The History of Encodings

54

By Digitize

Since I have been in web development, I have been constantly hearing about encoding and have always been told to utilize utf-8 and forget everything else. Being inquisitive, I don't usually follow directions without researching the topic. The following is the result of my research:

The first thing I did was to define what encoding is. The most popular encoding is called Morse Code, which you are probably familiar with. What is Morse Code? Simply put, it takes a character set, and puts it into another distinct array of dots and dashes. In the same way, character encoding takes a character set and represents this character set with numbers that are more understandable by the computer.

The next question is what was the evolution of encodings? One of the first character encoding for a computer was called ASCII. Most computers at the time had 8 bit bytes, conveniently, ASCII only used 7 bits leaving one left over. ASCII was able to represent every English character using a number between 32 and 127, space being 32, and the letter “A” being 65.

Being that ASCII left an unused bit, codes 128-255 were left vacant therefore allowing many to add on to ASCII which led to an issue. For example on some PCs the character code 130 would display as “é” but when on a computer sold in Israel code 130 represented a Gimmel ( ג), so when Americans would send their résumés to Israel, they would arrive as rגsum גs.

This issue was quickly spotted and rectified with something called the ANSI standard. In the ANSI standard, everyone agreed what to do with the codes below 128, but for codes above that, they created what was called code pages. Code pages were basically, from 128 and above, a different language per code page. For example, computers in Israel would be sold with a code page with Hebrew letters installed while American computers would be sold with an English code page.

This was a great leap forward in terms of standardization in the industry, but did lead to another issue. What if someone was bilingual? You can only have one code page installed per computer, meaning you were out of luck if you needed to type in more then one language. But this issue was avoided due to the fact that not many people were bilingual, and the issue wasn't that prominent. Until the internet!

The internet created a huge problem in terms of character encoding due to that fact that someone in America wrote a website in English, so when someone with a Hebrew code page installed went to that web site, they just got a lot of Hebrew giberish. To solve this problem, Unicode was invented.

A common misconception of Unicode stems from the fact that each character uses two bits. People think that since it uses two bits, that Unicode simply uses 16 bits therefore allowing 2^16, or 65536 characters. This is incorrect though, what the Unicode Consortium did is address the issue that there are similar looking characters, for example the English letter “i” and the Hebrew letter “ו" are very similar. So Unicode has a code for “i” and then uses some of the “i” to create a “ו", therefore, leaving space for another character.

But this lead to two issues. “Hello” in Unicode would be 00 48 00 65 00 6C 00 6C 00 6F leaving a lot of zeros that can be utilized. The second issue was that there were thousands of documents that were already written in ASCII and no one wanted to sit and convert them. So this led to people

continuing to use ASCII, until utf-8 was invented.

What the creators of utf-8 did was that the zeros were removed from English codes below code 128. Above 128, characters can use 2, or even 3 bits. This accomplished two things; the first thing is that it made utf-8 encodings for English look synonymous to ASCII encodings, therefore no one needed to convert all the previous documents that were written in ASCII. The second thing this accomplished is that it utilized all the zeros that were wasted.

by JerusaHost Israel domain and Israel web hosting intern Binny Zupnick

Comments

No comments yet.

Submit a Comment
Members and Guests

Sign in or sign up and post using a hubpages account.



    • No HTML is allowed in comments, but URLs will be hyperlinked
    • Comments are not for promoting your Hubs or other sites

    Please wait working