26 December 2007

Understanding Unicode

The computers understand everything by numbers. Each character is represented as a number, which is finally drawn as a character in the screens.It has been a major problem for the legacy systems to write programs for languages other than English. The primary reason being the non-availability of enough characters in ASCII encoding. So obviously Internationalization of applications becomes a big issue.

ASCII and CodePage mechanism:
ASCII format had one byte or 8 bits for each character. This means that, it can have 2^8 or 256 different characters. So if a program is to be written in a different language, the entire character set is to be replaced with a different one. Windows initially had a scheme called CodePage. For each language, it had a different codepage. If it needs its version of Windows in chinese, then it will use the Chinese Code page. The problem here is that, at a time it can support only one language. So if a person in Europe connects to US server, he'll see only English characters and vice-versa.

Multi-Byte Character Set or Double-Byte Character Set:
One solution proposed for the above problem was to have a multiple byte character set. In this schema, a character might be represented as a single byte or a double byte. If it is a double byte schema, the lead byte will have the information about its double byte status. So the applications have to check the lead byte status always. The VC++ provides an API "isleadbyte(int c)" to check if a character is a lead byte.

Unicode:
Finally all big companies have joined together and decided to invent a new strategy for this issue. A new character encoding scheme was deduced with 16 bits. Now this 16 bit character set can support 2^16 or 65536 characters. This standards of Unicode are hosted at Unicode. Although original goal of this unicode consortium was to produce a 16 bit encoding standard, it produced 3 different standards.
UTF-8: This is a 8 bit encoding standard. The advantage in this schema is that the unicode characters in/transformed into UTF-8 are compatible with the existing softwares.
UTF-16: This is the original planned standard using 16-bit characters.
UTF-32: This is used where memory is not a constraint.

All 3 forms of data can be transformed into one another without any loss of data. All of them use a common repertoire of characters.
Note:
Windows NT/2000/XP use unicode as their character set. So even if a program uses data in ASCII, it internally gets converted to unicode, processed, reconverted to ASCII and returned.
Most of the times some programs will need conversions from MBCS/DBCS to unicode. If anybody needs to learn the conversion procedures, please follow the link at Microsoft MSDN. You can get all the information you need about this(ofcourse, if the page is not moved to a different location).

Infact a common goal expected to be achieved out of the whole effort is to gain internationalization. But having a common character set solves only a part of the whole issue. The other issues like Date, Time, Numbers, Currencies and conventions also among other things to be taken care of.

No comments: