A study on Unicode

Unicode, this is a word that I hear and see everywhere. From computer books to government unveiling of a currency symbol and even in the simple Windows Notepad. One word in particular, UTF-8; what the heck does that mean and what is its
significance?

I know that UTF stands for Unicode Transformation Format, and Unicode defines a standard for coding character sets. The word Universal seems interesting!

Interestingly UTF-8 can be used to encode every language in the World, if not languages that are outside our World for now; Unicode consortium rejected the proposal to include Klingon in the character set, saying that it is not popular enough. But the battle is still on.

What does the 8 in UTF-8 mean though? It simply means that a document
when coded with UTF-8, will code the ASCII characters in 8 bit code.
However note that UTF-8 will use extra bytes depending for characters beyond the standard ASCII; not everything can be codded in 8 bits. UTF-8 is defined to encode characters in one to four bytes, depending on the number of significant of bits in the numerical value of the character.

EXAMPLE: UTF-8 code from bit pattern

My zsh terminal always uses the ➜ (Heavy Rounded-Tipped Rightwards Arrow) as its prompt. The problem is that not all fonts can display this arrow. But before I go looking for a replacement arrow, I need to know that is the UTF code for the current one. So here what I did.

1) Saved the arrow in a file and used ‘od’ to get the hex code for it.
The bit pattern of the arrow is: E2 9E 9C. But this is not the UTF-8 character code.

2) I begin by representing the hex in binary

E2 -> 1110 0010
9E -> 1001 1110
9C -> 1001 1100

3) Looked in the wiki page and found out that I need to strip some padding
bytes to get to the UTF-8 code. (See table below)

a) The 1110 is the padding (used to recognize the code) bits in the first
byte. I strip them and the remaining bits is part of the UTF-8 code.
Thus I got 0010 (0x:2)

b) The 10 is the padding bits in the 2nd byte. The rest is part of the
code. Hence I got 011110. (0x:2E)

c) The 10 is again the padding byte in the 3rd byte.
Hence I got 011100 (0x:2C)

Number
of bytes
Bits for
code point
First
code point
Last
code point
Byte 1Byte 2Byte 3Byte 4
1 7 U+0000 U+007F 0xxxxxxx
2 11 U+0080 U+07FF 110xxxxx10xxxxxx
3 16 U+0800 U+FFFF 1110xxxx10xxxxxx10xxxxxx
4 21 U+10000 U+10FFFF 11110xxx10xxxxxx10xxxxxx10xxxxxx

4) Now we combine the 3 bytes back to back and we will get the UTF-8 code.

2       2E      2C
0010    011110  011100  -> 0010 0111 1001 1100 (0x:279C)

5) Thus the UTF-8 code for ➜ is 0x279C or U+279C. This character is encoded in 3
bytes when used in a document as E2 9E 9C
. Preview available here https://unicode-table.com/en/279C/

Hope this document prove useful in someway and helps you understand Unicode a little more clearly. Feel free to point any mistakes and all comments are welcome.

#ascii, #character-set, #linux, #unicode, #utf, #utf-8