Unicode | Live to make

Unicode, this is a word that I hear and see everywhere. From computer books to government unveiling of a currency symbol and even in the simple Windows Notepad. One word in particular, UTF-8; what the heck does that mean and what is its
significance?

I know that UTF stands for Unicode Transformation Format, and Unicode defines a standard for coding character sets. The word Universal seems interesting!

Interestingly UTF-8 can be used to encode every language in the World, if not languages that are outside our World for now; Unicode consortium rejected the proposal to include Klingon in the character set, saying that it is not popular enough. But the battle is still on.

What does the 8 in UTF-8 mean though? It simply means that a document
when coded with UTF-8, will code the ASCII characters in 8 bit code.
However note that UTF-8 will use extra bytes depending for characters beyond the standard ASCII; not everything can be codded in 8 bits. UTF-8 is defined to encode characters in one to four bytes, depending on the number of significant of bits in the numerical value of the character.

EXAMPLE: UTF-8 code from bit pattern

My zsh terminal always uses the ➜ (Heavy Rounded-Tipped Rightwards Arrow) as its prompt. The problem is that not all fonts can display this arrow. But before I go looking for a replacement arrow, I need to know that is the UTF code for the current one. So here what I did.

1) Saved the arrow in a file and used ‘od’ to get the hex code for it.
The bit pattern of the arrow is: E2 9E 9C. But this is not the UTF-8 character code.

2) I begin by representing the hex in binary

E2 -> 1110 0010
9E -> 1001 1110
9C -> 1001 1100

3) Looked in the wiki page and found out that I need to strip some padding
bytes to get to the UTF-8 code. (See table below)

a) The 1110 is the padding (used to recognize the code) bits in the first
byte. I strip them and the remaining bits is part of the UTF-8 code.
Thus I got 0010 (0x:2)

b) The 10 is the padding bits in the 2nd byte. The rest is part of the
code. Hence I got 011110. (0x:2E)

c) The 10 is again the padding byte in the 3rd byte.
Hence I got 011100 (0x:2C)

Number of bytes	Bits for code point	First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4
1	7	U+0000	U+007F	0xxxxxxx
2	11	U+0080	U+07FF	110xxxxx	10xxxxxx
3	16	U+0800	U+FFFF	1110xxxx	10xxxxxx	10xxxxxx
4	21	U+10000	U+10FFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

4) Now we combine the 3 bytes back to back and we will get the UTF-8 code.

2       2E      2C
0010    011110  011100  -> 0010 0111 1001 1100 (0x:279C)

5) Thus the UTF-8 code for ➜ is 0x279C or U+279C. This character is encoded in 3
bytes when used in a document as E2 9E 9C. Preview available here https://unicode-table.com/en/279C/

Hope this document prove useful in someway and helps you understand Unicode a little more clearly. Feel free to point any mistakes and all comments are welcome.

#ascii, #character-set, #linux, #unicode, #utf, #utf-8

Live to make

Hobby electronics, computer science, mathematics and everything cool

Menu

Tag Archives: Unicode

A study on Unicode

EXAMPLE: UTF-8 code from bit pattern

Menu

EXAMPLE: UTF-8 code from bit pattern

Share this: