Inside UTF-8

Let us briefly look at how computers store and display text.

ASCII is an old encoding system for storing characters as ones and zeroes. The ASCII standard uses 7 bits to represent 128 different symbols: numbers, English letters, control codes and supplementary symbols.

Here is an example of the lowercase letter "x" encoded with ASCII:

01111000

Converted to decimal, this is the number 120, which represents the letter "x" in the ASCII table. Other symbols use the same logic - each character has a decimal representation, which in turn is converted to a binary value.

Despite using only 7 bits, each ASCII character is stored in a byte (8 bits) for convenience. This is why the first bit is always a zero.

But 128 values is not enough if you want to store and display anything other than English alphabet. For things like Cyrillic characters, you need Unicode.

The vast majority of the computer world uses UTF-8, which is one of the encodings of Unicode. It uses up to 4 bytes (32 bits) to store data, and it can theoretically be used to represent 1.1 million unique characters. As of 2024, the standard defines almost 150,000.

The first 128 characters are the same as found in ASCII, and they are also stored in one byte (8 bits).

The lowercase letter "x" also looks like this in UTF-8:

01111000

Again, when the character is stored in one byte, the first bit is always 0.

But what if we want to display the Cyrillic character "Є", which is not in ASCII? In UTF-8, this letter is represented by the number 1028. To store it, you would need to use two bytes (16 bits):

11010000 10000100

Here it gets interesting. If you convert these 16 bits to decimal, you will get 53380, not 1028. This is because not only does UTF-8 need to store a number, it also needs to help the computer read it.

When two or more bytes are needed, the first byte will start with as many ones as there are bytes, followed by a zero. This is an instruction for the computer. In our case, the first byte starts with 110, which means that there are two bytes in total. After that we have 10000, which is the first part of the actual number.

The following bytes will always start with 10 to indicate that they are part of the bigger value. Here, the second byte starts with 10, so the relevant bits are 000100.

Combined, we get the number 10000 000100, which is 1028 when converted to decimal.

The same applies for three-byte characters. Here's what the encoding of the euro sign "€" looks like:

11100010 10000010 10101100

When we remove all of the "instructional" bits at the start of each byte, we get 0010 000010 101100, or 8364 in decimal.

And finally, here's the blueprint to store any four-byte character:

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

That's pretty much it. I hope it either gets you excited about character encodings or at least makes them less scary.