SixFoisNeuf

Totally irregular blog on computers and security


Unicode and UTF-8 (and emojis)

Posted on Mar 9, 2024

Doing some research for my bush hid the facts article led me to read about Unicode, as well as the various ways to encode Unicode code points. I didn’t really look into the Unicode standard (it looked too complicated), but I still wanted to go back and have a look at the nitty gritty details at some point.

I only knew that UTF-8 was a 21-bit encoding, and that it could encode any code point in 1 to 4 bytes.

  • 21-bit encoding: this means that UTF-8 can encode 21 bits worth of information by using 1 to 4 bytes. However, this does not mean that Unicode numbers go up to 2,097,151! UTF-8 can encode numbers outside of the Unicode character range
  • code point: this is what a “character” is called in Unicode. Each code point corresponds to a number. Most Western characters and simple emojis are a single code point: for example “πŸ€“” (called NERD FACE), has code point number 129299, or U+1F913. However, some CJK characters and complex emojis are made up of multiple code points (we’ll see that later)

There are three main advantages to UTF-8 over UTF-16:

  • No encoded code point will ever produce a NULL byte: preserves compatibility with C-style strings
  • Fully backwards-compatible with 7-bit ASCII: a plain old ASCII text is also valid UTF-8, representing the same text
  • For Western languages, UTF-8-encoded text uses roughly half as much memory as UTF-16

Then, I stumbled upon “UTF-8: Bits, Bytes, and Benefits” when it was shared on lobste.rs. It presented in three simple rules how UTF-8 encodes numbers. The information jumped at me, seemingly clear as day. Oops, did I just learn how to encode and decode UTF-8 by accident?

The algorithm is simple:

  • We’re working on the binary representation of the Unicode code point
  • Each byte holds 6 bits worth of information, and 2 “continuation” bits (10) at the start
  • The first byte starts with as many 1s as there are bytes, followed by a 0, followed by any leftover data we might need to pack

I fired up Python, and tried some magic: my goal was to display the “😊” emoji. Emojipedia tells me this is Unicode code point U+1F60A.

>>> bin(0x1F60A)
'0b11111011000001010'
#  < c >< b  >< a  >  
>>> a = 0b10001010 # "10" + data
>>> b = 0b10011000 # "10" + data
>>> c = 0b10011111 # "10" + data
>>> d = 0b11110000 # "this is going to be 4 bytes" + no data
>>> bytearray([d, c, b, a]).decode("utf8")
'😊'

The process works! Armed with this new knowledge, I wanted to dig deeper into how various characters were encoded, so I built another small Javascript web app, this time to decompose UTF-8-encoded text in their respective code points, detailing for each one how the algorithm works.

It’s just called “UTF-8 playground”, and is available at /utf8

It’s a playground, let’s play! In the next few sections, I’ll illustrate how emoji construction works by using this tool to split up popular emojis into their various code points.

A note about emojis

“Emojis” as they are understood in the West really came to popularity after they were added to Unicode in 2010 (Wikipedia). For example, Android unveiled emojis with the KitKat (4.4) release, in 2013.

All of the emojis added in Unicode 6.0 were single-code-point emojis, meaning that the Unicode organization assigned a different, unique number to each one. For example, the four heart colors (πŸ’›πŸ’šπŸ’™πŸ’œ) each have their own code point.

We will see that there are other ways to make emojis!

Single code point emoji

Let’s analyze the cool emoji: 😎. Here’s what the playground has to say about it:

Character Bytes Meaning Tally
😎 U+1F60E0xF011110000 (4 bytes announcer + add 000)000
0x9F10011111 (continuation + add 011111)000011111
0x9810011000 (continuation + add 011000)000011111011000
0x8E10001110 (continuation + add 001110)000011111011000001110 => 0x1F60E

It is indeed made up of a single character (code point), with number 0x1F60E. That’s a big number, so UTF-8 uses 4 bytes to encode it, but it’s still a single Unicode character.

Combining emojis

Next up, let’s have a look at a country flag emoji: πŸ‡«πŸ‡·. (note: that’s supposed to be the French flag. If you’re seeing something else, your OS spoiled the answer for you)

Character Bytes Meaning Tally
πŸ‡« U+1F1EB0xF011110000 (4 bytes announcer + add 000)000
0x9F10011111 (continuation + add 011111)000011111
0x8710000111 (continuation + add 000111)000011111000111
0xAB10101011 (continuation + add 101011)000011111000111101011 => 0x1F1EB
πŸ‡· U+1F1F70xF011110000 (4 bytes announcer + add 000)000
0x9F10011111 (continuation + add 011111)000011111
0x8710000111 (continuation + add 000111)000011111000111
0xB710110111 (continuation + add 110111)000011111000111110111 => 0x1F1F7

Interesting! Instead of assigning a code point to each country flag (WorldPopulationReview lists 197 sovereign nations, plus 15 of disputed sovereignity), the Unicode consortium decided to assign an emoji to each 26 letter of the latin alphabet, and simply recommends combining them using the ISO 3166-1 alpha-2 standard to encode the flags. Combining literally means “putting them next to one another”. Hit “backspace” next to a flag, and some software will just pop up the remaining letter.

Here’s the full list of emojis in Unicode 15.1. Lookup “RGI_Emoji_Flag_Sequence” to get to the flag list. You’ll see that most of them are made up of two code points.

A funny note is that country flag emojis do not work on Windows: I don’t believe there has ever been a public explanation, but rumor has it that it was easier to just not ship this feature than have to censor the Taiwanese flag for the Chinese market.

This juxtaposition is also how “skin tone” modifiers work: 🏻 🏼 🏽 🏾 🏿 just get plopped down after any 🀚 and boom: 🀚🏿

The final boss of country flag emojis: 🏴󠁧󠁒󠁳󠁣󠁴󠁿 (this seriously messes up my terminal)

Literally [FLAG]GBSCT[CANCEL]

Special case of combining emojis: emoji promotions

Unicode already had “symbols” way before emojis came out: they were usually low-res and in black and white, but they were there! The Miscellaneous block is full of them: β˜€, ⛏, and others.

There would have been two bad ways to handle this:

  • Create new code points for these, and have one for the black-and-white version, and one for the “emoji” version
  • Update the definition of the original symbols to require colour

The first one introduces confusion (which one are you supposed to use?), and the second one is not reassuring regarding the stability of the Unicode standard.

A third way was chosen: β˜€οΈ

Character Bytes Meaning Tally
β˜€ U+26000xE211100010 (3 bytes announcer + add 0010)0010
0x9810011000 (continuation + add 011000)0010011000
0x8010000000 (continuation + add 000000)0010011000000000 => 0x2600
VS16 U+FE0F0xEF11101111 (3 bytes announcer + add 1111)1111
0xB810111000 (continuation + add 111000)1111111000
0x8F10001111 (continuation + add 001111)1111111000001111 => 0xFE0F

The “Variation Selectors” are a list of 16 tags, which apply to the character before them and inform on which variation to display. Variation 16 means “show me the emoji variant”.

So, β˜€ by itself it the old symbol, but append U+FE0F and it becomes β˜€οΈ !!

Zero Width Joiner combinations

The Zero Width Joiner is code point U+200D, and is used to link together two or more characters. It is used to encode complex scripts, such as Arabic. It is also a big part of how newer emojis are encoded!

Let’s take the example of πŸ‘¨β€πŸ¦― (“man with white cane”). We already have ManπŸ‘¨, and we also have White Cane🦯, so instead of allocating a brand new code point, why not combine them?

Character Bytes Meaning Tally
πŸ‘¨ U+1F4680xF011110000 (4 bytes announcer + add 000)000
0x9F10011111 (continuation + add 011111)000011111
0x9110010001 (continuation + add 010001)000011111010001
0xA810101000 (continuation + add 101000)000011111010001101000 => 0x1F468
ZWJ U+200D0xE211100010 (3 bytes announcer + add 0010)0010
0x8010000000 (continuation + add 000000)0010000000
0x8D10001101 (continuation + add 001101)0010000000001101 => 0x200D
🦯 U+1F9AF0xF011110000 (4 bytes announcer + add 000)000
0x9F10011111 (continuation + add 011111)000011111
0xA610100110 (continuation + add 100110)000011111100110
0xAF10101111 (continuation + add 101111)000011111100110101111 => 0x1F9AF

I think that the way this works is the ZWJ implies a with? E.g. a man with a cane, or a flag with a rainbow. Also, it allows people to write stuff like πŸ‘¨πŸ’» without it turning into πŸ‘¨β€πŸ’» automatically.

What is the biggest we can make an emoji (in bytes)?

Of course, I want to see where the limit is. Looking up this question on the Internet drove me to this blog post, which concludes that the biggest they could find was πŸ‘¨πŸ»β€β€οΈβ€πŸ’‹β€πŸ‘¨πŸ», at 35 bytes.

The blog post is from last year, and mentions Unicode 15. We are still at Unicode version 15, albeit 15.1.0. This small update does not appear to add any new huge combination of emojis…so are we done?

I don’t think so! It appears that there is at least another bigger emoji: a custom family!

ZWJs allows us to compose our ideal family: by asking for two adults, two children, and specifying each one’s skin tone…

Character Bytes Meaning Tally
πŸ‘¨ U+1F4680xF011110000 (4 bytes announcer + add 000)000
0x9F10011111 (continuation + add 011111)000011111
0x9110010001 (continuation + add 010001)000011111010001
0xA810101000 (continuation + add 101000)000011111010001101000 => 0x1F468
🏿 U+1F3FF0xF011110000 (4 bytes announcer + add 000)000
0x9F10011111 (continuation + add 011111)000011111
0x8F10001111 (continuation + add 001111)000011111001111
0xBF10111111 (continuation + add 111111)000011111001111111111 => 0x1F3FF
ZWJ U+200D0xE211100010 (3 bytes announcer + add 0010)0010
0x8010000000 (continuation + add 000000)0010000000
0x8D10001101 (continuation + add 001101)0010000000001101 => 0x200D
πŸ‘¨ U+1F4680xF011110000 (4 bytes announcer + add 000)000
0x9F10011111 (continuation + add 011111)000011111
0x9110010001 (continuation + add 010001)000011111010001
0xA810101000 (continuation + add 101000)000011111010001101000 => 0x1F468
🏿 U+1F3FF0xF011110000 (4 bytes announcer + add 000)000
0x9F10011111 (continuation + add 011111)000011111
0x8F10001111 (continuation + add 001111)000011111001111
0xBF10111111 (continuation + add 111111)000011111001111111111 => 0x1F3FF
ZWJ U+200D0xE211100010 (3 bytes announcer + add 0010)0010
0x8010000000 (continuation + add 000000)0010000000
0x8D10001101 (continuation + add 001101)0010000000001101 => 0x200D
πŸ‘¦ U+1F4660xF011110000 (4 bytes announcer + add 000)000
0x9F10011111 (continuation + add 011111)000011111
0x9110010001 (continuation + add 010001)000011111010001
0xA610100110 (continuation + add 100110)000011111010001100110 => 0x1F466
🏿 U+1F3FF0xF011110000 (4 bytes announcer + add 000)000
0x9F10011111 (continuation + add 011111)000011111
0x8F10001111 (continuation + add 001111)000011111001111
0xBF10111111 (continuation + add 111111)000011111001111111111 => 0x1F3FF
ZWJ U+200D0xE211100010 (3 bytes announcer + add 0010)0010
0x8010000000 (continuation + add 000000)0010000000
0x8D10001101 (continuation + add 001101)0010000000001101 => 0x200D
πŸ‘§ U+1F4670xF011110000 (4 bytes announcer + add 000)000
0x9F10011111 (continuation + add 011111)000011111
0x9110010001 (continuation + add 010001)000011111010001
0xA710100111 (continuation + add 100111)000011111010001100111 => 0x1F467
🏿 U+1F3FF0xF011110000 (4 bytes announcer + add 000)000
0x9F10011111 (continuation + add 011111)000011111
0x8F10001111 (continuation + add 001111)000011111001111
0xBF10111111 (continuation + add 111111)000011111001111111111 => 0x1F3FF

That’s a whopping 41 bytes!! We see both ways of associating code points together:πŸ‘¨πŸΏβ€πŸ‘¨πŸΏβ€πŸ‘¦πŸΏβ€πŸ‘§πŸΏ

  • The skin tone for each family member is right after the “man”, “boy” or “girl” emoji
  • They are all joined together with ZWJs

Update: Wow, this emoji is very inconsistent! It works fine on Firefox on Windows 10, but renders as 4 different πŸ‘¨πŸΏ on iOS for example. I suppose custom families aren’t completely standardized yet?

What’s next?

Well, writing all this made me want for a way to easily combine code points together, in addition to one to split them up. So look forward to that I guess.

Update on that: there is now a way to search for Unicode code points in the UTF-8 playground. Unicode code points are sorted in categories, I’ve only selected the most useful by default, but this can be changed by checking the boxes in the “Filter code points” drawer.

Send your comment via e-mail