Mastodon

To Love and to Learn (Gil Tayar's Blog)

Iterating Over Emoji Characters the ES6 Way

Iterating Over Emoji Characters the ES6 Way #

Let’s say we want to write a function that turns a string into an array of characters. Let’s try and write it:

Simple. And what does that console.log display?

[ 'R', 'o', 'b', 'i', 'n', ' ', 'H', 'o', 'o', 'd' ]

Good! But does it support Unicode? Let’s try a string with Hebrew characters:

And…

[ 'R', 'o', 'b', 'i', 'n', ' ', 'H', 'o', 'ו', 'ו', 'd' ]

Yes! I can iterate over characters in Unicode. End of story, it works! I’m so happy, I wanna try it with a happy emoji:

And…

[ 'R', 'o', 'b', 'i', 'n', ' ', 'H', '�', '�', '�', '�', 'd' ]

Oh. No. Why don’t I see the emoji? And why did it split those emojis into four characters and not two?

To understand this, let us go back to the dawn of time…

The Days Before Unicode #

Back in the days, there was ASCII. ASCII is an encoding that defines a number (called a code point) for each letter in the latin alphabet, while also including numbers, lots of punctuation marks, and some control characters like CR (13) and LF (10).

And it was good for a long time. But after awhile, the Europeans reared there antique head, and wanted all those characters with accents on them, like Ö. But there was no more place in the ASCII table — ASCII is a 7-bit encoding, which means that it can handle only code points from 0 to 127.

But if we create an 8-bit encoding? We’ll have 128 more code points. Is it enough? Yes! And so begat Latin-1, or as it is now called ISO-8859–1.

But then came all those other pesky languages. For example, Hebrew. Is there a place for Hebrew and the Latin characters? Nope. And so begat ISO-8859–8, which contains ASCII plus all the Hebrew characters.

But are they compatible? No. When encoding characters to bytes (i.e. to code points), you have to figure out if you want to encode Latin characters or Hebrew characters. You couldn’t have both, because the same code point has a different character mapped to it in the two encodings.

The Early Days Of Unicode #

Unicode “solved” this problem. People around the world understood that this situation can’t go on. And while World Peace isn’t a solved problem, standardizing what code point maps to which characters is a solved problem. It’s called Unicode.

Unicode defines a long long long table which maps numbers (code points) to characters.

And what is the range of the code points? Well, they defined it as 32-bits (4GB of characters), but they noticed that almost all languages on Earth can fit into 16-bits (i.e. 64K of characters). There were some stubborn (and dead) languages that didn’t fit and needed more than 16-bits, but who cared? Only some academics.

So when a new language came out — Java — it was decided that all strings in that language will encode the characters in 16-bits, i.e. be 2-byte wide. This encoding is called UTF-16. This decision was copied in other languages like C#. Did UTF-16 support code points above 65,535, i.e. above 16-bits? Yes, but it needed four bytes, i.e. two “characters”. This means that if a string contained a code-point that is above 65,535, that will be encoded as two “characters” in the string.

And guess which other language copied this behavior? That’s right — JavaScript.

Back to the Example #

So why do the Hebrew characters in

stringToArray('Robin Hoווd')

Generate the right characters, and yet this:

stringToArray('Robin H😀😀d')

doesn’t?

Because Hebrew is in the 16-bit “plane”, but all Emoji characters are above them.

Yes — Emojis came late to the scene, and there was no choice but to add them above the 16-bit plane.

Suddenly, it’s not just academics that cared about Unicode characters above the 16-bit place. Everybody now cared about those characters. Suddenly, everybody feels the problem with dealing with those characters.

But as we saw above, ES5 does not deal nicely with those characters.

So How Do We Iterate Over Unicode Characters? #

In ES5 it’s actually pretty difficult, and it’s basically finding those “surrogate pairs” and dealing with them. You can find the solution here.

But ES6 had to solve this problem, otherwise Emoji wouldn’t work, and everybody loves Emojis! How did they solve it? They defined that a string is an iterable, and that the iterator over a string will deal with surrogate pairs automatically.

Let’s see if that’s correct:

So we create an iterator from an iterable string using str[Symbol.iterator](). Then we iterate over the iterator, using the iterator protocol in JavaScript: iterator.next().

And, yes, it works. It displays the 😀 emoji!

Using for-of #

But there’s a simpler way of iterating over an iterable — for-of:

And it works!

[ 'R', 'o', 'b', 'i', 'n', ' ', 'H', '😀', '😀', 'd' ]

And there’s an even easier solution:

Array.from #

And it works — Array.from generates an array from the string — the correct array. Here, we map the 😀 to 🙃, and get:

[ 'R', 'o', 'b', 'i', 'n', ' ', 'H', '🙃', '🙃', 'd' ]

We can even turn it back into a string, using string.join:

And get…

Robin H🙃🙃d

Mission accomplished.

Prologue — Does Babel Transpile It Correctly? #

I didn’t know whether babel transpilation can handle Unicode characters correctly.

So I checked (you can try it out via the link), and it works! So if you’re using Babel, you can use this method, and it will still work.