An Unicode code point, what programmers often think of one character, often corresponds to what the user thinks is one character. Sometimes however a “character” is made up of multiple code points, as the examples above show.
This means that operations like slicing a string, or getting a character at a given index may not work as expected. For instance the 4th character of the string "Café"
is 'e'
(without the accent). Similarly, clipping the string to length 4 will remove the accent.
The technical term for such a group of code points is a grapheme cluster. See UAX #29: Unicode Text Segmentation
A letter with a diacritic may be represented with the letter, and a combining modifier letter. You normally think of é
as one character, but it's really 2 code points:
U+0065
— LATIN SMALL LETTER EU+0301
— COMBINING ACUTE ACCENTSimilarly ç
= c
+ ¸
, and å
= a
+ ˚
To complicate matters, there is often a code point for the composed form as well:
"Café" = 'C' + 'a' + 'f' + 'e' + '´'
"Café" = 'C' + 'a' + 'f' + 'é'
Although these strings look the same, they are not equal, and they don't even have the same length (5 and 4 respectively).
There is this thing called Zalgo Text which pushes this to the extreme. Here is the first grapheme cluster of the example. It consists of 15 code points: the Latin letter H
and 14 combining marks.
H̡̫̤̤̣͉̤ͭ̓̓̇͗̎̀
Although this doesn't show up in normal text, it shows that a “character” really can consist of an arbitrary number of code points
A lot of emoji consist of more than one code point.
U+FE0E
or U+FE0F
)U+200D
). On platforms which support it, this is rendered as an emoji of a family with two kids.