An assumption which pops up regularly is that when dealing with English text only, it’s unlikely to encounter characters outside the ASCII character set. To avoid problems with handling Unicode correctly, people are tempted to do things like stripping non-ASCII characters, or removing any accents on letters.
These examples show this assumption is wrong, and even for English text you should take care to handle Unicode characters correctly.
English text has the occasional diacritics.
Emoji are quite popular with social media these days.
U+2603
— SNOWMANU+01F600
— GRINNING FACEU+01F42A
— DROMEDARY CAMELNote that most emoji are outside the Basic Multilingual Plane. A lot of newer additions consist of more than one code point:
U+FE0E
or U+FE0F
)Almost all written text has punctuation marks which are outside the ASCII character set:
There are a few common symbols in use: