unicode

Topics related to unicode:

Getting started with unicode

The Unicode Standard is an international standardized character set. It attempts to assign characters and symbols from every writing system a unique number. With every major new version, additional characters are added to the Standard to achieve this goal. In providing a unified character set for all writing systems, text information can be exchanged in a Unicode format independent of any given platform.

The Unicode Standard also contains property data on the characters, and defines algorithms on how to properly manipulate characters. For example, these algorithms provide the correct method to search and display Unicode text.

English text is not ASCII only

An assumption which pops up regularly is that when dealing with English text only, it’s unlikely to encounter characters outside the ASCII character set. To avoid problems with handling Unicode correctly, people are tempted to do things like stripping non-ASCII characters, or removing any accents on letters.

These examples show this assumption is wrong, and even for English text you should take care to handle Unicode characters correctly.

UTF-8 as an encoding way of Unicode

What is UTF-8?

UTF-8 is an encoding, which is variable-length and uses 8-bit code units - that's why UTF-8. In the internet UTF-8 is dominant encoding (before 2008 ASCII was, ehich also can handle any Unicode code point.).

Is UTF-8 the same as Unicode?

"Unicode" isn't an encoding - it is a coded character set - i.e. a set of characters and a mapping between the characters and integer code points representing them. But a lot of documentation uses it to refer to encodings. On Windows, for example, the term Unicode is used to refer to UTF-16.

UTF-8 is only one of the ways to encode Unicode and as an encoding it converts the sequences of bytes to sequences of characters and vice versa. UTF-16 and -32 are other Unicode transformation formats.

BOM of UTF-8

All three mayhave a specific Byte Order Marks, which being a magic number signals several important things to a program (for example, Notepad++) - for example, the fact, that the imported text stream is Unicode; also it helps to detect the art of Unicode used for this stream. However the Unicode consortium recommends storing UTF-8 without any signature. Some software, for example gcc compiler complains if a file contains the UTF-8 signature. A lot of Windows programs on the other hand use the signature. And trying to detect the encoding of a stream of bytes don't always work.

How to check if your project has UTF-8 encoding or not

UTF-8 is yet not universal, and software engineers and data scientists often face problem of encoding of text streams. Sometimes UTF-8 is supposed to be used in the project, however another ecndoing is being used. There are several tools to detect the encoding of the file:

Some CMD tools, like Linux command-line tool 'file' or
powershell;
Python package "chardet"
Notepad++ as maybe the most popular tool for manual check.

Characters can consist of multiple code points

An Unicode code point, what programmers often think of one character, often corresponds to what the user thinks is one character. Sometimes however a “character” is made up of multiple code points, as the examples above show.

This means that operations like slicing a string, or getting a character at a given index may not work as expected. For instance the 4^th character of the string "Café" is 'e' (without the accent). Similarly, clipping the string to length 4 will remove the accent.

The technical term for such a group of code points is a grapheme cluster. See UAX #29: Unicode Text Segmentation