Strings

Other topics

Remarks:

In .NET strings System.String are sequence of characters System.Char, each character is an UTF-16 encoded code-unit. This distinction is important because spoken language definition of character and .NET (and many other languages) definition of character are different.

One character, which should be correctly called grapheme, it's displayed as a glyph and it is defined by one or more Unicode code-points. Each code-point is then encoded in a sequence of code-units. Now it should be clear why a single System.Char does not always represent a grapheme, let's see in real world how they're different:

  • One grapheme, because of combining characters, may result in two or more code-points: is composed by two code-points: U+0061 LATIN SMALL LETTER A and U+0300 COMBINING GRAVE ACCENT. This is the most common mistake because "à".Length == 2 while you may expect 1.
  • There are duplicated characters, for example à may be a single code-point U+00E0 LATIN SMALL LETTER A WITH GRAVE or two code-points as explained above. Obviously they must compare the same: "\u00e0" == "\u0061\u0300" (even if "\u00e0".Length != "\u0061\u0300".Length). This is possible because of string normalization performed by String.Normalize() method.
  • An Unicode sequence may contain a composed or decomposed sequence, for example character U+D55C HAN CHARACTER may be a single code-point (encoded as a single code-unit in UTF-16) or a decomposed sequence of its syllables , and . They must be compared equal.
  • One code-point may be encoded to more than one code-units: character 𠂊 U+2008A HAN CHARACTER is encoded as two System.Char ("\ud840\udc8a") even if it is just one code-point: UTF-16 encoding is not fixed size! This is a source of countless bugs (also serious security bugs), if for example your application applies a maximum length and blindly truncates string at that then you may create an invalid string.
  • Some languages have digraph and trigraphs, for example in Czech ch is a standalone letter (after h and before i then when ordering a list of strings you will have fyzika before chemie.

There are much more issues about text handling, see for example How can I perform a Unicode aware character by character comparison? for a broader introduction and more links to related arguments.

In general when dealing with international text you may use this simple function to enumerate text elements in a string (avoiding to break Unicode surrogates and encoding):

public static class StringExtensions
{
    public static IEnumerable<string> EnumerateCharacters(this string s)
    {
        if (s == null)
            return Enumerable.Empty<string>();

        var enumerator = StringInfo.GetTextElementEnumerator(s.Normalize());
        while (enumerator.MoveNext())
            yield return (string)enumerator.Value;
    }
}

Count distinct characters

If you need to count distinct characters then, for the reasons explained in Remarks section, you can't simply use Length property because it's the length of the array of System.Char which are not characters but code-units (not Unicode code-points nor graphemes). If, for example, you simply write text.Distinct().Count() you will get incorrect results, correct code:

int distinctCharactersCount = text.EnumerateCharacters().Count();

One step further is to count occurrences of each character, if performance aren't an issue you may simply do it like this (in this example regardless of case):

var frequencies = text.EnumerateCharacters()
    .GroupBy(x => x, StringComparer.CurrentCultureIgnoreCase)
    .Select(x => new { Character = x.Key, Count = x.Count() };

Count characters

If you need to count characters then, for the reasons explained in Remarks section, you can't simply use Length property because it's the length of the array of System.Char which are not characters but code-units (not Unicode code-points nor graphemes). Correct code is then:

int length = text.EnumerateCharacters().Count();

A small optimization may rewrite EnumerateCharacters() extension method specifically for this purpose:

public static class StringExtensions
{
    public static int CountCharacters(this string text)
    {
        if (String.IsNullOrEmpty(text))
            return 0;

        int count = 0;
        var enumerator = StringInfo.GetTextElementEnumerator(text);
        while (enumerator.MoveNext())
            ++count;

        return count;
    }
}

Count occurrences of a character

Because of the reasons explained in Remarks section you can't simply do this (unless you want to count occurrences of a specific code-unit):

int count = text.Count(x => x == ch);

You need a more complex function:

public static int CountOccurrencesOf(this string text, string character)
{
    return text.EnumerateCharacters()
        .Count(x => String.Equals(x, character, StringComparer.CurrentCulture));
}

Note that string comparison (in contrast to character comparison which is culture invariant) must always be performed according to rules to a specific culture.

Split string into fixed length blocks

We cannot break a string into arbitrary points (because a System.Char may not be valid alone because it's a combining character or part of a surrogate) then code must take that into account (note that with length I mean the number of graphemes not the number of code-units):

public static IEnumerable<string> Split(this string value, int desiredLength)
{
    var characters = StringInfo.GetTextElementEnumerator(value);
    while (characters.MoveNext())
        yield return String.Concat(Take(characters, desiredLength));
}

private static IEnumerable<string> Take(TextElementEnumerator enumerator, int count)
{
    for (int i = 0; i < count; ++i)
    {
        yield return (string)enumerator.Current;

        if (!enumerator.MoveNext())
            yield break;
    }
}

Convert string to/from another encoding

.NET strings contain System.Char (UTF-16 code-units). If you want to save (or manage) text with another encoding you have to work with an array of System.Byte.

Conversions are performed by classes derived from System.Text.Encoder and System.Text.Decoder which, together, can convert to/from another encoding (from a byte X encoded array byte[] to an UTF-16 encoded System.String and vice-versa).

Because the encoder/decoder usually works very close to each other they're grouped together in a class derived from System.Text.Encoding, derived classes offer conversions to/from popular encodings (UTF-8, UTF-16 and so on).

Examples:

Convert a string to UTF-8

byte[] data = Encoding.UTF8.GetBytes("This is my text");

Convert UTF-8 data to a string

var text = Encoding.UTF8.GetString(data);

Change encoding of an existing text file

This code will read content of an UTF-8 encoded text file and save it back encoded as UTF-16. Note that this code is not optimal if file is big because it will read all its content into memory:

var content = File.ReadAllText(path, Encoding.UTF8);
File.WriteAllText(content, Encoding.UTF16);

Object.ToString() virtual method

Everything in .NET is an object, hence every type has ToString() method defined in Object class which can be overridden. Default implementation of this method just returns the name of the type:

public class Foo
{
}

var foo = new Foo();
Console.WriteLine(foo); // outputs Foo

ToString() is implicitly called when concatinating value with a string:

public class Foo
{
    public override string ToString()
    {
        return "I am Foo";
    }
}

var foo = new Foo();
Console.WriteLine("I am bar and "+foo);// outputs I am bar and I am Foo

The result of this method is also extensively used by debugging tools. If, for some reason, you do not want to override this method, but want to customize how debugger shows the value of your type, use DebuggerDisplay Attribute (MSDN):

// [DebuggerDisplay("Person = FN {FirstName}, LN {LastName}")]
[DebuggerDisplay("Person = FN {"+nameof(Person.FirstName)+"}, LN {"+nameof(Person.LastName)+"}")]
public class Person
{
    public string FirstName { get; set; }
    public string LastName { get; set;}
    // ...
}

Immutability of strings

Strings are immutable. You just cannot change existing string. Any operation on the string crates a new instance of the string having new value. It means that if you need to replace a single character in a very long string, memory will be allocated for a new value.

string veryLongString = ...
// memory is allocated
string newString = veryLongString.Remove(0,1); // removes first character of the string.

If you need to perform many operations with string value, use StringBuilder class which is designed for efficient strings manipulation:

var sb = new StringBuilder(someInitialString);
foreach(var str in manyManyStrings)
{
    sb.Append(str);
} 
var finalString = sb.ToString();

Сomparing strings

Despite String is a reference type == operator compares string values rather than references.

As you may know string is just an array of characters. But if you think that strings equality check and comparison is made character by character, you are mistaken. This operation is culture specific (see Remarks below): some character sequences can be treated as equal depending on the culture.

Think twice before short circuiting equality check by comparing Length properties of two strings!

Use overloads of String.Equals method which accept additional StringComparison enumeration value, if you need to change default behavior.

Contributors

Topic Id: 2227

Example Ids: 7280,7281,7282,7283,7284,10689,10690,10691

This site is not affiliated with any of the contributors.