Characters in NSString

2012-06-03 18:43:45 UTC

Working with Unicode in any encoding but UTF-32 (which we don’t use because, for nearly all text, it wastes tons of memory) has some pitfalls:

As UTF-8′s name implies, its code units (roughly speaking, character values) are 8 bits long. ASCII characters are all one code unit long (in UTF-8, this means that 1 ASCII character == 1 byte), but any character outside of that range must be encoded as multiple code units (multiple bytes). Thus, any single character above U+007F will end up as more than one byte in UTF-8 data.

This first observation is not limited to Emoji; it’s true of most characters in Unicode. Most characters take up more bytes in UTF-8 data than “characters” in an NSString.

As we’ll see a couple of tweets later, though, even NSString’s length can be problematic.

UTF-16 data may begin with a single specific character that is used as a byte-order mark.

(I should point out, just in case it isn’t obvious, that code units in UTF-16 are two bytes, as opposed to UTF-8′s one-byte code units. This still isn’t enough to encode any Unicode character in a single code unit, though, which will become important shortly.)

The BOM’s code point is U+FEFF. If you encode this in big-endian UTF-16 (UTF-16BE), it comes out as 0xFEFF, exactly as you’d expect. If you encode it in UTF-16LE, it comes out as 0xFFFE, which is not a character.

Thus, a BOM indicates which byte-order all of the subsequent code units should be in. If the first two bytes are 0xFFFE, you can guess that it’s 0xFEFF byte-swapped, and if that’s true, then the rest of the code units (if indeed they are UTF-16) are little-endian. The BOM isn’t considered part of the text; it’s removed in decoding.

The BOM is also used simply to promise and detect that the data is UTF-16: If you see one, whichever way it is, then the rest of the data is probably UTF-16 in one form or the other.

So it’s useful to include the BOM for data that may be saved somewhere and later retrieved by something that may need to determine its encoding.

-[NSString dataUsingEncoding:] includes the BOM, so that you can just take the data and write it out (if it is the whole data—more on that in a moment). Since the data it returns has the BOM character in it, the data’s length includes the two bytes that encode that character. -[NSString lengthOfBytesUsingEncoding:], on the other hand, counts only the bytes for the characters in the string; it does not add 2 bytes for a BOM.

A corollary to this is that if you send dataUsingEncoding: to an empty string, the data it returns will not be empty. So, are you testing whether the string you’ve just encoded is empty by testing whether the data’s length is zero? If so, your test is always succeeding/always failing.

One problem with the BOM is that it should only appear at the start of the data, which means you can’t just encode a bunch of strings using dataUsingEncoding: and then, say, write them all to a file or socket one after another, because the output will end up with BOMs (or, worse, invalid characters, namely U+FFFE) sprinkled throughout.

The naïve solution to that is to staple strings together, then encode and write out the entire agglomeration. If performance (particularly memory consumption) is an issue and you’re writing the output out piecemeal anyway, a more efficient solution would be to use getCharacters:range: or getBytes::::::: to extract raw UTF-16 code units into your own buffer.

Unicode, the character set, can hold up to 0×20000 characters. Foundation’s unichar type is 16-bit, which means it can only hold values within the range of 0×0000 to 0xFFFF.

This is a problem for all of the characters above 0xFFFF, including the Emoji characters, which are in the range from U+1F300 to U+1064F.

UTF-16 addresses this problem by means of a system called surrogates. It’s similar to what UTF-8 does for the same problem, except that the values that UTF-16 uses are within two defined ranges of actual characters.

Surrogates come in pairs. The first one is called the high surrogate, and the second is called the low surrogate. The ranges of usable characters are named accordingly.

The bomb character, 💣, encodes to UTF-16 as 0xD83D 0xDCA3.

NSString and CFString use the word “character” all over the place, but what they really mean is “UTF-16 code unit”. So the aforementioned single-character string actually contains two “characters”:

2012-06-03 13:15:45.498 test[14761:707] 0: 0xD83D
2012-06-03 13:15:45.501 test[14761:707] 1: 0xDCA3

Beware of such things when enforcing length limits. Be sure of whether you’re counting ideal characters or code units in some encoding. Also make sure you’re clear on whether a destination with a length limit (e.g., Twitter) counts up to that limit in ideal characters or in code-units in some encoding.

Also, as @schwa mentions in the same tweet, this all applies to characterAtIndex: as well (indeed, everything in NS/CFString that talks about “characters”). So, for example, [bombString characterAtIndex:0UL] will really retrieve only half of the character.

As noted above, each of these Emoji characters is encoded in UTF-16 as two code units in a surrogate pair. A surrogate pair has a high surrogate and a low surrogate.

The high surrogate identifies a range of 210 characters; the low surrogate identifies a specific character within that range. Since the poop character and the bomb character are within the same range, they have the same high surrogate—i.e., the same first “character” in their NSString/UTF-16 representations.

As the example demonstrates, just because a string contains only one ideal character doesn’t mean that characterAtIndex:0 will return 1.0 character. It may return 0.5 characters.

Greg Titus answered this one for me:

No worries about surrogate pairs or lengths greater than 1 for characters that exist in ASCII (≤ U+007f).

Recap

  • “Characters” in NS/CFString are really UTF-16 code units.
  • Some characters in Unicode—including, but by no means limited to, Emoji—are outside the range of what a single UTF-16 code unit—a single NSString “character”—can hold.
  • Therefore, do not assume that a single character is a single “character”.
  • Neither should you assume that a single character will be a single byte in UTF-8. That sounds obvious, but…
  • Both of the preceding rules can trip you up when checking against length limits (or sending text to something else that will do such a check). Make sure you know whether the limit is in ideal characters (U+whatever) or code units in some encoding, and make sure you count the appropriate unit and do so correctly.
  • Those rules also have a way of tripping you up whenever you extract a single “character” at a time from a string. You should probably only do this when looking for known ASCII characters (e.g., for parsing purposes), and even then, please consider using NSScanner or NSRegularExpression instead.

5 Responses to “Characters in NSString”

  1. Jesper Says:

    What.

  2. Carl Says:

    It gets worse: combining characters. ‘Ä€’ can be written as one Unicode codepoint or as an ‘A’ plus a combining macron. If you want to truncate text correctly, you need to use a normalization form that turns combining characters into single codepoints first.

  3. David Says:

    The problem Carl brings up is actually solved via –rangeOfComposedCharacterSequence*: or enumerateSubstringsInRange:options:usingBlock: (you can request an iteration by code sequence). Not all combining characters can be normalized into a single code point.

  4. Kevin Says:

    Good information, thanks.

    Ross Carter also has an old blog post that talks about similar issue http://rosscarter.com/2008/173.html

  5. Jesper Says:

    Just to explain my previous comment: initially, this post cut off at the U+1F4A9 PILE OF POO in the first code block.

Leave a Reply

Do not delete the second sentence.