Revamping Rogue Strings to Use UTF-8 Encoding

Character encoding is tricky these days, particularly from a language design and API standpoint.

In the early days of computing, the 128-value ASCII Table was all you needed to work with, meaning that a string could just be composed of bytes and you'd be done.

The original Unicode proposal defined a 16-bit character space (65,536 values). Java, JavaScript, and many other new languages just doubled the size of their string data and everything seemed fine once again, even though we were using twice as much space as necessary most of the time. The bulk of any web page is ASCII HTML and JavaScript code rather than localized text.

But then we realized we needed an even larger character space to truly have global language support. The Chinese, Japanese, and Korean (CJK) languages alone have 80,000 ideograph characters. So Unicode was expanded to include the space 0-10FFFF16, a little over a million characters.

Java & co. began to use a variable-length 16-bit system called UTF-16. It's both fairly simple and a massive kludge in my opinion. It works like this:

Unicode ValueUTF-16 Representation
0000...D7FF0000...D7FF
E000..FFFFE000...FFFF
D800..DFFF(Illegal)
10000...10FFFFD800+((ch-0x10000)>>10) followed by DC00+((ch-0x10000)&0x3FF) - called a surrogate pair

It's strange to have an encoding scheme where a certain range of original data is flat-out illegal. But the subtle and pervasive problem with UTF-16 is that it requires special case handling when working with strings and it's all too easy to leave out those checks and write code that doesn't properly work with, say, emoji.

I originally wrote Rogue to use 16-bit characters and ignored the problem of "extended Unicode" as something to figure out later.

I then came back and implemented a UTF-16 system and it worked fine but had the caveats I mentioned above.

After reading the UTF-8 Everywhere Manifesto recently I was inspired to revamp Rogue strings once again. The result is conceptually simple and elegant: all string methods interact with strings as though they were composed of 21-bit Unicode characters, while underneath the hood all strings are actually a sequence of UTF-8 encoded bytes (more info on UTF-8 encoding here].

I had some real motivation to figure out an elegant system: I've recently been doing a lot of Swift programming and, while Swift is trying to tackle this same problem, Swift is also a great example of what not to do. Quite honestly Swift's string handling is just a mess: it doesn't come with a standard set of string methods (there's no lastIndexOf() for instance) and you can't just index directly into strings; you have to retrieve a string's startIndex() and then call advancedBy(Int) to move the cursor over. That's a pretty clear sign that it's operating on UTF-8 or UTF-16 data and trying to avoid having to scan through all the data every access, but it's just too cumbersome.

Recently I wanted to take my app's bundle identifier, e.g. com.abepralle.MyApp, and get the substring that followed the last period in the ID, e.g. MyApp. Here is the code I had to write to accomplish that in Swift:

let bundleID = "com.abepralle.MyApp"

func lastIndexOf( st:String, lookFor:String )->Int?
{
  if let r = st.rangeOfString( lookFor, options:.BackwardsSearch )
  {
    return st.startIndex.distanceTo( r.startIndex )
  }
  
  return nil
}

if let dotIndex = lastIndexOf( bundleID, lookFor:"." )
{
  let name = bundleID.substringFromIndex( bundleID.startIndex.advancedBy(dotIndex+1) )
}

It could be a little less convoluted if I return a String.CharacterView.Index? instead of an Int?, return r.startIndex as the value, and get the substringFromIndex( dotIndex.advancedBy(1) ). But it's still pretty cumbersome.

Back to Rogue, I wanted to avoid that complexity while simultaneously allowing fast whole-character indexing and using UTF-8 internally.

My two big tricks were 1) to have each string cache the UTF-8 byte offset and the character index of the most recent character access, and 2) to set an is_ascii flag when all characters are single-byte values and use a direct offset instead of scanning through the string.

The built-in String class looks like this in C++ now:

struct RogueString : RogueObject
{
  RogueInt32 byte_count;       // in UTF-8 bytes
  RogueInt32 character_count;  // in whole characters
  RogueInt32 is_ascii;  // every character 1 byte?
  RogueInt32 cursor_offset;
  RogueInt32 cursor_index;
  RogueInt32 hash_code;
  RogueByte  utf8[];
};

Besides a validation step that ensures valid UTF-8 and sets the character_count and hash_code, this is the workhorse method that most other string methods rely on for setting correct indices before a character-at or substring operation, etc.:

RogueInt32 RogueString_set_cursor( RogueString* THIS, int index )
{
  // Sets this string's cursor_offset and cursor_index and
  // returns cursor_offset.
  if (THIS->is_ascii)
  {
    return THIS->cursor_offset = THIS->cursor_index = index;
  }

  RogueByte* utf8 = THIS->utf8;

  RogueInt32 c_offset;
  RogueInt32 c_index;
  if (index == 0)
  {
    THIS->cursor_index = 0;
    return THIS->cursor_offset = 0;
  }
  else if (index >= THIS->character_count - 1)
  {
    c_offset = THIS->byte_count;
    c_index = THIS->character_count;
  }
  else
  {
    c_offset  = THIS->cursor_offset;
    c_index = THIS->cursor_index;
  }

  while (c_index < index)
  {
    while ((utf8[++c_offset] & 0xC0) == 0x80) {}
    ++c_index;
  }

  while (c_index > index)
  {
    while ((utf8[--c_offset] & 0xC0) == 0x80) {}
    --c_index;
  }

  THIS->cursor_index = c_index;
  return THIS->cursor_offset = c_offset;
}

Incidentally I also reworked StringBuilder to use the same system.

Here's a small sample Rogue program that shows off the new String handling:

local st = "Crazy๐Ÿ˜œCool๐Ÿ˜ŽStuff"
trace st, st.count, st.byte_count
trace st[5], st[6]
trace st.from_first( '๐Ÿ˜œ' ).up_to_last( '๐Ÿ˜Ž' )

Output:

st:Crazy๐Ÿ˜œCool๐Ÿ˜ŽStuff st.count:16 st.byte_count:22
st[5]:๐Ÿ˜œ st[6]:C
st.from_first( '๐Ÿ˜œ' ).up_to_last( '๐Ÿ˜Ž' ):๐Ÿ˜œCool๐Ÿ˜Ž