The Best IntX Yet

[Edit: I've changed my mind about all this; see Even Better IntX].

I've always had some go-to technique I call IntX Encoding for compressing integers into a variable number of bits (fewer on average) but the specifics have changed over the years.

Old Ideas

Originally it was "White Wolf" style or "10-Again" - White Wolf tabletop games tend to have a "10-Again" rule where whenever you roll a 10 on D10 you get another roll to add to it. Bytes 0-254 would be normal and then byte 255 meant "add the next byte to it". This is obviously a poor way to do it in general but I was a newb and it actually worked fine for the small-scale numbers I was typically working with in those early days, often in the 0-512 range.

Then it was "simplified unsigned UTF-8". Having the first few bits of the first byte seemed genius - %0=1 byte, %10=2 bytes, %11=3 bytes - but in UTF-8 having each successive byte start with %10 seemed like a waste. I did something like that, but chose to use all 8 bits of my successive bytes after the first one. But my numbers were essentially unsigned, meaning that to get something as simple as "-1" I had to encode the full 32-bit value in 5 bytes (1 byte to tell me there were 4 bytes following, then the 32-bit value in 4 bytes).

I eventually had an epiphany and changed my encoded values to be signed rather than unsigned. Numbers -64..63 were more or less represented with the single byte 0xxxxxxx, -8192..8191 with two bytes 01xxxxxx xxxxxxxx, and so on.

And that's been my favorite scheme until now.

The New Idea

Reading up on Google's Protocol Buffers this morning, I realized that their basic varint encoding was even simpler than my scheme and just as effective: if the high bit of a byte is a 1 then it's a signal to continue and read the next byte as the next 7 bits (plus terminator bit). I feel as though I've come full circle in a sense - instead of 255 marking a continuation, a high bit of 1 marks a continuation. "1-Again" you might call it. 😀

Their approach has two disadvantages in my book: first, bytes are read in low-to-high order, which makes it a little tricky to reassemble since you don't know how many there are going to be. Second, they do a weird Zig-Zag Encoding thing with negative numbers.

I am switching my go-to technique of variable-bit integer encoding (or IntX as I call it) to be this variation on Google's idea:

  1. The integer is broken up into 7-bit groups and written out in high-to-low order.

  2. In each encoded byte a high bit of 1 indicates that the integer continues with the next byte.

  3. The encoded integer is considered signed. This means that Bit 6 (the second-highest) of the first byte should be propagated throughout the high bits that aren't read in.

Here's some Rogue code I programmed up to test it:

# New IntX (1-Again)
local bytes = Byte[]
println decode( encode(0x80000000,bytes) )
forEach (n in -300..-280) println decode( encode(n,bytes) )
forEach (n in -257..-239) println decode( encode(n,bytes) )
forEach (n in -129..-126) println decode( encode(n,bytes) )
forEach (n in -17..17) println decode( encode(n,bytes) )
forEach (n in 126..129) println decode( encode(n,bytes) )
forEach (n in 239..257) println decode( encode(n,bytes) )
forEach (n in 280..300) println decode( encode(n,bytes) )
println decode( encode(0x7fffffff,bytes) )

method encode( n:Int32,  bytes:Byte[], continuation=0:Int32 )->Byte[]
  if (n < 0)
    if (((n & !0x7f) != !0x7f) or not (n & 0x40))
      # Recurse if the remaining bits are not all 1's or if
      # the 7th bit of the current output is a 0, which
      # would fail to correctly signal a signed number
      # unless we recurse further.
      encode( n:>>>:7, bytes, 0x80 )
    endIf
  elseIf (n > 0x3f)
    # "n > 0x3f" means "recurse if the number is more than
    # 7 bits OR if the 7th bit is a 1" - the latter
    # would otherwise indicate a negative number when
    # decoded.
    encode( n:>>>:7, bytes, 0x80 )
  endIf
  bytes.add( continuation | (n & 127) )
  return bytes

method decode( bytes:Byte[] )->Int32
  local reader = bytes.reader
  local byte = reader.read
  local result = (byte & 0x7f)
  if (byte & 0x40) result |= 0xffffff80

  while (byte & 0x80)
    byte = reader.read
    result = (result :<<: 7) | (byte & 0x7f)
  endWhile

  bytes.clear  # reset for next time
  return result

The output is correct!

> roguec Test.rogue --execute
Writing Test.h...
Writing Test.cpp...
g++ Test.cpp -o test && ./test 

[Output edited into columns afterward]
-2147483648 -247        2           251         
-300        -246        3           252         
-299        -245        4           253         
-298        -244        5           254         
-297        -243        6           255         
-296        -242        7           256         
-295        -241        8           257         
-294        -240        9           280         
-293        -239        10          281         
-292        -129        11          282         
-291        -128        12          283         
-290        -127        13          284         
-289        -126        14          285         
-288        -17         15          286         
-287        -16         16          287         
-286        -15         17          288         
-285        -14         126         289         
-284        -13         127         290         
-283        -12         128         291         
-282        -11         129         292         
-281        -10         239         293         
-280        -9          240         294         
-257        -8          241         295         
-256        -7          242         296         
-255        -6          243         297         
-254        -5          244         298         
-253        -4          245         299         
-252        -3          246         300         
-251        -2          247         2147483647  
-250        -1          248         
-249        0           249         
-248        1           250