The Best IntX Yet
[Edit: I've changed my mind about all this; see Even Better IntX].
I've always had some go-to technique I call IntX Encoding for compressing integers into a variable number of bits (fewer on average) but the specifics have changed over the years.
Old Ideas
Originally it was "White Wolf" style or "10-Again" - White Wolf tabletop games tend to have a "10-Again" rule where whenever you roll a 10 on D10 you get another roll to add to it. Bytes 0-254 would be normal and then byte 255 meant "add the next byte to it". This is obviously a poor way to do it in general but I was a newb and it actually worked fine for the small-scale numbers I was typically working with in those early days, often in the 0-512 range.
Then it was "simplified unsigned UTF-8". Having the first few bits of the first byte seemed genius - %0=1 byte, %10=2 bytes, %11=3 bytes - but in UTF-8 having each successive byte start with %10 seemed like a waste. I did something like that, but chose to use all 8 bits of my successive bytes after the first one. But my numbers were essentially unsigned, meaning that to get something as simple as "-1" I had to encode the full 32-bit value in 5 bytes (1 byte to tell me there were 4 bytes following, then the 32-bit value in 4 bytes).
I eventually had an epiphany and changed my encoded values to be signed rather than unsigned. Numbers -64..63 were more or less represented with the single byte 0xxxxxxx, -8192..8191 with two bytes 01xxxxxx xxxxxxxx, and so on.
And that's been my favorite scheme until now.
The New Idea
Reading up on Google's Protocol Buffers this morning, I realized that their basic varint encoding was even simpler than my scheme and just as effective: if the high bit of a byte is a 1
then it's a signal to continue and read the next byte as the next 7 bits (plus terminator bit). I feel as though I've come full circle in a sense - instead of 255 marking a continuation, a high bit of 1
marks a continuation. "1-Again" you might call it. 😀
Their approach has two disadvantages in my book: first, bytes are read in low-to-high order, which makes it a little tricky to reassemble since you don't know how many there are going to be. Second, they do a weird Zig-Zag Encoding thing with negative numbers.
I am switching my go-to technique of variable-bit integer encoding (or IntX as I call it) to be this variation on Google's idea:
-
The integer is broken up into 7-bit groups and written out in high-to-low order.
-
In each encoded byte a high bit of
1
indicates that the integer continues with the next byte. -
The encoded integer is considered signed. This means that Bit 6 (the second-highest) of the first byte should be propagated throughout the high bits that aren't read in.
Here's some Rogue code I programmed up to test it:
# New IntX (1-Again)
local bytes = Byte[]
println decode( encode(0x80000000,bytes) )
forEach (n in -300..-280) println decode( encode(n,bytes) )
forEach (n in -257..-239) println decode( encode(n,bytes) )
forEach (n in -129..-126) println decode( encode(n,bytes) )
forEach (n in -17..17) println decode( encode(n,bytes) )
forEach (n in 126..129) println decode( encode(n,bytes) )
forEach (n in 239..257) println decode( encode(n,bytes) )
forEach (n in 280..300) println decode( encode(n,bytes) )
println decode( encode(0x7fffffff,bytes) )
method encode( n:Int32, bytes:Byte[], continuation=0:Int32 )->Byte[]
if (n < 0)
if (((n & !0x7f) != !0x7f) or not (n & 0x40))
# Recurse if the remaining bits are not all 1's or if
# the 7th bit of the current output is a 0, which
# would fail to correctly signal a signed number
# unless we recurse further.
encode( n:>>>:7, bytes, 0x80 )
endIf
elseIf (n > 0x3f)
# "n > 0x3f" means "recurse if the number is more than
# 7 bits OR if the 7th bit is a 1" - the latter
# would otherwise indicate a negative number when
# decoded.
encode( n:>>>:7, bytes, 0x80 )
endIf
bytes.add( continuation | (n & 127) )
return bytes
method decode( bytes:Byte[] )->Int32
local reader = bytes.reader
local byte = reader.read
local result = (byte & 0x7f)
if (byte & 0x40) result |= 0xffffff80
while (byte & 0x80)
byte = reader.read
result = (result :<<: 7) | (byte & 0x7f)
endWhile
bytes.clear # reset for next time
return result
The output is correct!
> roguec Test.rogue --execute
Writing Test.h...
Writing Test.cpp...
g++ Test.cpp -o test && ./test
[Output edited into columns afterward]
-2147483648 -247 2 251
-300 -246 3 252
-299 -245 4 253
-298 -244 5 254
-297 -243 6 255
-296 -242 7 256
-295 -241 8 257
-294 -240 9 280
-293 -239 10 281
-292 -129 11 282
-291 -128 12 283
-290 -127 13 284
-289 -126 14 285
-288 -17 15 286
-287 -16 16 287
-286 -15 17 288
-285 -14 126 289
-284 -13 127 290
-283 -12 128 291
-282 -11 129 292
-281 -10 239 293
-280 -9 240 294
-257 -8 241 295
-256 -7 242 296
-255 -6 243 297
-254 -5 244 298
-253 -4 245 299
-252 -3 246 300
-251 -2 247 2147483647
-250 -1 248
-249 0 249
-248 1 250