# The Best IntX Yet

[Edit: I've changed my mind about all this; see Even Better IntX].

I've always had some go-to technique I call IntX Encoding for compressing integers into a variable number of bits (fewer on average) but the specifics have changed over the years.

##### Old Ideas

Originally it was "White Wolf" style or "10-Again" - White Wolf tabletop games tend to have a "10-Again" rule where whenever you roll a 10 on D10 you get another roll to add to it. Bytes 0-254 would be normal and then byte 255 meant "add the next byte to it". This is obviously a poor way to do it in general but I was a newb and it actually worked fine for the small-scale numbers I was typically working with in those early days, often in the 0-512 range.

Then it was "simplified unsigned UTF-8". Having the first few bits of the first byte seemed genius - %0=1 byte, %10=2 bytes, %11=3 bytes - but in UTF-8 having each successive byte start with %10 seemed like a waste. I did something like that, but chose to use all 8 bits of my successive bytes after the first one. But my numbers were essentially unsigned, meaning that to get something as simple as "-1" I had to encode the full 32-bit value in 5 bytes (1 byte to tell me there were 4 bytes following, then the 32-bit value in 4 bytes).

I eventually had an epiphany and changed my encoded values to be signed rather than unsigned. Numbers -64..63 were more or less represented with the single byte 0xxxxxxx, -8192..8191 with two bytes 01xxxxxx xxxxxxxx, and so on.

And that's been my favorite scheme until now.

###### The New Idea

Reading up on Google's Protocol Buffers this morning, I realized that their basic varint encoding was even simpler than my scheme and just as effective: if the high bit of a byte is a `1` then it's a signal to continue and read the next byte as the next 7 bits (plus terminator bit). I feel as though I've come full circle in a sense - instead of 255 marking a continuation, a high bit of `1` marks a continuation. "1-Again" you might call it. ðŸ˜€

Their approach has two disadvantages in my book: first, bytes are read in low-to-high order, which makes it a little tricky to reassemble since you don't know how many there are going to be. Second, they do a weird Zig-Zag Encoding thing with negative numbers.

I am switching my go-to technique of variable-bit integer encoding (or IntX as I call it) to be this variation on Google's idea:

1. The integer is broken up into 7-bit groups and written out in high-to-low order.

2. In each encoded byte a high bit of `1` indicates that the integer continues with the next byte.

3. The encoded integer is considered signed. This means that Bit 6 (the second-highest) of the first byte should be propagated throughout the high bits that aren't read in.

Here's some Rogue code I programmed up to test it:

``````# New IntX (1-Again)
local bytes = Byte[]
println decode( encode(0x80000000,bytes) )
forEach (n in -300..-280) println decode( encode(n,bytes) )
forEach (n in -257..-239) println decode( encode(n,bytes) )
forEach (n in -129..-126) println decode( encode(n,bytes) )
forEach (n in -17..17) println decode( encode(n,bytes) )
forEach (n in 126..129) println decode( encode(n,bytes) )
forEach (n in 239..257) println decode( encode(n,bytes) )
forEach (n in 280..300) println decode( encode(n,bytes) )
println decode( encode(0x7fffffff,bytes) )

method encode( n:Int32,  bytes:Byte[], continuation=0:Int32 )->Byte[]
if (n < 0)
if (((n & !0x7f) != !0x7f) or not (n & 0x40))
# Recurse if the remaining bits are not all 1's or if
# the 7th bit of the current output is a 0, which
# would fail to correctly signal a signed number
# unless we recurse further.
encode( n:>>>:7, bytes, 0x80 )
endIf
elseIf (n > 0x3f)
# "n > 0x3f" means "recurse if the number is more than
# 7 bits OR if the 7th bit is a 1" - the latter
# would otherwise indicate a negative number when
# decoded.
encode( n:>>>:7, bytes, 0x80 )
endIf
bytes.add( continuation | (n & 127) )
return bytes

method decode( bytes:Byte[] )->Int32
local result = (byte & 0x7f)
if (byte & 0x40) result |= 0xffffff80

while (byte & 0x80)
result = (result :<<: 7) | (byte & 0x7f)
endWhile

bytes.clear  # reset for next time
return result
``````

The output is correct!

``````> roguec Test.rogue --execute
Writing Test.h...
Writing Test.cpp...
g++ Test.cpp -o test && ./test

[Output edited into columns afterward]
-2147483648 -247        2           251
-300        -246        3           252
-299        -245        4           253
-298        -244        5           254
-297        -243        6           255
-296        -242        7           256
-295        -241        8           257
-294        -240        9           280
-293        -239        10          281
-292        -129        11          282
-291        -128        12          283
-290        -127        13          284
-289        -126        14          285
-288        -17         15          286
-287        -16         16          287
-286        -15         17          288
-285        -14         126         289
-284        -13         127         290
-283        -12         128         291
-282        -11         129         292
-281        -10         239         293
-280        -9          240         294
-257        -8          241         295
-256        -7          242         296
-255        -6          243         297
-254        -5          244         298
-253        -4          245         299
-252        -3          246         300
-251        -2          247         2147483647
-250        -1          248
-249        0           249
-248        1           250
``````