By Ugorji Nwoke   05 Jun 2013   /blog   geek technology

Update on Binc data interchange format

We describe enhancements to the Binc data interchange format since its public release.

Also, the Binc spec is now stored on github (https://github.com/ugorji/binc). This allows us track revisions and affords better collaboration.

The enhancements were made to improve compactness of the format without losing precision or increasing encode or decode times. They are listed below:

Symbols:
Symbols are constant strings which are repeated a lot in the encoded stream. A symbol is analogous to enumerated types, field names, keys in maps with string key type, or other constant strings in mainstream programming languages.

In Binc, symbols are tagged with 1 or 2 bytes (allowing up to 65536 symbols supported in a stream). The value of the symbol is recorded in the stream only the first time, and thereafter just the tag is recorded.

Including symbol support reduced the encoded stream size by about 25% for our test dataset. This was got by eliminating duplication of field names and map keys in the stream.

Compact variable-length integers:
Previously, integers were variable-length, but the length jumped exponentially (ie integers were represented with 2, 4, 8, 16 … 2^15 bytes. This could lead to a lot of waste.

With the new format, integers are represented in 1, 2, 3, 4, 5 … 2^64-1 bytes. This saves on encoded space while allowing an insanely large precision integer value.

Leading bytes with value 0x00 (if unsigned or positive) or 0xff (if negative values) are pruned.

We considered using varint, but it increased encoding / decoding time significantly, while giving worse encoded size. Better performance and encoding size was got by pre-fixing the length.

Compact Floats:
Previously, floats were encoded with exactly the number of bytes as defined by IEEE 754 (e.g. 128-bit float encoded with 128 bits / 16 bytes.

With the enhancement, floats are now stored with up to the number of bytes defined (e.g. 128-bit float encoded with between 2 to 16 bytes). Trailing bytes with value 0x00 are pruned.

As an example, 64-bit double-precision 17.0 can be stored as [192 49] instead of [192 49 0 0 0 0 0 0].


The enhancements made resulted in significant reduction in encoding size, while still allowing very good encoding and decoding performance. See same benchmank results reproduced below:

Benchmark: 
 Struct recursive Depth:             1
 ApproxDeepSize Of benchmark Struct: 4786
Benchmark One-Pass Run:
    msgpack: len: 1564
       binc: len: 1191
        gob: len: 1972
       json: len: 2538
       bson: len: 3025
..............................................
PASS
Benchmark__Msgpack__Encode    50000      63656 ns/op
Benchmark__Msgpack__Decode    10000     118399 ns/op
Benchmark__Binc_____Encode    50000      65053 ns/op
Benchmark__Binc_____Decode    10000     115781 ns/op
Benchmark__Gob______Encode    10000     146038 ns/op
Benchmark__Gob______Decode     5000     435162 ns/op
Benchmark__Json_____Encode    10000     160450 ns/op
Benchmark__Json_____Decode     5000     314897 ns/op
Benchmark__Bson_____Encode    10000     174070 ns/op
Benchmark__Bson_____Decode    10000     231874 ns/op

Compare to the previous run when:

Benchmark One-Pass Run:
       msgpack: len: 1504
          binc: len: 1508
           gob: len: 1908
          json: len: 2402

This shows that including symbols reduced the encoded size from 1508 to 1191 for an object of size 4786 bytes. For bigger objects, the difference is more significant. See sample results for a bigger object (15K):

Benchmark: 
 Struct recursive Depth:             2
 ApproxDeepSize Of benchmark Struct: 15574
Benchmark One-Pass Run:
    msgpack: len: 5086
       binc: len: 3390
        gob: len: 4531
       json: len: 8250
       bson: len: 9838
..............................................
Benchmark__Msgpack__Encode     5000     325819 ns/op
Benchmark__Msgpack__Decode     5000     397460 ns/op
Benchmark__Binc_____Encode     5000     304940 ns/op
Benchmark__Binc_____Decode     5000     364586 ns/op
Benchmark__Gob______Encode     5000     344352 ns/op
Benchmark__Gob______Decode     5000     667213 ns/op
Benchmark__Json_____Encode     5000     452073 ns/op
Benchmark__Json_____Decode     2000     966823 ns/op
Benchmark__Bson_____Encode     5000     617730 ns/op
Benchmark__Bson_____Decode     2000     863080 ns/op

Binc is an interesting take on schema-less binary codecs with a lot of benefit. Give it a shot.

Tags: geek technology


Subscribe: Technology
© Ugorji Nwoke