We describe enhancements to the Binc data interchange format since its public release.
Also, the Binc spec is now stored on github (https://github.com/ugorji/binc). This allows us track revisions and affords better collaboration.
The enhancements were made to improve compactness of the format without losing precision or increasing encode or decode times. They are listed below:
Symbols are constant strings which are repeated a lot in the encoded stream. A symbol is analogous to enumerated types, field names, keys in maps with string key type, or other constant strings in mainstream programming languages.
In Binc, symbols are tagged with 1 or 2 bytes (allowing up to 65536 symbols supported in a stream). The value of the symbol is recorded in the stream only the first time, and thereafter just the tag is recorded.
Including symbol support reduced the encoded stream size by about 25% for our test dataset. This was got by eliminating duplication of field names and map keys in the stream.
Compact variable-length integers:
Previously, integers were variable-length, but the length jumped exponentially (ie integers were represented with 2, 4, 8, 16 … 2^15 bytes. This could lead to a lot of waste.
With the new format, integers are represented in 1, 2, 3, 4, 5 … 2^64-1 bytes. This saves on encoded space while allowing an insanely large precision integer value.
Leading bytes with value 0x00 (if unsigned or positive) or 0xff (if negative values) are pruned.
We considered using varint, but it increased encoding / decoding time significantly, while giving worse encoded size. Better performance and encoding size was got by pre-fixing the length.
Previously, floats were encoded with exactly the number of bytes as defined by IEEE 754 (e.g. 128-bit float encoded with 128 bits / 16 bytes.
With the enhancement, floats are now stored with up to the number of bytes defined (e.g. 128-bit float encoded with between 2 to 16 bytes). Trailing bytes with value 0x00 are pruned.
As an example, 64-bit double-precision 17.0 can be stored as [192 49] instead of [192 49 0 0 0 0 0 0].
The enhancements made resulted in significant reduction in encoding size, while still allowing very good encoding and decoding performance. See same benchmank results reproduced below:
Benchmark: Struct recursive Depth: 1 ApproxDeepSize Of benchmark Struct: 4786 Benchmark One-Pass Run: msgpack: len: 1564 binc: len: 1191 gob: len: 1972 json: len: 2538 bson: len: 3025 .............................................. PASS Benchmark__Msgpack__Encode 50000 63656 ns/op Benchmark__Msgpack__Decode 10000 118399 ns/op Benchmark__Binc_____Encode 50000 65053 ns/op Benchmark__Binc_____Decode 10000 115781 ns/op Benchmark__Gob______Encode 10000 146038 ns/op Benchmark__Gob______Decode 5000 435162 ns/op Benchmark__Json_____Encode 10000 160450 ns/op Benchmark__Json_____Decode 5000 314897 ns/op Benchmark__Bson_____Encode 10000 174070 ns/op Benchmark__Bson_____Decode 10000 231874 ns/op
Compare to the previous run when:
Benchmark One-Pass Run: msgpack: len: 1504 binc: len: 1508 gob: len: 1908 json: len: 2402
This shows that including symbols reduced the encoded size from 1508 to 1191 for an object of size 4786 bytes. For bigger objects, the difference is more significant. See sample results for a bigger object (15K):
Benchmark: Struct recursive Depth: 2 ApproxDeepSize Of benchmark Struct: 15574 Benchmark One-Pass Run: msgpack: len: 5086 binc: len: 3390 gob: len: 4531 json: len: 8250 bson: len: 9838 .............................................. Benchmark__Msgpack__Encode 5000 325819 ns/op Benchmark__Msgpack__Decode 5000 397460 ns/op Benchmark__Binc_____Encode 5000 304940 ns/op Benchmark__Binc_____Decode 5000 364586 ns/op Benchmark__Gob______Encode 5000 344352 ns/op Benchmark__Gob______Decode 5000 667213 ns/op Benchmark__Json_____Encode 5000 452073 ns/op Benchmark__Json_____Decode 2000 966823 ns/op Benchmark__Bson_____Encode 5000 617730 ns/op Benchmark__Bson_____Decode 2000 863080 ns/op
Binc is an interesting take on schema-less binary codecs with a lot of benefit. Give it a shot.