-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
11 changed files
with
67,061 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
{ | ||
"go.inferGopath": false | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,71 @@ | ||
# intcomp | ||
Fast integer compression library | ||
# Integer Compression | ||
|
||
This library provides high performance (GB/s) compression and decompression of integers (int32/uint32/int64/uint64). | ||
|
||
Good compression factor can be achieved when, on average, the difference between 2 consecutive | ||
values of the input remains small and thus can be encoded with fewer bits. | ||
|
||
Common use cases: | ||
|
||
- Timestamps | ||
- Offsets | ||
- Counter based identifiers | ||
|
||
The encoding schemes used here are based on [Dr. Daniel Lemire research](https://lemire.me/blog/2012/09/12/fast-integer-compression-decoding-billions-of-integers-per-second/). | ||
|
||
## Encoding Logic | ||
|
||
Data is encoded in blocks of multiple of 128x32bit or 256x64bits inputs in the | ||
following manner: | ||
|
||
1. Difference of consecutive inputs is computed (differential coding) | ||
2. ZigZag encoding is applied if a block contains at least one negative delta value | ||
3. The result is bit packed into the optimal number of bits for the block | ||
|
||
The remaining input that won't fit within a 128x32bits or 256x64bits block will be encoded | ||
in an additional block using Variable Byte encoding (with delta) | ||
|
||
## Append to compressed arrays | ||
|
||
In stream processing systems data is usually received by chunks. | ||
Compressing and aggregating small chunks can be inneficient and impractical. | ||
|
||
This API provides a convenient way to handle such inputs: | ||
When adding data to a compressed buffer, if the last block is a small block, encoded with Variable Byte, | ||
it will be rewritten in order to provide better compression using bit packing. | ||
|
||
## Encoding of timestamps with nanosecond resolution | ||
|
||
Timestamps with nanosecond resolution sometimes have an actual lower internal resolution (eg. microsecond). | ||
To provide better compression for that type of data, the encoding algorithm for int64 has a specific | ||
optimization that will provide better compression factor in such case. | ||
|
||
## Usage | ||
|
||
```go | ||
input := []int32{1, 2, 3} | ||
|
||
// compress | ||
compressed := intcomp.CompressInt32(input, nil) | ||
// compress more data (append) | ||
compressed = intcomp.CompressInt32([]int32{4, 5, 6}, compressed) | ||
|
||
// uncompress | ||
data := intcomp.UncompressInt32(compressed, nil) | ||
// data: [1, 2, 3, 4, 5, 6] | ||
``` | ||
|
||
## Performance | ||
|
||
Benchmarks for the bitpacking compression/decompression (MacBook pro M1). | ||
The result vary depending on the number of bits used to encode integers. | ||
|
||
### Compression | ||
|
||
- 32bits: between 2.5 and 4.3 GB/s | ||
- 64bits: between 3.1 and 6.4 GB/s | ||
|
||
### Decompression | ||
|
||
- 32bits: between 3.2 and 5.3 GB/s | ||
- 64bits: between 7.3 and 19.1 GB/s |
Oops, something went wrong.