Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: alternative compression methods #3147

Closed
acslk opened this issue Jun 14, 2016 · 24 comments
Closed

Proposal: alternative compression methods #3147

acslk opened this issue Jun 14, 2016 · 24 comments
Labels
Milestone

Comments

@acslk
Copy link
Contributor

acslk commented Jun 14, 2016

Currently druid compress segment values using compression strategy of LZ4, LZF, and uncompressed (mostly LZ4).

  • For serializing values, all of these strategies use the general method of writing value into a buffer of specified size, currently set to 0x10000 bytes. When the buffer is full, it use the strategy specific method to compress the buffer into bytes and write the bytes and start position to output, while clearing the buffer.
  • For reading values at specified index, the general method is to have the buffer of the same size, and calculate the block number (index / buffer size). If the block is different from the one currently in buffer, it is loaded and decompressed using the strategy method. The value is the obtained from the loaded buffer.

This general method of compression does not perform well when reading small amount of data with many skips, since entire blocks would be loaded and uncompressed even if only a single value is required. A possible alternative approach for compression is to have fixed size length for each compressed value, so when accessing an index, the position of the index can be calculated to obtain the compressed value directly from the file mapped byte buffer, and no block coping or decompression is required.

Currently, the fixed size approach cannot be added as a compression strategy, since CompressionStrategy only has control over how to compress and decompress a given block of bytes. Serializers and suppliers calls GenericIndexedWriter and GenericIndexed, which performs block based compression using the compression strategy. The compression interface should be changed so compression strategy can decide on wether or not to use block based compression, so other compression methods such as fixed size compression can be added.

@cheddar
Copy link
Contributor

cheddar commented Jun 14, 2016

Do you know what the proposed change you are thinking of doing is yet?

Also, rather than changing it at the GenericIndexed level, it might be nice to think of it one level up and change how the column itself is serialized/deserialized.

@acslk
Copy link
Contributor Author

acslk commented Jun 14, 2016

The reference PR my proposed change for this issue. The GenericIndexed level wasn't changed, instead I added the class CompressionFactory that gets the serializer/deserializer for the compression format. The serializer/deserializer for each compression format would then decide to either use or not use GenericIndexed for block compression.

@xvrl xvrl added the Discuss label Jun 15, 2016
@xvrl
Copy link
Member

xvrl commented Jun 15, 2016

@acslk do you have any benchmarks on how much your refactoring affects performance for existing formats?

@acslk
Copy link
Contributor Author

acslk commented Jun 15, 2016

Here's the same benchmark on the current build

CompressionFormatBenchmark.readContinuous     reqInt           lz4  avgt   10  21.461 ±  2.102  ms/op
CompressionFormatBenchmark.readContinuous     reqInt  uncompressed  avgt   10  20.045 ±  1.080  ms/op
CompressionFormatBenchmark.readContinuous     reqInt           lzf  avgt   10  30.746 ±  2.555  ms/op
CompressionFormatBenchmark.readContinuous  bytesLong           lz4  avgt   10  36.415 ±  5.561  ms/op
CompressionFormatBenchmark.readContinuous  bytesLong  uncompressed  avgt   10  21.796 ±  2.914  ms/op
CompressionFormatBenchmark.readContinuous  bytesLong           lzf  avgt   10  63.403 ±  3.479  ms/op
CompressionFormatBenchmark.readContinuous  timestamp           lz4  avgt   10  57.158 ±  6.190  ms/op
CompressionFormatBenchmark.readContinuous  timestamp  uncompressed  avgt   10  21.833 ±  3.410  ms/op
CompressionFormatBenchmark.readContinuous  timestamp           lzf  avgt   10  71.457 ±  5.528  ms/op
CompressionFormatBenchmark.readSkipping       reqInt           lz4  avgt   10   4.173 ±  0.723  ms/op
CompressionFormatBenchmark.readSkipping       reqInt  uncompressed  avgt   10   4.751 ± 10.018  ms/op
CompressionFormatBenchmark.readSkipping       reqInt           lzf  avgt   10   8.560 ±  1.818  ms/op
CompressionFormatBenchmark.readSkipping    bytesLong           lz4  avgt   10  19.305 ±  3.020  ms/op
CompressionFormatBenchmark.readSkipping    bytesLong  uncompressed  avgt   10   3.304 ±  2.450  ms/op
CompressionFormatBenchmark.readSkipping    bytesLong           lzf  avgt   10  42.172 ±  3.632  ms/op
CompressionFormatBenchmark.readSkipping    timestamp           lz4  avgt   10  38.374 ±  4.688  ms/op
CompressionFormatBenchmark.readSkipping    timestamp  uncompressed  avgt   10  10.264 ± 35.136  ms/op
CompressionFormatBenchmark.readSkipping    timestamp           lzf  avgt   10  55.982 ±  8.981  ms/op

There shouldn't be any difference on LZ4 and LZF performance, since I did not change any steps in the GenericIndexed decompression procedure. For uncompressed strategy, I calculated the position of the value and got it directly from byte buffer instead of going through GenericIndexed, so skipping access is faster in refactored version.

@xvrl
Copy link
Member

xvrl commented Jun 15, 2016

@acslk just because we don't change the logic doesn't mean it doesn't change the performance. There are a lot of factors such as what can be inlined and addition of virtual calls that can severely impact performance, even for simple refactoring.

Based on my cursory assessment, it looks like there's about a 10% slow-down for readContinuous bytesLong lz4 with your refactoring. The error bounds are pretty wide, so we should try to narrow down the confidence interval (hint: benchmark on EC2, not a laptop + more benchmark runs)

@gianm
Copy link
Contributor

gianm commented Jun 15, 2016

Thanks @acslk for the numbers.

I am generally positive on including new compression strategies if they are either strictly better or at least give us a new and useful position on the storage/performance tradeoff space.

The new 'uncompressed' looks strictly better than the old one, and the 'delta' from #3148 doesn't seem strictly better than lz4, but it is faster at the cost of using somewhat more storage, which is probably going to be useful for some people.

RLE for the __time column might also be nice to add in the future, it'd probably work well there. With queryGranularity minute or hour it could potentially be both smaller & faster than lz4.

@gianm
Copy link
Contributor

gianm commented Jun 15, 2016

agree with @xvrl that we should aim for equivalent performance for lz4.

@acslk
Copy link
Contributor Author

acslk commented Jun 15, 2016

I've ran the lz4 benchmark again with more iterations, it seems that the difference is quite small now

current build

CompressionFormatBenchmark.readContinuous     reqInt       lz4  avgt  100  20.062 ± 0.323  ms/op
CompressionFormatBenchmark.readContinuous  bytesLong       lz4  avgt  100  33.735 ± 0.273  ms/op
CompressionFormatBenchmark.readContinuous  timestamp       lz4  avgt  100  53.092 ± 0.713  ms/op
CompressionFormatBenchmark.readSkipping       reqInt       lz4  avgt  100   4.543 ± 1.195  ms/op
CompressionFormatBenchmark.readSkipping    bytesLong       lz4  avgt  100  17.544 ± 0.181  ms/op
CompressionFormatBenchmark.readSkipping    timestamp       lz4  avgt  100  36.180 ± 0.572  ms/op

refactored

CompressionFormatBenchmark.readContinuous     reqInt       lz4  avgt  100  19.989 ± 0.226  ms/op
CompressionFormatBenchmark.readContinuous  bytesLong       lz4  avgt  100  33.881 ± 0.325  ms/op
CompressionFormatBenchmark.readContinuous  timestamp       lz4  avgt  100  53.332 ± 0.657  ms/op
CompressionFormatBenchmark.readSkipping       reqInt       lz4  avgt  100   4.263 ± 0.499  ms/op
CompressionFormatBenchmark.readSkipping    bytesLong       lz4  avgt  100  17.529 ± 0.167  ms/op
CompressionFormatBenchmark.readSkipping    timestamp       lz4  avgt  100  36.342 ± 0.594  ms/op

@fjy fjy added this to the 0.9.2 milestone Jun 15, 2016
@xvrl
Copy link
Member

xvrl commented Jun 16, 2016

@acslk thanks for the updated benchmark, glad to see there doesn't seem to be any meaningful impact. As far as the new compression schemes, I'm wondering how many new schemes we actually need and want to support. More formats will mean more things to support and more potential bugs. Can we put together some use-cases to explain which ones of those formats are actually useful?

@xvrl
Copy link
Member

xvrl commented Jun 16, 2016

It might also make sense to deprecate formats, so we don't have to drag them with us forever. We should think of deprecating LZF for instance. While we'll likely have to preserve backwards compatibility for a while, we'll probably want to steer users away from format we'd like to remove in the future.

@xvrl
Copy link
Member

xvrl commented Jun 16, 2016

@acslk from reading your description a bit more, I get the impression we are mixing up compression format, data encoding scheme, and column layout. I feel we should better separate those concepts to make things a bit clearer.

Here's the way I see things:

  • compression format is just a way to compress a block of arbitrary bytes to shrink storage requirements, and allow more data to be memory mapped. Currently we support three of those compression formats: lzf, lz4, and uncompressed. Just to clarify, "uncompressed" never meant the data was laid out as a continuous set of floats / longs, it just meant each block of bytes was store as is. I don't think many people should use it in practice, it was just added for completeness as a 'reference' compression format.
  • data encoding scheme on the other hand refers to the byte representation we choose for the column values. This can be delta-encoding, variable-length encoding, or other fancy ways of reducing the number of bytes we store for each value.
  • column layout finally is how we arrange all those values on disk / in memory. If we want to enable compression, we require some form of block layout. How values are represented in each block however, is up to the encoding scheme. If we don't require a block format, we can just store values as is, sequentially. All we need at the end of the day, is a way to translate a column index into, a direct position in the column, or into a block number and position in the block.

If we separate those concepts, than we could support things such as "delta-encoding + lz4" or "variable-length + lz4", or choose not to enable compression and store values as is.

If we believe there is value is doing so, we should consider separating those concepts now, to avoid too much backwards compatibility baggage down the road.

@gianm
Copy link
Contributor

gianm commented Jun 16, 2016

@xvrl so in your terminology, basically "encoding" is something like "as-is", "RLE", "delta", "table", "variableLength" (some are random-access and some are not); and "compression" is something like "lz4" or "lzf"?

I think that split makes sense, with some caveats around minimizing copies,

  1. Encodings should never be forced to materialize decoded values. Some of them support random access (like "as-is", "table", "delta") and should be doable as reads directly out of a buffer. Some don't support random access (RLE/variableLength) but should still be doable as reads directly out of a buffer, ideally seeking to some checkpoint and then reading from there.
  2. The implementation of non-random-access encodings should be able to optimize for in-order access by remembering things like current-buffer-position and RLE state.
  3. It's important that "none" / "uncompressed" compression does not incur performance overhead; currently it does, since it copies into a decompression buffer anyway.

#3148 achieves (1) and (3) although not in a way that allows composing compressions and encodings. For the "skipping" benchmarks in #3148 there are serious performance gains, seemingly mostly from avoiding unnecessary copies.

Would it work if "compression" and "encoding" are different things, with capabilities like,

  • Decompressors should be passed a decompression buffer and are just responsible for decompressing blocks. The existing CompressedObjectStrategy should work fine for this, except probably "uncompressed" needs to be special cased in the serdes (so they don't create a decompression buffer nor use a CompressedObjectStrategy for that)
  • Decodings should be passed an already decompressed buffer (could be the original mmapped column if no compression is used; or a decompression buffer) and a row offset into the buffer that they should be decoding. Fixed length decodings would random access (read from sizePerEntry * offset) and variable length decodings would read from the beginning until they found the target row offset. They should be able to optimize for in-order access by remembering things like current-buffer-position and RLE state, and re-using that across multiple "get" calls if the buffer is the same.

I could see that being useful for stuff like "lz4 + delta" which might end up better than just lz4 by itself for people trying to optimize for storage space. Or not. But it could be interesting to try…

@acslk
Copy link
Contributor Author

acslk commented Jun 16, 2016

@xvrl I agree that there's certain distinctions between the concepts you described, but I have some different opinions

The compression format in your description is the CompressionStrategy in the code. It compress a block of arbitrary bytes to another block of bytes, which must be used by decompressing the block back into the original block.

By separating the concepts, the compression procedure should be

  • choose a column layout to decide on how values are arranged
  • choose a data encoding scheme to have a byte representation of each block of values
  • choose a CompressionStrategy to compress the byte representation of each block

The problem I see with this is that these concepts are not entirely unrelated. Compression strategy such as LZ4 is essentially a data encoding scheme, as it belongs to LZ77 scheme which is kinda like an advanced version of Run Length Encoding. I believe both data encoding scheme and CompressionStrategy have the same objective of writing objects into bytes that can later be decoded, and they should be considered as the same concept. I've done LZ4 compression on variable length encoded longs, and the resulting file size isn't much different from normal LZ4 compression. This is the case since variable length encoding already eliminated large amount of repetition of 0, so LZ4 compression would not be as effective.

By combining these concepts, the process would then be

  • choose a column layout to decide on how values are arranged
  • write the bytes of each block with a data encoding scheme or compressionStrategy

Now encoding such as table and delta allow random access in the compressed form, which means they don't need to decompress the entire block to read a value. This makes it possible for column layout to compress the entire column together without dividing into blocks. Note compressing the entire column is not possible for non random access compression method such as LZ4 and RLE, since they need to uncompress the entire block. This means only certain combinations are available, and I've named these combinations as CompressionFormat in #3148.

Basically the compression formats are

  • LZ4/LZF/Uncompressed = block layout + LZ4/LZF/Uncompressed
  • Table/Delta = entire layout + Table/Delta
  • Uncompressed_new = entire layout + Uncompressed

Within these, Uncompressed and LZF formats should be deprecated, as they're worse version of Uncompressed_new and LZ4

@gianm
Copy link
Contributor

gianm commented Jun 16, 2016

@acslk Did you see if there is any benefit to combining lz4 + delta (or lz4 + rle or lz4 + some other encoding)? Is any of those combinations either smaller or faster than lz4 by itself or the encoding by itself? If so it could make sense to separate compression/encoding. If not then there's no point in implementing compression and encoding as different things, as nobody would want to combine them anyway.

@gianm
Copy link
Contributor

gianm commented Jun 17, 2016

@acslk, btw, on your point that compressions and encodings are kind of the same thing – yes, this is true :)

I think if we chose to separate them into two concepts, it wouldn't be because they are fundamentally different things, but would be because we feel that users would want to combine 'encodings that operate on whole blocks' with 'encodings that may be a little more nuanced', and it would a pain for us to implement all possible combinations manually.

If we chose to have a single concept, that would be because we feel that it keeps the code simpler and we don't lose much in terms of useful flexibility.

@acslk
Copy link
Contributor Author

acslk commented Jun 17, 2016

@gianm thanks for the clarification.

Here's a brief idea of the compression size. I've used the LZ4 compressor on the entire delta/table file and on the uncompressed file. It's not an accurate representation of what block decompressed file would look like, but it should show the potential difference. The access time of combination approach should be a bit slower than lz4, since it would follow the same procedure of block decompression and instead of accessing uncompressed values, it would access delta/table encoded values.

10385854 Jun 16 16:02 bytesLong.(lz4)
9021088 Jun 16 15:58 bytesLong.delta(lz4)
353061 Jun 16 16:02 reqInt.(lz4)
164702 Jun 16 15:58 reqInt.delta(lz4)
19226288 Jun 16 16:02 timestamp.(lz4)
19493145 Jun 16 15:58 timestamp.delta(lz4)
62300 Jun 16 17:09 added.(lz4)
52194 Jun 16 17:10 added.delta(lz4)
72817 Jun 16 17:09 delta.(lz4)
63566 Jun 16 17:10 delta.delta(lz4)

@xvrl
Copy link
Member

xvrl commented Jun 17, 2016

@gianm I agree the distinction between encoding and compression can be fuzzy. I would say this: encoding relies on the type of data being encoded, and defines how we translate a series of values/objects into bytes. Compression is something that's agnostic to the underlying data and operates on opaque bytes. So I think your suggestion to have decompressors and decoders goes in line with what I suggested. We should obviously avoid copying data if there is no need for intermediate de-coding / de-compression buffers.

@acslk If I'm reading your results correctly, based on the numbers in the corresponding PR, it looks like bytesLong delta + lz4 is 50% of the size of bytesLong delta alone, and reqInt delta + lz4 is 10% of the size of reqInt delta? Sounds like there's some significant benefit to combining delta encoding + compression.

We should also keep in mind that raw speed alone is not the only factor going into choosing a format. With smaller data sizes we can memory map a lot more data, which means that compression can give you much better overall performance for a given hardware configuration than uncompressed data. The additional CPU time required will typically be orders of magnitude smaller than the time it takes to page in additional data from SSDs. Ultimately the user will have to decide what gives them the best performance / price-point trade-off.

@gianm
Copy link
Contributor

gianm commented Jun 17, 2016

@xvrl I imagine that people who intend to have all of a particular dataset in-memory and want high performance will prefer to use delta/table by itself and not combine with lz4; for this set of needs, the 10–50% size increase is totally worth it for the speedups. The skipping benchmarks show multiple order of magnitude speed improvement in filtered queries from avoiding lz4 (mostly due to not needing to decompress an entire block to read just one row). Even unfiltered queries still show a 2.5x speed improvement. So I don't think combining lz4 + delta would be useful for this set of needs.

People who intend to page a lot from SSDs will probably prefer smaller data sizes and should still be willing to take the CPU hit of lz4. Maybe a combination would be useful here? Do you think you'd want to deploy a combination on your cluster?

@gianm
Copy link
Contributor

gianm commented Jun 17, 2016

I guess what I'm trying to figure out is, is the combination stuff useful enough to implement; i.e. is it something someone would actually want to deploy.

@xvrl
Copy link
Member

xvrl commented Jun 17, 2016

@gianm For dimension values we already do "combination stuff", e.g. we do variable-size encoding + compression, so combination things are already in Druid, and we've deployed that with great success. When I added dimension compression last year, it gave us 50% reduction in data size.

Now that we are thinking of expanding our sets of encodings, I think it would make sense to de-couple the two, so we don't keep have to keep re-implementing compressed versions of various encodings whenever we see potential to add compression.

@gianm
Copy link
Contributor

gianm commented Jun 17, 2016

Ok, it sounds like you're saying you could actually want to deploy a combination, since it does look like delta+lz4 is ~15%ish smaller than lz4 on as-is encoded data for most of the columns that @acslk posted.

@acslk
Copy link
Contributor Author

acslk commented Jun 20, 2016

From the discussion it seems like it's worth a shot to separate the encoding and compression, which I'll call encoding format and compression strategy. My interpretation is that compression strategy should be unaware of the data type and size, so it must decompress an entire compressed block to access the content. encoding format is anything that does not fit this description, and it would interpret the uncompressed content from the compression strategy.

This means if the compression strategy is anything other than none (which indicate no compression strategy should be used), then it require block layout. If no strategy is used, then there is no reason to use block layout as it creates overhead by copying uncompressed blocks. So there would be the choice of which encoding format and compression strategy to use, while whether or not to use block layout is determined by the compression strategy chosen.

@acslk
Copy link
Contributor Author

acslk commented Jun 23, 2016

Updated #3148 with the new format

@jon-wei
Copy link
Contributor

jon-wei commented Sep 2, 2016

Implemented in #3148

@jon-wei jon-wei closed this as completed Sep 2, 2016
seoeun25 pushed a commit to seoeun25/incubator-druid that referenced this issue Feb 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants