-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: alternative compression methods #3147
Comments
Do you know what the proposed change you are thinking of doing is yet? Also, rather than changing it at the GenericIndexed level, it might be nice to think of it one level up and change how the column itself is serialized/deserialized. |
The reference PR my proposed change for this issue. The GenericIndexed level wasn't changed, instead I added the class CompressionFactory that gets the serializer/deserializer for the compression format. The serializer/deserializer for each compression format would then decide to either use or not use GenericIndexed for block compression. |
@acslk do you have any benchmarks on how much your refactoring affects performance for existing formats? |
Here's the same benchmark on the current build
There shouldn't be any difference on LZ4 and LZF performance, since I did not change any steps in the GenericIndexed decompression procedure. For uncompressed strategy, I calculated the position of the value and got it directly from byte buffer instead of going through GenericIndexed, so skipping access is faster in refactored version. |
@acslk just because we don't change the logic doesn't mean it doesn't change the performance. There are a lot of factors such as what can be inlined and addition of virtual calls that can severely impact performance, even for simple refactoring. Based on my cursory assessment, it looks like there's about a 10% slow-down for |
Thanks @acslk for the numbers. I am generally positive on including new compression strategies if they are either strictly better or at least give us a new and useful position on the storage/performance tradeoff space. The new 'uncompressed' looks strictly better than the old one, and the 'delta' from #3148 doesn't seem strictly better than lz4, but it is faster at the cost of using somewhat more storage, which is probably going to be useful for some people. RLE for the __time column might also be nice to add in the future, it'd probably work well there. With queryGranularity minute or hour it could potentially be both smaller & faster than lz4. |
agree with @xvrl that we should aim for equivalent performance for lz4. |
I've ran the lz4 benchmark again with more iterations, it seems that the difference is quite small now current build
refactored
|
@acslk thanks for the updated benchmark, glad to see there doesn't seem to be any meaningful impact. As far as the new compression schemes, I'm wondering how many new schemes we actually need and want to support. More formats will mean more things to support and more potential bugs. Can we put together some use-cases to explain which ones of those formats are actually useful? |
It might also make sense to deprecate formats, so we don't have to drag them with us forever. We should think of deprecating LZF for instance. While we'll likely have to preserve backwards compatibility for a while, we'll probably want to steer users away from format we'd like to remove in the future. |
@acslk from reading your description a bit more, I get the impression we are mixing up compression format, data encoding scheme, and column layout. I feel we should better separate those concepts to make things a bit clearer. Here's the way I see things:
If we separate those concepts, than we could support things such as "delta-encoding + lz4" or "variable-length + lz4", or choose not to enable compression and store values as is. If we believe there is value is doing so, we should consider separating those concepts now, to avoid too much backwards compatibility baggage down the road. |
@xvrl so in your terminology, basically "encoding" is something like "as-is", "RLE", "delta", "table", "variableLength" (some are random-access and some are not); and "compression" is something like "lz4" or "lzf"? I think that split makes sense, with some caveats around minimizing copies,
#3148 achieves (1) and (3) although not in a way that allows composing compressions and encodings. For the "skipping" benchmarks in #3148 there are serious performance gains, seemingly mostly from avoiding unnecessary copies. Would it work if "compression" and "encoding" are different things, with capabilities like,
I could see that being useful for stuff like "lz4 + delta" which might end up better than just lz4 by itself for people trying to optimize for storage space. Or not. But it could be interesting to try… |
@xvrl I agree that there's certain distinctions between the concepts you described, but I have some different opinions The compression format in your description is the CompressionStrategy in the code. It compress a block of arbitrary bytes to another block of bytes, which must be used by decompressing the block back into the original block. By separating the concepts, the compression procedure should be
The problem I see with this is that these concepts are not entirely unrelated. Compression strategy such as LZ4 is essentially a data encoding scheme, as it belongs to LZ77 scheme which is kinda like an advanced version of Run Length Encoding. I believe both data encoding scheme and CompressionStrategy have the same objective of writing objects into bytes that can later be decoded, and they should be considered as the same concept. I've done LZ4 compression on variable length encoded longs, and the resulting file size isn't much different from normal LZ4 compression. This is the case since variable length encoding already eliminated large amount of repetition of 0, so LZ4 compression would not be as effective. By combining these concepts, the process would then be
Now encoding such as table and delta allow random access in the compressed form, which means they don't need to decompress the entire block to read a value. This makes it possible for column layout to compress the entire column together without dividing into blocks. Note compressing the entire column is not possible for non random access compression method such as LZ4 and RLE, since they need to uncompress the entire block. This means only certain combinations are available, and I've named these combinations as CompressionFormat in #3148. Basically the compression formats are
Within these, Uncompressed and LZF formats should be deprecated, as they're worse version of Uncompressed_new and LZ4 |
@acslk Did you see if there is any benefit to combining lz4 + delta (or lz4 + rle or lz4 + some other encoding)? Is any of those combinations either smaller or faster than lz4 by itself or the encoding by itself? If so it could make sense to separate compression/encoding. If not then there's no point in implementing compression and encoding as different things, as nobody would want to combine them anyway. |
@acslk, btw, on your point that compressions and encodings are kind of the same thing – yes, this is true :) I think if we chose to separate them into two concepts, it wouldn't be because they are fundamentally different things, but would be because we feel that users would want to combine 'encodings that operate on whole blocks' with 'encodings that may be a little more nuanced', and it would a pain for us to implement all possible combinations manually. If we chose to have a single concept, that would be because we feel that it keeps the code simpler and we don't lose much in terms of useful flexibility. |
@gianm thanks for the clarification. Here's a brief idea of the compression size. I've used the LZ4 compressor on the entire delta/table file and on the uncompressed file. It's not an accurate representation of what block decompressed file would look like, but it should show the potential difference. The access time of combination approach should be a bit slower than lz4, since it would follow the same procedure of block decompression and instead of accessing uncompressed values, it would access delta/table encoded values.
|
@gianm I agree the distinction between encoding and compression can be fuzzy. I would say this: encoding relies on the type of data being encoded, and defines how we translate a series of values/objects into bytes. Compression is something that's agnostic to the underlying data and operates on opaque bytes. So I think your suggestion to have decompressors and decoders goes in line with what I suggested. We should obviously avoid copying data if there is no need for intermediate de-coding / de-compression buffers. @acslk If I'm reading your results correctly, based on the numbers in the corresponding PR, it looks like bytesLong delta + lz4 is 50% of the size of bytesLong delta alone, and reqInt delta + lz4 is 10% of the size of reqInt delta? Sounds like there's some significant benefit to combining delta encoding + compression. We should also keep in mind that raw speed alone is not the only factor going into choosing a format. With smaller data sizes we can memory map a lot more data, which means that compression can give you much better overall performance for a given hardware configuration than uncompressed data. The additional CPU time required will typically be orders of magnitude smaller than the time it takes to page in additional data from SSDs. Ultimately the user will have to decide what gives them the best performance / price-point trade-off. |
@xvrl I imagine that people who intend to have all of a particular dataset in-memory and want high performance will prefer to use delta/table by itself and not combine with lz4; for this set of needs, the 10–50% size increase is totally worth it for the speedups. The skipping benchmarks show multiple order of magnitude speed improvement in filtered queries from avoiding lz4 (mostly due to not needing to decompress an entire block to read just one row). Even unfiltered queries still show a 2.5x speed improvement. So I don't think combining lz4 + delta would be useful for this set of needs. People who intend to page a lot from SSDs will probably prefer smaller data sizes and should still be willing to take the CPU hit of lz4. Maybe a combination would be useful here? Do you think you'd want to deploy a combination on your cluster? |
I guess what I'm trying to figure out is, is the combination stuff useful enough to implement; i.e. is it something someone would actually want to deploy. |
@gianm For dimension values we already do "combination stuff", e.g. we do variable-size encoding + compression, so combination things are already in Druid, and we've deployed that with great success. When I added dimension compression last year, it gave us 50% reduction in data size. Now that we are thinking of expanding our sets of encodings, I think it would make sense to de-couple the two, so we don't keep have to keep re-implementing compressed versions of various encodings whenever we see potential to add compression. |
Ok, it sounds like you're saying you could actually want to deploy a combination, since it does look like delta+lz4 is ~15%ish smaller than lz4 on as-is encoded data for most of the columns that @acslk posted. |
From the discussion it seems like it's worth a shot to separate the encoding and compression, which I'll call encoding format and compression strategy. My interpretation is that compression strategy should be unaware of the data type and size, so it must decompress an entire compressed block to access the content. encoding format is anything that does not fit this description, and it would interpret the uncompressed content from the compression strategy. This means if the compression strategy is anything other than none (which indicate no compression strategy should be used), then it require block layout. If no strategy is used, then there is no reason to use block layout as it creates overhead by copying uncompressed blocks. So there would be the choice of which encoding format and compression strategy to use, while whether or not to use block layout is determined by the compression strategy chosen. |
Updated #3148 with the new format |
Implemented in #3148 |
…g backward compatibility
Currently druid compress segment values using compression strategy of LZ4, LZF, and uncompressed (mostly LZ4).
This general method of compression does not perform well when reading small amount of data with many skips, since entire blocks would be loaded and uncompressed even if only a single value is required. A possible alternative approach for compression is to have fixed size length for each compressed value, so when accessing an index, the position of the index can be calculated to obtain the compressed value directly from the file mapped byte buffer, and no block coping or decompression is required.
Currently, the fixed size approach cannot be added as a compression strategy, since CompressionStrategy only has control over how to compress and decompress a given block of bytes. Serializers and suppliers calls GenericIndexedWriter and GenericIndexed, which performs block based compression using the compression strategy. The compression interface should be changed so compression strategy can decide on wether or not to use block based compression, so other compression methods such as fixed size compression can be added.
The text was updated successfully, but these errors were encountered: