-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for LZ4 as a compression format #81
Conversation
Add lz4 source tarball for built-in version. Tweak Event example to make it easier to compare compression algorithms. Tweak build files for Event example to include TreePlayer (for TTreePerfStats). Until we're sure we need it, no special path for Win32. Add -fPIC flag to compile lz4
@zzxuanyuan - can you post the above info in a tabular format? Also, can you repeat the tests using the file @pcanal posted in the other ticket? Thanks! |
@pcanal - I think this is ready to go in. Talking with Zhe, I think the reason the CMS file shows smaller increase in decompression speed is mostly due to the more complex objects in CMS's files (which causes deserialization to be the bottleneck, not decompression). |
Could we do a quick set of igprof run to confirm this hypothesis? |
@pcanal - I haven't shown Zhe igprof yet. I normally use it from the CMS context (with the CMS ROOT, CMS compilers, etc). Is there a good documentation page for using it with ROOT directly? |
There should be no difference with the CMS case ... The only to remember is to run it on the executable spelled root.exe (rather than the executable spelled 'root'). |
So the igprof report says that lz4 on the cms file is 20% faster than zlib. It is odd that it is not reflected in the data as it is supposed to be using " TTreePerfStats information, which gives us access to the compression-time-only rates." ... what am I missing? (i.e. do we have a bug in TTreePerfStas or is the number above not the (de)compression-time-only? Also LZ4HC and zlib-1 are expected to have similar file size. So we could also add zlib-1 to the table? Thanks. |
@pcanal I wrote a testing program on my own. For compression, it basically reads the root file given in your ticket and compress all the trees in it and write out to another file. For decompression, it simply iterates all entries in the compressed file. I used TStopWatch to measure the performance. I did not use TTreePerfStats in my program. |
|
I will rerun the tests and see how performance looks. |
Resolve ASan reports (user-after-free) in TClass and TCling
I have some trouble to use TTreePerfStats to accumulate perf stats of all the trees in CMS file. The following is the depression-only-time from the tree "Events" (it is constitude 1.8 GB out of 1.9 GB in CMS). I also attached zlib-1 (I assume it stands for zlib with level 1).
@pcanal regarding to your previous comment. Is there a way to compare (de)compression time only from igprof result? I did not quite understand why lz4 is 20% faster than zlib? |
@pcanal @bbockelm I am still struggling to measure compression time. I can't find a good way to copy the tree before I attach it into TTreePerfStats. I paste a piece of code extracted from my code. TFile* rfile = TFile::Open("CMS_7250E9A5-682D-DF11-8701-002618943934.root");
TTree* rtree = (TTree*)rfile->Get("Events");
Long64_t nentries = rtree->GetEntries();
TFile* wfile = TFile::Open("copytree.root","RECREATE");
TTree* wtree = rtree->CloneTree(0);
TTreePerfStats* ioperf = new TTreePerfStats("Stats Events", wtree);
for(Long64_t i; i<nentries; ++i){
rtree->GetEntry(i);
wtree->Fill();
}
wtree->Write();
ioperf->Print();
wfile->Close();
rfile->Close(); But I can't get any useful information from ioperf. |
@pcanal - Zhe is going to post the output of ioperf->Print() soon so we can help figure out what's going wrong. I'm not entirely sure the tree is being cloned properly; can you look at the code he posted? However, looking at your prior comment, I'm not sure that igprof states there's a 20% improvement going from zlib to lz4 -- it looks more modest than that to me. |
I will check .... |
Here is the output of the program. I only list tree of "Events" here. @bbockelm I was wrong. The program did generate the correct root output file. I guess the following is the result it is supposed to be?
|
Yes, I had forgotten that TTreePerfStats was not wired for writing ... only reading. Would you consider updating TTreePerfStats and TBasket to also track writing? Thanks. |
Yeah, I'd be happy to start tackling that - but I'd much prefer to wrap this PR up first. Brian Sent from my iPhone
|
Brian, So the number seems to say that lz4 and zlib-1 are equivalent in term of performance and compression. What use case do you see where lz4 really wins (I.e. where CMS would really benefit from switching)? |
@zzxuanyuan - for non-CMS files, can you also post the results for zlib-1? @pcanal - even for CMS files, LZ4 beat ZLIB-1 at decompression speed, right? |
Here is the results from "event" executable.
|
@bbockelm The CMS if I am not mistaken is
Where at same compression level the run-time gain is 5% (and even less compared to zlib-6) .... |
I measured three times on each algorithms and wrote the averages in the table. I think the performance between zlib-6 and zlib-1 are up and down but quite similar in term of decompression. I could double check it later. Sent from my iPhone
|
One important question is to verify if the times divided by the compressed size or decompressed size. |
Here are the results I tested. Again I run three times for each of the algorithms and interleave three algorithms in round-robin fashion.( zlib lz4 zlib-1 zlib lz4 zlib-1 zlib lz4 zlib-1). ReadUZ* represents the speed of dividing uncompressed size and Read* represents the speed of dividing compressed size. There is no obvious gap between lz4 and zlib. I tried to reproduce the previous test results. I guess the reason might be due to the sequence of each test? (previously I run each test three times and then switch to the next algorithm like zlib zlib zlib zlib-1 zlib-1 zlib-1 lz4 lz4 lz4). However I did clean the page cache before each run. |
i see we have some overlap, i also worked a bit on compression algorithms for root pseyfert/root-compression. results come from an LHCb analyst-level root file. Measurements for a central production file are less advance: plot. |
@pseyfert I am glad to see we have overlap here. I took a look at your figure. Could you give me brief overview of your graph. |
@zzxuanyuan I ran a production test job for 18 different compression settings - zlib from level 1 to 9 and lzma from level 1 to 9, each time the same events. Size is the size of the output file in Byte. I ran through valgrind/callgrind and the cycles are the number of cycles spent in the R__zipMultipleAlgorithm function (and functions called from there). The spike in the lzma curve is something i haven't understood yet I want to crosscheck it running over a different set of events (possibly more, though then it gets annoying in space requirements of the output and in the time the tests take). Not included in the numbers in neither test/summary.txt nor the plot is the RAM usage of a production job (just running tcmalloc with heap profiling and reporting the peak usage. this comes probably with large overhead because i report the memory usage of the full process, not only the compression) zlib:x 237 MB please also look at this file for the analyst root file where i not only report numbers for writing, but also for reading (though it seems the last column seems buggy). |
@pseyfert - to be clear though, you're talking about a different algorithm, right? (LZMA versus LZ4) |
well, in the linked file (refering to analysis ntuples) i have zlib, lzma, lzo, lz4, zopfli, brotli, for compression levels 1-9 and measure size, compression time (cpu cycles and wall clock), compression RAM, decompression time (cpu cycles and wall clock), decompression RAM. for production jobs i only have zlib and lzma for compression levels 1-9, measuring size, compression time (cpu cycles) and compression RAM |
Ah, ok - I was confused because the previous comment only referenced LZMA. For historical reference, could you post some of the plots here (particularly with respect to LZ4 versus zlib-1)? |
i don't have readable plots ready at the moment (with all the numbers i brute forced 2d plots for all combinations of benchmarks, and plotted in them 6 lines - one for each algorithm) the plots are then always dominated by the worst algorithm in each category. But you can read the numbers from NB: only the zopfli and brotli interfaces are from 2015, the lzo and lz4 interfaces are from 2011 and the lz4 backend is stuck in 2012. i've seen lz4 received upstream patches which i should merge, effectively the compression level in the old version is meaningless. we didn't follow up on lz4 back then as cpu seemed cheap and disk was rare, so we switched to lzma. |
@pseyfert - would it be possible to share the file you used for this? Particularly, I'd be curious to see how the compression ratio looks with the version of LZ4 @zzxuanyuan is using. I note the entire process read time (which LZ4 is supposed to optimize) is 6% faster with LZ4. Is there any way to fix the last column? |
Hi, okay I now understood what's wrong in the last column, my handwritten callgrind profile reader cannot handle when a function appears twice in the profile (of which I don't know right now what that means). Which means I'll read the profiles with kcachegrind and type the numbers by hand. I'm hesitant to share since it's real data on analysis level which is not yet published. I think it's less controversial if i take one of the ntuples which were used to generate the date for the kaggle challenge. since that data is already out in the wild. (will query my convenors for what can be shared without too much "political overhead") |
last column is fixed now. as additional explanation: I don't think that matters much for the interpretation, you just cannot say "decompression is x times faster than compression". but you can say "(de)compression with zlib is x times slower than (de)compression with lz4". |
@pcanal - From the last update of @pseyfert (see https://github.com/pseyfert/root-compression/blob/master/test/summary.txt), the LZ4 read speed was 7.2X faster versus zlib-1. As I suspected, we're hitting the fact that decompression speed is not a huge part of the overall read workflow. Hence, I think the impact here is going to be more noticeable as we increase deserialization speed. @pseyfert - if it's not possible to post the file, could you rebuild your ROOT with the |
@zzxuanyuan - I noticed that merge conflicts have snuck in. Can you update the branch? |
@bbockelm It should be good to use now. |
I successfully ran @zzxuanyuan's version but got benchmarks far from mine, so I got suspicious if I'm seing effects of something else that changed in ROOT (or possibly me having changed the compile options with which I compile ROOT). So I also reran my implementation in @zzxuanyuan's ROOT build (since I use LD_PRELOAD I can override @zzxuanyuan's compression while keeping the entire rest of ROOT)
Seems my implementation takes much less cpu cycles. Since there's not much code in the actual R__zip method, my guess is that this comes from the fact that I use I left out the real time tests as for these i'd prefer the computer "undisturbed" which it isn't at the moment. |
update: instead of building a preload library I now ported my changes into root itself here. There are still items on the todo list:
|
Can one of the admins verify this patch? |
@zzxuanyuan - can you close this PR as the functionality got merged elsewhere? |
Can one of the admins verify this patch? |
Replaced/Reworked in #59 which has been merged. |
This is a updated pull request from #59 The same experiments have been run and performance results are shown here:
The following performance is from the root file @pcanal's ticket (https://root.cern.ch/files/CMS_7250E9A5-682D-DF11-8701-002618943934.root). The file is 1.9 GB large, and I tried to decompressed it and it seems its original size is 6.4 GB. The following compression/decompression speeds are calculated by dividing 6.4 GB by the time each test run. @bbockelm , we could discuss implementation details of my tests tomorrow.