Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for LZ4 as a compression format #81

Closed
wants to merge 7 commits into from

Conversation

zzxuanyuan
Copy link
Contributor

This is a updated pull request from #59 The same experiments have been run and performance results are shown here:

Algorithm compression(write) decompression(read) Compressed File Size
zlib 11.74 MB/s 131.06 MB/s 181 MB
lzma 0.86 MB/s 17.36 MB/s 157 MB
lz4 5.22 MB/s 143.81 MB/s 221 MB

The following performance is from the root file @pcanal's ticket (https://root.cern.ch/files/CMS_7250E9A5-682D-DF11-8701-002618943934.root). The file is 1.9 GB large, and I tried to decompressed it and it seems its original size is 6.4 GB. The following compression/decompression speeds are calculated by dividing 6.4 GB by the time each test run. @bbockelm , we could discuss implementation details of my tests tomorrow.

Algorithm compression(write) decompression(read) Compressed File Size
zlib 15.83 MB/s 63.23 MB/s 1.6 GB
lzma 1.28 MB/s 22.62 MB/s 1.2 GB
lz4 8.32 MB/s 66.53 MB/s 1.8 GB

Add lz4 source tarball for built-in version.

Tweak Event example to make it easier to compare compression algorithms.

Tweak build files for Event example to include TreePlayer (for TTreePerfStats).

Until we're sure we need it, no special path for Win32.

Add -fPIC flag to compile lz4
@bbockelm
Copy link
Contributor

bbockelm commented Sep 2, 2015

@zzxuanyuan - can you post the above info in a tabular format?

Also, can you repeat the tests using the file @pcanal posted in the other ticket?

Thanks!

@bbockelm
Copy link
Contributor

bbockelm commented Sep 8, 2015

@pcanal - I think this is ready to go in.

Talking with Zhe, I think the reason the CMS file shows smaller increase in decompression speed is mostly due to the more complex objects in CMS's files (which causes deserialization to be the bottleneck, not decompression).

@pcanal
Copy link
Member

pcanal commented Sep 8, 2015

Could we do a quick set of igprof run to confirm this hypothesis?

@bbockelm
Copy link
Contributor

bbockelm commented Sep 8, 2015

@pcanal - I haven't shown Zhe igprof yet. I normally use it from the CMS context (with the CMS ROOT, CMS compilers, etc). Is there a good documentation page for using it with ROOT directly?

@pcanal
Copy link
Member

pcanal commented Sep 8, 2015

There should be no difference with the CMS case ... The only to remember is to run it on the executable spelled root.exe (rather than the executable spelled 'root').

@zzxuanyuan
Copy link
Contributor Author

screen shot 2015-09-13 at 6 24 30 am

I have been profiling the decompression executable with zlib, lzma, lz4. I attached the result here. lz4 uses the least time to decompress the basket (16.3%). @pcanal @bbockelm

@pcanal
Copy link
Member

pcanal commented Sep 21, 2015

So the igprof report says that lz4 on the cms file is 20% faster than zlib. It is odd that it is not reflected in the data as it is supposed to be using " TTreePerfStats information, which gives us access to the compression-time-only rates." ... what am I missing? (i.e. do we have a bug in TTreePerfStas or is the number above not the (de)compression-time-only?

Also LZ4HC and zlib-1 are expected to have similar file size. So we could also add zlib-1 to the table?

Thanks.

@zzxuanyuan
Copy link
Contributor Author

@pcanal I wrote a testing program on my own. For compression, it basically reads the root file given in your ticket and compress all the trees in it and write out to another file. For decompression, it simply iterates all entries in the compressed file. I used TStopWatch to measure the performance. I did not use TTreePerfStats in my program.

@pcanal
Copy link
Member

pcanal commented Sep 21, 2015

I did not use TTreePerfStats in my program.
Why not? (It was the tool used by Brian originally and would have give more accurate information by focusing on 'just' the zipping/unzipping).

@zzxuanyuan
Copy link
Contributor Author

I will rerun the tests and see how performance looks.

Dr15Jones pushed a commit to Dr15Jones/root that referenced this pull request Sep 22, 2015
Resolve ASan reports (user-after-free) in TClass and TCling
@zzxuanyuan
Copy link
Contributor Author

I have some trouble to use TTreePerfStats to accumulate perf stats of all the trees in CMS file. The following is the depression-only-time from the tree "Events" (it is constitude 1.8 GB out of 1.9 GB in CMS). I also attached zlib-1 (I assume it stands for zlib with level 1).

Algorithm decompression(real time) decompression(cpu time) Compressed File Size
zlib 54.13 MB/s 63.28 MB/s 1.6 GB
lzma 22.47 MB/s 23.41 MB/s 1.2 GB
lz4 56.36 MB/s 66.06 MB/s 1.8 GB
zlib-1 54.58 MB/s 62.78 MB/s 1.8 GB

@pcanal regarding to your previous comment. Is there a way to compare (de)compression time only from igprof result? I did not quite understand why lz4 is 20% faster than zlib?

@zzxuanyuan
Copy link
Contributor Author

@pcanal @bbockelm I am still struggling to measure compression time. I can't find a good way to copy the tree before I attach it into TTreePerfStats.

I paste a piece of code extracted from my code.

        TFile* rfile = TFile::Open("CMS_7250E9A5-682D-DF11-8701-002618943934.root");
        TTree* rtree = (TTree*)rfile->Get("Events");
        Long64_t nentries = rtree->GetEntries();
        TFile* wfile = TFile::Open("copytree.root","RECREATE");
        TTree* wtree = rtree->CloneTree(0); 
        TTreePerfStats* ioperf = new TTreePerfStats("Stats Events", wtree);
        for(Long64_t i; i<nentries; ++i){
                rtree->GetEntry(i);
                wtree->Fill();
        }
        wtree->Write();
        ioperf->Print();
        wfile->Close();
        rfile->Close();

But I can't get any useful information from ioperf.

@bbockelm
Copy link
Contributor

@pcanal - Zhe is going to post the output of ioperf->Print() soon so we can help figure out what's going wrong. I'm not entirely sure the tree is being cloned properly; can you look at the code he posted?

However, looking at your prior comment, I'm not sure that igprof states there's a 20% improvement going from zlib to lz4 -- it looks more modest than that to me.

@pcanal
Copy link
Member

pcanal commented Sep 28, 2015

I will check ....

@zzxuanyuan
Copy link
Contributor Author

Here is the output of the program. I only list tree of "Events" here. @bbockelm I was wrong. The program did generate the correct root output file. I guess the following is the result it is supposed to be?


Stats Tree Events
TreeCache = 30 MBytes
N leaves  = 285
ReadTotal = 0 MBytes
ReadUnZip = -nan MBytes
ReadCalls = 0
ReadSize  =    -nan KBytes/read
Readahead = 256 KBytes
Readextra =  -nan per cent
Real Time = 848.538 seconds
CPU  Time = 815.200 seconds
Disk Time =   0.000 seconds
Disk IO   =    -nan MBytes/s
ReadUZRT  =    -nan MBytes/s
ReadUZCP  =    -nan MBytes/s
ReadRT    =   0.000 MBytes/s
ReadCP    =   0.000 MBytes/s

@pcanal
Copy link
Member

pcanal commented Sep 28, 2015

Yes, I had forgotten that TTreePerfStats was not wired for writing ... only reading. Would you consider updating TTreePerfStats and TBasket to also track writing?

Thanks.

@bbockelm
Copy link
Contributor

Yeah, I'd be happy to start tackling that - but I'd much prefer to wrap this PR up first.

Brian

Sent from my iPhone

On Sep 28, 2015, at 5:25 PM, pcanal [email protected] wrote:

Yes, I had forgotten that TTreePerfStats was not wired for writing ... only reading. Would you consider updating TTreePerfStats and TBasket to also track writing?

Thanks.


Reply to this email directly or view it on GitHub.

@pcanal
Copy link
Member

pcanal commented Sep 30, 2015

Brian,

So the number seems to say that lz4 and zlib-1 are equivalent in term of performance and compression. What use case do you see where lz4 really wins (I.e. where CMS would really benefit from switching)?

@bbockelm
Copy link
Contributor

@zzxuanyuan - for non-CMS files, can you also post the results for zlib-1?

@pcanal - even for CMS files, LZ4 beat ZLIB-1 at decompression speed, right?

@zzxuanyuan
Copy link
Contributor Author

Here is the results from "event" executable.

Algorithm decompression(real time) decompression(cpu time) Compressed File Size
zlib 122.62 MB/s 136.61 MB/s 181 MB
lz4 127.57 MB/s 146.42 MB/s 221 MB
zlib-1 105.57 MB/s 118.10 MB/s 197 MB

@pcanal
Copy link
Member

pcanal commented Oct 1, 2015

@bbockelm The CMS if I am not mistaken is

Algorithm decompression(real time) decompression(cpu time) Compressed File Size
zlib 54.13 MB/s 63.28 MB/s 1.6 GB
lzma 22.47 MB/s 23.41 MB/s 1.2 GB
lz4 56.36 MB/s 66.06 MB/s 1.8 GB
zlib-1 54.58 MB/s 62.78 MB/s 1.8 GB

Where at same compression level the run-time gain is 5% (and even less compared to zlib-6) ....
hummm the numbers are odd .... zlib-6 is decompressing faster than zlib-1? Are the times divided by the compressed or decompressed size?

@zzxuanyuan
Copy link
Contributor Author

I measured three times on each algorithms and wrote the averages in the table. I think the performance between zlib-6 and zlib-1 are up and down but quite similar in term of decompression. I could double check it later.

Sent from my iPhone

On Oct 1, 2015, at 11:17, pcanal [email protected] wrote:

@bbockelm The CMS if I am not mistaken is

Algorithm decompression(real time) decompression(cpu time) Compressed File Size
zlib 54.13 MB/s 63.28 MB/s 1.6 GB
lzma 22.47 MB/s 23.41 MB/s 1.2 GB
lz4 56.36 MB/s 66.06 MB/s 1.8 GB
zlib-1 54.58 MB/s 62.78 MB/s 1.8 GB
Where at same compression level the run-time gain is 5% (and even less compared to zlib-6) ....
hummm the numbers are odd .... zlib-6 is decompressing faster than zlib-1? Are the times divided by the compressed or decompressed size?


Reply to this email directly or view it on GitHub.

@pcanal
Copy link
Member

pcanal commented Oct 1, 2015

One important question is to verify if the times divided by the compressed size or decompressed size.

@zzxuanyuan
Copy link
Contributor Author

Algorithm ReadUZRT(Unzipped Real Time) ReadUZCT(Unzipped CPU Time) ReadUnzip(Unzipped Size) ReadRT(Zipped Real Time) ReadCT(Zipped CPU Time) ReadTotal(Zipped Size)
zlib 55.22 MB/s 60.46 MB/s 6803.79 MB 13.34 MB/s 14.60 MB/s 1643.17 MB
lz4 54.67 MB/s 60.77 MB/s 6803.26 MB 15.01 MB/s 16.68 MB/s 1867.59 MB
zlib-1 52.64 MB/s 58.43 MB/s 6802.96 MB 14.74 MB/s 16.37 MB/s 1905.55 MB

Here are the results I tested. Again I run three times for each of the algorithms and interleave three algorithms in round-robin fashion.( zlib lz4 zlib-1 zlib lz4 zlib-1 zlib lz4 zlib-1). ReadUZ* represents the speed of dividing uncompressed size and Read* represents the speed of dividing compressed size. There is no obvious gap between lz4 and zlib. I tried to reproduce the previous test results. I guess the reason might be due to the sequence of each test? (previously I run each test three times and then switch to the next algorithm like zlib zlib zlib zlib-1 zlib-1 zlib-1 lz4 lz4 lz4). However I did clean the page cache before each run.

@pseyfert
Copy link
Contributor

i see we have some overlap, i also worked a bit on compression algorithms for root pseyfert/root-compression. results come from an LHCb analyst-level root file. Measurements for a central production file are less advance: plot.

@zzxuanyuan
Copy link
Contributor Author

@pseyfert I am glad to see we have overlap here. I took a look at your figure. Could you give me brief overview of your graph.
I am curious what happens if zlib goes beyond 40_10^6 writing cycles? and why is there a sharply folded line for lzma? For example, I guess there could be two compressed sizes if writing cycles reach 160_10^(6).

@pseyfert
Copy link
Contributor

@zzxuanyuan I ran a production test job for 18 different compression settings - zlib from level 1 to 9 and lzma from level 1 to 9, each time the same events. Size is the size of the output file in Byte. I ran through valgrind/callgrind and the cycles are the number of cycles spent in the R__zipMultipleAlgorithm function (and functions called from there). The spike in the lzma curve is something i haven't understood yet I want to crosscheck it running over a different set of events (possibly more, though then it gets annoying in space requirements of the output and in the time the tests take).

Not included in the numbers in neither test/summary.txt nor the plot is the RAM usage of a production job (just running tcmalloc with heap profiling and reporting the peak usage. this comes probably with large overhead because i report the memory usage of the full process, not only the compression)
(was a different job than the one from which the filesize and cycle counts come)

zlib:x 237 MB
lzma:1 239 MB
lzma:2 241 MB
lzma:3 262 MB
lzma:4 276 MB
lzma:5 322 MB
lzma:6 322 MB
lzma:7 411 MB
lzma:8 600 MB
lzma:9 904 MB

please also look at this file for the analyst root file where i not only report numbers for writing, but also for reading (though it seems the last column seems buggy).

@bbockelm
Copy link
Contributor

bbockelm commented Jan 8, 2016

@pseyfert - to be clear though, you're talking about a different algorithm, right? (LZMA versus LZ4)

@pseyfert
Copy link
Contributor

pseyfert commented Jan 8, 2016

well, in the linked file (refering to analysis ntuples) i have zlib, lzma, lzo, lz4, zopfli, brotli, for compression levels 1-9 and measure size, compression time (cpu cycles and wall clock), compression RAM, decompression time (cpu cycles and wall clock), decompression RAM.

for production jobs i only have zlib and lzma for compression levels 1-9, measuring size, compression time (cpu cycles) and compression RAM

@bbockelm
Copy link
Contributor

bbockelm commented Jan 8, 2016

Ah, ok - I was confused because the previous comment only referenced LZMA.

For historical reference, could you post some of the plots here (particularly with respect to LZ4 versus zlib-1)?

@pseyfert
Copy link
Contributor

pseyfert commented Jan 8, 2016

i don't have readable plots ready at the moment (with all the numbers i brute forced 2d plots for all combinations of benchmarks, and plotted in them 6 lines - one for each algorithm) the plots are then always dominated by the worst algorithm in each category. But you can read the numbers from
here
row 2 (zlib-1) and rows 29-37 (lz4).

NB: only the zopfli and brotli interfaces are from 2015, the lzo and lz4 interfaces are from 2011 and the lz4 backend is stuck in 2012. i've seen lz4 received upstream patches which i should merge, effectively the compression level in the old version is meaningless. we didn't follow up on lz4 back then as cpu seemed cheap and disk was rare, so we switched to lzma.

@bbockelm
Copy link
Contributor

bbockelm commented Jan 8, 2016

@pseyfert - would it be possible to share the file you used for this? Particularly, I'd be curious to see how the compression ratio looks with the version of LZ4 @zzxuanyuan is using. I note the entire process read time (which LZ4 is supposed to optimize) is 6% faster with LZ4.

Is there any way to fix the last column?

@pseyfert
Copy link
Contributor

Hi, okay I now understood what's wrong in the last column, my handwritten callgrind profile reader cannot handle when a function appears twice in the profile (of which I don't know right now what that means). Which means I'll read the profiles with kcachegrind and type the numbers by hand.

I'm hesitant to share since it's real data on analysis level which is not yet published. I think it's less controversial if i take one of the ntuples which were used to generate the date for the kaggle challenge. since that data is already out in the wild. (will query my convenors for what can be shared without too much "political overhead")

@pseyfert
Copy link
Contributor

last column is fixed now. as additional explanation:
i have two ntuples, one small one from the first round of benchmarks, and one larger one, where the benchmarks take much longer (also the small one fits my notebook, the large one is processed on the cluster). for zlib-1 and lz4 I report the callgrind reading cycles from both files, for all other settings I only report the values from the larger file.

I don't think that matters much for the interpretation, you just cannot say "decompression is x times faster than compression". but you can say "(de)compression with zlib is x times slower than (de)compression with lz4".

@bbockelm
Copy link
Contributor

@pcanal - From the last update of @pseyfert (see https://github.com/pseyfert/root-compression/blob/master/test/summary.txt), the LZ4 read speed was 7.2X faster versus zlib-1. As I suspected, we're hitting the fact that decompression speed is not a huge part of the overall read workflow. Hence, I think the impact here is going to be more noticeable as we increase deserialization speed.

@pseyfert - if it's not possible to post the file, could you rebuild your ROOT with the LZ4-HC algorithm that @zzxuanyuan used in this PR and then check the filesize? I want to see how LZ4-HC's compression ratio compares with ZLIB-1 for this case.

@bbockelm
Copy link
Contributor

@zzxuanyuan - I noticed that merge conflicts have snuck in. Can you update the branch?

@zzxuanyuan
Copy link
Contributor Author

@bbockelm It should be good to use now.

@pseyfert
Copy link
Contributor

I successfully ran @zzxuanyuan's version but got benchmarks far from mine, so I got suspicious if I'm seing effects of something else that changed in ROOT (or possibly me having changed the compile options with which I compile ROOT). So I also reran my implementation in @zzxuanyuan's ROOT build (since I use LD_PRELOAD I can override @zzxuanyuan's compression while keeping the entire rest of ROOT)
If you compare these new numbers with the previous ones I posted, the RAM consumption went up in general, but it's the same for my implementation and @zzxuanyuan's. I anyhow don't measure the RAM consumption for individual functions and at some point concluded that this is dominated by holding the root file in the memory.

alg -level compressed size (B) memory reading (peak, process, B) memory writing (peak, process, B) cycles writing (function) cycles reading (function)
ZLIB yours 50704058.000000 330447558.000000 374567259.000000 255700352.000000 2443672496
LZ4 yours 69861890.000000 357286997.000000 374567259.000000 188653871.000000 1006741856
ZLIB mine 50704058.000000 327875161.000000 371586581.000000 255702351.000000 2443713877
LZ4 mine 69874907.000000 354320671.000000 371586581.000000 28579724.000000 275698694

Seems my implementation takes much less cpu cycles. Since there's not much code in the actual R__zip method, my guess is that this comes from the fact that I use LZ4_compress while you use LZ4_compress_limitedOutput?

I left out the real time tests as for these i'd prefer the computer "undisturbed" which it isn't at the moment.

@pseyfert
Copy link
Contributor

update: instead of building a preload library I now ported my changes into root itself here.

There are still items on the todo list:

  • using an external LZ4 fails
  • checksum tests are skipped
  • compiler warnings should get fixed
  • documentation and attribution (I looked at your PR as an instruction how to update the CMakeLists files)
  • ./configure scripts not updated
  • license issue (at the moment i state in the description that the changes require GPLv3)
  • update/run benchmarks

@phsft-bot
Copy link
Collaborator

Can one of the admins verify this patch?

@bbockelm
Copy link
Contributor

bbockelm commented Aug 7, 2017

@zzxuanyuan - can you close this PR as the functionality got merged elsewhere?

@phsft-bot
Copy link
Collaborator

Can one of the admins verify this patch?

@pcanal
Copy link
Member

pcanal commented Aug 9, 2017

Replaced/Reworked in #59 which has been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants