Add support for LZ4 as a compression format #81

zzxuanyuan · 2015-08-26T19:10:53Z

This is a updated pull request from #59 The same experiments have been run and performance results are shown here:

Algorithm	compression(write)	decompression(read)	Compressed File Size
zlib	11.74 MB/s	131.06 MB/s	181 MB
lzma	0.86 MB/s	17.36 MB/s	157 MB
lz4	5.22 MB/s	143.81 MB/s	221 MB

The following performance is from the root file @pcanal's ticket (https://root.cern.ch/files/CMS_7250E9A5-682D-DF11-8701-002618943934.root). The file is 1.9 GB large, and I tried to decompressed it and it seems its original size is 6.4 GB. The following compression/decompression speeds are calculated by dividing 6.4 GB by the time each test run. @bbockelm , we could discuss implementation details of my tests tomorrow.

Algorithm	compression(write)	decompression(read)	Compressed File Size
zlib	15.83 MB/s	63.23 MB/s	1.6 GB
lzma	1.28 MB/s	22.62 MB/s	1.2 GB
lz4	8.32 MB/s	66.53 MB/s	1.8 GB

Add lz4 source tarball for built-in version. Tweak Event example to make it easier to compare compression algorithms. Tweak build files for Event example to include TreePlayer (for TTreePerfStats). Until we're sure we need it, no special path for Win32. Add -fPIC flag to compile lz4

bbockelm · 2015-09-02T18:50:31Z

@zzxuanyuan - can you post the above info in a tabular format?

Also, can you repeat the tests using the file @pcanal posted in the other ticket?

Thanks!

bbockelm · 2015-09-08T15:50:35Z

@pcanal - I think this is ready to go in.

Talking with Zhe, I think the reason the CMS file shows smaller increase in decompression speed is mostly due to the more complex objects in CMS's files (which causes deserialization to be the bottleneck, not decompression).

pcanal · 2015-09-08T18:39:06Z

Could we do a quick set of igprof run to confirm this hypothesis?

bbockelm · 2015-09-08T18:52:26Z

@pcanal - I haven't shown Zhe igprof yet. I normally use it from the CMS context (with the CMS ROOT, CMS compilers, etc). Is there a good documentation page for using it with ROOT directly?

pcanal · 2015-09-08T18:54:52Z

There should be no difference with the CMS case ... The only to remember is to run it on the executable spelled root.exe (rather than the executable spelled 'root').

zzxuanyuan · 2015-09-13T11:29:29Z

I have been profiling the decompression executable with zlib, lzma, lz4. I attached the result here. lz4 uses the least time to decompress the basket (16.3%). @pcanal @bbockelm

pcanal · 2015-09-21T15:14:22Z

So the igprof report says that lz4 on the cms file is 20% faster than zlib. It is odd that it is not reflected in the data as it is supposed to be using " TTreePerfStats information, which gives us access to the compression-time-only rates." ... what am I missing? (i.e. do we have a bug in TTreePerfStas or is the number above not the (de)compression-time-only?

Also LZ4HC and zlib-1 are expected to have similar file size. So we could also add zlib-1 to the table?

Thanks.

zzxuanyuan · 2015-09-21T16:07:17Z

@pcanal I wrote a testing program on my own. For compression, it basically reads the root file given in your ticket and compress all the trees in it and write out to another file. For decompression, it simply iterates all entries in the compressed file. I used TStopWatch to measure the performance. I did not use TTreePerfStats in my program.

pcanal · 2015-09-21T16:09:31Z

I did not use TTreePerfStats in my program.
Why not? (It was the tool used by Brian originally and would have give more accurate information by focusing on 'just' the zipping/unzipping).

zzxuanyuan · 2015-09-21T16:11:43Z

I will rerun the tests and see how performance looks.

Resolve ASan reports (user-after-free) in TClass and TCling

zzxuanyuan · 2015-09-28T04:25:29Z

I have some trouble to use TTreePerfStats to accumulate perf stats of all the trees in CMS file. The following is the depression-only-time from the tree "Events" (it is constitude 1.8 GB out of 1.9 GB in CMS). I also attached zlib-1 (I assume it stands for zlib with level 1).

Algorithm	decompression(real time)	decompression(cpu time)	Compressed File Size
zlib	54.13 MB/s	63.28 MB/s	1.6 GB
lzma	22.47 MB/s	23.41 MB/s	1.2 GB
lz4	56.36 MB/s	66.06 MB/s	1.8 GB
zlib-1	54.58 MB/s	62.78 MB/s	1.8 GB

@pcanal regarding to your previous comment. Is there a way to compare (de)compression time only from igprof result? I did not quite understand why lz4 is 20% faster than zlib?

zzxuanyuan · 2015-09-28T04:48:43Z

@pcanal @bbockelm I am still struggling to measure compression time. I can't find a good way to copy the tree before I attach it into TTreePerfStats.

I paste a piece of code extracted from my code.

        TFile* rfile = TFile::Open("CMS_7250E9A5-682D-DF11-8701-002618943934.root");
        TTree* rtree = (TTree*)rfile->Get("Events");
        Long64_t nentries = rtree->GetEntries();
        TFile* wfile = TFile::Open("copytree.root","RECREATE");
        TTree* wtree = rtree->CloneTree(0); 
        TTreePerfStats* ioperf = new TTreePerfStats("Stats Events", wtree);
        for(Long64_t i; i<nentries; ++i){
                rtree->GetEntry(i);
                wtree->Fill();
        }
        wtree->Write();
        ioperf->Print();
        wfile->Close();
        rfile->Close();

But I can't get any useful information from ioperf.

bbockelm · 2015-09-28T20:54:03Z

@pcanal - Zhe is going to post the output of ioperf->Print() soon so we can help figure out what's going wrong. I'm not entirely sure the tree is being cloned properly; can you look at the code he posted?

However, looking at your prior comment, I'm not sure that igprof states there's a 20% improvement going from zlib to lz4 -- it looks more modest than that to me.

pcanal · 2015-09-28T21:50:06Z

I will check ....

zzxuanyuan · 2015-09-28T22:15:48Z

Here is the output of the program. I only list tree of "Events" here. @bbockelm I was wrong. The program did generate the correct root output file. I guess the following is the result it is supposed to be?


Stats Tree Events
TreeCache = 30 MBytes
N leaves  = 285
ReadTotal = 0 MBytes
ReadUnZip = -nan MBytes
ReadCalls = 0
ReadSize  =    -nan KBytes/read
Readahead = 256 KBytes
Readextra =  -nan per cent
Real Time = 848.538 seconds
CPU  Time = 815.200 seconds
Disk Time =   0.000 seconds
Disk IO   =    -nan MBytes/s
ReadUZRT  =    -nan MBytes/s
ReadUZCP  =    -nan MBytes/s
ReadRT    =   0.000 MBytes/s
ReadCP    =   0.000 MBytes/s

pcanal · 2015-09-28T22:25:22Z

Yes, I had forgotten that TTreePerfStats was not wired for writing ... only reading. Would you consider updating TTreePerfStats and TBasket to also track writing?

Thanks.

bbockelm · 2015-09-28T22:30:51Z

Yeah, I'd be happy to start tackling that - but I'd much prefer to wrap this PR up first.

Brian

Sent from my iPhone

On Sep 28, 2015, at 5:25 PM, pcanal [email protected] wrote:

Yes, I had forgotten that TTreePerfStats was not wired for writing ... only reading. Would you consider updating TTreePerfStats and TBasket to also track writing?

Thanks.

—
Reply to this email directly or view it on GitHub.

pcanal · 2015-09-30T17:28:35Z

Brian,

So the number seems to say that lz4 and zlib-1 are equivalent in term of performance and compression. What use case do you see where lz4 really wins (I.e. where CMS would really benefit from switching)?

bbockelm · 2015-09-30T20:49:18Z

@zzxuanyuan - for non-CMS files, can you also post the results for zlib-1?

@pcanal - even for CMS files, LZ4 beat ZLIB-1 at decompression speed, right?

zzxuanyuan · 2015-10-01T00:07:41Z

Here is the results from "event" executable.

Algorithm	decompression(real time)	decompression(cpu time)	Compressed File Size
zlib	122.62 MB/s	136.61 MB/s	181 MB
lz4	127.57 MB/s	146.42 MB/s	221 MB
zlib-1	105.57 MB/s	118.10 MB/s	197 MB

pcanal · 2015-10-01T16:17:56Z

@bbockelm The CMS if I am not mistaken is

Algorithm	decompression(real time)	decompression(cpu time)	Compressed File Size
zlib	54.13 MB/s	63.28 MB/s	1.6 GB
lzma	22.47 MB/s	23.41 MB/s	1.2 GB
lz4	56.36 MB/s	66.06 MB/s	1.8 GB
zlib-1	54.58 MB/s	62.78 MB/s	1.8 GB

Where at same compression level the run-time gain is 5% (and even less compared to zlib-6) ....
hummm the numbers are odd .... zlib-6 is decompressing faster than zlib-1? Are the times divided by the compressed or decompressed size?

zzxuanyuan · 2015-10-01T17:29:57Z

I measured three times on each algorithms and wrote the averages in the table. I think the performance between zlib-6 and zlib-1 are up and down but quite similar in term of decompression. I could double check it later.

Sent from my iPhone

On Oct 1, 2015, at 11:17, pcanal [email protected] wrote:

@bbockelm The CMS if I am not mistaken is

Algorithm decompression(real time) decompression(cpu time) Compressed File Size
zlib 54.13 MB/s 63.28 MB/s 1.6 GB
lzma 22.47 MB/s 23.41 MB/s 1.2 GB
lz4 56.36 MB/s 66.06 MB/s 1.8 GB
zlib-1 54.58 MB/s 62.78 MB/s 1.8 GB
Where at same compression level the run-time gain is 5% (and even less compared to zlib-6) ....
hummm the numbers are odd .... zlib-6 is decompressing faster than zlib-1? Are the times divided by the compressed or decompressed size?

—
Reply to this email directly or view it on GitHub.

pcanal · 2015-10-01T17:33:49Z

One important question is to verify if the times divided by the compressed size or decompressed size.

zzxuanyuan · 2015-10-02T22:34:20Z

Algorithm	ReadUZRT(Unzipped Real Time)	ReadUZCT(Unzipped CPU Time)	ReadUnzip(Unzipped Size)	ReadRT(Zipped Real Time)	ReadCT(Zipped CPU Time)	ReadTotal(Zipped Size)
zlib	55.22 MB/s	60.46 MB/s	6803.79 MB	13.34 MB/s	14.60 MB/s	1643.17 MB
lz4	54.67 MB/s	60.77 MB/s	6803.26 MB	15.01 MB/s	16.68 MB/s	1867.59 MB
zlib-1	52.64 MB/s	58.43 MB/s	6802.96 MB	14.74 MB/s	16.37 MB/s	1905.55 MB

Here are the results I tested. Again I run three times for each of the algorithms and interleave three algorithms in round-robin fashion.( zlib lz4 zlib-1 zlib lz4 zlib-1 zlib lz4 zlib-1). ReadUZ* represents the speed of dividing uncompressed size and Read* represents the speed of dividing compressed size. There is no obvious gap between lz4 and zlib. I tried to reproduce the previous test results. I guess the reason might be due to the sequence of each test? (previously I run each test three times and then switch to the next algorithm like zlib zlib zlib zlib-1 zlib-1 zlib-1 lz4 lz4 lz4). However I did clean the page cache before each run.

pseyfert · 2015-12-20T22:43:41Z

i see we have some overlap, i also worked a bit on compression algorithms for root pseyfert/root-compression. results come from an LHCb analyst-level root file. Measurements for a central production file are less advance: plot.

zzxuanyuan · 2015-12-20T23:58:05Z

@pseyfert I am glad to see we have overlap here. I took a look at your figure. Could you give me brief overview of your graph.
I am curious what happens if zlib goes beyond 40_10^6 writing cycles? and why is there a sharply folded line for lzma? For example, I guess there could be two compressed sizes if writing cycles reach 160_10^(6).

pseyfert · 2015-12-21T08:27:16Z

@zzxuanyuan I ran a production test job for 18 different compression settings - zlib from level 1 to 9 and lzma from level 1 to 9, each time the same events. Size is the size of the output file in Byte. I ran through valgrind/callgrind and the cycles are the number of cycles spent in the R__zipMultipleAlgorithm function (and functions called from there). The spike in the lzma curve is something i haven't understood yet I want to crosscheck it running over a different set of events (possibly more, though then it gets annoying in space requirements of the output and in the time the tests take).

Not included in the numbers in neither test/summary.txt nor the plot is the RAM usage of a production job (just running tcmalloc with heap profiling and reporting the peak usage. this comes probably with large overhead because i report the memory usage of the full process, not only the compression)
(was a different job than the one from which the filesize and cycle counts come)

zlib:x 237 MB
lzma:1 239 MB
lzma:2 241 MB
lzma:3 262 MB
lzma:4 276 MB
lzma:5 322 MB
lzma:6 322 MB
lzma:7 411 MB
lzma:8 600 MB
lzma:9 904 MB

please also look at this file for the analyst root file where i not only report numbers for writing, but also for reading (though it seems the last column seems buggy).

bbockelm · 2016-01-08T14:30:48Z

@pseyfert - to be clear though, you're talking about a different algorithm, right? (LZMA versus LZ4)

pseyfert · 2016-01-08T14:49:20Z

well, in the linked file (refering to analysis ntuples) i have zlib, lzma, lzo, lz4, zopfli, brotli, for compression levels 1-9 and measure size, compression time (cpu cycles and wall clock), compression RAM, decompression time (cpu cycles and wall clock), decompression RAM.

for production jobs i only have zlib and lzma for compression levels 1-9, measuring size, compression time (cpu cycles) and compression RAM

bbockelm · 2016-01-08T14:52:30Z

Ah, ok - I was confused because the previous comment only referenced LZMA.

For historical reference, could you post some of the plots here (particularly with respect to LZ4 versus zlib-1)?

pseyfert · 2016-01-08T15:17:26Z

i don't have readable plots ready at the moment (with all the numbers i brute forced 2d plots for all combinations of benchmarks, and plotted in them 6 lines - one for each algorithm) the plots are then always dominated by the worst algorithm in each category. But you can read the numbers from
here
row 2 (zlib-1) and rows 29-37 (lz4).

NB: only the zopfli and brotli interfaces are from 2015, the lzo and lz4 interfaces are from 2011 and the lz4 backend is stuck in 2012. i've seen lz4 received upstream patches which i should merge, effectively the compression level in the old version is meaningless. we didn't follow up on lz4 back then as cpu seemed cheap and disk was rare, so we switched to lzma.

bbockelm · 2016-01-08T15:27:06Z

@pseyfert - would it be possible to share the file you used for this? Particularly, I'd be curious to see how the compression ratio looks with the version of LZ4 @zzxuanyuan is using. I note the entire process read time (which LZ4 is supposed to optimize) is 6% faster with LZ4.

Is there any way to fix the last column?

pseyfert · 2016-01-11T16:51:55Z

Hi, okay I now understood what's wrong in the last column, my handwritten callgrind profile reader cannot handle when a function appears twice in the profile (of which I don't know right now what that means). Which means I'll read the profiles with kcachegrind and type the numbers by hand.

I'm hesitant to share since it's real data on analysis level which is not yet published. I think it's less controversial if i take one of the ntuples which were used to generate the date for the kaggle challenge. since that data is already out in the wild. (will query my convenors for what can be shared without too much "political overhead")

pseyfert · 2016-01-12T11:17:18Z

last column is fixed now. as additional explanation:
i have two ntuples, one small one from the first round of benchmarks, and one larger one, where the benchmarks take much longer (also the small one fits my notebook, the large one is processed on the cluster). for zlib-1 and lz4 I report the callgrind reading cycles from both files, for all other settings I only report the values from the larger file.

I don't think that matters much for the interpretation, you just cannot say "decompression is x times faster than compression". but you can say "(de)compression with zlib is x times slower than (de)compression with lz4".

bbockelm · 2016-01-12T14:13:05Z

@pcanal - From the last update of @pseyfert (see https://github.com/pseyfert/root-compression/blob/master/test/summary.txt), the LZ4 read speed was 7.2X faster versus zlib-1. As I suspected, we're hitting the fact that decompression speed is not a huge part of the overall read workflow. Hence, I think the impact here is going to be more noticeable as we increase deserialization speed.

@pseyfert - if it's not possible to post the file, could you rebuild your ROOT with the LZ4-HC algorithm that @zzxuanyuan used in this PR and then check the filesize? I want to see how LZ4-HC's compression ratio compares with ZLIB-1 for this case.

bbockelm · 2016-01-12T14:35:41Z

@zzxuanyuan - I noticed that merge conflicts have snuck in. Can you update the branch?

zzxuanyuan · 2016-01-12T19:52:55Z

@bbockelm It should be good to use now.

pseyfert · 2016-01-17T13:41:04Z

I successfully ran @zzxuanyuan's version but got benchmarks far from mine, so I got suspicious if I'm seing effects of something else that changed in ROOT (or possibly me having changed the compile options with which I compile ROOT). So I also reran my implementation in @zzxuanyuan's ROOT build (since I use LD_PRELOAD I can override @zzxuanyuan's compression while keeping the entire rest of ROOT)
If you compare these new numbers with the previous ones I posted, the RAM consumption went up in general, but it's the same for my implementation and @zzxuanyuan's. I anyhow don't measure the RAM consumption for individual functions and at some point concluded that this is dominated by holding the root file in the memory.

alg -level	compressed size (B)	memory reading (peak, process, B)	memory writing (peak, process, B)	cycles writing (function)	cycles reading (function)
ZLIB yours	50704058.000000	330447558.000000	374567259.000000	255700352.000000	2443672496
LZ4 yours	69861890.000000	357286997.000000	374567259.000000	188653871.000000	1006741856
ZLIB mine	50704058.000000	327875161.000000	371586581.000000	255702351.000000	2443713877
LZ4 mine	69874907.000000	354320671.000000	371586581.000000	28579724.000000	275698694

Seems my implementation takes much less cpu cycles. Since there's not much code in the actual R__zip method, my guess is that this comes from the fact that I use LZ4_compress while you use LZ4_compress_limitedOutput?

I left out the real time tests as for these i'd prefer the computer "undisturbed" which it isn't at the moment.

pseyfert · 2016-05-23T14:53:38Z

update: instead of building a preload library I now ported my changes into root itself here.

There are still items on the todo list:

using an external LZ4 fails
checksum tests are skipped
compiler warnings should get fixed
documentation and attribution (I looked at your PR as an instruction how to update the CMakeLists files)
./configure scripts not updated
license issue (at the moment i state in the description that the changes require GPLv3)
update/run benchmarks

phsft-bot · 2017-02-19T21:15:16Z

Can one of the admins verify this patch?

bbockelm · 2017-08-07T13:39:58Z

@zzxuanyuan - can you close this PR as the functionality got merged elsewhere?

phsft-bot · 2017-08-07T13:39:59Z

Can one of the admins verify this patch?

pcanal · 2017-08-09T21:31:02Z

Replaced/Reworked in #59 which has been merged.

Dr15Jones pushed a commit to Dr15Jones/root that referenced this pull request Sep 22, 2015

Merge pull request root-project#81 from davidlt/add-asan-reports

4443527

Resolve ASan reports (user-after-free) in TClass and TCling

Axel-Naumann assigned pcanal Dec 11, 2015

Resolve two conflicts

4366feb

pseyfert mentioned this pull request May 20, 2016

copy&paste error in cmake comment #172

Closed

pseyfert mentioned this pull request May 26, 2016

[wip] more and new compression algorithms #177

Closed

zzxuanyuan added 5 commits January 29, 2017 23:41

Add compiling for MainLocalCombine.cxx

2b00926

Add test files

fbe481a

Add chrono to test decompression time

2144790

Add Dummy object to test and add debug code in TFile.cxx and TBasket.cxx

fb09b25

Change cmake file of lz4. Now lz4 has faster read speed.

12c579e

pcanal closed this Aug 9, 2017

ethereal-space-cadet16 mentioned this pull request May 31, 2022

Accessing pyROOT #10676

Closed

Add support for LZ4 as a compression format #81

Add support for LZ4 as a compression format #81

Conversation

zzxuanyuan commented Aug 26, 2015

bbockelm commented Sep 2, 2015

bbockelm commented Sep 8, 2015

pcanal commented Sep 8, 2015

bbockelm commented Sep 8, 2015

pcanal commented Sep 8, 2015

zzxuanyuan commented Sep 13, 2015

pcanal commented Sep 21, 2015

zzxuanyuan commented Sep 21, 2015

pcanal commented Sep 21, 2015

zzxuanyuan commented Sep 21, 2015

zzxuanyuan commented Sep 28, 2015

zzxuanyuan commented Sep 28, 2015

bbockelm commented Sep 28, 2015

pcanal commented Sep 28, 2015

zzxuanyuan commented Sep 28, 2015

pcanal commented Sep 28, 2015

bbockelm commented Sep 28, 2015

pcanal commented Sep 30, 2015

bbockelm commented Sep 30, 2015

zzxuanyuan commented Oct 1, 2015

pcanal commented Oct 1, 2015

zzxuanyuan commented Oct 1, 2015

pcanal commented Oct 1, 2015

zzxuanyuan commented Oct 2, 2015

pseyfert commented Dec 20, 2015

zzxuanyuan commented Dec 20, 2015

pseyfert commented Dec 21, 2015

bbockelm commented Jan 8, 2016

pseyfert commented Jan 8, 2016

bbockelm commented Jan 8, 2016

pseyfert commented Jan 8, 2016

bbockelm commented Jan 8, 2016

pseyfert commented Jan 11, 2016

pseyfert commented Jan 12, 2016

bbockelm commented Jan 12, 2016

bbockelm commented Jan 12, 2016

zzxuanyuan commented Jan 12, 2016

pseyfert commented Jan 17, 2016

pseyfert commented May 23, 2016

phsft-bot commented Feb 19, 2017

bbockelm commented Aug 7, 2017

phsft-bot commented Aug 7, 2017

pcanal commented Aug 9, 2017