Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade Apache Commons Compress to 1.27.1 #1961

Closed
wladimirleite opened this issue Oct 31, 2023 · 21 comments · Fixed by #2091 or #2341
Closed

Upgrade Apache Commons Compress to 1.27.1 #1961

wladimirleite opened this issue Oct 31, 2023 · 21 comments · Fixed by #2091 or #2341
Assignees
Labels
bug dependencies Pull requests that update a dependency file

Comments

@wladimirleite
Copy link
Member

wladimirleite commented Oct 31, 2023

Processing a 7z of ~7 GB, another user here got several errors.
Testing the archive (with the 7-zip program), it seems fine.
After investigating other possibilities, I tried to use the latest version of Apache Commons Compress (1.24.0), and the processing finished successfully (currently, version 1.21 is used).

@lfcnassif, I know that upgrading this kind of dependency is critical, so I will leave this issue as a suggestion to be evaluated.

@lfcnassif
Copy link
Member

Hi @tc-wleite! I think we can upgrade this and other dependencies on master, targeting 4.2.0. I already plan to upgrade Tika for 4.2.0 and I should run a large regression test before releasing it.

@lfcnassif lfcnassif added the dependencies Pull requests that update a dependency file label Oct 31, 2023
@wladimirleite wladimirleite changed the title Upgrade Apache Commons Compress? Upgrade Apache Commons Compress Nov 11, 2023
@wladimirleite
Copy link
Member Author

I noticed that #2091 pointed out a possible vulnerability in 1.21, recommending the upgrade to 1.26.

@lfcnassif
Copy link
Member

There are 2 fixed CVEs in 1.26, CVE-2024-26308 (OOME) is worse to me, let's upgrade. Sorry @wladimirleite, I missed this in the queue.

@lfcnassif
Copy link
Member

Reopening, while testing #2093, I detected in a small test a lot of *.ar (application/x-archive) files stopped to be expanded. I'll test Compress 1.24 and 1.25 to see if that *.ar regression disappears, but I think we should do a large test before this upgrade.

@lfcnassif lfcnassif reopened this Feb 22, 2024
@lfcnassif
Copy link
Member

CommonsCompress-1.25 doesn't have the *.ar regression I detected. I'll downgrade to it on master, but I'll leave this open until we run a larger test to see if there are other regressions.

@wladimirleite
Copy link
Member Author

Reopening, while testing #2093, I detected in a small test a lot of *.ar (application/x-archive) files stopped to be expanded.

That is bad news.

CommonsCompress-1.25 doesn't have the *.ar regression I detected. I'll downgrade to it on master, but I'll leave this open until we run a larger test to see if there are other regressions.

Are the vulnerabilities solved in 1.25?

@lfcnassif
Copy link
Member

lfcnassif commented Feb 22, 2024

Are the vulnerabilities solved in 1.25?

I think no. Although non ideal, we have timeouts to protect against infinite loops (DoS) and external parsing to isolate OOMEs. Fortunately there is no (known) Remote Code Execution vulnerability.

@lfcnassif
Copy link
Member

Hopefully

I meant Fortunately.

@wladimirleite
Copy link
Member Author

If the only problem with 1.26 is with AR, maybe one option would be using SevenZipParser instead of PackageParser.

@lfcnassif
Copy link
Member

lfcnassif commented Feb 22, 2024

If the only problem with 1.26 is with AR, maybe one option would be using SevenZipParser instead of PackageParser.

That worked!

@wladimirleite
Copy link
Member Author

wladimirleite commented Sep 19, 2024

@lfcnassif, working on a real case, very important valid ZIP files were not expanded. Fortunately, I noticed the parser exceptions and tried to figure out what was going on.
Using the latest version of common-compress (1.27.1) solved the issue.
I know that there are a lot of things to deal with before releasing 4.2, but upgrading this library seems very important.

@lfcnassif
Copy link
Member

Hi @wladimirleite, thank you for reporting that. ZIP is much more important to handle correctly than AR. I'll upgrade when I have some time. But could you test the original 7zip sample referenced by the first post here with commons compress 1.27.1?

@lfcnassif lfcnassif added bug and removed task labels Sep 20, 2024
@wladimirleite
Copy link
Member Author

But could you test the original 7zip sample referenced by the first post here with commons compress 1.27.1?

Sure! I will try with that 7z.

@lfcnassif
Copy link
Member

One question: does 4.1.6 version handle those valid zip files correctly?

@wladimirleite
Copy link
Member Author

One question: does 4.1.6 version handle those valid zip files correctly?

No, it doesn't.
I tested the 7z I mentioned when I opened this issue using common compress 1.27.1, and it worked fine.
However, I found other ZIPs (from the some source I mentioned yesterday), which still fail using 1.27.1 (they also fail using previous versions). Not sure what is causing this, as they are opened with other tools. I am investing it a bit further, and will update here.

@wladimirleite
Copy link
Member Author

One question: does 4.1.6 version handle those valid zip files correctly?

I am investing it a bit further, and will update here.

I found out that SevenZipParser does handle the ZIPs I mentioned (PackageParser doesn't).

@lfcnassif, as compressed files are quite important (may contain a lot of files), have you ever considered using a MultipleParser?At least for some more common types of archives, and only using the second parser if the first fails.
I noticed MultipleParser already has a stopAfterSomeParserWorks, although it seems currently not used.

@lfcnassif
Copy link
Member

lfcnassif commented Oct 16, 2024

I found out that SevenZipParser does handle the ZIPs I mentioned (PackageParser doesn't).

Great!

@lfcnassif, as compressed files are quite important (may contain a lot of files), have you ever considered using a MultipleParser?

Not for generic containers, but for specialized ones, yes, e.g. for DOCX files run PackageParser if OOXMLParser fails (running RawStringParser over ZIP files doesn't extract too much info), for DOC files run GenericOLEParser if OfficeParser fails, etc. A possible bad side effect would be extracting duplicated subitems if first parser fails after extracting a subset of them...

I noticed MultipleParser already has a stopAfterSomeParserWorks, although it seems currently not used.

Yes, I put it there thinking about use cases like above, but I never tested it. Taking a quick look at the code, at least the tikaException = null at line 124 could be moved to before the break...

@lfcnassif
Copy link
Member

lfcnassif commented Oct 17, 2024

Running a small to medium size regression test between compress 1.25.0 and 1.27.1...

@lfcnassif
Copy link
Member

Tests seem fine, there are very minor differences for good and for bad, but the critical *.ar regression I detected with compress 1.26.0 is gone with 1.27.1.

@lfcnassif
Copy link
Member

lfcnassif commented Oct 17, 2024

@wladimirleite, please feel free to open another issue to tackle the problem of some ZIPs still not being handled by compress-1.27.1, maybe using a MultipleParser.

@lfcnassif lfcnassif changed the title Upgrade Apache Commons Compress Upgrade Apache Commons Compress to 1.27.1 Oct 17, 2024
@wladimirleite
Copy link
Member Author

@wladimirleite, please feel free to open another issue to tackle the issue of ZIPs still not being handled by compress-1.27.1, maybe using a MultipleParser.

Great! For now upgrading to 1.27.1 will already solve a lot of cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug dependencies Pull requests that update a dependency file
Projects
None yet
2 participants