-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove BinaryReader and BinaryWriter references from ZipArchive #103153
Remove BinaryReader and BinaryWriter references from ZipArchive #103153
Conversation
We now read the data into a (sometimes stack-allocated) byte array and slice it up with BinaryPrimitives. This reduces the number of reads and writes required to read and write a ZipArchive. It also makes future work to enable async APIs easier, since BinaryReader and BinaryWriter lack this support. Also changed approach to reading central file directory headers. Rather than performing X reads per header, we read 4KB of data at a time and look for all applicable headers in that data. This should improve performance when dealing with many small files.
This allowed the removal of the ArchiveReader property from ZipArchive.
Now pooling the file IO buffers and the temporary buffers for extra fields of the CD file header (which would otherwise be allocated and deallocated in a loop.)
Tagging subscribers to this area: @dotnet/area-system-io-compression |
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchiveEntry.cs
Outdated
Show resolved
Hide resolved
This handles 64x entries with 19-character filenames (and thus, 65-byte file headers.) As a result, it straddles two 4KB read buffers. Also corrected the seek logic while reading the central directory header
@edwardneal the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick partial review. Also needs resolving the merge conflict so we get proper CI results.
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchiveEntry.cs
Show resolved
Hide resolved
src/libraries/System.IO.Compression/tests/ZipArchive/zip_ReadTests.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.IO.Compression/tests/ZipArchive/zip_ReadTests.cs
Outdated
Show resolved
Hide resolved
Thanks @carlossanlop - I've addressed the merge conflict and made those test changes. I've got a number of post-merge test failures to deal with - I'll look at those shortly, ready for review post-14th. |
The buffer returned from the ArrayPool contained older data (including a ZIP header.) When reading the last chunk of the file (i.e a chunk which was less than BackwardsSeekingBufferSize) the buffer's Span wasn't resized to account for this. SeekBackwardsToSignature would thus find the older data, try to seek beyond the end of the stream and fail to read the file.
I fetched your branch and all the relevant tests passed locally: System.IO.Compression, System.IO.Compression.ZipFile and System.IO.Packaging. I merged the latest bits in main into this branch, as the last CI results were already stale and the build info was gone. Let's see what comes out of this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left more feedback for you to consider. Thank you so much for this change.
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs
Show resolved
Hide resolved
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchiveEntry.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchiveEntry.cs
Outdated
Show resolved
Hide resolved
Reads and writes are now performed using a new set of field lengths and locations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edwardneal thanks for applying the suggestions, I'm happy with how this is looking! I left a (hopefully) final batch of comments for you to consider.
After addressing those comments, I'd like to run a CI run that doesn't get executed by default, which will test your code in mobile platforms. If we don't see anything concerning there, I think we can merge it.
If it's in your possibilities, we could try to get this merged before the Code Complete day for .NET 10 Preview 1, which is Monday January 27th. Let me know if this works for you, otherwise I can take over for the above two final steps.
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipHelper.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipHelper.cs
Show resolved
Hide resolved
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipHelper.cs
Show resolved
Hide resolved
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipHelper.cs
Show resolved
Hide resolved
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchiveEntry.cs
Show resolved
Hide resolved
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchiveEntry.cs
Show resolved
Hide resolved
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchiveEntry.cs
Show resolved
Hide resolved
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipBlocks.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipBlocks.cs
Show resolved
Hide resolved
Formatting change; added one comment to Zip64EndOfCentralDirectoryLocator.SignatureConstantBytes; clarified comment on ZipHelper.SeekBackwardsAndRead
Added comment to SeekBackwardsToStream. Lingering references to SignatureConstantBytes.Length. Added two asserts to CanReadLargeCentralDirectoryHeader test, verifying that the archive entry metadata is in a sensible state.
/azp run runtime-extra-platforms |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you @edwardneal for this PR.
The CI failures in the default legs are all marked as expected. The runtime-extra-platforms legs unfortunately are plagued with exit 137
for all assemblies, so we'll have to verify the results there at a later time after that issue gets resolved.
@carlossanlop I've seen several outerloop tests failing on |
Okay, thank you. We might have to revert both to avoid risking the preview. |
Relates to #83909, #31460.
This removes all references to BinaryReader and BinaryWriter from ZipArchive and ZipArchiveEntry. It also changes the way that the ZIP central directory header is read from the source stream and makes one tweak to the way that the EOCD header is detected.
I've removed BinaryReader and BinaryWriter for three reasons:
The second change is to adjust the way that the ZIP central directory header is read. Previously, this was read from the source stream file-by-file. This PR now reads from the source in 4KB blocks and tries to read the headers from there. This is much faster. I've chosen not to implement it when writing the CD headers because they contain dynamic data and I didn't want to copy buffers around; I'm open to doing so.
The detection of the end-of-central-directory header is very similar too. It was already doing something similar, but with only 16 bytes at a time. I've tweaked this to read 4KB block instead, and changed the way it searches for the EOCD signature to use an intrinsic rather than byte-by-byte bit shuffling.
In both cases, I've picked 4KB because it feels like a small enough buffer to not make a massive difference to wait times, and it aligns with the 4KB buffer on FileStream (which I imagine would be the most common use case.)
There are performance improvements across the board. To benchmark this, I used a test wrapping stream which simulates the worst case - an Xms Thread.Sleep on every Read and Write call. Results are below, but in short:
Reads
Creation
Benchmark results - Reads
Benchmark results - Reads (scaling up by number of ZipArchiveEntry children)
Benchmark results - Creation
Benchmark header
NB: because this changes the number of reads/writes to a stream, it'll have an impact on the tests for #102704. I'll change these depending on the order the PRs are merged in.