-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
archive/zip: Performance regression with reading ZIP files with many entries #48374
Comments
Yikes! I'm sorry this caused a production incident. You're correct that the readDataDescriptor in Reader.init is causing many additional random access reads. And there is no way to bypass that with the current public methods. This placement was proposed by @rsc in #34974 (comment). We'll have to take a closer look if it can be conditionally skipped or moved elsewhere. If I remember correctly, we were trying to balance constraints of the existing package interface/behavior, the needs of the new methods, and edge cases with the zip format. Would it be possible for your use case to read the file from GCS first and then open it from memory? Or does that violate your bandwidth and local storage requirements? |
We really need to be operating from the HTTP range requests for the functionality impacted here, as it is directly related to our CI artifacts, as uploaded by GitLab Runner instances. There are many projects that generate rather large artifacts in public/private instances, and it would be best not to force the transit round trip when we really just need the metadata. |
Fixes golang#48374 Signed-off-by: Vinicius Tinti <[email protected]>
Change https://golang.org/cl/353715 mentions this issue: |
Read file data descriptor only when the file is open. Previously, during zip.Reader.init, each new file had its data descriptor read during the file discovery loop. This caused extra reads for those who want just to list the files. Fixes golang#48374
Read file data descriptor only when the file is open. Previously, during zip.Reader.init, each new file had its data descriptor read during the file discovery loop. This caused extra reads for those who want just to list the files. Fixes golang#48374
Change https://golang.org/cl/353716 mentions this issue: |
@WarheadsSE @escholtz I created https://github.com/guilhem/chunkreaderat to solve this kind of performance issue with zip + HTTP range call. In our case, using caching on read with a pipeline of Azure Blob -> zip -> csv -> zip -> Azure Blob improved performance by over x200. |
@guilhem I don't think caching reads will help here. (Unless the zip file is very small, in which case you'd probably be better off reading the entire file with a single read.) The problem is big zip files and reading small portions spaced throughout the file during init. |
@escholtz caching with my lib will help to keep all metadata in the same chunk in memory. |
@guilhem Happy to change my mind if you share benchmarks for the scenario described. Otherwise, I think this is sidetracking the original issue. The problem reported is additional per-file reads spread throughout large zip files. The sections are only read once so caching won't help. |
@escholtz I did some tests and benchmarks: https://play.golang.org/p/6Da7Bq6vwLs Results:
1.16
conclusions:
|
If I'm reading this change correctly, the only benefit to loading the data descriptors in @rsc Removing this could be a breaking change if anybody has started to rely on this field since 1.17, but it feels unlikely that anybody is relying on it (at least while it's a fairly young change). Are you able to comment on whether there's even a possibility of fixing this now? |
@tinti Thank you for https://golang.org/cl/353715 I just tested the patch and unfortunately it doesn't entirely resolve the issue for how we use this at GitLab. We convert reads to HTTP range requests and we still generate more than we did previously. I believe the reason is that reading this data descriptor used to occur after reading a file entry (https://go-review.googlesource.com/c/go/+/312310/14/src/archive/zip/reader.go#b224) and would convert to one continous read (the data descriptor immediately following the file data). Now that the data descriptor is read upon opening the file, we generate additional range requests. I wonder if the newer functionality introduced in 1.17 would continue to work as expected if we moved the data descriptor read back to occuring after the file entry read? |
I was not reading it correctly. The CRC32 field is available, but is from the central directory. In the changes introduced in ddb648f, we started overwriting this value with the CRC32 from the data descriptor. Previously, the data descriptor value didn't overwrite the field, but was compared against the central directory entry's CRC32 and returned a checksum error if they did not match. |
I took a closer look at comment and change history to refresh my memory and it appears I reached similar conclusions in #34974 (comment) and https://go-review.googlesource.com/c/go/+/312310/11/src/archive/zip/reader.go#438
Unfortunately there was one test case |
The existing behaviour didn't overwrite, but instead compared the data descriptor's CRC32 to the central directory's entry, and returned an error if they didn't match. I've opened this issue for this: #49089. As the change envolved, it looks like much was reverted to match the previous implementation. It doesn't look like there's now any need to be parsing the uncompressed/compressed sizes from data descriptor, nor keeping the CRC32 value and creating a data descriptor object for each file entry. It looks like fully reverting to the previous data descriptor behaviour wouldn't be a breaking change after all, so maybe that's something we should now explore? The diff below works for me, and still passes Diff: diff --git a/src/archive/zip/reader.go b/src/archive/zip/reader.go
index c91a8d0..bb71f45 100644
--- a/src/archive/zip/reader.go
+++ b/src/archive/zip/reader.go
@@ -125,7 +125,6 @@ func (z *Reader) init(r io.ReaderAt, size int64) error {
if err != nil {
return err
}
- f.readDataDescriptor()
z.File = append(z.File, f)
}
if uint16(len(z.File)) != uint16(end.directoryRecords) { // only compare 16 bits here
@@ -186,10 +185,15 @@ func (f *File) Open() (io.ReadCloser, error) {
return nil, ErrAlgorithm
}
var rc io.ReadCloser = dcomp(r)
+ var desr io.Reader
+ if f.hasDataDescriptor() {
+ desr = io.NewSectionReader(f.zipr, f.headerOffset+bodyOffset+size, dataDescriptorLen)
+ }
rc = &checksumReader{
rc: rc,
hash: crc32.NewIEEE(),
f: f,
+ desr: desr,
}
return rc, nil
}
@@ -205,49 +209,13 @@ func (f *File) OpenRaw() (io.Reader, error) {
return r, nil
}
-func (f *File) readDataDescriptor() {
- if !f.hasDataDescriptor() {
- return
- }
-
- bodyOffset, err := f.findBodyOffset()
- if err != nil {
- f.descErr = err
- return
- }
-
- // In section 4.3.9.2 of the spec: "However ZIP64 format MAY be used
- // regardless of the size of a file. When extracting, if the zip64
- // extended information extra field is present for the file the
- // compressed and uncompressed sizes will be 8 byte values."
- //
- // Historically, this package has used the compressed and uncompressed
- // sizes from the central directory to determine if the package is
- // zip64.
- //
- // For this case we allow either the extra field or sizes to determine
- // the data descriptor length.
- zip64 := f.zip64 || f.isZip64()
- n := int64(dataDescriptorLen)
- if zip64 {
- n = dataDescriptor64Len
- }
- size := int64(f.CompressedSize64)
- r := io.NewSectionReader(f.zipr, f.headerOffset+bodyOffset+size, n)
- dd, err := readDataDescriptor(r, zip64)
- if err != nil {
- f.descErr = err
- return
- }
- f.CRC32 = dd.crc32
-}
-
type checksumReader struct {
rc io.ReadCloser
hash hash.Hash32
nread uint64 // number of bytes read so far
f *File
- err error // sticky error
+ desr io.Reader // if non-nil, where to read the data descriptor
+ err error // sticky error
}
func (r *checksumReader) Stat() (fs.FileInfo, error) {
@@ -268,12 +236,12 @@ func (r *checksumReader) Read(b []byte) (n int, err error) {
if r.nread != r.f.UncompressedSize64 {
return 0, io.ErrUnexpectedEOF
}
- if r.f.hasDataDescriptor() {
- if r.f.descErr != nil {
- if r.f.descErr == io.EOF {
+ if r.desr != nil {
+ if err1 := readDataDescriptor(r.desr, r.f); err1 != nil {
+ if err1 == io.EOF {
err = io.ErrUnexpectedEOF
} else {
- err = r.f.descErr
+ err = err1
}
} else if r.hash.Sum32() != r.f.CRC32 {
err = ErrChecksum
@@ -485,10 +453,8 @@ parseExtras:
return nil
}
-func readDataDescriptor(r io.Reader, zip64 bool) (*dataDescriptor, error) {
- // Create enough space for the largest possible size
- var buf [dataDescriptor64Len]byte
-
+func readDataDescriptor(r io.Reader, f *File) error {
+ var buf [dataDescriptorLen]byte
// The spec says: "Although not originally assigned a
// signature, the value 0x08074b50 has commonly been adopted
// as a signature value for the data descriptor record.
@@ -497,9 +463,10 @@ func readDataDescriptor(r io.Reader, zip64 bool) (*dataDescriptor, error) {
// descriptors and should account for either case when reading
// ZIP files to ensure compatibility."
//
- // First read just those 4 bytes to see if the signature exists.
+ // dataDescriptorLen includes the size of the signature but
+ // first read just those 4 bytes to see if it exists.
if _, err := io.ReadFull(r, buf[:4]); err != nil {
- return nil, err
+ return err
}
off := 0
maybeSig := readBuf(buf[:4])
@@ -508,28 +475,21 @@ func readDataDescriptor(r io.Reader, zip64 bool) (*dataDescriptor, error) {
// bytes.
off += 4
}
-
- end := dataDescriptorLen - 4
- if zip64 {
- end = dataDescriptor64Len - 4
+ if _, err := io.ReadFull(r, buf[off:12]); err != nil {
+ return err
}
- if _, err := io.ReadFull(r, buf[off:end]); err != nil {
- return nil, err
+ b := readBuf(buf[:12])
+ if b.uint32() != f.CRC32 {
+ return ErrChecksum
}
- b := readBuf(buf[:end])
- out := &dataDescriptor{
- crc32: b.uint32(),
- }
+ // The two sizes that follow here can be either 32 bits or 64 bits
+ // but the spec is not very clear on this and different
+ // interpretations has been made causing incompatibilities. We
+ // already have the sizes from the central directory so we can
+ // just ignore these.
- if zip64 {
- out.compressedSize = b.uint64()
- out.uncompressedSize = b.uint64()
- } else {
- out.compressedSize = uint64(b.uint32())
- out.uncompressedSize = uint64(b.uint32())
- }
- return out, nil
+ return nil
}
func readDirectoryEnd(r io.ReaderAt, size int64) (dir *directoryEnd, err error) { |
Change https://golang.org/cl/357489 mentions this issue: |
Perhaps I was unclear or incorrect with that portion of the comment but hopefully the remainder provides useful historical context. Yes, parts of the change were iterated on and/or reverted. Initially the thinking was to provide access to the data descriptor so the Raw and Copy methods would have the values and be able to read/write them exactly. The archive/zip package intentionally hides some of the inner workings of the zip format, for simplicity I assume, so it was not easy or desirable to provide data descriptor access. After discussion, we decided not to provide direct access to the data descriptor so those portions were modified or made private. ianlancetaylor seemed to be in favor of keeping some of the data descriptor struct, methods, and tests so it would be good to get his opinion on removing them and rsc's opinion on the removal from Reader.init. I am in favor of your change. As long as it doesn't break the Copy and Raw method behavior which I don't think it does (but would need to take a closer look). Based on my older comments, it appears I reached the conclusion that the data descriptor could mostly be ignored in favor of the central directory values. Is #49089 something we should add a new test for to prevent future regression or is that already covered somehow? |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
We use HTTP Range Requests to seek around an externally stored ZIP file to extract the directory listing. For example:
What did you expect to see?
With Go v1.16.4:
What did you see instead?
It appears the call to
f.readDataDescriptor
is causing these extra 76 HTTP Range Requests:go/src/archive/zip/reader.go
Line 120 in ddb648f
This was added in ddb648f for #34974.
Granted that not many people may be using the library with HTTP Range Requests, but this is compatible with the interface. This problem caused a production incident as our Google Cloud Storage request rate jumped from 2000/s to over 30,000/s (https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5521).
The additional reads also slows down opening ZIP files, especially on a networked filesystem. The
OpenRaw()
interface doesn't really work as a substitute because the HTTP client usesio.Reader
, so we'd need to buffer the output to a file, which we're trying to avoid.@escholtz @ianlancetaylor I realize this parsing of the data descriptors might have been done to support the raw methods/ ZIP64, but is there a way we can skip this since we're primarily interested in extracting the central directory header?
The text was updated successfully, but these errors were encountered: