Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

archive/zip: Performance regression with reading ZIP files with many entries #48374

Closed
stanhu opened this issue Sep 14, 2021 · 16 comments
Closed
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@stanhu
Copy link

stanhu commented Sep 14, 2021

What version of Go are you using (go version)?

$ go version
go version go1.17.1 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/stanhu/.cache/go-build"
GOENV="/home/stanhu/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/stanhu/.gvm/pkgsets/go1.17.1/global/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/stanhu/.gvm/pkgsets/go1.17.1/global"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/stanhu/.gvm/gos/go1.17.1"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/stanhu/.gvm/gos/go1.17.1/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.17.1"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build4184767189=/tmp/go-build -gno-record-gcc-switches"

What did you do?

We use HTTP Range Requests to seek around an externally stored ZIP file to extract the directory listing. For example:

package main

import (
	"archive/zip"
	"fmt"
	"net/http"

	"github.com/jfbus/httprs"
)

func main() {
	url := "https://stanhu-s3-test-bucket.s3.us-west-1.amazonaws.com/test.zip"
	req, err := http.NewRequest(http.MethodGet, url, nil)
	if err != nil {
		panic(err)
	}

	httpClient := &http.Client{}

	resp, err := httpClient.Do(req)
	if err != nil || resp.StatusCode != http.StatusOK {
		panic(err)
	}

	rs := httprs.NewHttpReadSeeker(resp, httpClient)
	defer resp.Body.Close()
	defer rs.Close()

	archive, err := zip.NewReader(rs, resp.ContentLength)
	if err != nil {
		panic(err)
	}

	for _, entry := range archive.File {
		fmt.Println(entry.Name)
	}

	fmt.Printf("Range Requests: %d\n", rs.Requests)
}

What did you expect to see?

With Go v1.16.4:

# time ./main
<snip>
tmp/tests/frontend/fixtures-ee/projects_json/pipelines_empty.json
tmp/tests/frontend/fixtures-ee/pipeline_schedules/edit_with_variables.html
tmp/tests/frontend/fixtures-ee/projects_json/files.json
tmp/tests/frontend/fixtures-ee/projects/overview.html
tmp/tests/frontend/fixtures-ee/search/show.html
tmp/tests/frontend/fixtures-ee/search/blob_search_result.html
Range Requests: 2

real	0m0.277s
user	0m0.060s
sys	0m0.028s

What did you see instead?

# time ./main
<snip>
tmp/tests/frontend/fixtures-ee/pipeline_schedules/edit_with_variables.html
tmp/tests/frontend/fixtures-ee/projects_json/files.json
tmp/tests/frontend/fixtures-ee/projects/overview.html
tmp/tests/frontend/fixtures-ee/search/show.html
tmp/tests/frontend/fixtures-ee/search/blob_search_result.html
Range Requests: 78

real	0m9.547s
user	0m0.239s
sys	0m0.154s

It appears the call to f.readDataDescriptor is causing these extra 76 HTTP Range Requests:

f.readDataDescriptor()

This was added in ddb648f for #34974.

Granted that not many people may be using the library with HTTP Range Requests, but this is compatible with the interface. This problem caused a production incident as our Google Cloud Storage request rate jumped from 2000/s to over 30,000/s (https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5521).

The additional reads also slows down opening ZIP files, especially on a networked filesystem. The OpenRaw() interface doesn't really work as a substitute because the HTTP client uses io.Reader, so we'd need to buffer the output to a file, which we're trying to avoid.

@escholtz @ianlancetaylor I realize this parsing of the data descriptors might have been done to support the raw methods/ ZIP64, but is there a way we can skip this since we're primarily interested in extracting the central directory header?

@ALTree ALTree added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Sep 14, 2021
@ALTree ALTree added this to the Go1.18 milestone Sep 14, 2021
@escholtz
Copy link
Contributor

Yikes! I'm sorry this caused a production incident.

You're correct that the readDataDescriptor in Reader.init is causing many additional random access reads. And there is no way to bypass that with the current public methods. This placement was proposed by @rsc in #34974 (comment). We'll have to take a closer look if it can be conditionally skipped or moved elsewhere. If I remember correctly, we were trying to balance constraints of the existing package interface/behavior, the needs of the new methods, and edge cases with the zip format.

Would it be possible for your use case to read the file from GCS first and then open it from memory? Or does that violate your bandwidth and local storage requirements?

@WarheadsSE
Copy link

Would it be possible for your use case to read the file from GCS first and then open it from memory? Or does that violate your bandwidth and local storage requirements?

We really need to be operating from the HTTP range requests for the functionality impacted here, as it is directly related to our CI artifacts, as uploaded by GitLab Runner instances. There are many projects that generate rather large artifacts in public/private instances, and it would be best not to force the transit round trip when we really just need the metadata.

tinti added a commit to tinti/go that referenced this issue Oct 3, 2021
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/353715 mentions this issue: archive/zip: lazy load file data descriptor

tinti added a commit to tinti/go that referenced this issue Oct 3, 2021
Read file data descriptor only when the file is open.

Previously, during zip.Reader.init, each new file had its data descriptor
read during the file discovery loop. This caused extra reads for those who
want just to list the files.

Fixes golang#48374
tinti added a commit to tinti/go that referenced this issue Oct 3, 2021
Read file data descriptor only when the file is open.

Previously, during zip.Reader.init, each new file had its data descriptor
read during the file discovery loop. This caused extra reads for those who
want just to list the files.

Fixes golang#48374
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/353716 mentions this issue: archive/zip: load file data descriptor on demand

@guilhem
Copy link

guilhem commented Oct 7, 2021

@WarheadsSE @escholtz I created https://github.com/guilhem/chunkreaderat to solve this kind of performance issue with zip + HTTP range call.

In our case, using caching on read with a pipeline of Azure Blob -> zip -> csv -> zip -> Azure Blob improved performance by over x200.

@escholtz
Copy link
Contributor

escholtz commented Oct 8, 2021

@guilhem I don't think caching reads will help here. (Unless the zip file is very small, in which case you'd probably be better off reading the entire file with a single read.) The problem is big zip files and reading small portions spaced throughout the file during init.

@guilhem
Copy link

guilhem commented Oct 8, 2021

@escholtz caching with my lib will help to keep all metadata in the same chunk in memory.
So any new read being in memory instead of doing a new http call.

@escholtz
Copy link
Contributor

escholtz commented Oct 9, 2021

@guilhem Happy to change my mind if you share benchmarks for the scenario described. Otherwise, I think this is sidetracking the original issue. The problem reported is additional per-file reads spread throughout large zip files. The sections are only read once so caching won't help.

@guilhem
Copy link

guilhem commented Oct 10, 2021

@escholtz I did some tests and benchmarks: https://play.golang.org/p/6Da7Bq6vwLs

Results:
1.17

goos: linux
goarch: amd64
pkg: benchzip
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
Range Requests: 78
BenchmarkTestNoCache-8         	       1	53492265878 ns/op
Range Requests: 61
BenchmarkTest1024_1-8          	       1	43789868429 ns/op
Range Requests: 61
BenchmarkTest1024_10-8         	       1	41351607605 ns/op
Range Requests: 61
BenchmarkTest1024_100-8        	       1	41900755470 ns/op
Range Requests: 11
BenchmarkTest1024_64_1-8       	       1	8513921216 ns/op
Range Requests: 3
BenchmarkTest1024_64_10-8      	       1	2708611041 ns/op
Range Requests: 3
BenchmarkTest1024_64_100-8     	       1	4615014652 ns/op
Range Requests: 9
BenchmarkTest1024_256_1-8      	       1	6609027592 ns/op
Range Requests: 2
BenchmarkTest1024_256_10-8     	       1	1771012854 ns/op
Range Requests: 2
BenchmarkTest1024_256_100-8    	       1	2278849474 ns/op
Range Requests: 6
BenchmarkTest1024_1024_1-8     	       1	4454133078 ns/op
Range Requests: 2
BenchmarkTest1024_1024_10-8    	       1	1837363328 ns/op
Range Requests: 2
BenchmarkTest1024_1024_100-8   	       1	2234271086 ns/op
PASS
ok  	benchzip	215.579s

1.16

goos: linux
goarch: amd64
pkg: benchzip
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
Range Requests: 2
BenchmarkTestNoCache-8         	       1	1667792285 ns/op
Range Requests: 2
BenchmarkTest1024_1-8          	       1	1119663831 ns/op
Range Requests: 2
BenchmarkTest1024_10-8         	       1	1126226778 ns/op
Range Requests: 2
BenchmarkTest1024_100-8        	       1	1116728348 ns/op
Range Requests: 1
BenchmarkTest1024_64_1-8       	       1	1428527826 ns/op
Range Requests: 1
BenchmarkTest1024_64_10-8      	Range Requests: 1
Range Requests: 1
       2	 962857761 ns/op
Range Requests: 1
BenchmarkTest1024_64_100-8     	Range Requests: 1
Range Requests: 1
       2	 961547464 ns/op
Range Requests: 1
BenchmarkTest1024_256_1-8      	       1	1100716445 ns/op
Range Requests: 1
BenchmarkTest1024_256_10-8     	       1	1120852478 ns/op
Range Requests: 1
BenchmarkTest1024_256_100-8    	       1	1119547786 ns/op
Range Requests: 1
BenchmarkTest1024_1024_1-8     	       1	1148681635 ns/op
Range Requests: 1
BenchmarkTest1024_1024_10-8    	       1	1115886109 ns/op
Range Requests: 1
BenchmarkTest1024_1024_100-8   	       1	1247170472 ns/op
Range Requests: 0
BenchmarkTest1024_4096_100-8   	Range Requests: 0
Range Requests: 0
Range Requests: 0
Range Requests: 0
Range Requests: 0
       3	 380116127 ns/op
PASS
ok  	benchzip	21.856s

conclusions:

  • 1.16 has vanilla better performance than 1.17
  • even 1.16 performances can be improved by using a cache (x4 when cache is big enough)
  • 1.17 performances match 1.16 vanilla when using some buffers
  • 1.17 perfs stay lower than 1.16

@saracen
Copy link
Contributor

saracen commented Oct 19, 2021

Change https://golang.org/cl/353715 mentions this issue: archive/zip: lazy load file data descriptor

If I'm reading this change correctly, the only benefit to loading the data descriptors in init was that the CRC32 field was populated for each file entry and available immediately after opening the archive.

@rsc Removing this could be a breaking change if anybody has started to rely on this field since 1.17, but it feels unlikely that anybody is relying on it (at least while it's a fairly young change). Are you able to comment on whether there's even a possibility of fixing this now?

@saracen
Copy link
Contributor

saracen commented Oct 19, 2021

@tinti Thank you for https://golang.org/cl/353715

I just tested the patch and unfortunately it doesn't entirely resolve the issue for how we use this at GitLab.

We convert reads to HTTP range requests and we still generate more than we did previously. I believe the reason is that reading this data descriptor used to occur after reading a file entry (https://go-review.googlesource.com/c/go/+/312310/14/src/archive/zip/reader.go#b224) and would convert to one continous read (the data descriptor immediately following the file data). Now that the data descriptor is read upon opening the file, we generate additional range requests.

I wonder if the newer functionality introduced in 1.17 would continue to work as expected if we moved the data descriptor read back to occuring after the file entry read?

@saracen
Copy link
Contributor

saracen commented Oct 20, 2021

If I'm reading this change correctly, the only benefit to loading the data descriptors in init was that the CRC32 field was populated for each file entry and available immediately after opening the archive.

I was not reading it correctly.

The CRC32 field is available, but is from the central directory.

In the changes introduced in ddb648f, we started overwriting this value with the CRC32 from the data descriptor.

Previously, the data descriptor value didn't overwrite the field, but was compared against the central directory entry's CRC32 and returned a checksum error if they did not match.

@escholtz
Copy link
Contributor

I took a closer look at comment and change history to refresh my memory and it appears I reached similar conclusions in #34974 (comment) and https://go-review.googlesource.com/c/go/+/312310/11/src/archive/zip/reader.go#438

Ok, I used the approach rsc suggested and call this from Reader.init. Is it going to be in issue that we're adding a bunch of random reads throughout the file?
I maintain the existing behavior in that I only overwrite the CRC32 (even if it's not zero, otherwise I'll break existing tests).
I'm not convinced it's necessary to update UncompressedSize64.

Unfortunately there was one test case Bad-CRC32-in-data-descriptor that failed if we ignored the data descriptor. There was hesitancy (understandably) to introduce such a big change that also changed test cases at the same time.

@saracen
Copy link
Contributor

saracen commented Oct 20, 2021

I maintain the existing behavior in that I only overwrite the CRC32

The existing behaviour didn't overwrite, but instead compared the data descriptor's CRC32 to the central directory's entry, and returned an error if they didn't match. I've opened this issue for this: #49089.

As the change envolved, it looks like much was reverted to match the previous implementation. It doesn't look like there's now any need to be parsing the uncompressed/compressed sizes from data descriptor, nor keeping the CRC32 value and creating a data descriptor object for each file entry.

It looks like fully reverting to the previous data descriptor behaviour wouldn't be a breaking change after all, so maybe that's something we should now explore?

The diff below works for me, and still passes Bad-CRC32-in-data-descriptor. It also corrects #49089

Diff:

diff --git a/src/archive/zip/reader.go b/src/archive/zip/reader.go
index c91a8d0..bb71f45 100644
--- a/src/archive/zip/reader.go
+++ b/src/archive/zip/reader.go
@@ -125,7 +125,6 @@ func (z *Reader) init(r io.ReaderAt, size int64) error {
 		if err != nil {
 			return err
 		}
-		f.readDataDescriptor()
 		z.File = append(z.File, f)
 	}
 	if uint16(len(z.File)) != uint16(end.directoryRecords) { // only compare 16 bits here
@@ -186,10 +185,15 @@ func (f *File) Open() (io.ReadCloser, error) {
 		return nil, ErrAlgorithm
 	}
 	var rc io.ReadCloser = dcomp(r)
+	var desr io.Reader
+	if f.hasDataDescriptor() {
+		desr = io.NewSectionReader(f.zipr, f.headerOffset+bodyOffset+size, dataDescriptorLen)
+	}
 	rc = &checksumReader{
 		rc:   rc,
 		hash: crc32.NewIEEE(),
 		f:    f,
+		desr: desr,
 	}
 	return rc, nil
 }
@@ -205,49 +209,13 @@ func (f *File) OpenRaw() (io.Reader, error) {
 	return r, nil
 }
 
-func (f *File) readDataDescriptor() {
-	if !f.hasDataDescriptor() {
-		return
-	}
-
-	bodyOffset, err := f.findBodyOffset()
-	if err != nil {
-		f.descErr = err
-		return
-	}
-
-	// In section 4.3.9.2 of the spec: "However ZIP64 format MAY be used
-	// regardless of the size of a file.  When extracting, if the zip64
-	// extended information extra field is present for the file the
-	// compressed and uncompressed sizes will be 8 byte values."
-	//
-	// Historically, this package has used the compressed and uncompressed
-	// sizes from the central directory to determine if the package is
-	// zip64.
-	//
-	// For this case we allow either the extra field or sizes to determine
-	// the data descriptor length.
-	zip64 := f.zip64 || f.isZip64()
-	n := int64(dataDescriptorLen)
-	if zip64 {
-		n = dataDescriptor64Len
-	}
-	size := int64(f.CompressedSize64)
-	r := io.NewSectionReader(f.zipr, f.headerOffset+bodyOffset+size, n)
-	dd, err := readDataDescriptor(r, zip64)
-	if err != nil {
-		f.descErr = err
-		return
-	}
-	f.CRC32 = dd.crc32
-}
-
 type checksumReader struct {
 	rc    io.ReadCloser
 	hash  hash.Hash32
 	nread uint64 // number of bytes read so far
 	f     *File
-	err   error // sticky error
+	desr  io.Reader // if non-nil, where to read the data descriptor
+	err   error     // sticky error
 }
 
 func (r *checksumReader) Stat() (fs.FileInfo, error) {
@@ -268,12 +236,12 @@ func (r *checksumReader) Read(b []byte) (n int, err error) {
 		if r.nread != r.f.UncompressedSize64 {
 			return 0, io.ErrUnexpectedEOF
 		}
-		if r.f.hasDataDescriptor() {
-			if r.f.descErr != nil {
-				if r.f.descErr == io.EOF {
+		if r.desr != nil {
+			if err1 := readDataDescriptor(r.desr, r.f); err1 != nil {
+				if err1 == io.EOF {
 					err = io.ErrUnexpectedEOF
 				} else {
-					err = r.f.descErr
+					err = err1
 				}
 			} else if r.hash.Sum32() != r.f.CRC32 {
 				err = ErrChecksum
@@ -485,10 +453,8 @@ parseExtras:
 	return nil
 }
 
-func readDataDescriptor(r io.Reader, zip64 bool) (*dataDescriptor, error) {
-	// Create enough space for the largest possible size
-	var buf [dataDescriptor64Len]byte
-
+func readDataDescriptor(r io.Reader, f *File) error {
+	var buf [dataDescriptorLen]byte
 	// The spec says: "Although not originally assigned a
 	// signature, the value 0x08074b50 has commonly been adopted
 	// as a signature value for the data descriptor record.
@@ -497,9 +463,10 @@ func readDataDescriptor(r io.Reader, zip64 bool) (*dataDescriptor, error) {
 	// descriptors and should account for either case when reading
 	// ZIP files to ensure compatibility."
 	//
-	// First read just those 4 bytes to see if the signature exists.
+	// dataDescriptorLen includes the size of the signature but
+	// first read just those 4 bytes to see if it exists.
 	if _, err := io.ReadFull(r, buf[:4]); err != nil {
-		return nil, err
+		return err
 	}
 	off := 0
 	maybeSig := readBuf(buf[:4])
@@ -508,28 +475,21 @@ func readDataDescriptor(r io.Reader, zip64 bool) (*dataDescriptor, error) {
 		// bytes.
 		off += 4
 	}
-
-	end := dataDescriptorLen - 4
-	if zip64 {
-		end = dataDescriptor64Len - 4
+	if _, err := io.ReadFull(r, buf[off:12]); err != nil {
+		return err
 	}
-	if _, err := io.ReadFull(r, buf[off:end]); err != nil {
-		return nil, err
+	b := readBuf(buf[:12])
+	if b.uint32() != f.CRC32 {
+		return ErrChecksum
 	}
-	b := readBuf(buf[:end])
 
-	out := &dataDescriptor{
-		crc32: b.uint32(),
-	}
+	// The two sizes that follow here can be either 32 bits or 64 bits
+	// but the spec is not very clear on this and different
+	// interpretations has been made causing incompatibilities. We
+	// already have the sizes from the central directory so we can
+	// just ignore these.
 
-	if zip64 {
-		out.compressedSize = b.uint64()
-		out.uncompressedSize = b.uint64()
-	} else {
-		out.compressedSize = uint64(b.uint32())
-		out.uncompressedSize = uint64(b.uint32())
-	}
-	return out, nil
+	return nil
 }
 
 func readDirectoryEnd(r io.ReaderAt, size int64) (dir *directoryEnd, err error) {

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/357489 mentions this issue: archive/zip: fix data descriptor reads and checksum matches

@escholtz
Copy link
Contributor

Perhaps I was unclear or incorrect with that portion of the comment but hopefully the remainder provides useful historical context.

Yes, parts of the change were iterated on and/or reverted. Initially the thinking was to provide access to the data descriptor so the Raw and Copy methods would have the values and be able to read/write them exactly. The archive/zip package intentionally hides some of the inner workings of the zip format, for simplicity I assume, so it was not easy or desirable to provide data descriptor access. After discussion, we decided not to provide direct access to the data descriptor so those portions were modified or made private. ianlancetaylor seemed to be in favor of keeping some of the data descriptor struct, methods, and tests so it would be good to get his opinion on removing them and rsc's opinion on the removal from Reader.init.

I am in favor of your change. As long as it doesn't break the Copy and Raw method behavior which I don't think it does (but would need to take a closer look). Based on my older comments, it appears I reached the conclusion that the data descriptor could mostly be ignored in favor of the central directory values.

Is #49089 something we should add a new test for to prevent future regression or is that already covered somehow?

@golang golang locked and limited conversation to collaborators Nov 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
7 participants