Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

archive/zip: add File.OpenRaw, Writer.CreateRaw, Writer.Copy #34974

Closed
saracen opened this issue Oct 17, 2019 · 56 comments
Closed

archive/zip: add File.OpenRaw, Writer.CreateRaw, Writer.Copy #34974

saracen opened this issue Oct 17, 2019 · 56 comments

Comments

@saracen
Copy link
Contributor

saracen commented Oct 17, 2019

If you have a compressed file, know its uncompressed size and crc32 checksum, it'd be nice to be able to add it to an archive as is.

I have three current use-cases for this:

  • Repackaging a zip file, removing or adding files, without incurring the associated compression overheads for files that already exist (somewhat achieving proposal: archive/zip: add support for appending files #15626)
  • Compress first and include later based on whether the compressed size was smaller than the original, without having to perform the compression twice.
  • Support concurrent compression: compress many files concurrently and then copy the already compressed files to the archive (similar to Apache Commons Compress' ParallelScatterZipCreator)

I see three different ways we could achieve this:

  1. Create a zip.CreateHeaderRaw(fh *FileHeader) (io.Writer, error) function, that uses the FileHeader's CRC32 and UncompressedSize64 fields set by the user.
  2. Use the existing CreateHeader(fh *FileHeader) (io.Writer, error) function, but have a new FileHeader field that indicates we're going to write already compressed data (and then use CRC32 and UncompressedSize64 fields set by the user)
  3. Use the existing CreateHeader(fh *FileHeader) (io.Writer, error) function, but if CompressedSize64 has already been set, assume that data written is already compressed.

I'm going to assume that option 3 would be a no-go, because existing code might suddenly break if a user has already set the CompressedSize for whatever reason, but hopefully the other options are viable.

@julieqiu
Copy link
Member

/cc @dsnet

@julieqiu julieqiu added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 18, 2019
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/202217 mentions this issue: archive/zip: support adding raw files

@saracen
Copy link
Contributor Author

saracen commented Oct 19, 2019

Added change that uses option 1 (CreateHeaderRaw) because it feels weird to use a field in FileHeader solely for archiving as that structure is also used when reading an archive.

@ianlancetaylor
Copy link
Member

This is going to be new API, so turning into a proposal.

@dsnet Please add any thoughts you may have about this functionality. Thanks.

@ianlancetaylor ianlancetaylor changed the title archive/zip: add already compressed files proposal: archive/zip: add already compressed files Oct 19, 2019
@gopherbot gopherbot added this to the Proposal milestone Oct 19, 2019
@ianlancetaylor ianlancetaylor removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 19, 2019
@escholtz
Copy link
Contributor

The project I'm working on would appreciate this functionality for performance reasons. We need to replace a single file within a zip. The DEFLATE tax is high and wasteful since we decode/encode without needing or using the intermediate data.

I like @rsc's Copy implementation in https://github.com/rsc/zipmerge. It might not be quite as flexible as the other proposals, but I think it fits well within the existing API and supports efficiently creating a new zip from an existing zip while allowing the user to replace/delete/update specific files.

// Copy copies the file f (obtained from a Reader) into w.
// It copies the compressed form directly.
func (w *Writer) Copy(f *File) error {
	dataOffset, err := f.DataOffset()
	if err != nil {
		return err
	}
	if err := w.closeLastWriter(); err != nil {
		return err
	}

	fh := f.FileHeader
	h := &header{
		FileHeader: &fh,
		offset:     uint64(w.cw.count),
	}
	fh.Flags |= 0x8 // we will write a data descriptor
	w.dir = append(w.dir, h)

	if err := writeHeader(w.cw, &fh); err != nil {
		return err
	}

	r := io.NewSectionReader(f.zipr, dataOffset, int64(f.CompressedSize64))
	if _, err := io.Copy(w.cw, r); err != nil {
		return err
	}

	return writeDesc(w.cw, &fh)
}

Here are the latest compress/flate benchmarks from master at level 5 which archive/zip uses by default (Intel Xeon W-2135 CPU @ 3.70GHz × 6):

goos: linux
goarch: amd64
pkg: compress/flate
BenchmarkDecode/Digits/Level5/1e4-12         	    6530	    164347 ns/op	  60.85 MB/s	   40598 B/op	       7 allocs/op
BenchmarkDecode/Digits/Level5/1e5-12         	     960	   1131532 ns/op	  88.38 MB/s	   40842 B/op	      13 allocs/op
BenchmarkDecode/Digits/Level5/1e6-12         	     117	   9673619 ns/op	 103.37 MB/s	   44800 B/op	      79 allocs/op
BenchmarkDecode/Newton/Level5/1e4-12         	    8076	    164681 ns/op	  60.72 MB/s	   41114 B/op	      17 allocs/op
BenchmarkDecode/Newton/Level5/1e5-12         	    1653	    765625 ns/op	 130.61 MB/s	   44283 B/op	      31 allocs/op
BenchmarkDecode/Newton/Level5/1e6-12         	     180	   6408771 ns/op	 156.04 MB/s	   61144 B/op	     156 allocs/op
BenchmarkEncode/Digits/Level5/1e4-12         	    5710	    215617 ns/op	  46.38 MB/s
BenchmarkEncode/Digits/Level5/1e5-12         	     370	   3232511 ns/op	  30.94 MB/s
BenchmarkEncode/Digits/Level5/1e6-12         	      32	  34852520 ns/op	  28.69 MB/s
BenchmarkEncode/Newton/Level5/1e4-12         	    4774	    238334 ns/op	  41.96 MB/s
BenchmarkEncode/Newton/Level5/1e5-12         	     357	   2974790 ns/op	  33.62 MB/s
BenchmarkEncode/Newton/Level5/1e6-12         	      39	  30595764 ns/op	  32.68 MB/s
PASS

@saracen
Copy link
Contributor Author

saracen commented Dec 16, 2019

@escholtz I agree that having a (*Writer) Copy(f *File) would be nice, but as you said, it isn't quite as flexible.

For comparison, copying a file with the patch I've submitted would look something like this:

func Copy(zw *zip.Writer, zf io.ReaderAt, f *zip.File) error {
	offset, err := f.DataOffset()
	if err != nil {
		return err
	}

	w, err := zw.CreateHeaderRaw(&f.FileHeader)
	if err != nil {
		return err
	}

	_, err = io.Copy(w, io.NewSectionReader(zf, offset, int64(f.CompressedSize64)))

	return err
}

Having Copy() would be great, but if for some reason only one solution could be added, I'd prefer it to be something along the lines of CreateHeaderRaw.

For the other use cases originally mentioned, I've written https://github.com/saracen/fastzip that takes advantage of CreateHeaderRaw also (but I've had to vendor archive/zip and include my modification for it to work).

@klauspost
Copy link
Contributor

Implemented in klauspost/compress#214

saracen added a commit to saracen/fastzip that referenced this issue Jun 7, 2020
klauspost kindly implemented the CreateHeaderRaw proposal
(golang/go#34974) in
klauspost/compress#214. This repository was vendoring
the patched version (https://go-review.googlesource.com/c/go/+/202217/)
but having it supported in a well maintained library already used by fastzip is
a plus.
@rsc
Copy link
Contributor

rsc commented Jun 10, 2020

I vaguely remember discussing this with @rogpeppe years ago, but I can't find the discussion. Any clearer memory, Roger?

@rsc
Copy link
Contributor

rsc commented Jun 10, 2020

My memory is coming back. I think what I remember @rogpeppe and I talking about was the ability to modify an existing zip file by appending new files to it.

@saracen
Copy link
Contributor Author

saracen commented Jun 10, 2020

@rsc https://groups.google.com/g/golang-dev/c/ZSubqlF2G4k?pli=1 this might be the discussion, and yeah, was related to appending.

@rsc
Copy link
Contributor

rsc commented Jun 17, 2020

Adding a whole new Create method just to set this extra bit seems unfortunate - what if there is a second bit we want control over later? Add four new methods? And so on.

Probably better to add a new bool in the CreateHeader. Perhaps WriteRaw bool?

@saracen
Copy link
Contributor Author

saracen commented Jun 18, 2020

Probably better to add a new bool in the CreateHeader. Perhaps WriteRaw bool?

@rsc I originally preferred this too. The reason I didn't for my implementation is because FileHeader is also used when reading an archive too. It felt odd having a WriteRaw bool as part of that structure in the context of reads. Is this not an issue?

@rsc
Copy link
Contributor

rsc commented Jun 24, 2020

@saracen it seems like less of an issue than an exponential blowup of create methods. The NonUTF8 field is probably most analogous, although it does get set in the reader.

I suppose the most interesting analogy is whether there's a need to get the raw bytes from the reader as well. There at least we could add an OpenRaw method to zip.File. It's much harder to do that in what CreateHeader returns, because it's an io.Writer.

I suppose another way to store compressed files would be to register a no-op compressor.

Another way would be to have a method on zip.Writer: WriteRaw(bool) to set whether the compressors should be avoided. You'd z.WriteRaw(true) before calling CreateHeader.

Does any of these sound like a winner to anyone?

@saracen
Copy link
Contributor Author

saracen commented Jun 24, 2020

suppose the most interesting analogy is whether there's a need to get the raw bytes from the reader as well.

@rsc At the moment, I believe there's two ways to access the raw bytes:

  • Open with a ReaderAt, and then use file.Offset and file.CompressedSize64 to access and read a specific section of the archive.
  • Register a decompressor and do whatever custom logic to the raw bytes.

I suppose another way to store compressed files would be to register a no-op compressor.

This sounds quite good, especially since it's similar to how you can handle raw reads by registering a decompressor.

Using a zip.NoOpCompressor would need to mean that the CompressedSize64 and CRC32 fields of the header are left unmodified when the header is closed:

zw := zip.NewWriter(archive)
zw.RegisterCompressor(zip.Deflate, zip.NoOpCompressor)

hdr := &zip.FileHeader{
  Name: "file.txt",

  // the following fields would be left untouched because
  // the writer detects that zip.NoOpCompressor is being used
  CRC32: 0xffffffff,
  CompressedSize64: 1024,
}

w, _ := zw.CreateHeader(hdr) (io.Writer, error)

// bytes are copied to the zip archive unmodified
io.Copy(w, buf)

@rsc
Copy link
Contributor

rsc commented Jul 8, 2020

The problem with the no-op compressor is that the uncompressed size would be wrong. That would have to be left alone. It's probably a little too subtle.

Given that you can already read the compressed data directly, as @saracen points out, maybe it's fine to focus on writing.

What if pre-filling CompressedSize64 (> 0) in the FileHeader means that the writes are expected to be pre-compressed? In that case CRC32 has to be filled in too, although zero is a valid CRC32 so we can't double-check that. But I believe 0 is not a valid CompressedSize64, and in any event it's not an interesting one.

In essence, CompressedSize64 = 0 means "I don't know, you compress it for me".

Edit: I see that this was @saracen's suggestion 3 in the original comment at the top.

@rsc
Copy link
Contributor

rsc commented Jul 15, 2020

@saracen, @escholtz, @klauspost, any thoughts about the most recent suggestion (previous comment)?

@dsnet
Copy link
Member

dsnet commented Jul 15, 2020

According RFC1951, the compression stream requires at least one block (with the BFINAL bit set). Therefore, it is impossible to have a compressed DEFLATE stream that is zero-bytes in length.

However, the ZIP file format allows for a number of different compression methods to be used (DEFLATE happens to be the most common by far). I think we should think about 1) whether we want to support other compression methods, and 2) whether we ever want to support other compression format that may theoretically have a zero-length output for an empty input.

@escholtz
Copy link
Contributor

I've been hesitant to comment because I didn't want to hijack @saracen's original proposal.

For my use case, the func (w *Writer) Copy(f *File) error implementation would be sufficient. I think it fits cleanly with the existing package exports and still maintains simplicity.

I agree that having both CreateHeader and CreateHeaderRaw does not feel like a clean standard library package.

Is there precedent in the standard library for using the CompressedSize64 approach?

@saracen
Copy link
Contributor Author

saracen commented Jul 15, 2020

For my use case, func (w *Writer) Copy(f *File) error doesn't help, but I can see its usefulness and how it relates this this problem.

@rsc In my original proposal, I'd written:

  1. Use the existing CreateHeader(fh *FileHeader) (io.Writer, error) function, but if CompressedSize64 has already been set, assume that data written is already compressed.

I'm going to assume that option 3 would be a no-go, because existing code might suddenly break if a user has already set the CompressedSize for whatever reason, but hopefully the other options are viable.

Breaking existing code was my only reluctance to this approach. If we don't see that as an issue, it sounds like an easy way to support this.

@rsc rsc changed the title proposal: archive/zip: add File.OpenRaw, Writer.CreateRaw, Writer.Copy archive/zip: add File.OpenRaw, Writer.CreateRaw, Writer.Copy Oct 28, 2020
@hherman1
Copy link

Maybe I missed this, but why not just open the file directly instead of using zip.Open if you want to write/read bytes directly?

@ianlancetaylor
Copy link
Member

@hherman1 This proposal is about working with compressed files that are contained within zip archives. It's not about working with the zip archive itself.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/312310 mentions this issue: archive/zip: add File.OpenRaw, Writer.CreateRaw, Writer.Copy

@escholtz
Copy link
Contributor

When implementing this in https://golang.org/cl/312310 one challenge I discovered was figuring out how we should handle the trailing data descriptor (https://en.wikipedia.org/wiki/ZIP_(file_format)#Data_descriptor) for the OpenRaw and CreateRaw methods. @ianlancetaylor also raised concern in the review for good reason.

I chose to return the data descriptor in the OpenRaw reader and give the CreateRaw writer responsibility for calculating and appending the data descriptor when necessary. This seemed optimal in terms of giving the caller full control, which is the intention of this change, while also conforming to the interface we agreed to. And it also ensures the data descriptor isn't lost by the Copy method.

Perhaps I'm overlooking a different/better approach or perhaps we want to change the interface to allow the same data descriptor reading/writing functionality but in a safer way.

What are your thoughts?

@saracen
Copy link
Contributor Author

saracen commented Apr 21, 2021

For me, OpenRaw/CreateRaw was mostly about providing raw access to the file data and not modifying the fields that I provide in the header on my behalf. I still expected it to serialize the header data in the same way.

I don't know whether it's needed, but I understand the reason for adding it.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.17, Backlog Apr 21, 2021
@escholtz
Copy link
Contributor

The data descriptor is part of the spec. We need to support it for the raw methods. The CRC32 and uncompressed size fields are not things we can compute within the raw methods. So we need to allow the caller to specify them.

The 3 data descriptor fields are also in the FileHeader but I don't think it's a good idea to overload them by copying the data descriptor fields into the FileHeader. Since it's a separate segment it should be treated as such.

The only other option I can think of is adding a DataDescriptor struct within the archive/zip package that handles marshalling and unmarshalling of the byte segment. But since the Raw methods are intended for advanced/rare use cases, I'm fine with allowing the caller to handle the marshalling and unmarshalling.

@rsc Are you ok with this approach (returning the data descriptor bytes from OpenRaw and requiring the caller to write the data descriptor bytes to the CreateRaw Writer)?

@ianlancetaylor
Copy link
Member

I think it's going to be error-prone to attach the data descriptor bytes to the reader. I suggest:

// DataDescriptor holds the data descriptor that optionally follows the file contents in the zip file.
type DataDescriptor struct {
    CRC32 uint32
    CompressedSize uint64
    UncompressedSize uint64
}

// OpenRaw is like Open but reads the raw, compressed, data from the file.
// If dataDescriptor is not nil, it contains the contents of the optional data descriptor following the file contents.
func (*File) OpenRaw() (r io.Reader, dataDescriptor *DataDescriptor, err error)

// CreateRaw is like CreateHeader but the data written to the io.Writer should already be compressed.
// If dataDescriptor is not nil, it will be written as the optional data descriptor following the file contents.
func (*Writer) CreateRaw(file *FileHeader, dataDescriptor *DataDescriptor) (io.Writer, error)

@dsnet Any thoughts?

@saracen
Copy link
Contributor Author

saracen commented Apr 22, 2021

The 3 data descriptor fields are also in the FileHeader but I don't think it's a good idea to overload them by copying the data descriptor fields into the FileHeader. Since it's a separate segment it should be treated as such.

@escholtz We serialize the known crc32, uncompressed and compressed size fields from FileHeader when we write the local file header. What benefit is there to not doing the same in the data descriptor? Would they ever be different?

@escholtz
Copy link
Contributor

For the CreateRaw case, if the FileHeader Data Descriptor 0x8 Flag is not set, but a non-nil dataDescriptor argument is passed, would we treat that as an error?

And if the CRC32, CompressedSize, and UncompressedSize fields are non-zero in the FileHeader and a non-nil dataDescriptor argument is passed, would we treat that as an error?

@escholtz
Copy link
Contributor

@saracen They could be different when reading a zip that doesn't strictly adhere to the spec. Probably a rare use case, but I can imagine some caller might want to know if and when there is a discrepancy and choose how to handle it in their own code.

@ianlancetaylor
Copy link
Member

For the CreateRaw case, if the FileHeader Data Descriptor 0x8 Flag is not set, but a non-nil dataDescriptor argument is passed, would we treat that as an error?

Sure.

And if the CRC32, CompressedSize, and UncompressedSize fields are non-zero in the FileHeader and a non-nil dataDescriptor argument is passed, would we treat that as an error?

I don't see why. But I am definitely not an expert.

Should we just take those values from the FileHeader and not worry about the data descriptor for CreateRaw?

@saracen
Copy link
Contributor Author

saracen commented Apr 22, 2021

They could be different when reading a zip that doesn't strictly adhere to the spec.

I think there's already a way this can be detected with the existing API. By using DataOffset and reading at the end of the compressed data size. It's such a rare case, I'm not sure we need the convenience to detect it with the new API.

On the write side, I'm not sure there's a use-case where we would want to intentionally have a different crc32/size than we have in the local header. I could be wrong though.

@rsc
Copy link
Contributor

rsc commented Apr 28, 2021

I tend to agree with @saracen that we don't need API changes here.
OpenRaw and CreateRaw are about "raw compressed bytes" not "all possible raw access to the zip data structures".

On the read side, we can fill in the missing FileHeader fields using the data descriptor during Reader.init. That will benefit all users, not just users of OpenRaw.

On the write side, the fields are in FileHeader and can be copied out of file header into a new data descriptor.

Again there is no goal to permit different values in the file header versus the data descriptor. That's not a nice zip file at all, whether it is technically well-formed or not.

@escholtz
Copy link
Contributor

escholtz commented Apr 30, 2021

After taking a closer look at the spec, the data descriptor values should always match the corresponding values in the central directory.

The central directory is already the source of truth when a zip is opened, so I don't think we even need to check the data descriptor unless we want to detect mismatches with the central directory and return an error? Currently the CRC32 from the data descriptor is used for validation (but I don't think it's checked against the central directory for mismatch) and the compressed and uncompressed sizes in the data descriptor are ignored.

Since the local file header is mostly ignored in the read path (see File.findBodyOffset) I'm fine with ignoring the data descriptor as well. Otherwise, we should probably only use it to detect mismatches.

Sorry for all the back and forth on the issue. First time going this deep on the zip spec and want to make sure I get it right.

4.4.7 CRC-32: (4 bytes)

   The CRC-32 algorithm was generously contributed by
   David Schwaderer and can be found in his excellent
   book "C Programmers Guide to NetBIOS" published by
   Howard W. Sams & Co. Inc.  The 'magic number' for
   the CRC is 0xdebb20e3.  The proper CRC pre and post
   conditioning is used, meaning that the CRC register
   is pre-conditioned with all ones (a starting value
   of 0xffffffff) and the value is post-conditioned by
   taking the one's complement of the CRC residual.
   If bit 3 of the general purpose flag is set, this
   field is set to zero in the local header and the correct
   value is put in the data descriptor and in the central
   directory. When encrypting the central directory, if the
   local header is not in ZIP64 format and general purpose 
   bit flag 13 is set indicating masking, the value stored 
   in the Local Header will be zero. 

4.4.8 compressed size: (4 bytes)
4.4.9 uncompressed size: (4 bytes)

   The size of the file compressed (4.4.8) and uncompressed,
   (4.4.9) respectively.  When a decryption header is present it 
   will be placed in front of the file data and the value of the
   compressed file size will include the bytes of the decryption
   header.  If bit 3 of the general purpose bit flag is set, 
   these fields are set to zero in the local header and the 
   correct values are put in the data descriptor and
   in the central directory.  If an archive is in ZIP64 format
   and the value in this field is 0xFFFFFFFF, the size will be
   in the corresponding 8 byte ZIP64 extended information 
   extra field.  When encrypting the central directory, if the
   local header is not in ZIP64 format and general purpose bit 
   flag 13 is set indicating masking, the value stored for the 
   uncompressed size in the Local Header will be zero. 

@ianlancetaylor
Copy link
Member

I mentioned on the CL that there is the spec and there is the reality. Zip files come from a lot of different programs with varying degrees of adherence to the formal spec. We shouldn't stop supporting any code that we currently support.

@rsc rsc mentioned this issue Jun 10, 2021
83 tasks
@dmitshur dmitshur modified the milestones: Backlog, Go1.17 Nov 6, 2021
@rsc rsc moved this to Accepted in Proposals Aug 10, 2022
@rsc rsc added this to Proposals Aug 10, 2022
@rsc rsc removed this from Proposals Oct 19, 2022
@golang golang locked and limited conversation to collaborators Nov 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests