-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
archive/zip: add File.OpenRaw, Writer.CreateRaw, Writer.Copy #34974
Comments
/cc @dsnet |
Change https://golang.org/cl/202217 mentions this issue: |
Added change that uses option 1 ( |
This is going to be new API, so turning into a proposal. @dsnet Please add any thoughts you may have about this functionality. Thanks. |
The project I'm working on would appreciate this functionality for performance reasons. We need to replace a single file within a zip. The DEFLATE tax is high and wasteful since we decode/encode without needing or using the intermediate data. I like @rsc's Copy implementation in https://github.com/rsc/zipmerge. It might not be quite as flexible as the other proposals, but I think it fits well within the existing API and supports efficiently creating a new zip from an existing zip while allowing the user to replace/delete/update specific files. // Copy copies the file f (obtained from a Reader) into w.
// It copies the compressed form directly.
func (w *Writer) Copy(f *File) error {
dataOffset, err := f.DataOffset()
if err != nil {
return err
}
if err := w.closeLastWriter(); err != nil {
return err
}
fh := f.FileHeader
h := &header{
FileHeader: &fh,
offset: uint64(w.cw.count),
}
fh.Flags |= 0x8 // we will write a data descriptor
w.dir = append(w.dir, h)
if err := writeHeader(w.cw, &fh); err != nil {
return err
}
r := io.NewSectionReader(f.zipr, dataOffset, int64(f.CompressedSize64))
if _, err := io.Copy(w.cw, r); err != nil {
return err
}
return writeDesc(w.cw, &fh)
} Here are the latest compress/flate benchmarks from master at level 5 which archive/zip uses by default (Intel Xeon W-2135 CPU @ 3.70GHz × 6):
|
@escholtz I agree that having a For comparison, copying a file with the patch I've submitted would look something like this:
Having For the other use cases originally mentioned, I've written https://github.com/saracen/fastzip that takes advantage of |
Implemented in klauspost/compress#214 |
klauspost kindly implemented the CreateHeaderRaw proposal (golang/go#34974) in klauspost/compress#214. This repository was vendoring the patched version (https://go-review.googlesource.com/c/go/+/202217/) but having it supported in a well maintained library already used by fastzip is a plus.
I vaguely remember discussing this with @rogpeppe years ago, but I can't find the discussion. Any clearer memory, Roger? |
My memory is coming back. I think what I remember @rogpeppe and I talking about was the ability to modify an existing zip file by appending new files to it. |
@rsc https://groups.google.com/g/golang-dev/c/ZSubqlF2G4k?pli=1 this might be the discussion, and yeah, was related to appending. |
Adding a whole new Create method just to set this extra bit seems unfortunate - what if there is a second bit we want control over later? Add four new methods? And so on. Probably better to add a new bool in the CreateHeader. Perhaps |
@rsc I originally preferred this too. The reason I didn't for my implementation is because |
@saracen it seems like less of an issue than an exponential blowup of create methods. The NonUTF8 field is probably most analogous, although it does get set in the reader. I suppose the most interesting analogy is whether there's a need to get the raw bytes from the reader as well. There at least we could add an OpenRaw method to zip.File. It's much harder to do that in what CreateHeader returns, because it's an io.Writer. I suppose another way to store compressed files would be to register a no-op compressor. Another way would be to have a method on zip.Writer: WriteRaw(bool) to set whether the compressors should be avoided. You'd z.WriteRaw(true) before calling CreateHeader. Does any of these sound like a winner to anyone? |
@rsc At the moment, I believe there's two ways to access the raw bytes:
This sounds quite good, especially since it's similar to how you can handle raw reads by registering a decompressor. Using a zw := zip.NewWriter(archive)
zw.RegisterCompressor(zip.Deflate, zip.NoOpCompressor)
hdr := &zip.FileHeader{
Name: "file.txt",
// the following fields would be left untouched because
// the writer detects that zip.NoOpCompressor is being used
CRC32: 0xffffffff,
CompressedSize64: 1024,
}
w, _ := zw.CreateHeader(hdr) (io.Writer, error)
// bytes are copied to the zip archive unmodified
io.Copy(w, buf) |
The problem with the no-op compressor is that the uncompressed size would be wrong. That would have to be left alone. It's probably a little too subtle. Given that you can already read the compressed data directly, as @saracen points out, maybe it's fine to focus on writing. What if pre-filling CompressedSize64 (> 0) in the FileHeader means that the writes are expected to be pre-compressed? In that case CRC32 has to be filled in too, although zero is a valid CRC32 so we can't double-check that. But I believe 0 is not a valid CompressedSize64, and in any event it's not an interesting one. In essence, CompressedSize64 = 0 means "I don't know, you compress it for me". Edit: I see that this was @saracen's suggestion 3 in the original comment at the top. |
@saracen, @escholtz, @klauspost, any thoughts about the most recent suggestion (previous comment)? |
According RFC1951, the compression stream requires at least one block (with the BFINAL bit set). Therefore, it is impossible to have a compressed DEFLATE stream that is zero-bytes in length. However, the ZIP file format allows for a number of different compression methods to be used (DEFLATE happens to be the most common by far). I think we should think about 1) whether we want to support other compression methods, and 2) whether we ever want to support other compression format that may theoretically have a zero-length output for an empty input. |
I've been hesitant to comment because I didn't want to hijack @saracen's original proposal. For my use case, the I agree that having both Is there precedent in the standard library for using the CompressedSize64 approach? |
For my use case, @rsc In my original proposal, I'd written:
Breaking existing code was my only reluctance to this approach. If we don't see that as an issue, it sounds like an easy way to support this. |
Maybe I missed this, but why not just open the file directly instead of using zip.Open if you want to write/read bytes directly? |
@hherman1 This proposal is about working with compressed files that are contained within zip archives. It's not about working with the zip archive itself. |
Change https://golang.org/cl/312310 mentions this issue: |
When implementing this in https://golang.org/cl/312310 one challenge I discovered was figuring out how we should handle the trailing data descriptor (https://en.wikipedia.org/wiki/ZIP_(file_format)#Data_descriptor) for the OpenRaw and CreateRaw methods. @ianlancetaylor also raised concern in the review for good reason. I chose to return the data descriptor in the OpenRaw reader and give the CreateRaw writer responsibility for calculating and appending the data descriptor when necessary. This seemed optimal in terms of giving the caller full control, which is the intention of this change, while also conforming to the interface we agreed to. And it also ensures the data descriptor isn't lost by the Copy method. Perhaps I'm overlooking a different/better approach or perhaps we want to change the interface to allow the same data descriptor reading/writing functionality but in a safer way. What are your thoughts? |
For me, I don't know whether it's needed, but I understand the reason for adding it. |
The data descriptor is part of the spec. We need to support it for the raw methods. The CRC32 and uncompressed size fields are not things we can compute within the raw methods. So we need to allow the caller to specify them. The 3 data descriptor fields are also in the FileHeader but I don't think it's a good idea to overload them by copying the data descriptor fields into the FileHeader. Since it's a separate segment it should be treated as such. The only other option I can think of is adding a DataDescriptor struct within the archive/zip package that handles marshalling and unmarshalling of the byte segment. But since the Raw methods are intended for advanced/rare use cases, I'm fine with allowing the caller to handle the marshalling and unmarshalling. @rsc Are you ok with this approach (returning the data descriptor bytes from OpenRaw and requiring the caller to write the data descriptor bytes to the CreateRaw Writer)? |
I think it's going to be error-prone to attach the data descriptor bytes to the reader. I suggest: // DataDescriptor holds the data descriptor that optionally follows the file contents in the zip file.
type DataDescriptor struct {
CRC32 uint32
CompressedSize uint64
UncompressedSize uint64
}
// OpenRaw is like Open but reads the raw, compressed, data from the file.
// If dataDescriptor is not nil, it contains the contents of the optional data descriptor following the file contents.
func (*File) OpenRaw() (r io.Reader, dataDescriptor *DataDescriptor, err error)
// CreateRaw is like CreateHeader but the data written to the io.Writer should already be compressed.
// If dataDescriptor is not nil, it will be written as the optional data descriptor following the file contents.
func (*Writer) CreateRaw(file *FileHeader, dataDescriptor *DataDescriptor) (io.Writer, error) @dsnet Any thoughts? |
@escholtz We serialize the known crc32, uncompressed and compressed size fields from |
For the CreateRaw case, if the FileHeader Data Descriptor 0x8 Flag is not set, but a non-nil dataDescriptor argument is passed, would we treat that as an error? And if the CRC32, CompressedSize, and UncompressedSize fields are non-zero in the FileHeader and a non-nil dataDescriptor argument is passed, would we treat that as an error? |
@saracen They could be different when reading a zip that doesn't strictly adhere to the spec. Probably a rare use case, but I can imagine some caller might want to know if and when there is a discrepancy and choose how to handle it in their own code. |
Sure.
I don't see why. But I am definitely not an expert. Should we just take those values from the FileHeader and not worry about the data descriptor for |
I think there's already a way this can be detected with the existing API. By using On the write side, I'm not sure there's a use-case where we would want to intentionally have a different crc32/size than we have in the local header. I could be wrong though. |
I tend to agree with @saracen that we don't need API changes here. On the read side, we can fill in the missing FileHeader fields using the data descriptor during Reader.init. That will benefit all users, not just users of OpenRaw. On the write side, the fields are in FileHeader and can be copied out of file header into a new data descriptor. Again there is no goal to permit different values in the file header versus the data descriptor. That's not a nice zip file at all, whether it is technically well-formed or not. |
After taking a closer look at the spec, the data descriptor values should always match the corresponding values in the central directory. The central directory is already the source of truth when a zip is opened, so I don't think we even need to check the data descriptor unless we want to detect mismatches with the central directory and return an error? Currently the CRC32 from the data descriptor is used for validation (but I don't think it's checked against the central directory for mismatch) and the compressed and uncompressed sizes in the data descriptor are ignored. Since the local file header is mostly ignored in the read path (see File.findBodyOffset) I'm fine with ignoring the data descriptor as well. Otherwise, we should probably only use it to detect mismatches. Sorry for all the back and forth on the issue. First time going this deep on the zip spec and want to make sure I get it right.
|
I mentioned on the CL that there is the spec and there is the reality. Zip files come from a lot of different programs with varying degrees of adherence to the formal spec. We shouldn't stop supporting any code that we currently support. |
If you have a compressed file, know its uncompressed size and crc32 checksum, it'd be nice to be able to add it to an archive as is.
I have three current use-cases for this:
I see three different ways we could achieve this:
zip.CreateHeaderRaw(fh *FileHeader) (io.Writer, error)
function, that uses the FileHeader'sCRC32
andUncompressedSize64
fields set by the user.CreateHeader(fh *FileHeader) (io.Writer, error)
function, but have a new FileHeader field that indicates we're going to write already compressed data (and then useCRC32
andUncompressedSize64
fields set by the user)CreateHeader(fh *FileHeader) (io.Writer, error)
function, but ifCompressedSize64
has already been set, assume that data written is already compressed.I'm going to assume that option 3 would be a no-go, because existing code might suddenly break if a user has already set the CompressedSize for whatever reason, but hopefully the other options are viable.
The text was updated successfully, but these errors were encountered: