Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with archives.Identify when filename contains a compression extension #7

Open
luotianqi777 opened this issue Dec 17, 2024 · 4 comments
Labels
help wanted Extra attention is needed

Comments

@luotianqi777
Copy link

luotianqi777 commented Dec 17, 2024

Title: Issue with archives.Identify when filename contains a compression extension

Description:

I encountered an issue while using the archives.Identify function from the github.com/mholt/archives package. When the file name includes a compression extension (e.g., test.gzero.zip), the function returns an error: gzip: invalid header.

Here is the code snippet to reproduce the issue:

package main

import (
	"context"
	"fmt"
	"os"

	"github.com/mholt/archives"
)

func main() {
	var err error
	ctx := context.Background()
	stream, _ := os.Open("testzip") // zip stream
	defer stream.Close()

	_, _, err = archives.Identify(ctx, "test.gzero.zip", stream) // filename contains Compression extension
	fmt.Println(err) // matching zip: gzip: invalid header

	_, _, err = archives.Identify(ctx, "test.zip", stream)
	fmt.Println(err) // nil
}

Steps to Reproduce:

  1. Open a zip file stream using os.Open.
  2. Pass the filename with a compression extension (e.g., test.gzero.zip) to archives.Identify.
  3. Observe that the function returns an error (gzip: invalid header).
@luotianqi777
Copy link
Author

luotianqi777 commented Dec 17, 2024

The current implementation of the filename matching logic in archives.Identify uses strings.Contains, which may result in incorrect matches when the filename contains multiple extensions. For example, a file named test.gzero.szx.zip may cause unexpected behavior.

// match filename
if strings.Contains(strings.ToLower(filename), gz.Name()) {
	mr.ByName = true
}

Maybe we should split the filename by . and check each part for equality with the expected format.

// match filename
for _, w := range strings.Split(filename, ".")[1:]{
	if strings.EqualFold(gz.Name(), "."+w){
		mr.ByName = true
		break
	}
}

Or provide a strict mode that matches only using the file header.

@mholt
Copy link
Owner

mholt commented Dec 17, 2024

Well, maybe we need to start with test cases then.

Is foo.tar.gz a tar file or a gzipped file?

What is foo.gz.zip?

Anyway, I agree we could improve this logic, but I am not sure what the answers are yet.

Identify(), and thus Match(), are used to determine how to read files... typically they expect an outer compression layer, if any, and then an archive format if there's a second match, within the compressed layer (if any).

So maybe the answer is a combination of chopping off a file extension after matching it, before matching the inner layer, or something like that; and making the filename matching more strict.

@mholt mholt added the help wanted Extra attention is needed label Dec 26, 2024
@luotianqi777
Copy link
Author

luotianqi777 commented Dec 30, 2024

If each time only the outermost layer of the compression or archive format is processed during decompression, can this problem be better addressed?
For example, with the tar.gz format, instead of handling both tar and gz simultaneously, first decompress the gz layer, then extract the tar archive. Prioritize using the format identified by the file header, which can prevent decompression errors caused by incorrect file names. This approach is similar to the extraction strategy used by 7z.
Even complex nested compressed files such as foo.zip.tar.zip.gz.7z can be processed normally.

@mholt
Copy link
Owner

mholt commented Dec 30, 2024

first decompress the gz layer, then extract the tar archive

That's basically how it works already. Anything that gets read from the archive is first decompressed.

Reading headers is preferred, but sometimes only a filename is available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants