Skip to content
This repository has been archived by the owner on Nov 19, 2024. It is now read-only.

zlib format incorrectly matches on ASCII files starting with the letter x #386

Merged
merged 3 commits into from
Sep 12, 2023

Conversation

dpgarrick
Copy link
Contributor

@dpgarrick dpgarrick commented Sep 11, 2023

Added a test to demonstrate a bug in the Identify routine where the zlib format incorrectly matches on ASCII files starting with the letter x (because ASCII x is the same as the first expected byte for ZlibHeader).

Unsure of how best to address.

The ZlibHeader match could be expanded to look at the second byte and verify it is one of the accepted values associated with the different compression levels.

Or could try and decompress the first part of the file using zlib? Not sure how robust that is, or how much you would need to read...maybe something like the following?

func (zz Zlib) Match(filename string, stream io.Reader) (MatchResult, error) {
	var mr MatchResult

	// match filename
	if strings.Contains(strings.ToLower(filename), zz.Name()) {
		mr.ByName = true
	}

	// match file header
	buf, err := readAtMost(stream, len(ZlibHeader))
	if err != nil {
		return mr, err
	}
	if bytes.Equal(buf, ZlibHeader) {
		// ZlibHeader can be accidentally matched by an ASCII file starting with the letter x
		// Try to decompress the first part of the stream to double check
		buf, err = readAtMost(stream, 4096)
		if err != nil {
			return mr, err
		}

		r, err := zlib.NewReader(bytes.NewReader(buf))
		if err != nil {
			return mr, nil // Not a zlib file
		}
		defer r.Close()

		_, err = io.ReadAll(r)
		if err == nil {
			mr.ByStream = true // Successfully decompressed, likely a zlib file
		}
	}

	return mr, nil
}

@mholt
Copy link
Owner

mholt commented Sep 11, 2023

Lol, in hindsight checking a single byte that happens to be in the ASCII range was not very bright of me 😂 ... I was likely in a hurry.

I'm headed to sleep now but I'm looking forward to checking this out! Thanks for the test.

@mholt
Copy link
Owner

mholt commented Sep 11, 2023

Apparently the first digit might not always be 7, but it is "usually" 7. (It could be 0-7)... and then there are other combinations for the next few bytes: https://stackoverflow.com/a/54915442/1048862

That SO answer gives a table of the 32 possible headers:

      FLEVEL: 0       1       2       3
CINFO:
     0      08 1D   08 5B   08 99   08 D7
     1      18 19   18 57   18 95   18 D3
     2      28 15   28 53   28 91   28 CF
     3      38 11   38 4F   38 8D   38 CB
     4      48 0D   48 4B   48 89   48 C7
     5      58 09   58 47   58 85   58 C3
     6      68 05   68 43   68 81   68 DE
     7      78 01   78 5E   78 9C   78 DA

and so I wonder if it'd be easy enough to just compare the first 4 bytes against those in this table and if there's any matches, it's likely zlib.

I like your approach for decompressing some of it to see if that works -- but it feels like that'd be slower (I'm guessing, though I'm not sure, that it makes allocations). It might not actually matter in practice, which approach we take.

Part of me likes the elegant idea of just comparing a fixed number of bytes, versus setting up a decompressor. What do you think?

@dpgarrick
Copy link
Contributor Author

Yeah I prefer to compare a set number of bytes as well. I pushed an implementation of that and added a test - not extensive in the sense that it doesn't test all the possible values so we are assuming the table provided on SO is accurate. But seems to work for the default compression case and fixes the original issue (matching on ASCII x).

Copy link
Owner

@mholt mholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful! I like this a lot more. Thanks for your thoroughness :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants