zlib format incorrectly matches on ASCII files starting with the letter x #386

dpgarrick · 2023-09-11T04:32:54Z

Added a test to demonstrate a bug in the Identify routine where the zlib format incorrectly matches on ASCII files starting with the letter x (because ASCII x is the same as the first expected byte for ZlibHeader).

Unsure of how best to address.

The ZlibHeader match could be expanded to look at the second byte and verify it is one of the accepted values associated with the different compression levels.

Or could try and decompress the first part of the file using zlib? Not sure how robust that is, or how much you would need to read...maybe something like the following?

func (zz Zlib) Match(filename string, stream io.Reader) (MatchResult, error) {
	var mr MatchResult

	// match filename
	if strings.Contains(strings.ToLower(filename), zz.Name()) {
		mr.ByName = true
	}

	// match file header
	buf, err := readAtMost(stream, len(ZlibHeader))
	if err != nil {
		return mr, err
	}
	if bytes.Equal(buf, ZlibHeader) {
		// ZlibHeader can be accidentally matched by an ASCII file starting with the letter x
		// Try to decompress the first part of the stream to double check
		buf, err = readAtMost(stream, 4096)
		if err != nil {
			return mr, err
		}

		r, err := zlib.NewReader(bytes.NewReader(buf))
		if err != nil {
			return mr, nil // Not a zlib file
		}
		defer r.Close()

		_, err = io.ReadAll(r)
		if err == nil {
			mr.ByStream = true // Successfully decompressed, likely a zlib file
		}
	}

	return mr, nil
}

mholt · 2023-09-11T05:04:16Z

Lol, in hindsight checking a single byte that happens to be in the ASCII range was not very bright of me 😂 ... I was likely in a hurry.

I'm headed to sleep now but I'm looking forward to checking this out! Thanks for the test.

mholt · 2023-09-11T16:55:28Z

Apparently the first digit might not always be 7, but it is "usually" 7. (It could be 0-7)... and then there are other combinations for the next few bytes: https://stackoverflow.com/a/54915442/1048862

That SO answer gives a table of the 32 possible headers:

      FLEVEL: 0       1       2       3
CINFO:
     0      08 1D   08 5B   08 99   08 D7
     1      18 19   18 57   18 95   18 D3
     2      28 15   28 53   28 91   28 CF
     3      38 11   38 4F   38 8D   38 CB
     4      48 0D   48 4B   48 89   48 C7
     5      58 09   58 47   58 85   58 C3
     6      68 05   68 43   68 81   68 DE
     7      78 01   78 5E   78 9C   78 DA

and so I wonder if it'd be easy enough to just compare the first 4 bytes against those in this table and if there's any matches, it's likely zlib.

I like your approach for decompressing some of it to see if that works -- but it feels like that'd be slower (I'm guessing, though I'm not sure, that it makes allocations). It might not actually matter in practice, which approach we take.

Part of me likes the elegant idea of just comparing a fixed number of bytes, versus setting up a decompressor. What do you think?

dpgarrick · 2023-09-11T23:14:11Z

Yeah I prefer to compare a set number of bytes as well. I pushed an implementation of that and added a test - not extensive in the sense that it doesn't test all the possible values so we are assuming the table provided on SO is accurate. But seems to work for the default compression case and fixes the original issue (matching on ASCII x).

mholt

Beautiful! I like this a lot more. Thanks for your thoroughness :)

demo bug

feba955

dpgarrick added 2 commits September 12, 2023 11:08

check 2 bytes of zlib header

0c97052

add .zz test

99444e6

mholt approved these changes Sep 12, 2023

View reviewed changes

mholt merged commit 24fa33e into mholt:master Sep 12, 2023

rgmz mentioned this pull request Jun 8, 2024

archiver.Identify fails on non-archive file: zlib: invalid header #406

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zlib format incorrectly matches on ASCII files starting with the letter x #386

zlib format incorrectly matches on ASCII files starting with the letter x #386

dpgarrick commented Sep 11, 2023 •

edited

Loading

mholt commented Sep 11, 2023

mholt commented Sep 11, 2023

dpgarrick commented Sep 11, 2023

mholt left a comment

zlib format incorrectly matches on ASCII files starting with the letter x #386

zlib format incorrectly matches on ASCII files starting with the letter x #386

Conversation

dpgarrick commented Sep 11, 2023 • edited Loading

mholt commented Sep 11, 2023

mholt commented Sep 11, 2023

dpgarrick commented Sep 11, 2023

mholt left a comment

Choose a reason for hiding this comment

dpgarrick commented Sep 11, 2023 •

edited

Loading