Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reject all DEFLATE streams that zlib rejects #288

Open
Dongmuliang opened this issue Jan 9, 2023 · 5 comments
Open

Reject all DEFLATE streams that zlib rejects #288

Dongmuliang opened this issue Jan 9, 2023 · 5 comments

Comments

@Dongmuliang
Copy link

Hi, I recently fuzz the libdeflate for parsing zlib format file and found some interesting cases.
Specifically, libdeflate accepts the file without any issue while another parser, the zlib rejects it, and I also contacted the zlib authors.

To check it whether valid or not, I use the following code (mainly from zlib_decompress/fuzz.c)

int main(int argc, char **argv)
{
	struct libdeflate_decompressor *d;
	int ret;
	int fd = open(argv[1], O_RDONLY);
	struct stat stbuf;
	assert(fd >= 0);
	ret = fstat(fd, &stbuf);
	assert(!ret);

	char in[stbuf.st_size];
	ret = read(fd, in, sizeof in);
	assert(ret == sizeof in);

	char out[sizeof(in) * 30];

	d = libdeflate_alloc_decompressor();
	size_t out_size = 0 ;

	enum libdeflate_result res = libdeflate_zlib_decompress(d, in, sizeof in, out, sizeof out, &out_size);
	printf("decode res:%d\n", res);
	libdeflate_free_decompressor(d);
	return 0;
}

These interesting files are attached!
pocs.zip

@ebiggers
Copy link
Owner

ebiggers commented Jan 9, 2023

There are several edge cases where for performance reasons, libdeflate is intentionally more accepting than zlib, in a safe way. The specific case that your example triggers is the case where the encoded codeword lengths expand to more than the number of codewords. But there are a few others too.

There isn't any real problem with doing this, since in general corruption in a DEFLATE stream can only be detected by a checksum anyway.

Can you elaborate on why you consider this to be a problem?

@Dongmuliang
Copy link
Author

Hi, @ebiggers , thanks for your explantion. Generally, any corruption of the compressed data should be timely notified to the users because it may lead to severe effects and difficult to make recovery. This is different from uncompressed text, which may probably still be useful despite the presence of some corrupted bytes.
Therefore, keeping it silent and accepting it is not a good choice.

It seems unlikely a real problem because there is a very low possibility that both checksum and corrupted data are satisfied at the same time. However, considering its wide usage, including some critical systems, the situation will be changed when a stealthy attacker is involved (e.g., an attacker may combine other bugs to hijack the checksum function, which can be used to correct the checksum maliciously).

@ebiggers
Copy link
Owner

Hi, @ebiggers , thanks for your explantion. Generally, any corruption of the compressed data should be timely notified to the users because it may lead to severe effects and difficult to make recovery.

Yes, which is why people who want to detect data corruption need to use a checksum (e.g. as the gzip and zlib wrapper formats for DEFLATE do), and not rely on the incidental built-in redundancies of the DEFLATE format which are much, much less effective at detecting data corruption. Corrupting a DEFLATE stream will very often create another valid DEFLATE stream. In contrast, just a 32-bit checksum will detect 99.99999997% of corruptions.

The question of when the DEFLATE decompressor should report an error when it's given an invalid stream, vs. remap it to a valid stream, is really just a minor quality-of-implementation question.

I'd argue that reporting DEFLATE decompression errors is actually sort of bad, because it misleads people into thinking that DEFLATE has built-in error detection, which it doesn't. You need a checksum if you want to detect data corruption.

That being said, I do understand that zlib is the standard implementation of DEFLATE, and it's generally better to be consistent with it... if only so that people running fuzzers that compare the implementations aren't confused.

It's easy to make libdeflate return an error when "the encoded codeword lengths expand to more than the number of codewords", so I'll do that. That handles two of your examples.

However, that still leaves the fact that DEFLATE streams can contain invalid litlen and offset symbols. Those are really hard to handle efficiently, other than by remapping them to valid symbols (as libdeflate does). I am probably not going to change what libdeflate does for those, as it does not seem worth it...

the situation will be changed when a stealthy attacker is involved (e.g., an attacker may combine other bugs to hijack the checksum function, which can be used to correct the checksum maliciously).

Detecting malicious modifications is totally irrelevant here, as a cryptographic MAC would be needed for that.

@ebiggers ebiggers changed the title accepting invalid inputs Reject all DEFLATE streams that zlib rejects Jan 10, 2023
ebiggers added a commit that referenced this issue Jan 10, 2023
When there are more encoded litlen and offset codeword lengths than
there are supposed to be based on the earlier fields, return
LIBDEFLATE_BAD_DATA instead of ignoring the extra lengths.

This isn't very important, but this aligns the behavior with zlib.

Also add a regression test for this change.

Update #288
ebiggers added a commit that referenced this issue Jan 10, 2023
When there are more encoded litlen and offset codeword lengths than
there are supposed to be based on the earlier fields, return
LIBDEFLATE_BAD_DATA instead of ignoring the extra lengths.

This isn't very important, but this aligns the behavior with zlib.

Also add a regression test for this change.

Update #288
ebiggers added a commit that referenced this issue Jan 10, 2023
When there are more encoded litlen and offset codeword lengths than
there are supposed to be based on the earlier fields, return
LIBDEFLATE_BAD_DATA instead of ignoring the extra lengths.

This isn't very important, but this aligns the behavior with zlib.

Also add a regression test for this change.

Update #288
@ebiggers
Copy link
Owner

65da376 handled poc1 and poc4.

2574818 would handle poc2 and poc3, but it would be a more complex change. I'm not sure it's worth merging, given that the existing behavior is safe and acceptable too. It also would only detect invalid offset symbols when they occur less than 4 GiB into the stream. I haven't found a way to detect them at positions greater than 4 GiB without adding overhead to the decompression inner loop. I don't want to slow down decompression for everyone just because of some artificial test.

@Dongmuliang
Copy link
Author

thanks for your explanation and bug fixing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants