JsonlReader fails with `gzip.BadGzipFile: CRC check failed`, but gzip doesn't seem to be bad #278

nelson-liu · 2024-08-29T06:47:00Z

Hi!

I seem to be running into this issue when processing a file with datatrove (on latest HEAD 2da6f22). It complains about a badgzipfile when the gzip file appears fine.

Here's a minimal reproduction:

In [1]: from datatrove.pipeline.readers import JsonlReader

In [2]: reader = JsonlReader("s3://commoncrawl/")

In [3]: for i in reader.read_file("contrib/datacomp/DCLM-pool/crawl=CC-MAIN-2019-18/1555578530040.33/CC-MAIN-201904202008
   ...: 02-20190420222802-00178.jsonl.gz"):
   ...:     print(i)

yields:

---------------------------------------------------------------------------
BadGzipFile                               Traceback (most recent call last)
Cell In[3], line 1
----> 1 for i in reader.read_file("contrib/datacomp/DCLM-pool/crawl=CC-MAIN-2019-18/1555578530040.33/CC-MAIN-20190420200802-20190420222802-00178.jsonl.gz"):
      2     print(i)

File ~/miniconda3/envs/envname/lib/python3.10/site-packages/datatrove/pipeline/readers/jsonl.py:75, in JsonlReader.read_file(self, filepath)
     73 with self.data_folder.open(filepath, "r", compression=self.compression) as f:
     74     try:
---> 75         for li, line in enumerate(f):
     76             with self.track_time():
     77                 try:

File ~/miniconda3/envs/envname/lib/python3.10/gzip.py:314, in GzipFile.read1(self, size)
    312 if size < 0:
    313     size = io.DEFAULT_BUFFER_SIZE
--> 314 return self._buffer.read1(size)

BadGzipFile: CRC check failed 3835354860 != 826174170

However, the gzip file doesn't appear to be bad:

aws s3 cp s3://commoncrawl/contrib/datacomp/DCLM-pool/crawl=CC-MAIN-2019-18/1555578530040.33/CC-MAIN-20190420200802-20190420222802-00178.jsonl.gz  .

followed by:

In [1]: import gzip

In [2]: with gzip.open('CC-MAIN-20190420200802-20190420222802-00178.jsonl.gz', 'rb') as f:
   ...:   for line in f:
   ...:       print(line)

runs successfully. Furthermore:

gunzip CC-MAIN-20190420200802-20190420222802-00178.jsonl.gz

also runs successfully.

Thanks!

The text was updated successfully, but these errors were encountered:

guipenedo · 2024-09-03T11:20:37Z

Hi! I can't reproduce your issue, the code you provided runs on my machine, maybe there was some sort of temporary network issue? Could you make sure you are on the latest datatrove version?

nelson-liu · 2024-09-03T19:37:01Z

Hm, this was run on an AWS machine, my local laptop, and also another remote machine that I was using, so I don't think that there was a network issue...the AWS machine and my local laptop are ARM-based machines, while the other remote machine should be x86. Out of curiosity, what version of python did you use when testing?

Could you make sure you are on the latest datatrove version?

i mentioned above that i was on HEAD at the time of issue creation---is it better to be on the last release version?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JsonlReader fails with `gzip.BadGzipFile: CRC check failed`, but gzip doesn't seem to be bad #278

JsonlReader fails with `gzip.BadGzipFile: CRC check failed`, but gzip doesn't seem to be bad #278

nelson-liu commented Aug 29, 2024 •

edited

Loading

guipenedo commented Sep 3, 2024

nelson-liu commented Sep 3, 2024

JsonlReader fails with gzip.BadGzipFile: CRC check failed, but gzip doesn't seem to be bad #278

JsonlReader fails with gzip.BadGzipFile: CRC check failed, but gzip doesn't seem to be bad #278

Comments

nelson-liu commented Aug 29, 2024 • edited Loading

guipenedo commented Sep 3, 2024

nelson-liu commented Sep 3, 2024

JsonlReader fails with `gzip.BadGzipFile: CRC check failed`, but gzip doesn't seem to be bad #278

JsonlReader fails with `gzip.BadGzipFile: CRC check failed`, but gzip doesn't seem to be bad #278

nelson-liu commented Aug 29, 2024 •

edited

Loading