Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JsonlReader fails with gzip.BadGzipFile: CRC check failed, but gzip doesn't seem to be bad #278

Open
nelson-liu opened this issue Aug 29, 2024 · 2 comments

Comments

@nelson-liu
Copy link

nelson-liu commented Aug 29, 2024

Hi!

I seem to be running into this issue when processing a file with datatrove (on latest HEAD 2da6f22). It complains about a badgzipfile when the gzip file appears fine.

Here's a minimal reproduction:

In [1]: from datatrove.pipeline.readers import JsonlReader

In [2]: reader = JsonlReader("s3://commoncrawl/")

In [3]: for i in reader.read_file("contrib/datacomp/DCLM-pool/crawl=CC-MAIN-2019-18/1555578530040.33/CC-MAIN-201904202008
   ...: 02-20190420222802-00178.jsonl.gz"):
   ...:     print(i)

yields:

---------------------------------------------------------------------------
BadGzipFile                               Traceback (most recent call last)
Cell In[3], line 1
----> 1 for i in reader.read_file("contrib/datacomp/DCLM-pool/crawl=CC-MAIN-2019-18/1555578530040.33/CC-MAIN-20190420200802-20190420222802-00178.jsonl.gz"):
      2     print(i)

File ~/miniconda3/envs/envname/lib/python3.10/site-packages/datatrove/pipeline/readers/jsonl.py:75, in JsonlReader.read_file(self, filepath)
     73 with self.data_folder.open(filepath, "r", compression=self.compression) as f:
     74     try:
---> 75         for li, line in enumerate(f):
     76             with self.track_time():
     77                 try:

File ~/miniconda3/envs/envname/lib/python3.10/gzip.py:314, in GzipFile.read1(self, size)
    312 if size < 0:
    313     size = io.DEFAULT_BUFFER_SIZE
--> 314 return self._buffer.read1(size)

BadGzipFile: CRC check failed 3835354860 != 826174170

However, the gzip file doesn't appear to be bad:

aws s3 cp s3://commoncrawl/contrib/datacomp/DCLM-pool/crawl=CC-MAIN-2019-18/1555578530040.33/CC-MAIN-20190420200802-20190420222802-00178.jsonl.gz  .

followed by:

In [1]: import gzip

In [2]: with gzip.open('CC-MAIN-20190420200802-20190420222802-00178.jsonl.gz', 'rb') as f:
   ...:   for line in f:
   ...:       print(line)

runs successfully. Furthermore:

gunzip CC-MAIN-20190420200802-20190420222802-00178.jsonl.gz

also runs successfully.

Thanks!

@guipenedo
Copy link
Collaborator

Hi! I can't reproduce your issue, the code you provided runs on my machine, maybe there was some sort of temporary network issue? Could you make sure you are on the latest datatrove version?

@nelson-liu
Copy link
Author

Hm, this was run on an AWS machine, my local laptop, and also another remote machine that I was using, so I don't think that there was a network issue...the AWS machine and my local laptop are ARM-based machines, while the other remote machine should be x86. Out of curiosity, what version of python did you use when testing?

Could you make sure you are on the latest datatrove version?

i mentioned above that i was on HEAD at the time of issue creation---is it better to be on the last release version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants