You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I seem to be running into this issue when processing a file with datatrove (on latest HEAD 2da6f22). It complains about a badgzipfile when the gzip file appears fine.
Here's a minimal reproduction:
In [1]: from datatrove.pipeline.readers import JsonlReader
In [2]: reader = JsonlReader("s3://commoncrawl/")
In [3]: for i in reader.read_file("contrib/datacomp/DCLM-pool/crawl=CC-MAIN-2019-18/1555578530040.33/CC-MAIN-201904202008
...: 02-20190420222802-00178.jsonl.gz"):
...: print(i)
yields:
---------------------------------------------------------------------------
BadGzipFile Traceback (most recent call last)
Cell In[3], line 1
----> 1 for i in reader.read_file("contrib/datacomp/DCLM-pool/crawl=CC-MAIN-2019-18/1555578530040.33/CC-MAIN-20190420200802-20190420222802-00178.jsonl.gz"):
2 print(i)
File ~/miniconda3/envs/envname/lib/python3.10/site-packages/datatrove/pipeline/readers/jsonl.py:75, in JsonlReader.read_file(self, filepath)
73 with self.data_folder.open(filepath, "r", compression=self.compression) as f:
74 try:
---> 75 for li, line in enumerate(f):
76 with self.track_time():
77 try:
File ~/miniconda3/envs/envname/lib/python3.10/gzip.py:314, in GzipFile.read1(self, size)
312 if size < 0:
313 size = io.DEFAULT_BUFFER_SIZE
--> 314 return self._buffer.read1(size)
BadGzipFile: CRC check failed 3835354860 != 826174170
Hi! I can't reproduce your issue, the code you provided runs on my machine, maybe there was some sort of temporary network issue? Could you make sure you are on the latest datatrove version?
Hm, this was run on an AWS machine, my local laptop, and also another remote machine that I was using, so I don't think that there was a network issue...the AWS machine and my local laptop are ARM-based machines, while the other remote machine should be x86. Out of curiosity, what version of python did you use when testing?
Could you make sure you are on the latest datatrove version?
i mentioned above that i was on HEAD at the time of issue creation---is it better to be on the last release version?
Hi!
I seem to be running into this issue when processing a file with datatrove (on latest HEAD 2da6f22). It complains about a badgzipfile when the gzip file appears fine.
Here's a minimal reproduction:
yields:
However, the gzip file doesn't appear to be bad:
followed by:
runs successfully. Furthermore:
also runs successfully.
Thanks!
The text was updated successfully, but these errors were encountered: