You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to run paragraph level deduplication using the dolma library and wanted to test it on c4. I downloaded allenai/c4 from huggingface, updated the schema to be text (string, doc content), id (long, unique id), source ("c4"), and saved it as json.gz files that are ~250MB/file. Any time I run dolma -c c4-dedupe.yaml dedupe the output attribute is always an empty list. Here is the yaml I am using (which is almost identical to the one provided at configs/dolma-v1_5/para_dedupe/c4.yaml
the machine I am using has 360 vCPU and is running Debian 11, Python 3.10. I tried using pip install dolma and downloading the library directly from the repo (neither worked). I built a small example input as I saw in this discussion which worked totally fine. Pretty confused about this result.
I would really appreciate help / any thoughts why this might be the case.
The text was updated successfully, but these errors were encountered:
I am trying to run paragraph level deduplication using the dolma library and wanted to test it on c4. I downloaded
allenai/c4
from huggingface, updated the schema to betext (string, doc content), id (long, unique id), source ("c4")
, and saved it asjson.gz
files that are~250MB/file
. Any time I rundolma -c c4-dedupe.yaml dedupe
the output attribute is always an empty list. Here is theyaml
I am using (which is almost identical to the one provided atconfigs/dolma-v1_5/para_dedupe/c4.yaml
the machine I am using has
360 vCPU
and is runningDebian 11, Python 3.10
. I tried usingpip install dolma
and downloading the library directly from the repo (neither worked). I built a small example input as I saw in this discussion which worked totally fine. Pretty confused about this result.I would really appreciate help / any thoughts why this might be the case.
The text was updated successfully, but these errors were encountered: