-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Baseline data #61
base: main
Are you sure you want to change the base?
Baseline data #61
Conversation
To fix the issue with all the data getting removed by the decon we tried deleting the bloom filter in s3 before rerunning, as this is getting read in and added too rather than started fresh. It is unclear why this should change the filter (as the data it's being run on should be identical) unless something is causing the bloom filter indexing to shift such that the old filter is hashed differently.
and then we tried rerunning everything after this step:
However this still had the same issue of removing almost everything in the dedup. |
Tried this approach again but restarting from the step below where the eval data that is used to build the bloom filter is created, after first removing the output directory for this in case something about how the bloom filter creation step adds attributes to this is a problem:
Additionally we changed the bloom filter byte size in After this I am unfortunately still seeing the behavior with nearly all files being removed. |
I tried something I just thought of to get some more info on debugging the decon issues: I tried running the decon pipeline using a copy of saved bloom filter for option 1 that I hadn't accidentally over written. So this bloom filter should be created correctly. However when I run it on Falcon it starts removing almost all documents the same way as when I remade the bloom filter. So this implies to me that the issue isn't with the bloom filter creation but rather in how we're using it. |
Issues should have been fixed with #66. |
Starting over from the top now with new Dolma version (commit 2ee1ae2):
Setup Environment
DeconFollow the steps in this readme to decontaminate
Now let's do this with Pile since we want to train on it first. So we mark contamination:
Then we remove contamination:
Unfortunately this still results in near total removal:
Overall we have only 145725 / 210607728 = 0.0006919261766 of documents retained. |
Okay I think the issue is that the old setup instructions had me installing the wrong wheels so here we go again but now with the right wheels. Starting over from the top now with new Dolma version (commit 2ee1ae2):
Setup Environment
DeconFollow the steps in this readme to decontaminate
Now let's do this with Pile since we want to train on it first. So we mark contamination:
Then we remove contamination:
This initially errored out like this:
Rerunning the command didn't seem to reuse any of the already completed results, but it did finish without errors this time. Removal is more moderate this time, though surprisingly consistent from file to file:
Overall we have only 204809882 / 210607728 = 0.9724708772 of documents retained. |
Next we're trying to tokenize
But this gets the following error:
Luca says to just remove the offending line. So we rebuild after removing:
Rebuild env
Then try again:
This works and we upload the results to |
Now applying all this to RedPajama we get:
900799243 / 901687943 = 0.999014404 documents retained And tokenize
|
And now falcon: decon
mix
check doc removal
912114192 / 918848690 = 0.9926707214 docs retained Tokenize
|
We're redoing Pile tokenization now cuz of a bug when tokenizing with more parallel processes than files in the dataset. We push a new config and run:
resulting in:
|
Now lets do c4:
This data is already deconned for Dolma, so we go right to check removal
364156258 / 364156258 = 100% documents retained This seems unlikely so we try deconning again.
check again:
364121142 / 364156258 = 0.9999035689 doc retention rate Mix
check the number of files to make sure its > 224 (cpus on this machine)
496 files Tokenize
|
Now mc4:
dedup
Check removal
3928652800 / 3928733374 = 0.9999794911 Mix
Tokenize
|
Now we'll make a dolma-cc-only dataset. This just needs tokenization but for some reason it needs the code at main on afab18c
Then tokenize:
|
Working on creating data with dolma v1.5 style decontamination from baseline datasets. Progress so far is commented below.