-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replicating training / test split on models #191
Comments
@siddk would know better, but my first guess is that code in auto.py performed the split, so ultimately the HF dataset method train_test_split with the validation ratio = 0.0005 ... I'm guessing this was done once and every random seed experiment used the same split ... I am unsure which random seed was used for the initial data processing ... code in Mistral: Line 112 in 315560f
code in HF Datasets: |
If I had to guess I would assume it was done with seed=42 but that could certainly be wrong ... I just note 42 is the default seed when no seed is specified ... |
Honestly I am really unclear on what random seed was used for the data preprocessing which means it is kind of difficult to perfectly replicate the data split ... |
Here are some more details from @siddk
|
Hello,
We are running some experiments on Mistral models and it would be useful if we knew how the openwebtext train-test split was done to train the models. It would allow us to replicate this split and evaluate the models using openwebtext / without leakage.
Thanks for your help.
The text was updated successfully, but these errors were encountered: