Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No validation set in openwebtext leads to failure. #195

Open
john-hewitt opened this issue Dec 28, 2022 · 13 comments
Open

No validation set in openwebtext leads to failure. #195

john-hewitt opened this issue Dec 28, 2022 · 13 comments

Comments

@john-hewitt
Copy link

Describe the bug
After building index for openwebtext, building the trainer fails (at line 161 of train.py) because no validation dataset is constructed. I believe this is because the lm_dataset object is built with huggingface's load_dataset on the openwebtext named dataset, and it has no validation split. The validation_ratio quinine config option is only used in building the custom_eval_datasets, not the lm_dataset object, so it is not used to portion out part of openwebtext as a validation set.

To Reproduce
Replace datasets/wikitext2.yaml with datasets/openwebtext.yaml in mistral-micro.yaml (and make other artefact location changes) and run

deepspeed --num_gpus 4 --num_nodes 1 --master_addr machine1 train.py --config conf/mistral-micro.yaml --nnodes 1 --nproc_per_node 4 --training_arguments.fp16 true --training_arguments.per_device_train_batch_size 4 --training_arguments.deepspeed conf/deepspeed/z2-small-conf.json --run_id repro-bug-openweb-novalid

Expected behavior
No failure occurs at line 161 of train.py when lm_dataset['validation'] is expressed.

@J38
Copy link
Contributor

J38 commented Dec 29, 2022

I think this is because David changed this from the original. It looks like

def get_auto_dataset(
does the right thing ? So if you just build the conventional Hugging Face cache (instead of David's custom index) ... with get_auto_dataset it should work fine and create a standard tokenized Hugging Face cache with an extra validation set.

@J38
Copy link
Contributor

J38 commented Dec 29, 2022

Basically older Mistral just built a conventional Hugging Face cache and David created a new custom data handling setup and I guess didn't add in creating a validation set ...

@J38
Copy link
Contributor

J38 commented Dec 29, 2022

I've started this branch: https://github.com/stanford-crfm/mistral/tree/mistral-flash-dec-2022

This should have Mistral Feb 2022 code + some bug fixes + has worked with Flash Attention

@J38
Copy link
Contributor

J38 commented Dec 29, 2022

I'm on vacation mode but I am happy to help you get this branch working ... you will need to install flash attention and a specially modified Hugging Face as well ...

@J38
Copy link
Contributor

J38 commented Dec 29, 2022

Some instructions on getting this working, (remember use branch: https://github.com/stanford-crfm/mistral/tree/mistral-flash-dec-2022)

  1. The standard Mistral environment should work, but note you need to install a custom transformers, so delete transformers from the pip-requirements.txt file when setting up your environment ...
conda create -n mistral python=3.8.12 pytorch=1.11.0 torchdata cudatoolkit=11.3 -c pytorch
conda activate mistral
pip install -r setup/pip-requirements.txt

I think this will work with newer PyTorch, etc ... but you need to make sure you build Flash Attention with whatever you are using ...

  1. Install transformers from Git (https://github.com/huggingface/transformers), replace src/transformers/models/gpt2/modeling_gpt2.py with the version checked into this branch in the transformers dir in the top level directory of this repo

  2. When creating an environment, make sure to install Flash Attention: (https://github.com/HazyResearch/flash-attention) ... you made need to roll back to this commit: f515c77f2528b5062ebcc6c905c8817ca0ac0ad1 ... last time I tried to get this working it wasn't because of issues with newer versions of Flash Attention but they may've been resolved in main by now ... but I rolled back to that commit and it was fine ...

Please let me know if you run into any issues and we can clean this branch + instructions up ... but if all goes well should get super fast Flash Attention GPT2 training which is something like 2x faster ...

In the future we should think about reconciling this branch with current main ... but if you just want something working in the next day this is quickest route ...

@J38
Copy link
Contributor

J38 commented Dec 29, 2022

Sample command:

Note add a file called hostfile in the top level directory (even a blank one if just using one machine)

deepspeed --hostfile hostfile --num_gpus 8 --num_nodes 1 --master_addr sphinx4 train.py --config conf/your_config.yaml --nnodes 1 --nproc_per_node 8 --training_arguments.per_device_train_batch_size 16 --training_arguments.deepspeed conf/deepspeed/z2-small-bf16-conf.json --run_id mistral-w-flash-demo

@J38
Copy link
Contributor

J38 commented Dec 29, 2022

You need to use bf16 ... a bad feature of this branch right now is this is just hard-coded here:

training_args.bf16 = True

So it'd be a good idea to make this more transparent ... this branch is sort of my personal experimentation that I got running and could use some clean up ...

@J38
Copy link
Contributor

J38 commented Dec 29, 2022

Flash Attention requires bf16 or fp16 ... and you need bf16 for the stability ...

@yandachen
Copy link

yandachen commented Mar 6, 2023

@J38 Hello, I also had the same issue of code not working for openwebtext due to missing validation set, so I tried your solution above. But I encountered the error "ImportError: cannot import name 'LMDataCollator' from 'src.core.trainer'".
It looks like the src.core.trainer file in the branch https://github.com/stanford-crfm/mistral/tree/mistral-flash-dec-2022 does not have a class called LMDataCollator. Could you please help look into that?

@J38
Copy link
Contributor

J38 commented Mar 7, 2023

Can you provide more details about what is causing that error (e.g. what line is failing in what file)? The branch is older code before changes were made, so it should not require LMDataCollator. Are you pre-training from scratch or trying to fine-tune a model trained with main branch code?

@J38
Copy link
Contributor

J38 commented Mar 7, 2023

I guess it is this line:

from src.core.trainer import LMDataCollator

@yandachen
Copy link

Yes it is this line, and I believe LMDataCollator is used in line 158 of this file. But I'm able to fix the dev set problem by adding a few lines of code on the main branch so I think the issue is resolved.

@J38
Copy link
Contributor

J38 commented Mar 7, 2023

I tried reverting train.py to the February 2022 version, does that help ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants