Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Export to Hugging Face format and optionally upload #571

Closed
wants to merge 7 commits into from

Conversation

rhys101
Copy link

@rhys101 rhys101 commented Jun 7, 2024

This continues the work on exporting llm.c models to Hugging Face formats (Issue 502).

It's a standalone export script that will convert a GPT2 llm.c binary model file to a local HF model directory. It copies over a standard GPT2 tokenizer into the HF model as well.

python export_hf.py --input input_file.bin --output model_name

It can also optionally upload the model to Hugging Face under the current logged-in user account.
(use huggingface-cli login if needed before running the script).

python export_hf.py --input input_file.bin --output model_name --push true

I've tested on a 124M example export which gave some semi-coherent output - it improves quite a bit with a repetition_penalty set to 1.3.

There may well be mis-configurations in the exported model config.json or issues with the conversion script which may become apparent with further review and testing.

@eliebak
Copy link
Contributor

eliebak commented Jun 7, 2024

Seems to work not so bad for me you can see the model here https://huggingface.co/eliebak/dummy-model2 and here is the generation test:
['During photosynthesis in green plants, the leaves of a plant are covered with chlorophyll. The chlorophyll is an essential component for photosynthetic activity and it helps to regulate the growth rate of plants.\nThe chlorophyll content of the leaves of a plant depends on several factors such as the temperature, light intensity, pH, and other parameters that affect its color. In addition, the amount of sunlight available at any given time affects the chlorophyll content of the leaf. Therefore, when you look through your']

@rhys101
Copy link
Author

rhys101 commented Jun 13, 2024

That's great to hear @eliebak. I've since fixed and improved the script to allow selecting the output dtype to be either float16 or float32, irrespective of the model.bin format, with the --dtype float16 or --dtype float32 command line options. Default is float16.

I've also run local evals using Eleuther AI Harness as described on the open_llm_leaderboard and comparing with those published for openai-community/gpt2 to get a better picture of the performance of the edu-fineweb-10B model.

gpt2-124M-edu-fineweb-10B Evals

@karpathy
Copy link
Owner

btw here is related code from @matthewdouglas

https://gist.github.com/matthewdouglas/1c0833f7fa9adbc54e4f5dc09e2b59a2

I'll want to merge one of these two to master

@karpathy
Copy link
Owner

@rhys101 can you share your eleuther eval harness command? i was a bit surprised that their docs are very sparse on the actual evals one should be running

@rhys101
Copy link
Author

rhys101 commented Jun 14, 2024

btw here is related code from @matthewdouglas

https://gist.github.com/matthewdouglas/1c0833f7fa9adbc54e4f5dc09e2b59a2

I'll want to merge one of these two to master

I think both are pretty similar - in fact I ran the float16 evals on its conversion and got the same output scores. One suggestion for the gist code would be to make the torch_dtype explicit in the exported config.json? I think if none is specified, it defaults to float32 (so could take up twice the memory space during inference if the weights were actually saved as 16 bit).

@matthewdouglas please use whatever bits from this PR that are useful!

@rhys101
Copy link
Author

rhys101 commented Jun 14, 2024

@rhys101 can you share your eleuther eval harness command? i was a bit surprised that their docs are very sparse on the actual evals one should be running

I followed the guide on the Open-LLM-Leaderboard for the version of EleutherAI they use and how they configured each test.

Here are the two scripts I use locally to run the evals:

Python script summarize_eval.py

import json, sys

RESULT=sys.argv[1]
print("-"*40)

key = {"arc_challenge_25shot.json": "acc_norm",
       "gsm8k_5shot.json": "acc",
       "hellaswag_10shot.json": "acc_norm",
       "mmlu_5shot.json": "acc",
       "truthfulqa_0shot.json": "mc2",
       "winogrande_5shot.json": "acc"
       }

total=0
for test in ["arc_challenge_25shot.json", "gsm8k_5shot.json", "hellaswag_10shot.json", "mmlu_5shot.json", "truthfulqa_0shot.json", "winogrande_5shot.json"]:
    data = json.loads(open("./%s/%s"%(RESULT, test)).read())
    r_count=0
    r_total=0
    for test_name in data['results']:
      r_count+=1
      r_total+=data['results'][test_name][key[test]]
    score = (r_total*100)/r_count
    print(f"{test:<30} : {score:.4f}")
    total+=score
average = total/6.0
print("-"*40)
print(f"Average Score                  : {average:.4f}")

Shell script eval.sh

# https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
# (See About tab -> REPRODUCIBILITY)

# Clone the evaluation harness:

# git clone https://github.com/EleutherAI/lm-evaluation-harness/
# cd lm-evaluation-harness
# git checkout b281b0921b636bc36ad05c0b0b0763bd6dd43463
# pip install -e .

# Then return to the parent directory and run this script

# cd ..
# ./eval.sh [model_name] [result_name]

# where model_name is either a HF model such as openai-community/gpt2 or a local path such as ./gpt2-124M-run1
# and result_name is the name of the folder under lm-evaluation-harness/results to store the evaluations

# Since the evals can take a couple of hours to run, depending on the model size, you may wish to
# run within a "screen" session or by using nohup to run the script:

# nohup ./eval.sh [model_name] [result_name] > run.txt 2> err.txt &

if [ -z "$1" ]; then
    echo "Error: missing HuggingFace model name or path to local model"
    echo "./eval.sh hf_account/model_name my_result"
  exit 1
fi
if [ -z "$2" ]; then
  echo "Error: missing output name for results"
    echo "./eval.sh hf_account/model_name my_result"
  exit 1
fi

export MODEL="$(realpath -s "$1")"
export RESULT="$2"
echo "Evaluating model $MODEL"
echo "Saving results to ./lm-evaluation-harness/results/$RESULT"

cd lm-evaluation-harness

python main.py --model hf-causal-experimental --model_args pretrained=$MODEL,use_accelerate=True,trust_remote_code=True --tasks truthfulqa_mc --batch_size 1 --no_cache --write_out --output_path results/$RESULT/truthfulqa_0shot.json --device cuda
python main.py --model hf-causal-experimental --model_args pretrained=$MODEL,use_accelerate=True,trust_remote_code=True --tasks winogrande --batch_size 1 --no_cache --write_out --output_path results/$RESULT/winogrande_5shot.json --device cuda --num_fewshot 5
python main.py --model hf-causal-experimental --model_args pretrained=$MODEL,use_accelerate=True,trust_remote_code=True --tasks arc_challenge --batch_size 1 --no_cache --write_out --output_path results/$RESULT/arc_challenge_25shot.json --device cuda --num_fewshot 25
python main.py --model hf-causal-experimental --model_args pretrained=$MODEL,use_accelerate=True,trust_remote_code=True --tasks hellaswag --batch_size 1 --no_cache --write_out --output_path results/$RESULT/hellaswag_10shot.json --device cuda --num_fewshot 10
python main.py --model hf-causal-experimental --model_args pretrained=$MODEL,use_accelerate=True,trust_remote_code=True --tasks gsm8k --batch_size 1 --no_cache --write_out --output_path results/$RESULT/gsm8k_5shot.json --device cuda --num_fewshot 5
python main.py --model hf-causal-experimental --model_args pretrained=$MODEL,use_accelerate=True,trust_remote_code=True --tasks hendrycksTest-abstract_algebra,hendrycksTest-anatomy,hendrycksTest-astronomy,hendrycksTest-business_ethics,hendrycksTest-clinical_knowledge,hendrycksTest-college_biology,hendrycksTest-college_chemistry,hendrycksTest-college_computer_science,hendrycksTest-college_mathematics,hendrycksTest-college_medicine,hendrycksTest-college_physics,hendrycksTest-computer_security,hendrycksTest-conceptual_physics,hendrycksTest-econometrics,hendrycksTest-electrical_engineering,hendrycksTest-elementary_mathematics,hendrycksTest-formal_logic,hendrycksTest-global_facts,hendrycksTest-high_school_biology,hendrycksTest-high_school_chemistry,hendrycksTest-high_school_computer_science,hendrycksTest-high_school_european_history,hendrycksTest-high_school_geography,hendrycksTest-high_school_government_and_politics,hendrycksTest-high_school_macroeconomics,hendrycksTest-high_school_mathematics,hendrycksTest-high_school_microeconomics,hendrycksTest-high_school_physics,hendrycksTest-high_school_psychology,hendrycksTest-high_school_statistics,hendrycksTest-high_school_us_history,hendrycksTest-high_school_world_history,hendrycksTest-human_aging,hendrycksTest-human_sexuality,hendrycksTest-international_law,hendrycksTest-jurisprudence,hendrycksTest-logical_fallacies,hendrycksTest-machine_learning,hendrycksTest-management,hendrycksTest-marketing,hendrycksTest-medical_genetics,hendrycksTest-miscellaneous,hendrycksTest-moral_disputes,hendrycksTest-moral_scenarios,hendrycksTest-nutrition,hendrycksTest-philosophy,hendrycksTest-prehistory,hendrycksTest-professional_accounting,hendrycksTest-professional_law,hendrycksTest-professional_medicine,hendrycksTest-professional_psychology,hendrycksTest-public_relations,hendrycksTest-security_studies,hendrycksTest-sociology,hendrycksTest-us_foreign_policy,hendrycksTest-virology,hendrycksTest-world_religions --batch_size 1 --no_cache --write_out --output_path results/$RESULT/mmlu_5shot.json --device cuda --num_fewshot 5

cd ..
python summarize_eval.py lm-evaluation-harness/results/$RESULT

Then just run with ./eval.sh [model_name] [result_name] and go make some tea...

@karpathy
Copy link
Owner

Ty @rhys101 ! I'm organizing everything together right now and running it.
Worth keeping in mind that we shouldn't use float16. If the model is trained in either bf16 or fp32 it cannot be inferenced in fp16, because it has a reduced exponent range. It must be inferenced in bf16. I'll make some of the edits here.

@karpathy
Copy link
Owner

@rhys101 is there a reason i'm missing for why tensor_bf16 also returns the weights in fp32?

@karpathy
Copy link
Owner

For e..g HellaSwag I get warning like:

Token indices sequence length is longer than the specified maximum sequence length for this model (1091 > 1024). Running this sequence through the model will result in indexing errors

I'm not sure if the code handles this correctly, i.e. cropping the # of shots as needed to fit everything in.

@karpathy
Copy link
Owner

I am digging into why the Eleuther eval is SO slow. In llm.c it takes me like 1-2 seconds to evaluate HellaSwag, but here it is taking many long minutes, even messing with the batch size (which in your code defaults to 1).

@rhys101
Copy link
Author

rhys101 commented Jun 14, 2024

The principle I was trying to follow was to keep float32 accuracy throughout, then convert if needed to float16 / bfloat16. I've added the bfloat16 option in latest push, but testing on the 774M model is generating very low evals compared to float32 and float16, so it needs some thought / checking to see why.

Here's the evals on float32 of the 774 model:

Test Score
arc_challenge_25shot.json 30.887372
gsm8k_5shot.json 0.303260
hellaswag_10shot.json 57.797252
mmlu_5shot.json 26.173440
truthfulqa_0shot.json 35.708253
winogrande_5shot.json 59.589582
Average Score 35.076527

For float16:

Test Score
arc_challenge_25shot.json 30.716724
gsm8k_5shot.json 0.151630
hellaswag_10shot.json 57.896833
mmlu_5shot.json 25.556487
truthfulqa_0shot.json 35.699654
winogrande_5shot.json 60.142068
Average Score 35.027233

Then bfloat16:

Test Score
arc_challenge_25shot.json 24.658703
gsm8k_5shot.json 0.000000
hellaswag_10shot.json 35.172276
mmlu_5shot.json 25.212616
truthfulqa_0shot.json 38.108226
winogrande_5shot.json 50.276243
Average Score 28.904677

This is on both this PR code and also the gist code (you can just edit config.json to set the torch_dtype for inference - the evals pick this up). Need to check a bit more on what's happening with bfloat16 here.

For reference, the openai-community/gpt2-large average score is 32.07

@karpathy
Copy link
Owner

@rhys101 how long does it take for you to run e.g. only HellaSwag? I'm used to this taking only a few seconds in llm.c

@rhys101
Copy link
Author

rhys101 commented Jun 14, 2024

I am digging into why the Eleuther eval is SO slow. In llm.c it takes me like 1-2 seconds to evaluate HellaSwag, but here it is taking many long minutes, even messing with the batch size (which in your code defaults to 1).

Yes, it's unreal - it can take quite a while to fully run the evals, the mmlu set especially. I've tried to set the test order in the shell script so that the quickest are run first, giving an early guidance on which direction the evals are going.

@karpathy
Copy link
Owner

I see in nvidia-smi that the GPUs are not utilized at all, which could be a part of it. Why do the official docs recommend batch size 1? I tried bringing this up more but it doesn't appear to be significantly faster.

@rhys101
Copy link
Author

rhys101 commented Jun 14, 2024

I've Just fired off an 124M eval on a local 4090 to measure:

| 0 NVIDIA GeForce RTX 4090 Off | 00000000:26:00.0 Off | Off |
| 55% 76C P0 314W / 350W | 5891MiB / 24564MiB | 98% Default |

(the GPU drops to 0% in-between each test as the CPU does the data prep. Often for a long time..)

Timings are: (min:sec)

Test Samples Runtime
truthfulqa 5882 01:08
winograde 2534 00:26
arc 4687 05:45
hellaswag 40168 52:00

(^ hellaswag still running, that's the indicative time at start)

@matthewdouglas
Copy link

matthewdouglas commented Jun 14, 2024

It's reasonable to add torch_dtype to the config.json. When I was testing the inference I typically would load it with torch_dtype=torch.bfloat16. Only bothered to export the weights in bf16.

The lm-eval-harness should be using the GPU with --device cuda being set. The inference type should be able to be changed in the eval harness with e.g. --model_args dtype="bfloat16" or --model_args dtype="float32".

There should also be an "auto" option for the batch size.

@karpathy
Copy link
Owner

merged

@karpathy karpathy closed this Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants