Model Export to Hugging Face format and optionally upload #571

rhys101 · 2024-06-07T18:18:50Z

This continues the work on exporting llm.c models to Hugging Face formats (Issue 502).

It's a standalone export script that will convert a GPT2 llm.c binary model file to a local HF model directory. It copies over a standard GPT2 tokenizer into the HF model as well.

python export_hf.py --input input_file.bin --output model_name

It can also optionally upload the model to Hugging Face under the current logged-in user account.
(use huggingface-cli login if needed before running the script).

python export_hf.py --input input_file.bin --output model_name --push true

I've tested on a 124M example export which gave some semi-coherent output - it improves quite a bit with a repetition_penalty set to 1.3.

There may well be mis-configurations in the exported model config.json or issues with the conversion script which may become apparent with further review and testing.

eliebak · 2024-06-07T23:34:02Z

Seems to work not so bad for me you can see the model here https://huggingface.co/eliebak/dummy-model2 and here is the generation test:
['During photosynthesis in green plants, the leaves of a plant are covered with chlorophyll. The chlorophyll is an essential component for photosynthetic activity and it helps to regulate the growth rate of plants.\nThe chlorophyll content of the leaves of a plant depends on several factors such as the temperature, light intensity, pH, and other parameters that affect its color. In addition, the amount of sunlight available at any given time affects the chlorophyll content of the leaf. Therefore, when you look through your']

rhys101 · 2024-06-13T13:36:17Z

That's great to hear @eliebak. I've since fixed and improved the script to allow selecting the output dtype to be either float16 or float32, irrespective of the model.bin format, with the --dtype float16 or --dtype float32 command line options. Default is float16.

I've also run local evals using Eleuther AI Harness as described on the open_llm_leaderboard and comparing with those published for openai-community/gpt2 to get a better picture of the performance of the edu-fineweb-10B model.

karpathy · 2024-06-14T01:26:03Z

btw here is related code from @matthewdouglas

https://gist.github.com/matthewdouglas/1c0833f7fa9adbc54e4f5dc09e2b59a2

I'll want to merge one of these two to master

karpathy · 2024-06-14T01:27:04Z

@rhys101 can you share your eleuther eval harness command? i was a bit surprised that their docs are very sparse on the actual evals one should be running

rhys101 · 2024-06-14T14:41:02Z

btw here is related code from @matthewdouglas

https://gist.github.com/matthewdouglas/1c0833f7fa9adbc54e4f5dc09e2b59a2

I'll want to merge one of these two to master

I think both are pretty similar - in fact I ran the float16 evals on its conversion and got the same output scores. One suggestion for the gist code would be to make the torch_dtype explicit in the exported config.json? I think if none is specified, it defaults to float32 (so could take up twice the memory space during inference if the weights were actually saved as 16 bit).

@matthewdouglas please use whatever bits from this PR that are useful!

rhys101 · 2024-06-14T14:50:15Z

@rhys101 can you share your eleuther eval harness command? i was a bit surprised that their docs are very sparse on the actual evals one should be running

I followed the guide on the Open-LLM-Leaderboard for the version of EleutherAI they use and how they configured each test.

Here are the two scripts I use locally to run the evals:

Python script `summarize_eval.py`

import json, sys

RESULT=sys.argv[1]
print("-"*40)

key = {"arc_challenge_25shot.json": "acc_norm",
       "gsm8k_5shot.json": "acc",
       "hellaswag_10shot.json": "acc_norm",
       "mmlu_5shot.json": "acc",
       "truthfulqa_0shot.json": "mc2",
       "winogrande_5shot.json": "acc"
       }

total=0
for test in ["arc_challenge_25shot.json", "gsm8k_5shot.json", "hellaswag_10shot.json", "mmlu_5shot.json", "truthfulqa_0shot.json", "winogrande_5shot.json"]:
    data = json.loads(open("./%s/%s"%(RESULT, test)).read())
    r_count=0
    r_total=0
    for test_name in data['results']:
      r_count+=1
      r_total+=data['results'][test_name][key[test]]
    score = (r_total*100)/r_count
    print(f"{test:<30} : {score:.4f}")
    total+=score
average = total/6.0
print("-"*40)
print(f"Average Score                  : {average:.4f}")

Shell script `eval.sh`

# https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
# (See About tab -> REPRODUCIBILITY)

# Clone the evaluation harness:

# git clone https://github.com/EleutherAI/lm-evaluation-harness/
# cd lm-evaluation-harness
# git checkout b281b0921b636bc36ad05c0b0b0763bd6dd43463
# pip install -e .

# Then return to the parent directory and run this script

# cd ..
# ./eval.sh [model_name] [result_name]

# where model_name is either a HF model such as openai-community/gpt2 or a local path such as ./gpt2-124M-run1
# and result_name is the name of the folder under lm-evaluation-harness/results to store the evaluations

# Since the evals can take a couple of hours to run, depending on the model size, you may wish to
# run within a "screen" session or by using nohup to run the script:

# nohup ./eval.sh [model_name] [result_name] > run.txt 2> err.txt &

if [ -z "$1" ]; then
    echo "Error: missing HuggingFace model name or path to local model"
    echo "./eval.sh hf_account/model_name my_result"
  exit 1
fi
if [ -z "$2" ]; then
  echo "Error: missing output name for results"
    echo "./eval.sh hf_account/model_name my_result"
  exit 1
fi

export MODEL="$(realpath -s "$1")"
export RESULT="$2"
echo "Evaluating model $MODEL"
echo "Saving results to ./lm-evaluation-harness/results/$RESULT"

cd lm-evaluation-harness

python main.py --model hf-causal-experimental --model_args pretrained=$MODEL,use_accelerate=True,trust_remote_code=True --tasks truthfulqa_mc --batch_size 1 --no_cache --write_out --output_path results/$RESULT/truthfulqa_0shot.json --device cuda
python main.py --model hf-causal-experimental --model_args pretrained=$MODEL,use_accelerate=True,trust_remote_code=True --tasks winogrande --batch_size 1 --no_cache --write_out --output_path results/$RESULT/winogrande_5shot.json --device cuda --num_fewshot 5
python main.py --model hf-causal-experimental --model_args pretrained=$MODEL,use_accelerate=True,trust_remote_code=True --tasks arc_challenge --batch_size 1 --no_cache --write_out --output_path results/$RESULT/arc_challenge_25shot.json --device cuda --num_fewshot 25
python main.py --model hf-causal-experimental --model_args pretrained=$MODEL,use_accelerate=True,trust_remote_code=True --tasks hellaswag --batch_size 1 --no_cache --write_out --output_path results/$RESULT/hellaswag_10shot.json --device cuda --num_fewshot 10
python main.py --model hf-causal-experimental --model_args pretrained=$MODEL,use_accelerate=True,trust_remote_code=True --tasks gsm8k --batch_size 1 --no_cache --write_out --output_path results/$RESULT/gsm8k_5shot.json --device cuda --num_fewshot 5
python main.py --model hf-causal-experimental --model_args pretrained=$MODEL,use_accelerate=True,trust_remote_code=True --tasks hendrycksTest-abstract_algebra,hendrycksTest-anatomy,hendrycksTest-astronomy,hendrycksTest-business_ethics,hendrycksTest-clinical_knowledge,hendrycksTest-college_biology,hendrycksTest-college_chemistry,hendrycksTest-college_computer_science,hendrycksTest-college_mathematics,hendrycksTest-college_medicine,hendrycksTest-college_physics,hendrycksTest-computer_security,hendrycksTest-conceptual_physics,hendrycksTest-econometrics,hendrycksTest-electrical_engineering,hendrycksTest-elementary_mathematics,hendrycksTest-formal_logic,hendrycksTest-global_facts,hendrycksTest-high_school_biology,hendrycksTest-high_school_chemistry,hendrycksTest-high_school_computer_science,hendrycksTest-high_school_european_history,hendrycksTest-high_school_geography,hendrycksTest-high_school_government_and_politics,hendrycksTest-high_school_macroeconomics,hendrycksTest-high_school_mathematics,hendrycksTest-high_school_microeconomics,hendrycksTest-high_school_physics,hendrycksTest-high_school_psychology,hendrycksTest-high_school_statistics,hendrycksTest-high_school_us_history,hendrycksTest-high_school_world_history,hendrycksTest-human_aging,hendrycksTest-human_sexuality,hendrycksTest-international_law,hendrycksTest-jurisprudence,hendrycksTest-logical_fallacies,hendrycksTest-machine_learning,hendrycksTest-management,hendrycksTest-marketing,hendrycksTest-medical_genetics,hendrycksTest-miscellaneous,hendrycksTest-moral_disputes,hendrycksTest-moral_scenarios,hendrycksTest-nutrition,hendrycksTest-philosophy,hendrycksTest-prehistory,hendrycksTest-professional_accounting,hendrycksTest-professional_law,hendrycksTest-professional_medicine,hendrycksTest-professional_psychology,hendrycksTest-public_relations,hendrycksTest-security_studies,hendrycksTest-sociology,hendrycksTest-us_foreign_policy,hendrycksTest-virology,hendrycksTest-world_religions --batch_size 1 --no_cache --write_out --output_path results/$RESULT/mmlu_5shot.json --device cuda --num_fewshot 5

cd ..
python summarize_eval.py lm-evaluation-harness/results/$RESULT

Then just run with ./eval.sh [model_name] [result_name] and go make some tea...

karpathy · 2024-06-14T16:43:20Z

Ty @rhys101 ! I'm organizing everything together right now and running it.
Worth keeping in mind that we shouldn't use float16. If the model is trained in either bf16 or fp32 it cannot be inferenced in fp16, because it has a reduced exponent range. It must be inferenced in bf16. I'll make some of the edits here.

karpathy · 2024-06-14T16:56:06Z

@rhys101 is there a reason i'm missing for why tensor_bf16 also returns the weights in fp32?

…on bfloat16 before using in earnest.

karpathy · 2024-06-14T17:57:13Z

For e..g HellaSwag I get warning like:

Token indices sequence length is longer than the specified maximum sequence length for this model (1091 > 1024). Running this sequence through the model will result in indexing errors

I'm not sure if the code handles this correctly, i.e. cropping the # of shots as needed to fit everything in.

karpathy · 2024-06-14T18:15:29Z

I am digging into why the Eleuther eval is SO slow. In llm.c it takes me like 1-2 seconds to evaluate HellaSwag, but here it is taking many long minutes, even messing with the batch size (which in your code defaults to 1).

rhys101 · 2024-06-14T18:20:07Z

The principle I was trying to follow was to keep float32 accuracy throughout, then convert if needed to float16 / bfloat16. I've added the bfloat16 option in latest push, but testing on the 774M model is generating very low evals compared to float32 and float16, so it needs some thought / checking to see why.

Here's the evals on float32 of the 774 model:

Test	Score
arc_challenge_25shot.json	30.887372
gsm8k_5shot.json	0.303260
hellaswag_10shot.json	57.797252
mmlu_5shot.json	26.173440
truthfulqa_0shot.json	35.708253
winogrande_5shot.json	59.589582
Average Score	35.076527

For float16:

Test	Score
arc_challenge_25shot.json	30.716724
gsm8k_5shot.json	0.151630
hellaswag_10shot.json	57.896833
mmlu_5shot.json	25.556487
truthfulqa_0shot.json	35.699654
winogrande_5shot.json	60.142068
Average Score	35.027233

Then bfloat16:

Test	Score
arc_challenge_25shot.json	24.658703
gsm8k_5shot.json	0.000000
hellaswag_10shot.json	35.172276
mmlu_5shot.json	25.212616
truthfulqa_0shot.json	38.108226
winogrande_5shot.json	50.276243
Average Score	28.904677

This is on both this PR code and also the gist code (you can just edit config.json to set the torch_dtype for inference - the evals pick this up). Need to check a bit more on what's happening with bfloat16 here.

For reference, the openai-community/gpt2-large average score is 32.07

karpathy · 2024-06-14T18:26:36Z

@rhys101 how long does it take for you to run e.g. only HellaSwag? I'm used to this taking only a few seconds in llm.c

rhys101 · 2024-06-14T18:27:17Z

I am digging into why the Eleuther eval is SO slow. In llm.c it takes me like 1-2 seconds to evaluate HellaSwag, but here it is taking many long minutes, even messing with the batch size (which in your code defaults to 1).

Yes, it's unreal - it can take quite a while to fully run the evals, the mmlu set especially. I've tried to set the test order in the shell script so that the quickest are run first, giving an early guidance on which direction the evals are going.

karpathy · 2024-06-14T18:35:02Z

I see in nvidia-smi that the GPUs are not utilized at all, which could be a part of it. Why do the official docs recommend batch size 1? I tried bringing this up more but it doesn't appear to be significantly faster.

rhys101 · 2024-06-14T18:44:40Z

I've Just fired off an 124M eval on a local 4090 to measure:

(the GPU drops to 0% in-between each test as the CPU does the data prep. Often for a long time..)

Timings are: (min:sec)

Test	Samples	Runtime
truthfulqa	5882	01:08
winograde	2534	00:26
arc	4687	05:45
hellaswag	40168	52:00

(^ hellaswag still running, that's the indicative time at start)

matthewdouglas · 2024-06-14T18:55:27Z

It's reasonable to add torch_dtype to the config.json. When I was testing the inference I typically would load it with torch_dtype=torch.bfloat16. Only bothered to export the weights in bf16.

The lm-eval-harness should be using the GPU with --device cuda being set. The inference type should be able to be changed in the eval harness with e.g. --model_args dtype="bfloat16" or --model_args dtype="float32".

There should also be an "auto" option for the batch size.

karpathy · 2024-06-19T01:23:06Z

merged

rhys101 added 4 commits June 7, 2024 13:30

Add script to export bin model to HF and optinally upload

2e1c8b4

Use GPTModel from transformers

24d633c

Add example usage in output

72b28e1

Use GPT2LMHeadModel

4c88d31

rhys101 added 2 commits June 12, 2024 23:11

float32/16 correction to allow export for evals

f282572

Add flag to set output dtype

ab09266

karpathy added the notable label Jun 14, 2024

Add bfloat16 to dtype option with default now float32. Checks needed …

aca0649

…on bfloat16 before using in earnest.

karpathy closed this Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Export to Hugging Face format and optionally upload #571

Model Export to Hugging Face format and optionally upload #571

rhys101 commented Jun 7, 2024

eliebak commented Jun 7, 2024

rhys101 commented Jun 13, 2024

karpathy commented Jun 14, 2024

karpathy commented Jun 14, 2024

rhys101 commented Jun 14, 2024

rhys101 commented Jun 14, 2024

karpathy commented Jun 14, 2024

karpathy commented Jun 14, 2024

karpathy commented Jun 14, 2024

karpathy commented Jun 14, 2024

rhys101 commented Jun 14, 2024

karpathy commented Jun 14, 2024

rhys101 commented Jun 14, 2024

karpathy commented Jun 14, 2024

rhys101 commented Jun 14, 2024

matthewdouglas commented Jun 14, 2024 •

edited

Loading

karpathy commented Jun 19, 2024

Model Export to Hugging Face format and optionally upload #571

Model Export to Hugging Face format and optionally upload #571

Conversation

rhys101 commented Jun 7, 2024

eliebak commented Jun 7, 2024

rhys101 commented Jun 13, 2024

karpathy commented Jun 14, 2024

karpathy commented Jun 14, 2024

rhys101 commented Jun 14, 2024

rhys101 commented Jun 14, 2024

Python script summarize_eval.py

Shell script eval.sh

karpathy commented Jun 14, 2024

karpathy commented Jun 14, 2024

karpathy commented Jun 14, 2024

karpathy commented Jun 14, 2024

rhys101 commented Jun 14, 2024

karpathy commented Jun 14, 2024

rhys101 commented Jun 14, 2024

karpathy commented Jun 14, 2024

rhys101 commented Jun 14, 2024

matthewdouglas commented Jun 14, 2024 • edited Loading

karpathy commented Jun 19, 2024

Python script `summarize_eval.py`

Shell script `eval.sh`

matthewdouglas commented Jun 14, 2024 •

edited

Loading