Download Hugging Face models into Hugging Face cache #1285

vmpuri · 2024-10-09T20:17:55Z

Currently, we download models to a local (~/.torchchat by default). For Hugging Face models, we should download to the Hugging Face cache instead.

As per Hugging Face:

By default, we recommend using the [cache system](https://huggingface.co/docs/huggingface_hub/en/guides/manage-cache) to download files from the Hub. You can specify a custom cache location using the cache_dir parameter in [hf_hub_download()](https://huggingface.co/docs/huggingface_hub/v0.25.1/en/package_reference/file_download#huggingface_hub.hf_hub_download) and [snapshot_download()](https://huggingface.co/docs/huggingface_hub/v0.25.1/en/package_reference/file_download#huggingface_hub.snapshot_download), or by setting the [HF_HOME](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hf_home) environment variable.

This PR also enables hf_transfer, a production-ready Rust library that speeds up downloads from Hugging Face. In my own testing, the speedup was over 2x:

// Before 

python3 torchchat.py download llama3.2-1b 
...
5.63s user 7.31s system 29% cpu 43.139 total

// After
python3 torchchat.py download llama3.2-1b  
...
7.59s user 2.81s system 48% cpu 21.551 total

Testing

Download

python3 torchchat.py download llama3.2-1b

Downloading meta-llama/Meta-Llama-3.2-1B-Instruct from Hugging Face to /Users/puri/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct
.gitattributes: 100%|████████████████| 1.52k/1.52k [00:00<00:00, 4.16MB/s]
LICENSE.txt: 100%|███████████████████| 7.71k/7.71k [00:00<00:00, 52.1MB/s]
README.md: 100%|█████████████████████| 35.9k/35.9k [00:00<00:00, 3.75MB/s]
USE_POLICY.md: 100%|█████████████████| 6.02k/6.02k [00:00<00:00, 8.77MB/s]
config.json: 100%|███████████████████████| 877/877 [00:00<00:00, 4.47MB/s]
generation_config.json: 100%|████████████| 189/189 [00:00<00:00, 1.21MB/s]
consolidated.00.pth: 100%|██████████▉| 2.47G/2.47G [01:10<00:00, 35.1MB/s]
original/params.json: 100%|██████████████| 220/220 [00:00<00:00, 1.12MB/s]
tokenizer.model: 100%|███████████████| 2.18M/2.18M [00:00<00:00, 7.80MB/s]
special_tokens_map.json: 100%|███████████| 296/296 [00:00<00:00, 1.62MB/s]
tokenizer.json: 100%|████████████████| 9.09M/9.09M [00:00<00:00, 19.8MB/s]
tokenizer_config.json: 100%|█████████| 54.5k/54.5k [00:00<00:00, 51.7MB/s]
Model downloaded to /Users/puri/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/e9f8effbab1cbdc515c11ee6e098e3d5a9f51e14
Converting meta-llama/Meta-Llama-3.2-1B-Instruct to torchchat format...
known configs: ['llava-1.5', '13B', '70B', 'CodeLlama-7b-Python-hf', 'Meta-Llama-3.1-70B-Tune', '34B', 'Meta-Llama-3.1-8B', 'stories42M', 'Llama-Guard-3-1B', '30B', 'Meta-Llama-3.1-8B-Tune', 'stories110M', 'Llama-3.2-11B-Vision', 'Meta-Llama-3.2-3B', 'Meta-Llama-3.1-70B', 'Meta-Llama-3.2-1B', '7B', 'stories15M', 'Llama-Guard-3-1B-INT4', 'Mistral-7B', 'Meta-Llama-3-70B', 'Meta-Llama-3-8B']
Model config {'block_size': 131072, 'vocab_size': 128256, 'n_layers': 16, 'n_heads': 32, 'dim': 2048, 'hidden_dim': 8192, 'n_local_heads': 8, 'head_dim': 64, 'rope_base': 500000.0, 'norm_eps': 1e-05, 'multiple_of': 256, 'ffn_dim_multiplier': 1.5, 'use_tiktoken': True, 'max_seq_length': 8192, 'rope_scaling': {'factor': 32.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192}, 'n_stages': 1, 'stage_idx': 0, 'attention_bias': False, 'feed_forward_bias': False}
Symlinking checkpoint to /Users/puri/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/e9f8effbab1cbdc515c11ee6e098e3d5a9f51e14/model.pth.
Done.

Generate

python3 torchchat.py generate llama3.2-1b --prompt "Write a monologue from this opening line: 'Let me tell you what bugs me about human endeavor.'"   

Using checkpoint path: /Users/puri/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/e9f8effbab1cbdc515c11ee6e098e3d5a9f51e14/model.pth
Using checkpoint path: /Users/puri/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/e9f8effbab1cbdc515c11ee6e098e3d5a9f51e14/model.pth
Using device=mps 
Loading model...
Time to load model: 0.55 seconds
-----------------------------------------------------------
Write a monologue from this opening line: 'Let me tell you what bugs me about human endeavor.'"Let me tell you what bugs me about human endeavor. It's the capacity to create something, anything, that's not already out there, and to spend the vast majority of their time trying to make it better or more complex. It's the relentless pursuit of perfection, as if the only thing that matters is the product itself, not the journey.

We create for a reason, I suppose. We have emotions, desires, and problems to solve. And we find ways to tinker, to improvise, and to optimize. But the human problem-solving machine is a double-edged sword. On the one hand, it's incredibly resourceful and adaptable. We can solve complex systems, crack open hearts, and even bring forth unprecedented innovations.

But on the other hand, it's also a curse. We're obsessed with making something exactly right. We spend hours, days, even years tweaking and refining, only to have it devolve into something that's almost, but not quite,2024-10-09:13:31:17,670 INFO     [generate.py:1146] 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                
Generated 199 tokens                 
Time for inference 1: 5.7620 sec total                 
Time to first token: 0.2066 sec with parallel prefill.                

      Total throughput: 34.7103 tokens/sec, 0.0288 s/token                 
First token throughput: 4.8414 tokens/sec, 0.2066 s/token                 
 Next token throughput: 35.8208 tokens/sec, 0.0279 s/token                     
2024-10-09:13:31:17,671 INFO     [generate.py:1157] 
Bandwidth achieved: 104.03 GB/s
2024-10-09:13:31:17,671 INFO     [generate.py:1161] *** This first iteration will include cold start effects for dynamic import, hardware caches. ***

========================================


      Average tokens/sec (total): 34.71                 
Average tokens/sec (first token): 4.84                 
Average tokens/sec (next tokens): 35.82

Where

python3 torchchat.py where llama3.2-
1b

/Users/puri/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/e9f8effbab1cbdc515c11ee6e098e3d5a9f51e14

List

python3 torchchat.py list 

Model                                        Aliases                                                    Downloaded 
-------------------------------------------- ---------------------------------------------------------- -----------
meta-llama/llama-2-7b-hf                     llama2-base, llama2-7b                                                
meta-llama/llama-2-7b-chat-hf                llama2, llama2-chat, llama2-7b-chat                                   
meta-llama/llama-2-13b-chat-hf               llama2-13b-chat                                                       
meta-llama/llama-2-70b-chat-hf               llama2-70b-chat                                                       
meta-llama/meta-llama-3-8b                   llama3-base                                                           
meta-llama/meta-llama-3-8b-instruct          llama3, llama3-chat, llama3-instruct                       Yes        
meta-llama/meta-llama-3-70b-instruct         llama3-70b                                                            
meta-llama/meta-llama-3.1-8b                 llama3.1-base                                                         
meta-llama/meta-llama-3.1-8b-instruct        llama3.1, llama3.1-chat, llama3.1-instruct                            
meta-llama/meta-llama-3.1-70b-instruct       llama3.1-70b                                                          
meta-llama/meta-llama-3.1-8b-instruct-tune   llama3.1-tune, llama3.1-chat-tune, llama3.1-instruct-tune             
meta-llama/meta-llama-3.1-70b-instruct-tune  llama3.1-70b-tune                                                     
meta-llama/meta-llama-3.2-1b                 llama3.2-1b-base                                                      
meta-llama/meta-llama-3.2-1b-instruct        llama3.2-1b, llama3.2-1b-chat, llama3.2-1b-instruct        Yes        
meta-llama/llama-guard-3-1b                  llama3-1b-guard, llama3.2-1b-guard                                    
meta-llama/meta-llama-3.2-3b                 llama3.2-3b-base                                                      
meta-llama/meta-llama-3.2-3b-instruct        llama3.2-3b, llama3.2-3b-chat, llama3.2-3b-instruct                   
meta-llama/llama-3.2-11b-vision              llama3.2-11B-base, Llama-3.2-11B-Vision-base                          
meta-llama/llama-3.2-11b-vision-instruct     llama3.2-11B, Llama-3.2-11B-Vision, Llama-3.2-mm                      
meta-llama/codellama-7b-python-hf            codellama, codellama-7b                                               
meta-llama/codellama-34b-python-hf           codellama-34b                                                         
mistralai/mistral-7b-v0.1                    mistral-7b-v01-base                                                   
mistralai/mistral-7b-instruct-v0.1           mistral-7b-v01-instruct                                               
mistralai/mistral-7b-instruct-v0.2           mistral, mistral-7b, mistral-7b-instruct                              
openlm-research/open_llama_7b                open-llama, open-llama-7b                                             
stories15m                                                                                              Yes        
stories42m                                                                                                         
stories110m                                                                                             Yes

Remove

ls ~/.cache/huggingface/hub/
models--meta-llama--Llama-3.2-1B-Instruct
version.txt

python3 torchchat.py remove llama3.2-1b
Removing downloaded model artifacts for llama3.2-1b at /Users/puri/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct...
Done.


ls ~/.cache/huggingface/hub/
version.txt

Remove Again (file not present)

python3 torchchat.py remove llama3.2-1b
Model llama3.2-1b has no downloaded artifacts in /Users/puri/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct.

pytorch-bot · 2024-10-09T20:17:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1285

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 84602c8 with merge base 438ebb1 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

swolchok

I don't see any kind of migration from the old path to the new one. How are people going to know we've abandoned gigabytes of downloaded models? Can we check the old path and move models over?

swolchok · 2024-10-09T20:40:07Z

install/install_requirements.sh

@@ -106,3 +106,5 @@ fi
  set -x
  $PIP_EXECUTABLE install evaluate=="0.4.3" lm-eval=="0.4.2" psutil=="6.0.0"
 )
+
+export HF_HUB_ENABLE_HF_TRANSFER=1


I don't think this will do anything unless people are running install_requirements.sh via source, which would be unusual.

swolchok · 2024-10-09T20:40:38Z

torchchat/cli/builder.py

@@ -73,7 +74,7 @@ def __post_init__(self):
            or (self.pte_path and Path(self.pte_path).is_file())
        ):
            raise RuntimeError(
-                "need to specified a valid checkpoint path, checkpoint dir, gguf path, DSO path, or PTE path"
+                f"need to specified a valid checkpoint path, checkpoint dir, gguf path, DSO path, or PTE path {self.checkpoint_path}"


as written, this reads like it's suggesting a path. maybe s/PTE path/PTE path, got:/

Jack-Khuu · 2024-10-09T20:52:52Z

Thanks for taking this on!! It'll make things much easier to adopt

PS: You'll wanna rebase this onto main (you can use the update branch button then git pull on your machine)

vmpuri · 2024-10-09T21:20:19Z

Fixing @swolchok 's suggestions in follow-up commits.

Regarding this:

I don't see any kind of migration from the old path to the new one. How are people going to know we've abandoned gigabytes of downloaded models? Can we check the old path and move models over?

What's the best way to do this? I've currently implemented this as:

User calls download
We check all model configs where the distribution is through HF.
If that model exists in the old location, delete it

Testing:

python3 torchchat.py download llama3.2-3b
Cleaning up old model artifacts in /Users/puri/.torchchat/model-cache/meta-llama/Llama-2-7b-chat-hf. New artifacts will be downloaded to /Users/puri/.cache/huggingface/hub
Cleaning up old model artifacts in /Users/puri/.torchchat/model-cache/meta-llama/Meta-Llama-3.1-8B-Instruct. New artifacts will be downloaded to /Users/puri/.cache/huggingface/hub
Cleaning up old model artifacts in /Users/puri/.torchchat/model-cache/meta-llama/Llama-3.2-11B-Vision-Instruct. New artifacts will be downloaded to /Users/puri/.cache/huggingface/hub
Downloading meta-llama/Meta-Llama-3.2-3B-Instruct from Hugging Face to /Users/puri/.cache/huggingface/hub/models--meta-llama--Llama-3.2-3B-Instruct/snapshots/392a143b624368100f77a3eafaa4a2468ba50a72
...

This should clean up everything in one fell-swoop. We could figure out a "lazy" way of doing this, but I think this is best.

Jack-Khuu · 2024-10-09T23:42:15Z

torchchat/cli/download.py

+
+def get_model_dir(model_config: ModelConfig, models_dir: Optional[Path]) -> Path:
+    """
+    Returns the directory where the model artifacts are stored.


Suggested change

Returns the directory where the model artifacts are stored.

Returns the directory where the model artifacts are or would be stored.

Jack-Khuu · 2024-10-09T23:50:40Z

torchchat/cli/download.py

+    os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
+
+# For each model with huggingface distribution path, clean up the old location. 
+def _delete_old_hf_models(models_dir: Path):


Let's name this more concretely and give more context in the docstring

delete_old_hf_models is ambiguous

Jack-Khuu · 2024-10-09T23:53:12Z

fyi @lessw2020 @kwen2501 we're going to be leveraging the huggingface-cache now

Jack-Khuu · 2024-10-09T23:53:43Z

@vmpuri One important thing to double check is the behavior if someone has a HF model already downloaded from a different project

byjlw · 2024-10-14T15:17:21Z

Fixing @swolchok 's suggestions in follow-up commits.

Regarding this:

I don't see any kind of migration from the old path to the new one. How are people going to know we've abandoned gigabytes of downloaded models? Can we check the old path and move models over?

What's the best way to do this? I've currently implemented this as:

User calls download

We check all model configs where the distribution is through HF.

If that model exists in the old location, delete it

Testing:
python3 torchchat.py download llama3.2-3b
Cleaning up old model artifacts in /Users/puri/.torchchat/model-cache/meta-llama/Llama-2-7b-chat-hf. New artifacts will be downloaded to /Users/puri/.cache/huggingface/hub
Cleaning up old model artifacts in /Users/puri/.torchchat/model-cache/meta-llama/Meta-Llama-3.1-8B-Instruct. New artifacts will be downloaded to /Users/puri/.cache/huggingface/hub
Cleaning up old model artifacts in /Users/puri/.torchchat/model-cache/meta-llama/Llama-3.2-11B-Vision-Instruct. New artifacts will be downloaded to /Users/puri/.cache/huggingface/hub
Downloading meta-llama/Meta-Llama-3.2-3B-Instruct from Hugging Face to /Users/puri/.cache/huggingface/hub/models--meta-llama--Llama-3.2-3B-Instruct/snapshots/392a143b624368100f77a3eafaa4a2468ba50a72
...
This should clean up everything in one fell-swoop. We could figure out a "lazy" way of doing this, but I think this is best.

Why redownload?
Copy to new location would be much more efficient.

Also i recommend updating the description for this PR so it's very clear what we'll do and for how long as well as details on how to change the default the cache location.

byjlw · 2024-10-14T20:01:51Z

Since we're in the process of moving files to a new directory, I'd also like to understand how we plan to save and manage quantized models.

There are very few use cases which involve doing inference at full precision and thus we should expect that all users will want to quantize and manage the quantized models. In fact, it likely doesn't make sense to keep the full precision models after quantization, at least in most cases.

We shouldn't necessarily solve this problem in this diff but there should be an RFC that covers the whole problem so that we can ensure this diff moves towards the overall solution. Otherwise we may interrupt people's file locations a second time

vmpuri · 2024-10-16T04:49:31Z

Why redownload?
Copy to new location would be much more efficient.

True - this was my first thought, but there were 2 main things that make this more complex (and therefore more error prone).

Possible Existing Mutations - It's possible artifacts in the existing ~/.torchchat directory are modified despite having the same name. If we transfer this, then a user attempts to use it from another application, it wouldn't be operating on the original files. We should ensure the Hugging Face cache is read-only - currently, torchchat/cli/convert_hf_checkpoint.py will mutate files in the cache, so I'll need to address this in this PR. This is a good usecase for the existing ~/.torchchat directory (and perhaps a good place for other generated artifacts, i.e quantized models)
Replicating directory structure I'm not sure how to source all of the hashes and paths for the artifacts or the snapshot ID - I don't believe any of this information is preserved in the current download. Currently, all of the files are exactly as they appear in ~/.torchchat. However, Hugging Face will download the files for a particular snapshot in <HF_CACHE_DIR>/models--some-model--model-1B/blobs with some hash as the filename. The blobs are symlinked to human-readable paths, i.e. <HF_CACHE_DIR>/models--some-model--model-1B/snapshots/<snapshot_id>/original/model.pth would correspond to a file in the blobs folder.

Given that re-downloading the models to the Hugging Face cache is guaranteed to download artifacts to the expected location, I figure the tradeoff between long-term reliability & one-time efficiency makes sense. Re-downloading ~50gb of models would be painful once, but manageable (just ~33 minutes even on a relatively-slow 25 mbps connection).

Let me know your thoughts (or if you know a way to recover the snapshot ID and blob IDs from our current download)

vmpuri added the enhancement New feature or request label Oct 9, 2024

vmpuri requested review from orionr and Jack-Khuu October 9, 2024 20:17

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 9, 2024

vmpuri linked an issue Oct 9, 2024 that may be closed by this pull request

Leverage the HF cache for models #992

Open

vmpuri force-pushed the hf-cache branch from e2058f9 to 329e2a0 Compare October 9, 2024 20:31

vmpuri marked this pull request as ready for review October 9, 2024 20:39

swolchok reviewed Oct 9, 2024

View reviewed changes

vmpuri force-pushed the hf-cache branch 2 times, most recently from 0ca3941 to 5bf2315 Compare October 9, 2024 21:19

vmpuri added 3 commits October 9, 2024 16:17

Download huggingface models to huggingface cache instead of ~/.torchchat

fa78671

Enable hf_transfer for faster download

2e722ba

Cleanup

654dbec

vmpuri force-pushed the hf-cache branch from 5bf2315 to 24875aa Compare October 9, 2024 23:18

Jack-Khuu approved these changes Oct 9, 2024

View reviewed changes

Delete models from old location for huggingface download

84602c8

vmpuri force-pushed the hf-cache branch from 24875aa to 84602c8 Compare October 11, 2024 00:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download Hugging Face models into Hugging Face cache #1285

Download Hugging Face models into Hugging Face cache #1285

vmpuri commented Oct 9, 2024 •

edited

Loading

pytorch-bot bot commented Oct 9, 2024 •

edited

Loading

swolchok left a comment

swolchok Oct 9, 2024

swolchok Oct 9, 2024

Jack-Khuu commented Oct 9, 2024

vmpuri commented Oct 9, 2024 •

edited

Loading

Jack-Khuu Oct 9, 2024

Jack-Khuu Oct 9, 2024

Jack-Khuu commented Oct 9, 2024

Jack-Khuu commented Oct 9, 2024

byjlw commented Oct 14, 2024

byjlw commented Oct 14, 2024

vmpuri commented Oct 16, 2024 •

edited

Loading

	Returns the directory where the model artifacts are stored.
	Returns the directory where the model artifacts are or would be stored.

Download Hugging Face models into Hugging Face cache #1285

Are you sure you want to change the base?

Download Hugging Face models into Hugging Face cache #1285

Conversation

vmpuri commented Oct 9, 2024 • edited Loading

Testing

Download

Generate

Where

List

Remove

Remove Again (file not present)

pytorch-bot bot commented Oct 9, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1285

✅ No Failures

swolchok left a comment

Choose a reason for hiding this comment

swolchok Oct 9, 2024

Choose a reason for hiding this comment

swolchok Oct 9, 2024

Choose a reason for hiding this comment

Jack-Khuu commented Oct 9, 2024

vmpuri commented Oct 9, 2024 • edited Loading

Jack-Khuu Oct 9, 2024

Choose a reason for hiding this comment

Jack-Khuu Oct 9, 2024

Choose a reason for hiding this comment

Jack-Khuu commented Oct 9, 2024

Jack-Khuu commented Oct 9, 2024

byjlw commented Oct 14, 2024

byjlw commented Oct 14, 2024

vmpuri commented Oct 16, 2024 • edited Loading

vmpuri commented Oct 9, 2024 •

edited

Loading

pytorch-bot bot commented Oct 9, 2024 •

edited

Loading

vmpuri commented Oct 9, 2024 •

edited

Loading

vmpuri commented Oct 16, 2024 •

edited

Loading