Update Caching logic to only trigger on the first inference sample #1369

Jack-Khuu · 2024-11-13T02:56:56Z

When the model cache is already set up, there is no need to call setup_caches each time a sample is passed in.
This is normally fine, but torchtune is noisy (as it should) when setup_cache is unnecessarily called.

This just adds a check for first sample

Warnings that are now missing

Key value caches are already setup. You cannot call ``setup_caches()`` twice. Skipping.
Key value caches are already setup. You cannot call ``setup_caches()`` twice. Skipping.
Key value caches are already setup. You cannot call ``setup_caches()`` twice. Skipping.
Key value caches are already setup. You cannot call ``setup_caches()`` twice. Skipping.

Generation after fix (no warning)

python torchchat.py generate llama3.2-11B --prompt "What's in this image?" --image-prompt assets/dog.jpg  --num-samples 2

Note: NumExpr detected 22 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
NumExpr defaulting to 16 threads.
PyTorch version 2.6.0.dev20241002+cu121 available.
lm_eval is not installed, GPTQ may not be usable
Using device=cuda NVIDIA PG509-210
Loading model...
Time to load model: 10.45 seconds
-----------------------------------------------------------
What's in this image?The image features a dog sitting on a skateboard with its tongue out, sporting sunglasses. The dog has a white chest with brown ears and a brown patch of fur between its eyes and nose. It wears a blue collar and red sunglasses. The skateboard is red and yellow, with two yellow wheels on either side, and the dog appears to be sitting on top of it while facing the camera. The background of the image is blurry but seems to feature a paved road lined with green grass and trees.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Generated 99 tokens
Time for inference 1: 17.3737 sec total
Time to first token: 2.5819 sec with parallel prefill.

      Total throughput: 5.7558 tokens/sec, 0.1737 s/token
First token throughput: 0.3873 tokens/sec, 2.5819 s/token
 Next token throughput: 6.6929 tokens/sec, 0.1494 s/token

Bandwidth achieved: 122.55 GB/s
*** This first iteration will include cold start effects for dynamic import, hardware caches. ***

========================================

What's in this image?The image depicts a medium-sized white dog sitting on a red skateboard on an asphalt path. The dog has brown ears and a tan patch over one eye, giving it a slightly inquisitive appearance. Its tongue is protruding slightly from its mouth, which is slightly open, suggesting that the dog may be panting or playing along with the photo.

The dog is wearing red-framed sunglasses with black lenses, an alternative to a pair of goggles, and a blue collar. The skateboard features yellow wheels and has the word "CRAZ" written on the underside. The dog's body is facing forward, but it's looking toward the camera with its head turned slightly to the side, as if posing.

The background of the image shows a green grassy area and a hedge or bush behind it. The overall atmosphere suggests that the dog is enjoying a fun day out, possibly on a sunny day, and is ready to take a ride on its skateboard. The image is likely intended to be
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Generated 199 tokens
Time for inference 2: 32.2208 sec total
Time to first token: 1.3305 sec with parallel prefill.

      Total throughput: 6.2072 tokens/sec, 0.1611 s/token
First token throughput: 0.7516 tokens/sec, 1.3305 s/token
 Next token throughput: 6.4421 tokens/sec, 0.1552 s/token

Bandwidth achieved: 132.16 GB/s

========================================

pytorch-bot · 2024-11-13T02:56:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1369

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7047d79 with merge base 93f713f ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Jack-Khuu · 2024-11-13T18:25:53Z

torchchat/generate.py

-                        max_batch_size=1,
-                        max_seq_length=max_seq_length,
-                    )
+            if not skip_cache_setup:


Only change in this block: rest is whitespace

Is there any way to directly telling the cache status from model, instead of forwarding a new attribute from outside?

Not off the top of my head, but definitely worth baking into our model abstraction in the future

Gasoonjia · 2024-11-13T19:26:23Z

torchchat/generate.py

@@ -591,6 +591,7 @@ def generate(
            Dict[str, Any]
        ] = None,  # List of Image prompt tensors for multimodal models
        start_pos: int = 0,
+        skip_cache_setup: bool = False,


I'm ok for now, but introducing new inputs into generate function might trigger my nightmare 😣, making it farther away from our target.
I would like to delegate it to model side to suppress the warning mgs.

I agree, it's not great. Luckily it's light so we can abstract it easily later on

Jack-Khuu added 2 commits November 12, 2024 16:14

Only set up during the first sample

116c5c2

Cleaner

0163f61

Jack-Khuu requested review from Gasoonjia and vmpuri November 13, 2024 02:56

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 13, 2024

Merge branch 'main' into skip-setup

7047d79

Jack-Khuu commented Nov 13, 2024

View reviewed changes

Gasoonjia reviewed Nov 13, 2024

View reviewed changes

Gasoonjia approved these changes Nov 13, 2024

View reviewed changes

Jack-Khuu merged commit 6eae887 into main Nov 13, 2024
52 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Caching logic to only trigger on the first inference sample #1369

Update Caching logic to only trigger on the first inference sample #1369

Jack-Khuu commented Nov 13, 2024 •

edited

Loading

pytorch-bot bot commented Nov 13, 2024 •

edited

Loading

Jack-Khuu Nov 13, 2024

Gasoonjia Nov 13, 2024

Jack-Khuu Nov 13, 2024

Gasoonjia Nov 13, 2024 •

edited

Loading

Jack-Khuu Nov 13, 2024

Update Caching logic to only trigger on the first inference sample #1369

Update Caching logic to only trigger on the first inference sample #1369

Conversation

Jack-Khuu commented Nov 13, 2024 • edited Loading

This just adds a check for first sample

pytorch-bot bot commented Nov 13, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1369

✅ No Failures

Jack-Khuu Nov 13, 2024

Choose a reason for hiding this comment

Gasoonjia Nov 13, 2024

Choose a reason for hiding this comment

Jack-Khuu Nov 13, 2024

Choose a reason for hiding this comment

Gasoonjia Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

Jack-Khuu Nov 13, 2024

Choose a reason for hiding this comment

Jack-Khuu commented Nov 13, 2024 •

edited

Loading

pytorch-bot bot commented Nov 13, 2024 •

edited

Loading

Gasoonjia Nov 13, 2024 •

edited

Loading