[REVIEW] Fix Padding Related Bugs: `Crossfit` #66

VibhuJawa · 2024-07-31T07:08:51Z

This PR plans to fix:

TODO:

Add tests
Verify fix for (Error with different padding size in batch #65), @ryantwolf , Can probably help

Signed-off-by: Vibhu Jawa <[email protected]>

sarahyurick

Could you please add an example test for padding_side="right" vs padding_side="left"?

Signed-off-by: Vibhu Jawa <[email protected]>

VibhuJawa · 2024-08-02T06:34:12Z

Could you please add an example test for padding_side="right" vs padding_side="left"?

Added

ryantwolf

Just some minor comments, thanks for fixing this!

crossfit/backend/torch/hf/model.py

crossfit/backend/torch/loader.py

Signed-off-by: Vibhu Jawa <[email protected]>

VibhuJawa · 2024-08-05T04:24:01Z

Resolved comments @ryantwolf

VibhuJawa · 2024-08-05T04:28:57Z

Verified same results before and after memory changes

import os
import numpy as np
import joblib
from sklearn.linear_model import LinearRegression

new_model_dir = "/home/nfs/vjawa/.cf/memory" 
old_model_dir = "/home/nfs/vjawa/bkp_memory_curves"

model_name = "microsoft/deberta-v3-base"
#model_name = "sentence-transformers/all-MiniLM-L6-v2"
model_fname = "mem_model.pkl"

old_model_path = os.path.join(old_model_dir, model_name, model_fname)
new_model_path = os.path.join(new_model_dir, model_name, model_fname)

old_model = joblib.load(old_model_path)
new_model = joblib.load(new_model_path)

assert np.allclose(old_model.coef_, old_model.coef_)
assert np.isclose(new_model.intercept_, new_model.intercept_)

ryantwolf

Two minor things remain that you might want to fix, but other than that looks good to me.

crossfit/backend/torch/hf/model.py

sarahyurick

Tests LGTM, thanks!

Signed-off-by: Vibhu Jawa <[email protected]>

VibhuJawa · 2024-08-05T18:14:23Z

@ryantwolf , Addressed your review and added improvements, thanks again for the careful review and helping make the performance and user experience better.

crossfit/backend/torch/hf/model.py

Signed-off-by: Vibhu Jawa <[email protected]>

ryantwolf · 2024-08-05T19:49:16Z

crossfit/backend/torch/hf/memory_curve_utils.py

+    X: list[list[int]] = []
+    y: list[float] = []
+
+    max_seq = min(AutoTokenizer.from_pretrained(path_or_name).model_max_length, end_seq_len)


Similar to the other issue, I think every time AutoTokenizer or AutoConfig is used in this file it should be using the corresponding methods of the model. Now I'm getting this error with llama guard:

Traceback (most recent call last): File "/usr/local/bin/aegis_classifier_inference", line 8, in <module> sys.exit(console_script()) File "/usr/local/lib/python3.10/dist-packages/nemo_curator/scripts/aegis_classifier_inference.py", line 126, in console_script main() File "/usr/local/lib/python3.10/dist-packages/nemo_curator/scripts/aegis_classifier_inference.py", line 62, in main domain_classifier = AegisClassifier( File "/usr/local/lib/python3.10/dist-packages/nemo_curator/classifiers/aegis.py", line 161, in __init__ model = AegisHFModel(config=config) File "/usr/local/lib/python3.10/dist-packages/nemo_curator/classifiers/aegis.py", line 90, in __init__ super().__init__( File "/usr/local/lib/python3.10/dist-packages/crossfit/backend/torch/hf/model.py", line 58, in __init__ self.mem = fit_memory_estimate_curve( File "/usr/local/lib/python3.10/dist-packages/crossfit/backend/torch/hf/memory_curve_utils.py", line 44, in fit_memory_estimate_curve max_seq = min(AutoTokenizer.from_pretrained(path_or_name).model_max_length, end_seq_len) File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 843, in from_pretrained return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2032, in from_pretrained raise EnvironmentError( OSError: Can't load tokenizer for 'meta-llama/LlamaGuard-7b'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'meta-llama/LlamaGuard-7b' is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer.

Resolved by: 987836f

Signed-off-by: Vibhu Jawa <[email protected]>

ryantwolf

My tests ran successfully. Thanks so much for doing this @VibhuJawa!

VibhuJawa added 4 commits July 31, 2024 00:05

Add crossfit bits

130d465

Signed-off-by: Vibhu Jawa <[email protected]>

Add padding fixes

67ed3a4

Signed-off-by: Vibhu Jawa <[email protected]>

Fix test

b20b4e9

Signed-off-by: Vibhu Jawa <[email protected]>

Add docstrings

91dd34d

Signed-off-by: Vibhu Jawa <[email protected]>

VibhuJawa marked this pull request as ready for review July 31, 2024 08:00

VibhuJawa changed the title ~~[DRAFT] Fix Padding Related Bugs: Crossfit~~ [REVIEW] Fix Padding Related Bugs: Crossfit Jul 31, 2024

VibhuJawa added 2 commits July 31, 2024 01:04

fix torch import

1b02e4b

Signed-off-by: Vibhu Jawa <[email protected]>

fix torch import

d63761e

Signed-off-by: Vibhu Jawa <[email protected]>

VibhuJawa requested a review from sarahyurick July 31, 2024 08:12

sarahyurick suggested changes Jul 31, 2024

View reviewed changes

VibhuJawa added 6 commits July 31, 2024 16:00

fix padding to only pad the last dim

2c19aa2

Signed-off-by: Vibhu Jawa <[email protected]>

fix padding tests

3109e1f

Signed-off-by: Vibhu Jawa <[email protected]>

Add test for left/right

7d0fb6f

Signed-off-by: Vibhu Jawa <[email protected]>

Skip test for cf_loader

97b3aea

Signed-off-by: Vibhu Jawa <[email protected]>

Fix bugs in clipping

62aed47

Signed-off-by: Vibhu Jawa <[email protected]>

Fix bugs in clipping

bd73727

Signed-off-by: Vibhu Jawa <[email protected]>

ryantwolf reviewed Aug 2, 2024

View reviewed changes

crossfit/backend/torch/hf/model.py Outdated Show resolved Hide resolved

crossfit/backend/torch/hf/model.py Outdated Show resolved Hide resolved

crossfit/backend/torch/loader.py Outdated Show resolved Hide resolved

crossfit/backend/torch/loader.py Outdated Show resolved Hide resolved

VibhuJawa added 3 commits August 4, 2024 19:43

Add early stopping to HF memory estimation

bbdc4f2

Signed-off-by: Vibhu Jawa <[email protected]>

Fix copy-right year

4a2b544

Signed-off-by: Vibhu Jawa <[email protected]>

Add copyright year

a553b11

Signed-off-by: Vibhu Jawa <[email protected]>

VibhuJawa mentioned this pull request Aug 5, 2024

Fix clipping for models with non default padding id and direction #58

Closed

ryantwolf mentioned this pull request Aug 5, 2024

Add AEGIS classifier NVIDIA/NeMo-Curator#172

Merged

3 tasks

ryantwolf reviewed Aug 5, 2024

View reviewed changes

crossfit/backend/torch/hf/model.py Outdated Show resolved Hide resolved

sarahyurick approved these changes Aug 5, 2024

View reviewed changes

VibhuJawa added 2 commits August 5, 2024 10:47

Address last of Ryan's reviews

a6f4ac2

Signed-off-by: Vibhu Jawa <[email protected]>

Skip loading model if its allready fitted

de6fb38

Signed-off-by: Vibhu Jawa <[email protected]>

ryantwolf reviewed Aug 5, 2024

View reviewed changes

crossfit/backend/torch/hf/model.py Outdated Show resolved Hide resolved

VibhuJawa added 2 commits August 5, 2024 11:54

Use self.load_cfg instead of AutoConfig.from_pretrained

23a8aeb

Signed-off-by: Vibhu Jawa <[email protected]>

Use self.load_cfg instead of AutoConfig.from_pretrained

15e27bb

Signed-off-by: Vibhu Jawa <[email protected]>

ryantwolf reviewed Aug 5, 2024

View reviewed changes

Fix memory_curve_utils and skip loading cfg/tokenizer here

987836f

Signed-off-by: Vibhu Jawa <[email protected]>

ryantwolf approved these changes Aug 5, 2024

View reviewed changes

VibhuJawa merged commit 0cc2993 into rapidsai:main Aug 5, 2024
10 checks passed

This was referenced Aug 6, 2024

Error with different padding size in batch #65

Closed

[BUG] clipping logic fails when padding token is not 0. #59

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Fix Padding Related Bugs: `Crossfit` #66

[REVIEW] Fix Padding Related Bugs: `Crossfit` #66

VibhuJawa commented Jul 31, 2024 •

edited

Loading

sarahyurick left a comment

VibhuJawa commented Aug 2, 2024

ryantwolf left a comment

VibhuJawa commented Aug 5, 2024

VibhuJawa commented Aug 5, 2024

ryantwolf left a comment

sarahyurick left a comment

VibhuJawa commented Aug 5, 2024

ryantwolf Aug 5, 2024

VibhuJawa Aug 5, 2024

ryantwolf left a comment

[REVIEW] Fix Padding Related Bugs: Crossfit #66

[REVIEW] Fix Padding Related Bugs: Crossfit #66

Conversation

VibhuJawa commented Jul 31, 2024 • edited Loading

sarahyurick left a comment

Choose a reason for hiding this comment

VibhuJawa commented Aug 2, 2024

ryantwolf left a comment

Choose a reason for hiding this comment

VibhuJawa commented Aug 5, 2024

VibhuJawa commented Aug 5, 2024

ryantwolf left a comment

Choose a reason for hiding this comment

sarahyurick left a comment

Choose a reason for hiding this comment

VibhuJawa commented Aug 5, 2024

ryantwolf Aug 5, 2024

Choose a reason for hiding this comment

VibhuJawa Aug 5, 2024

Choose a reason for hiding this comment

ryantwolf left a comment

Choose a reason for hiding this comment

[REVIEW] Fix Padding Related Bugs: `Crossfit` #66

[REVIEW] Fix Padding Related Bugs: `Crossfit` #66

VibhuJawa commented Jul 31, 2024 •

edited

Loading