-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding additional optional args for decoding flags and AutoModel kwargs to support models like ReplitLM #115
base: main
Are you sure you want to change the base?
Conversation
@@ -257,15 +258,15 @@ def complete_code( | |||
if s[0] == tokenizer.bos_token_id: | |||
s = s[1:] | |||
gen_code = tokenizer.decode( | |||
s, skip_special_tokens=False, clean_up_tokenization_spaces=False | |||
s, skip_special_tokens=False, clean_up_tokenization_spaces=clean_up_tokenization_spaces |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one should be checked for correctness by maintainers. Earlier was set to False, but with the flag being passed, the default behaviour is now conditioned on the flag.
More explicitly, if a user does not supply the arg --clean_up_tokenizations=False
the expected default behaviour of this code path changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually most models (including replit) and tasks decoding happen here with clean_up_tokenizations
as False because eos_token
is in stop_words
whenever it exists. I suggest we change the argument definition so it stores False and becomes True when called by the user. For the else scenario, default behavior will change but I don't think that impacts performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! clean_up_tokenization_spaces
is already set to False during decoding for most tasks and models including replit model, so I suggested below that we keep it as False by default and True if a user specifies the argument.
The automodel_kwargs
additions looks good.
@@ -257,15 +258,15 @@ def complete_code( | |||
if s[0] == tokenizer.bos_token_id: | |||
s = s[1:] | |||
gen_code = tokenizer.decode( | |||
s, skip_special_tokens=False, clean_up_tokenization_spaces=False | |||
s, skip_special_tokens=False, clean_up_tokenization_spaces=clean_up_tokenization_spaces |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually most models (including replit) and tasks decoding happen here with clean_up_tokenizations
as False because eos_token
is in stop_words
whenever it exists. I suggest we change the argument definition so it stores False and becomes True when called by the user. For the else scenario, default behavior will change but I don't think that impacts performance.
action="store_false", | ||
help="Set the clean_up_tokenization_spaces in tokenizer.decode() to False for specific models, defaults to True", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
action="store_false", | |
help="Set the clean_up_tokenization_spaces in tokenizer.decode() to False for specific models, defaults to True", | |
action="store_true", | |
help="Set the clean_up_tokenization_spaces in tokenizer.decode() to True for specific models, defaults to False", |
see comment above
Why
We require the ability to configure the
tokenizer.decode
call, as well as model args in theAutoModelForCausalLM.from_pretrained
to support models like ReplitLM.What changed
We add two input arguments with safe default behaviour to the
main.py
script:clean_up_tokenization_spaces
:bool
tokenizer.decode
to prevent tokenization spaces from being cleaned up. This flag affects spacing and therefore syntax in generated code with certain tokenizers such as the ReplitLM tokenizer.True
, storesFalse
automodel_kwargs
:json.loads
, aka. a "stringified" JSONAutoModelForCausalLM.from_pretrained
askwargs
. See the logic of why and how this works here in thetransformers
documentation."{}"
.Rollout
[x] This is fully backward and forward compatible