Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding additional optional args for decoding flags and AutoModel kwargs to support models like ReplitLM #115

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

madhavatreplit
Copy link

@madhavatreplit madhavatreplit commented Jul 12, 2023

Why

We require the ability to configure the tokenizer.decode call, as well as model args in the AutoModelForCausalLM.from_pretrained to support models like ReplitLM.

What changed

We add two input arguments with safe default behaviour to the main.py script:

  1. clean_up_tokenization_spaces : bool
  • this boolean flag is passed to tokenizer.decode to prevent tokenization spaces from being cleaned up. This flag affects spacing and therefore syntax in generated code with certain tokenizers such as the ReplitLM tokenizer.
  • defaults to True, stores False
  1. automodel_kwargs: json.loads, aka. a "stringified" JSON
  • a "stringified" JSON that sets what default config values should be overriden in this harness to reproduce results.
  • updates default init config key-values by being passed into the AutoModelForCausalLM.from_pretrained as kwargs. See the logic of why and how this works here in the transformers documentation.
  • defaults to empty stringified JSON: "{}".

Rollout

[x] This is fully backward and forward compatible

@@ -257,15 +258,15 @@ def complete_code(
if s[0] == tokenizer.bos_token_id:
s = s[1:]
gen_code = tokenizer.decode(
s, skip_special_tokens=False, clean_up_tokenization_spaces=False
s, skip_special_tokens=False, clean_up_tokenization_spaces=clean_up_tokenization_spaces
Copy link
Author

@madhavatreplit madhavatreplit Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one should be checked for correctness by maintainers. Earlier was set to False, but with the flag being passed, the default behaviour is now conditioned on the flag.

More explicitly, if a user does not supply the arg --clean_up_tokenizations=False the expected default behaviour of this code path changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually most models (including replit) and tasks decoding happen here with clean_up_tokenizations as False because eos_token is in stop_words whenever it exists. I suggest we change the argument definition so it stores False and becomes True when called by the user. For the else scenario, default behavior will change but I don't think that impacts performance.

@madhavatreplit madhavatreplit changed the title Adding additional optional arguments for decoding flags and AutoModel kwargs to support models like ReplitLM Adding additional optional args for decoding flags and AutoModel kwargs to support models like ReplitLM Jul 12, 2023
@madhavatreplit madhavatreplit marked this pull request as ready for review July 12, 2023 18:09
Copy link
Collaborator

@loubnabnl loubnabnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! clean_up_tokenization_spaces is already set to False during decoding for most tasks and models including replit model, so I suggested below that we keep it as False by default and True if a user specifies the argument.
The automodel_kwargs additions looks good.

@@ -257,15 +258,15 @@ def complete_code(
if s[0] == tokenizer.bos_token_id:
s = s[1:]
gen_code = tokenizer.decode(
s, skip_special_tokens=False, clean_up_tokenization_spaces=False
s, skip_special_tokens=False, clean_up_tokenization_spaces=clean_up_tokenization_spaces
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually most models (including replit) and tasks decoding happen here with clean_up_tokenizations as False because eos_token is in stop_words whenever it exists. I suggest we change the argument definition so it stores False and becomes True when called by the user. For the else scenario, default behavior will change but I don't think that impacts performance.

Comment on lines +107 to +108
action="store_false",
help="Set the clean_up_tokenization_spaces in tokenizer.decode() to False for specific models, defaults to True",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
action="store_false",
help="Set the clean_up_tokenization_spaces in tokenizer.decode() to False for specific models, defaults to True",
action="store_true",
help="Set the clean_up_tokenization_spaces in tokenizer.decode() to True for specific models, defaults to False",

see comment above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants