Adding additional optional args for decoding flags and AutoModel kwargs to support models like ReplitLM #115

madhavatreplit · 2023-07-12T18:00:39Z

Why

We require the ability to configure the tokenizer.decode call, as well as model args in the AutoModelForCausalLM.from_pretrained to support models like ReplitLM.

What changed

We add two input arguments with safe default behaviour to the main.py script:

clean_up_tokenization_spaces : bool

this boolean flag is passed to tokenizer.decode to prevent tokenization spaces from being cleaned up. This flag affects spacing and therefore syntax in generated code with certain tokenizers such as the ReplitLM tokenizer.
defaults to True, stores False

automodel_kwargs: json.loads, aka. a "stringified" JSON

a "stringified" JSON that sets what default config values should be overriden in this harness to reproduce results.
updates default init config key-values by being passed into the AutoModelForCausalLM.from_pretrained as kwargs. See the logic of why and how this works here in the transformers documentation.
defaults to empty stringified JSON: "{}".

Rollout

[x] This is fully backward and forward compatible

madhavatreplit · 2023-07-12T18:03:00Z

lm_eval/utils.py

@@ -257,15 +258,15 @@ def complete_code(
                if s[0] == tokenizer.bos_token_id:
                    s = s[1:]
                gen_code = tokenizer.decode(
-                    s, skip_special_tokens=False, clean_up_tokenization_spaces=False
+                    s, skip_special_tokens=False, clean_up_tokenization_spaces=clean_up_tokenization_spaces


This one should be checked for correctness by maintainers. Earlier was set to False, but with the flag being passed, the default behaviour is now conditioned on the flag.

More explicitly, if a user does not supply the arg --clean_up_tokenizations=False the expected default behaviour of this code path changes.

Actually most models (including replit) and tasks decoding happen here with clean_up_tokenizations as False because eos_token is in stop_words whenever it exists. I suggest we change the argument definition so it stores False and becomes True when called by the user. For the else scenario, default behavior will change but I don't think that impacts performance.

loubnabnl

Thanks for the PR! clean_up_tokenization_spaces is already set to False during decoding for most tasks and models including replit model, so I suggested below that we keep it as False by default and True if a user specifies the argument.
The automodel_kwargs additions looks good.

loubnabnl · 2023-07-26T11:50:26Z

lm_eval/utils.py

@@ -257,15 +258,15 @@ def complete_code(
                if s[0] == tokenizer.bos_token_id:
                    s = s[1:]
                gen_code = tokenizer.decode(
-                    s, skip_special_tokens=False, clean_up_tokenization_spaces=False
+                    s, skip_special_tokens=False, clean_up_tokenization_spaces=clean_up_tokenization_spaces


Actually most models (including replit) and tasks decoding happen here with clean_up_tokenizations as False because eos_token is in stop_words whenever it exists. I suggest we change the argument definition so it stores False and becomes True when called by the user. For the else scenario, default behavior will change but I don't think that impacts performance.

loubnabnl · 2023-07-26T11:51:45Z

main.py

+        action="store_false",
+        help="Set the clean_up_tokenization_spaces in tokenizer.decode() to False for specific models, defaults to True",


Suggested change

action="store_false",

help="Set the clean_up_tokenization_spaces in tokenizer.decode() to False for specific models, defaults to True",

action="store_true",

help="Set the clean_up_tokenization_spaces in tokenizer.decode() to True for specific models, defaults to False",

see comment above

madhavatreplit added 3 commits July 11, 2023 16:14

Adding changes to harness required to best support ReplitLM (#1)

11e721f

Bug fix for default automodel_kwargs

fdee787

Apply black formatter

ea4b65f

madhavatreplit commented Jul 12, 2023

View reviewed changes

madhavatreplit changed the title ~~Adding additional optional arguments for decoding flags and AutoModel kwargs to support models like ReplitLM~~ Adding additional optional args for decoding flags and AutoModel kwargs to support models like ReplitLM Jul 12, 2023

madhavatreplit marked this pull request as ready for review July 12, 2023 18:09

loubnabnl reviewed Jul 26, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding additional optional args for decoding flags and AutoModel kwargs to support models like ReplitLM #115

Adding additional optional args for decoding flags and AutoModel kwargs to support models like ReplitLM #115

madhavatreplit commented Jul 12, 2023 •

edited

Loading

madhavatreplit Jul 12, 2023 •

edited

Loading

loubnabnl Jul 26, 2023

loubnabnl left a comment

loubnabnl Jul 26, 2023

loubnabnl Jul 26, 2023

		action="store_false",
		help="Set the clean_up_tokenization_spaces in tokenizer.decode() to False for specific models, defaults to True",

Adding additional optional args for decoding flags and AutoModel kwargs to support models like ReplitLM #115

Are you sure you want to change the base?

Adding additional optional args for decoding flags and AutoModel kwargs to support models like ReplitLM #115

Conversation

madhavatreplit commented Jul 12, 2023 • edited Loading

Why

What changed

Rollout

madhavatreplit Jul 12, 2023 • edited Loading

Choose a reason for hiding this comment

loubnabnl Jul 26, 2023

Choose a reason for hiding this comment

loubnabnl left a comment

Choose a reason for hiding this comment

loubnabnl Jul 26, 2023

Choose a reason for hiding this comment

loubnabnl Jul 26, 2023

Choose a reason for hiding this comment

madhavatreplit commented Jul 12, 2023 •

edited

Loading

madhavatreplit Jul 12, 2023 •

edited

Loading