You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During the MMLU evaluation of our LLM, we encountered an issue where correct answers are being marked as incorrect due to format mismatches. Specifically, when the model outputs the correct numerical answer but includes a preceding letter (e.g., "C. 12" instead of just "12"), the scoring system fails to recognize it as correct.
Current Behavior
The scoring system marks answers as incorrect if they don't exactly match the reference answer format, even if the numerical value is correct.
Expected Behavior
The scoring system should be able to correctly identify and score answers that are numerically correct, regardless of minor formatting differences such as preceding letters.
The text was updated successfully, but these errors were encountered:
Hi @DerryChan, unfortunately this is something we don't plan to support for the default built-in MMLU scenario.
Some suggestions that you could try for your use case:
Your could change your model to respect the max_tokens parameter, which is set 1 for to MMLU. This will usually cause the model to only output the letter.
If your model is instruction-tuned, you can try adding an additional prompt that tells the model to only respond with a single letter. In particular, adding output_format_instructions=mmlu to your run entry (e.g. mmlu:output_format_instructions=mmlu,model=text) will add "Answer with only a single letter." to the prompt.
You could implement your own MMLU variant that uses a modified metric that performs the additional post-processing necessary to interpret your model's output.
Description
During the MMLU evaluation of our LLM, we encountered an issue where correct answers are being marked as incorrect due to format mismatches. Specifically, when the model outputs the correct numerical answer but includes a preceding letter (e.g., "C. 12" instead of just "12"), the scoring system fails to recognize it as correct.
Current Behavior
The scoring system marks answers as incorrect if they don't exactly match the reference answer format, even if the numerical value is correct.
Expected Behavior
The scoring system should be able to correctly identify and score answers that are numerically correct, regardless of minor formatting differences such as preceding letters.
The text was updated successfully, but these errors were encountered: