-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPQA scenario and tests added #3068
Conversation
Both the scenario (GPQA, 3 subsets, with HF datasets) and the tests are added. All tests passed. @yifanmai Regarding the run spec function, should we create one in |
Currently seeing issues with the automatic checks, will look into it |
FYI
|
instances = scenario.get_instances(tmpdir) | ||
assert len(instances) == 448 | ||
assert instances[0].input == Input( | ||
text="A large gene has dozens of exons, of which the central ones code for folded triple helical repeats that connect the cytoskeleton with sarcolemma and extracellular space. Each exon usually codes for one folded triple alpha helix. The most common mutations of the gene are central exon deletions that create out-of-frame peptides and progressive degenerative organ waste. A solution is to deliver a Morpholino that recognizes the 5' end of the out-of-frame exon in pre-mRNA. The molecule prevents binding of the spliceosome and creates exon skipping and in-frame joining. Several missing exons are well tolerated by an organism. Which structure below is not involved in the proposed therapy?" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid posting plain text from the dataset, as per the instructions from the dataset card for GPQA. Just check the string length instead.
instance = Instance( | ||
input=input, | ||
references=references, | ||
split=TRAIN_SPLIT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the paper, the authors defined five fixed train split examples in this file. All other examples are test split.
You should also filter out the train examples from the test set. Unfortunately, the examples in the train split are also in the Hugging Face dataset, so you may have to do this filtering manually. You could do something like hardcoding the indexes of the train examples, but the indexes might be different depending on the subset.
Also see the discussion here:
Reference(Output(text=row["Correct Answer"].strip()), tags=[CORRECT_TAG]), | ||
Reference(Output(text=row["Incorrect Answer 1"].strip()), tags=[]), | ||
Reference(Output(text=row["Incorrect Answer 2"].strip()), tags=[]), | ||
Reference(Output(text=row["Incorrect Answer 3"].strip()), tags=[]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This always sets the correct answer to A, which is not great. The authors shuffle the choices deterministically, and you should do something similar.
4c91a01
to
eab0e56
Compare
… examples, reformatted.
eab0e56
to
e445043
Compare
if idx in EXCLUDED_TRAIN_EXAMPLES[self.subset]: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has no train split examples currently; let's address this in a different PR.
No description provided.