-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Endpoints Integration to evaluate closed source Models. #179
Endpoints Integration to evaluate closed source Models. #179
Conversation
merge from main.
merge from main.
Merge from main.
This feature is powered by LiteLLM. The reason of using LiteLLM is it provides a good interface and proxies for connnecting any endpoint in open ai format.
…stead of method of TokenizedDataset class
This commit just defines a simple structure for the methods.
…ions for better and further customization.
…and other for endpoints
This is the result in {
"instruct-humaneval": {
"pass@1": 0.651219512195122,
"pass@10": 0.7535606619462241,
} |
Merge from main.
Hey @loubnabnl, is it possible to review this PR? Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the implementation! I left some comments. If I understand you're just using the instruct-humaneval prompt as is to evaluate GPT-3.5, I understood form this comment that you needed to change the prompt.
Regarding the score, the 65% pass@1 result is a bit far from what's reported by the community, e.g 77 pass@1 by EvalPlus leaderboard and 76 by DeepSeekCoder, maybe inspect the generations to see if it's a post-processing issue or compare to the evalplus repo to see which prompts and postprocessing make sense for the OAI modes.
It would be also good to add some documentation on how to run this, which tasks and models are supported and tested.. here and in the Readme.
In case you want to test other benchmarks MultiPL-E (for multilingual HumanEval) or HumanEvalPack (inspired from this, they already handle postprocessing) could be good starting points.
if api_organization: | ||
litellm.organization = api_organization | ||
|
||
def fetch_dataset_from_task(self, task_name: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe let's leave prompts building for the generation file instead, and keep a format similar to Evaluator class + have them inherit the code parts that are similar to condense the code.
I think the self.args.limit_start logic changed compared to Evaluator, any reason for that?
@@ -1,5 +1,8 @@ | |||
import os | |||
import json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add the API generation logic to a separate generation_api.py
file since the implementations are separate and this adds a lot of new functions
try: | ||
import litellm | ||
except ImportError as e: | ||
print('EvaluationForEndpoint requires package litellm to be installed.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's throw an error instead of a print given code will fail after this
else: | ||
task_names = pattern_match(args.tasks.split(","), ALL_TASKS) | ||
|
||
def evaluate_huggingface_model_with_accelarator(task_names, args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def evaluate_huggingface_model_with_accelarator(task_names, args): | |
def evaluate_huggingface_model_with_accelerator(task_names, args): |
|
||
# Save all args to config | ||
results["config"] = vars(args) | ||
if not args.generation_only: | ||
dumped = json.dumps(results, indent=2) | ||
if accelerator.is_main_process: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep this condition otherwise results will be printed n_device
times, you can check if it's huggingface
service before
Merge from main
Hi @loubnabnl, apologies for not staying updated with the PR, for some bandwidth issue, I might not be be further contribute to this PR and since right now lot of things (on the basis of metrics and models has changed), so I am closing this PR. If I open a new one in the future, I would definitely follow the comments you mentioned. Thanks |
This PR integrates closed-source models interfaced through API endpoints. Solves issue #161