Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endpoints Integration to evaluate closed source Models. #179

Conversation

Anindyadeep
Copy link

This PR integrates closed-source models interfaced through API endpoints. Solves issue #161

Anindyadeep and others added 6 commits December 2, 2023 19:53
This feature is powered by LiteLLM. The reason of using LiteLLM is it provides
a good interface and proxies for connnecting any endpoint in open ai format.
This commit just defines a simple structure for the methods.
@Anindyadeep Anindyadeep marked this pull request as draft January 6, 2024 08:00
@Anindyadeep
Copy link
Author

Anindyadeep commented Jan 7, 2024

This is the result in instruct-humaneval

{
  "instruct-humaneval": {
    "pass@1": 0.651219512195122,
    "pass@10": 0.7535606619462241,
}

@Anindyadeep Anindyadeep marked this pull request as ready for review January 7, 2024 18:10
@Anindyadeep
Copy link
Author

Hey @loubnabnl, is it possible to review this PR?

Thanks

Copy link
Collaborator

@loubnabnl loubnabnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the implementation! I left some comments. If I understand you're just using the instruct-humaneval prompt as is to evaluate GPT-3.5, I understood form this comment that you needed to change the prompt.

Regarding the score, the 65% pass@1 result is a bit far from what's reported by the community, e.g 77 pass@1 by EvalPlus leaderboard and 76 by DeepSeekCoder, maybe inspect the generations to see if it's a post-processing issue or compare to the evalplus repo to see which prompts and postprocessing make sense for the OAI modes.

It would be also good to add some documentation on how to run this, which tasks and models are supported and tested.. here and in the Readme.

In case you want to test other benchmarks MultiPL-E (for multilingual HumanEval) or HumanEvalPack (inspired from this, they already handle postprocessing) could be good starting points.

if api_organization:
litellm.organization = api_organization

def fetch_dataset_from_task(self, task_name: str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe let's leave prompts building for the generation file instead, and keep a format similar to Evaluator class + have them inherit the code parts that are similar to condense the code.

I think the self.args.limit_start logic changed compared to Evaluator, any reason for that?

@@ -1,5 +1,8 @@
import os
import json
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add the API generation logic to a separate generation_api.py file since the implementations are separate and this adds a lot of new functions

try:
import litellm
except ImportError as e:
print('EvaluationForEndpoint requires package litellm to be installed.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's throw an error instead of a print given code will fail after this

else:
task_names = pattern_match(args.tasks.split(","), ALL_TASKS)

def evaluate_huggingface_model_with_accelarator(task_names, args):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def evaluate_huggingface_model_with_accelarator(task_names, args):
def evaluate_huggingface_model_with_accelerator(task_names, args):


# Save all args to config
results["config"] = vars(args)
if not args.generation_only:
dumped = json.dumps(results, indent=2)
if accelerator.is_main_process:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep this condition otherwise results will be printed n_device times, you can check if it's huggingface service before

@Anindyadeep
Copy link
Author

Hi @loubnabnl, apologies for not staying updated with the PR, for some bandwidth issue, I might not be be further contribute to this PR and since right now lot of things (on the basis of metrics and models has changed), so I am closing this PR.

If I open a new one in the future, I would definitely follow the comments you mentioned. Thanks

@Anindyadeep Anindyadeep closed this Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants