Endpoints Integration to evaluate closed source Models. #179

Anindyadeep · 2024-01-06T08:00:41Z

This PR integrates closed-source models interfaced through API endpoints. Solves issue #161

merge from main.

Merge from main.

This feature is powered by LiteLLM. The reason of using LiteLLM is it provides a good interface and proxies for connnecting any endpoint in open ai format.

…stead of method of TokenizedDataset class

This commit just defines a simple structure for the methods.

…ions for better and further customization.

…and other for endpoints

Anindyadeep · 2024-01-07T17:02:14Z

This is the result in instruct-humaneval

{
  "instruct-humaneval": {
    "pass@1": 0.651219512195122,
    "pass@10": 0.7535606619462241,
}

Merge from main.

…ge conflicts.

Anindyadeep · 2024-01-18T05:59:20Z

Hey @loubnabnl, is it possible to review this PR?

Thanks

loubnabnl

Thanks for the implementation! I left some comments. If I understand you're just using the instruct-humaneval prompt as is to evaluate GPT-3.5, I understood form this comment that you needed to change the prompt.

Regarding the score, the 65% pass@1 result is a bit far from what's reported by the community, e.g 77 pass@1 by EvalPlus leaderboard and 76 by DeepSeekCoder, maybe inspect the generations to see if it's a post-processing issue or compare to the evalplus repo to see which prompts and postprocessing make sense for the OAI modes.

It would be also good to add some documentation on how to run this, which tasks and models are supported and tested.. here and in the Readme.

In case you want to test other benchmarks MultiPL-E (for multilingual HumanEval) or HumanEvalPack (inspired from this, they already handle postprocessing) could be good starting points.

loubnabnl · 2024-01-18T13:49:25Z

bigcode_eval/evaluator.py

+        if api_organization:
+            litellm.organization = api_organization        
+
+    def fetch_dataset_from_task(self, task_name: str):


maybe let's leave prompts building for the generation file instead, and keep a format similar to Evaluator class + have them inherit the code parts that are similar to condense the code.

I think the self.args.limit_start logic changed compared to Evaluator, any reason for that?

loubnabnl · 2024-01-18T13:53:10Z

bigcode_eval/generation.py

@@ -1,5 +1,8 @@
+import os 
 import json


let's add the API generation logic to a separate generation_api.py file since the implementations are separate and this adds a lot of new functions

loubnabnl · 2024-01-18T13:53:54Z

bigcode_eval/evaluator.py

+        try:
+            import litellm
+        except ImportError as e:
+            print('EvaluationForEndpoint requires package litellm to be installed.')


let's throw an error instead of a print given code will fail after this

loubnabnl · 2024-01-18T13:54:49Z

main.py

-    else:
-        task_names = pattern_match(args.tasks.split(","), ALL_TASKS)
-
+def evaluate_huggingface_model_with_accelarator(task_names, args):


Suggested change

def evaluate_huggingface_model_with_accelarator(task_names, args):

def evaluate_huggingface_model_with_accelerator(task_names, args):

loubnabnl · 2024-01-18T13:56:21Z

main.py


    # Save all args to config
    results["config"] = vars(args)
    if not args.generation_only:
        dumped = json.dumps(results, indent=2)
-        if accelerator.is_main_process:


let's keep this condition otherwise results will be printed n_device times, you can check if it's huggingface service before

Merge from main

Anindyadeep · 2024-06-06T07:58:30Z

Hi @loubnabnl, apologies for not staying updated with the PR, for some bandwidth issue, I might not be be further contribute to this PR and since right now lot of things (on the basis of metrics and models has changed), so I am closing this PR.

If I open a new one in the future, I would definitely follow the comments you mentioned. Thanks

Anindyadeep and others added 6 commits December 2, 2023 19:53

Merge pull request #1 from bigcode-project/main

2135283

merge from main.

Merge pull request #2 from bigcode-project/main

8375728

merge from main.

Merge pull request #3 from bigcode-project/main

2216adf

Merge from main.

Feat: A new class for Evaluation of different endpoints.

c6bc80d

This feature is powered by LiteLLM. The reason of using LiteLLM is it provides a good interface and proxies for connnecting any endpoint in open ai format.

Refactor: _make_instruction_prompt function is now a normal method in…

a9f9407

…stead of method of TokenizedDataset class

Feat: Generation function for endpoints.

b5567a1

This commit just defines a simple structure for the methods.

Anindyadeep marked this pull request as draft January 6, 2024 08:00

Anindyadeep added 12 commits January 6, 2024 15:11

added tenacity in the requirements

3fde234

added the generation_from_api function

977da59

add evaluate function for API without accelarator

6b5182a

Refactor: Seperated the generate function with a seperate yield funct…

cfb1481

…ions for better and further customization.

removed redundant argument

cfc66e8

Feat: generate_text and evaluate functions under EndpointEvaluator

876acd2

Refactor: Restructure main into two different functions, one with HF …

749e42d

…and other for endpoints

added a .env.template file for keeping all the secrets

7c67a14

Integration of API endpoint in main function

60f5845

Fix: errors and cleanups

f5043bd

remove unwanted comments

fe2f1e6

Supports do_sample argument while doing completion

72eb4d8

Anindyadeep marked this pull request as ready for review January 7, 2024 18:10

Anindyadeep and others added 2 commits January 14, 2024 18:50

Merge pull request #4 from bigcode-project/main

0852da6

Merge from main.

Merge branch 'main' into anindya/endpoints-integration and remove mer…

5834a5e

…ge conflicts.

loubnabnl reviewed Jan 18, 2024

View reviewed changes

hmellor mentioned this pull request Jan 23, 2024

Make main.py compatible with OpenAI compatible APIs #189

Open

Merge pull request #5 from bigcode-project/main

bdfcf6f

Merge from main

Anindyadeep closed this Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Endpoints Integration to evaluate closed source Models. #179

Endpoints Integration to evaluate closed source Models. #179

Anindyadeep commented Jan 6, 2024

Anindyadeep commented Jan 7, 2024 •

edited

Loading

Anindyadeep commented Jan 18, 2024

loubnabnl left a comment •

edited

Loading

loubnabnl Jan 18, 2024

loubnabnl Jan 18, 2024

loubnabnl Jan 18, 2024

loubnabnl Jan 18, 2024

loubnabnl Jan 18, 2024

Anindyadeep commented Jun 6, 2024

	def evaluate_huggingface_model_with_accelarator(task_names, args):
	def evaluate_huggingface_model_with_accelerator(task_names, args):

Endpoints Integration to evaluate closed source Models. #179

Endpoints Integration to evaluate closed source Models. #179

Conversation

Anindyadeep commented Jan 6, 2024

Anindyadeep commented Jan 7, 2024 • edited Loading

Anindyadeep commented Jan 18, 2024

loubnabnl left a comment • edited Loading

Choose a reason for hiding this comment

loubnabnl Jan 18, 2024

Choose a reason for hiding this comment

loubnabnl Jan 18, 2024

Choose a reason for hiding this comment

loubnabnl Jan 18, 2024

Choose a reason for hiding this comment

loubnabnl Jan 18, 2024

Choose a reason for hiding this comment

loubnabnl Jan 18, 2024

Choose a reason for hiding this comment

Anindyadeep commented Jun 6, 2024

Anindyadeep commented Jan 7, 2024 •

edited

Loading

loubnabnl left a comment •

edited

Loading