Generation with custom data & evaluator #98

amitshermann · 2024-09-15T13:50:40Z

Hi,

We'd like to use AutoPrompt for a generation task where both the input and output are text. We've also developed an evaluator that scores the input-output pairs (e.g., a float between 0 and 1).

Our goal is to optimize the output using our dataset and evaluator, but we're unsure how to set this up with AutoPrompt. Could you provide guidance on how to achieve this?

Thanks in advance,

Eladlev · 2024-09-15T15:13:12Z

Hi,
Yes, it is relatively simple to tweak the system for this use case. The steps that you should follow:

Remove the first step in the optimization (the ranker optimization). Lines 40-54 in the run_generation_pipeline.py
You should prepare a csv with your dataset inputs, following this comment instructions, the only difference is that in your case you can also leave the annotation field empty
Put the csv from step 2 in a folder <base_folder>/generator/dataset.csv, and add the flag --load_dump <base_folder>
In this if you should add the option custom and set it to:
return utils.set_function_from_iterrow(lambda record: custom_score( '###User input:\n' + generation_dataset['text'] + '\n####model prediction:\n' + generation_dataset['prediction']))
where custom_score is your score function (adapt the format according to the function input)
In the config file change this value to custom, and the error_threshold to 0.5
Here change the scale to be 0-1

That's all! It should work with all these changes. If there are any issues I can help with the integration also on the discord server.

amitshermann · 2024-09-16T08:07:37Z

Thank you,
What does the error_threshold mean? Will it make the score Boolean? Because then my custom_eval function kinda looses it's meaning. For example, I want the model to understand the different between a score of 0.8 and a score of 0.6.

Thanks in advance,

Eladlev · 2024-09-17T07:59:28Z

The error_threshold determines what is the list of examples that is provided to the analyzer (we get the worst from this list). These samples are considered as samples that potentially could be improved.
You can put here very high threshold (for example 0.9), if there are too many samples it simply takes the worst from this list

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generation with custom data & evaluator #98

Generation with custom data & evaluator #98

amitshermann commented Sep 15, 2024

Eladlev commented Sep 15, 2024

amitshermann commented Sep 16, 2024 •

edited

Loading

Eladlev commented Sep 17, 2024

Generation with custom data & evaluator #98

Generation with custom data & evaluator #98

Comments

amitshermann commented Sep 15, 2024

Eladlev commented Sep 15, 2024

amitshermann commented Sep 16, 2024 • edited Loading

Eladlev commented Sep 17, 2024

amitshermann commented Sep 16, 2024 •

edited

Loading