Entire chat conversation evaluation #5654

Mosquid · 2024-12-09T09:07:17Z

Mosquid
Dec 9, 2024

Hey there,

I'm trying to create an evaluator that allows me to evaluate the entire chat history and determine if the task has been completed. Is there a playbook on how to do this?

Thanks

@dosu · 2024-12-09T09:08:48Z

dosubot[bot]
bot Dec 9, 2024

There isn't a specific playbook or documentation within the Arize Phoenix repository that outlines how to evaluate entire chat histories for task completion. However, the codebase does include logging and tracing capabilities through OpenInference, which can be used to track the evaluation process of chat conversations. This setup allows for comprehensive tracing and logging functionalities that can be utilized for tracking chat conversation evaluations ^[1]^[2].

For processing and evaluating chat conversations, the codebase uses several dependencies, including opentelemetry-api, opentelemetry-sdk, and openinference-semantic-conventions, among others ^[3]^[4]^[5].

You might consider leveraging these tools and libraries to build your evaluator, focusing on how OpenInference can help you trace and log the necessary data to determine task completion.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

0 replies

axiomofjoy · 2024-12-09T19:04:23Z

axiomofjoy
Dec 9, 2024
Maintainer

Hey @Mosquid, are you using our datasets and experiments APIs?

0 replies

Jgilhuly · 2024-12-10T00:47:44Z

Jgilhuly
Dec 10, 2024
Collaborator

Hi @Mosquid

Expanding on Xander's comment above, we don't have a prebuilt template or evaluator that reviews full conversations. The Bring Your Own Evaluator docs would be a good place to start. Those will walk you through how to set up your own custom evaluator in Phoenix.

After you get the initial pipes working there, our Datasets & Experiments features help give a good approach to iterate on and improve your evaluator prompt. I'd recommend:

Creating a small dataset of labeled examples. The input columns here would be a json of conversation history from your app, and the task requested. The output would be a human annotated label saying whether or not the task had been completed. This doesn't have to be a huge dataset, but should be somewhat comprehensive of the examples you expect to see.
Either setting up an experiment to test versions of your eval prompt, or using our new prompt playground feature to test them in the platform against your dataset.
Tweaking your prompt etc until you can get maximum alignment between llm generated eval labels, and your human annotated ground truth.

That approach should help you create an eval prompt that you can then use at scale with confidence that it won't diverge far from your ground truth.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Entire chat conversation evaluation #5654

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Entire chat conversation evaluation #5654

Mosquid Dec 9, 2024

Replies: 3 comments

dosubot[bot] bot Dec 9, 2024

axiomofjoy Dec 9, 2024 Maintainer

Jgilhuly Dec 10, 2024 Collaborator

Mosquid
Dec 9, 2024

dosubot[bot]
bot Dec 9, 2024

axiomofjoy
Dec 9, 2024
Maintainer

Jgilhuly
Dec 10, 2024
Collaborator