chore(llmobs): implement skeleton code for ragas faithfulness evaluator #10662

lievan · 2024-09-13T19:32:17Z

This PR implements EvaluatorRunner, a periodic service for LLM Obs that runs evaluations on batches of finished spans.

We add _DD_LLMOBS_EVALUATORS to detect which evaluators should be enabled. Right now, the only supported evaluation is ragas_faithfulness

Within the trace processor, spans events—after being enqueued to the span writer—are enqueued to the evaluator runner.

On each call of periodic() we run a list of evaluators of the batch of finished spans. An evaluator is defined as a function that takes a span as an argument and returns an evaluation metric.

Right now, the faithfulness function is a dummy function that just returns an eval metric with score label 1. In a future PR we will implement the actual faithfulness evaluation.

Intended Usage

_DD_LLMOBS_EVALUATORS="ragas_faithfulness,ragas_.." DD_LLMOBS_ENABLED=true python3 app.py

! No user facing changes for this pr !

No changelog since this PR only implements the internal skeleton code necessary for RAGAS evaluation integration. The environment variable to enable the ragas evaluator service is hidden (_DD_LLMOBS_RAGAS_FAITHFULNESS_ENABLED) and will be made public when we implement an actual faithfulness function.

(Full e2e poc, which contains some differences)

See #10431 for an idea of what the full e2e implementation of the ragas integration looks like.

Checklist

PR author has checked that all the criteria below are met
The PR description includes an overview of the change
The PR description articulates the motivation for the change
The change includes tests OR the PR description describes a testing strategy
The PR description notes risks associated with the change, if any
Newly-added code is easy to change
The change follows the library release note guidelines
The change includes or references documentation updates if necessary
Backport labels are set (if applicable)

Reviewer Checklist

Reviewer has checked that all the criteria below are met
Title is accurate
All changes are related to the pull request's stated goal
Avoids breaking API changes
Testing strategy adequately addresses listed risks
Newly-added code is easy to change
Release note makes sense to a user of the library
If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
Backport labels are set in a manner that is consistent with the release branch maintenance policy

github-actions · 2024-09-13T19:32:57Z

CODEOWNERS have been resolved as:

ddtrace/llmobs/_evaluators/ragas/faithfulness.py                        @DataDog/ml-observability
ddtrace/llmobs/_evaluators/runner.py                                    @DataDog/ml-observability
tests/llmobs/test_llmobs_evaluator_runner.py                            @DataDog/ml-observability
ddtrace/llmobs/_llmobs.py                                               @DataDog/ml-observability
ddtrace/llmobs/_trace_processor.py                                      @DataDog/ml-observability
tests/llmobs/_utils.py                                                  @DataDog/ml-observability
tests/llmobs/conftest.py                                                @DataDog/ml-observability
tests/llmobs/test_llmobs_service.py                                     @DataDog/ml-observability

datadog-dd-trace-py-rkomorn · 2024-09-13T19:50:26Z

Datadog Report

Branch report: evan.li/ragas-skeleton
Commit report: be893a1
Test service: dd-trace-py

✅ 0 Failed, 100 Passed, 850 Skipped, 1m 26.57s Total duration (13m 4.6s time saved)

tests/llmobs/test_llmobs_service.py

tests/llmobs/test_llmobs_ragas_faithfulness_evaluator.py

tests/llmobs/conftest.py

tests/llmobs/test_llmobs_ragas_faithfulness_evaluator.py

ddtrace/settings/config.py

pr-commenter · 2024-09-17T17:11:35Z

Benchmarks

Benchmark execution time: 2024-09-20 20:32:45

Comparing candidate commit 04d202e in PR branch evan.li/ragas-skeleton with baseline commit 33daba9 in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 356 metrics, 48 unstable metrics.

Yun-Kim · 2024-09-17T17:33:58Z

ddtrace/llmobs/_trace_processor.py

        self._span_writer = llmobs_span_writer
+        self._evaluators = evaluators


If we're going to have evaluators be a part of the trace processor, we're going to need to ensure we do the same thing on _child_after_fork() as we do for the span writer, i.e. something like

self._evaluator = self._evaluator.recreate() self._trace_processor._evaluator = self._evaluator

ddtrace/llmobs/_evaluations/ragas/faithfulness/evaluator.py

ddtrace/llmobs/_trace_processor.py

tests/llmobs/test_llmobs_evaluator_runner.py

lievan added 2 commits September 13, 2024 14:51

implement ragas faithfulenss runner with dummy ragas score generator

571d317

remove newline

4b3d840

lievan changed the title ~~feat(llmobs): Implement ragas faithfulenss runner with dummy ragas score generator~~ feat(llmobs): implement ragas faithfulenss runner with dummy ragas score generator Sep 13, 2024

lievan changed the title ~~feat(llmobs): implement ragas faithfulenss runner with dummy ragas score generator~~ feat(llmobs): implement ragas faithfulness runner with dummy ragas score generator Sep 13, 2024

lievan changed the title ~~feat(llmobs): implement ragas faithfulness runner with dummy ragas score generator~~ feat(llmobs): implement skeleton of ragas faithfulness runner that uses a dummy faithfulness function Sep 13, 2024

lievan mentioned this pull request Sep 13, 2024

feat(llmobs): poc ragas evaluation integration #10431

Draft

2 tasks

lievan added the changelog/no-changelog A changelog entry is not required for this PR. label Sep 13, 2024

lievan added 3 commits September 16, 2024 08:55

pydantic v1

7b9c929

refactor into evaluator list

2e883a0

add unit tests

7b31443

datadog-datadog-prod-us1 bot reviewed Sep 17, 2024

View reviewed changes

lievan added 2 commits September 17, 2024 12:49

fix expectde span event

13229bd

merg conf

b493e20

lievan marked this pull request as ready for review September 17, 2024 16:54

lievan requested review from a team as code owners September 17, 2024 16:54

lievan requested review from tabgok and rachelyangdog September 17, 2024 16:54

Yun-Kim changed the title ~~feat(llmobs): implement skeleton of ragas faithfulness runner that uses a dummy faithfulness function~~ chore(llmobs): implement skeleton of ragas faithfulness runner that uses a dummy faithfulness function Sep 17, 2024

Yun-Kim reviewed Sep 17, 2024

View reviewed changes

ddtrace/settings/config.py Outdated Show resolved Hide resolved

lievan changed the title ~~chore(llmobs): implement skeleton of ragas faithfulness runner that uses a dummy faithfulness function~~ chore(llmobs): implement skeleton code for ragas faithfulness runner that uses a dummy faithfulness function Sep 17, 2024

lievan changed the title ~~chore(llmobs): implement skeleton code for ragas faithfulness runner that uses a dummy faithfulness function~~ chore(llmobs): implement skeleton code for ragas faithfulness evaluator Sep 17, 2024

remove config option, use only env var

d290dcd

Yun-Kim reviewed Sep 17, 2024

View reviewed changes

lievan added 2 commits September 17, 2024 14:33

address comments

6e49cca

refactor into one evaluator service

fcf9991

datadog-datadog-prod-us1 bot reviewed Sep 18, 2024

View reviewed changes

tests/llmobs/test_llmobs_evaluator_runner.py Show resolved Hide resolved

tests/llmobs/test_llmobs_evaluator_runner.py Show resolved Hide resolved

lievan and others added 5 commits September 18, 2024 11:10

dont cancel futures

10b276f

refactor dummy faithfulness into class

b6fa4e0

rename field to label

be893a1

Merge branch 'main' into evan.li/ragas-skeleton

a309330

Merge branch 'main' into evan.li/ragas-skeleton

04d202e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(llmobs): implement skeleton code for ragas faithfulness evaluator #10662

chore(llmobs): implement skeleton code for ragas faithfulness evaluator #10662

lievan commented Sep 13, 2024 •

edited

Loading

github-actions bot commented Sep 13, 2024 •

edited

Loading

datadog-dd-trace-py-rkomorn bot commented Sep 13, 2024 •

edited

Loading

pr-commenter bot commented Sep 17, 2024 •

edited

Loading

Yun-Kim Sep 17, 2024

		self._span_writer = llmobs_span_writer
		self._evaluators = evaluators

chore(llmobs): implement skeleton code for ragas faithfulness evaluator #10662

Are you sure you want to change the base?

chore(llmobs): implement skeleton code for ragas faithfulness evaluator #10662

Conversation

lievan commented Sep 13, 2024 • edited Loading

Intended Usage

! No user facing changes for this pr !

(Full e2e poc, which contains some differences)

Checklist

Reviewer Checklist

github-actions bot commented Sep 13, 2024 • edited Loading

datadog-dd-trace-py-rkomorn bot commented Sep 13, 2024 • edited Loading

Datadog Report

pr-commenter bot commented Sep 17, 2024 • edited Loading

Benchmarks

Yun-Kim Sep 17, 2024

Choose a reason for hiding this comment

lievan commented Sep 13, 2024 •

edited

Loading

github-actions bot commented Sep 13, 2024 •

edited

Loading

datadog-dd-trace-py-rkomorn bot commented Sep 13, 2024 •

edited

Loading

pr-commenter bot commented Sep 17, 2024 •

edited

Loading