Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(llmobs): implement skeleton code for ragas faithfulness evaluator #10662

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

lievan
Copy link
Contributor

@lievan lievan commented Sep 13, 2024

This PR implements EvaluatorRunner, a periodic service for LLM Obs that runs evaluations on batches of finished spans.

We add _DD_LLMOBS_EVALUATORS to detect which evaluators should be enabled. Right now, the only supported evaluation is ragas_faithfulness

Within the trace processor, spans events—after being enqueued to the span writer—are enqueued to the evaluator runner.

On each call of periodic() we run a list of evaluators of the batch of finished spans. An evaluator is defined as a function that takes a span as an argument and returns an evaluation metric.

Right now, the faithfulness function is a dummy function that just returns an eval metric with score label 1. In a future PR we will implement the actual faithfulness evaluation.

Intended Usage

_DD_LLMOBS_EVALUATORS="ragas_faithfulness,ragas_.." DD_LLMOBS_ENABLED=true python3 app.py

! No user facing changes for this pr !

No changelog since this PR only implements the internal skeleton code necessary for RAGAS evaluation integration. The environment variable to enable the ragas evaluator service is hidden (_DD_LLMOBS_RAGAS_FAITHFULNESS_ENABLED) and will be made public when we implement an actual faithfulness function.

(Full e2e poc, which contains some differences)

See #10431 for an idea of what the full e2e implementation of the ragas integration looks like.

Checklist

  • PR author has checked that all the criteria below are met
  • The PR description includes an overview of the change
  • The PR description articulates the motivation for the change
  • The change includes tests OR the PR description describes a testing strategy
  • The PR description notes risks associated with the change, if any
  • Newly-added code is easy to change
  • The change follows the library release note guidelines
  • The change includes or references documentation updates if necessary
  • Backport labels are set (if applicable)

Reviewer Checklist

  • Reviewer has checked that all the criteria below are met
  • Title is accurate
  • All changes are related to the pull request's stated goal
  • Avoids breaking API changes
  • Testing strategy adequately addresses listed risks
  • Newly-added code is easy to change
  • Release note makes sense to a user of the library
  • If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • Backport labels are set in a manner that is consistent with the release branch maintenance policy

@lievan lievan changed the title feat(llmobs): Implement ragas faithfulenss runner with dummy ragas score generator feat(llmobs): implement ragas faithfulenss runner with dummy ragas score generator Sep 13, 2024
@lievan lievan changed the title feat(llmobs): implement ragas faithfulenss runner with dummy ragas score generator feat(llmobs): implement ragas faithfulness runner with dummy ragas score generator Sep 13, 2024
@lievan lievan changed the title feat(llmobs): implement ragas faithfulness runner with dummy ragas score generator feat(llmobs): implement skeleton of ragas faithfulness runner that uses a dummy faithfulness function Sep 13, 2024
Copy link
Contributor

github-actions bot commented Sep 13, 2024

CODEOWNERS have been resolved as:

ddtrace/llmobs/_evaluators/ragas/faithfulness.py                        @DataDog/ml-observability
ddtrace/llmobs/_evaluators/runner.py                                    @DataDog/ml-observability
tests/llmobs/test_llmobs_evaluator_runner.py                            @DataDog/ml-observability
ddtrace/llmobs/_llmobs.py                                               @DataDog/ml-observability
ddtrace/llmobs/_trace_processor.py                                      @DataDog/ml-observability
tests/llmobs/_utils.py                                                  @DataDog/ml-observability
tests/llmobs/conftest.py                                                @DataDog/ml-observability
tests/llmobs/test_llmobs_service.py                                     @DataDog/ml-observability

@lievan lievan added the changelog/no-changelog A changelog entry is not required for this PR. label Sep 13, 2024
@datadog-dd-trace-py-rkomorn
Copy link

datadog-dd-trace-py-rkomorn bot commented Sep 13, 2024

Datadog Report

Branch report: evan.li/ragas-skeleton
Commit report: be893a1
Test service: dd-trace-py

✅ 0 Failed, 100 Passed, 850 Skipped, 1m 26.57s Total duration (13m 4.6s time saved)

@lievan lievan marked this pull request as ready for review September 17, 2024 16:54
@lievan lievan requested review from a team as code owners September 17, 2024 16:54
@Yun-Kim Yun-Kim changed the title feat(llmobs): implement skeleton of ragas faithfulness runner that uses a dummy faithfulness function chore(llmobs): implement skeleton of ragas faithfulness runner that uses a dummy faithfulness function Sep 17, 2024
ddtrace/settings/config.py Outdated Show resolved Hide resolved
@lievan lievan changed the title chore(llmobs): implement skeleton of ragas faithfulness runner that uses a dummy faithfulness function chore(llmobs): implement skeleton code for ragas faithfulness runner that uses a dummy faithfulness function Sep 17, 2024
@lievan lievan changed the title chore(llmobs): implement skeleton code for ragas faithfulness runner that uses a dummy faithfulness function chore(llmobs): implement skeleton code for ragas faithfulness evaluator Sep 17, 2024
@pr-commenter
Copy link

pr-commenter bot commented Sep 17, 2024

Benchmarks

Benchmark execution time: 2024-09-20 20:32:45

Comparing candidate commit 04d202e in PR branch evan.li/ragas-skeleton with baseline commit 33daba9 in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 356 metrics, 48 unstable metrics.

self._span_writer = llmobs_span_writer
self._evaluators = evaluators
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to have evaluators be a part of the trace processor, we're going to need to ensure we do the same thing on _child_after_fork() as we do for the span writer, i.e. something like

self._evaluator = self._evaluator.recreate()
self._trace_processor._evaluator = self._evaluator

ddtrace/llmobs/_trace_processor.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog/no-changelog A changelog entry is not required for this PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants