Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(evals): Updated docs for evals #1512

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

docs(evals): Updated docs for evals #1512

wants to merge 6 commits into from

Conversation

ssbushi
Copy link
Contributor

@ssbushi ssbushi commented Dec 13, 2024

Checklist (if applicable):

@ssbushi ssbushi marked this pull request as ready for review December 13, 2024 04:25
docs/evaluation.md Outdated Show resolved Hide resolved

You can see the details of your evaluation run in this page, including original input, extracted context and metrics (if any).

<!-- TODO, more convincing conclusion here? -->
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjchristy thoughts?

You can also provide custom extractors to be used in `eval:extractData` and
`eval:flow` commands. Custom extractors allow you to override the default
extraction logic giving you more power in creating datasets and evaluating them.
<!-- TOOD: Any caveats on where this approach does not work (ES5 or something?) -->
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pavelgj thoughts?

Copy link
Contributor

@mjchristy mjchristy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this LGTM. I wonder if we need to give them a RAG flow to evaluate, esp. given that we chose 'maliciousness'.

// ...
});
```
1. **Inference-based evaluation**: In this type of evaluation, a system is run on a collection of pre-determined inputs and the corresponding outputs are assessed for quality.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"A system is run on" sounds strange to me. Maybe we just say, "This type of evaluation is run against a collection of..." or similar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/1./*

Numbered lists are really more for sequences (i.e., procedures, etc.).

```posix-terminal
npm install @genkit-ai/evaluator @genkit-ai/vertexai
```
2. **Raw evaluation**: This type of evaluation directly assesses the quality of inputs without any inference. This approach typically is used with automated evaluation using metrics. All required fields for evaluation (`context`, `output`) must be present in the input dataset. This is useful when you have data coming from an external source (eg: collected from your production traces) and you simply want to have an objective measurement of the quality of the collected data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean quality of "outputs" here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, no? Because the inputs contain the outputs ;)

I can see how this can be confusing though.

@ssbushi
Copy link
Contributor Author

ssbushi commented Dec 18, 2024

Overall, this LGTM. I wonder if we need to give them a RAG flow to evaluate, esp. given that we chose 'maliciousness'.

Maliciousness is not RAG specific.

// ...
});
```
1. **Inference-based evaluation**: In this type of evaluation, a system is run on a collection of pre-determined inputs and the corresponding outputs are assessed for quality.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/1./*

Numbered lists are really more for sequences (i.e., procedures, etc.).

```posix-terminal
npm install @genkit-ai/evaluator @genkit-ai/vertexai
```
2. **Raw evaluation**: This type of evaluation directly assesses the quality of inputs without any inference. This approach typically is used with automated evaluation using metrics. All required fields for evaluation (`context`, `output`) must be present in the input dataset. This is useful when you have data coming from an external source (eg: collected from your production traces) and you simply want to have an objective measurement of the quality of the collected data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/2./*
(see comment on L17)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

3 participants