-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs(evals): Updated docs for evals #1512
base: main
Are you sure you want to change the base?
Conversation
|
||
You can see the details of your evaluation run in this page, including original input, extracted context and metrics (if any). | ||
|
||
<!-- TODO, more convincing conclusion here? --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mjchristy thoughts?
You can also provide custom extractors to be used in `eval:extractData` and | ||
`eval:flow` commands. Custom extractors allow you to override the default | ||
extraction logic giving you more power in creating datasets and evaluating them. | ||
<!-- TOOD: Any caveats on where this approach does not work (ES5 or something?) --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pavelgj thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this LGTM. I wonder if we need to give them a RAG flow to evaluate, esp. given that we chose 'maliciousness'.
// ... | ||
}); | ||
``` | ||
1. **Inference-based evaluation**: In this type of evaluation, a system is run on a collection of pre-determined inputs and the corresponding outputs are assessed for quality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"A system is run on" sounds strange to me. Maybe we just say, "This type of evaluation is run against a collection of..." or similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/1./*
Numbered lists are really more for sequences (i.e., procedures, etc.).
```posix-terminal | ||
npm install @genkit-ai/evaluator @genkit-ai/vertexai | ||
``` | ||
2. **Raw evaluation**: This type of evaluation directly assesses the quality of inputs without any inference. This approach typically is used with automated evaluation using metrics. All required fields for evaluation (`context`, `output`) must be present in the input dataset. This is useful when you have data coming from an external source (eg: collected from your production traces) and you simply want to have an objective measurement of the quality of the collected data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean quality of "outputs" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, no? Because the inputs contain the outputs ;)
I can see how this can be confusing though.
Maliciousness is not RAG specific. |
// ... | ||
}); | ||
``` | ||
1. **Inference-based evaluation**: In this type of evaluation, a system is run on a collection of pre-determined inputs and the corresponding outputs are assessed for quality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/1./*
Numbered lists are really more for sequences (i.e., procedures, etc.).
```posix-terminal | ||
npm install @genkit-ai/evaluator @genkit-ai/vertexai | ||
``` | ||
2. **Raw evaluation**: This type of evaluation directly assesses the quality of inputs without any inference. This approach typically is used with automated evaluation using metrics. All required fields for evaluation (`context`, `output`) must be present in the input dataset. This is useful when you have data coming from an external source (eg: collected from your production traces) and you simply want to have an objective measurement of the quality of the collected data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/2./*
(see comment on L17)
Checklist (if applicable):