Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor and improve coverage #62

Merged
merged 11 commits into from
May 22, 2024
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added
- Added pass rate metric to summary ([#60](https://github.com/awslabs/agent-evaluation/pull/60))

### Changed
- Renamed `TestResult.success` to `TestResult.passed` ([#62](https://github.com/awslabs/agent-evaluation/pull/62))
- Moved `agenteval.TargetResponse` to `agenteval.targets.TargetResponse`. Documentation for creating custom targets also updated to reflect this change ([#62](https://github.com/awslabs/agent-evaluation/pull/62))
- Renamed the target config `type` from `bedrock-knowledgebase` to `bedrock-knowledge-base` ([#62](https://github.com/awslabs/agent-evaluation/pull/62))

## [0.2.0] - 2024-05-13

### Changed
Expand Down
6 changes: 3 additions & 3 deletions docs/hooks.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@ You can specify hooks that run before and/or after evaluating a test. This is us

To create your hooks, define a Python module containing a subclass of [Hook](reference/hook.md#src.agenteval.hook.Hook). The name of this module must contain the suffix `_hook` (e.g. `my_evaluation_hook`).

- Implement the `pre_evaluate` method for a hook that runs *before* evaluation. In this method, you have access to the [Test](reference/test.md#src.agenteval.test.Test) and [Trace](reference/trace.md#src.agenteval.trace.Trace) via the `test` and `trace` arguments, respectively.
- Implement the `pre_evaluate` method for a hook that runs *before* evaluation. In this method, you have access to the [Test](reference/test.md#src.agenteval.test.test.Test) and [Trace](reference/trace.md#src.agenteval.trace.Trace) via the `test` and `trace` arguments, respectively.

- Implement the `post_evaluate` method for a hook that runs *after* evaluation,. Similar to the `pre_evaluate` method, you have access to the [Test](reference/test.md#src.agenteval.test.Test) and [Trace](reference/trace.md#src.agenteval.trace.Trace). You also have access to the [TestResult](reference/test_result.md#src.agenteval.test_result.TestResult) via the `test_result` argument. You may override the attributes of the `TestResult` if you plan to use this hook to perform additional testing, such as integration testing.
- Implement the `post_evaluate` method for a hook that runs *after* evaluation,. Similar to the `pre_evaluate` method, you have access to the [Test](reference/test.md#src.agenteval.test.test.Test) and [Trace](reference/trace.md#src.agenteval.trace.Trace). You also have access to the [TestResult](reference/test_result.md#src.agenteval.test.test_result.TestResult) via the `test_result` argument. You may override the attributes of the `TestResult` if you plan to use this hook to perform additional testing, such as integration testing.


```python title="my_evaluation_hook.py"
Expand Down Expand Up @@ -81,7 +81,7 @@ In this example, we will test an agent that can make dinner reservations. In add

# override the test result based on query result
if not row:
test_result.success = False
test_result.passed = False
test_result.result = "Integration test failed"
test_result.reasoning = "Record was not inserted into the database"
```
Expand Down
1 change: 1 addition & 0 deletions docs/reference/base_target.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.agenteval.targets.base_target
1 change: 0 additions & 1 deletion docs/reference/conversation.md

This file was deleted.

2 changes: 0 additions & 2 deletions docs/reference/evaluator.md

This file was deleted.

1 change: 0 additions & 1 deletion docs/reference/target.md

This file was deleted.

1 change: 1 addition & 0 deletions docs/reference/target_response.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.agenteval.targets.target_response
2 changes: 1 addition & 1 deletion docs/reference/test.md
Original file line number Diff line number Diff line change
@@ -1 +1 @@
::: src.agenteval.test
::: src.agenteval.test.test
2 changes: 1 addition & 1 deletion docs/reference/test_result.md
Original file line number Diff line number Diff line change
@@ -1 +1 @@
::: src.agenteval.test_result
::: src.agenteval.test.test_result
2 changes: 1 addition & 1 deletion docs/targets/bedrock_knowledge_bases.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ The principal must have the following permissions:

```yaml title="agenteval.yml"
target:
type: bedrock-knowledgebase
type: bedrock-knowledge-base
model_id: my-model-id
knowledge_base_id: my-kb-id
```
Expand Down
13 changes: 6 additions & 7 deletions docs/targets/custom_targets.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
# Custom Targets

If you want to test an agent that is not natively supported, you can bring your own Target by defining a Python module containing a subclass of [BaseTarget](../reference/target.md#src.agenteval.targets.target.BaseTarget). The name of this module must contain the suffix `_target` (e.g. `my_custom_target`), and the subclass should implement the `invoke` method to invoke your agent.
If you want to test an agent that is not natively supported, you can bring your own Target by defining a Python module containing a subclass of [BaseTarget](../reference/base_target.md#src.agenteval.targets.base_target.BaseTarget). The name of this module must contain the suffix `_target` (e.g. `my_custom_target`).

The subclass should implement the `invoke` method to invoke your agent and return a [TargetResponse](../reference/target_response.md#src.agenteval.targets.target_response.TargetResponse).

```python title="my_custom_target.py"
from agenteval.targets import BaseTarget
from agenteval import TargetResponse
from agenteval.targets import BaseTarget, TargetResponse
from my_agent import MyAgent

class MyCustomTarget(BaseTarget):
Expand Down Expand Up @@ -47,8 +48,7 @@ We will implement a custom Target that invokes an agent exposed as a REST API.

import requests

from agenteval.targets import BaseTarget
from agenteval import TargetResponse
from agenteval.targets import BaseTarget, TargetResponse


class MyAPITarget(BaseTarget):
Expand Down Expand Up @@ -96,8 +96,7 @@ We will create a simple [LangChain](https://python.langchain.com/docs/modules/ag
from langchain import hub
from langchain.agents import AgentExecutor, create_xml_agent

from agenteval.targets import BaseTarget
from agenteval import TargetResponse
from agenteval.targets import BaseTarget, TargetResponse

llm = Bedrock(model_id="anthropic.claude-v2:1")

Expand Down
2 changes: 1 addition & 1 deletion docs/targets/q_business.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Amazon Q for Business
# Amazon Q Business

Amazon Q Business is a generative AI–powered assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. For more information, visit the AWS documentation [here](https://docs.aws.amazon.com/amazonq/latest/business-use-dg/what-is.html).

Expand Down
13 changes: 6 additions & 7 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,13 +52,12 @@ nav:
- Hooks: hooks.md
- CLI: cli.md
- Reference:
- conversation: reference/conversation.md
- evaluator: reference/evaluator.md
- hook: reference/hook.md
- target: reference/target.md
- test: reference/test.md
- test_result: reference/test_result.md
- trace: reference/trace.md
- BaseTarget: reference/base_target.md
- Hook: reference/hook.md
- TargetResponse: reference/target_response.md
- Test: reference/test.md
- TestResult: reference/test_result.md
- Trace: reference/trace.md
repo_url: https://github.com/awslabs/agent-evaluation
repo_name: awslabs/agent-evaluation
markdown_extensions:
Expand Down
4 changes: 4 additions & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
[flake8]
# E501 line too long (85 > 79 characters)
extend-ignore = E501

[coverage:run]
omit =
src/agenteval/hook.py
3 changes: 1 addition & 2 deletions src/agenteval/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,8 @@
from rich.logging import RichHandler

from .hook import Hook
from .target_response import TargetResponse

__all__ = ["Hook", "TargetResponse"]
__all__ = ["Hook"]
__version__ = version("agent-evaluation")


Expand Down
72 changes: 29 additions & 43 deletions src/agenteval/cli.py
Original file line number Diff line number Diff line change
@@ -1,23 +1,27 @@
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: Apache-2.0

import logging
import os
from enum import Enum
from typing import Optional

import click

from agenteval.plan import Plan
from agenteval.runner import Runner
from agenteval.plan.exceptions import TestFailureError

logger = logging.getLogger(__name__)

class ExitCode(Enum):
TESTS_FAILED = 1
PLAN_ALREADY_EXISTS = 2

def validate_directory(directory):
if not os.path.isdir(directory):
raise NotADirectoryError(f"{directory} is not a directory")
if not os.access(directory, os.R_OK) or not os.access(directory, os.W_OK):
raise PermissionError(f"No read/write permissions for {directory}")

def validate_directory(ctx, param, value):
if value:
if not os.path.isdir(value):
raise click.BadParameter(f"{value} is not a directory")
if not os.access(value, os.R_OK) or not os.access(value, os.W_OK):
raise click.BadParameter(f"No read/write permissions for {value}")


@click.group()
Expand All @@ -30,51 +34,49 @@ def cli():
"--plan-dir",
type=str,
required=False,
help="The destination directory for storing the test plan. If unspecified, then the test plan is saved to the current working directory.",
help="The directory to store the test plan. If a directory is not provided, the test plan will be saved to the current working directory.",
callback=validate_directory,
)
def init(plan_dir: Optional[str]):
if plan_dir:
validate_directory(plan_dir)
try:
path = Plan.init_plan(plan_dir)
logger.info(f"[green]Test plan created at {path}")

except FileExistsError as e:
logger.error(f"[red]{e}")
exit(1)
Plan.init_plan(plan_dir)
except FileExistsError:
exit(ExitCode.PLAN_ALREADY_EXISTS.value)


@cli.command(help="Run test plan.")
@click.option(
"--filter",
type=str,
required=False,
help="Specifies the test(s) to run. Multiple tests should be seperated using a comma. If unspecified, all tests from the test plan will be run.",
help="Specifies the test(s) to run, where multiple tests should be seperated using a comma. If a filter is not provided, all tests will be run.",
)
@click.option(
"--plan-dir",
type=str,
required=False,
help="The directory where the test plan is stored. If unspecified, then the current working directory is used.",
help="The directory where the test plan is stored. If a directory is not provided, the test plan will be read from the current working directory.",
callback=validate_directory,
)
@click.option(
"--verbose",
is_flag=True,
type=bool,
default=False,
help="Controls the verbosity of the terminal logs.",
help="Whether to enable verbose logging. Defaults to False.",
)
@click.option(
"--num-threads",
type=int,
required=False,
help="Number of threads (and thus tests) to run concurrently. If unspecified, number of threads will be capped at 45.",
help="Number of threads used to run tests concurrently. If the number of threads is not provided, the thread count will be set to the number of tests (up to a maximum of 45 threads).",
)
@click.option(
"--work-dir",
type=str,
required=False,
help="The directory where the test result and trace will be generated. If unspecified, then the current working directory is used.",
help="The directory where the test result and trace will be generated. If a directory is not provided, the assets will be saved to the current working directory.",
callback=validate_directory,
)
def run(
filter: Optional[str],
Expand All @@ -84,26 +86,10 @@ def run(
work_dir: Optional[str],
):
try:
plan = Plan.load(plan_dir, filter)
if work_dir:
validate_directory(work_dir)
runner = Runner(
plan,
verbose,
num_threads,
work_dir,
plan = Plan.load(plan_dir)
plan.run(
verbose=verbose, num_threads=num_threads, work_dir=work_dir, filter=filter
)
num_failed = runner.run()
_num_failed_exit(num_failed)

except Exception as e:
_exception_exit(e)


def _num_failed_exit(num_failed):
exit(1 if num_failed else 0)


def _exception_exit(e):
logger.exception(f"Error running test: {e}")
exit(1)
except TestFailureError:
exit(ExitCode.TESTS_FAILED.value)
4 changes: 3 additions & 1 deletion src/agenteval/conversation.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,13 @@ class Conversation:
"""

def __init__(self):
"""
Initialize the conversation.
"""
self.messages = []
self.turns = _START_TURN_COUNT

def __iter__(self):
"""Allow iteration over conversation messages."""
return iter(self.messages)

def add_turn(self, user_message: str, agent_response: str):
Expand Down
28 changes: 15 additions & 13 deletions src/agenteval/evaluators/base_evaluator.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,10 @@
from agenteval.conversation import Conversation
from agenteval.hook import Hook
from agenteval.targets import BaseTarget
from agenteval.test import Test
from agenteval.test_result import TestResult
from agenteval.test import Test, TestResult
from agenteval.trace import Trace
from agenteval.utils import create_boto3_client, import_class

_DEFAULT_MAX_RETRY = 10
_BOTO3_SERVICE_NAME = "bedrock-runtime"


Expand Down Expand Up @@ -44,20 +42,21 @@ def __init__(
aws_profile: Optional[str] = None,
aws_region: Optional[str] = None,
endpoint_url: Optional[str] = None,
max_retry: int = _DEFAULT_MAX_RETRY,
max_retry: int = 10,
):
"""Initialize the evaluator instance for a given `Test` and `Target`.
"""Initialize the evaluator.

Args:
test (Test): The test case.
target (BaseTarget): The target agent being evaluated.
work_dir (str): The work directory.
work_dir (str): The directory where the test result and trace will be
generated.
model_id (str): The ID of the Bedrock model used to run evaluation.
provisioned_throughput_arn (str, optional): The ARN of the provisioned throughput.
aws_profile (str, optional): The AWS profile name.
aws_region (str, optional): The AWS region.
endpoint_url (str, optional): The endpoint URL for the AWS service.
max_retry (int, optional): The maximum number of retry attempts.
provisioned_throughput_arn (Optional[str]): The ARN of the provisioned throughput.
aws_profile (Optional[str]): The AWS profile name.
aws_region (Optional[str]): The AWS region.
endpoint_url (Optional[str]): The endpoint URL for the AWS service.
max_retry (int): The maximum number of retry attempts.
"""
self.test = test
self.target = target
Expand All @@ -77,10 +76,10 @@ def __init__(

@abstractmethod
def evaluate(self) -> TestResult:
"""Conduct a test.
"""Conduct the test.

Returns:
TestResult: The result of the test.
TestResult
"""
pass

Expand Down Expand Up @@ -125,6 +124,9 @@ def run(self) -> TestResult:
"""
Run the evaluator within a trace context manager and run hooks
if provided.

Returns:
TestResult
"""

hook_cls = self._get_hook_cls(self.test.hook)
Expand Down
Loading
Loading