awslabs · tonykchen · May 22, 2024 · May 16, 2024 · May 18, 2024 · May 20, 2024
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+### Added
+- Added pass rate metric to summary ([#60](https://github.com/awslabs/agent-evaluation/pull/60))
+
+### Changed
+- Renamed `TestResult.success` to `TestResult.passed` ([#62](https://github.com/awslabs/agent-evaluation/pull/62))
+- Moved `agenteval.TargetResponse` to `agenteval.targets.TargetResponse`. Documentation for creating custom targets also updated to reflect this change ([#62](https://github.com/awslabs/agent-evaluation/pull/62))
+- Renamed the target config `type` from `bedrock-knowledgebase` to `bedrock-knowledge-base` ([#62](https://github.com/awslabs/agent-evaluation/pull/62))
+
 ## [0.2.0] - 2024-05-13
 
 ### Changed

diff --git a/docs/hooks.md b/docs/hooks.md
@@ -2,9 +2,9 @@ You can specify hooks that run before and/or after evaluating a test. This is us
 
 To create your hooks, define a Python module containing a subclass of [Hook](reference/hook.md#src.agenteval.hook.Hook). The name of this module must contain the suffix `_hook` (e.g. `my_evaluation_hook`).
 
-- Implement the `pre_evaluate` method for a hook that runs *before* evaluation. In this method, you have access to the [Test](reference/test.md#src.agenteval.test.Test) and [Trace](reference/trace.md#src.agenteval.trace.Trace) via the `test` and `trace` arguments, respectively.
+- Implement the `pre_evaluate` method for a hook that runs *before* evaluation. In this method, you have access to the [Test](reference/test.md#src.agenteval.test.test.Test) and [Trace](reference/trace.md#src.agenteval.trace.Trace) via the `test` and `trace` arguments, respectively.
 
-- Implement the `post_evaluate` method for a hook that runs *after* evaluation,. Similar to the `pre_evaluate` method, you have access to the [Test](reference/test.md#src.agenteval.test.Test) and [Trace](reference/trace.md#src.agenteval.trace.Trace). You also have access to the [TestResult](reference/test_result.md#src.agenteval.test_result.TestResult) via the `test_result` argument. You may override the attributes of the `TestResult` if you plan to use this hook to perform additional testing, such as integration testing.
+- Implement the `post_evaluate` method for a hook that runs *after* evaluation,. Similar to the `pre_evaluate` method, you have access to the [Test](reference/test.md#src.agenteval.test.test.Test) and [Trace](reference/trace.md#src.agenteval.trace.Trace). You also have access to the [TestResult](reference/test_result.md#src.agenteval.test.test_result.TestResult) via the `test_result` argument. You may override the attributes of the `TestResult` if you plan to use this hook to perform additional testing, such as integration testing.
 
 
 ```python title="my_evaluation_hook.py"
@@ -81,7 +81,7 @@ In this example, we will test an agent that can make dinner reservations. In add
 
         # override the test result based on query result 
         if not row:
-          test_result.success = False
+          test_result.passed = False
           test_result.result = "Integration test failed"
           test_result.reasoning = "Record was not inserted into the database"
     ```

diff --git a/docs/reference/base_target.md b/docs/reference/base_target.md
@@ -0,0 +1 @@
+::: src.agenteval.targets.base_target
diff --git a/docs/reference/conversation.md b/docs/reference/conversation.md
diff --git a/docs/reference/evaluator.md b/docs/reference/evaluator.md
diff --git a/docs/reference/target.md b/docs/reference/target.md
diff --git a/docs/reference/target_response.md b/docs/reference/target_response.md
@@ -0,0 +1 @@
+::: src.agenteval.targets.target_response
diff --git a/docs/reference/test.md b/docs/reference/test.md
@@ -1 +1 @@
-::: src.agenteval.test
+::: src.agenteval.test.test
diff --git a/docs/reference/test_result.md b/docs/reference/test_result.md
@@ -1 +1 @@
-::: src.agenteval.test_result
+::: src.agenteval.test.test_result
diff --git a/docs/targets/bedrock_knowledge_bases.md b/docs/targets/bedrock_knowledge_bases.md
@@ -12,7 +12,7 @@ The principal must have the following permissions:
 
 ```yaml title="agenteval.yml"
 target:
-  type: bedrock-knowledgebase
+  type: bedrock-knowledge-base
   model_id: my-model-id
   knowledge_base_id: my-kb-id
 ```

diff --git a/docs/targets/custom_targets.md b/docs/targets/custom_targets.md
@@ -1,10 +1,11 @@
 # Custom Targets
 
-If you want to test an agent that is not natively supported, you can bring your own Target by defining a Python module containing a subclass of [BaseTarget](../reference/target.md#src.agenteval.targets.target.BaseTarget). The name of this module must contain the suffix `_target` (e.g. `my_custom_target`), and the subclass should implement the `invoke` method to invoke your agent.
+If you want to test an agent that is not natively supported, you can bring your own Target by defining a Python module containing a subclass of [BaseTarget](../reference/base_target.md#src.agenteval.targets.base_target.BaseTarget). The name of this module must contain the suffix `_target` (e.g. `my_custom_target`).
+
+The subclass should implement the `invoke` method to invoke your agent and return a [TargetResponse](../reference/target_response.md#src.agenteval.targets.target_response.TargetResponse).
 
 ```python title="my_custom_target.py"
-from agenteval.targets import BaseTarget
-from agenteval import TargetResponse
+from agenteval.targets import BaseTarget, TargetResponse
 from my_agent import MyAgent
 
 class MyCustomTarget(BaseTarget):
@@ -47,8 +48,7 @@ We will implement a custom Target that invokes an agent exposed as a REST API.
 
     import requests
 
-    from agenteval.targets import BaseTarget
-    from agenteval import TargetResponse
+    from agenteval.targets import BaseTarget, TargetResponse
 
 
     class MyAPITarget(BaseTarget):
@@ -96,8 +96,7 @@ We will create a simple [LangChain](https://python.langchain.com/docs/modules/ag
     from langchain import hub
     from langchain.agents import AgentExecutor, create_xml_agent
 
-    from agenteval.targets import BaseTarget
-    from agenteval import TargetResponse
+    from agenteval.targets import BaseTarget, TargetResponse
 
     llm = Bedrock(model_id="anthropic.claude-v2:1")
 

diff --git a/docs/targets/q_business.md b/docs/targets/q_business.md
@@ -1,4 +1,4 @@
-# Amazon Q for Business
+# Amazon Q Business
 
 Amazon Q Business is a generative AI–powered assistant that can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in your enterprise systems. For more information, visit the AWS documentation [here](https://docs.aws.amazon.com/amazonq/latest/business-use-dg/what-is.html).
 

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -52,13 +52,12 @@ nav:
     - Hooks: hooks.md
   - CLI: cli.md
   - Reference:
-    - conversation: reference/conversation.md
-    - evaluator: reference/evaluator.md
-    - hook: reference/hook.md
-    - target: reference/target.md
-    - test: reference/test.md
-    - test_result: reference/test_result.md
-    - trace: reference/trace.md
+    - BaseTarget: reference/base_target.md
+    - Hook: reference/hook.md
+    - TargetResponse: reference/target_response.md
+    - Test: reference/test.md
+    - TestResult: reference/test_result.md
+    - Trace: reference/trace.md
 repo_url: https://github.com/awslabs/agent-evaluation
 repo_name: awslabs/agent-evaluation
 markdown_extensions:

diff --git a/setup.cfg b/setup.cfg
@@ -1,3 +1,7 @@
 [flake8]
 # E501 line too long (85 > 79 characters)
 extend-ignore = E501
+
+[coverage:run]
+omit =
+    src/agenteval/hook.py
diff --git a/src/agenteval/__init__.py b/src/agenteval/__init__.py
@@ -10,9 +10,8 @@
 from rich.logging import RichHandler
 
 from .hook import Hook
-from .target_response import TargetResponse
 
-__all__ = ["Hook", "TargetResponse"]
+__all__ = ["Hook"]
 __version__ = version("agent-evaluation")
 
 

diff --git a/src/agenteval/cli.py b/src/agenteval/cli.py
@@ -1,23 +1,27 @@
 # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 # SPDX-License-Identifier: Apache-2.0
 
-import logging
 import os
+from enum import Enum
 from typing import Optional
 
 import click
 
 from agenteval.plan import Plan
-from agenteval.runner import Runner
+from agenteval.plan.exceptions import TestFailureError
 
-logger = logging.getLogger(__name__)
 
+class ExitCode(Enum):
+    TESTS_FAILED = 1
+    PLAN_ALREADY_EXISTS = 2
 
-def validate_directory(directory):
-    if not os.path.isdir(directory):
-        raise NotADirectoryError(f"{directory} is not a directory")
-    if not os.access(directory, os.R_OK) or not os.access(directory, os.W_OK):
-        raise PermissionError(f"No read/write permissions for {directory}")
+
+def validate_directory(ctx, param, value):
+    if value:
+        if not os.path.isdir(value):
+            raise click.BadParameter(f"{value} is not a directory")
+        if not os.access(value, os.R_OK) or not os.access(value, os.W_OK):
+            raise click.BadParameter(f"No read/write permissions for {value}")
 
 
 @click.group()
@@ -30,51 +34,49 @@ def cli():
     "--plan-dir",
     type=str,
     required=False,
-    help="The destination directory for storing the test plan. If unspecified, then the test plan is saved to the current working directory.",
+    help="The directory to store the test plan. If a directory is not provided, the test plan will be saved to the current working directory.",
+    callback=validate_directory,
 )
 def init(plan_dir: Optional[str]):
-    if plan_dir:
-        validate_directory(plan_dir)
     try:
-        path = Plan.init_plan(plan_dir)
-        logger.info(f"[green]Test plan created at {path}")
-
-    except FileExistsError as e:
-        logger.error(f"[red]{e}")
-        exit(1)
+        Plan.init_plan(plan_dir)
+    except FileExistsError:
+        exit(ExitCode.PLAN_ALREADY_EXISTS.value)
 
 
 @cli.command(help="Run test plan.")
 @click.option(
     "--filter",
     type=str,
     required=False,
-    help="Specifies the test(s) to run. Multiple tests should be seperated using a comma. If unspecified, all tests from the test plan will be run.",
+    help="Specifies the test(s) to run, where multiple tests should be seperated using a comma. If a filter is not provided, all tests will be run.",
 )
 @click.option(
     "--plan-dir",
     type=str,
     required=False,
-    help="The directory where the test plan is stored. If unspecified, then the current working directory is used.",
+    help="The directory where the test plan is stored. If a directory is not provided, the test plan will be read from the current working directory.",
+    callback=validate_directory,
 )
 @click.option(
     "--verbose",
     is_flag=True,
     type=bool,
     default=False,
-    help="Controls the verbosity of the terminal logs.",
+    help="Whether to enable verbose logging. Defaults to False.",
 )
 @click.option(
     "--num-threads",
     type=int,
     required=False,
-    help="Number of threads (and thus tests) to run concurrently. If unspecified, number of threads will be capped at 45.",
+    help="Number of threads used to run tests concurrently. If the number of threads is not provided, the thread count will be set to the number of tests (up to a maximum of 45 threads).",
 )
 @click.option(
     "--work-dir",
     type=str,
     required=False,
-    help="The directory where the test result and trace will be generated. If unspecified, then the current working directory is used.",
+    help="The directory where the test result and trace will be generated. If a directory is not provided, the assets will be saved to the current working directory.",
+    callback=validate_directory,
 )
 def run(
     filter: Optional[str],
@@ -84,26 +86,10 @@ def run(
     work_dir: Optional[str],
 ):
     try:
-        plan = Plan.load(plan_dir, filter)
-        if work_dir:
-            validate_directory(work_dir)
-        runner = Runner(
-            plan,
-            verbose,
-            num_threads,
-            work_dir,
+        plan = Plan.load(plan_dir)
+        plan.run(
+            verbose=verbose, num_threads=num_threads, work_dir=work_dir, filter=filter
         )
-        num_failed = runner.run()
-        _num_failed_exit(num_failed)
-
-    except Exception as e:
-        _exception_exit(e)
-
-
-def _num_failed_exit(num_failed):
-    exit(1 if num_failed else 0)
-
 
-def _exception_exit(e):
-    logger.exception(f"Error running test: {e}")
-    exit(1)
+    except TestFailureError:
+        exit(ExitCode.TESTS_FAILED.value)
diff --git a/src/agenteval/conversation.py b/src/agenteval/conversation.py
@@ -15,11 +15,13 @@ class Conversation:
     """
 
     def __init__(self):
+        """
+        Initialize the conversation.
+        """
         self.messages = []
         self.turns = _START_TURN_COUNT
 
     def __iter__(self):
-        """Allow iteration over conversation messages."""
         return iter(self.messages)
 
     def add_turn(self, user_message: str, agent_response: str):

diff --git a/src/agenteval/evaluators/base_evaluator.py b/src/agenteval/evaluators/base_evaluator.py
@@ -8,12 +8,10 @@
 from agenteval.conversation import Conversation
 from agenteval.hook import Hook
 from agenteval.targets import BaseTarget
-from agenteval.test import Test
-from agenteval.test_result import TestResult
+from agenteval.test import Test, TestResult
 from agenteval.trace import Trace
 from agenteval.utils import create_boto3_client, import_class
 
-_DEFAULT_MAX_RETRY = 10
 _BOTO3_SERVICE_NAME = "bedrock-runtime"
 
 
@@ -44,20 +42,21 @@ def __init__(
         aws_profile: Optional[str] = None,
         aws_region: Optional[str] = None,
         endpoint_url: Optional[str] = None,
-        max_retry: int = _DEFAULT_MAX_RETRY,
+        max_retry: int = 10,
     ):
-        """Initialize the evaluator instance for a given `Test` and `Target`.
+        """Initialize the evaluator.
 
         Args:
             test (Test): The test case.
             target (BaseTarget): The target agent being evaluated.
-            work_dir (str): The work directory.
+            work_dir (str): The directory where the test result and trace will be
+                generated.
             model_id (str): The ID of the Bedrock model used to run evaluation.
-            provisioned_throughput_arn (str, optional): The ARN of the provisioned throughput.
-            aws_profile (str, optional): The AWS profile name.
-            aws_region (str, optional): The AWS region.
-            endpoint_url (str, optional): The endpoint URL for the AWS service.
-            max_retry (int, optional): The maximum number of retry attempts.
+            provisioned_throughput_arn (Optional[str]): The ARN of the provisioned throughput.
+            aws_profile (Optional[str]): The AWS profile name.
+            aws_region (Optional[str]): The AWS region.
+            endpoint_url (Optional[str]): The endpoint URL for the AWS service.
+            max_retry (int): The maximum number of retry attempts.
         """
         self.test = test
         self.target = target
@@ -77,10 +76,10 @@ def __init__(
 
     @abstractmethod
     def evaluate(self) -> TestResult:
-        """Conduct a test.
+        """Conduct the test.
 
         Returns:
-            TestResult: The result of the test.
+            TestResult
         """
         pass
 
@@ -125,6 +124,9 @@ def run(self) -> TestResult:
         """
         Run the evaluator within a trace context manager and run hooks
         if provided.
+
+        Returns:
+            TestResult
         """
 
         hook_cls = self._get_hook_cls(self.test.hook)
Original file line number	Diff line number	Diff line change
		@@ -1 +1 @@
		::: src.agenteval.test
		::: src.agenteval.test.test
Original file line number	Diff line number	Diff line change
		@@ -1 +1 @@
		::: src.agenteval.test_result
		::: src.agenteval.test.test_result