diff --git a/blogs/13_bfcl_v3_multi_turn.html b/blogs/13_bfcl_v3_multi_turn.html index 2c28526e5..d8cf7d2ab 100644 --- a/blogs/13_bfcl_v3_multi_turn.html +++ b/blogs/13_bfcl_v3_multi_turn.html @@ -99,6 +99,67 @@ .code-toggle:hover { text-decoration: underline; } + + .level-1 { + background-color: #E2E2E2; + /* Darkest gray for top level */ + font-weight: bold; + } + + .level-2 { + background-color: #E8E8E8; + /* Slightly lighter gray */ + } + + .level-3 { + background-color: #EFEFEF; + /* Medium gray */ + } + + .level-4 { + background-color: #F5F5F5; + /* Light gray */ + } + + .level-5 { + background-color: #FFFFFF; + /* Pure white */ + } + + .scoring-table { + width: 100%; + border-collapse: collapse; + table-layout: auto; + } + + .scoring-table th, + .scoring-table td { + word-wrap: break-word; + padding: 5px; + text-align: center; + } + + @media screen and (max-width: 768px) { + + .scoring-table th, + .scoring-table td { + padding: 1px; + } + } + + .note { + font-size: 0.8em; + color: #666; + } + + .na { + color: #999; + font-style: italic; + } + + .emoji { + display: block; + } @@ -119,7 +180,7 @@

🦍 Gorilla: Large Langu APIs

-

BFCL V3: Introducing Multi-Turn & Multi-Step Function Calling

+

BFCL V3: Multi-Turn & Multi-Step Function Calling


-

Existing Tool Calling Dataset

+

Existing Tool Calling Dataset

Table Example @@ -417,74 +482,77 @@

Existing Tool Calling Dataset


-

Dataset Composition

+

Dataset Composition

+

Newly Introduced Categories

Responsive Image - Berkeley Function-Calling - Leaderboard (BFCL V3 β€’ Multi-Turn & Multi-Step Function Calling) Data Composition + BFCL V3 β€’ Multi-Turn & Multi-Step Function Calling Data Composition


+ +
Examples + style="max-width: 100%; width: 100%; margin-bottom: 20px;"> + + An example of how the missing parameter and missing function are augmented from the base entries. +
+

Here we visualize the data statistics of the BFCL V3 Base Multi Turn dataset (the augmented categories follow similar distributions):

@@ -511,13 +579,271 @@

Dataset Composition

Top 15 Tool Usage in BFCL V3
- + +

Full Leaderboard Score Composition

+

The full bfcl v3 leaderboard score is composed of the following components. Number in the parenthesis + indicates the number of entries in each category.

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ πŸ† Overall Score +
Unweighted Average
+
(4751)
+
+ πŸ“‚ Non-Live (Single-Turn) +
Unweighted Average
+
(1700)
+
+ πŸ’‘ Live (Single-Turn) +
Weighted Average
+
(2251)
+
+ πŸ”„ Multi-Turn +
Unweighted Average
+
(800)
+
+ 🌳 AST +
Unweighted Average
+
(1150)
+
+ πŸš€ Exec +
Unweighted Average
+
(310)
+
+ 🧐 Relevance Detection +
(240)
+
+ 🌳 AST +
Weighted Average
+
(1351)
+
+ 🧐 Relevance Detection +
Independent Categories
+
(900)
+
+ πŸ“ Base Case +
(200)
+
+ πŸ” Augmented Entries +
Independent Categories
+
(600)
+
+ 🀏 Simple +
Unweighted Average
+
+ πŸ“¦ Multiple + + πŸ”€ Parallel + + πŸ“¦πŸ”€ Parallel Multiple + + 🀏 Simple +
Unweighted Average
+
+ πŸ“¦ Multiple + + πŸ”€ Parallel + + πŸ“¦πŸ”€ Parallel Multiple + + + + 🀏 Simple + + πŸ“¦ Multiple + + πŸ”€ Parallel + + πŸ“¦πŸ”€ Parallel Multiple + + + + + + + + + + + + +
+ 🐍 Python Simple AST +
(400)
+
+ β˜• Java Simple AST +
(100)
+
+ πŸ’» JavaScript Simple AST +
(50)
+
+ 🐍 Python Multiple AST +
(200)
+
+ 🐍 Python Parallel AST +
(200)
+
+ 🐍 Python Parallel Multiple AST +
(200)
+
+ 🐍 Python Simple Exec +
(100)
+
+ 🌐 REST API Exec +
(70)
+
+ 🐍 Python Multiple Exec +
(50)
+
+ 🐍 Python Parallel Exec +
(50)
+
+ 🐍 Python Parallel Multiple Exec +
(40)
+
+ ❌ Irrelevance +
(240)
+
+ 🐍 Python Live Simple AST +
(258)
+
+ 🐍 Python Live Multiple AST +
(1053)
+
+ 🐍 Python Live Parallel AST +
(16)
+
+ 🐍 Python Live Parallel Multiple AST +
(24)
+
+ βœ… Live Relevance +
(18)
+
+ ❌ Live Irrelevance +
(882)
+
+ πŸ“ Python Multi Turn Base +
(200)
+
+ πŸ” Python Multi Turn Missing Function +
(200)
+
+ ⚠️ Python Multi Turn Missing Parameter +
(200)
+
+ πŸ“œ Python Multi Turn Long Context +
(200)
+
-
-

Data Curation Methodology

-

In this section, we detail our data curation methodology for the BFCL V3 β€’ Multi-Turn & Multi-Step dataset. The dataset curation - process consists of hand-curated data generation for four components of BFCL V3 β€’ Multi-Turn & Multi-Step: API codebase creation, graph +
+

Data Curation Methodology

+

In this section, we detail our data curation methodology for the BFCL V3 β€’ Multi-Turn & Multi-Step + dataset. The dataset curation + process consists of hand-curated data generation for four components of BFCL V3 β€’ Multi-Turn & + Multi-Step: API codebase creation, graph edge construction, task generation, and human-labeled ground truth multi-turn trajectories, as well as a comprehensive data validation process.

Dataset with human-in-the-loop pre-processing and post-processing

@@ -528,8 +854,10 @@

Dataset with human-in-the-loop pre-processing and post-processing

execution results based on initial configurations.

- Our team believes that synthetic dataset by itself alone is not enough and human labeling is essential. We take care of the APIs created by humans as we believe human can - generate more connected and densely important functions useful for evaluation purposes. Even with this, we went through 11 rounds of data-filtering, highlighting the importance and challenges of function calling. + Our team believes that synthetic dataset by itself alone is not enough and human labeling is essential. We + take care of the APIs created by humans as we believe human can + generate more connected and densely important functions useful for evaluation purposes. Even with this, we + went through 11 rounds of data-filtering, highlighting the importance and challenges of function calling.

Responsive Image1. API Codebase Creation

Primary Domain APIs:

@@ -556,19 +887,25 @@

1. API Codebase Creation

Cross-functional APIs:

All eight domains took inspiration from our experience with Open Functions data collection and public interest in popular agent application domains.

-

The four primary API domains are evenly distributed across the test cases in Base, and Augmented Multi-Turn. For example, there are 200 test entries in - Base category and 0-49 utilizes Gorilla File System, 50-99 utilizes Vehicle Control, 100-149 utilizes Trading Bots, and 150-199 utilizes Travel Booking. +

The four primary API domains are evenly distributed across the test cases in Base, and Augmented + Multi-Turn. For example, there are 200 test entries in + Base category and 0-49 utilizes Gorilla File System, 50-99 utilizes Vehicle Control, 100-149 utilizes + Trading Bots, and 150-199 utilizes Travel Booking.


@@ -642,6 +979,7 @@

4. Human-Labeled Ground Truth Multi-Turn Trajectories

Validation Process

+

Responsive Image @@ -661,7 +999,8 @@

1. Question Validation (πŸ§‘β€πŸ’»)

  • Clarity and Specificity: Ambiguous questions are refined to provide more specific instructions.
  • Example: A question like β€œUpload the document” is improved to β€œUpload the - <NUMBER_OF_BYTES>.pdf document.”

    + <NUMBER_OF_BYTES>.pdf document.” +

  • Complete Information: The question and previous questions up to the current turn or through exploration within the environment must contain all the necessary details for the model to @@ -680,20 +1019,23 @@

    2. Human-Labeled Ground Truth Validation (πŸ§‘β€πŸ’»+ πŸ’» )

  • Executability: Ensuring the labeled functions can be correctly executed in sequence without errors.
  • Example: If the question asks for booking a flight, the ground truth should correctly call the - book_flight function without any internal errors.

    + book_flight function without any internal errors. +

  • Alignment with User Request: The execution path must be reasonable and consistent with the question's intent, whether implicit or explicit.
  • Example: If the question asks for a flight reservation without providing the flight's cost, the implicit request is to fetch the current ticket price by calling get_flight_cost before - book_flight.

    + book_flight. +

  • Brevity: The execution path should be logically concise under the premise of the previous two criteria.
  • Example: If the question asks for posting a tweet and mentioning another user, only the functions authenticate(username, password) and post_tweet(content, tags, mentions) should be called. The function mention(tweet_id, mentioned_usernames) is unnecessary since - post_tweet can handle user mentions.

    + post_tweet can handle user mentions. +

    3. Initial Configuration Validation ( πŸ’» )

    @@ -717,7 +1059,8 @@

    4. Function List Validation ( πŸ’» )

  • Completeness: All necessary functions must be present in the function list, allowing the model to make appropriate calls.
  • Example: If the task involves sending a tweet, the function list should include the - post_tweet function to ensure the model can complete the action.

    + post_tweet function to ensure the model can complete the action. +

    5. API Code Validation (πŸ§‘β€πŸ’»+ πŸ’» )

    @@ -740,7 +1083,8 @@

    5. API Code Validation (πŸ§‘β€πŸ’»+ πŸ’» )

  • Automated Format Checkers: Tools like mypy (a static type checker) and pydocstyle are used to enforce strict compliance with type hints and docstring formatting. - These automated tools check for:
  • + These automated tools check for: +
    -

    Model Inference & Execution

    +

    Model Inference Process

    Initialization

    Each entry comes with its own initial_config, which is used to initialize the API backend state. For example, a file system backend might start with a set of pre-existing files, and a messaging API @@ -772,9 +1116,11 @@

    Step Execution
    the chat history so far.
  • Function Call Execution: If the model makes any valid function calls (i.e., decode_exec in the model handler returns a non-empty list), the function calls are executed - in the API backend in the order that model provides.
  • + in the API backend in the order that model provides. +
  • Updating Chat History: The model's response is then added to the chat history, along - with the execution results (if any). For prompting models, since they don't use the `tool` role tag, we will provide back the execution results under the `user` role.
  • + with the execution results (if any). For prompting models, since they don't use the `tool` role tag, we + will provide back the execution results under the `user` role.
    End of a Step

    After updating the chat history with the model's response and execution results, the current step ends. @@ -786,7 +1132,8 @@

    Termination of a Turn

  • No Output Termination: If the model doesn't output any valid function calls in a step, we consider this the end of the current turn and move on to the next turn. This could occur if the model outputs chatting messages or if its response cannot be properly decoded into executable function calls - (the latter usually happens when the model is in prompting mode and is not following instructions). This method aligns with how we + (the latter usually happens when the model is in prompting mode and is not following instructions). This + method aligns with how we determine if the model makes any function call in the irrelevance category in the single-turn scenario, and we find it to work effectively.
  • Step Limit Force Termination: If the model takes more than 20 steps within a turn, the @@ -800,14 +1147,16 @@

    Termination of a Turn

    Note For Multi Turn Missing Function Category

    In this category, one or more functions are held out from the function list provided to the model at the - beginning; they will be provided in a later turns (never the first turn). For FC models, the added functions will + beginning; they will be provided in a later turns (never the first turn). For FC models, the added functions + will just be appended to the tools list. But for prompting models, since we supplied all the tools at the beginning in the system prompt and it's not possible to modify the system prompt in the middle of the conversation, we will provide the held-out function definitions in the content of a user message instead.

    Why We Avoid Certain Techniques (e.g. ReAct)

    -

    In BFCL V3 β€’ Multi-Turn & Multi-Step, we deliberately avoid using techniques like prompt engineering and ReAct, which combines +

    In BFCL V3 β€’ Multi-Turn & Multi-Step, we deliberately avoid using techniques like prompt engineering + and ReAct, which combines reasoning and acting through specific prompting methods. While ReAct and other techniques can improve models' function calling performance in certain cases, we chose not to use it throughout the BFCL series to evaluate base LLMs with the same standards to isolate the effects from using additional optimization @@ -815,26 +1164,51 @@

    Why We Avoid Certain Techniques (e.g. ReAct)

  • +

    Evaluation Metrics

    +

    BFCL V3 has different evaluation metrics for single-turn and multi-turn tasks.

    +

    Single-turn Evaluation Metrics

    +

    + Single-turn evaluation metrics (i.e., Abstract Syntax Tree (AST) Evaluation, Executable Function Evaluation, and Relevance Detection) are the same as those used in BFCL v1 and + BFCL v2. Please refer to the previous blog posts for more + details. +

    Multi-turn Evaluation Metrics

    -

    In BFCL V3 β€’ Multi-Turn & Multi-Step, we employed state-based evaluation and response-based evaluation to assess the model's performance the multi-turn categories.

    - At the end of every turn, we mark an entry as correct if it passes both checks in all turns. Note that force-terminated entries will be marked wrong, even if they pass the checks. - +

    For multi-turn tasks, two types of checks are used: state-based evaluation and + response-based evaluation. +

    + These two checks are run at the end of every turn. + An entry is marked as correct if it passes both checks in all turns. + Note that force-terminated entries will be marked wrong, even if they happen to pass the checks.
    1. - State-based evaluation focuses on comparing the backend system's state (excluding the private attributes, i.e., the ones that start with _) after all function calls are executed at the end of each turn of the conversation. We expect that given a user request, there can be multiple ways to fulfill the demand, which we are not able to measure, but the end state, or end result, should be consistent with ground truth labelling. The state-based evaluation capture the correctness of model executions that modify the internal state via write and delete e.g. create a new file, delete a stock from watchlist.

      + State-based evaluation focuses on comparing the backend system's state (excluding the + private attributes, i.e., the ones that start with _) after all function calls are executed + at the end of each turn of the conversation. We expect that given a user request, there can be multiple + ways to fulfill the demand, which we are not able to measure, but the end state, or end result, should + be consistent with ground truth labelling. The state-based evaluation capture the + correctness of model executions that modify the internal state via write and delete + e.g. create a new file, delete a stock from watchlist. +

    2. -

      Response-based evaluation compares the model's execution path against the minimial viable execution result paths as labeled in ground truth. The minimial viable execution result paths refer to a list of function calls that must be executed in order to produce desired response as user requests. Having response-based evaluation ensure read only request can be properly evaluated e.g. get the stock price, get the weather information. +

      Response-based evaluation compares the model's execution path against the minimial + viable execution result paths as labeled in ground truth. The minimial viable execution result paths + refer to a list of function calls that must be executed in order to produce desired response as user + requests. Having response-based evaluation ensure read only request + can be properly evaluated e.g. get the stock price, get the weather information.

    In the - following sections, we will discuss the advantages and limitations of state-based evaluation in multi-turn function calling and why we need a subset-matched response-based evaluation as well.

    + following sections, we will discuss the advantages and limitations of state-based evaluation in multi-turn + function calling and why we need a subset-matched response-based evaluation as well.

    Why State-based Evaluation

    -

    Minicking state offer a different perspective of real world performance evalution as autonomous agents can detour on its own discreet while achieving the tasks after all. Instead of only checking if each individual function output is correct, we +

    Minicking state offer a different perspective of real world performance evalution as autonomous agents can + detour on its own discreet while achieving the tasks after all. Instead of only checking if each individual + function output is correct, we compare the attributes of the system's state after every turn against the expected state. If the model successfully brings the system to the correct state at the end of each turn, it passes the evaluation.

    For example, if a model is tasked with a series of actions such as:

    @@ -846,17 +1220,29 @@

    Why State-based Evaluation

    In state-based evaluation, the system checks after each turn whether the file exists, whether the correct data was written, and whether the file is properly closed. If all the required state attributes are present and correct at each turn, the evaluation succeeds.

    - +

    Limitations of State-Based Evaluation

    - While state-based evaluation is a powerful tool for assessing multi-turn function calling models, it does have some limitations. For example, some functions don't have a direct impact on the system's state, such as get_zipcode_by_city or estimate_distance. We will not be able to tell if the model has actually invoked those functions or not, if relying solely on state-based evaluation. We want to make sure that the model is making the necessary function calls and reasoning through the task, instead of just memorizing or guessing the correct information; we want the model to call get_zipcode_by_city("Berkeley") to get the zip code for Berkeley is 94710, and then use that zip code to call get_weather_by_zipcode(94710) to get the weather information, instead of directly calling get_weather_by_zipcode(94710) and hope that it is the correct zip code for Berkeley (this would be hallucination!). - In such cases, response-based evaluation can be a good complement to state-based evaluation, as it can provide additional insights into the model's behavior and decision-making process. + While state-based evaluation is a powerful tool for assessing multi-turn function calling models, it does + have some limitations. For example, some functions don't have a direct impact on the system's state, such as + get_zipcode_by_city or estimate_distance. We will not be able to tell if the model + has actually invoked those functions or not, if relying solely on state-based evaluation. We want to make + sure that the model is making the necessary function calls and reasoning through the task, instead of just + memorizing or guessing the correct information; we want the model to call + get_zipcode_by_city("Berkeley") to get the zip code for Berkeley is 94710, and then use that + zip code to call get_weather_by_zipcode(94710) to get the weather information, instead of + directly calling get_weather_by_zipcode(94710) and hope that it is the correct zip code for + Berkeley (this would be hallucination!). + In such cases, response-based evaluation can be a good complement to state-based evaluation, as it can + provide additional insights into the model's behavior and decision-making process.

    Why Subset-Matched Response-based Evaluation

    -

    In earlier versions like BFCL V1 and V2, a pure response-based evaluation was used; the model response must match the ground truth in full. This approach evaluated the +

    In earlier versions like BFCL V1 and V2, a pure response-based evaluation was used; the model response must + match the ground truth in full. This approach evaluated the model based on the immediate function response, either by analyzing the return values or by checking the - Abstract Syntax Tree (AST) structure. However, it faces several limitations when it comes to multi-turn categories:

    + Abstract Syntax Tree (AST) structure. However, it faces several limitations when it comes to multi-turn + categories:

    -

    Result & Error Analysis

    +

    Result & Error Analysis