diff --git a/blogs/13_bfcl_v3_multi_turn.html b/blogs/13_bfcl_v3_multi_turn.html
index 2c28526e5..d8cf7d2ab 100644
--- a/blogs/13_bfcl_v3_multi_turn.html
+++ b/blogs/13_bfcl_v3_multi_turn.html
@@ -99,6 +99,67 @@
     .code-toggle:hover {
       text-decoration: underline;
     }
+
+    .level-1 {
+      background-color: #E2E2E2;
+      /* Darkest gray for top level */
+      font-weight: bold;
+    }
+
+    .level-2 {
+      background-color: #E8E8E8;
+      /* Slightly lighter gray */
+    }
+
+    .level-3 {
+      background-color: #EFEFEF;
+      /* Medium gray */
+    }
+
+    .level-4 {
+      background-color: #F5F5F5;
+      /* Light gray */
+    }
+
+    .level-5 {
+      background-color: #FFFFFF;
+      /* Pure white */
+    }
+
+    .scoring-table {
+      width: 100%;
+      border-collapse: collapse;
+      table-layout: auto;
+    }
+
+    .scoring-table th,
+    .scoring-table td {
+      word-wrap: break-word;
+      padding: 5px;
+      text-align: center;
+    }
+
+    @media screen and (max-width: 768px) {
+
+      .scoring-table th,
+      .scoring-table td {
+        padding: 1px;
+      }
+    }
+
+    .note {
+      font-size: 0.8em;
+      color: #666;
+    }
+
+    .na {
+      color: #999;
+      font-style: italic;
+    }
+
+    .emoji {
+      display: block;
+    }
   </style>
   <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/3.2.0/es5/tex-mml-chtml.js" integrity="sha384-..."
     crossorigin="anonymous"></script>
@@ -119,7 +180,7 @@ <h1 class="text-center" style="padding-bottom: 10px;"> 🦍 Gorilla: Large Langu
       APIs</h1>
 
     <div class="box-index">
-      <h3>BFCL V3: Introducing Multi-Turn & Multi-Step Function Calling</h3>
+      <h3>BFCL V3: Multi-Turn & Multi-Step Function Calling</h3>
       <ul>
         <ul>
           <li><a href="13_bfcl_v3_multi_turn.html#intro">Introduction</a></li>
@@ -127,8 +188,7 @@ <h3>BFCL V3: Introducing Multi-Turn & Multi-Step Function Calling</h3>
           <li><a href="13_bfcl_v3_multi_turn.html#existing">Existing Tool Calling Dataset </a></li>
           <li><a href="13_bfcl_v3_multi_turn.html#composition">Dataset Composition</a></li>
           <li><a href="13_bfcl_v3_multi_turn.html#curation">Data Curation</a></li>
-          <li><a href="13_bfcl_v3_multi_turn.html#validation">Data Validation</a></li>
-          <li><a href="13_bfcl_v3_multi_turn.html#inference">Model Inference & Execution</a></li>
+          <li><a href="13_bfcl_v3_multi_turn.html#inference">Model Inference Process</a></li>
           <li><a href="13_bfcl_v3_multi_turn.html#evaluation">Evaluation Metrics</a></li>
           <li><a href="13_bfcl_v3_multi_turn.html#result">Result & Error Analysis</a></li>
           <li class="more-blogs">
@@ -195,100 +255,105 @@ <h4 class="text-center" style="margin: 0;">
         </style>
 
         <br>
-        <b><i style="font-size: 1.0em;">Last updated: 2024-12-04 <a
+        <b><i style="font-size: 1.0em;">Release date: 2024-09-19. Last updated: 2024-12-10. <a
               href="https://github.com/ShishirPatil/gorilla/blob/main/berkeley-function-call-leaderboard/CHANGELOG.md">[Change
               Log]</a></i></b>
 
         <div>
           <br>
-          <h3 id="intro">Introduction</h3>
+          <h2 id="intro">Introduction</h2>
           <p>
-            <strong> The Berkeley Function-Calling Leaderboard (BFCL) V3</strong> takes a significant leap forward by
-            introducing a new multi-turn, and multi-step function calling (tool usage) category.
-            Only at <i>BFCL V3 • Multi-Turn & Multi-Step</i>, you will see a LLM stuck in a loop, listing the current directory, write a non-existing
-            file, and list the directory again... You will ask LLM to make a social media post.
-            LLM will force you to spell your username and password to login despite the fact that you are already
-            browsing other people's posts! This is only possible with <strong>multi-turn</strong>,
-            and <strong>multi-step</strong> function calling (tool usage). <i>Note that BFCL V3 contains the Expert Curated (Non-live) dataset introduced in <a href="https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html">BFCL V1</a> and User Contributed (Live) dataset introduced in <a href="https://gorilla.cs.berkeley.edu/blogs/12_bfcl_v2_live.html">BFCL V2</a> and the multi-turn, and multi-step category introduced in BFCL V3.</i>
+            <strong> The Berkeley Function-Calling Leaderboard (BFCL) V3</strong>
+            makes a big step forward by introducing a new category for multi-turn and multi-step function calls (tool
+            use). Only in BFCL V3, you might see a model loop over and over—listing a directory, trying to write to a
+            file that isn't there, and then listing again. Or it might demand your username and password even though
+            you've already logged in and are browsing other users' posts. Such scenarios are only possible when a model
+            can use multiple turns and steps in its function calls.
           </p>
           <p>
-              Understanding these more advanced interactions builds on the foundation of single-turn single-step function calling, where models takes an user input prompt and selects one or more functions with appropriately filled parameters from a set of provided function options, without further interaction. If you're unfamiliar with single-turn single-step function calling and the evaluation metrics we used, check out our <a href="https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html">earlier blog</a> on single-turn single-step function calling for a deeper dive.
-              
-
+            Note that BFCL V3 still includes the Expert Curated (Non-live) dataset from <a
+              href="https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html">BFCL V1</a> and
+            the User Contributed (Live) dataset from <a
+              href="https://gorilla.cs.berkeley.edu/blogs/12_bfcl_v2_live.html">BFCL V2</a>. On
+            top of that, it now tests how well models can handle back-and-forth (multi-turn) and step-by-step
+            (multi-step) interactions.
           </p>
           <p>
-            <i>BFCL V3 • Multi-Turn & Multi-Step</i> is a critical advancement in evaluating how Large Language Models (LLMs) interact with diverse
-            scenarios through invoking right functions.
-            Multi-turn function calling allows models to engage in a back-and-forth interaction with users, making it
-            possible for LLMs to navigate through
-            the complex tasks by asking clarifying questions. In contrast to multi-turn <code>(user t0, assistant t1,
-              user t2, assistant t3, ..)</code>,
-            multi-step is where the LLM can break the response down into multiple steps <code>(user t0, assistant t1,
-              assistant t2,..)</code>.
-            This new paradigm mimics real-world agentic behaviors where AI assistants might have to plan execution
-            paths, request and
-            extract critical information, and handle sequential function invokes to complete a task.
+            If you're new to single-turn, single-step function calling, be sure to check out our <a
+              href="https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html">earlier blog
+              post</a> for more background. That post explains how a model should pick a single function and fill in its
+            parameters without asking follow-up questions.
           </p>
-
           <p>
-            In addition, this is the first time BFCL performs API state verifications as the ground truth validation. In
-            previous iterations, BFCL has been
-            dominated by dissecting function parameter pairs using AST and matching them in a list of possible answers.
-            In BFCL V3, we will not perform an exact match
-            on the parameters but on the state. As long as the internal state of an API system(file system, travel
-            booking system) stays intact, we mark them as correct.
 
+            With BFCL V3, we're looking at more complex tasks. Multi-turn function calls <code>(user t0, assistant t1,
+              user t2, assistant t3, ..)</code> let models interact with the user back-and-forth, asking questions and
+            refining their approach. Multi-step calls <code>(user t0, assistant t1,
+                assistant t2,..)</code> let a model break its response into smaller parts before giving a final answer.
+            This setup mimics real-world cases where an AI might need to plan, gather info, and chain several actions
+            together.
+          </p>
+          <p>
+            Another new twist in BFCL V3 is how we check the model's answers. Instead of dissecting function parameter
+            pairs using AST and matching them in a list of possible answers, we now verify the actual state of the API
+            system (like file systems or booking systems) after the model runs its functions. This gives us a more
+            realistic way to see if the model did the right thing.
           </p>
           <p>
             Quick Links:
-            <ul>
-              <li>BFCL Leaderboard: <a href="https://gorilla.cs.berkeley.edu/leaderboard.html">Website</a>
-              <li>BFCL Evaluation Dataset: <a
-                  href="https://huggingface.co/datasets/gorilla-llm/Berkeley-Function-Calling-Leaderboard">
-                  HuggingFace Dataset 🤗</a></li>
-              <li>Reproducibility: <a
-                  href="https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard">Github
-                  Code</a></li>
-              <li>BFCL v1: <a
-                  href="8_berkeley_function_calling_leaderboard.html">Release Blog</a></li>
-              <li>BFCL v2: <a
-                  href="12_bfcl_v2_live.html">Enterprise and OSS-contributed Live Data</a></li>
-            </ul>
+          <ul>
+            <li>BFCL Leaderboard: <a href="https://gorilla.cs.berkeley.edu/leaderboard.html">Website</a>
+            <li>BFCL Dataset: <a
+                href="https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard/data">
+                HuggingFace Dataset 🤗</a></li>
+            <li>Reproducibility: <a
+                href="https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard">Github
+                Code</a></li>
+            <li>BFCL v1: <a href="8_berkeley_function_calling_leaderboard.html">Release Blog</a></li>
+            <li>BFCL v2: <a href="12_bfcl_v2_live.html">Enterprise and OSS-contributed Live Data</a></li>
+          </ul>
           </p>
           <p>
-            In this blog, we start off by describing the difference between multi-step and multi-turn function calling
-            and explaining why both concepts
-            are important in the real world. We will then present the key features of our benchmarking and findings when
-            evaluating against SOTA models.
-            Lastly, we will showcase the evaluation dataset construction process and highlight the importance of
-            a human annotated dataset.
-
+            In this post, we'll explain the difference between multi-step and multi-turn function calling and why both
+            matter in real use cases. Then we'll highlight what sets this new benchmark apart and share our findings
+            after testing top models. Finally, we'll walk through how we built the evaluation dataset and why having
+            human-annotated data is so important.
           </p>
         </div>
         <br>
         <div>
-          <h3 id="definition">What is Multi-Step & Multi-Turn?</h3>
+          <h2 id="definition">What is Multi-Step & Multi-Turn?</h2>
           <br>
           <div style="display: flex; justify-content: center; gap: 40px;">
             <!-- Single-Turn Section -->
             <div class="box">
               <h4>Single-Turn</h4>
               <p> In a single-turn interaction, assistant can fulfill a user's request by making one function call.
-                The user's request is typically a straight forward instruction or question that is state-agnostic.</p>
+                These requests are typically straightforward, self-contained, and state-agnostic (i.e., do not rely on
+                prior context).
+              </p>
             </div>
 
             <!-- Multi-Step Section -->
             <div class="box">
               <h4>Multi-Step</h4>
-              <p>Multi-step refers to an interaction where the assistant performs several internal function calls to
-                address a single user request. This can be interpreted as models proactively plan and gather information to fulfill a request.</p>
+              <p>
+                Multi-step interactions require the assistant to execute multiple internal function calls to address a
+                single user request. This process reflects the assistant's ability to proactively plan and gather
+                information to deliver a comprehensive response. The user only interacts with the model once (at the
+                very beginning), and the model then interacts with the system back-and-forth to complete the task.
+              </p>
             </div>
 
             <!-- Multi-Turn Section -->
             <div class="box">
               <h4>Multi-Turn</h4>
-              <p>Multi-turn interaction involves multiple exchanges between the user and the assistant. Each turn can
-                contain multiple steps. Multi-turn interaction requires models to collect from previous context to answer follow-up questions.</p>
+              <p>
+                Multi-turn interactions involve an extended exchange between the user and the assistant, consisting of
+                multiple conversational turns. Each turn may involve several steps, and the assistant must retain and
+                utilize contextual information from previous exchanges to handle follow-up queries effectively. The user
+                will interact with the model multiple times throughout the process.
+              </p>
             </div>
           </div>
         </div>
@@ -307,7 +372,7 @@ <h4>Why Multi-Turn Matters</h4>
         </div>
         <br>
         <div>
-          <h3 id="existing">Existing Tool Calling Dataset </h3>
+          <h2 id="existing">Existing Tool Calling Dataset </h2>
           <meta charset="UTF-8">
           <meta name="viewport">
           <title>Table Example</title>
@@ -417,74 +482,77 @@ <h3 id="existing">Existing Tool Calling Dataset </h3>
         </div>
         <br>
         <div>
-          <h3 id="composition">Dataset Composition</h3>
+          <h2 id="composition">Dataset Composition</h2>
+          <h3> Newly Introduced Categories</h3>
           <p style="text-align: center; margin-bottom: 0">
             <img src="../assets/img/blog_post_13_data_composition.png" alt="Responsive Image"
               style="max-width: 40%; height: auto;">
             <i style="font-size: 0.9em;">
-              Berkeley Function-Calling
-              Leaderboard (<i>BFCL V3 • Multi-Turn & Multi-Step Function Calling</i>) Data Composition
+              BFCL V3 • Multi-Turn & Multi-Step Function Calling Data Composition
             </i>
           </p>
           <br>
           <ul>
             <li>
               <strong>Base Multi-Turn (200)</strong>: This category covers the foundational yet sufficiently diverse
-              basic multi-turn interactions. In this category, we provide complete information to call each
-              function (either through current turn question, execution result from previous turn, or initial state
-              configuration)
+              basic multi-turn interactions. In this category, we provide all the necessary information--whether from
+              the user request message, execution result from previous turn, or the output of exploratory functions--to
+              complete the task. The model is expected to handle the multi-turn interaction without any ambiguity.
             </li>
             <li>
-              <strong>Augmented Multi-Turn (800)</strong>: This category introduce additional complexity, such as
+              <strong>Augmented Multi-Turn (800)</strong>: These categories introduce additional complexity, such as
               ambiguous prompts or situations
-              where the model must process multiple pieces of information across turns (similar to Multihop QA),
-              requiring models to
-              handle more nuanced decision-making, disambiguation, and conditional logic across multiple turns.
-              <ul>
-                <strong>Missing Parameters (200)</strong>: This dataset challenges the model to identify required
-                missing information that cannot be retrieved
-                elsewhere in the system. In this scenario, we expect the LLM to ask for a follow-up to clarify the
-                misinformation.
-                This is distinct from certain entries in the Core Multi-Turn dataset where the question has implicit
-                intent that can be
-                answered by referencing the backend system.
-              </ul>
-              <ul>
-                <strong>Missing Functions (200)</strong>: This scenario denotes when we expect the model to recognize
-                that no action
-                should be taken given the lack of functions provided. If the LLM raises that concern, we then supply it
-                with the hold-out
-                functions that can successfully perform user intended tasks. Note that the Core dataset and the Missing
-                Function dataset
-                essentially contains the same sequence of actions except for the latter we hold-out a subset of
-                functions on execution path to
-                further challenge the model's inference ability.
-              </ul>
-              <ul>
-                <strong>Long-Context Multi-Turn (200)</strong>: This dataset challenges the model's resilience in long
-                context scenarios on function
-                calling. We inject random objects (e.g. hundreds of files in one directory or thousands of booking
-                records) to mimic real world API output,
-                which tend to be overtly informative. Here, we aim to test the model's ability to grasp the core
-                information from an overwhelmingly large context.
-              </ul>
+              where the model must process multiple pieces of information across turns (similar to Multihop QA).
+              Models must demonstrate more nuanced decision-making, disambiguation, and conditional logic across
+              multiple turns.
+              The following four sub-categories are included in the Augmented Multi-Turn dataset:
               <ul>
-                <strong>Composite (200)</strong>: Composite Category seeks to combine all three scenarios above to
-                create an exceptionally
-                hard challenge that, despite being rare, is important to handle when using autonomous agents at scale.
-                Through this category,
-                we want to convince the audience that a good model performance in this category offers a strong signal
-                that LLMs can function as autonomous
-                agents at scale despite rare and extremely difficult scenarios.
+                <li>
+
+
+                  <strong>Missing Parameters (200)</strong>: Tests the model's ability to recognize when essential
+                  information is missing from the user request and cannot be inferred from the system. In these cases,
+                  the
+                  model should
+                  request clarification rather than making unwarranted assumptions. This differs from cases in
+                  the Base Multi-Turn set, where implicit intentions can be resolved by referencing the backend system.
+                </li>
+                <li>
+
+                  <strong>Missing Functions (200)</strong>: Requires the model to identify that no available function
+                  can fulfill the user request. Once the model indicates this issue, we provide the missing functions in
+                  the next turn.
+                  Compared to the Base dataset, this scenario withholds a subset of functions at the beginning,
+                  challenging the model
+                  to infer the need for additional functions.
+                </li>
+                <li>
+                  <strong>Long-Context Multi-Turn (200)</strong>: Challenges the model's ability to maintain accuracy
+                  and relevance in lengthy, information-dense contexts. We introduce large volumes of extraneous data
+                  (e.g., hundreds of files or thousands of records) to test how well the model can extract crucial
+                  details from an overwhelming array of information.
+                </li>
+                <li>
+                  <strong>Composite (200)</strong>: Combines all three augmented challenges—Missing Parameters,
+                  Missing Functions, and Long-Context—into a single, highly complex scenario. While rare, success here
+                  strongly indicates that the model can function effectively as an autonomous agent at scale, even in
+                  intricate and demanding conditions.
+                </li>
               </ul>
             </li>
           </ul>
+          </li>
+          </ul>
           <br>
 
           <div style="text-align: center; width: 100%;">
             <img src="../assets/img/blog_post_13_multi_turn_examples.png" alt="Examples"
-             style="max-width: 100%; width: 100%; margin-bottom: 20px;">
+              style="max-width: 100%; width: 100%; margin-bottom: 20px;">
+            <i style="font-size: 0.9em;">
+              An example of how the missing parameter and missing function are augmented from the base entries.
+            </i>
           </div>
+          <br>
 
           <p>Here we visualize the data statistics of the <i>BFCL V3 Base Multi Turn</i> dataset (the augmented
             categories follow similar distributions):</p>
@@ -511,13 +579,271 @@ <h3 id="composition">Dataset Composition</h3>
               <img src="../assets/img/blog_post_13_data_stat_tool.png" alt="Top 15 Tool Usage in BFCL V3"
                 style="width: 100%;">
             </div>
-
+          </div>
+          <h3 id="leaderboard-composition"> Full Leaderboard Score Composition</h3>
+          <p> The full bfcl v3 leaderboard score is composed of the following components. Number in the parenthesis
+            indicates the number of entries in each category. </p>
+          <div class="table-responsive">
+            <table class="scoring-table">
+              <!-- Row 1: Final Score -->
+              <tr>
+                <th colspan="22" class="level-1">
+                  <span class="emoji">🏆</span> Overall Score
+                  <div class="note">Unweighted Average</div>
+                  <div class="note">(4751)</div>
+                </th>
+              </tr>
+
+              <!-- Row 2: Main Categories -->
+              <tr>
+                <th colspan="12" class="level-2 section-non-live">
+                  <span class="emoji">📂</span> Non-Live (Single-Turn)
+                  <div class="note">Unweighted Average</div>
+                  <div class="note">(1700)</div>
+                </th>
+                <th colspan="6" class="level-2 section-live">
+                  <span class="emoji">💡</span> Live (Single-Turn)
+                  <div class="note">Weighted Average</div>
+                  <div class="note">(2251)</div>
+                </th>
+                <th colspan="4" class="level-2">
+                  <span class="emoji">🔄</span> Multi-Turn
+                  <div class="note">Unweighted Average</div>
+                  <div class="note">(800)</div>
+                </th>
+              </tr>
+
+              <!-- Row 3: Sub-Categories -->
+              <tr>
+                <!-- Non-Live Section -->
+                <th colspan="6" class="level-3">
+                  <span class="emoji">🌳</span> AST
+                  <div class="note">Unweighted Average</div>
+                  <div class="note">(1150)</div>
+                </th>
+                <th colspan="5" class="level-3">
+                  <span class="emoji">🚀</span> Exec
+                  <div class="note">Unweighted Average</div>
+                  <div class="note">(310)</div>
+                </th>
+                <th colspan="1" class="level-3 section-non-live">
+                  <span class="emoji">🧐</span> Relevance Detection
+                  <div class="note">(240)</div>
+                </th>
+
+                <!-- Live Section -->
+                <th colspan="4" class="level-3">
+                  <span class="emoji">🌳</span> AST
+                  <div class="note">Weighted Average</div>
+                  <div class="note">(1351)</div>
+                </th>
+                <th colspan="2" class="level-3 section-live">
+                  <span class="emoji">🧐</span> Relevance Detection
+                  <div class="note">Independent Categories</div>
+                  <div class="note">(900)</div>
+                </th>
+
+                <!-- Multi-Turn Section -->
+                <th colspan="1" class="level-3">
+                  <span class="emoji">📝</span> Base Case
+                  <div class="note">(200)</div>
+                </th>
+                <th colspan="3" class="level-3">
+                  <span class="emoji">🔍</span> Augmented Entries
+                  <div class="note">Independent Categories</div>
+                  <div class="note">(600)</div>
+                </th>
+              </tr>
+
+              <!-- Row 4: Detailed Categories -->
+              <tr>
+                <!-- Non-Live > AST -->
+                <th colspan="3" class="level-4">
+                  <span class="emoji">🤏</span> Simple
+                  <div class="note">Unweighted Average</div>
+                </th>
+                <th class="level-4">
+                  <span class="emoji">📦</span> Multiple
+                </th>
+                <th class="level-4">
+                  <span class="emoji">🔀</span> Parallel
+                </th>
+                <th class="level-4">
+                  <span class="emoji">📦🔀</span> Parallel Multiple
+                </th>
+
+                <!-- Non-Live > Exec -->
+                <th colspan="2" class="level-4">
+                  <span class="emoji">🤏</span> Simple
+                  <div class="note">Unweighted Average</div>
+                </th>
+                <th class="level-4">
+                  <span class="emoji">📦</span> Multiple
+                </th>
+                <th class="level-4">
+                  <span class="emoji">🔀</span> Parallel
+                </th>
+                <th class="level-4">
+                  <span class="emoji">📦🔀</span> Parallel Multiple
+                </th>
+
+                <!-- Non-Live > Irrelevance -->
+                <th class="level-4 section-non-live">
+                  <span class="note na"></span>
+                </th>
+
+                <!-- Live > LiveAST -->
+                <th class="level-4">
+                  <span class="emoji">🤏</span> Simple
+                </th>
+                <th class="level-4">
+                  <span class="emoji">📦</span> Multiple
+                </th>
+                <th class="level-4">
+                  <span class="emoji">🔀</span> Parallel
+                </th>
+                <th class="level-4">
+                  <span class="emoji">📦🔀</span> Parallel Multiple
+                </th>
+
+                <!-- Live > RelScore -->
+                <th class="level-4">
+                  <span class="note na"></span>
+                </th>
+                <th class="level-4 section-live">
+                  <span class="note na"></span>
+                </th>
+
+                <!-- Multi-Turn Section -->
+                <th class="level-4">
+                  <span class="note na"></span>
+                </th>
+                <th class="level-4">
+                  <span class="note na"></span>
+                </th>
+                <th class="level-4">
+                  <span class="note na"></span>
+                </th>
+                <th class="level-4">
+                  <span class="note na"></span>
+                </th>
+              </tr>
+
+              <!-- Row 5: Lowest Level Details -->
+              <tr>
+                <!-- Non-Live > AST > SimpleAST -->
+                <th class="level-5">
+                  <span class="emoji">🐍</span> Python Simple AST
+                  <div class="note">(400)</div>
+                </th>
+                <th class="level-5">
+                  <span class="emoji">☕</span> Java Simple AST
+                  <div class="note">(100)</div>
+                </th>
+                <th class="level-5">
+                  <span class="emoji">💻</span> JavaScript Simple AST
+                  <div class="note">(50)</div>
+                </th>
+
+                <!-- Non-Live AST other categories -->
+                <th class="level-5">
+                  <span class="emoji">🐍</span> Python Multiple AST
+                  <div class="note">(200)</div>
+                </th>
+                <th class="level-5">
+                  <span class="emoji">🐍</span> Python Parallel AST
+                  <div class="note">(200)</div>
+                </th>
+                <th class="level-5">
+                  <span class="emoji">🐍</span> Python Parallel Multiple AST
+                  <div class="note">(200)</div>
+                </th>
+
+                <!-- Non-Live > Exec > SimpleExec -->
+                <th class="level-5">
+                  <span class="emoji">🐍</span> Python Simple Exec
+                  <div class="note">(100)</div>
+                </th>
+                <th class="level-5">
+                  <span class="emoji">🌐</span> REST API Exec
+                  <div class="note">(70)</div>
+                </th>
+
+                <!-- Non-Live Exec other categories -->
+                <th class="level-5">
+                  <span class="emoji">🐍</span> Python Multiple Exec
+                  <div class="note">(50)</div>
+                </th>
+                <th class="level-5">
+                  <span class="emoji">🐍</span> Python Parallel Exec
+                  <div class="note">(50)</div>
+                </th>
+                <th class="level-5">
+                  <span class="emoji">🐍</span> Python Parallel Multiple Exec
+                  <div class="note">(40)</div>
+                </th>
+
+                <!-- Non-Live Irrelevance -->
+                <th class="level-5 section-non-live">
+                  <span class="emoji">❌</span> Irrelevance
+                  <div class="note">(240)</div>
+                </th>
+
+                <!-- Live AST categories -->
+                <th class="level-5">
+                  <span class="emoji">🐍</span> Python Live Simple AST
+                  <div class="note">(258)</div>
+                </th>
+                <th class="level-5">
+                  <span class="emoji">🐍</span> Python Live Multiple AST
+                  <div class="note">(1053)</div>
+                </th>
+                <th class="level-5">
+                  <span class="emoji">🐍</span> Python Live Parallel AST
+                  <div class="note">(16)</div>
+                </th>
+                <th class="level-5">
+                  <span class="emoji">🐍</span> Python Live Parallel Multiple AST
+                  <div class="note">(24)</div>
+                </th>
+
+                <!-- Live RelScore categories -->
+                <th class="level-5">
+                  <span class="emoji">✅</span> Live Relevance
+                  <div class="note">(18)</div>
+                </th>
+                <th class="level-5 section-live">
+                  <span class="emoji">❌</span> Live Irrelevance
+                  <div class="note">(882)</div>
+                </th>
+
+                <!-- Multi-Turn categories -->
+                <th class="level-5">
+                  <span class="emoji">📝</span> Python Multi Turn Base
+                  <div class="note">(200)</div>
+                </th>
+                <th class="level-5">
+                  <span class="emoji">🔍</span> Python Multi Turn Missing Function
+                  <div class="note">(200)</div>
+                </th>
+                <th class="level-5">
+                  <span class="emoji">⚠️</span> Python Multi Turn Missing Parameter
+                  <div class="note">(200)</div>
+                </th>
+                <th class="level-5">
+                  <span class="emoji">📜</span> Python Multi Turn Long Context
+                  <div class="note">(200)</div>
+                </th>
+              </tr>
+            </table>
           </div>
 
-          <div></div>
-          <h3 id="curation">Data Curation Methodology</h3>
-          <p>In this section, we detail our data curation methodology for the <i>BFCL V3 • Multi-Turn & Multi-Step</i> dataset. The dataset curation
-            process consists of hand-curated data generation for four components of <i>BFCL V3 • Multi-Turn & Multi-Step</i>: API codebase creation, graph
+          <br>
+          <h2 id="curation">Data Curation Methodology</h2>
+          <p>In this section, we detail our data curation methodology for the <i>BFCL V3 • Multi-Turn & Multi-Step</i>
+            dataset. The dataset curation
+            process consists of hand-curated data generation for four components of <i>BFCL V3 • Multi-Turn &
+              Multi-Step</i>: API codebase creation, graph
             edge construction, task generation, and human-labeled ground truth multi-turn trajectories, as well as a
             comprehensive data validation process.</p>
           <h4>Dataset with human-in-the-loop pre-processing and post-processing</h4>
@@ -528,8 +854,10 @@ <h4>Dataset with human-in-the-loop pre-processing and post-processing</h4>
             execution results based on initial configurations.
           </p>
           <p>
-            Our team believes that synthetic dataset by itself alone is not enough and human labeling is essential.  We take care of the APIs created by humans as we believe human can
-            generate more connected and densely important functions useful for evaluation purposes. Even with this, we went through 11 rounds of data-filtering, highlighting the importance and challenges of function calling.
+            Our team believes that synthetic dataset by itself alone is not enough and human labeling is essential. We
+            take care of the APIs created by humans as we believe human can
+            generate more connected and densely important functions useful for evaluation purposes. Even with this, we
+            went through 11 rounds of data-filtering, highlighting the importance and challenges of function calling.
           </p>
 
           <img src="../assets/img/blog_post_13_data_collection.png" alt="Responsive Image"
@@ -544,11 +872,14 @@ <h4>1. API Codebase Creation</h4>
           <p>Primary Domain APIs:</p>
           <ul>
             <li><strong>Vehicle Control</strong>: <code>startEngine(...)</code>, <code>displayCarStatus(...)</code>,
-              <code>estimate_distance(...)</code></li>
+              <code>estimate_distance(...)</code>
+            </li>
             <li><strong>Trading Bots</strong>: <code>get_stock_info(...)</code>, <code>place_order(...)</code>,
-              <code>get_watchlist(...)</code></li>
+              <code>get_watchlist(...)</code>
+            </li>
             <li><strong>Travel Booking</strong>: <code>book_flight(...)</code>,
-              <code>get_nearest_airport_by_city(...)</code>, <code>purchase_insurance(...)</code></li>
+              <code>get_nearest_airport_by_city(...)</code>, <code>purchase_insurance(...)</code>
+            </li>
             <li><strong>Gorilla File System</strong>: <code>ls(...)</code>, <code>cd(...)</code>, <code>cat(...)</code>
             </li>
           </ul>
@@ -556,19 +887,25 @@ <h4>1. API Codebase Creation</h4>
           <p>Cross-functional APIs:</p>
           <ul>
             <li><strong>Message API</strong>: <code>send_message(...)</code>, <code>delete_message(...)</code>,
-              <code>view_messages_received(...)</code></li>
+              <code>view_messages_received(...)</code>
+            </li>
             <li><strong>Twitter API</strong>: <code>post_tweet(...)</code>, <code>retweet(...)</code>,
-              <code>comment(...)</code></li>
+              <code>comment(...)</code>
+            </li>
             <li><strong>Ticket API</strong>: <code>create_ticket(...)</code>, <code>get_ticket(...)</code>,
-              <code>close_ticket(...)</code></li>
+              <code>close_ticket(...)</code>
+            </li>
             <li><strong>Math API</strong>: <code>logarithm(...)</code>, <code>mean(...)</code>,
-              <code>standard_deviation(...)</code></li>
+              <code>standard_deviation(...)</code>
+            </li>
           </ul>
 
           <p>All eight domains took inspiration from our experience with Open Functions data collection and public
             interest in popular agent application domains.</p>
-          <p>The four primary API domains are evenly distributed across the test cases in Base, and Augmented Multi-Turn. For example, there are 200 test entries in 
-            Base category and 0-49 utilizes Gorilla File System, 50-99 utilizes Vehicle Control, 100-149 utilizes Trading Bots, and 150-199 utilizes Travel Booking.
+          <p>The four primary API domains are evenly distributed across the test cases in Base, and Augmented
+            Multi-Turn. For example, there are 200 test entries in
+            Base category and 0-49 utilizes Gorilla File System, 50-99 utilizes Vehicle Control, 100-149 utilizes
+            Trading Bots, and 150-199 utilizes Travel Booking.
           </p>
 
           <br>
@@ -642,6 +979,7 @@ <h4>4. Human-Labeled Ground Truth Multi-Turn Trajectories</h4>
         </div>
         <div>
           <h3 id="validation">Validation Process</h3>
+          <br>
           <p style="text-align: center; margin-bottom: 0">
             <img src="../assets/img/blog_post_13_data_validation.png" alt="Responsive Image"
               style="max-width: 70%; height: auto;">
@@ -661,7 +999,8 @@ <h4>1. Question Validation (🧑‍💻)</h4>
             <li><strong>Clarity and Specificity</strong>: Ambiguous questions are refined to provide more specific
               instructions.</li>
             <p>Example: A question like “Upload the document” is improved to “Upload the
-              <code>&lt;NUMBER_OF_BYTES&gt;</code>.pdf document.”</p>
+              <code>&lt;NUMBER_OF_BYTES&gt;</code>.pdf document.”
+            </p>
 
             <li><strong>Complete Information</strong>: The question and previous questions up to the current turn or
               through exploration within the environment must contain all the necessary details for the model to
@@ -680,20 +1019,23 @@ <h4>2. Human-Labeled Ground Truth Validation (🧑‍💻+ 💻 )</h4>
             <li><strong>Executability</strong>: Ensuring the labeled functions can be correctly executed in sequence
               without errors.</li>
             <p>Example: If the question asks for booking a flight, the ground truth should correctly call the
-              <code>book_flight</code> function without any internal errors.</p>
+              <code>book_flight</code> function without any internal errors.
+            </p>
 
             <li><strong>Alignment with User Request</strong>: The execution path must be reasonable and consistent
               with the question's intent, whether implicit or explicit.</li>
             <p>Example: If the question asks for a flight reservation without providing the flight's cost, the
               implicit request is to fetch the current ticket price by calling <code>get_flight_cost</code> before
-              <code>book_flight</code>.</p>
+              <code>book_flight</code>.
+            </p>
 
             <li><strong>Brevity</strong>: The execution path should be logically concise under the premise of the
               previous two criteria.</li>
             <p>Example: If the question asks for posting a tweet and mentioning another user, only the functions
               <code>authenticate(username, password)</code> and <code>post_tweet(content, tags, mentions)</code>
               should be called. The function <code>mention(tweet_id, mentioned_usernames)</code> is unnecessary since
-              <code>post_tweet</code> can handle user mentions.</p>
+              <code>post_tweet</code> can handle user mentions.
+            </p>
           </ul>
 
           <h4>3. Initial Configuration Validation ( 💻 )</h4>
@@ -717,7 +1059,8 @@ <h4>4. Function List Validation ( 💻 )</h4>
             <li><strong>Completeness</strong>: All necessary functions must be present in the function list, allowing
               the model to make appropriate calls.</li>
             <p>Example: If the task involves sending a tweet, the function list should include the
-              <code>post_tweet</code> function to ensure the model can complete the action.</p>
+              <code>post_tweet</code> function to ensure the model can complete the action.
+            </p>
           </ul>
 
           <h4>5. API Code Validation (🧑‍💻+ 💻 )</h4>
@@ -740,7 +1083,8 @@ <h4>5. API Code Validation (🧑‍💻+ 💻 )</h4>
 
             <li><strong>Automated Format Checkers</strong>: Tools like <code>mypy</code> (a static type checker) and
               <code>pydocstyle</code> are used to enforce strict compliance with type hints and docstring formatting.
-              These automated tools check for:</li>
+              These automated tools check for:
+            </li>
             <ul>
               <li><strong>Type Consistency</strong>: Ensures that each function adheres to the expected types defined
                 by PEP484.</li>
@@ -751,7 +1095,7 @@ <h4>5. API Code Validation (🧑‍💻+ 💻 )</h4>
 
         </div>
         <div>
-          <h3 id="inference">Model Inference & Execution</h3>
+          <h2 id="inference">Model Inference Process</h2>
           <h4>Initialization</h4>
           <p> Each entry comes with its own <code>initial_config</code>, which is used to initialize the API backend
             state. For example, a file system backend might start with a set of pre-existing files, and a messaging API
@@ -772,9 +1116,11 @@ <h5>Step Execution</h5>
               the chat history so far. </li>
             <li> <strong>Function Call Execution:</strong> If the model makes any valid function calls (i.e.,
               <code>decode_exec</code> in the model handler returns a non-empty list), the function calls are executed
-              in the API backend in the order that model provides. </li>
+              in the API backend in the order that model provides.
+            </li>
             <li> <strong>Updating Chat History:</strong> The model's response is then added to the chat history, along
-              with the execution results (if any). For prompting models, since they don't use the `tool` role tag, we will provide back the execution results under the `user` role. </li>
+              with the execution results (if any). For prompting models, since they don't use the `tool` role tag, we
+              will provide back the execution results under the `user` role. </li>
           </ol>
           <h5>End of a Step</h5>
           <p> After updating the chat history with the model's response and execution results, the current step ends.
@@ -786,7 +1132,8 @@ <h4>Termination of a Turn</h4>
             <li> <strong>No Output Termination:</strong> If the model doesn't output any valid function calls in a step,
               we consider this the end of the current turn and move on to the next turn. This could occur if the model
               outputs chatting messages or if its response cannot be properly decoded into executable function calls
-              (the latter usually happens when the model is in prompting mode and is not following instructions). This method aligns with how we
+              (the latter usually happens when the model is in prompting mode and is not following instructions). This
+              method aligns with how we
               determine if the model makes any function call in the <code>irrelevance</code> category in the single-turn
               scenario, and we find it to work effectively. </li>
             <li> <strong>Step Limit Force Termination:</strong> If the model takes more than 20 steps within a turn, the
@@ -800,14 +1147,16 @@ <h4>Termination of a Turn</h4>
           <h4>Note For Multi Turn Missing Function Category</h4>
           <p>
             In this category, one or more functions are held out from the function list provided to the model at the
-            beginning; they will be provided in a later turns (never the first turn). For FC models, the added functions will
+            beginning; they will be provided in a later turns (never the first turn). For FC models, the added functions
+            will
             just be appended to the <code>tools</code> list. But for prompting models, since we supplied all the tools
             at the beginning in the system prompt and it's not possible to modify the system prompt in the middle of the
             conversation, we will provide the held-out function definitions in the content of a user message instead.
           </p>
 
           <h4>Why We Avoid Certain Techniques (e.g. ReAct)</h4>
-          <p>In <i>BFCL V3 • Multi-Turn & Multi-Step</i>, we deliberately avoid using techniques like prompt engineering and ReAct, which combines
+          <p>In <i>BFCL V3 • Multi-Turn & Multi-Step</i>, we deliberately avoid using techniques like prompt engineering
+            and ReAct, which combines
             reasoning and acting through specific prompting methods. While ReAct and other techniques can improve
             models' function calling performance in certain cases, we chose not to use it throughout the BFCL series
             to evaluate base LLMs with the same standards to isolate the effects from using additional optimization
@@ -815,26 +1164,51 @@ <h4>Why We Avoid Certain Techniques (e.g. ReAct)</h4>
         </div>
 
         <div>
+          <h2 id="evaluation">Evaluation Metrics</h2>
+          <p>BFCL V3 has different evaluation metrics for single-turn and multi-turn tasks. </p>
+          <h3 id="single-turn">Single-turn Evaluation Metrics</h3>
+          <p>
+            Single-turn evaluation metrics (i.e., Abstract Syntax Tree (AST) Evaluation, Executable Function Evaluation, and Relevance Detection) are the same as those used in <a href="8_berkeley_function_calling_leaderboard.html#metrics">BFCL v1</a> and  
+            <a href="12_bfcl_v2_live.html#composition">BFCL v2</a>. Please refer to the previous blog posts for more
+            details.
+          </p>
           <h3 id="evaluation">Multi-turn Evaluation Metrics</h3>
-          <p>In <i>BFCL V3 • Multi-Turn & Multi-Step</i>, we employed <strong>state-based</strong> evaluation and <strong>response-based</strong> evaluation to assess the model's performance the multi-turn categories. </p>
-          At the end of <strong>every</strong> turn, we mark an entry as correct if it passes both checks in all turns. Note that force-terminated entries will be marked wrong, even if they pass the checks.
-
+          <p>For multi-turn tasks, two types of checks are used: <strong>state-based</strong> evaluation and
+            <strong>response-based</strong> evaluation.
+          </p>
+          These two checks are run at the end of every turn.
+          An entry is marked as correct if it passes <i>both</i> checks in <i>all</i> turns.
+          Note that force-terminated entries will be marked wrong, even if they happen to pass the checks.
           <ol>
             <li>
               <p>
-                <strong>State-based</strong> evaluation focuses on comparing the backend system's state (excluding the private attributes, i.e., the ones that start with <code>_</code>) after all function calls are executed at the end of each turn of the conversation. We expect that given a user request, there can be multiple ways to fulfill the demand, which we are not able to measure, but the end state, or end result, should be consistent with ground truth labelling. The <strong>state-based</strong> evaluation capture the correctness of model executions that modify the internal state via <strong>write and delete</strong> e.g. create a new file, delete a stock from watchlist.</p>
+                <strong>State-based</strong> evaluation focuses on comparing the backend system's state (excluding the
+                private attributes, i.e., the ones that start with <code>_</code>) after all function calls are executed
+                at the end of each turn of the conversation. We expect that given a user request, there can be multiple
+                ways to fulfill the demand, which we are not able to measure, but the end state, or end result, should
+                be consistent with ground truth labelling. The <strong>state-based</strong> evaluation capture the
+                correctness of model executions that modify the internal state via <strong>write and delete</strong>
+                e.g. create a new file, delete a stock from watchlist.
+              </p>
             </li>
             <li>
-              <p><strong>Response-based</strong> evaluation compares the model's execution path against the minimial viable execution result paths as labeled in ground truth. The minimial viable execution result paths refer to a list of function calls that must be executed in order to produce desired response as user requests. Having <strong>response-based</strong> evaluation ensure <strong>read only</strong> request can be properly evaluated e.g. get the stock price, get the weather information.
+              <p><strong>Response-based</strong> evaluation compares the model's execution path against the minimial
+                viable execution result paths as labeled in ground truth. The minimial viable execution result paths
+                refer to a list of function calls that must be executed in order to produce desired response as user
+                requests. Having <strong>response-based</strong> evaluation ensure <strong>read only</strong> request
+                can be properly evaluated e.g. get the stock price, get the weather information.
               </p>
             </li>
           </ol>
 
           <p> In the
-            following sections, we will discuss the advantages and limitations of state-based evaluation in multi-turn function calling and why we need a subset-matched response-based evaluation as well.</p>
+            following sections, we will discuss the advantages and limitations of state-based evaluation in multi-turn
+            function calling and why we need a subset-matched response-based evaluation as well.</p>
 
           <h4>Why State-based Evaluation</h4>
-          <p>Minicking state offer a different perspective of real world performance evalution as autonomous agents can detour on its own discreet while achieving the tasks after all. Instead of only checking if each individual function output is correct, we
+          <p>Minicking state offer a different perspective of real world performance evalution as autonomous agents can
+            detour on its own discreet while achieving the tasks after all. Instead of only checking if each individual
+            function output is correct, we
             compare the attributes of the system's state after every turn against the expected state. If the model
             successfully brings the system to the correct state at the end of each turn, it passes the evaluation.</p>
           <p>For example, if a model is tasked with a series of actions such as:</p>
@@ -846,17 +1220,29 @@ <h4>Why State-based Evaluation</h4>
           <p>In state-based evaluation, the system checks after each turn whether the file exists, whether the correct
             data was written, and whether the file is properly closed. If all the required state attributes are
             present and correct at each turn, the evaluation succeeds.</p>
-          
+
           <h4>Limitations of State-Based Evaluation</h4>
           <p>
-            While state-based evaluation is a powerful tool for assessing multi-turn function calling models, it does have some limitations. For example, some functions don't have a direct impact on the system's state, such as <code>get_zipcode_by_city</code> or <code>estimate_distance</code>. We will not be able to tell if the model has actually invoked those functions or not, if relying solely on state-based evaluation. We want to make sure that the model is making the necessary function calls and reasoning through the task, instead of just memorizing or guessing the correct information; we want the model to call <code>get_zipcode_by_city("Berkeley")</code> to get the zip code for Berkeley is 94710, and then use that zip code to call <code>get_weather_by_zipcode(94710)</code> to get the weather information, instead of directly calling <code>get_weather_by_zipcode(94710)</code> and hope that it is the correct zip code for Berkeley (this would be hallucination!).
-            In such cases, response-based evaluation can be a good complement to state-based evaluation, as it can provide additional insights into the model's behavior and decision-making process.
+            While state-based evaluation is a powerful tool for assessing multi-turn function calling models, it does
+            have some limitations. For example, some functions don't have a direct impact on the system's state, such as
+            <code>get_zipcode_by_city</code> or <code>estimate_distance</code>. We will not be able to tell if the model
+            has actually invoked those functions or not, if relying solely on state-based evaluation. We want to make
+            sure that the model is making the necessary function calls and reasoning through the task, instead of just
+            memorizing or guessing the correct information; we want the model to call
+            <code>get_zipcode_by_city("Berkeley")</code> to get the zip code for Berkeley is 94710, and then use that
+            zip code to call <code>get_weather_by_zipcode(94710)</code> to get the weather information, instead of
+            directly calling <code>get_weather_by_zipcode(94710)</code> and hope that it is the correct zip code for
+            Berkeley (this would be hallucination!).
+            In such cases, response-based evaluation can be a good complement to state-based evaluation, as it can
+            provide additional insights into the model's behavior and decision-making process.
           </p>
 
           <h4>Why Subset-Matched Response-based Evaluation</h4>
-          <p>In earlier versions like BFCL V1 and V2, a pure response-based evaluation was used; the model response must match the ground truth in full. This approach evaluated the
+          <p>In earlier versions like BFCL V1 and V2, a pure response-based evaluation was used; the model response must
+            match the ground truth in full. This approach evaluated the
             model based on the immediate function response, either by analyzing the return values or by checking the
-            Abstract Syntax Tree (AST) structure. However, it faces several limitations when it comes to multi-turn categories:</p>
+            Abstract Syntax Tree (AST) structure. However, it faces several limitations when it comes to multi-turn
+            categories:</p>
           <ul>
             <li><strong>Inconsistent Trajectories</strong>: In multi-turn tasks, models may take different, equally
               valid trajectories that are hard to predict or constrain via the prompt. For instance, a model might
@@ -882,9 +1268,9 @@ <h4>Why Subset-Matched Response-based Evaluation</h4>
               Example trajectory in multi-turn function calling that would fail in response-based evaluation but
               succeed in state-based evaluation.
             </i>
-            <br>
           </p>
-          <p>The question asks about purchasing Nvidia stock, and in order to do this, relevant stock information like
+          <br>
+          <p>In the above example, the question asks about purchasing Nvidia stock, and in order to do this, relevant stock information like
             its current price must be retrieved. In the ground truth, the <code>get_stock_info</code> function is
             invoked, with the stock symbol value (provided in the question) passed in. However, the model does not
             know that the value in the question is indeed the stock symbol, so it tries to first retrieve the symbol
@@ -892,17 +1278,22 @@ <h4>Why Subset-Matched Response-based Evaluation</h4>
             symbols and pattern-match the correct one. Although the model took more turns (function calls) and went
             through a trial-and-error process, it ultimately achieved the goal.</p>
           <p>This example illustrates how evaluating based solely on the response is not ideal. Sometimes, the model
-            may not have enough information to determine the minimial viable execution result paths, and taking extra steps should not be
-            penalized. Thus, we introduce a subset-matched response-based evaluation to address these limitations. The ground truth is the the minimial viable execution result paths labled with full context of the existing states. The model result is considered correct if it contains the ground truth as a subset, even if it contains additional function calls or takes a different trajectory.</p>
+            may not have enough information to determine the minimial viable execution result paths, and taking extra
+            steps should not be
+            penalized. Thus, we introduce a subset-matched response-based evaluation to address these limitations. The
+            ground truth is the the minimial viable execution result paths labled with full context of the existing
+            states. The model result is considered correct if it contains the ground truth as a subset, even if it
+            contains additional function calls or takes a different trajectory.</p>
           </p>
         </div>
-        <h3 id='result'>Result & Error Analysis</h3>
+        <h2 id='result'>Result & Error Analysis</h2>
 
         <ul>
           <li>
             <strong>Scenario 1 (Failure to perform implicit actions):</strong>
             <div style="background-color: #f4f4f4; padding: 10px; border-radius: 5px;">
-              <p><strong>ℹ️ Context:</strong> Fuel tank has 5 gallons of gas initially. The max capacity is 50 gallons.</p>
+              <p><strong>ℹ️ Context:</strong> Fuel tank has 5 gallons of gas initially. The max capacity is 50 gallons.
+              </p>
               <p><strong>🧑‍💻 User Prompt:</strong> Fill the fuel tank until we are able to reach Rivermist. Oil costs
                 money so I just need to reach there, I don't need a full tank.</p>
 
@@ -918,7 +1309,8 @@ <h3 id='result'>Result & Error Analysis</h3>
                 </div>
               </div>
 
-              <p><strong>🤖 Model Text Response:</strong> It seems that the fuel tank is already full or nearly full, and I
+              <p><strong>🤖 Model Text Response:</strong> It seems that the fuel tank is already full or nearly full,
+                and I
                 can't add more fuel beyond its capacity. Would you like me to display the current fuel status?</p>
             </div>
 
@@ -964,7 +1356,8 @@ <h3 id='result'>Result & Error Analysis</h3>
             <strong>Scenario 3 (LLMs incur unnecessary planning and thinking):</strong>
             <div style="background-color: #f4f4f4; padding: 10px; border-radius: 5px;">
               <p><strong>ℹ️ Context:</strong> Twitter API has initially been authenticated.</p>
-              <p><strong>🧑‍💻 User Prompt:</strong> I would appreciate it if you could share the sorted result on social
+              <p><strong>🧑‍💻 User Prompt:</strong> I would appreciate it if you could share the sorted result on
+                social
                 media, ensuring to tag 'current tech trend' and mention Julia, our insightful team.</p>
 
 
@@ -982,7 +1375,8 @@ <h3 id='result'>Result & Error Analysis</h3>
               </div>
 
 
-              <p><strong>🤖 Model Text Response:</strong> It seems that I'm unable to authenticate with Twitter at the moment.
+              <p><strong>🤖 Model Text Response:</strong> It seems that I'm unable to authenticate with Twitter at the
+                moment.
                 Please ensure your credentials are correct or try again later.</p>
             </div>
 
diff --git a/leaderboard.html b/leaderboard.html
index e29c8b49d..b78e26fb7 100644
--- a/leaderboard.html
+++ b/leaderboard.html
@@ -113,7 +113,7 @@ <h2>BFCL Leaderboard</h2>
                     <div>
                         <span>
                             <b><i style="font-size: 1.0em;">Last Updated:
-                                2024-12-06 <a href="https://github.com/ShishirPatil/gorilla/blob/main/berkeley-function-call-leaderboard/CHANGELOG.md">&nbsp;&nbsp;&nbsp;&nbsp;[Change Log]</a>
+                                2024-12-06 &nbsp;&nbsp;&nbsp;&nbsp; <a href="https://github.com/ShishirPatil/gorilla/blob/main/berkeley-function-call-leaderboard/CHANGELOG.md">[Change Log]</a>
                             </i></b>
                         </span>
                     </div>
@@ -137,19 +137,17 @@ <h2>BFCL Leaderboard</h2>
             </div>
             <p></p>
             <p>
-                FC = native support for function/tool calling.
+                <b>FC</b> = native support for function/tool calling.
+                <b>Prompt</b> = walk-around for function calling, using model's normal text generation capability.
             </p>
             <p>
                 <b>Cost</b> is calculated as an estimate of the cost per 1000 function calls, in USD.
                 <b>Latency</b> is measured in seconds.
-                For <b>Open-Source Models</b>, the cost and latency are calculated when serving with <a
-                    href="https://github.com/vllm-project/vllm">vLLM</a> using 8 V100 GPUs. The formula can be found in
-                the <a href="https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html#cost">blog</a>.
             </p>
             <p>
-                <b>AST Summary</b> is the <b>unweighted</b> average of the four test categories under AST Evaluation.
-                <b>Exec Summary</b> is the <b>unweighted</b> average of the four test categories under Exec Evaluation.
                 <b>Overall Accuracy</b> is the <b>unweighted</b> average of all the sub-categories.
+                For details on score composition, please refer to our <a
+                    href="https://gorilla.cs.berkeley.edu/blogs/13_bfcl_v3_multi_turn.html#leaderboard-composition">blog</a>.
             </p>
             <p>
                 Click on column header to sort. If you would like to add
@@ -157,7 +155,7 @@ <h2>BFCL Leaderboard</h2>
                     href="https://discord.gg/grXXvj9Whz">discord</a>.
             </p>
             <p>
-                Models were evaluated using commit <b>d7e52e5</b>. 
+                Models are evaluated using commit <b>d7e52e5</b>. 
                 All the model response we obtained is available <a href="https://github.com/HuanzhiMao/BFCL-Result">here</a>. 
                 To reproduce the results, please checkout our codebase at 
                 <a href="https://github.com/ShishirPatil/gorilla/tree/d7e52e5158cc4feffc6c4999c3c30a4af7bba7e7">this checkpoint</a>.

+ 🏆 Overall Score + Unweighted Average + (4751) +
+ 📂 Non-Live (Single-Turn) + Unweighted Average + (1700) +												+ 💡 Live (Single-Turn) + Weighted Average + (2251) +						+ 🔄 Multi-Turn + Unweighted Average + (800) +
+ 🌳 AST + Unweighted Average + (1150) +						+ 🚀 Exec + Unweighted Average + (310) +					+ 🧐 Relevance Detection + (240) +	+ 🌳 AST + Weighted Average + (1351) +				+ 🧐 Relevance Detection + Independent Categories + (900) +		+ 📝 Base Case + (200) +	+ 🔍 Augmented Entries + Independent Categories + (600) +
+ 🤏 Simple + Unweighted Average +			+ 📦 Multiple +	+ 🔀 Parallel +	+ 📦🔀 Parallel Multiple +	+ 🤏 Simple + Unweighted Average +		+ 📦 Multiple +	+ 🔀 Parallel +	+ 📦🔀 Parallel Multiple +	+ +	+ 🤏 Simple +	+ 📦 Multiple +	+ 🔀 Parallel +	+ 📦🔀 Parallel Multiple +	+ +	+ +	+ +	+ +	+ +	+ +
+ 🐍 Python Simple AST + (400) +	+ ☕ Java Simple AST + (100) +	+ 💻 JavaScript Simple AST + (50) +	+ 🐍 Python Multiple AST + (200) +	+ 🐍 Python Parallel AST + (200) +	+ 🐍 Python Parallel Multiple AST + (200) +	+ 🐍 Python Simple Exec + (100) +	+ 🌐 REST API Exec + (70) +	+ 🐍 Python Multiple Exec + (50) +	+ 🐍 Python Parallel Exec + (50) +	+ 🐍 Python Parallel Multiple Exec + (40) +	+ ❌ Irrelevance + (240) +	+ 🐍 Python Live Simple AST + (258) +	+ 🐍 Python Live Multiple AST + (1053) +	+ 🐍 Python Live Parallel AST + (16) +	+ 🐍 Python Live Parallel Multiple AST + (24) +	+ ✅ Live Relevance + (18) +	+ ❌ Live Irrelevance + (882) +	+ 📝 Python Multi Turn Base + (200) +	+ 🔍 Python Multi Turn Missing Function + (200) +	+ ⚠️ Python Multi Turn Missing Parameter + (200) +	+ 📜 Python Multi Turn Long Context + (200) +