Executable Code Actions Elicit Better LLM Agents

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, Heng Ji
Proceedings of the 41st International Conference on Machine Learning
Department of Computer Science, University of Illinois Urbana-Champaign

Table of Contents

Overall Summary

Study Background and Main Findings

This paper addresses limitations in current Large Language Model (LLM) agents, which typically interact with environments or tools using predefined text or structured JSON formats. These formats often restrict the complexity and flexibility of actions an agent can perform, hindering their ability to solve complex, multi-step problems. The research proposes CodeAct, a novel framework where LLM agents generate executable Python code as their actions. This approach aims to create a unified and more powerful action space by leveraging Python's inherent expressiveness, control flow structures (like loops and conditionals), data handling capabilities, and access to a vast ecosystem of existing software libraries.

The effectiveness of CodeAct was evaluated by comparing it against text and JSON action formats across 17 different LLMs on two benchmarks: API-Bank (for simple, single-tool actions) and a newly curated benchmark, M3ToolEval (designed for complex tasks requiring multiple tools and interaction turns). The results demonstrated that CodeAct performs comparably or better on simple tasks and significantly outperforms alternatives on complex tasks, achieving up to a 20% absolute increase in success rate and requiring fewer interaction turns. This suggests CodeAct effectively utilizes LLMs' familiarity with code (from pre-training) and excels where complex logic or tool composition is needed.

Recognizing a performance gap between open-source and proprietary LLMs, the researchers developed CodeActInstruct, a dataset containing ~7,000 multi-turn interaction examples using CodeAct, specifically filtered to include instances of self-debugging and improvement based on feedback (like error messages). They used this dataset, combined with general conversation data, to fine-tune open-source models (Llama-2 7B, Mistral 7B), creating CodeActAgent. Evaluation showed that CodeActAgent significantly improved performance on agent tasks compared to baseline open-source models, demonstrated generalization to text-based actions, and maintained strong performance on general LLM benchmarks.

The main conclusion is that using executable Python code (CodeAct) offers a substantial advantage over text/JSON for LLM agent actions, particularly in complex scenarios. The CodeAct framework, combined with targeted instruction tuning using datasets like CodeActInstruct, enables the development of more capable and autonomous agents that can leverage existing software ecosystems and even debug their own actions. This work provides both a conceptual framework and practical resources (datasets, models) for advancing LLM agent capabilities.

Research Impact and Future Directions

This research presents a compelling argument for shifting the paradigm of Large Language Model (LLM) agent actions from constrained formats like JSON or text towards executable Python code, encapsulated in the CodeAct framework. The core strength lies in leveraging the inherent flexibility, control flow (e.g., loops, conditionals), data manipulation capabilities, and vast library ecosystem of a mature programming language. Empirical results, particularly the up to 20% higher success rate on complex multi-tool tasks compared to traditional methods, provide substantial evidence for CodeAct's potential. The development of the M3ToolEval benchmark specifically addresses a gap in evaluating complex agent interactions, strengthening the validation.

The creation of the CodeActInstruct dataset and the subsequent fine-tuning of CodeActAgent models represent significant practical contributions, particularly for advancing open-source LLM agent capabilities. The demonstration that fine-tuning with this specialized data, mixed with general conversation data, improves agent performance without substantially degrading general abilities is a valuable finding for practical LLM development. The agent's ability to leverage existing Python packages and perform self-debugging based on execution errors points towards more autonomous and capable AI systems.

However, some limitations warrant consideration. While CodeAct improves performance, a significant gap persists between open-source and leading closed-source models, suggesting that the action format alone isn't a panacea for underlying model capability differences. The effectiveness of self-debugging might vary depending on error complexity and the base model's reasoning ability. Furthermore, while the benchmarks used are valuable, generalizing performance to the full spectrum of real-world tasks requires ongoing validation. The reliance on LLMs to generate correct and safe code also introduces potential security considerations not fully explored within this scope. Despite these points, CodeAct offers a promising and demonstrably effective direction for building more powerful and flexible LLM agents.

Critical Analysis and Recommendations

Clear Problem Definition (written-content)
The abstract clearly defines the problem of limited action scope and flexibility in current LLM agents using JSON/text formats. + This establishes a strong rationale and context for the proposed CodeAct solution, immediately grounding the reader in the paper's objectives.
Section: Abstract
Highlights Key Performance Result (written-content)
The abstract highlights CodeAct's superior performance with a specific metric (up to 20% higher success rate). + Quantifying the key result upfront effectively conveys the significance and potential impact of the proposed framework.
Section: Abstract
Specify Novelty of Curated Benchmark (written-content)
The abstract mentions a newly curated benchmark but doesn't specify its unique contribution (e.g., focus on complex multi-tool tasks). + Clarifying the benchmark's novelty would strengthen the abstract by highlighting a key methodological contribution and providing better context for the performance claims.
Section: Abstract
Clear Context, Problem Definition, and Solution Presentation (written-content)
The introduction clearly defines the problem with text/JSON actions and systematically presents CodeAct's advantages (e.g., dynamic interaction, library access, control flow). + This structured presentation effectively motivates the research and helps readers grasp the core value proposition of the CodeAct framework.
Section: Introduction
Effective Use of Figure Reference for Conceptual Clarity (written-content)
The introduction effectively uses Figure 1 to visually contrast CodeAct with text/JSON and preview performance gains. + This visual aid reinforces the conceptual differences and quantitative advantages, making the core arguments more accessible and compelling.
Section: Introduction
Strong Empirical Validation Strategy (written-content)
The paper provides strong empirical validation using two distinct experiments: API-Bank for atomic actions (testing familiarity) and the novel M3ToolEval benchmark for complex multi-tool actions (testing control/data flow). + This well-designed strategy effectively isolates and demonstrates the different hypothesized benefits of CodeAct, lending credibility to the claims.
Section: CodeAct Makes LLMs Better Agents
Novel Benchmark (M3ToolEval) for Complex Tool Use (written-content)
The introduction and evaluation using the M3ToolEval benchmark addresses a gap by focusing on complex multi-tool composition, unlike many existing benchmarks. + This novel benchmark allows for a more rigorous evaluation of CodeAct's capabilities in scenarios where its advantages are expected to be most pronounced.
Section: CodeAct Makes LLMs Better Agents
Quantitative Evidence Supporting CodeAct Superiority (written-content)
Results (Tables 2 & 3) show CodeAct outperforms JSON/Text, achieving up to 20.7% higher success rates and requiring fewer turns on the complex M3ToolEval benchmark for most LLMs. + This quantitative evidence strongly supports the central claim that CodeAct enhances agent performance, particularly for complex tasks.
Section: CodeAct Makes LLMs Better Agents
Demonstrates Practical Benefits (Libraries, Self-Debugging) (written-content)
Section 2.4 demonstrates CodeAct's ability to leverage existing software libraries (e.g., Pandas, Scikit-Learn) and facilitate self-debugging via error messages (Figure 3). + This showcases practical benefits beyond benchmark scores, highlighting CodeAct's potential for building more autonomous and versatile agents capable of complex workflows.
Section: CodeAct Makes LLMs Better Agents
Elaborate on Open/Closed-Source Gap and Link to Motivation (written-content)
The significant performance gap observed between open- and closed-source models on M3ToolEval using CodeAct is noted, but the potential reasons and link to the motivation for CodeActInstruct are not fully elaborated in the main text. + Expanding on this gap and explicitly connecting it to the need for fine-tuning (Section 3) would create a stronger narrative bridge and better justify the subsequent work.
Section: CodeAct Makes LLMs Better Agents
Detailed and Systematic Dataset Construction (CodeActInstruct) (written-content)
The methodology for constructing the CodeActInstruct dataset (7k trajectories) is detailed, including source selection, repurposing, trajectory generation, and filtering for self-improvement patterns. + This systematic approach enhances the dataset's quality and relevance for training agents capable of learning from interaction and self-correction.
Section: Empowering Open-source LLM Agent to be Better at CodeAct
Comprehensive Evaluation of Fine-Tuned Agent (CodeActAgent) (written-content)
The evaluation of CodeActAgent is comprehensive, assessing performance on CodeAct tasks (in/out-of-domain), generalization to text actions, and standard LLM benchmarks. + This holistic evaluation provides a robust assessment of the fine-tuned models' capabilities and the trade-offs involved.
Section: Empowering Open-source LLM Agent to be Better at CodeAct
Successful Integration of Diverse Training Data (written-content)
The successful integration of CodeActInstruct with general conversation data improves agent performance without significantly harming general capabilities (Table 5, Table A.8). + This demonstrates a practical and effective fine-tuning strategy for developing specialized agents that retain broad utility.
Section: Empowering Open-source LLM Agent to be Better at CodeAct
Clear Contextualization and Problem Framing (written-content)
The Related Work section effectively situates CodeAct within the standard LLM agent architecture and clearly frames the problem as standardizing the action space. + This provides clear context and focus for the paper's contributions within the broader field.
Section: Related Work
Effective Differentiation from Related Concepts (written-content)
The section differentiates CodeAct from general code generation and concurrent work (TaskWeaver) by referencing detailed comparisons in appendices. + This appropriately acknowledges related research while maintaining focus in the main text and clarifying the unique aspects of CodeAct.
Section: Related Work
Concise Summary of Contributions and Core Advantage (written-content)
The conclusion concisely summarizes the main contributions (CodeAct framework, CodeActInstruct dataset, CodeActAgent model) and reiterates the core advantage over text/JSON actions. + This provides a clear and effective wrap-up of the paper's key takeaways.
Section: Conclusions
Highlights Key Agent Capabilities (written-content)
The conclusion highlights key capabilities of CodeActAgent, such as Python integration, complex task execution, library use, and self-debugging. + This effectively emphasizes the practical potential and advanced features enabled by the proposed approach.
Section: Conclusions

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 1: Comparison between CodeAct and Text / JSON as action. (top)...
Full Caption

Figure 1: Comparison between CodeAct and Text / JSON as action. (top) Illustrative example comparing different actions. (bottom) Quantitative results on M³ToolEval (§2.3).

Figure/Table Image (Page 2)
Figure 1: Comparison between CodeAct and Text / JSON as action. (top) Illustrative example comparing different actions. (bottom) Quantitative results on M³ToolEval (§2.3).
First Reference in Text
However, both methods typically suffer from constrained scope of action spaces (actions are usually tailored for specific tasks) and restricted flexibility (e.g., tool uses in Fig. 1 top left).
Description
  • Illustrative Comparison of Action Methods: This part of the figure presents a side-by-side comparison of how a Large Language Model (LLM) agent tackles a specific task: finding the most cost-effective country (USA, Japan, Germany, India) to buy a smartphone ('CodeAct 1'). LLM agents are AI systems designed to understand instructions and perform actions, like using software tools (APIs). The left side shows the agent using traditional Text or JSON formats to call available tools (like 'lookup_rates', 'lookup_phone_price') one by one. Each call requires a separate interaction with the 'Environment'. The right side shows the same task performed using 'CodeAct', where the agent generates a single block of Python code. This code uses loops ('for country in countries:') to iterate through the countries, calls the necessary tools as functions within the code, stores results in variables ('final_prices'), and uses built-in Python functions ('min') to find the minimum price, achieving the result in fewer overall interactions.
  • Efficiency Claim (Fewer Actions): The example highlights that the CodeAct method can achieve the task objective with fewer interactions compared to the Text/JSON method, which requires multiple back-and-forth steps for each country being checked.
  • Code Features Demonstrated (Control Flow, Data Flow, Library Use): The CodeAct example demonstrates the use of programming concepts like variables to store intermediate results (e.g., 'final_prices'), loops ('for') for repetitive operations across different countries, and leveraging existing Python library functions ('min'), contrasting with the single-tool-call limitation often seen in Text/JSON approaches.
Scientific Validity
  • Representativeness of Example: The example serves as a conceptual illustration. Its representativeness of typical or complex scenarios where CodeAct's advantages become significant is not established by this single instance alone. The chosen task might be specifically selected to favor the CodeAct approach.
  • Accuracy of Depicted Mechanisms: The illustration accurately depicts how Python code can encapsulate multiple tool calls, control flow, and data manipulation, contrasting with the typical one-call-per-interaction pattern of simple Text/JSON tool use.
  • Support for Efficiency Claim: The claim 'Fewer Actions Required!' is visually supported by the reduced number of agent interaction blocks shown on the right compared to the left, assuming the omitted steps on the left are numerous.
Communication
  • Visual layout and annotations: The side-by-side layout effectively contrasts the Text/JSON approach with the CodeAct approach for the illustrative task. The annotations highlighting 'Fewer Actions Required!' and specific code advantages (e.g., control flow, library re-use) aid comprehension.
  • Clarity of interaction flow: The step-by-step interaction flow for both methods is generally clear, showing the action taken by the agent and the environment's response.
  • Conceptual clarity: While illustrative, the example clearly conveys the core concept of using executable code versus structured text/JSON for agent actions.
  • Omission of steps: The omission of intermediate steps for brevity (indicated by '[...]') is necessary but slightly hinders a full understanding of the interaction count difference without referring to the text.

CodeAct Makes LLMs Better Agents

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 2: General agent multi-turn interaction framework that describes the...
Full Caption

Figure 2: General agent multi-turn interaction framework that describes the role of CodeAct and motivates the construction of our data mixture. CodeActInstruct focuses on the agent-environment interactions and specifically filters for the self-improved planning behavior, while general conversation data we include focuses on agent-user interaction (§3.1).

Figure/Table Image (Page 4)
Figure 2: General agent multi-turn interaction framework that describes the role of CodeAct and motivates the construction of our data mixture. CodeActInstruct focuses on the agent-environment interactions and specifically filters for the self-improved planning behavior, while general conversation data we include focuses on agent-user interaction (§3.1).
First Reference in Text
CodeActInstruct is motivated by a general agent framework consisting of agent, user, and environments (Fig. 2) and focuses on agent-environment interactions with the computer (information seeking, software package use, external memory) and the physical world (robot planning).
Description
  • Comparison Setup: This part of the figure displays quantitative results comparing three different 'action modes' – ways for Large Language Models (LLMs) to perform actions – namely CodeAct (using Python code), JSON (using a structured data format), and Text (using plain text commands). The comparison is performed on a benchmark dataset called M00ToolEval, which presumably involves tasks requiring tool use.
  • Metrics and Models Compared: Two metrics are shown in separate bar charts: 'Success Rate (%)' (higher is better), indicating how often the LLM successfully completed the task, and 'Average Number of Interaction Turns' (lower is better), showing how many steps were needed on average. Results are presented for several different LLMs, including various versions of GPT (e.g., gpt-4-1106-preview, gpt-3.5-turbo-0613), Claude-2, Gemini-pro, text-davinci-003, and Llama-2-70b-chat-hf.
  • Success Rate Results Trend: For most models shown, particularly the more capable ones like gpt-4 versions, CodeAct (represented by the darkest bar) generally achieves the highest success rate compared to JSON and Text actions. For instance, gpt-4-1106-preview shows a success rate noticeably above 70% with CodeAct, while its success rate with JSON and Text is closer to 50%.
  • Interaction Turns Results Trend: Correspondingly, for the higher-performing models, CodeAct also tends to require fewer interaction turns. For gpt-4-1106-preview, the average turns are below 6 for CodeAct, whereas they are closer to 8 for JSON and Text.
  • Trend with Model Capability: The charts visually suggest that the performance gap between CodeAct and the other methods widens for more powerful LLMs (those listed higher on the y-axis), implying CodeAct's benefits might be more pronounced with more capable base models.
Scientific Validity
  • Benchmark Dependency: The figure presents results from experiments on the M00ToolEval benchmark. The validity depends on the quality, complexity, and representativeness of this benchmark, which is described as newly curated in the text (002.3). Without detailed information on the benchmark tasks, it's difficult to fully assess if the observed performance differences generalize.
  • Model Coverage: The comparison across 17 LLMs (though only 8 are shown clearly in the snippet) provides reasonable breadth, covering both open-source and proprietary models.
  • Appropriateness of Metrics: The metrics (Success Rate, Average Turns) are standard and appropriate for evaluating agent task performance and efficiency.
  • Consistency with Claims: The figure visually supports the paper's claim that CodeAct outperforms alternatives, particularly for more capable LLMs, on this specific benchmark. The trend of increasing advantage with model scale appears consistent across the presented models.
  • Potential Confounding Factors: Potential confounding factors, such as differences in prompt engineering for each action mode or the specific implementation details, are not detailed in the figure caption itself but are crucial for the validity of the comparison.
Communication
  • Chart type clarity: The horizontal bar charts are a standard and clear way to compare success rates and interaction turns across different models and action modes.
  • Use of color-coding: Color-coding the bars based on the action mode (Code, JSON, Text) allows for easy visual comparison of the different approaches for each LLM.
  • Axis and model labeling: Labeling the axes ('Success Rate (%)', 'Average Number of Interaction Turns') and listing the specific LLM models clearly identifies the metrics and subjects of comparison.
  • Separation of metrics: The separation into two charts (Success Rate and Interaction Turns) allows for focused comparison on each metric.
  • Axis scaling: The range of the x-axis for Success Rate (0-70%) might visually exaggerate differences for lower-performing models. Similarly, the Interaction Turns axis (5-10) focuses on a specific range.
  • Interpretation dependency: While the figure presents the data, interpreting the significance of the differences (e.g., whether a 5% increase in success rate is statistically significant) requires referring to the main text or statistical analysis not present in the figure itself.
Figure 3: Example multi-turn interaction with Python packages using...
Full Caption

Figure 3: Example multi-turn interaction with Python packages using CodeActAgent (Mistral-7b). No in-context demonstrations are provided to the model. Some messages are omitted for space. See https://chat.xwang.dev/r/Vqn108G for complete interaction.

Figure/Table Image (Page 6)
Figure 3: Example multi-turn interaction with Python packages using CodeActAgent (Mistral-7b). No in-context demonstrations are provided to the model. Some messages are omitted for space. See https://chat.xwang.dev/r/Vqn108G for complete interaction.
First Reference in Text
As shown in Fig. 3, CodeActAgent, designed for seamless integration with Python, can carry out sophisticated tasks (e.g., model training, data visualization) using existing Python packages.
Description
  • Interaction Overview: This figure demonstrates a conversational interaction between a human user and an AI agent called CodeActAgent (powered by the Mistral-7b language model). The goal is to perform a data science task involving an auto MPG (Miles Per Gallon) dataset.
  • Initial Task Request: The interaction spans multiple turns. The user initially asks the agent to download the dataset, preprocess it (check for missing values), split it into training and testing sets, and train a regression model (a statistical technique to predict MPG based on other car features).
  • Use of Python Libraries: The agent uses specific Python libraries: `pandas` for data loading and manipulation (reading a CSV file from a URL), `numpy` for numerical operations, and `scikit-learn` for machine learning tasks like splitting data (`train_test_split`) and implementing the `LinearRegression` model.
  • Self-Debugging Example (ValueError): The figure highlights the agent's self-debugging capability. Initially, the agent's code fails due to unexpected characters ('?') in the data, causing a `ValueError`. The environment provides a traceback (an error report). The agent analyzes this feedback and generates corrected code to handle the issue (replacing '?' with NaN - 'Not a Number' - and dropping rows with missing values).
  • Model Training, Evaluation, and Follow-up: After successfully training the model, the agent reports evaluation metrics: Mean Squared Error (MSE), a measure of prediction error (10.71 shown), and R^2 score, indicating how well the model fits the data (0.79 shown, closer to 1 is better). It also handles follow-up requests, like calculating these metrics for the training set.
  • Data Visualization and Further Debugging: The user then asks for a visualization of the model's coefficients (values indicating the importance of each input feature). The agent uses the `matplotlib` library to create a bar chart. Further self-debugging occurs when the visualization code initially fails (e.g., `AttributeError`, incorrect function arguments for rotating labels). The agent iteratively corrects its code based on error messages over several (partially omitted) turns until the visualization is successfully generated.
  • Zero-Shot Capability Demonstration: A key aspect noted in the caption is that this interaction happens without 'in-context demonstrations', meaning the agent wasn't given prior examples of how to perform this specific task; it relies on its pre-trained knowledge.
Scientific Validity
  • Demonstration of Claimed Capabilities: The figure provides a concrete example supporting the claim that CodeActAgent can interact with Python packages (pandas, scikit-learn, matplotlib) to perform complex, multi-step tasks.
  • Demonstration of Self-Debugging: The example effectively showcases the multi-turn interaction and, crucially, the agent's ability to use automated feedback (error messages) for self-debugging, a key feature highlighted in the paper.
  • Representativeness: As a single example, it may not be fully representative of the agent's general performance across all possible tasks or error types. The success shown could be specific to this particular scenario or model version (Mistral-7b).
  • Evidence for Zero-Shot Learning: The successful execution without in-context examples provides evidence for the agent's zero-shot capabilities in this domain, leveraging its underlying training.
  • Task Complexity: The complexity of the task (data loading, cleaning, modeling, evaluation, visualization, iterative debugging) lends credibility to the demonstration of sophisticated task handling.
Communication
  • Clear demarcation of interaction roles: The layout clearly distinguishes between the User's requests, the CodeActAgent's thoughts and code actions, and the Environment's responses (including code output and error messages).
  • Effective use of annotations: Annotations highlighting specific capabilities (e.g., 'Use Pandas Library...', 'Self-Debug from Automated Feedback', 'Use Matplotlib Library...') effectively guide the reader's attention to the key demonstrated features.
  • Illustration of multi-turn interaction: The multi-turn dialogue format successfully illustrates the iterative nature of the interaction, including follow-up questions and error correction cycles.
  • Impact of omissions: While necessary for brevity, the omission of some messages and code snippets (indicated by '[...omitted for space...]') slightly fragments the flow and requires trust or external viewing (via the provided link) for a complete picture.
  • Provision of link to full interaction: Including a link to the full interaction is a good practice for transparency and allows interested readers to examine the complete sequence.
  • Showcasing complex workflow: The figure effectively showcases a complex workflow involving data processing, model training, evaluation, and visualization, demonstrating the agent's ability to handle sophisticated tasks.
Table 1: The benefit of CodeAct compared to using Text/JSON for LLM action.
Figure/Table Image (Page 3)
Table 1: The benefit of CodeAct compared to using Text/JSON for LLM action.
First Reference in Text
In this section, we first describe CodeAct framework (§2.1) and provide empirical evidence that supports the choice of CodeAct.
Description
  • Comparison Overview: This table presents a qualitative comparison between two methods for defining actions taken by Large Language Model (LLM) agents: CodeAct (using executable Python code) and traditional Text/JSON formats. LLM agents are AI systems that can perform tasks by interacting with environments or tools.
  • Data Availability and Complex Operations: The comparison is structured across four key aspects. For 'Availability of Data', it states CodeAct benefits from the large amount of code data already available for 'pre-training' LLMs (the initial phase where models learn from vast datasets), whereas Text/JSON requires specific data curation. For 'Complex Operation', it claims CodeAct natively supports 'control and data flow' (like loops or conditional statements, e.g., 'if-then'), while Text/JSON requires careful engineering for similar complexity. 'Control flow' refers to the order in which instructions are executed, while 'data flow' refers to how data is passed between operations.
  • Tool Availability and Automated Feedback: Regarding 'Availability of Tools', the table asserts that CodeAct allows direct use of existing 'software packages' (libraries of pre-written code, like Python's extensive collection found on PyPI), while Text/JSON often requires human effort to create or adapt tools. For 'Automated Feedback', it highlights that CodeAct can leverage built-in programming language feedback mechanisms like 'traceback' (an error report generated when code fails), which are common in software development, whereas obtaining feedback for Text/JSON actions might require more manual setup.
  • Summary Indicators: The table uses checkmarks (✓) to indicate perceived advantages for CodeAct in all four categories and crosses (X) to indicate perceived limitations for Text/JSON in the same categories.
Scientific Validity
  • Qualitative Claims vs. Evidence: The table presents qualitative arguments and claims about the benefits of CodeAct. While these claims are plausible and align with common understanding of programming vs. structured text, the table itself does not provide quantitative evidence; it serves to outline the hypothesized advantages that are presumably tested later in the paper.
  • Data Availability Claim: The claim regarding data availability for pre-training is generally true; LLMs are trained on vast amounts of text and code from the internet. However, the extent to which this pre-training directly translates to effective agent action generation in the CodeAct format versus simpler formats requires empirical validation.
  • Complex Operations Claim: The comparison regarding complex operations (control/data flow) accurately reflects the inherent capabilities of programming languages versus typical JSON/Text API call structures. Code inherently supports loops, conditionals, and variable manipulation.
  • Tool Availability Claim: The claim about tool availability accurately points to the vast ecosystem of existing software libraries accessible via code, contrasting with the often bespoke nature of tools defined for Text/JSON agents.
  • Automated Feedback Claim: The point about automated feedback (e.g., tracebacks) is valid, as programming environments provide rich debugging information. Integrating such feedback effectively into an LLM agent's reasoning loop is a key aspect of the CodeAct proposal.
  • Potential Oversimplification of Text/JSON: The comparison might oversimplify the Text/JSON approach. While basic implementations might be limited, more sophisticated frameworks exist that attempt to add complexity (e.g., orchestration layers) to Text/JSON agents, although perhaps less natively than code.
Communication
  • Clear comparison format: The table uses a clear two-column comparison format, making it easy to contrast CodeAct and Text/JSON across the specified criteria.
  • Effective use of symbols: The use of checkmarks ("✓") and crosses ("X") provides an immediate visual summary of the purported advantages and disadvantages of each approach.
  • Relevant comparison criteria: The criteria listed (Availability of Data, Complex Operation, Availability of Tools, Automated Feedback) are relevant dimensions for comparing agent action frameworks.
  • Concise explanations: The brief descriptions accompanying each checkmark/cross concisely explain the reasoning behind the assessment (e.g., 'Large quantity of code available for pre-training', 'Requires human effort to curate tools').
  • Use of footnotes for clarification: Footnotes provide necessary context and sources for claims, such as links related to Python packages and error handling.
  • Summarization effectiveness: The table effectively summarizes the core arguments for preferring CodeAct, serving as a useful high-level overview before diving into detailed experimental results.
Table 2: Atomic API call correctness on API- Bank. The best performance is...
Full Caption

Table 2: Atomic API call correctness on API- Bank. The best performance is bolded, and the second-best is underlined.

Figure/Table Image (Page 5)
Table 2: Atomic API call correctness on API- Bank. The best performance is bolded, and the second-best is underlined.
First Reference in Text
We present results in Tab. 2.
Description
  • Experiment Goal and Context: This table presents results from an experiment measuring the 'correctness' of different Large Language Models (LLMs) when making 'atomic API calls'. An 'atomic API call' refers to a single, indivisible request made to a software tool or service (an Application Programming Interface or API). 'Correctness' here likely means whether the LLM generated the API call with the exact right structure, tool name, and parameters as expected by the benchmark. The experiment uses the 'API-Bank' benchmark, a dataset designed to test how well LLMs can use tools.
  • Action Formats Compared: The table compares three different formats ('Format of Action') the LLMs were asked to use for making these API calls: 'CodeAct' (generating a Python function call), 'JSON' (generating a structured data object), and 'Text' (generating a plain text representation of the call).
  • Models Tested and Metric: Results are shown as percentages (Correctness %, higher is better) for a variety of LLMs, categorized into 'Open-source LLMs' (models whose underlying code is publicly available, like Llama-2 and Mistral) and 'Closed-source LLMs' (proprietary models like GPT-4, Claude-2, Gemini-pro).
  • Results for Open-Source Models: For open-source models, performance varies. For example, Llama-2-70b-chat-hf achieves 35.6% correctness with CodeAct and 37.6% with Text, but only 14.3% with JSON. Mistral-7B-Instruct-v0.1 shows low scores across all formats (2.5%, 2.3%, 3.0%). CodeAct and Text often perform better than JSON for these models.
  • Results for Closed-Source Models: For closed-source models, correctness scores are generally much higher. For instance, gpt-4-1106-preview achieves 76.7% with CodeAct, 82.7% with JSON, and 73.4% with Text. Here, JSON often performs comparably or sometimes better than CodeAct, unlike the trend in open-source models. gpt-4-0613 shows 75.4% for CodeAct and 82.0% for JSON.
  • Overall Format Performance Summary: A summary at the bottom ('Frequency of Best-Performing Format') counts how many times each format achieved the highest score. Overall, CodeAct was best 8 times, JSON 5 times, and Text 4 times across all 17 models listed.
Scientific Validity
  • Standardized Benchmark (API-Bank): The use of the API-Bank benchmark provides a standardized basis for comparison, assuming the benchmark itself is well-designed and relevant for testing atomic tool use.
  • Controlled Experiment Design (Atomic Calls): Focusing on 'atomic' API calls isolates the LLM's ability to formulate a single correct call, specifically ablating the control/data flow advantages tested elsewhere. This provides a controlled comparison of format familiarity/preference.
  • Appropriate Metric (Correctness): The 'correctness' metric, defined in the text as matching ground-truth API outputs after execution, seems appropriate for evaluating functional accuracy in this context.
  • Diversity of Models Tested: The inclusion of a diverse set of 17 LLMs, both open- and closed-source, strengthens the generalizability of the findings.
  • Observed Dichotomy (Open vs. Closed Source): The results suggest a potential difference in how open-source vs. closed-source models handle different action formats, particularly JSON. This might reflect differences in pre-training data or subsequent fine-tuning strategies employed by closed-source model providers, a valid point for investigation.
  • Consistency with Field Observations: The relatively lower performance of open-source models compared to closed-source models on this task is consistent with general observations in the field.
Communication
  • Table structure and clarity: The table structure is clear, grouping models into open-source and closed-source categories and presenting correctness scores for each action format (CodeAct, JSON, Text) side-by-side.
  • Use of formatting (bold/underline): Using bolding for the best performance and underlining for the second-best within each row (model) effectively highlights the top-performing action formats for each LLM.
  • Clarity of headers: Column headers ('Format of Action', 'Correctness (%,↑)', 'CodeAct', 'JSON', 'Text') are unambiguous.
  • Inclusion of summary statistics: The inclusion of a summary row ('Frequency of Best-Performing Format ↑') at the bottom provides a useful high-level takeaway regarding which format most often yields the best results across the tested models.
  • Model specificity: Listing a wide range of specific LLMs (e.g., CodeLlama variants, Llama-2 variants, Mistral, Claude, GPT variants, Gemini) allows for detailed comparison.
Table 3: Success rates (higher the better) and average turns required per...
Full Caption

Table 3: Success rates (higher the better) and average turns required per instance (lower the better) on M³ToolEval. The best results for each model are bolded, and the second-best ones are underlined.

Figure/Table Image (Page 5)
Table 3: Success rates (higher the better) and average turns required per instance (lower the better) on M³ToolEval. The best results for each model are bolded, and the second-best ones are underlined.
First Reference in Text
We include full re- sults in Tab. 3 and a subset of results for visualization in Fig. 1.
Description
  • Benchmark and Purpose: This table evaluates various Large Language Models (LLMs) on a benchmark called 'M00ToolEval'. This benchmark is designed to test the ability of LLMs to solve complex tasks that require using multiple software tools over several steps or 'turns' of interaction.
  • Metrics Measured: Two primary metrics are reported: 'Success Rate (%)', which measures how often the LLM successfully completed the tasks (higher values are better), and 'Avg. Turns', the average number of interaction steps required per task (lower values are better).
  • Action Formats Compared: Similar to Table 2, the comparison focuses on three different 'action formats' the LLMs used: 'CodeAct' (Python code), 'JSON', and 'Text'. The performance for each format is shown for each LLM.
  • Models Tested: The table includes results for numerous LLMs, separated into 'Open-source LLMs' (like Llama-2, Mistral) and 'Closed-source LLMs' (like GPT-4, Claude-2, Gemini-pro).
  • Performance Highlights (Success Rate): Performance varies significantly. Open-source models generally show low success rates (e.g., Llama-2-70b-chat-hf achieves 11.0% success with CodeAct, 3.7% with JSON/Text). Closed-source models perform much better; for example, 'gpt-4-1106-preview' achieves a 74.4% success rate with CodeAct, compared to 52.4% with JSON and 53.7% with Text.
  • Performance Highlights (Average Turns): In terms of efficiency, 'gpt-4-1106-preview' using CodeAct required the fewest average turns (5.5), whereas JSON and Text required more turns (7.6 and 7.7, respectively). This trend of CodeAct requiring fewer turns is observed for many, but not all, models.
  • Overall Format Performance Summary: The summary rows indicate that CodeAct was the best-performing format for success rate in 12 out of 17 models and required the fewest turns in 12 out of 17 models, suggesting a strong advantage on this benchmark compared to JSON (best success rate 5 times, fewest turns 3 times) and Text (best success rate 4 times, fewest turns 2 times).
Scientific Validity
  • Benchmark Relevance (M00ToolEval): The M00ToolEval benchmark, described in the text (002.3) as requiring complex coordination and composition of multiple tools in multi-turn interactions, is specifically designed to test the scenarios where CodeAct is hypothesized to excel (leveraging control and data flow). Evaluating on this benchmark directly tests the central claims.
  • Appropriateness of Metrics: The use of success rate and average interaction turns are standard and appropriate metrics for evaluating task completion and efficiency in agent benchmarks.
  • Model Diversity: Testing across a wide range of LLMs (17 total, including open- and closed-source) enhances the robustness and potential generalizability of the findings regarding the effectiveness of CodeAct.
  • Support for Claims (Benefit in Complexity): The results presented strongly support the paper's claim that CodeAct's advantages (seen modestly in atomic calls, Table 2) become more prominent in complex tasks. The absolute improvements in success rate (e.g., ~20% for gpt-4-1106-preview) and reductions in turns are substantial.
  • Zero-Shot Evaluation Setting: The text mentions the evaluation is zero-shot (002.3), meaning models were not given specific examples within the prompt. This tests the models' inherent ability to use the action formats without task-specific prompt engineering, adding to the validity.
  • Observed Performance Gaps: The large performance gap between open-source and top closed-source models highlights limitations in current open-source models' abilities on complex agent tasks, even with the potentially advantageous CodeAct format.
Communication
  • Clear presentation of metrics: The table effectively presents two key metrics (success rate, average turns) side-by-side for easy comparison across different action formats and models.
  • Logical grouping of models: Grouping models into open-source and closed-source categories aids in comparing trends between these two types of models.
  • Effective use of formatting: The use of bolding and underlining to highlight the best and second-best results per model is an effective visual aid for quickly identifying top performers.
  • Clarity of headers: Headers are clear and indicate the directionality of metrics ('%,↑' for success rate, '↓' for average turns).
  • Inclusion of summary statistics: Including summary rows ('Frequency of Best-performing Format ↑') provides a concise takeaway regarding the overall performance dominance of each action format across the evaluated models for both metrics.

Empowering Open-source LLM Agent to be Better at CodeAct

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Table 4: Statistics of our training mixture and comparison with prior work....
Full Caption

Table 4: Statistics of our training mixture and comparison with prior work. Please refer to §3.1 for details about CodeActInstruct and general conversation data. Token statistics are computed using Llama-2 tokenizer.

Figure/Table Image (Page 7)
Table 4: Statistics of our training mixture and comparison with prior work. Please refer to §3.1 for details about CodeActInstruct and general conversation data. Token statistics are computed using Llama-2 tokenizer.
First Reference in Text
The statistics of the resulting dataset Code ActInstruct are shown in Tab. 4.
Description
  • Purpose and Data Categories: This table details the composition and size of the data mixture used for training the CodeActAgent models in the study. It separates the data into three main categories: datasets from 'Prior Work' for comparison, the authors' newly compiled 'CodeActInstruct' dataset, and 'General Conversation' data.
  • Statistics Provided and Token Explanation: For each dataset listed, the table provides three statistics: '# of Data Instances' (number of individual examples or dialogues), '# of Total Tokens' (the total number of basic text units, like words or sub-words, as defined by the Llama-2 tokenizer), and 'Avg. Tokens Per Instance' (the average length of an example). 'Tokens' are the fundamental units that language models process; different models break text down into tokens differently, hence specifying the 'Llama-2 tokenizer' is important.
  • CodeActInstruct Dataset Statistics: The 'CodeActInstruct' dataset, created by the authors, is shown to consist of 7,139 instances totaling approximately 10.6 million tokens, with an average length of 1482 tokens per instance. It's compiled from several sources targeting different agent capabilities: Information Seeking (HotpotQA - 1,664 instances), Software Package/Tool Usage (MATH - 1,732 instances, APPS - 647 instances), External Memory (WikiTableQuestion - 1,065 instances), and Robot Planning (ALFWorld - 2,031 instances).
  • General Conversation Data Statistics: The 'General Conversation' data includes sources like OpenOrca (50,000 instances, ~14M tokens), ShareGPT (two sources totaling ~14.6k instances and ~36M tokens), and CapyBara (4,647 instances, ~5M tokens). This part of the mixture totals 69,230 instances and over 55 million tokens, generally with shorter average lengths (e.g., 280 for OpenOrca, 797 overall) compared to CodeActInstruct.
  • Comparison with Prior Work Datasets: For context, statistics for two prior datasets, FireAct and AgentInstruct, are also listed. FireAct has 2,063 instances and ~0.5M tokens, while AgentInstruct has 1,866 instances and ~2.5M tokens. This comparison highlights that the authors' CodeActInstruct dataset contains significantly more instances and tokens than these specific prior works.
Scientific Validity
  • Transparency of Data Composition: Providing a detailed breakdown of the training data mixture, including sources, instance counts, and token counts, is crucial for transparency and reproducibility.
  • Dataset Size and Diversity (CodeActInstruct): The CodeActInstruct dataset appears reasonably large (7k instances, 10M tokens) and diverse, covering multiple domains relevant to agent capabilities (search, tool use, memory, planning). This diversity supports the goal of training a versatile agent.
  • Data Mixture Strategy: The inclusion of a much larger general conversation dataset (69k instances, 55M tokens) alongside the specialized CodeActInstruct data reflects a common strategy to maintain the model's general language understanding and conversational abilities while fine-tuning for specific tasks.
  • Tokenizer Specification: Specifying the tokenizer (Llama-2) used for calculating token statistics is essential, as token counts can vary significantly between different tokenizers.
  • Contextualization via Comparison: Comparing the scale of CodeActInstruct to prior works (FireAct, AgentInstruct) helps position the contribution but relies on the specific choice of prior works for comparison.
  • Token Length Insights: The average token lengths suggest that CodeActInstruct instances (avg. 1482 tokens) are substantially longer than the general conversation instances (avg. 797 tokens), likely reflecting the complexity and multi-turn nature of the agent tasks.
Communication
  • Clear categorization of data: The table clearly categorizes the training data into 'Prior Work', 'CodeActInstruct (Ours)', and 'General Conversation', making it easy to understand the components of the training mixture.
  • Detailed breakdown of data sources: Breaking down 'CodeActInstruct' and 'General Conversation' into their constituent datasets (e.g., HotpotQA, MATH, OpenOrca, ShareGPT) provides transparency about the data sources.
  • Inclusion of relevant statistics: Including statistics like the number of instances, total tokens, and average tokens per instance provides a quantitative overview of the dataset sizes.
  • Comparison with prior work: Comparing the size of the authors' 'CodeActInstruct' dataset with 'Prior Work' (FireAct, AgentInstruct) helps contextualize its scale.
  • Specification of tokenizer: Specifying the tokenizer used (Llama-2) is crucial information for reproducibility and understanding the token counts.
  • Clarity of headers: The column headers are clear and accurately describe the data presented.
Table 5: Evaluation results for CodeActAgent. The best results among all...
Full Caption

Table 5: Evaluation results for CodeActAgent. The best results among all open-source LLMs are bolded, and the second-best results are underlined. ID and OD stand for in-domain and out-of-domain evaluation correspondingly. Overall averaged performance normalizes the MT-Bench score to be consistent with other tasks and excludes in-domain tasks for fair comparison.

Figure/Table Image (Page 8)
Table 5: Evaluation results for CodeActAgent. The best results among all open-source LLMs are bolded, and the second-best results are underlined. ID and OD stand for in-domain and out-of-domain evaluation correspondingly. Overall averaged performance normalizes the MT-Bench score to be consistent with other tasks and excludes in-domain tasks for fair comparison.
First Reference in Text
As shown in Tab. 5, CodeActAgent (both variants) perform better than all evaluated open-source LLMs on both the in- and out-of-domain subsets of MINT.
Description
  • Evaluation Overview: This table presents a comprehensive evaluation of the authors' 'CodeActAgent' models (fine-tuned versions of Llama-2 7B and Mistral 7B) against various baseline models and prior work. The evaluation covers two main categories: 'Agent Tasks' and 'Generic Tasks'. '7B' refers to the model size, indicating approximately 7 billion parameters, a measure of model complexity.
  • Agent Task Benchmarks: 'Agent Tasks' assess the models' ability to perform actions. This is split into tasks using 'Code as Action' (evaluated on MINT and M00ToolEval benchmarks) and 'Text as Action' (evaluated on Miniwob++ and SciWorld). MINT (Multi-turn Interaction with Tools) results are further divided into 'ID' (In-Domain, tasks similar to the training data) and 'OD' (Out-of-Domain, tasks different from training data). M00ToolEval is a benchmark for complex multi-tool use. Miniwob++ involves controlling web browser elements, and SciWorld is a text-based environment simulating science experiments.
  • Generic Task Benchmarks: 'Generic Tasks' assess broader capabilities. MMLU measures knowledge across many subjects. HumanEval tests code generation ability. GSM8K assesses mathematical reasoning. MTBench evaluates instruction following and conversational ability.
  • Example Results for CodeActAgent (Mistral): Results are presented as scores (likely percentages or benchmark-specific metrics). For example, CodeActAgent (Mistral) scores 57.4 on MINT ID (Code), 32.4 on MINT OD (Code), 12.2 on M00ToolEval, 46.2 on Miniwob++, 15.9 on SciWorld, 59.1 on MMLU, 34.7 on HumanEval, 58.0 on GSM8K, and 8.2 on MTBench.
  • Performance Comparisons: The table compares CodeActAgent to baseline models (e.g., Llama2 Chat, Mistral Instruct), prior agent models (FireAct, AgentLM), and powerful closed-source models (gpt-3.5-turbo-0613, gpt-4-0613). CodeActAgent (Mistral) generally outperforms other open-source models on agent tasks (indicated by bolding) and maintains strong performance on generic tasks.
  • Overall Average Performance: An 'Overall Average' score is calculated to summarize performance across diverse tasks. The caption notes it excludes ID tasks for fairness and normalizes the MT-Bench score (likely because its scale differs from others). CodeActAgent (Mistral) achieves the highest overall average (42.5) among open-source models.
Scientific Validity
  • Breadth and Quality of Benchmarks: The evaluation uses a diverse suite of established benchmarks covering both specialized agent capabilities (MINT, M00ToolEval, Miniwob++, SciWorld) and general LLM abilities (MMLU, HumanEval, GSM8K, MTBench), providing a comprehensive assessment.
  • ID/OD Split Evaluation: Distinguishing between In-Domain (ID) and Out-of-Domain (OD) performance on MINT is methodologically sound, as it helps assess the model's ability to generalize beyond its specific training tasks.
  • Inclusion of Baselines and Prior Work: Comparing against relevant baselines (standard instruction-tuned models) and prior work focused on agents (FireAct, AgentLM) provides appropriate context for evaluating the contribution of CodeActAgent.
  • Support for Claims: The results clearly show significant improvements for CodeActAgent, particularly the Mistral variant, over other open-source models on agent-specific tasks, supporting the paper's claims about the effectiveness of the proposed fine-tuning approach.
  • Calculation of Overall Average: The normalization of MT-Bench and exclusion of ID tasks in the overall average calculation is a reasonable approach to create a more balanced summary metric across disparate benchmarks.
  • Contextualization with SOTA: The performance gap between the best open-source model (CodeActAgent Mistral) and the closed-source models (especially GPT-4) remains substantial across most tasks, reflecting the current landscape in LLM capabilities.
  • Empirical Evidence Strength: The table provides strong empirical evidence for the effectiveness of the CodeActInstruct fine-tuning data and methodology for improving open-source LLM agent capabilities.
Communication
  • Table organization: The table is well-structured, clearly separating models, evaluation task categories (Agent Tasks, Generic Tasks), and specific benchmarks.
  • Granularity of results: Grouping agent tasks by action type (Code vs. Text) and distinguishing between In-Domain (ID) and Out-of-Domain (OD) results for MINT provides valuable granularity.
  • Clarity of headers and abbreviations: Headers are mostly clear, although abbreviations like MINT, MMLU, GSM8K, etc., rely on reader familiarity or reference to the text.
  • Use of formatting (bold/underline): The use of bolding and underlining effectively highlights the best and second-best performing open-source models on each task, aiding quick comparison.
  • Overall average and explanation: The inclusion of an 'Overall Average' provides a useful summary, and the caption clarifies how it's calculated (normalization, exclusion of ID tasks), which is good practice.
  • Inclusion of closed-source benchmarks: Comparing open-source models (including the authors' CodeActAgent) with closed-source models (gpt-3.5, gpt-4) provides important context regarding the current state-of-the-art.
Table A.6: Example of actions for re-purposed API-Bank (Li et al., 2023) and...
Full Caption

Table A.6: Example of actions for re-purposed API-Bank (Li et al., 2023) and M³ToolEval.

Figure/Table Image (Page 14)
Table A.6: Example of actions for re-purposed API-Bank (Li et al., 2023) and M³ToolEval.
First Reference in Text
Not explicitly referenced in main text
Description
  • Purpose and Context: This table provides concrete examples of how a single intended action – adding an agenda item with specific content ('Meeting with John') and time ('2023-10-26 09:00:00') – would be represented using the three different action formats discussed in the paper: CodeAct, JSON, and Text. These formats are used by Large Language Model (LLM) agents to invoke tools or APIs.
  • CodeAct Example: For the 'CodeAct' format, the action is shown as a Python function call: `AddAgenda(content="Meeting with John", time="2023-10-26 09:00:00")`. This format uses standard programming syntax.
  • JSON Example: For the 'JSON' format, the action is represented as a JSON (JavaScript Object Notation) object. This is a structured data format using key-value pairs: `{"action": "AddAgenda", "content": "Meeting with John", "time": "2023-10-26 09:00:00"}`. Here, the action name and its parameters are explicitly labeled fields within the object.
  • Text Example: For the 'Text' format, the action is shown as a semi-structured text string: `Action: AddAgenda, content: Meeting with John, time: 2023-10-26 09:00:00`. This format uses keywords and delimiters (like commas and colons) but is less rigidly structured than JSON or code.
Scientific Validity
  • Accuracy of Representation: The examples accurately reflect the syntax and structure characteristic of Python function calls (CodeAct), JSON objects, and simple key-value text representations (Text).
  • Illustrative Value: The table serves an illustrative purpose, clarifying the concrete differences between the abstract concepts of 'Code as Action', 'JSON as Action', and 'Text as Action' used throughout the paper's experiments (e.g., in Table 2 and Table 3).
  • Suitability of Example Action: While simple, the chosen action ('AddAgenda') effectively demonstrates how parameters are handled in each format, which is sufficient for illustrating the basic syntactic differences relevant to atomic API calls.
  • Contextual Role: This table does not present experimental results but provides necessary context for understanding the experimental setup and results presented elsewhere.
Communication
  • Clarity of comparison: The table clearly presents the same conceptual action ('AddAgenda') formatted in three distinct ways (CodeAct, JSON, Text), allowing for direct comparison.
  • Effective layout: The side-by-side layout is simple and effective for illustrating the syntactic differences between the action formats.
  • Concrete examples: The examples chosen are concrete and easy to understand, clearly showing how parameters ('content', 'time') are represented in each format.
  • Clear labeling: Labeling each row with the format name ('CodeAct', 'JSON', 'Text') is unambiguous.
Table A.7: Comparison between M³ToolEval and existing tool-use evaluation...
Full Caption

Table A.7: Comparison between M³ToolEval and existing tool-use evaluation benchmark.

Figure/Table Image (Page 14)
Table A.7: Comparison between M³ToolEval and existing tool-use evaluation benchmark.
First Reference in Text
Not explicitly referenced in main text
Description
  • Purpose and Benchmarks Compared: This table compares the M00ToolEval benchmark, introduced in this paper, against four existing benchmarks used for evaluating how well Large Language Models (LLMs) can use software tools or APIs. The benchmarks compared are ToolBench (Qin et al., 2023b), APIBench (Patil et al., 2023), API-Bank (Li et al., 2023), and another benchmark also named ToolBench (Xu et al., 2023).
  • Comparison Criteria: The comparison is based on five criteria: 'Requiring multi-turn interaction' (does the task need multiple steps?), 'Multiple tools' (does the task involve using more than one tool?), 'Evaluation' method (how is success measured?), 'No dependency on external API*' (does the benchmark rely on third-party web services?), and 'Supported API Action Format' (what formats can the LLM use to call tools?).
  • Multi-turn and Multi-tool Comparison: According to the table, M00ToolEval requires multi-turn interaction and involves multiple tools, similar to ToolBench (Qin et al.) but unlike the other three benchmarks listed, which are marked with 'X' for these features.
  • Evaluation Method Comparison: For evaluation, M00ToolEval uses 'Answer Match' (checking if the final answer is correct). ToolBench (Qin et al.) uses an 'LLM Evaluator' (another LLM judges the performance). APIBench uses 'AST Tree Match' (comparing the structure of generated code/calls). API-Bank uses 'API-Call Match'. ToolBench (Xu et al.) uses 'Test Case' evaluation.
  • External API Dependency Comparison: M00ToolEval is marked as having 'No dependency on external API', meaning it doesn't rely on potentially unavailable third-party services, unlike ToolBench (Qin et al.), APIBench, and ToolBench (Xu et al.). API-Bank is also marked as not having this dependency.
  • Supported Action Format Comparison: Regarding supported action formats, M00ToolEval supports 'CodeAct & JSON & Text'. ToolBench (Qin et al.) and API-Bank support 'JSON'. APIBench and ToolBench (Xu et al.) support 'CodeAct'.
Scientific Validity
  • Positioning of M00ToolEval: The table aims to position M00ToolEval within the landscape of existing benchmarks by highlighting its unique combination of features, particularly multi-turn interaction, multiple tools, answer-based evaluation, lack of external dependency, and broad action format support.
  • Relevance of Comparison Criteria: The chosen comparison criteria (multi-turn, multi-tool, evaluation method, external dependency, format support) are relevant dimensions for differentiating benchmark designs and capabilities in the context of LLM tool use.
  • Accuracy of Benchmark Characterization: The characterization of each benchmark against the criteria appears generally consistent with the descriptions in the cited papers, although nuances might exist. For example, the exact nature of 'multi-turn' or 'multiple tools' can vary.
  • Practical Advantage Highlighted (API Dependency): Highlighting the lack of dependency on external APIs is a valid practical advantage, as external services can introduce unreliability or cost.
  • Benchmark Suitability for Study: Supporting multiple action formats (CodeAct, JSON, Text) makes M00ToolEval suitable for the comparative experiments conducted in the paper (e.g., Table 3).
  • Justification for New Benchmark: The table effectively justifies the need for creating M00ToolEval by showing that no single existing benchmark (among those compared) combined all the desired features, particularly the focus on complex, multi-step, multi-tool tasks evaluated by final answer correctness without external dependencies.
Communication
  • Clear comparison format: The table uses a clear columnar format to compare M00ToolEval against four other benchmarks across several key features.
  • Effective use of symbols: The use of checkmarks (0c) and crosses (X) provides a quick visual summary of whether each benchmark possesses a specific feature.
  • Clear feature labels: Row headers clearly define the features being compared (e.g., 'Requiring multi-turn interaction', 'Multiple tools').
  • Clear benchmark identification: Column headers clearly identify the benchmarks being compared, including citations.
  • Highlighting unique features: The table effectively highlights the purported unique combination of features in M00ToolEval (multi-turn, multiple tools, answer matching evaluation, no external API dependency, support for CodeAct/JSON/Text).
  • Informative footnote: The footnote explaining the potential impact of relying on external APIs adds important context to that specific comparison point.
Table A.8: Ablation study results. The best results are bolded, and the...
Full Caption

Table A.8: Ablation study results. The best results are bolded, and the second-best results are underlined. ID and OD stand for in-domain and out-of-domain evaluation correspondingly. Overall averaged performance normalizes the MT-Bench score to be consistent with other tasks and excludes in-domain tasks for fair comparison.

Figure/Table Image (Page 14)
Table A.8: Ablation study results. The best results are bolded, and the second-best results are underlined. ID and OD stand for in-domain and out-of-domain evaluation correspondingly. Overall averaged performance normalizes the MT-Bench score to be consistent with other tasks and excludes in-domain tasks for fair comparison.
First Reference in Text
Not explicitly referenced in main text
Description
  • Purpose: Ablation Study: This table presents an 'ablation study', a type of experiment where components are systematically removed to understand their contribution. Here, the study investigates the importance of the two main parts of the training data mixture used to create the CodeActAgent models: the specialized 'CodeActInstruct' dataset and the 'general conversations' dataset.
  • Models and Conditions Compared: Two base models are considered: the Llama2-based CodeActAgent and the Mistral-based CodeActAgent. For each, three versions are compared: the full model (trained on both data types), a model trained without the CodeActInstruct data ('w/o CodeAct'), and a model trained without the general conversation data ('w/o general conversations').
  • Evaluation Benchmarks: The models are evaluated on the same suite of 'Agent Tasks' (MINT ID/OD, Miniwob++, SciWorld) and 'Generic LLM Tasks' (MMLU, HumanEval, GSM8K, MTBench) as in Table 5.
  • Results for Llama2-based Ablation: For the Llama2-based agent, removing CodeActInstruct data ('w/o CodeAct') drastically reduces performance on MINT agent tasks (e.g., MINT ID drops from 51.3 to 17.0) but has a smaller impact on generic tasks (e.g., MMLU drops slightly from 50.6 to 49.5). Removing general conversations ('w/o general conversations') significantly hurts generic task performance (e.g., MMLU drops to 46.4, MTBench drops from 7.5 to 4.1) and also negatively impacts agent tasks, though less severely than removing CodeActInstruct data (MINT ID drops to 29.2).
  • Results for Mistral-based Ablation: Similar trends are observed for the Mistral-based agent. Removing CodeActInstruct data ('w/o CodeAct') substantially lowers agent task scores (MINT ID drops from 57.4 to 32.9) while maintaining relatively high generic task scores (MMLU 59.9, HumanEval 33.2). Removing general conversations ('w/o general conversations') severely degrades performance on both generic tasks (MMLU drops to 52.4, MTBench drops from 8.2 to 2.6) and agent tasks (MINT ID drops to 50.5, Miniwob++ drops to 0.0).
  • Overall Average Impact: The 'Overall Average' score reflects these trends, showing significant drops when either data component is removed, but highlighting the particular importance of CodeActInstruct data for agent tasks and general conversation data for generic capabilities and overall robustness.
Scientific Validity
  • Methodological Soundness (Ablation): Ablation studies are a standard and valid method for assessing the contribution of different components (in this case, training data subsets) to a model's performance.
  • Clear Component Separation: Ablating the two primary distinct components of the training mixture (specialized agent data vs. general conversation data) allows for a clear analysis of their respective roles.
  • Comprehensive Evaluation Suite: Evaluating on a diverse set of both agent-specific and generic tasks provides strong evidence for the differential impact of the ablated data components on different capabilities.
  • Support for Hypotheses: The results strongly support the hypothesis that the CodeActInstruct data is crucial for imparting the specialized agent skills, while the general conversation data is essential for maintaining broad language understanding and instruction-following capabilities.
  • Consistency Across Models: The consistency of the trends observed across both Llama2 and Mistral base models strengthens the conclusion about the roles of the different data components.
  • Justification for Data Mixture: The study effectively demonstrates that simply training on general conversation data is insufficient for achieving strong agent performance ('w/o CodeAct' results), and training only on agent data significantly degrades general capabilities ('w/o general conversations' results), justifying the use of the data mixture.
Communication
  • Clarity of Ablation Comparison: The table clearly presents the results of the ablation study, comparing the full model against versions trained without specific data components.
  • Grouping by Base Model: Grouping results by base model (Llama2 and Mistral) makes it easy to see the impact of ablations on each.
  • Consistent Structure with Table 5: The structure mirrors Table 5, facilitating comparison of ablated models to the full model and other baselines.
  • Headers and Formatting: Headers and formatting (bolding/underlining) are used effectively, consistent with previous tables.
  • Clarity of Ablation Labels: The labels 'w/o CodeAct' and 'w/o general conversations' clearly indicate the ablated conditions.
Table A.9: CodeActInstruct components and the number of instances for training...
Full Caption

Table A.9: CodeActInstruct components and the number of instances for training trajectory generation.

Figure/Table Image (Page 18)
Table A.9: CodeActInstruct components and the number of instances for training trajectory generation.
First Reference in Text
Not explicitly referenced in main text
Description
  • Purpose and Components: This table outlines the different datasets that make up the 'CodeActInstruct' collection, which was used as the basis for generating 'training trajectories'. A training trajectory is essentially a recorded example of an AI agent successfully completing a task, often involving multiple steps and interactions, which can then be used to teach the agent.
  • Dataset Breakdown and Instance Counts: The table categorizes the components by the 'Domain' or 'Capability' they are intended to train. These include: 'Web Search' capability using the HotpotQA dataset (3,000 instances); 'Math Reasoning' using the MATH dataset (5,586 instances); 'Code Generation' using the APPS dataset (4,439 instances); 'Tabular Reasoning' using the WikiTableQuestion dataset (3,000 instances); and 'Embodied Planning' (simulated robot tasks) using the ALFWorld dataset (3,553 instances).
  • Capabilities Targeted by Each Dataset: Each listed dataset corresponds to a specific skill the researchers aimed to develop in their CodeActAgent. For example, HotpotQA involves answering questions that require finding and combining information from multiple sources (simulating web search). MATH and APPS involve solving math problems and programming challenges, respectively, requiring logical reasoning and the use of code/libraries. WikiTableQuestion involves reasoning over data presented in tables. ALFWorld involves planning sequences of actions in a simulated household environment.
  • Focus on Instance Count for Trajectory Generation: The table focuses specifically on the number of 'instances' (individual problems or tasks) drawn from each source dataset that were used for the initial step of generating interaction examples (trajectories). The total number of instances listed sums up to 19,578.
Scientific Validity
  • Dataset Diversity and Relevance: The selection of diverse datasets (HotpotQA, MATH, APPS, WikiTableQuestion, ALFWorld) covering various agent capabilities (search, reasoning, coding, planning) provides a strong foundation for generating comprehensive training data for a multi-skilled agent.
  • Use of Established Benchmarks: Using established benchmarks as source data ensures a certain level of task quality and relevance, leveraging prior work in specific capability areas.
  • Scale of Data Selection: The number of instances selected from each dataset seems substantial (thousands per category), suggesting an effort to capture sufficient examples for each capability during the trajectory generation phase.
  • Distinction from Final Training Data: This table describes the input data used for generating trajectories. The scientific validity of the final CodeActInstruct dataset (as detailed in Table 4, which shows ~7k instances after filtering) depends on the quality of the trajectory generation process and the subsequent filtering heuristics applied (described in 00G.2), not just the initial selection shown here.
  • Data Selection/Filtering Rationale: The process of down-sampling or selecting instances from the original datasets (e.g., selecting 'hard' instances from HotpotQA, filtering MATH by difficulty, as mentioned in 00G.1) before trajectory generation is a reasonable step to focus computational effort on more challenging examples.
Communication
  • Clear Structure: The table clearly lists the components of the CodeActInstruct dataset, categorized by domain/capability.
  • Mapping Datasets to Capabilities: Mapping specific datasets (e.g., HotpotQA, MATH) to capabilities (e.g., Information seeking, Math Reasoning) provides clarity on the intended purpose of each component.
  • Quantitative Information (Instances): Providing the number of instances for each component gives a quantitative sense of the data scale used for generating trajectories, although total token counts (as in Table 4) are omitted here.
  • Clarity of Headers: Headers ('Domain', 'Capability', 'Dataset', '# of Instances') are clear and informative.
  • Inclusion of Citations: Citations for the datasets used are included, allowing readers to find more information about the source data.

Related Work

Key Aspects

Strengths

Suggestions for Improvement

Conclusions

Key Aspects

Strengths

Suggestions for Improvement

↑ Back to Top