Executable Code Actions Elicit Better LLM Agents

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, Heng Ji
Proceedings of the 41st International Conference on Machine Learning
Department of Computer Science, University of Illinois Urbana-Champaign

Table of Contents

Overall Summary

Study Background and Main Findings

This paper addresses limitations in current Large Language Model (LLM) agents, which typically interact with environments or tools using predefined text or structured JSON formats. These formats often restrict the complexity and flexibility of actions an agent can perform, hindering their ability to solve complex, multi-step problems. The research proposes CodeAct, a novel framework where LLM agents generate executable Python code as their actions. This approach aims to create a unified and more powerful action space by leveraging Python's inherent expressiveness, control flow structures (like loops and conditionals), data handling capabilities, and access to a vast ecosystem of existing software libraries.

The effectiveness of CodeAct was evaluated by comparing it against text and JSON action formats across 17 different LLMs on two benchmarks: API-Bank (for simple, single-tool actions) and a newly curated benchmark, M3ToolEval (designed for complex tasks requiring multiple tools and interaction turns). The results demonstrated that CodeAct performs comparably or better on simple tasks and significantly outperforms alternatives on complex tasks, achieving up to a 20% absolute increase in success rate and requiring fewer interaction turns. This suggests CodeAct effectively utilizes LLMs' familiarity with code (from pre-training) and excels where complex logic or tool composition is needed.

Recognizing a performance gap between open-source and proprietary LLMs, the researchers developed CodeActInstruct, a dataset containing ~7,000 multi-turn interaction examples using CodeAct, specifically filtered to include instances of self-debugging and improvement based on feedback (like error messages). They used this dataset, combined with general conversation data, to fine-tune open-source models (Llama-2 7B, Mistral 7B), creating CodeActAgent. Evaluation showed that CodeActAgent significantly improved performance on agent tasks compared to baseline open-source models, demonstrated generalization to text-based actions, and maintained strong performance on general LLM benchmarks.

The main conclusion is that using executable Python code (CodeAct) offers a substantial advantage over text/JSON for LLM agent actions, particularly in complex scenarios. The CodeAct framework, combined with targeted instruction tuning using datasets like CodeActInstruct, enables the development of more capable and autonomous agents that can leverage existing software ecosystems and even debug their own actions. This work provides both a conceptual framework and practical resources (datasets, models) for advancing LLM agent capabilities.

Research Impact and Future Directions

This research presents a compelling argument for shifting the paradigm of Large Language Model (LLM) agent actions from constrained formats like JSON or text towards executable Python code, encapsulated in the CodeAct framework. The core strength lies in leveraging the inherent flexibility, control flow (e.g., loops, conditionals), data manipulation capabilities, and vast library ecosystem of a mature programming language. Empirical results, particularly the up to 20% higher success rate on complex multi-tool tasks compared to traditional methods, provide substantial evidence for CodeAct's potential. The development of the M3ToolEval benchmark specifically addresses a gap in evaluating complex agent interactions, strengthening the validation.

The creation of the CodeActInstruct dataset and the subsequent fine-tuning of CodeActAgent models represent significant practical contributions, particularly for advancing open-source LLM agent capabilities. The demonstration that fine-tuning with this specialized data, mixed with general conversation data, improves agent performance without substantially degrading general abilities is a valuable finding for practical LLM development. The agent's ability to leverage existing Python packages and perform self-debugging based on execution errors points towards more autonomous and capable AI systems.

However, some limitations warrant consideration. While CodeAct improves performance, a significant gap persists between open-source and leading closed-source models, suggesting that the action format alone isn't a panacea for underlying model capability differences. The effectiveness of self-debugging might vary depending on error complexity and the base model's reasoning ability. Furthermore, while the benchmarks used are valuable, generalizing performance to the full spectrum of real-world tasks requires ongoing validation. The reliance on LLMs to generate correct and safe code also introduces potential security considerations not fully explored within this scope. Despite these points, CodeAct offers a promising and demonstrably effective direction for building more powerful and flexible LLM agents.

Critical Analysis and Recommendations

Clear Problem Definition (written-content)
The abstract clearly defines the problem of limited action scope and flexibility in current LLM agents using JSON/text formats. + This establishes a strong rationale and context for the proposed CodeAct solution, immediately grounding the reader in the paper's objectives.
Section: Abstract
Highlights Key Performance Result (written-content)
The abstract highlights CodeAct's superior performance with a specific metric (up to 20% higher success rate). + Quantifying the key result upfront effectively conveys the significance and potential impact of the proposed framework.
Section: Abstract
Specify Novelty of Curated Benchmark (written-content)
The abstract mentions a newly curated benchmark but doesn't specify its unique contribution (e.g., focus on complex multi-tool tasks). + Clarifying the benchmark's novelty would strengthen the abstract by highlighting a key methodological contribution and providing better context for the performance claims.
Section: Abstract
Clear Context, Problem Definition, and Solution Presentation (written-content)
The introduction clearly defines the problem with text/JSON actions and systematically presents CodeAct's advantages (e.g., dynamic interaction, library access, control flow). + This structured presentation effectively motivates the research and helps readers grasp the core value proposition of the CodeAct framework.
Section: Introduction
Effective Use of Figure Reference for Conceptual Clarity (written-content)
The introduction effectively uses Figure 1 to visually contrast CodeAct with text/JSON and preview performance gains. + This visual aid reinforces the conceptual differences and quantitative advantages, making the core arguments more accessible and compelling.
Section: Introduction
Strong Empirical Validation Strategy (written-content)
The paper provides strong empirical validation using two distinct experiments: API-Bank for atomic actions (testing familiarity) and the novel M3ToolEval benchmark for complex multi-tool actions (testing control/data flow). + This well-designed strategy effectively isolates and demonstrates the different hypothesized benefits of CodeAct, lending credibility to the claims.
Section: CodeAct Makes LLMs Better Agents
Novel Benchmark (M3ToolEval) for Complex Tool Use (written-content)
The introduction and evaluation using the M3ToolEval benchmark addresses a gap by focusing on complex multi-tool composition, unlike many existing benchmarks. + This novel benchmark allows for a more rigorous evaluation of CodeAct's capabilities in scenarios where its advantages are expected to be most pronounced.
Section: CodeAct Makes LLMs Better Agents
Quantitative Evidence Supporting CodeAct Superiority (written-content)
Results (Tables 2 & 3) show CodeAct outperforms JSON/Text, achieving up to 20.7% higher success rates and requiring fewer turns on the complex M3ToolEval benchmark for most LLMs. + This quantitative evidence strongly supports the central claim that CodeAct enhances agent performance, particularly for complex tasks.
Section: CodeAct Makes LLMs Better Agents
Demonstrates Practical Benefits (Libraries, Self-Debugging) (written-content)
Section 2.4 demonstrates CodeAct's ability to leverage existing software libraries (e.g., Pandas, Scikit-Learn) and facilitate self-debugging via error messages (Figure 3). + This showcases practical benefits beyond benchmark scores, highlighting CodeAct's potential for building more autonomous and versatile agents capable of complex workflows.
Section: CodeAct Makes LLMs Better Agents
Elaborate on Open/Closed-Source Gap and Link to Motivation (written-content)
The significant performance gap observed between open- and closed-source models on M3ToolEval using CodeAct is noted, but the potential reasons and link to the motivation for CodeActInstruct are not fully elaborated in the main text. + Expanding on this gap and explicitly connecting it to the need for fine-tuning (Section 3) would create a stronger narrative bridge and better justify the subsequent work.
Section: CodeAct Makes LLMs Better Agents
Detailed and Systematic Dataset Construction (CodeActInstruct) (written-content)
The methodology for constructing the CodeActInstruct dataset (7k trajectories) is detailed, including source selection, repurposing, trajectory generation, and filtering for self-improvement patterns. + This systematic approach enhances the dataset's quality and relevance for training agents capable of learning from interaction and self-correction.
Section: Empowering Open-source LLM Agent to be Better at CodeAct
Comprehensive Evaluation of Fine-Tuned Agent (CodeActAgent) (written-content)
The evaluation of CodeActAgent is comprehensive, assessing performance on CodeAct tasks (in/out-of-domain), generalization to text actions, and standard LLM benchmarks. + This holistic evaluation provides a robust assessment of the fine-tuned models' capabilities and the trade-offs involved.
Section: Empowering Open-source LLM Agent to be Better at CodeAct
Successful Integration of Diverse Training Data (written-content)
The successful integration of CodeActInstruct with general conversation data improves agent performance without significantly harming general capabilities (Table 5, Table A.8). + This demonstrates a practical and effective fine-tuning strategy for developing specialized agents that retain broad utility.
Section: Empowering Open-source LLM Agent to be Better at CodeAct
Clear Contextualization and Problem Framing (written-content)
The Related Work section effectively situates CodeAct within the standard LLM agent architecture and clearly frames the problem as standardizing the action space. + This provides clear context and focus for the paper's contributions within the broader field.
Section: Related Work
Effective Differentiation from Related Concepts (written-content)
The section differentiates CodeAct from general code generation and concurrent work (TaskWeaver) by referencing detailed comparisons in appendices. + This appropriately acknowledges related research while maintaining focus in the main text and clarifying the unique aspects of CodeAct.
Section: Related Work
Concise Summary of Contributions and Core Advantage (written-content)
The conclusion concisely summarizes the main contributions (CodeAct framework, CodeActInstruct dataset, CodeActAgent model) and reiterates the core advantage over text/JSON actions. + This provides a clear and effective wrap-up of the paper's key takeaways.
Section: Conclusions
Highlights Key Agent Capabilities (written-content)
The conclusion highlights key capabilities of CodeActAgent, such as Python integration, complex task execution, library use, and self-debugging. + This effectively emphasizes the practical potential and advanced features enabled by the proposed approach.
Section: Conclusions

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 1: Comparison between CodeAct and Text / JSON as action. (top)...
Full Caption

Figure 1: Comparison between CodeAct and Text / JSON as action. (top) Illustrative example comparing different actions. (bottom) Quantitative results on M³ToolEval (§2.3).

Figure/Table Image (Page 2)
Figure 1: Comparison between CodeAct and Text / JSON as action. (top) Illustrative example comparing different actions. (bottom) Quantitative results on M³ToolEval (§2.3).
First Reference in Text
However, both methods typically suffer from constrained scope of action spaces (actions are usually tailored for specific tasks) and restricted flexibility (e.g., tool uses in Fig. 1 top left).
Description
  • Illustrative Comparison of Action Methods: This part of the figure presents a side-by-side comparison of how a Large Language Model (LLM) agent tackles a specific task: finding the most cost-effective country (USA, Japan, Germany, India) to buy a smartphone ('CodeAct 1'). LLM agents are AI systems designed to understand instructions and perform actions, like using software tools (APIs). The left side shows the agent using traditional Text or JSON formats to call available tools (like 'lookup_rates', 'lookup_phone_price') one by one. Each call requires a separate interaction with the 'Environment'. The right side shows the same task performed using 'CodeAct', where the agent generates a single block of Python code. This code uses loops ('for country in countries:') to iterate through the countries, calls the necessary tools as functions within the code, stores results in variables ('final_prices'), and uses built-in Python functions ('min') to find the minimum price, achieving the result in fewer overall interactions.
  • Efficiency Claim (Fewer Actions): The example highlights that the CodeAct method can achieve the task objective with fewer interactions compared to the Text/JSON method, which requires multiple back-and-forth steps for each country being checked.
  • Code Features Demonstrated (Control Flow, Data Flow, Library Use): The CodeAct example demonstrates the use of programming concepts like variables to store intermediate results (e.g., 'final_prices'), loops ('for') for repetitive operations across different countries, and leveraging existing Python library functions ('min'), contrasting with the single-tool-call limitation often seen in Text/JSON approaches.
Scientific Validity
  • Representativeness of Example: The example serves as a conceptual illustration. Its representativeness of typical or complex scenarios where CodeAct's advantages become significant is not established by this single instance alone. The chosen task might be specifically selected to favor the CodeAct approach.
  • Accuracy of Depicted Mechanisms: The illustration accurately depicts how Python code can encapsulate multiple tool calls, control flow, and data manipulation, contrasting with the typical one-call-per-interaction pattern of simple Text/JSON tool use.
  • Support for Efficiency Claim: The claim 'Fewer Actions Required!' is visually supported by the reduced number of agent interaction blocks shown on the right compared to the left, assuming the omitted steps on the left are numerous.
Communication
  • Visual layout and annotations: The side-by-side layout effectively contrasts the Text/JSON approach with the CodeAct approach for the illustrative task. The annotations highlighting 'Fewer Actions Required!' and specific code advantages (e.g., control flow, library re-use) aid comprehension.
  • Clarity of interaction flow: The step-by-step interaction flow for both methods is generally clear, showing the action taken by the agent and the environment's response.
  • Conceptual clarity: While illustrative, the example clearly conveys the core concept of using executable code versus structured text/JSON for agent actions.
  • Omission of steps: The omission of intermediate steps for brevity (indicated by '[...]') is necessary but slightly hinders a full understanding of the interaction count difference without referring to the text.

CodeAct Makes LLMs Better Agents

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 2: General agent multi-turn interaction framework that describes the...
Full Caption

Figure 2: General agent multi-turn interaction framework that describes the role of CodeAct and motivates the construction of our data mixture. CodeActInstruct focuses on the agent-environment interactions and specifically filters for the self-improved planning behavior, while general conversation data we include focuses on agent-user interaction (§3.1).

Figure/Table Image (Page 4)
Figure 2: General agent multi-turn interaction framework that describes the role of CodeAct and motivates the construction of our data mixture. CodeActInstruct focuses on the agent-environment interactions and specifically filters for the self-improved planning behavior, while general conversation data we include focuses on agent-user interaction (§3.1).
First Reference in Text
CodeActInstruct is motivated by a general agent framework consisting of agent, user, and environments (Fig. 2) and focuses on agent-environment interactions with the computer (information seeking, software package use, external memory) and the physical world (robot planning).
Description
  • Comparison Setup: This part of the figure displays quantitative results comparing three different 'action modes' – ways for Large Language Models (LLMs) to perform actions – namely CodeAct (using Python code), JSON (using a structured data format), and Text (using plain text commands). The comparison is performed on a benchmark dataset called M00ToolEval, which presumably involves tasks requiring tool use.
  • Metrics and Models Compared: Two metrics are shown in separate bar charts: 'Success Rate (%)' (higher is better), indicating how often the LLM successfully completed the task, and 'Average Number of Interaction Turns' (lower is better), showing how many steps were needed on average. Results are presented for several different LLMs, including various versions of GPT (e.g., gpt-4-1106-preview, gpt-3.5-turbo-0613), Claude-2, Gemini-pro, text-davinci-003, and Llama-2-70b-chat-hf.
  • Success Rate Results Trend: For most models shown, particularly the more capable ones like gpt-4 versions, CodeAct (represented by the darkest bar) generally achieves the highest success rate compared to JSON and Text actions. For instance, gpt-4-1106-preview shows a success rate noticeably above 70% with CodeAct, while its success rate with JSON and Text is closer to 50%.
  • Interaction Turns Results Trend: Correspondingly, for the higher-performing models, CodeAct also tends to require fewer interaction turns. For gpt-4-1106-preview, the average turns are below 6 for CodeAct, whereas they are closer to 8 for JSON and Text.
  • Trend with Model Capability: The charts visually suggest that the performance gap between CodeAct and the other methods widens for more powerful LLMs (those listed higher on the y-axis), implying CodeAct's benefits might be more pronounced with more capable base models.
Scientific Validity
  • Benchmark Dependency: The figure presents results from experiments on the M00ToolEval benchmark. The validity depends on the quality, complexity, and representativeness of this benchmark, which is described as newly curated in the text (002.3). Without detailed information on the benchmark tasks, it's difficult to fully assess if the observed performance differences generalize.
  • Model Coverage: The comparison across 17 LLMs (though only 8 are shown clearly in the snippet) provides reasonable breadth, covering both open-source and proprietary models.
  • Appropriateness of Metrics: The metrics (Success Rate, Average Turns) are standard and appropriate for evaluating agent task performance and efficiency.
  • Consistency with Claims: The figure visually supports the paper's claim that CodeAct outperforms alternatives, particularly for more capable LLMs, on this specific benchmark. The trend of increasing advantage with model scale appears consistent across the presented models.
  • Potential Confounding Factors: Potential confounding factors, such as differences in prompt engineering for each action mode or the specific implementation details, are not detailed in the figure caption itself but are crucial for the validity of the comparison.
Communication
  • Chart type clarity: The horizontal bar charts are a standard and clear way to compare success rates and interaction turns across different models and action modes.
  • Use of color-coding: Color-coding the bars based on the action mode (Code, JSON, Text) allows for easy visual comparison of the different approaches for each LLM.
  • Axis and model labeling: Labeling the axes ('Success Rate (%)', 'Average Number of Interaction Turns') and listing the specific LLM models clearly identifies the metrics and subjects of comparison.
  • Separation of metrics: The separation into two charts (Success Rate and Interaction Turns) allows for focused comparison on each metric.
  • Axis scaling: The range of the x-axis for Success Rate (0-70%) might visually exaggerate differences for lower-performing models. Similarly, the Interaction Turns axis (5-10) focuses on a specific range.
  • Interpretation dependency: While the figure presents the data, interpreting the significance of the differences (e.g., whether a 5% increase in success rate is statistically significant) requires referring to the main text or statistical analysis not present in the figure itself.
Figure 3: Example multi-turn interaction with Python packages using...
Full Caption

Figure 3: Example multi-turn interaction with Python packages using CodeActAgent (Mistral-7b). No in-context demonstrations are provided to the model. Some messages are omitted for space. See https://chat.xwang.dev/r/Vqn108G for complete interaction.

Figure/Table Image (Page 6)
Figure 3: Example multi-turn interaction with Python packages using CodeActAgent (Mistral-7b). No in-context demonstrations are provided to the model. Some messages are omitted for space. See https://chat.xwang.dev/r/Vqn108G for complete interaction.
First Reference in Text
As shown in Fig. 3, CodeActAgent, designed for seamless integration with Python, can carry out sophisticated tasks (e.g., model training, data visualization) using existing Python packages.
Description
  • Interaction Overview: This figure demonstrates a conversational interaction between a human user and an AI agent called CodeActAgent (powered by the Mistral-7b language model). The goal is to perform a data science task involving an auto MPG (Miles Per Gallon) dataset.
  • Initial Task Request: The interaction spans multiple turns. The user initially asks the agent to download the dataset, preprocess it (check for missing values), split it into training and testing sets, and train a regression model (a statistical technique to predict MPG based on other car features).
  • Use of Python Libraries: The agent uses specific Python libraries: `pandas` for data loading and manipulation (reading a CSV file from a URL), `numpy` for numerical operations, and `scikit-learn` for machine learning tasks like splitting data (`train_test_split`) and implementing the `LinearRegression` model.
  • Self-Debugging Example (ValueError): The figure highlights the agent's self-debugging capability. Initially, the agent's code fails due to unexpected characters ('?') in the data, causing a `ValueError`. The environment provides a traceback (an error report). The agent analyzes this feedback and generates corrected code to handle the issue (replacing '?' with NaN - 'Not a Number' - and dropping rows with missing values).
  • Model Training, Evaluation, and Follow-up: After successfully training the model, the agent reports evaluation metrics: Mean Squared Error (MSE), a measure of prediction error (10.71 shown), and R^2 score, indicating how well the model fits the data (0.79 shown, closer to 1 is better). It also handles follow-up requests, like calculating these metrics for the training set.
  • Data Visualization and Further Debugging: The user then asks for a visualization of the model's coefficients (values indicating the importance of each input feature). The agent uses the `matplotlib` library to create a bar chart. Further self-debugging occurs when the visualization code initially fails (e.g., `AttributeError`, incorrect function arguments for rotating labels). The agent iteratively corrects its code based on error messages over several (partially omitted) turns until the visualization is successfully generated.
  • Zero-Shot Capability Demonstration: A key aspect noted in the caption is that this interaction happens without 'in-context demonstrations', meaning the agent wasn't given prior examples of how to perform this specific task; it relies on its pre-trained knowledge.
Scientific Validity
  • Demonstration of Claimed Capabilities: The figure provides a concrete example supporting the claim that CodeActAgent can interact with Python packages (pandas, scikit-learn, matplotlib) to perform complex, multi-step tasks.
  • Demonstration of Self-Debugging: The example effectively showcases the multi-turn interaction and, crucially, the agent's ability to use automated feedback (error messages) for self-debugging, a key feature highlighted in the paper.
  • Representativeness: As a single example, it may not be fully representative of the agent's general performance across all possible tasks or error types. The success shown could be specific to this particular scenario or model version (Mistral-7b).
  • Evidence for Zero-Shot Learning: The successful execution without in-context examples provides evidence for the agent's zero-shot capabilities in this domain, leveraging its underlying training.
  • Task Complexity: The complexity of the task (data loading, cleaning, modeling, evaluation, visualization, iterative debugging) lends credibility to the demonstration of sophisticated task handling.
Communication
  • Clear demarcation of interaction roles: The layout clearly distinguishes between the User's requests, the CodeActAgent's thoughts and code actions, and the Environment's responses (including code output and error messages).
  • Effective use of annotations: Annotations highlighting specific capabilities (e.g., 'Use Pandas Library...', 'Self-Debug from Automated Feedback', 'Use Matplotlib Library...') effectively guide the reader's attention to the key demonstrated features.
  • Illustration of multi-turn interaction: The multi-turn dialogue format successfully illustrates the iterative nature of the interaction, including follow-up questions and error correction cycles.
  • Impact of omissions: While necessary for brevity, the omission of some messages and code snippets (indicated by '[...omitted for space...]') slightly fragments the flow and requires trust or external viewing (via the provided link) for a complete picture.
  • Provision of link to full interaction: Including a link to the full interaction is a good practice for transparency and allows interested readers to examine the complete sequence.
  • Showcasing complex workflow: The figure effectively showcases a complex workflow involving data processing, model training, evaluation, and visualization, demonstrating the agent's ability to handle sophisticated tasks.
Table 1: The benefit of CodeAct compared to using Text/JSON for LLM action.
Figure/Table Image (Page 3)
Table 1: The benefit of CodeAct compared to using Text/JSON for LLM action.
First Reference in Text
In this section, we first describe CodeAct framework (§2.1) and provide empirical evidence that supports the choice of CodeAct.
Description
  • Comparison Overview: This table presents a qualitative comparison between two methods for defining actions taken by Large Language Model (LLM) agents: CodeAct (using executable Python code) and traditional Text/JSON formats. LLM agents are AI systems that can perform tasks by interacting with environments or tools.
  • Data Availability and Complex Operations: The comparison is structured across four key aspects. For 'Availability of Data', it states CodeAct benefits from the large amount of code data already available for 'pre-training' LLMs (the initial phase where models learn from vast datasets), whereas Text/JSON requires specific data curation. For 'Complex Operation', it claims CodeAct natively supports 'control and data flow' (like loops or conditional statements, e.g., 'if-then'), while Text/JSON requires careful engineering for similar complexity. 'Control flow' refers to the order in which instructions are executed, while 'data flow' refers to how data is passed between operations.
  • Tool Availability and Automated Feedback: Regarding 'Availability of Tools', the table asserts that CodeAct allows direct use of existing 'software packages' (libraries of pre-written code, like Python's extensive collection found on PyPI), while Text/JSON often requires human effort to create or adapt tools. For 'Automated Feedback', it highlights that CodeAct can leverage built-in programming language feedback mechanisms like 'traceback' (an error report generated when code fails), which are common in software development, whereas obtaining feedback for Text/JSON actions might require more manual setup.
  • Summary Indicators: The table uses checkmarks (✓) to indicate perceived advantages for CodeAct in all four categories and crosses (X) to indicate perceived limitations for Text/JSON in the same categories.
Scientific Validity
  • Qualitative Claims vs. Evidence: The table presents qualitative arguments and claims about the benefits of CodeAct. While these claims are plausible and align with common understanding of programming vs. structured text, the table itself does not provide quantitative evidence; it serves to outline the hypothesized advantages that are presumably tested later in the paper.
  • Data Availability Claim: The claim regarding data availability for pre-training is generally true; LLMs are trained on vast amounts of text and code from the internet. However, the extent to which this pre-training directly translates to effective agent action generation in the CodeAct format versus simpler formats requires empirical validation.
  • Complex Operations Claim: The comparison regarding complex operations (control/data flow) accurately reflects the inherent capabilities of programming languages versus typical JSON/Text API call structures. Code inherently supports loops, conditionals, and variable manipulation.
  • Tool Availability Claim: The claim about tool availability accurately points to the vast ecosystem of existing software libraries accessible via code, contrasting with the often bespoke nature of tools defined for Text/JSON agents.
  • Automated Feedback Claim: The point about automated feedback (e.g., tracebacks) is valid, as programming environments provide rich debugging information. Integrating such feedback effectively into an LLM agent's reasoning loop is a key aspect of the CodeAct proposal.
  • Potential Oversimplification of Text/JSON: The comparison might oversimplify the Text/JSON approach. While basic implementations might be limited, more sophisticated frameworks exist that attempt to add complexity (e.g., orchestration layers) to Text/JSON agents, although perhaps less natively than code.
Communication
  • Clear comparison format: The table uses a clear two-column comparison format, making it easy to contrast CodeAct and Text/JSON across the specified criteria.
  • Effective use of symbols: The use of checkmarks ("✓") and crosses ("X") provides an immediate visual summary of the purported advantages and disadvantages of each approach.
  • Relevant comparison criteria: The criteria listed (Availability of Data, Complex Operation, Availability of Tools, Automated Feedback) are relevant dimensions for comparing agent action frameworks.
  • Concise explanations: The brief descriptions accompanying each checkmark/cross concisely explain the reasoning behind the assessment (e.g., 'Large quantity of code available for pre-training', 'Requires human effort to curate tools').
  • Use of footnotes for clarification: Footnotes provide necessary context and sources for claims, such as links related to Python packages and error handling.
  • Summarization effectiveness: The table effectively summarizes the core arguments for preferring CodeAct, serving as a useful high-level overview before diving into detailed experimental results.
Table 2: Atomic API call correctness on API- Bank. The best performance is...
Full Caption

Table 2: Atomic API call correctness on API- Bank. The best performance is bolded, and the second-best is underlined.

Figure/Table Image (Page 5)
Table 2: Atomic API call correctness on API- Bank. The best performance is bolded, and the second-best is underlined.
First Reference in Text
We present results in Tab. 2.
Description
  • Experiment Goal and Context: This table presents results from an experiment measuring the 'correctness' of different Large Language Models (LLMs) when making 'atomic API calls'. An 'atomic API call' refers to a single, indivisible request made to a software tool or service (an Application Programming Interface or API). 'Correctness' here likely means whether the LLM generated the API call with the exact right structure, tool name, and parameters as expected by the benchmark. The experiment uses the 'API-Bank' benchmark, a dataset designed to test how well LLMs can use tools.
  • Action Formats Compared: The table compares three different formats ('Format of Action') the LLMs were asked to use for making these API calls: 'CodeAct' (generating a Python function call), 'JSON' (generating a structured data object), and 'Text' (generating a plain text representation of the call).
  • Models Tested and Metric: Results are shown as percentages (Correctness %, higher is better) for a variety of LLMs, categorized into 'Open-source LLMs' (models whose underlying code is publicly available, like Llama-2 and Mistral) and 'Closed-source LLMs' (proprietary models like GPT-4, Claude-2, Gemini-pro).
  • Results for Open-Source Models: For open-source models, performance varies. For example, Llama-2-70b-chat-hf achieves 35.6% correctness with CodeAct and 37.6% with Text, but only 14.3% with JSON. Mistral-7B-Instruct-v0.1 shows low scores across all formats (2.5%, 2.3%, 3.0%). CodeAct and Text often perform better than JSON for these models.
  • Results for Closed-Source Models: For closed-source models, correctness scores are generally much higher. For instance, gpt-4-1106-preview achieves 76.7% with CodeAct, 82.7% with JSON, and 73.4% with Text. Here, JSON often performs comparably or sometimes better than CodeAct, unlike the trend in open-source models. gpt-4-0613 shows 75.4% for CodeAct and 82.0% for JSON.
  • Overall Format Performance Summary: A summary at the bottom ('Frequency of Best-Performing Format') counts how many times each format achieved the highest score. Overall, CodeAct was best 8 times, JSON 5 times, and Text 4 times across all 17 models listed.
Scientific Validity
  • Standardized Benchmark (API-Bank): The use of the API-Bank benchmark provides a standardized basis for comparison, assuming the benchmark itself is well-designed and relevant for testing atomic tool use.
  • Controlled Experiment Design (Atomic Calls): Focusing on 'atomic' API calls isolates the LLM's ability to formulate a single correct call, specifically ablating the control/data flow advantages tested elsewhere. This provides a controlled comparison of format familiarity/preference.
  • Appropriate Metric (Correctness): The 'correctness' metric, defined in the text as matching ground-truth API outputs after execution, seems appropriate for evaluating functional accuracy in this context.
  • Diversity of Models Tested: The inclusion of a diverse set of 17 LLMs, both open- and closed-source, strengthens the generalizability of the findings.
  • Observed Dichotomy (Open vs. Closed Source): The results suggest a potential difference in how open-source vs. closed-source models handle different action formats, particularly JSON. This might reflect differences in pre-training data or subsequent fine-tuning strategies employed by closed-source model providers, a valid point for investigation.
  • Consistency with Field Observations: The relatively lower performance of open-source models compared to closed-source models on this task is consistent with general observations in the field.
Communication
  • Table structure and clarity: The table structure is clear, grouping models into open-source and closed-source categories and presenting correctness scores for each action format (CodeAct, JSON, Text) side-by-side.
  • Use of formatting (bold/underline): Using bolding for the best performance and underlining for the second-best within each row (model) effectively highlights the top-performing action formats for each LLM.
  • Clarity of headers: Column headers ('Format of Action', 'Correctness (%,↑)', 'CodeAct', 'JSON', 'Text') are unambiguous.
  • Inclusion of summary statistics: The inclusion of a summary row ('Frequency of Best-Performing Format ↑') at the bottom provides a useful high-level takeaway regarding which format most often yields the best results across the tested models.
  • Model specificity: Listing a wide range of specific LLMs (e.g., CodeLlama variants, Llama-2 variants, Mistral, Claude, GPT variants, Gemini) allows for detailed comparison.
Table 3: Success rates (higher the better) and average turns required per...
Full Caption

Table 3: Success rates (higher the better) and average turns required per instance (lower the better) on M³ToolEval. The best results for each model are bolded, and the second-best ones are underlined.

Figure/Table Image (Page 5)
Table 3: Success rates (higher the better) and average turns required per instance (lower the better) on M³ToolEval. The best results for each model are bolded, and the second-best ones are underlined.
First Reference in Text
We include full re- sults in Tab. 3 and a subset of results for visualization in Fig. 1.
Description
  • Benchmark and Purpose: This table evaluates various Large Language Models (LLMs) on a benchmark called 'M00ToolEval'. This benchmark is designed to test the ability of LLMs to solve complex tasks that require using multiple software tools over several steps or 'turns' of interaction.
  • Metrics Measured: Two primary metrics are reported: 'Success Rate (%)', which measures how often the LLM successfully completed the tasks (higher values are better), and 'Avg. Turns', the average number of interaction steps required per task (lower values are better).
  • Action Formats Compared: Similar to Table 2, the comparison focuses on three different 'action formats' the LLMs used: 'CodeAct' (Python code), 'JSON', and 'Text'. The performance for each format is shown for each LLM.
  • Models Tested: The table includes results for numerous LLMs, separated into 'Open-source LLMs' (like Llama-2, Mistral) and 'Closed-source LLMs' (like GPT-4, Claude-2, Gemini-pro).
  • Performance Highlights (Success Rate): Performance varies significantly. Open-source models generally show low success rates (e.g., Llama-2-70b-chat-hf achieves 11.0% success with CodeAct, 3.7% with JSON/Text). Closed-source models perform much better; for example, 'gpt-4-1106-preview' achieves a 74.4% success rate with CodeAct, compared to 52.4% with JSON and 53.7% with Text.
  • Performance Highlights (Average Turns): In terms of efficiency, 'gpt-4-1106-preview' using CodeAct required the fewest average turns (5.5), whereas JSON and Text required more turns (7.6 and 7.7, respectively). This trend of CodeAct requiring fewer turns is observed for many, but not all, models.
  • Overall Format Performance Summary: The summary rows indicate that CodeAct was the best-performing format for success rate in 12 out of 17 models and required the fewest turns in 12 out of 17 models, suggesting a strong advantage on this benchmark compared to JSON (best success rate 5 times, fewest turns 3 times) and Text (best success rate 4 times, fewest turns 2 times).
Scientific Validity
  • Benchmark Relevance (M00ToolEval): The M00ToolEval benchmark, described in the text (002.3) as requiring complex coordination and composition of multiple tools in multi-turn interactions, is specifically designed to test the scenarios where CodeAct is hypothesized to excel (leveraging control and data flow). Evaluating on this benchmark directly tests the central claims.
  • Appropriateness of Metrics: The use of success rate and average interaction turns are standard and appropriate metrics for evaluating task completion and efficiency in agent benchmarks.
  • Model Diversity: Testing across a wide range of LLMs (17 total, including open- and closed-source) enhances the robustness and potential generalizability of the findings regarding the effectiveness of CodeAct.
  • Support for Claims (Benefit in Complexity): The results presented strongly support the paper's claim that CodeAct's advantages (seen modestly in atomic calls, Table 2) become more prominent in complex tasks. The absolute improvements in success rate (e.g., ~20% for gpt-4-1106-preview) and reductions in turns are substantial.
  • Zero-Shot Evaluation Setting: The text mentions the evaluation is zero-shot (002.3), meaning models were not given specific examples within the prompt. This tests the models' inherent ability to use the action formats without task-specific prompt engineering, adding to the validity.
  • Observed Performance Gaps: The large performance gap between open-source and top closed-source models highlights limitations in current open-source models' abilities on complex agent tasks, even with the potentially advantageous CodeAct format.
Communication
  • Clear presentation of metrics: The table effectively presents two key metrics (success rate, average turns) side-by-side for easy comparison across different action formats and models.
  • Logical grouping of models: Grouping models into open-source and closed-source categories aids in comparing trends between these two types of models.
  • Effective use of formatting: The use of bolding and underlining to highlight the best and second-best results per model is an effective visual aid for quickly identifying top performers.
  • Clarity of headers: Headers are clear and indicate the directionality of metrics ('%,↑' for success rate, '↓' for average turns).
  • Inclusion of summary statistics: Including summary rows ('Frequency of Best-performing Format ↑') provides a concise takeaway regarding the overall performance dominance of each action format across the evaluated models for both metrics.

Empowering Open-source LLM Agent to be Better at CodeAct

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Table 4: Statistics of our training mixture and comparison with prior work....
Full Caption

Table 4: Statistics of our training mixture and comparison with prior work. Please refer to §3.1 for details about CodeActInstruct and general conversation data. Token statistics are computed using Llama-2 tokenizer.

Figure/Table Image (Page 7)