Executable Code Actions Elicit Better LLM Agents

Section Analysis

Abstract

Key Aspects

Problem Statement: Limitations of Current LLM Agent Actions: The paper addresses limitations in current Large Language Model (LLM) agents, which typically rely on generating actions via JSON or text formats. This approach suffers from constraints such as predefined tool scopes and inflexibility, particularly in composing multiple tools within a single action. These limitations hinder the ability of LLM agents to effectively tackle complex, real-world problems that require dynamic interaction and sophisticated task execution.
Proposed Solution: CodeAct Framework: The core proposal is CodeAct, a framework unifying LLM agent actions through executable Python code. Instead of constrained JSON/text formats, CodeAct allows agents to generate Python code, leveraging the language's inherent flexibility and expressiveness. This approach aims to overcome the limitations of existing methods by providing a more powerful and versatile action space.
CodeAct Features and Advantages: CodeAct integrates with a Python interpreter, enabling direct execution of generated code actions and dynamic interaction with environments. This allows agents to revise previous actions or generate new ones based on observations (like code execution results or error messages) received over multiple interaction turns. Furthermore, CodeAct facilitates leveraging existing Python libraries and supports autonomous self-debugging, significantly expanding the agent's capabilities and adaptability.
Evaluation Methodology and Key Findings: The effectiveness of CodeAct was evaluated through extensive analysis involving 17 different LLMs. This evaluation utilized the existing API-Bank benchmark and a newly developed benchmark specifically curated for this study. The results demonstrated that CodeAct significantly outperforms widely used alternative action formats (JSON/text), achieving up to a 20% higher success rate in agent tasks.
CodeActInstruct Dataset and CodeActAgent Model: Building on CodeAct's success, the research introduces two further contributions: the CodeActInstruct dataset and the CodeActAgent model. CodeActInstruct comprises 7,000 high-quality, multi-turn interaction examples using CodeAct, designed for instruction-tuning LLMs. CodeActAgent represents specific models (Llama2 and Mistral variants) fine-tuned on this dataset, demonstrating improved performance on agent-oriented tasks, including sophisticated operations like model training and autonomous self-debugging, without compromising general capabilities.

Strengths

Clear Problem Definition
The abstract clearly articulates the problem being addressed, specifically the limitations inherent in current LLM agent action generation methods that rely on JSON or text formats. It effectively highlights the constraints related to action space scope and flexibility, setting a strong foundation for the proposed solution.

"LLM agents are typically prompted to produce actions by generating JSON or text in a pre-defined format, which is usually limited by constrained action space (e.g., the scope of pre-defined tools) and restricted flexibility (e.g., inability to compose multiple tools)." (Page 1)
Concise Introduction of Solution
The abstract concisely introduces CodeAct as the core contribution, defining it as a unified action space based on executable Python code. This provides the reader with an immediate understanding of the paper's central proposal.

"This work proposes to use executable Python code to consolidate LLM agents’ actions into a unified action space (CodeAct)." (Page 1)
Highlights Key Performance Result
The abstract effectively summarizes the key performance advantage of CodeAct by stating its superiority over alternatives and quantifying the improvement with a specific metric (up to 20% higher success rate). This immediately conveys the significance of the findings.

"Our extensive analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that CodeAct outperforms widely used alternatives (up to 20% higher success rate)." (Page 1)
Summarizes Major Contributions
The abstract successfully outlines the paper's main contributions beyond the core CodeAct framework, including the development of the CodeActInstruct dataset and the CodeActAgent model. This gives a comprehensive overview of the work's scope.

"To this end, we collect an instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn interactions using CodeAct. We show that it can be used with existing data to improve models in agent-oriented tasks... CodeActAgent, finetuned from Llama2 and Mistral, is integrated with Python interpreter..." (Page 1)

Suggestions for Improvement

Specify Novelty of Curated Benchmark
This medium-impact improvement would enhance the reader's understanding of the evaluation's novelty and rigor directly within the abstract. The abstract introduces a new benchmark alongside API-Bank but doesn't specify what gap this new benchmark addresses or what makes it distinct. Clarifying its purpose (e.g., focusing on complex multi-tool composition) would strengthen the abstract by immediately highlighting a key aspect of the methodological contribution and providing context for the performance claims.

"Our extensive analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that CodeAct outperforms widely used alternatives..." (Page 1)

Implementation: Modify the sentence mentioning the benchmark to briefly indicate its specific focus or the type of complexity it introduces. For example: "...on API-Bank and a newly curated benchmark designed for complex multi-tool interactions shows that CodeAct outperforms..."

Introduction

Key Aspects

Context: LLMs as Agents and Action Expansion: The introduction situates the research within the context of Large Language Models (LLMs) evolving into autonomous agents capable of interacting with external environments through actions like tool invocation and robot control. It acknowledges LLMs' potential but highlights the need for more effective action mechanisms to tackle complex real-world problems, setting the stage for the paper's contribution.
Problem: Limitations of Text/JSON Actions: A core problem identified is the limitation of prevalent LLM agent action formats, namely JSON and pre-defined text structures. These methods are critiqued for their constrained action spaces, often tailored to specific tasks, and restricted flexibility, particularly the difficulty in composing multiple tools or implementing complex logic within a single action turn. This inflexibility hinders performance on sophisticated tasks.
Critique of Prior Code-Based Control: The paper acknowledges prior attempts to use code generation for controlling agents like robots or game characters but points out their typical reliance on pre-defined, basic action commands ('control primitives') and hand-engineered prompts. Crucially, these earlier methods often lacked mechanisms for dynamically adapting actions based on environmental feedback or observations, limiting their robustness and applicability.
Proposed Solution: CodeAct Framework: The central proposal is CodeAct, a general-purpose framework that unifies LLM agent actions by having the agent generate executable Python code. This approach moves beyond rigid formats, aiming to provide a more flexible, powerful, and unified action space suitable for a wide variety of applications.
CodeAct Advantage: Dynamic Interaction and Execution: A key advantage of CodeAct is its integration with a Python interpreter, enabling dynamic interaction. Agents can execute generated code, receive observations (like execution results or error messages), and use this feedback to dynamically adjust previous plans or emit new actions across multiple interaction turns, facilitating adaptive behavior.
CodeAct Advantage: Leveraging Software Ecosystem & Self-Debugging: CodeAct allows LLM agents to leverage the extensive ecosystem of existing Python software packages, vastly expanding the potential action space beyond handcrafted, task-specific tools. Furthermore, it enables agents to utilize automated feedback mechanisms inherent in programming, such as error messages (tracebacks), to autonomously self-debug and refine their generated code actions.
CodeAct Advantage: Synergy with LLM Pre-training: The paper highlights the synergy between CodeAct and modern LLM pre-training, which heavily incorporates code data. This inherent familiarity with programming languages like Python suggests that LLMs can adopt the CodeAct framework more naturally and cost-effectively compared to specialized formats requiring extensive fine-tuning or complex prompting.
CodeAct Advantage: Enhanced Control and Data Flow: CodeAct leverages the inherent capabilities of programming languages, specifically Python's support for control flow (e.g., if-statements, for-loops) and data flow (storing intermediate results in variables). This allows agents to compose multiple tools and implement complex logical operations within a single code action, efficiently tackling tasks that would require numerous, sequential actions using text or JSON formats.
Experimental Validation Preview and Key Results: The introduction previews the empirical validation of CodeAct's benefits, mentioning experiments conducted with 17 LLMs on both existing (API-Bank) and newly curated (M3ToolEval) benchmarks. It summarizes key findings: CodeAct performs comparably or better on simple, atomic tool-use tasks (validating LLM familiarity) and achieves significant improvements (up to 20% higher success rate, 30% fewer actions) on complex, multi-tool tasks (validating control/data flow benefits).
Motivation for Open-Source Agent Development: Based on the promising experimental results demonstrating CodeAct's effectiveness, the introduction motivates the subsequent development of an open-source LLM agent specifically designed to utilize CodeAct for environmental interaction and natural language for human collaboration. This sets the stage for the introduction of the CodeActInstruct dataset and the CodeActAgent model later in the paper.

Strengths

Clear Context and Problem Definition
The introduction effectively establishes the context by highlighting the advancements in LLMs and their application as agents, while clearly articulating the limitations of current action generation methods (text/JSON), thereby setting a strong motivation for the research.

"We inquire: how to effectively expand LLM agents action space for solving complex real-world problems? Much existing research has examined using text (...) or JSON (...) to produce actions (...). However, both methods typically suffer from constrained scope of action spaces (...) and restricted flexibility (...)." (Page 1)
Structured Presentation of CodeAct Advantages
The paper clearly introduces CodeAct as a novel framework and systematically presents its distinct advantages over existing approaches in a well-structured, enumerated list, facilitating reader comprehension.

"This work proposes CodeAct, a general-purpose frame- work that allows LLMs to generate executable Python code as actions (Fig. 1 top right). CodeAct is designed to handle a variety of applications and comes with unique advantages: (1)... (2)... (3)... (4)..." (Page 1)
Effective Use of Figure Reference
The introduction effectively leverages Figure 1 (referenced on pages 1 and 2) to visually contrast CodeAct with text/JSON actions and quantitatively preview performance gains, reinforcing the core arguments.

"In Fig. 1, an LLM using with CodeAct (top right) can apply the same sequence of tools (...) to all inputs through for-loops (...) with one action; while text or JSON have to take action for every input (top left)." (Page 2)
Strong Rationale and Motivation
The introduction successfully motivates the study by outlining the limitations of prior work, including previous attempts at code generation for agents, and logically positioning CodeAct as a superior alternative.

"As an alternative approach, several work (...) demonstrate the potential of using LLMs to generate code to control robots or game characters. However, they typically rely on pre-specified control primitives and hand-engineered prompts and, more importantly, struggle to dynamically adjust or emit actions based on new environmental observation and feedback." (Page 1)
Preview of Experimental Validation
The section provides a concise preview of the experimental validation strategy and key findings, outlining how the benefits of CodeAct (LLM familiarity, control/data flow) are demonstrated through specific experiments and benchmarks (API-Bank, M3ToolEval).

"Our extensive experiments with 17 LLMs (...) confirm the above benefits (3 & 4) of CodeAct. To demonstrate benefit (3), our first experiment (...) compares CodeAct to baselines on basic tasks involving atomic tool use (...). CodeAct s performance gains are more prominent on complex tasks, as demonstrated in our second experiment (benefit 4). We curate a new benchmark (...)" (Page 2)

Suggestions for Improvement

Define "Control Primitives" for Clarity
This low-impact improvement would enhance the clarity of the critique of prior work for readers less familiar with robotics or specific agent control literature. The Introduction section, aiming to establish context, benefits from defining potentially specialized terms early on. Briefly explaining "pre-specified control primitives" when introduced would ensure broader comprehension of the limitations being discussed, thereby strengthening the justification for CodeAct's more general approach.

"However, they typically rely on pre-specified control primitives and hand-engineered prompts and, more importantly, struggle to dynamically adjust or emit actions based on new environmental observation and feedback." (Page 1)

Implementation: When mentioning that prior code-generation work relies on "pre-specified control primitives," add a brief parenthetical explanation or a short phrase clarifying the term. For example: "...typically rely on pre-specified control primitives (i.e., basic, predefined commands for interaction)..." or "...typically rely on pre-specified control primitives, which are fundamental action commands designed for a specific system..."

Non-Text Elements

Figure 1: Comparison between CodeAct and Text / JSON as action. (top)...

Full Caption

Figure 1: Comparison between CodeAct and Text / JSON as action. (top) Illustrative example comparing different actions. (bottom) Quantitative results on M³ToolEval (§2.3).

Figure/Table Image (Page 2)

First Reference in Text

However, both methods typically suffer from constrained scope of action spaces (actions are usually tailored for specific tasks) and restricted flexibility (e.g., tool uses in Fig. 1 top left).

Description

Illustrative Comparison of Action Methods: This part of the figure presents a side-by-side comparison of how a Large Language Model (LLM) agent tackles a specific task: finding the most cost-effective country (USA, Japan, Germany, India) to buy a smartphone ('CodeAct 1'). LLM agents are AI systems designed to understand instructions and perform actions, like using software tools (APIs). The left side shows the agent using traditional Text or JSON formats to call available tools (like 'lookup_rates', 'lookup_phone_price') one by one. Each call requires a separate interaction with the 'Environment'. The right side shows the same task performed using 'CodeAct', where the agent generates a single block of Python code. This code uses loops ('for country in countries:') to iterate through the countries, calls the necessary tools as functions within the code, stores results in variables ('final_prices'), and uses built-in Python functions ('min') to find the minimum price, achieving the result in fewer overall interactions.
Efficiency Claim (Fewer Actions): The example highlights that the CodeAct method can achieve the task objective with fewer interactions compared to the Text/JSON method, which requires multiple back-and-forth steps for each country being checked.
Code Features Demonstrated (Control Flow, Data Flow, Library Use): The CodeAct example demonstrates the use of programming concepts like variables to store intermediate results (e.g., 'final_prices'), loops ('for') for repetitive operations across different countries, and leveraging existing Python library functions ('min'), contrasting with the single-tool-call limitation often seen in Text/JSON approaches.

Scientific Validity

Representativeness of Example: The example serves as a conceptual illustration. Its representativeness of typical or complex scenarios where CodeAct's advantages become significant is not established by this single instance alone. The chosen task might be specifically selected to favor the CodeAct approach.
Accuracy of Depicted Mechanisms: The illustration accurately depicts how Python code can encapsulate multiple tool calls, control flow, and data manipulation, contrasting with the typical one-call-per-interaction pattern of simple Text/JSON tool use.
Support for Efficiency Claim: The claim 'Fewer Actions Required!' is visually supported by the reduced number of agent interaction blocks shown on the right compared to the left, assuming the omitted steps on the left are numerous.

Communication

Visual layout and annotations: The side-by-side layout effectively contrasts the Text/JSON approach with the CodeAct approach for the illustrative task. The annotations highlighting 'Fewer Actions Required!' and specific code advantages (e.g., control flow, library re-use) aid comprehension.
Clarity of interaction flow: The step-by-step interaction flow for both methods is generally clear, showing the action taken by the agent and the environment's response.
Conceptual clarity: While illustrative, the example clearly conveys the core concept of using executable code versus structured text/JSON for agent actions.
Omission of steps: The omission of intermediate steps for brevity (indicated by '[...]') is necessary but slightly hinders a full understanding of the interaction count difference without referring to the text.

CodeAct Makes LLMs Better Agents

Key Aspects

CodeAct Framework Definition: This section introduces CodeAct as a framework where Large Language Model (LLM) agents interact with environments by generating executable Python code instead of text or JSON. Within a general multi-turn interaction model involving an agent, user, and environment, CodeAct consolidates agent-environment actions into Python code. The agent receives observations, such as code execution results or error messages, which inform subsequent actions.
Hypothesis: LLM Familiarity with Code: A key hypothesis is that LLMs, extensively pre-trained on code data, possess an inherent familiarity with programming structures, making CodeAct a more natural action format than specialized text or JSON. An experiment using the API-Bank benchmark tested this by evaluating the correctness of atomic tool calls (single function calls) across 17 LLMs using CodeAct, JSON, and text formats. This setup intentionally isolated the familiarity aspect by minimizing the need for complex control or data flow.
Finding: CodeAct Superiority in Atomic Tool Use: Results from the atomic tool call experiment (API-Bank) showed that CodeAct achieved comparable or superior performance to text and JSON for most LLMs, particularly open-source models. While closed-source models showed decent JSON performance, potentially due to targeted fine-tuning, the findings suggest CodeAct leverages existing pre-training effectively, making it a promising avenue for enhancing open-source LLM tool-use capabilities without requiring specialized format tuning.
Methodology: M3ToolEval Benchmark for Complex Tasks: To evaluate CodeAct's advantages in complex scenarios requiring multi-tool composition and logic, the authors curated the M3ToolEval benchmark. This benchmark contains 82 human-curated tasks across diverse domains (web browsing, finance, etc.) that necessitate features like control flow (loops, conditionals) and data flow (passing outputs as inputs), which are natively supported by CodeAct (Python) but difficult to implement with text/JSON actions.
Methodology: Complex Task Evaluation Setup: Experiments on M3ToolEval involved LLMs generating actions in CodeAct, JSON, or text formats over multiple turns (up to 10) in a zero-shot setting (no demonstrations). Success was measured by matching the final answer to a ground truth, alongside the average number of interaction turns required. This setup tested the practical ability of each format to handle complex problem-solving involving intricate tool coordination.
Finding: CodeAct Superiority in Complex Tasks: Results on M3ToolEval demonstrated that CodeAct significantly outperformed text and JSON for the majority of LLMs (12 out of 17), achieving higher success rates (e.g., up to 20.7% absolute improvement for GPT-4-1106-preview) while requiring fewer interaction turns (e.g., 2.1 fewer turns for GPT-4-1106-preview). This supports the claim that CodeAct's native support for control and data flow enhances efficiency and capability on complex tasks.
Finding: Open vs. Closed-Source Performance Gap: The M3ToolEval results also highlighted a substantial performance disparity between open-source and closed-source LLMs using CodeAct. The best open-source model achieved a much lower success rate (13.4%) compared to the best closed-source model (74.4%), attributed to weaker intrinsic task-solving abilities and instruction-following in open-source models in zero-shot scenarios. This finding motivates the need for targeted fine-tuning to improve open-source agent capabilities.
Advantage: Leveraging Existing Software Ecosystem: Beyond quantitative benchmarks, CodeAct offers qualitative advantages demonstrated through multi-turn interactions. LLM agents using CodeAct can leverage the vast ecosystem of existing Python libraries (e.g., Pandas for data manipulation, Scikit-Learn for ML, Matplotlib for visualization) without needing pre-defined tools or demonstrations. This significantly expands the agent's potential action space and adaptability.
Advantage: Multi-turn Interaction and Self-Debugging: CodeAct facilitates autonomous error correction through self-debugging. By executing code via an interactive Python interpreter, the agent receives standard programming error messages (tracebacks). The agent can then analyze these messages and attempt to correct its generated code in subsequent interaction turns, improving robustness and task completion rates, as illustrated conceptually in Figure 3.

Strengths

Clear Definition of CodeAct Framework
The section clearly defines the CodeAct framework within a general agent interaction model (agent, user, environment), explicitly stating its core mechanism: using executable Python code for agent-environment actions and receiving execution results as observations.

"CodeAct employs Python code to consolidate all actions for agent-environment interaction. In CodeAct, each emitted action to the environment is a piece of Python code, and the agent will receive outputs of code execution (e.g., results, errors) as observation." (Page 3)
Strong Empirical Validation Strategy
The paper provides strong empirical support for CodeAct's advantages through two well-designed experiments. The first (API-Bank) isolates the benefit of LLM familiarity with code by testing atomic actions, while the second (M3ToolEval) demonstrates the advantages of control/data flow in complex, multi-tool scenarios.

"In §2.2, we examine RQ1: Does LLMs’ familiarity with code due to a large amount of code pre-training data bring CodeAct advantages over text and JSON? We discuss RQ2 in §2.3: Does CodeAct benefit from Python’s innate control and data flow feature in complex problems?" (Page 3)
Novel Benchmark for Complex Tool Use
The introduction and evaluation of the M3ToolEval benchmark effectively addresses a gap in existing tool-use evaluations, specifically the lack of benchmarks requiring complex multi-tool composition across different action formats, thereby strengthening the evaluation of CodeAct's capabilities.

"Hence, we curate a benchmark M3ToolEval to fill this gap, which evaluates LLMs’ capabilities in solving complex tasks that typically require multiple calls to multiple tools in multi-turn interactions." (Page 4)
Quantitative Evidence Supporting CodeAct
The results presented in Tables 2 and 3 compellingly show CodeAct's superiority over JSON and text formats, particularly for complex tasks (higher success rates, fewer turns) and for open-source models, quantitatively supporting the central claims of the paper.

"CodeAct generally has a higher task success rate (12 out of 17 evaluated LLMs), similar to the trend in §2.2. Moreover, using CodeAct requires a lower average number of turns (12 out of 17 evaluated LLMs)." (Page 5)
Demonstrates Practical Benefits (Libraries, Self-Debugging)
Section 2.4 effectively illustrates the practical benefits of CodeAct beyond basic tool calls, showcasing its ability to leverage existing software libraries (Pandas, Scikit-Learn, Matplotlib) and facilitate self-debugging through automated error feedback, enhancing the agent's autonomy and capability.

"Furthermore, using the interactive Python interpreter for code execution allows automated error messages that help the LLM agent ‘self-debug’ their actions in a multi-turn interaction and eventually complete the human user’s request correctly." (Page 5)

Suggestions for Improvement

Explicitly Justify Atomic Action Testing Rationale
This low-impact improvement would enhance the methodological clarity in Section 2.2. This section aims to isolate the benefit of LLM familiarity with code by testing atomic actions, explicitly ablating control/data flow advantages. Explicitly stating why this ablation is crucial (i.e., to separate the 'familiarity' effect from the 'expressiveness' effect tested later) would strengthen the reader's understanding of the experimental logic and how it supports RQ1.

"For most LLMs, CodeAct achieves comparable or better performance even in atomic actions (the simplistic tool use scenario) where its control and data flow strengths are ablated." (Page 4)

Implementation: In Section 2.2, when introducing the API-Bank experiment setup, add a sentence clarifying the rationale behind focusing on atomic actions. For example: "By focusing on atomic tool calls, where only a single tool is invoked per action, we intentionally ablate the control and data flow advantages inherent in CodeAct (tested in §2.3) to specifically isolate and assess the benefit derived from LLMs' pre-existing familiarity with code structures compared to JSON or text."
Elaborate on Open/Closed-Source Gap and Link to Motivation
This medium-impact suggestion aims to improve the interpretation of results in Section 2.3. The section highlights a significant performance gap between open- and closed-source models using CodeAct on M3ToolEval, attributing it to weaker task-solving and instruction-following in open models. Expanding slightly on potential reasons for this weakness (e.g., differences in pre-training scale/data, architectural choices, or less targeted alignment for complex agency) and more explicitly linking this gap to the motivation for developing CodeActInstruct (introduced in Section 3) would create a stronger narrative bridge and better justify the subsequent focus on improving open-source models.

"However, there is still a significant gap in terms of absolute CodeAct performance between open- and closed-source LLMs... This is potentially due to open-source models’ weak task-solving capability and inability to follow complex instructions without demonstration, suggesting an urgent need to improve open-source LLMs..." (Page 5)

Implementation: In the paragraph discussing the quantitative results on M3ToolEval (end of Section 2.3), after noting the performance gap and suggesting reasons, add a sentence that explicitly connects this finding to the work presented later. For example: "...suggesting an urgent need to improve open-source LLMs for practical, real-world tasks under the zero-shot setting. This observed gap underscores the importance of targeted instruction tuning, motivating the development of our CodeActInstruct dataset (§3.1) designed to enhance these specific capabilities in open-source models."

Non-Text Elements

Figure 2: General agent multi-turn interaction framework that describes the...

Full Caption

Figure 2: General agent multi-turn interaction framework that describes the role of CodeAct and motivates the construction of our data mixture. CodeActInstruct focuses on the agent-environment interactions and specifically filters for the self-improved planning behavior, while general conversation data we include focuses on agent-user interaction (§3.1).

Figure/Table Image (Page 4)

First Reference in Text

CodeActInstruct is motivated by a general agent framework consisting of agent, user, and environments (Fig. 2) and focuses on agent-environment interactions with the computer (information seeking, software package use, external memory) and the physical world (robot planning).

Description

Comparison Setup: This part of the figure displays quantitative results comparing three different 'action modes' – ways for Large Language Models (LLMs) to perform actions – namely CodeAct (using Python code), JSON (using a structured data format), and Text (using plain text commands). The comparison is performed on a benchmark dataset called M00ToolEval, which presumably involves tasks requiring tool use.
Metrics and Models Compared: Two metrics are shown in separate bar charts: 'Success Rate (%)' (higher is better), indicating how often the LLM successfully completed the task, and 'Average Number of Interaction Turns' (lower is better), showing how many steps were needed on average. Results are presented for several different LLMs, including various versions of GPT (e.g., gpt-4-1106-preview, gpt-3.5-turbo-0613), Claude-2, Gemini-pro, text-davinci-003, and Llama-2-70b-chat-hf.
Success Rate Results Trend: For most models shown, particularly the more capable ones like gpt-4 versions, CodeAct (represented by the darkest bar) generally achieves the highest success rate compared to JSON and Text actions. For instance, gpt-4-1106-preview shows a success rate noticeably above 70% with CodeAct, while its success rate with JSON and Text is closer to 50%.
Interaction Turns Results Trend: Correspondingly, for the higher-performing models, CodeAct also tends to require fewer interaction turns. For gpt-4-1106-preview, the average turns are below 6 for CodeAct, whereas they are closer to 8 for JSON and Text.
Trend with Model Capability: The charts visually suggest that the performance gap between CodeAct and the other methods widens for more powerful LLMs (those listed higher on the y-axis), implying CodeAct's benefits might be more pronounced with more capable base models.

Scientific Validity

Benchmark Dependency: The figure presents results from experiments on the M00ToolEval benchmark. The validity depends on the quality, complexity, and representativeness of this benchmark, which is described as newly curated in the text (002.3). Without detailed information on the benchmark tasks, it's difficult to fully assess if the observed performance differences generalize.
Model Coverage: The comparison across 17 LLMs (though only 8 are shown clearly in the snippet) provides reasonable breadth, covering both open-source and proprietary models.
Appropriateness of Metrics: The metrics (Success Rate, Average Turns) are standard and appropriate for evaluating agent task performance and efficiency.
Consistency with Claims: The figure visually supports the paper's claim that CodeAct outperforms alternatives, particularly for more capable LLMs, on this specific benchmark. The trend of increasing advantage with model scale appears consistent across the presented models.
Potential Confounding Factors: Potential confounding factors, such as differences in prompt engineering for each action mode or the specific implementation details, are not detailed in the figure caption itself but are crucial for the validity of the comparison.

Communication

Chart type clarity: The horizontal bar charts are a standard and clear way to compare success rates and interaction turns across different models and action modes.
Use of color-coding: Color-coding the bars based on the action mode (Code, JSON, Text) allows for easy visual comparison of the different approaches for each LLM.
Axis and model labeling: Labeling the axes ('Success Rate (%)', 'Average Number of Interaction Turns') and listing the specific LLM models clearly identifies the metrics and subjects of comparison.
Separation of metrics: The separation into two charts (Success Rate and Interaction Turns) allows for focused comparison on each metric.
Axis scaling: The range of the x-axis for Success Rate (0-70%) might visually exaggerate differences for lower-performing models. Similarly, the Interaction Turns axis (5-10) focuses on a specific range.
Interpretation dependency: While the figure presents the data, interpreting the significance of the differences (e.g., whether a 5% increase in success rate is statistically significant) requires referring to the main text or statistical analysis not present in the figure itself.

Figure 3: Example multi-turn interaction with Python packages using...

Full Caption

Figure 3: Example multi-turn interaction with Python packages using CodeActAgent (Mistral-7b). No in-context demonstrations are provided to the model. Some messages are omitted for space. See https://chat.xwang.dev/r/Vqn108G for complete interaction.

Figure/Table Image (Page 6)

First Reference in Text

As shown in Fig. 3, CodeActAgent, designed for seamless integration with Python, can carry out sophisticated tasks (e.g., model training, data visualization) using existing Python packages.

Description

Interaction Overview: This figure demonstrates a conversational interaction between a human user and an AI agent called CodeActAgent (powered by the Mistral-7b language model). The goal is to perform a data science task involving an auto MPG (Miles Per Gallon) dataset.
Initial Task Request: The interaction spans multiple turns. The user initially asks the agent to download the dataset, preprocess it (check for missing values), split it into training and testing sets, and train a regression model (a statistical technique to predict MPG based on other car features).
Use of Python Libraries: The agent uses specific Python libraries: `pandas` for data loading and manipulation (reading a CSV file from a URL), `numpy` for numerical operations, and `scikit-learn` for machine learning tasks like splitting data (`train_test_split`) and implementing the `LinearRegression` model.
Self-Debugging Example (ValueError): The figure highlights the agent's self-debugging capability. Initially, the agent's code fails due to unexpected characters ('?') in the data, causing a `ValueError`. The environment provides a traceback (an error report). The agent analyzes this feedback and generates corrected code to handle the issue (replacing '?' with NaN - 'Not a Number' - and dropping rows with missing values).
Model Training, Evaluation, and Follow-up: After successfully training the model, the agent reports evaluation metrics: Mean Squared Error (MSE), a measure of prediction error (10.71 shown), and R^2 score, indicating how well the model fits the data (0.79 shown, closer to 1 is better). It also handles follow-up requests, like calculating these metrics for the training set.
Data Visualization and Further Debugging: The user then asks for a visualization of the model's coefficients (values indicating the importance of each input feature). The agent uses the `matplotlib` library to create a bar chart. Further self-debugging occurs when the visualization code initially fails (e.g., `AttributeError`, incorrect function arguments for rotating labels). The agent iteratively corrects its code based on error messages over several (partially omitted) turns until the visualization is successfully generated.
Zero-Shot Capability Demonstration: A key aspect noted in the caption is that this interaction happens without 'in-context demonstrations', meaning the agent wasn't given prior examples of how to perform this specific task; it relies on its pre-trained knowledge.

Scientific Validity

Demonstration of Claimed Capabilities: The figure provides a concrete example supporting the claim that CodeActAgent can interact with Python packages (pandas, scikit-learn, matplotlib) to perform complex, multi-step tasks.
Demonstration of Self-Debugging: The example effectively showcases the multi-turn interaction and, crucially, the agent's ability to use automated feedback (error messages) for self-debugging, a key feature highlighted in the paper.
Representativeness: As a single example, it may not be fully representative of the agent's general performance across all possible tasks or error types. The success shown could be specific to this particular scenario or model version (Mistral-7b).
Evidence for Zero-Shot Learning: The successful execution without in-context examples provides evidence for the agent's zero-shot capabilities in this domain, leveraging its underlying training.
Task Complexity: The complexity of the task (data loading, cleaning, modeling, evaluation, visualization, iterative debugging) lends credibility to the demonstration of sophisticated task handling.

Communication

Clear demarcation of interaction roles: The layout clearly distinguishes between the User's requests, the CodeActAgent's thoughts and code actions, and the Environment's responses (including code output and error messages).
Effective use of annotations: Annotations highlighting specific capabilities (e.g., 'Use Pandas Library...', 'Self-Debug from Automated Feedback', 'Use Matplotlib Library...') effectively guide the reader's attention to the key demonstrated features.
Illustration of multi-turn interaction: The multi-turn dialogue format successfully illustrates the iterative nature of the interaction, including follow-up questions and error correction cycles.
Impact of omissions: While necessary for brevity, the omission of some messages and code snippets (indicated by '[...omitted for space...]') slightly fragments the flow and requires trust or external viewing (via the provided link) for a complete picture.
Provision of link to full interaction: Including a link to the full interaction is a good practice for transparency and allows interested readers to examine the complete sequence.
Showcasing complex workflow: The figure effectively showcases a complex workflow involving data processing, model training, evaluation, and visualization, demonstrating the agent's ability to handle sophisticated tasks.

Table 1: The benefit of CodeAct compared to using Text/JSON for LLM action.

Figure/Table Image (Page 3)

First Reference in Text

In this section, we first describe CodeAct framework (§2.1) and provide empirical evidence that supports the choice of CodeAct.

Description

Comparison Overview: This table presents a qualitative comparison between two methods for defining actions taken by Large Language Model (LLM) agents: CodeAct (using executable Python code) and traditional Text/JSON formats. LLM agents are AI systems that can perform tasks by interacting with environments or tools.
Data Availability and Complex Operations: The comparison is structured across four key aspects. For 'Availability of Data', it states CodeAct benefits from the large amount of code data already available for 'pre-training' LLMs (the initial phase where models learn from vast datasets), whereas Text/JSON requires specific data curation. For 'Complex Operation', it claims CodeAct natively supports 'control and data flow' (like loops or conditional statements, e.g., 'if-then'), while Text/JSON requires careful engineering for similar complexity. 'Control flow' refers to the order in which instructions are executed, while 'data flow' refers to how data is passed between operations.
Tool Availability and Automated Feedback: Regarding 'Availability of Tools', the table asserts that CodeAct allows direct use of existing 'software packages' (libraries of pre-written code, like Python's extensive collection found on PyPI), while Text/JSON often requires human effort to create or adapt tools. For 'Automated Feedback', it highlights that CodeAct can leverage built-in programming language feedback mechanisms like 'traceback' (an error report generated when code fails), which are common in software development, whereas obtaining feedback for Text/JSON actions might require more manual setup.
Summary Indicators: The table uses checkmarks (✓) to indicate perceived advantages for CodeAct in all four categories and crosses (X) to indicate perceived limitations for Text/JSON in the same categories.

Scientific Validity

Qualitative Claims vs. Evidence: The table presents qualitative arguments and claims about the benefits of CodeAct. While these claims are plausible and align with common understanding of programming vs. structured text, the table itself does not provide quantitative evidence; it serves to outline the hypothesized advantages that are presumably tested later in the paper.
Data Availability Claim: The claim regarding data availability for pre-training is generally true; LLMs are trained on vast amounts of text and code from the internet. However, the extent to which this pre-training directly translates to effective agent action generation in the CodeAct format versus simpler formats requires empirical validation.
Complex Operations Claim: The comparison regarding complex operations (control/data flow) accurately reflects the inherent capabilities of programming languages versus typical JSON/Text API call structures. Code inherently supports loops, conditionals, and variable manipulation.
Tool Availability Claim: The claim about tool availability accurately points to the vast ecosystem of existing software libraries accessible via code, contrasting with the often bespoke nature of tools defined for Text/JSON agents.
Automated Feedback Claim: The point about automated feedback (e.g., tracebacks) is valid, as programming environments provide rich debugging information. Integrating such feedback effectively into an LLM agent's reasoning loop is a key aspect of the CodeAct proposal.
Potential Oversimplification of Text/JSON: The comparison might oversimplify the Text/JSON approach. While basic implementations might be limited, more sophisticated frameworks exist that attempt to add complexity (e.g., orchestration layers) to Text/JSON agents, although perhaps less natively than code.

Communication

Clear comparison format: The table uses a clear two-column comparison format, making it easy to contrast CodeAct and Text/JSON across the specified criteria.
Effective use of symbols: The use of checkmarks ("✓") and crosses ("X") provides an immediate visual summary of the purported advantages and disadvantages of each approach.
Relevant comparison criteria: The criteria listed (Availability of Data, Complex Operation, Availability of Tools, Automated Feedback) are relevant dimensions for comparing agent action frameworks.
Concise explanations: The brief descriptions accompanying each checkmark/cross concisely explain the reasoning behind the assessment (e.g., 'Large quantity of code available for pre-training', 'Requires human effort to curate tools').
Use of footnotes for clarification: Footnotes provide necessary context and sources for claims, such as links related to Python packages and error handling.
Summarization effectiveness: The table effectively summarizes the core arguments for preferring CodeAct, serving as a useful high-level overview before diving into detailed experimental results.

Table 2: Atomic API call correctness on API- Bank. The best performance is...

Full Caption

Table 2: Atomic API call correctness on API- Bank. The best performance is bolded, and the second-best is underlined.

Figure/Table Image (Page 5)

First Reference in Text

We present results in Tab. 2.

Description

Experiment Goal and Context: This table presents results from an experiment measuring the 'correctness' of different Large Language Models (LLMs) when making 'atomic API calls'. An 'atomic API call' refers to a single, indivisible request made to a software tool or service (an Application Programming Interface or API). 'Correctness' here likely means whether the LLM generated the API call with the exact right structure, tool name, and parameters as expected by the benchmark. The experiment uses the 'API-Bank' benchmark, a dataset designed to test how well LLMs can use tools.
Action Formats Compared: The table compares three different formats ('Format of Action') the LLMs were asked to use for making these API calls: 'CodeAct' (generating a Python function call), 'JSON' (generating a structured data object), and 'Text' (generating a plain text representation of the call).
Models Tested and Metric: Results are shown as percentages (Correctness %, higher is better) for a variety of LLMs, categorized into 'Open-source LLMs' (models whose underlying code is publicly available, like Llama-2 and Mistral) and 'Closed-source LLMs' (proprietary models like GPT-4, Claude-2, Gemini-pro).
Results for Open-Source Models: For open-source models, performance varies. For example, Llama-2-70b-chat-hf achieves 35.6% correctness with CodeAct and 37.6% with Text, but only 14.3% with JSON. Mistral-7B-Instruct-v0.1 shows low scores across all formats (2.5%, 2.3%, 3.0%). CodeAct and Text often perform better than JSON for these models.
Results for Closed-Source Models: For closed-source models, correctness scores are generally much higher. For instance, gpt-4-1106-preview achieves 76.7% with CodeAct, 82.7% with JSON, and 73.4% with Text. Here, JSON often performs comparably or sometimes better than CodeAct, unlike the trend in open-source models. gpt-4-0613 shows 75.4% for CodeAct and 82.0% for JSON.
Overall Format Performance Summary: A summary at the bottom ('Frequency of Best-Performing Format') counts how many times each format achieved the highest score. Overall, CodeAct was best 8 times, JSON 5 times, and Text 4 times across all 17 models listed.

Scientific Validity

Standardized Benchmark (API-Bank): The use of the API-Bank benchmark provides a standardized basis for comparison, assuming the benchmark itself is well-designed and relevant for testing atomic tool use.
Controlled Experiment Design (Atomic Calls): Focusing on 'atomic' API calls isolates the LLM's ability to formulate a single correct call, specifically ablating the control/data flow advantages tested elsewhere. This provides a controlled comparison of format familiarity/preference.
Appropriate Metric (Correctness): The 'correctness' metric, defined in the text as matching ground-truth API outputs after execution, seems appropriate for evaluating functional accuracy in this context.
Diversity of Models Tested: The inclusion of a diverse set of 17 LLMs, both open- and closed-source, strengthens the generalizability of the findings.
Observed Dichotomy (Open vs. Closed Source): The results suggest a potential difference in how open-source vs. closed-source models handle different action formats, particularly JSON. This might reflect differences in pre-training data or subsequent fine-tuning strategies employed by closed-source model providers, a valid point for investigation.
Consistency with Field Observations: The relatively lower performance of open-source models compared to closed-source models on this task is consistent with general observations in the field.

Communication

Table structure and clarity: The table structure is clear, grouping models into open-source and closed-source categories and presenting correctness scores for each action format (CodeAct, JSON, Text) side-by-side.
Use of formatting (bold/underline): Using bolding for the best performance and underlining for the second-best within each row (model) effectively highlights the top-performing action formats for each LLM.
Clarity of headers: Column headers ('Format of Action', 'Correctness (%,↑)', 'CodeAct', 'JSON', 'Text') are unambiguous.
Inclusion of summary statistics: The inclusion of a summary row ('Frequency of Best-Performing Format ↑') at the bottom provides a useful high-level takeaway regarding which format most often yields the best results across the tested models.
Model specificity: Listing a wide range of specific LLMs (e.g., CodeLlama variants, Llama-2 variants, Mistral, Claude, GPT variants, Gemini) allows for detailed comparison.

Table 3: Success rates (higher the better) and average turns required per...

Full Caption

Table 3: Success rates (higher the better) and average turns required per instance (lower the better) on M³ToolEval. The best results for each model are bolded, and the second-best ones are underlined.

Figure/Table Image (Page 5)

First Reference in Text

We include full re- sults in Tab. 3 and a subset of results for visualization in Fig. 1.

Description

Benchmark and Purpose: This table evaluates various Large Language Models (LLMs) on a benchmark called 'M00ToolEval'. This benchmark is designed to test the ability of LLMs to solve complex tasks that require using multiple software tools over several steps or 'turns' of interaction.
Metrics Measured: Two primary metrics are reported: 'Success Rate (%)', which measures how often the LLM successfully completed the tasks (higher values are better), and 'Avg. Turns', the average number of interaction steps required per task (lower values are better).
Action Formats Compared: Similar to Table 2, the comparison focuses on three different 'action formats' the LLMs used: 'CodeAct' (Python code), 'JSON', and 'Text'. The performance for each format is shown for each LLM.
Models Tested: The table includes results for numerous LLMs, separated into 'Open-source LLMs' (like Llama-2, Mistral) and 'Closed-source LLMs' (like GPT-4, Claude-2, Gemini-pro).
Performance Highlights (Success Rate): Performance varies significantly. Open-source models generally show low success rates (e.g., Llama-2-70b-chat-hf achieves 11.0% success with CodeAct, 3.7% with JSON/Text). Closed-source models perform much better; for example, 'gpt-4-1106-preview' achieves a 74.4% success rate with CodeAct, compared to 52.4% with JSON and 53.7% with Text.
Performance Highlights (Average Turns): In terms of efficiency, 'gpt-4-1106-preview' using CodeAct required the fewest average turns (5.5), whereas JSON and Text required more turns (7.6 and 7.7, respectively). This trend of CodeAct requiring fewer turns is observed for many, but not all, models.
Overall Format Performance Summary: The summary rows indicate that CodeAct was the best-performing format for success rate in 12 out of 17 models and required the fewest turns in 12 out of 17 models, suggesting a strong advantage on this benchmark compared to JSON (best success rate 5 times, fewest turns 3 times) and Text (best success rate 4 times, fewest turns 2 times).

Scientific Validity

Benchmark Relevance (M00ToolEval): The M00ToolEval benchmark, described in the text (002.3) as requiring complex coordination and composition of multiple tools in multi-turn interactions, is specifically designed to test the scenarios where CodeAct is hypothesized to excel (leveraging control and data flow). Evaluating on this benchmark directly tests the central claims.
Appropriateness of Metrics: The use of success rate and average interaction turns are standard and appropriate metrics for evaluating task completion and efficiency in agent benchmarks.
Model Diversity: Testing across a wide range of LLMs (17 total, including open- and closed-source) enhances the robustness and potential generalizability of the findings regarding the effectiveness of CodeAct.
Support for Claims (Benefit in Complexity): The results presented strongly support the paper's claim that CodeAct's advantages (seen modestly in atomic calls, Table 2) become more prominent in complex tasks. The absolute improvements in success rate (e.g., ~20% for gpt-4-1106-preview) and reductions in turns are substantial.
Zero-Shot Evaluation Setting: The text mentions the evaluation is zero-shot (002.3), meaning models were not given specific examples within the prompt. This tests the models' inherent ability to use the action formats without task-specific prompt engineering, adding to the validity.
Observed Performance Gaps: The large performance gap between open-source and top closed-source models highlights limitations in current open-source models' abilities on complex agent tasks, even with the potentially advantageous CodeAct format.

Communication

Clear presentation of metrics: The table effectively presents two key metrics (success rate, average turns) side-by-side for easy comparison across different action formats and models.
Logical grouping of models: Grouping models into open-source and closed-source categories aids in comparing trends between these two types of models.
Effective use of formatting: The use of bolding and underlining to highlight the best and second-best results per model is an effective visual aid for quickly identifying top performers.
Clarity of headers: Headers are clear and indicate the directionality of metrics ('%,↑' for success rate, '↓' for average turns).
Inclusion of summary statistics: Including summary rows ('Frequency of Best-performing Format ↑') provides a concise takeaway regarding the overall performance dominance of each action format across the evaluated models for both metrics.

Empowering Open-source LLM Agent to be Better at CodeAct

Key Aspects

Motivation: Empowering Open-Source LLMs for CodeAct: Building upon the demonstrated potential of CodeAct, this section aims to enhance the capabilities of open-source Large Language Models (LLMs) to effectively utilize this code-based action framework. The primary motivation stems from the observed performance gap between open- and closed-source models in complex agent tasks (identified in 2.3), highlighting the need for targeted methods to empower open-source alternatives for practical, real-world applications.
CodeActInstruct Dataset Introduction: The core contribution is CodeActInstruct, a novel instruction-tuning dataset specifically designed to improve LLMs' proficiency with CodeAct. This dataset consists of approximately 7,000 agent-environment interaction trajectories where actions are represented as Python code. Its construction is systematically guided by a general agent framework (Figure 2) encompassing diverse interaction scenarios.
CodeActInstruct: Data Sources and Use Cases: CodeActInstruct is built by repurposing existing datasets across four key agent use cases: Information Seeking (e.g., using search APIs via HotpotQA), Software Package Usage (e.g., using math/code libraries via MATH, APPS), External Memory access (e.g., SQL/Pandas interaction via WikiTableQuestion), and Robot Planning (e.g., controlling agents in simulated environments via ALFWorld). This diverse foundation ensures the dataset covers a broad range of agent interactions.
CodeActInstruct: Data Processing and Trajectory Generation: Several data processing steps were employed to curate CodeActInstruct, including down-sampling datasets to focus on challenging instances, repurposing single-turn problems into multi-turn interactions to better reflect realistic agent usage, and generating trajectories using capable LLMs (GPT-3.5, Claude, GPT-4) via the MINT evaluation framework. These steps aimed to create efficient and relevant training data.
CodeActInstruct: Focus on Self-Improvement and Debugging: A key feature of CodeActInstruct is its focus on enhancing the agent's ability to improve from interaction feedback, particularly through self-debugging. The dataset construction selectively filters for trajectories where the LLM initially makes errors (e.g., code execution errors) but subsequently corrects them based on environmental observations (e.g., error messages). This promotes the learning of self-correction and planning capabilities.
CodeActInstruct: Comparison and Advantages Over Prior Work: CodeActInstruct is positioned as an advancement over prior agent instruction datasets like AgentInstruct and FireAct, which primarily use text-based actions. Its advantages include greater practical applicability due to direct Python integration, broader domain coverage, a specific focus on quality data promoting self-debugging, larger scale, and empirically demonstrated superior performance when used for fine-tuning.
Training Strategy: Mixing CodeActInstruct and Conversation Data: The research demonstrates that CodeActInstruct can be effectively combined with existing general-purpose conversation datasets (like OpenOrca, ShareGPT, CapyBara). This mixed-data approach allows for fine-tuning LLMs that are proficient in CodeAct-based agent tasks while maintaining or improving their general conversational and reasoning abilities, achieving a balance between specialized agent skills and broad utility.
CodeActAgent Model Introduction and Training: CodeActAgent refers to the open-source models (Llama-2 7B and Mistral 7B variants) fine-tuned using the mixture of CodeActInstruct and general conversation data. The fine-tuning process involved full-parameter supervised fine-tuning with specific sequence lengths and training configurations detailed in the paper and appendix.
CodeActAgent: Evaluation Setup and Benchmarks: CodeActAgent models were comprehensively evaluated using the MINT framework for CodeAct tasks (reporting in-domain and out-of-domain results separately), M3ToolEval for complex tool use, MiniWob++ and ScienceWorld for text-action generalization, and standard benchmarks (MMLU, HumanEval, GSM8K, MTBench) for general capabilities. This multi-faceted evaluation aimed to rigorously assess the impact of the fine-tuning.
CodeActAgent: Key Performance Results and Findings: Evaluation results show that CodeActAgent models significantly outperform baseline open-source LLMs on CodeAct-based agent tasks (MINT, M3ToolEval for Mistral variant). The models also demonstrate generalization to text-based actions and maintain or improve performance on general LLM tasks, validating the effectiveness of the fine-tuning approach. An anomaly where the Llama-2 variant did not improve on M3ToolEval is noted and discussed further in Appendix H.
Ablation Study Findings: Ablation studies confirmed the importance of both CodeActInstruct and general conversation data components in the training mixture. CodeActInstruct was crucial for boosting agent task performance using CodeAct, while the general conversation data was essential for maintaining strong performance on general LLM benchmarks and likely contributed to overall robustness.

Strengths

Clear Motivation and Link to Prior Findings
The section clearly articulates the motivation for the work presented, directly addressing the performance gap between open- and closed-source models identified in the previous section (")2.3) and positioning the development of CodeActInstruct and CodeActAgent as a solution.

"The promising results achieved by CodeAct motivate us to build an open-source LLM agent... To improve open-source LLMs CodeAct capability, in 3.1, we introduce Code- ActInstruct..." (Page 5)
Detailed Dataset Construction Methodology
The methodology for constructing the CodeActInstruct dataset is detailed and systematic, covering use case selection, repurposing of existing datasets, data down-sampling, multi-turn interaction conversion, trajectory generation, and quality filtering based on interaction patterns (self-improvement).

"We select a high-quality subset of all the generated trajectories from CodeActInstruct to promote the agent s ability to improve the next action based on prior observations (e.g., self-debugging from code execution error message...)" (Page 7)
Strong Comparison with Prior Work
The paper effectively compares CodeActInstruct to prior agent instruction datasets (AgentInstruct, FireAct), highlighting its advantages in terms of action modality (code vs. text), practicality, domain diversity, data quality focus (self-debugging), scale, and empirical performance improvements.

"Compared with prior work AgentInstruct (...) and FireAct (...) that mainly focus using text as action, CodeActInstruct results in models that are more practical in real-world implementation... covers diverse domains... contains quality data... and of larger size..." (Page 7)
Comprehensive Evaluation of CodeActAgent
The evaluation of CodeActAgent is comprehensive, testing performance not only on CodeAct tasks (in-domain and out-of-domain) but also assessing generalization to text-based actions and performance on standard general LLM benchmarks, providing a holistic view of the fine-tuned models' capabilities.

"We also evaluate out-of-domain agent tasks using text actions... to test whether CodeActAgent can generalize to different action formats. Finally, we include a suite of general LLM evaluation tasks to assess general capability..." (Page 8)
Successful Integration of Diverse Training Data
The section demonstrates the successful integration of agent-specific interaction data (CodeActInstruct) with general conversation data, showing this mixture improves agent performance without significantly harming general LLM capabilities, offering a practical fine-tuning strategy.

"CodeActInstruct Can Be Used With Existing Agent- User Conversation Data. We use a sub-sampled set of OpenOrca... ShareGPT... and CapyBara... Statistics and down-sampling details can be found in Tab. 4 and C." (Page 7)

Suggestions for Improvement

Slightly Elaborate on Self-Improving Trajectory Identification
This low-impact improvement would enhance methodological clarity regarding the dataset creation process. Section 3.1 describes selecting trajectories that exhibit self-improvement, a key quality criterion for CodeActInstruct. While the concept is clear, briefly mentioning the specific heuristic or indicator used to identify such trajectories (e.g., presence of an error traceback followed by successful execution in a later turn, as implied and detailed in Appendix G.2) directly within this section would add a layer of operational detail.

"To achieve this, we selectively preserve those trajectories wherein the model initially encounters errors but rectifies these inaccuracies in later interactions." (Page 7)

Implementation: In the paragraph 'Enhancing Agent s Capabilities of Improving from Interaction', modify the sentence describing the selection criteria. For example: "To achieve this, we selectively preserve those trajectories, identified by the pattern of initial code execution errors followed by successful rectification in subsequent turns, wherein the model initially encounters errors but rectifies these inaccuracies..."
Briefly Acknowledge Rationale for Llama-2 Anomaly In-Text
This medium-impact suggestion aims to improve the flow and interpretation of the CodeActAgent evaluation results. Section 3.2 presents the core findings on the fine-tuned models. The text notes the surprising lack of improvement for the Llama-2 variant on M3ToolEval and defers the explanation entirely to Appendix H. Briefly acknowledging the unexpected nature of this specific result and hinting at the hypothesized reason (potential pre-training data artifacts, as detailed in Appendix H) within this main results paragraph would provide better immediate context for the reader and create a smoother narrative.

"Surprisingly, no improvement is observed for the Llama-2 variant. We discuss potential reasons in H." (Page 8)

Implementation: After stating that no improvement is observed for the Llama-2 variant on M3ToolEval, add a sentence acknowledging the anomaly and briefly referencing the hypothesized cause. For example: "Surprisingly, no improvement is observed for the Llama-2 variant. This unexpected result, potentially linked to artifacts in the model's pre-training data (discussed further in H), contrasts with the gains seen in the Mistral variant."

Non-Text Elements

Table 4: Statistics of our training mixture and comparison with prior work....

Full Caption

Table 4: Statistics of our training mixture and comparison with prior work. Please refer to §3.1 for details about CodeActInstruct and general conversation data. Token statistics are computed using Llama-2 tokenizer.

Figure/Table Image (Page 7)

First Reference in Text

The statistics of the resulting dataset Code ActInstruct are shown in Tab. 4.

Description

Purpose and Data Categories: This table details the composition and size of the data mixture used for training the CodeActAgent models in the study. It separates the data into three main categories: datasets from 'Prior Work' for comparison, the authors' newly compiled 'CodeActInstruct' dataset, and 'General Conversation' data.
Statistics Provided and Token Explanation: For each dataset listed, the table provides three statistics: '# of Data Instances' (number of individual examples or dialogues), '# of Total Tokens' (the total number of basic text units, like words or sub-words, as defined by the Llama-2 tokenizer), and 'Avg. Tokens Per Instance' (the average length of an example). 'Tokens' are the fundamental units that language models process; different models break text down into tokens differently, hence specifying the 'Llama-2 tokenizer' is important.
CodeActInstruct Dataset Statistics: The 'CodeActInstruct' dataset, created by the authors, is shown to consist of 7,139 instances totaling approximately 10.6 million tokens, with an average length of 1482 tokens per instance. It's compiled from several sources targeting different agent capabilities: Information Seeking (HotpotQA - 1,664 instances), Software Package/Tool Usage (MATH - 1,732 instances, APPS - 647 instances), External Memory (WikiTableQuestion - 1,065 instances), and Robot Planning (ALFWorld - 2,031 instances).
General Conversation Data Statistics: The 'General Conversation' data includes sources like OpenOrca (50,000 instances, ~14M tokens), ShareGPT (two sources totaling ~14.6k instances and ~36M tokens), and CapyBara (4,647 instances, ~5M tokens). This part of the mixture totals 69,230 instances and over 55 million tokens, generally with shorter average lengths (e.g., 280 for OpenOrca, 797 overall) compared to CodeActInstruct.
Comparison with Prior Work Datasets: For context, statistics for two prior datasets, FireAct and AgentInstruct, are also listed. FireAct has 2,063 instances and ~0.5M tokens, while AgentInstruct has 1,866 instances and ~2.5M tokens. This comparison highlights that the authors' CodeActInstruct dataset contains significantly more instances and tokens than these specific prior works.

Scientific Validity

Transparency of Data Composition: Providing a detailed breakdown of the training data mixture, including sources, instance counts, and token counts, is crucial for transparency and reproducibility.
Dataset Size and Diversity (CodeActInstruct): The CodeActInstruct dataset appears reasonably large (7k instances, 10M tokens) and diverse, covering multiple domains relevant to agent capabilities (search, tool use, memory, planning). This diversity supports the goal of training a versatile agent.
Data Mixture Strategy: The inclusion of a much larger general conversation dataset (69k instances, 55M tokens) alongside the specialized CodeActInstruct data reflects a common strategy to maintain the model's general language understanding and conversational abilities while fine-tuning for specific tasks.
Tokenizer Specification: Specifying the tokenizer (Llama-2) used for calculating token statistics is essential, as token counts can vary significantly between different tokenizers.
Contextualization via Comparison: Comparing the scale of CodeActInstruct to prior works (FireAct, AgentInstruct) helps position the contribution but relies on the specific choice of prior works for comparison.
Token Length Insights: The average token lengths suggest that CodeActInstruct instances (avg. 1482 tokens) are substantially longer than the general conversation instances (avg. 797 tokens), likely reflecting the complexity and multi-turn nature of the agent tasks.

Communication

Clear categorization of data: The table clearly categorizes the training data into 'Prior Work', 'CodeActInstruct (Ours)', and 'General Conversation', making it easy to understand the components of the training mixture.
Detailed breakdown of data sources: Breaking down 'CodeActInstruct' and 'General Conversation' into their constituent datasets (e.g., HotpotQA, MATH, OpenOrca, ShareGPT) provides transparency about the data sources.
Inclusion of relevant statistics: Including statistics like the number of instances, total tokens, and average tokens per instance provides a quantitative overview of the dataset sizes.
Comparison with prior work: Comparing the size of the authors' 'CodeActInstruct' dataset with 'Prior Work' (FireAct, AgentInstruct) helps contextualize its scale.
Specification of tokenizer: Specifying the tokenizer used (Llama-2) is crucial information for reproducibility and understanding the token counts.
Clarity of headers: The column headers are clear and accurately describe the data presented.

Table 5: Evaluation results for CodeActAgent. The best results among all...

Full Caption

Table 5: Evaluation results for CodeActAgent. The best results among all open-source LLMs are bolded, and the second-best results are underlined. ID and OD stand for in-domain and out-of-domain evaluation correspondingly. Overall averaged performance normalizes the MT-Bench score to be consistent with other tasks and excludes in-domain tasks for fair comparison.

Figure/Table Image (Page 8)

First Reference in Text

As shown in Tab. 5, CodeActAgent (both variants) perform better than all evaluated open-source LLMs on both the in- and out-of-domain subsets of MINT.

Description

Evaluation Overview: This table presents a comprehensive evaluation of the authors' 'CodeActAgent' models (fine-tuned versions of Llama-2 7B and Mistral 7B) against various baseline models and prior work. The evaluation covers two main categories: 'Agent Tasks' and 'Generic Tasks'. '7B' refers to the model size, indicating approximately 7 billion parameters, a measure of model complexity.
Agent Task Benchmarks: 'Agent Tasks' assess the models' ability to perform actions. This is split into tasks using 'Code as Action' (evaluated on MINT and M00ToolEval benchmarks) and 'Text as Action' (evaluated on Miniwob++ and SciWorld). MINT (Multi-turn Interaction with Tools) results are further divided into 'ID' (In-Domain, tasks similar to the training data) and 'OD' (Out-of-Domain, tasks different from training data). M00ToolEval is a benchmark for complex multi-tool use. Miniwob++ involves controlling web browser elements, and SciWorld is a text-based environment simulating science experiments.
Generic Task Benchmarks: 'Generic Tasks' assess broader capabilities. MMLU measures knowledge across many subjects. HumanEval tests code generation ability. GSM8K assesses mathematical reasoning. MTBench evaluates instruction following and conversational ability.
Example Results for CodeActAgent (Mistral): Results are presented as scores (likely percentages or benchmark-specific metrics). For example, CodeActAgent (Mistral) scores 57.4 on MINT ID (Code), 32.4 on MINT OD (Code), 12.2 on M00ToolEval, 46.2 on Miniwob++, 15.9 on SciWorld, 59.1 on MMLU, 34.7 on HumanEval, 58.0 on GSM8K, and 8.2 on MTBench.
Performance Comparisons: The table compares CodeActAgent to baseline models (e.g., Llama2 Chat, Mistral Instruct), prior agent models (FireAct, AgentLM), and powerful closed-source models (gpt-3.5-turbo-0613, gpt-4-0613). CodeActAgent (Mistral) generally outperforms other open-source models on agent tasks (indicated by bolding) and maintains strong performance on generic tasks.
Overall Average Performance: An 'Overall Average' score is calculated to summarize performance across diverse tasks. The caption notes it excludes ID tasks for fairness and normalizes the MT-Bench score (likely because its scale differs from others). CodeActAgent (Mistral) achieves the highest overall average (42.5) among open-source models.

Scientific Validity

Breadth and Quality of Benchmarks: The evaluation uses a diverse suite of established benchmarks covering both specialized agent capabilities (MINT, M00ToolEval, Miniwob++, SciWorld) and general LLM abilities (MMLU, HumanEval, GSM8K, MTBench), providing a comprehensive assessment.
ID/OD Split Evaluation: Distinguishing between In-Domain (ID) and Out-of-Domain (OD) performance on MINT is methodologically sound, as it helps assess the model's ability to generalize beyond its specific training tasks.
Inclusion of Baselines and Prior Work: Comparing against relevant baselines (standard instruction-tuned models) and prior work focused on agents (FireAct, AgentLM) provides appropriate context for evaluating the contribution of CodeActAgent.
Support for Claims: The results clearly show significant improvements for CodeActAgent, particularly the Mistral variant, over other open-source models on agent-specific tasks, supporting the paper's claims about the effectiveness of the proposed fine-tuning approach.
Calculation of Overall Average: The normalization of MT-Bench and exclusion of ID tasks in the overall average calculation is a reasonable approach to create a more balanced summary metric across disparate benchmarks.
Contextualization with SOTA: The performance gap between the best open-source model (CodeActAgent Mistral) and the closed-source models (especially GPT-4) remains substantial across most tasks, reflecting the current landscape in LLM capabilities.
Empirical Evidence Strength: The table provides strong empirical evidence for the effectiveness of the CodeActInstruct fine-tuning data and methodology for improving open-source LLM agent capabilities.

Communication

Table organization: The table is well-structured, clearly separating models, evaluation task categories (Agent Tasks, Generic Tasks), and specific benchmarks.
Granularity of results: Grouping agent tasks by action type (Code vs. Text) and distinguishing between In-Domain (ID) and Out-of-Domain (OD) results for MINT provides valuable granularity.
Clarity of headers and abbreviations: Headers are mostly clear, although abbreviations like MINT, MMLU, GSM8K, etc., rely on reader familiarity or reference to the text.
Use of formatting (bold/underline): The use of bolding and underlining effectively highlights the best and second-best performing open-source models on each task, aiding quick comparison.
Overall average and explanation: The inclusion of an 'Overall Average' provides a useful summary, and the caption clarifies how it's calculated (normalization, exclusion of ID tasks), which is good practice.
Inclusion of closed-source benchmarks: Comparing open-source models (including the authors' CodeActAgent) with closed-source models (gpt-3.5, gpt-4) provides important context regarding the current state-of-the-art.

Table A.6: Example of actions for re-purposed API-Bank (Li et al., 2023) and...

Full Caption

Table A.6: Example of actions for re-purposed API-Bank (Li et al., 2023) and M³ToolEval.

Figure/Table Image (Page 14)

First Reference in Text

Not explicitly referenced in main text

Description

Purpose and Context: This table provides concrete examples of how a single intended action – adding an agenda item with specific content ('Meeting with John') and time ('2023-10-26 09:00:00') – would be represented using the three different action formats discussed in the paper: CodeAct, JSON, and Text. These formats are used by Large Language Model (LLM) agents to invoke tools or APIs.
CodeAct Example: For the 'CodeAct' format, the action is shown as a Python function call: `AddAgenda(content="Meeting with John", time="2023-10-26 09:00:00")`. This format uses standard programming syntax.
JSON Example: For the 'JSON' format, the action is represented as a JSON (JavaScript Object Notation) object. This is a structured data format using key-value pairs: `{"action": "AddAgenda", "content": "Meeting with John", "time": "2023-10-26 09:00:00"}`. Here, the action name and its parameters are explicitly labeled fields within the object.
Text Example: For the 'Text' format, the action is shown as a semi-structured text string: `Action: AddAgenda, content: Meeting with John, time: 2023-10-26 09:00:00`. This format uses keywords and delimiters (like commas and colons) but is less rigidly structured than JSON or code.

Scientific Validity

Accuracy of Representation: The examples accurately reflect the syntax and structure characteristic of Python function calls (CodeAct), JSON objects, and simple key-value text representations (Text).
Illustrative Value: The table serves an illustrative purpose, clarifying the concrete differences between the abstract concepts of 'Code as Action', 'JSON as Action', and 'Text as Action' used throughout the paper's experiments (e.g., in Table 2 and Table 3).
Suitability of Example Action: While simple, the chosen action ('AddAgenda') effectively demonstrates how parameters are handled in each format, which is sufficient for illustrating the basic syntactic differences relevant to atomic API calls.
Contextual Role: This table does not present experimental results but provides necessary context for understanding the experimental setup and results presented elsewhere.

Communication

Clarity of comparison: The table clearly presents the same conceptual action ('AddAgenda') formatted in three distinct ways (CodeAct, JSON, Text), allowing for direct comparison.
Effective layout: The side-by-side layout is simple and effective for illustrating the syntactic differences between the action formats.
Concrete examples: The examples chosen are concrete and easy to understand, clearly showing how parameters ('content', 'time') are represented in each format.
Clear labeling: Labeling each row with the format name ('CodeAct', 'JSON', 'Text') is unambiguous.

Table A.7: Comparison between M³ToolEval and existing tool-use evaluation...

Full Caption

Table A.7: Comparison between M³ToolEval and existing tool-use evaluation benchmark.

Figure/Table Image (Page 14)

First Reference in Text

Not explicitly referenced in main text

Description

Purpose and Benchmarks Compared: This table compares the M00ToolEval benchmark, introduced in this paper, against four existing benchmarks used for evaluating how well Large Language Models (LLMs) can use software tools or APIs. The benchmarks compared are ToolBench (Qin et al., 2023b), APIBench (Patil et al., 2023), API-Bank (Li et al., 2023), and another benchmark also named ToolBench (Xu et al., 2023).
Comparison Criteria: The comparison is based on five criteria: 'Requiring multi-turn interaction' (does the task need multiple steps?), 'Multiple tools' (does the task involve using more than one tool?), 'Evaluation' method (how is success measured?), 'No dependency on external API*' (does the benchmark rely on third-party web services?), and 'Supported API Action Format' (what formats can the LLM use to call tools?).
Multi-turn and Multi-tool Comparison: According to the table, M00ToolEval requires multi-turn interaction and involves multiple tools, similar to ToolBench (Qin et al.) but unlike the other three benchmarks listed, which are marked with 'X' for these features.
Evaluation Method Comparison: For evaluation, M00ToolEval uses 'Answer Match' (checking if the final answer is correct). ToolBench (Qin et al.) uses an 'LLM Evaluator' (another LLM judges the performance). APIBench uses 'AST Tree Match' (comparing the structure of generated code/calls). API-Bank uses 'API-Call Match'. ToolBench (Xu et al.) uses 'Test Case' evaluation.
External API Dependency Comparison: M00ToolEval is marked as having 'No dependency on external API', meaning it doesn't rely on potentially unavailable third-party services, unlike ToolBench (Qin et al.), APIBench, and ToolBench (Xu et al.). API-Bank is also marked as not having this dependency.
Supported Action Format Comparison: Regarding supported action formats, M00ToolEval supports 'CodeAct & JSON & Text'. ToolBench (Qin et al.) and API-Bank support 'JSON'. APIBench and ToolBench (Xu et al.) support 'CodeAct'.

Scientific Validity

Positioning of M00ToolEval: The table aims to position M00ToolEval within the landscape of existing benchmarks by highlighting its unique combination of features, particularly multi-turn interaction, multiple tools, answer-based evaluation, lack of external dependency, and broad action format support.
Relevance of Comparison Criteria: The chosen comparison criteria (multi-turn, multi-tool, evaluation method, external dependency, format support) are relevant dimensions for differentiating benchmark designs and capabilities in the context of LLM tool use.
Accuracy of Benchmark Characterization: The characterization of each benchmark against the criteria appears generally consistent with the descriptions in the cited papers, although nuances might exist. For example, the exact nature of 'multi-turn' or 'multiple tools' can vary.
Practical Advantage Highlighted (API Dependency): Highlighting the lack of dependency on external APIs is a valid practical advantage, as external services can introduce unreliability or cost.
Benchmark Suitability for Study: Supporting multiple action formats (CodeAct, JSON, Text) makes M00ToolEval suitable for the comparative experiments conducted in the paper (e.g., Table 3).
Justification for New Benchmark: The table effectively justifies the need for creating M00ToolEval by showing that no single existing benchmark (among those compared) combined all the desired features, particularly the focus on complex, multi-step, multi-tool tasks evaluated by final answer correctness without external dependencies.

Communication

Clear comparison format: The table uses a clear columnar format to compare M00ToolEval against four other benchmarks across several key features.
Effective use of symbols: The use of checkmarks (0c) and crosses (X) provides a quick visual summary of whether each benchmark possesses a specific feature.
Clear feature labels: Row headers clearly define the features being compared (e.g., 'Requiring multi-turn interaction', 'Multiple tools').
Clear benchmark identification: Column headers clearly identify the benchmarks being compared, including citations.
Highlighting unique features: The table effectively highlights the purported unique combination of features in M00ToolEval (multi-turn, multiple tools, answer matching evaluation, no external API dependency, support for CodeAct/JSON/Text).
Informative footnote: The footnote explaining the potential impact of relying on external APIs adds important context to that specific comparison point.

Table A.8: Ablation study results. The best results are bolded, and the...

Full Caption

Table A.8: Ablation study results. The best results are bolded, and the second-best results are underlined. ID and OD stand for in-domain and out-of-domain evaluation correspondingly. Overall averaged performance normalizes the MT-Bench score to be consistent with other tasks and excludes in-domain tasks for fair comparison.

Figure/Table Image (Page 14)

First Reference in Text

Not explicitly referenced in main text

Description

Purpose: Ablation Study: This table presents an 'ablation study', a type of experiment where components are systematically removed to understand their contribution. Here, the study investigates the importance of the two main parts of the training data mixture used to create the CodeActAgent models: the specialized 'CodeActInstruct' dataset and the 'general conversations' dataset.
Models and Conditions Compared: Two base models are considered: the Llama2-based CodeActAgent and the Mistral-based CodeActAgent. For each, three versions are compared: the full model (trained on both data types), a model trained without the CodeActInstruct data ('w/o CodeAct'), and a model trained without the general conversation data ('w/o general conversations').
Evaluation Benchmarks: The models are evaluated on the same suite of 'Agent Tasks' (MINT ID/OD, Miniwob++, SciWorld) and 'Generic LLM Tasks' (MMLU, HumanEval, GSM8K, MTBench) as in Table 5.
Results for Llama2-based Ablation: For the Llama2-based agent, removing CodeActInstruct data ('w/o CodeAct') drastically reduces performance on MINT agent tasks (e.g., MINT ID drops from 51.3 to 17.0) but has a smaller impact on generic tasks (e.g., MMLU drops slightly from 50.6 to 49.5). Removing general conversations ('w/o general conversations') significantly hurts generic task performance (e.g., MMLU drops to 46.4, MTBench drops from 7.5 to 4.1) and also negatively impacts agent tasks, though less severely than removing CodeActInstruct data (MINT ID drops to 29.2).
Results for Mistral-based Ablation: Similar trends are observed for the Mistral-based agent. Removing CodeActInstruct data ('w/o CodeAct') substantially lowers agent task scores (MINT ID drops from 57.4 to 32.9) while maintaining relatively high generic task scores (MMLU 59.9, HumanEval 33.2). Removing general conversations ('w/o general conversations') severely degrades performance on both generic tasks (MMLU drops to 52.4, MTBench drops from 8.2 to 2.6) and agent tasks (MINT ID drops to 50.5, Miniwob++ drops to 0.0).
Overall Average Impact: The 'Overall Average' score reflects these trends, showing significant drops when either data component is removed, but highlighting the particular importance of CodeActInstruct data for agent tasks and general conversation data for generic capabilities and overall robustness.

Scientific Validity

Methodological Soundness (Ablation): Ablation studies are a standard and valid method for assessing the contribution of different components (in this case, training data subsets) to a model's performance.
Clear Component Separation: Ablating the two primary distinct components of the training mixture (specialized agent data vs. general conversation data) allows for a clear analysis of their respective roles.
Comprehensive Evaluation Suite: Evaluating on a diverse set of both agent-specific and generic tasks provides strong evidence for the differential impact of the ablated data components on different capabilities.
Support for Hypotheses: The results strongly support the hypothesis that the CodeActInstruct data is crucial for imparting the specialized agent skills, while the general conversation data is essential for maintaining broad language understanding and instruction-following capabilities.
Consistency Across Models: The consistency of the trends observed across both Llama2 and Mistral base models strengthens the conclusion about the roles of the different data components.
Justification for Data Mixture: The study effectively demonstrates that simply training on general conversation data is insufficient for achieving strong agent performance ('w/o CodeAct' results), and training only on agent data significantly degrades general capabilities ('w/o general conversations' results), justifying the use of the data mixture.

Communication

Clarity of Ablation Comparison: The table clearly presents the results of the ablation study, comparing the full model against versions trained without specific data components.
Grouping by Base Model: Grouping results by base model (Llama2 and Mistral) makes it easy to see the impact of ablations on each.
Consistent Structure with Table 5: The structure mirrors Table 5, facilitating comparison of ablated models to the full model and other baselines.
Headers and Formatting: Headers and formatting (bolding/underlining) are used effectively, consistent with previous tables.
Clarity of Ablation Labels: The labels 'w/o CodeAct' and 'w/o general conversations' clearly indicate the ablated conditions.

Table A.9: CodeActInstruct components and the number of instances for training...

Full Caption

Table A.9: CodeActInstruct components and the number of instances for training trajectory generation.

Figure/Table Image (Page 18)

First Reference in Text

Not explicitly referenced in main text

Description

Purpose and Components: This table outlines the different datasets that make up the 'CodeActInstruct' collection, which was used as the basis for generating 'training trajectories'. A training trajectory is essentially a recorded example of an AI agent successfully completing a task, often involving multiple steps and interactions, which can then be used to teach the agent.
Dataset Breakdown and Instance Counts: The table categorizes the components by the 'Domain' or 'Capability' they are intended to train. These include: 'Web Search' capability using the HotpotQA dataset (3,000 instances); 'Math Reasoning' using the MATH dataset (5,586 instances); 'Code Generation' using the APPS dataset (4,439 instances); 'Tabular Reasoning' using the WikiTableQuestion dataset (3,000 instances); and 'Embodied Planning' (simulated robot tasks) using the ALFWorld dataset (3,553 instances).
Capabilities Targeted by Each Dataset: Each listed dataset corresponds to a specific skill the researchers aimed to develop in their CodeActAgent. For example, HotpotQA involves answering questions that require finding and combining information from multiple sources (simulating web search). MATH and APPS involve solving math problems and programming challenges, respectively, requiring logical reasoning and the use of code/libraries. WikiTableQuestion involves reasoning over data presented in tables. ALFWorld involves planning sequences of actions in a simulated household environment.
Focus on Instance Count for Trajectory Generation: The table focuses specifically on the number of 'instances' (individual problems or tasks) drawn from each source dataset that were used for the initial step of generating interaction examples (trajectories). The total number of instances listed sums up to 19,578.

Scientific Validity

Dataset Diversity and Relevance: The selection of diverse datasets (HotpotQA, MATH, APPS, WikiTableQuestion, ALFWorld) covering various agent capabilities (search, reasoning, coding, planning) provides a strong foundation for generating comprehensive training data for a multi-skilled agent.
Use of Established Benchmarks: Using established benchmarks as source data ensures a certain level of task quality and relevance, leveraging prior work in specific capability areas.
Scale of Data Selection: The number of instances selected from each dataset seems substantial (thousands per category), suggesting an effort to capture sufficient examples for each capability during the trajectory generation phase.
Distinction from Final Training Data: This table describes the input data used for generating trajectories. The scientific validity of the final CodeActInstruct dataset (as detailed in Table 4, which shows ~7k instances after filtering) depends on the quality of the trajectory generation process and the subsequent filtering heuristics applied (described in 00G.2), not just the initial selection shown here.
Data Selection/Filtering Rationale: The process of down-sampling or selecting instances from the original datasets (e.g., selecting 'hard' instances from HotpotQA, filtering MATH by difficulty, as mentioned in 00G.1) before trajectory generation is a reasonable step to focus computational effort on more challenging examples.

Communication

Clear Structure: The table clearly lists the components of the CodeActInstruct dataset, categorized by domain/capability.
Mapping Datasets to Capabilities: Mapping specific datasets (e.g., HotpotQA, MATH) to capabilities (e.g., Information seeking, Math Reasoning) provides clarity on the intended purpose of each component.
Quantitative Information (Instances): Providing the number of instances for each component gives a quantitative sense of the data scale used for generating trajectories, although total token counts (as in Table 4) are omitted here.
Clarity of Headers: Headers ('Domain', 'Capability', 'Dataset', '# of Instances') are clear and informative.
Inclusion of Citations: Citations for the datasets used are included, allowing readers to find more information about the source data.

Related Work

Key Aspects

LLM Agent Architecture and Focus on Action Modules: The paper situates its contribution within the established architecture of LLM-based autonomous agents, which commonly comprise four key components: customized profiles defining agent persona and objectives, long-term memory for retaining information across interactions, reasoning and planning algorithms for decision-making, and action modules. This work specifically focuses on the action module, which is identified as the critical component enabling agents to interact effectively with external entities like humans, tools, and the broader environment.
Problem: Standardizing the LLM Agent Action Space: A central challenge identified in the field is the lack of a standardized action space for LLM agents, hindering interoperability and efficient development. This study directly addresses this critical problem by proposing and evaluating a novel approach to unify how agents express and execute actions. The goal is to move beyond current fragmented or limited action representations.
Differentiation from Related Code Generation Work: The paper explicitly differentiates its proposed CodeAct framework from related lines of research. It distinguishes CodeAct from approaches that use code generation solely for general problem-solving (detailed in Appendix A) and from concurrent work like TaskWeaver that also advocates for code use but differs in scope and methodology (detailed in Appendix B). This careful positioning clarifies the unique contributions of CodeAct.
Overview of LLM Agent Improvement Methods: The section surveys the two dominant strategies for improving the capabilities of LLM agents: prompt engineering and instruction tuning. Prompt engineering involves carefully crafting input prompts to guide the LLM's behavior, utilizing techniques like chain-of-thought reasoning, self-consistency, tree-based planning, and reflection. This contrasts with instruction tuning, which involves fine-tuning the LLM itself on datasets of desired behaviors.

Strengths

Clear Contextualization within LLM Agent Structure
The section effectively situates the paper's contribution within the established four-component structure of LLM agents (profiles, memory, reasoning/planning, action modules), clearly identifying the action module as the core focus.

"As detailed in (Wang et al., 2023b), LLM-based autonomous agents are typically structured around four components: customized profiles (...), long-term memory capabilities (...), reasoning and planning algorithms (...), and, most crucially, action modules." (Page 8)
Precise Problem Framing
The related work clearly articulates the specific problem addressed the standardization of the action space for LLM agents distinguishing it from broader agent research.

"In this study, we address the critical problem of standardizing the action space for LLM agents." (Page 8)
Effective Differentiation from Related Concepts
The section appropriately acknowledges and differentiates the work from related concepts like general code generation for problem-solving and concurrent studies (TaskWeaver) by directing readers to specific appendices ( A, B) for detailed comparisons, keeping the main text concise.

"We further discuss the difference between CodeAct and the line of work that uses code generation for problem-solving in A. We notice a concurrent study TaskWeaver (Qiao et al., 2023) similarly endorses the use of code. We discuss the principal distinctions in B." (Page 8)
Concise Overview of Agent Improvement Methods
The section concisely summarizes the two primary methodologies for enhancing LLM agents (prompt engineering and instruction tuning), providing relevant citations and examples for each, which effectively sets the stage for understanding the paper's approach.

"Two primary methods for enhancing LLM agents are prompt engineering and instruction tuning, as surveyed by (Wang et al., 2023b). For prompt engineering (...), numerous strategies have been introduced to improve the chain-of-thought reasoning (...), including self-consistency-based reasoning (...) and tree-based approaches (...)." (Page 8)

Suggestions for Improvement

Briefly State Key TaskWeaver Distinction In-Text
This low-impact improvement would enhance immediate clarity within the Related Work section itself. While deferring the detailed comparison with TaskWeaver to Appendix B is efficient, briefly stating the primary distinction (e.g., CodeAct's focus on empirical validation and open-source model fine-tuning vs. TaskWeaver's conceptual framework, as suggested in B) directly in Section 4.1 would give readers a quicker understanding of the key difference without needing to immediately consult the appendix. This addition aligns with the section's purpose of contextualizing and differentiating the work.

"We notice a concurrent study TaskWeaver (Qiao et al., 2023) similarly endorses the use of code. We discuss the principal distinctions in B." (Page 8)

Implementation: After mentioning TaskWeaver and referencing Appendix B, add a short phrase summarizing the core difference. For example: "...similarly endorses the use of code. While TaskWeaver provides a conceptual framework, our work focuses on extensive empirical validation and open-source agent development ( B)." or a similar concise summary based on the distinctions in Appendix B.
Explicitly Link Agent Improvement Methods to Paper's Approach
This low-impact improvement would strengthen the coherence between the general discussion of agent improvement methods and the specific contributions of this paper. Section 4.2 effectively surveys prompt engineering and instruction tuning. Explicitly linking this survey back to the paper's chosen methodology (instruction tuning via CodeActInstruct for the CodeAct framework) would reinforce the rationale presented earlier and better integrate this subsection into the paper's narrative arc. This fits the section's role of positioning the work within existing methodologies.

"Two primary methods for enhancing LLM agents are prompt engineering and instruction tuning, as surveyed by (Wang et al., 2023b). For prompt engineering (...), numerous strategies have been introduced... Moreover, LLMs can be strategically prompted to reflect on..." (Page 8)

Implementation: At the end of Section 4.2, after discussing prompt engineering and instruction tuning approaches, add a sentence that connects this overview to the paper's specific approach. For example: "This paper primarily leverages the instruction tuning paradigm, developing the CodeActInstruct dataset ( 3.1) to specifically enhance agent capabilities within the proposed CodeAct framework."

Conclusions

Key Aspects

CodeAct Framework Recap and Advantage: The conclusion reiterates the introduction of CodeAct, a framework where Large Language Model (LLM) agents utilize executable Python code as their action modality. This approach is presented as a significant advancement over conventional methods relying on text or JSON formats. The primary benefit highlighted is CodeAct's enhanced suitability for handling complex scenarios where flexibility and expressive power in action generation are crucial.
CodeActInstruct Dataset Summary: The paper summarizes the development of CodeActInstruct, a dataset specifically curated for enhancing LLM agent capabilities within the CodeAct framework. This dataset comprises multi-turn interaction trajectories focused on CodeAct usage. Its purpose is to serve as training material for instruction tuning, enabling LLMs to learn how to effectively generate and utilize Python code actions in interactive settings.
CodeActAgent Model and Capabilities Summary: The conclusion highlights the training of CodeActAgent, an LLM agent specifically fine-tuned using CodeActInstruct. This agent is characterized by its seamless integration with the Python environment, allowing it to execute sophisticated tasks, such as machine learning model training, by directly leveraging the vast ecosystem of existing Python packages. A key capability emphasized is its capacity for autonomous error correction through self-debugging based on execution feedback.

Strengths

Concise Summary of Contributions
The conclusion effectively synthesizes the paper's primary contributions, clearly restating the introduction of the CodeAct framework, the CodeActInstruct dataset, and the CodeActAgent model.

"This work introduces CodeAct... We collect CodeAct-focused multi-turn interaction trajectories CodeActInstruct... and train CodeActAgent..." (Page 9)
Reiteration of Core Advantage
The section clearly reiterates the core value proposition of CodeAct, emphasizing its advantage over traditional text or JSON-based actions, particularly for complex tasks.

"...CodeAct that employs executable Python code for the LLM agent s action, which is advantageous over using text or JSON action, especially in complex scenarios." (Page 9)
Highlights Key Agent Capabilities
The conclusion effectively highlights the key features and advanced capabilities of the resulting CodeActAgent, including its seamless Python integration, ability to execute complex tasks, leverage existing libraries, and perform autonomous self-debugging.

"...CodeActAgent that is specially designed for seamless integration with Python and can execute sophisticated tasks (e.g., model training) leveraging existing Python packages and autonomously rectifying errors through self-debugging." (Page 9)

Suggestions for Improvement

Explicitly Link Advantage Statement to Empirical Evidence
This low-impact improvement would enhance the conclusiveness of the summary by directly tying the stated advantages back to the empirical evidence presented earlier. The Conclusions section restates CodeAct's advantages but doesn't explicitly reference the quantitative results (e.g., improved success rates shown in Section 2) that substantiate this claim. Adding a brief mention of the empirical validation strengthens the concluding statement.

"...which is advantageous over using text or JSON action, especially in complex scenarios." (Page 9)

Implementation: Modify the sentence stating CodeAct's advantage to include a brief reference to the empirical findings. For example: "...which, as demonstrated through extensive experiments ( 2), is advantageous over using text or JSON action, especially in complex scenarios." or "...which is advantageous over using text or JSON action, offering empirically verified improvements in success rates for complex scenarios."

Executable Code Actions Elicit Better LLM Agents

Table of Contents

Overall Summary

Study Background and Main Findings

Research Impact and Future Directions

Critical Analysis and Recommendations

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

CodeAct Makes LLMs Better Agents

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Empowering Open-source LLM Agent to be Better at CodeAct

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Related Work

Key Aspects

Strengths

Suggestions for Improvement

Conclusions

Key Aspects

Strengths

Suggestions for Improvement