This paper addresses limitations in current Large Language Model (LLM) agents, which typically interact with environments or tools using predefined text or structured JSON formats. These formats often restrict the complexity and flexibility of actions an agent can perform, hindering their ability to solve complex, multi-step problems. The research proposes CodeAct, a novel framework where LLM agents generate executable Python code as their actions. This approach aims to create a unified and more powerful action space by leveraging Python's inherent expressiveness, control flow structures (like loops and conditionals), data handling capabilities, and access to a vast ecosystem of existing software libraries.
The effectiveness of CodeAct was evaluated by comparing it against text and JSON action formats across 17 different LLMs on two benchmarks: API-Bank (for simple, single-tool actions) and a newly curated benchmark, M3ToolEval (designed for complex tasks requiring multiple tools and interaction turns). The results demonstrated that CodeAct performs comparably or better on simple tasks and significantly outperforms alternatives on complex tasks, achieving up to a 20% absolute increase in success rate and requiring fewer interaction turns. This suggests CodeAct effectively utilizes LLMs' familiarity with code (from pre-training) and excels where complex logic or tool composition is needed.
Recognizing a performance gap between open-source and proprietary LLMs, the researchers developed CodeActInstruct, a dataset containing ~7,000 multi-turn interaction examples using CodeAct, specifically filtered to include instances of self-debugging and improvement based on feedback (like error messages). They used this dataset, combined with general conversation data, to fine-tune open-source models (Llama-2 7B, Mistral 7B), creating CodeActAgent. Evaluation showed that CodeActAgent significantly improved performance on agent tasks compared to baseline open-source models, demonstrated generalization to text-based actions, and maintained strong performance on general LLM benchmarks.
The main conclusion is that using executable Python code (CodeAct) offers a substantial advantage over text/JSON for LLM agent actions, particularly in complex scenarios. The CodeAct framework, combined with targeted instruction tuning using datasets like CodeActInstruct, enables the development of more capable and autonomous agents that can leverage existing software ecosystems and even debug their own actions. This work provides both a conceptual framework and practical resources (datasets, models) for advancing LLM agent capabilities.
This research presents a compelling argument for shifting the paradigm of Large Language Model (LLM) agent actions from constrained formats like JSON or text towards executable Python code, encapsulated in the CodeAct framework. The core strength lies in leveraging the inherent flexibility, control flow (e.g., loops, conditionals), data manipulation capabilities, and vast library ecosystem of a mature programming language. Empirical results, particularly the up to 20% higher success rate on complex multi-tool tasks compared to traditional methods, provide substantial evidence for CodeAct's potential. The development of the M3ToolEval benchmark specifically addresses a gap in evaluating complex agent interactions, strengthening the validation.
The creation of the CodeActInstruct dataset and the subsequent fine-tuning of CodeActAgent models represent significant practical contributions, particularly for advancing open-source LLM agent capabilities. The demonstration that fine-tuning with this specialized data, mixed with general conversation data, improves agent performance without substantially degrading general abilities is a valuable finding for practical LLM development. The agent's ability to leverage existing Python packages and perform self-debugging based on execution errors points towards more autonomous and capable AI systems.
However, some limitations warrant consideration. While CodeAct improves performance, a significant gap persists between open-source and leading closed-source models, suggesting that the action format alone isn't a panacea for underlying model capability differences. The effectiveness of self-debugging might vary depending on error complexity and the base model's reasoning ability. Furthermore, while the benchmarks used are valuable, generalizing performance to the full spectrum of real-world tasks requires ongoing validation. The reliance on LLMs to generate correct and safe code also introduces potential security considerations not fully explored within this scope. Despite these points, CodeAct offers a promising and demonstrably effective direction for building more powerful and flexible LLM agents.
The abstract clearly articulates the problem being addressed, specifically the limitations inherent in current LLM agent action generation methods that rely on JSON or text formats. It effectively highlights the constraints related to action space scope and flexibility, setting a strong foundation for the proposed solution.
The abstract concisely introduces CodeAct as the core contribution, defining it as a unified action space based on executable Python code. This provides the reader with an immediate understanding of the paper's central proposal.
The abstract effectively summarizes the key performance advantage of CodeAct by stating its superiority over alternatives and quantifying the improvement with a specific metric (up to 20% higher success rate). This immediately conveys the significance of the findings.
The abstract successfully outlines the paper's main contributions beyond the core CodeAct framework, including the development of the CodeActInstruct dataset and the CodeActAgent model. This gives a comprehensive overview of the work's scope.
This medium-impact improvement would enhance the reader's understanding of the evaluation's novelty and rigor directly within the abstract. The abstract introduces a new benchmark alongside API-Bank but doesn't specify what gap this new benchmark addresses or what makes it distinct. Clarifying its purpose (e.g., focusing on complex multi-tool composition) would strengthen the abstract by immediately highlighting a key aspect of the methodological contribution and providing context for the performance claims.
Implementation: Modify the sentence mentioning the benchmark to briefly indicate its specific focus or the type of complexity it introduces. For example: "...on API-Bank and a newly curated benchmark designed for complex multi-tool interactions shows that CodeAct outperforms..."
The introduction effectively establishes the context by highlighting the advancements in LLMs and their application as agents, while clearly articulating the limitations of current action generation methods (text/JSON), thereby setting a strong motivation for the research.
The paper clearly introduces CodeAct as a novel framework and systematically presents its distinct advantages over existing approaches in a well-structured, enumerated list, facilitating reader comprehension.
The introduction effectively leverages Figure 1 (referenced on pages 1 and 2) to visually contrast CodeAct with text/JSON actions and quantitatively preview performance gains, reinforcing the core arguments.
The introduction successfully motivates the study by outlining the limitations of prior work, including previous attempts at code generation for agents, and logically positioning CodeAct as a superior alternative.
The section provides a concise preview of the experimental validation strategy and key findings, outlining how the benefits of CodeAct (LLM familiarity, control/data flow) are demonstrated through specific experiments and benchmarks (API-Bank, M3ToolEval).
This low-impact improvement would enhance the clarity of the critique of prior work for readers less familiar with robotics or specific agent control literature. The Introduction section, aiming to establish context, benefits from defining potentially specialized terms early on. Briefly explaining "pre-specified control primitives" when introduced would ensure broader comprehension of the limitations being discussed, thereby strengthening the justification for CodeAct's more general approach.
Implementation: When mentioning that prior code-generation work relies on "pre-specified control primitives," add a brief parenthetical explanation or a short phrase clarifying the term. For example: "...typically rely on pre-specified control primitives (i.e., basic, predefined commands for interaction)..." or "...typically rely on pre-specified control primitives, which are fundamental action commands designed for a specific system..."
Figure 1: Comparison between CodeAct and Text / JSON as action. (top) Illustrative example comparing different actions. (bottom) Quantitative results on M³ToolEval (§2.3).
The section clearly defines the CodeAct framework within a general agent interaction model (agent, user, environment), explicitly stating its core mechanism: using executable Python code for agent-environment actions and receiving execution results as observations.
The paper provides strong empirical support for CodeAct's advantages through two well-designed experiments. The first (API-Bank) isolates the benefit of LLM familiarity with code by testing atomic actions, while the second (M3ToolEval) demonstrates the advantages of control/data flow in complex, multi-tool scenarios.
The introduction and evaluation of the M3ToolEval benchmark effectively addresses a gap in existing tool-use evaluations, specifically the lack of benchmarks requiring complex multi-tool composition across different action formats, thereby strengthening the evaluation of CodeAct's capabilities.
The results presented in Tables 2 and 3 compellingly show CodeAct's superiority over JSON and text formats, particularly for complex tasks (higher success rates, fewer turns) and for open-source models, quantitatively supporting the central claims of the paper.
Section 2.4 effectively illustrates the practical benefits of CodeAct beyond basic tool calls, showcasing its ability to leverage existing software libraries (Pandas, Scikit-Learn, Matplotlib) and facilitate self-debugging through automated error feedback, enhancing the agent's autonomy and capability.
This low-impact improvement would enhance the methodological clarity in Section 2.2. This section aims to isolate the benefit of LLM familiarity with code by testing atomic actions, explicitly ablating control/data flow advantages. Explicitly stating why this ablation is crucial (i.e., to separate the 'familiarity' effect from the 'expressiveness' effect tested later) would strengthen the reader's understanding of the experimental logic and how it supports RQ1.
Implementation: In Section 2.2, when introducing the API-Bank experiment setup, add a sentence clarifying the rationale behind focusing on atomic actions. For example: "By focusing on atomic tool calls, where only a single tool is invoked per action, we intentionally ablate the control and data flow advantages inherent in CodeAct (tested in §2.3) to specifically isolate and assess the benefit derived from LLMs' pre-existing familiarity with code structures compared to JSON or text."
This medium-impact suggestion aims to improve the interpretation of results in Section 2.3. The section highlights a significant performance gap between open- and closed-source models using CodeAct on M3ToolEval, attributing it to weaker task-solving and instruction-following in open models. Expanding slightly on potential reasons for this weakness (e.g., differences in pre-training scale/data, architectural choices, or less targeted alignment for complex agency) and more explicitly linking this gap to the motivation for developing CodeActInstruct (introduced in Section 3) would create a stronger narrative bridge and better justify the subsequent focus on improving open-source models.
Implementation: In the paragraph discussing the quantitative results on M3ToolEval (end of Section 2.3), after noting the performance gap and suggesting reasons, add a sentence that explicitly connects this finding to the work presented later. For example: "...suggesting an urgent need to improve open-source LLMs for practical, real-world tasks under the zero-shot setting. This observed gap underscores the importance of targeted instruction tuning, motivating the development of our CodeActInstruct dataset (§3.1) designed to enhance these specific capabilities in open-source models."
Figure 2: General agent multi-turn interaction framework that describes the role of CodeAct and motivates the construction of our data mixture. CodeActInstruct focuses on the agent-environment interactions and specifically filters for the self-improved planning behavior, while general conversation data we include focuses on agent-user interaction (§3.1).
Figure 3: Example multi-turn interaction with Python packages using CodeActAgent (Mistral-7b). No in-context demonstrations are provided to the model. Some messages are omitted for space. See https://chat.xwang.dev/r/Vqn108G for complete interaction.
Table 2: Atomic API call correctness on API- Bank. The best performance is bolded, and the second-best is underlined.
Table 3: Success rates (higher the better) and average turns required per instance (lower the better) on M³ToolEval. The best results for each model are bolded, and the second-best ones are underlined.
The section clearly articulates the motivation for the work presented, directly addressing the performance gap between open- and closed-source models identified in the previous section (")2.3) and positioning the development of CodeActInstruct and CodeActAgent as a solution.
The methodology for constructing the CodeActInstruct dataset is detailed and systematic, covering use case selection, repurposing of existing datasets, data down-sampling, multi-turn interaction conversion, trajectory generation, and quality filtering based on interaction patterns (self-improvement).
The paper effectively compares CodeActInstruct to prior agent instruction datasets (AgentInstruct, FireAct), highlighting its advantages in terms of action modality (code vs. text), practicality, domain diversity, data quality focus (self-debugging), scale, and empirical performance improvements.
The evaluation of CodeActAgent is comprehensive, testing performance not only on CodeAct tasks (in-domain and out-of-domain) but also assessing generalization to text-based actions and performance on standard general LLM benchmarks, providing a holistic view of the fine-tuned models' capabilities.
The section demonstrates the successful integration of agent-specific interaction data (CodeActInstruct) with general conversation data, showing this mixture improves agent performance without significantly harming general LLM capabilities, offering a practical fine-tuning strategy.
This low-impact improvement would enhance methodological clarity regarding the dataset creation process. Section 3.1 describes selecting trajectories that exhibit self-improvement, a key quality criterion for CodeActInstruct. While the concept is clear, briefly mentioning the specific heuristic or indicator used to identify such trajectories (e.g., presence of an error traceback followed by successful execution in a later turn, as implied and detailed in Appendix G.2) directly within this section would add a layer of operational detail.
Implementation: In the paragraph 'Enhancing Agent s Capabilities of Improving from Interaction', modify the sentence describing the selection criteria. For example: "To achieve this, we selectively preserve those trajectories, identified by the pattern of initial code execution errors followed by successful rectification in subsequent turns, wherein the model initially encounters errors but rectifies these inaccuracies..."
This medium-impact suggestion aims to improve the flow and interpretation of the CodeActAgent evaluation results. Section 3.2 presents the core findings on the fine-tuned models. The text notes the surprising lack of improvement for the Llama-2 variant on M3ToolEval and defers the explanation entirely to Appendix H. Briefly acknowledging the unexpected nature of this specific result and hinting at the hypothesized reason (potential pre-training data artifacts, as detailed in Appendix H) within this main results paragraph would provide better immediate context for the reader and create a smoother narrative.
Implementation: After stating that no improvement is observed for the Llama-2 variant on M3ToolEval, add a sentence acknowledging the anomaly and briefly referencing the hypothesized cause. For example: "Surprisingly, no improvement is observed for the Llama-2 variant. This unexpected result, potentially linked to artifacts in the model's pre-training data (discussed further in H), contrasts with the gains seen in the Mistral variant."
Table 4: Statistics of our training mixture and comparison with prior work. Please refer to §3.1 for details about CodeActInstruct and general conversation data. Token statistics are computed using Llama-2 tokenizer.
Table 5: Evaluation results for CodeActAgent. The best results among all open-source LLMs are bolded, and the second-best results are underlined. ID and OD stand for in-domain and out-of-domain evaluation correspondingly. Overall averaged performance normalizes the MT-Bench score to be consistent with other tasks and excludes in-domain tasks for fair comparison.
Table A.6: Example of actions for re-purposed API-Bank (Li et al., 2023) and M³ToolEval.
Table A.7: Comparison between M³ToolEval and existing tool-use evaluation benchmark.
Table A.8: Ablation study results. The best results are bolded, and the second-best results are underlined. ID and OD stand for in-domain and out-of-domain evaluation correspondingly. Overall averaged performance normalizes the MT-Bench score to be consistent with other tasks and excludes in-domain tasks for fair comparison.
Table A.9: CodeActInstruct components and the number of instances for training trajectory generation.
The section effectively situates the paper's contribution within the established four-component structure of LLM agents (profiles, memory, reasoning/planning, action modules), clearly identifying the action module as the core focus.
The related work clearly articulates the specific problem addressed the standardization of the action space for LLM agents distinguishing it from broader agent research.
The section appropriately acknowledges and differentiates the work from related concepts like general code generation for problem-solving and concurrent studies (TaskWeaver) by directing readers to specific appendices ( A, B) for detailed comparisons, keeping the main text concise.
The section concisely summarizes the two primary methodologies for enhancing LLM agents (prompt engineering and instruction tuning), providing relevant citations and examples for each, which effectively sets the stage for understanding the paper's approach.
This low-impact improvement would enhance immediate clarity within the Related Work section itself. While deferring the detailed comparison with TaskWeaver to Appendix B is efficient, briefly stating the primary distinction (e.g., CodeAct's focus on empirical validation and open-source model fine-tuning vs. TaskWeaver's conceptual framework, as suggested in B) directly in Section 4.1 would give readers a quicker understanding of the key difference without needing to immediately consult the appendix. This addition aligns with the section's purpose of contextualizing and differentiating the work.
Implementation: After mentioning TaskWeaver and referencing Appendix B, add a short phrase summarizing the core difference. For example: "...similarly endorses the use of code. While TaskWeaver provides a conceptual framework, our work focuses on extensive empirical validation and open-source agent development ( B)." or a similar concise summary based on the distinctions in Appendix B.
This low-impact improvement would strengthen the coherence between the general discussion of agent improvement methods and the specific contributions of this paper. Section 4.2 effectively surveys prompt engineering and instruction tuning. Explicitly linking this survey back to the paper's chosen methodology (instruction tuning via CodeActInstruct for the CodeAct framework) would reinforce the rationale presented earlier and better integrate this subsection into the paper's narrative arc. This fits the section's role of positioning the work within existing methodologies.
Implementation: At the end of Section 4.2, after discussing prompt engineering and instruction tuning approaches, add a sentence that connects this overview to the paper's specific approach. For example: "This paper primarily leverages the instruction tuning paradigm, developing the CodeActInstruct dataset ( 3.1) to specifically enhance agent capabilities within the proposed CodeAct framework."
The conclusion effectively synthesizes the paper's primary contributions, clearly restating the introduction of the CodeAct framework, the CodeActInstruct dataset, and the CodeActAgent model.
The section clearly reiterates the core value proposition of CodeAct, emphasizing its advantage over traditional text or JSON-based actions, particularly for complex tasks.
The conclusion effectively highlights the key features and advanced capabilities of the resulting CodeActAgent, including its seamless Python integration, ability to execute complex tasks, leverage existing libraries, and perform autonomous self-debugging.
This low-impact improvement would enhance the conclusiveness of the summary by directly tying the stated advantages back to the empirical evidence presented earlier. The Conclusions section restates CodeAct's advantages but doesn't explicitly reference the quantitative results (e.g., improved success rates shown in Section 2) that substantiate this claim. Adding a brief mention of the empirical validation strengthens the concluding statement.
Implementation: Modify the sentence stating CodeAct's advantage to include a brief reference to the empirical findings. For example: "...which, as demonstrated through extensive experiments ( 2), is advantageous over using text or JSON action, especially in complex scenarios." or "...which is advantageous over using text or JSON action, offering empirically verified improvements in success rates for complex scenarios."