Evaluating the Planning Abilities of Large Language and Reasoning Models using PlanBench

Section Analysis

Abstract

Overview

This abstract introduces a study evaluating the planning abilities of Large Language Models (LLMs) and Large Reasoning Models (LRMs), particularly OpenAI's o1 model, using the PlanBench benchmark. It highlights the slow progress of LLMs in planning tasks and suggests that o1, while significantly improved, still has limitations in terms of accuracy, efficiency, and guarantees.

Key Aspects

LLM Planning Limitations: Traditional LLMs struggle with planning tasks, potentially due to their reliance on approximate retrieval rather than reasoning.
PlanBench Benchmark: This benchmark is used to evaluate the planning capabilities of LLMs and LRMs, focusing on block-stacking problems.
o1 (LRM) Performance: OpenAI's o1 model demonstrates a substantial improvement over previous LLMs in planning, but still falls short of perfect accuracy.
Efficiency and Cost Concerns: o1's improved performance comes at a higher computational cost, raising questions about its practical deployment.
Need for New Evaluation Metrics: The emergence of LRMs necessitates new evaluation methods that consider efficiency, cost, and guarantees, in addition to accuracy.

Strengths

Clear Motivation
The abstract effectively establishes the importance of planning in AI and the need to evaluate new models like o1.

"The ability to plan a course of action that achieves a desired state of affairs has long been considered a core competence of intelligent agents and has been an integral part of AI research since its inception." (Page 1)
Concise Summary of Findings
The abstract succinctly presents the key findings, highlighting both o1's improvements and its remaining limitations.

"As we shall see, while o1’s performance is a quantum improvement on the benchmark, outpacing the competition, it is still far from saturating it." (Page 1)

Suggestions for Improvement

Elaborate on Evaluation Metrics
While the abstract mentions the need for new metrics, it could briefly elaborate on what these might be.

"This improvement also brings to the fore questions about accuracy, efficiency, and guarantees which must be considered before deploying such systems." (Page 1)

Rationale: Providing more specific examples of new metrics would strengthen the abstract's call for future research.

Implementation: Mention specific metrics like planning time, cost per plan, or robustness to variations in problem descriptions.
Quantify o1's Improvement
The abstract uses qualitative terms like "quantum improvement." Adding some quantitative data would make the improvement more concrete.

"As we shall see, while o1’s performance is a quantum improvement on the benchmark, outpacing the competition, it is still far from saturating it." (Page 1)

Rationale: Quantifying the improvement would give readers a better understanding of the magnitude of o1's advancement.

Implementation: Include a brief statement like "o1 achieved X% accuracy compared to the previous best of Y%.

Introduction

Overview

This introduction sets the stage for evaluating Large Reasoning Models (LRMs), specifically OpenAI's o1, using the PlanBench benchmark. It emphasizes the shift from approximate retrieval in LLMs to approximate reasoning in LRMs, highlighting o1's potential and the need for new evaluation methods.

Key Aspects

PlanBench's Role: PlanBench serves as a tool to evaluate the planning capabilities of LLMs and now LRMs, providing a consistent measure of progress.
LRM vs. LLM: The introduction distinguishes between Large Language Models (LLMs) and Large Reasoning Models (LRMs), suggesting o1 belongs to the latter category due to its different architecture and capabilities.
o1 as an Approximate Reasoner: Unlike LLMs, which are described as approximate retrievers, o1 is presented as an approximate reasoner, implying a more sophisticated approach to problem-solving.
Need for New Evaluation Tools: The unique nature of LRMs like o1 requires new evaluation tools and methods, especially given the limited information about their internal workings.
Extending PlanBench: The introduction foreshadows the need to extend PlanBench to better assess the capabilities of LRMs and address questions of efficiency, cost, and guarantees.

Strengths

Clear Distinction between LLMs and LRMs
The introduction clearly differentiates between LLMs and LRMs, highlighting the shift from retrieval to reasoning.

"In particular, unlike the LLMs which came before it, which can roughly be viewed as approximate retrievers, o1 seems to have been trained to be an approximate reasoner." (Page 2)
Justification for New Evaluation Methods
The introduction effectively justifies the need for new evaluation methods for LRMs by emphasizing their unique characteristics.

"To properly evaluate this new kind of model and understand its abilities and limitations will require new tools and evaluation methods, especially if details of the overall model structure are kept secret and internal traces remain inaccessible to outside researchers." (Page 2)

Suggestions for Improvement

More Concrete Examples of o1's Reasoning
While the introduction states that o1 is an approximate reasoner, providing specific examples of its reasoning process would strengthen the argument.

"o1 seems to have been trained to be an approximate reasoner." (Page 2)

Rationale: Concrete examples would make the distinction between LLMs and LRMs more tangible and persuasive.

Implementation: Include a brief example of how o1 approaches a planning problem differently from an LLM, perhaps using a simplified block-stacking scenario.
Expand on PlanBench Extension Directions
The introduction mentions extending PlanBench but could briefly elaborate on potential directions for this extension.

"Now that LRMs score so highly on at least parts of the original test set, those tools will become ever more important for future evaluations." (Page 2)

Rationale: Providing specific directions for extending PlanBench would further motivate the research and provide a roadmap for future work.

Implementation: Suggest specific extensions like incorporating more complex planning domains, varying problem sizes, or introducing uncertainty into the environment.

State-of-the-Art LLMs Still Can’t Plan

Overview

This section examines the performance of existing Large Language Models (LLMs) on the PlanBench benchmark, particularly focusing on block-stacking problems. It finds that even the most advanced LLMs struggle with these planning tasks, especially variations like "Mystery Blocksworld," suggesting their limitations in reasoning and planning compared to retrieval tasks.

Key Aspects

PlanBench Performance: LLMs show limited success on PlanBench, even with simpler block-stacking problems.
Mystery Blocksworld: LLMs perform poorly on this obfuscated version of Blocksworld, highlighting their difficulty with semantic understanding.
Translation and Reasoning: Providing explicit translations between problem representations doesn't significantly improve LLM performance, suggesting a lack of compositional reasoning abilities.
One-Shot vs. Zero-Shot Prompting: One-shot prompting doesn't consistently improve performance over zero-shot prompting, and can even be detrimental for some models.
Accuracy Limitations: The best-performing LLM achieves only around 62.6% accuracy on standard Blocksworld, indicating substantial room for improvement.

Strengths

Clear Performance Comparison
The section provides a clear comparison of different LLMs on various Blocksworld tasks, allowing for a direct assessment of their capabilities.

"In Table 1, we present the results of running current and previous generation LLMs on a static test set of 600 three to five block Blocksworld problems, as well as on a set of 600 semantically identical but syntactically obfuscated instances which we call Mystery Blocksworld." (Page 3)
Highlighting Limitations
The section effectively highlights the limitations of LLMs in planning, emphasizing the gap between their performance and the desired level of competence.

"PlanBench remains a challenging benchmark for vanilla LLMs (massive transformer models which have been fine-tuned via RLHF), and their lackluster performance on even our easiest test set leads us to continue to believe that planning cannot be generally and robustly solved by approximate retrieval alone." (Page 3)

Suggestions for Improvement

Investigate Reasons for One-Shot Degradation
The section notes that one-shot prompting can worsen performance but doesn't delve into the reasons behind this.

"We also find that, contrary to previous claims, one-shot prompting is not a strict improvement over zero-shot. In fact, for many models it seems to do significantly worse!" (Page 3)

Rationale: Understanding why one-shot prompting sometimes fails could lead to better prompting strategies or model improvements.

Implementation: Analyze the specific cases where one-shot prompting performs worse, looking for patterns in the prompts or the model's responses.
Explore Alternative Obfuscation Methods
The Mystery Blocksworld obfuscation might be too brittle or easily reversed by LLMs.

"Despite the underlying problems being identical, Mystery Blocksworld performance lags far behind–no LLM achieves even 5% on our test set–and performance on one version of the domain does not clearly predict performance on the other." (Page 3)

Rationale: Different obfuscation techniques could provide a more robust test of semantic understanding and reasoning.

Implementation: Experiment with obfuscations that involve more complex semantic transformations or require more inferential steps to reverse.

Non-Text Elements

table 1

Table 1 compares the performance of various Large Language Models (LLMs) on two block-stacking tasks: 'Blocksworld' and 'Mystery Blocksworld'. 'Blocksworld' involves arranging blocks according to clear instructions, while 'Mystery Blocksworld' uses confusing language for the same task. The table shows how many out of 600 problems each LLM solved correctly in both zero-shot (no example) and one-shot (one example provided) scenarios. The models are grouped by family (Claude, GPT, LLaMA, Gemini).

First Mention

Text: "Table 1: Performance on 600 instances from the Blocksworld and Mystery Blocksworld domains across large language models from different families, using both zero-shot and one-shot prompts. Best-in-class accuracies are bolded."

Context: This table appears in the 'State-of-the-Art LLMs Still Can't Plan' section on page 2. It presents the performance of various LLMs on the Blocksworld and Mystery Blocksworld tasks, which serves as a baseline for comparison with the newer LRM models discussed later.

Relevance: This table is crucial because it demonstrates that even the most advanced LLMs struggle with planning tasks, especially when the instructions are unclear. This highlights the need for models with better reasoning abilities, like the LRMs discussed in the paper.

Critique

Visual Aspects

The table is well-organized and easy to read.
Using bold font for the best results in each row makes it easy to see which model performed best in each category.

Analytical Aspects

The table clearly shows that LLMs are much better at 'Blocksworld' than 'Mystery Blocksworld', suggesting they rely on recognizing familiar phrasing rather than true understanding.
Including both the number of correct solutions and the percentage makes the data easier to interpret.
The table could benefit from a brief explanation of why 'Mystery Blocksworld' is a useful test.

Numeric Data

Best Blocksworld Zero-Shot Accuracy: 376 correct instances out of 600 (62.6%)
Best Mystery Blocksworld Zero-Shot Accuracy: 21 correct instances out of 600 (3.5%)

From Approximate Retrieval to Approximate Reasoning: Evaluating o1

Overview

This section discusses the shift from approximate retrieval (like looking up answers in a vast library) in Large Language Models (LLMs) to approximate reasoning (like figuring things out step-by-step) in Large Reasoning Models (LRMs), using OpenAI's o1 as a prime example. It evaluates o1's performance on PlanBench, a benchmark for planning tasks, and finds that while o1 shows significant improvement over LLMs, its performance isn't perfect and degrades with more complex problems.

Key Aspects

Approximate Retrieval vs. Approximate Reasoning: LLMs retrieve information like searching a database, while LRMs attempt to reason through problems step-by-step.
o1's Architecture and Operation: o1 likely combines a large language model with a system that guides its reasoning process, possibly using reinforcement learning.
PlanBench Evaluation: o1 performs well on basic block-stacking problems in PlanBench, exceeding LLMs, but struggles with more complex or obfuscated versions.
Accuracy with Increased Problem Size: o1's accuracy drops significantly when the block-stacking problems require longer plans, indicating limitations in its reasoning abilities.
Performance on Unsolvable Instances: o1 can sometimes identify unsolvable problems, but also makes mistakes, either by proposing impossible plans or incorrectly claiming solvable problems are unsolvable.

Strengths

Clear Explanation of LLM Limitations
The section effectively explains why traditional LLMs struggle with planning tasks, using the analogy of approximate retrieval to illustrate their limitations.

"Many researchers, including us, have argued that "standard" autoregressive LLMs generate outputs via approximate retrieval, and that, while they show impressive performance on a range of System 1 tasks, they are unlikely to achieve more System 2-like approximate reasoning capabilities critical for planning tasks (c.f. [9])." (Page 4)
Detailed Performance Analysis of o1
The section provides a comprehensive analysis of o1's performance on various PlanBench tasks, including different problem sizes and unsolvable instances.

"Evaluating LRMs on the Original Test Set: We test o1-preview and o1-mini on the static PlanBench test set." (Page 4)

Suggestions for Improvement

Further Investigation of o1's Reasoning Process
While the section speculates about o1's architecture, a deeper investigation of its internal reasoning process would be valuable.

"As far as we can tell, o1 combines an underlying LLM, most likely a modified GPT-4o, into an RL-trained system that steers the creation, curation, and final selection of private Chain-of-Thought reasoning traces." (Page 4)

Rationale: Understanding how o1 reasons would help identify its strengths and weaknesses, leading to better model development.

Implementation: Analyze the intermediate steps or reasoning traces (if accessible) to understand how o1 arrives at its solutions or identifies unsolvable problems.
Comparison with Other Reasoning Approaches
Comparing o1's performance with other reasoning methods, such as symbolic planners or hybrid approaches, would provide a broader context.

"in what are called LLM-Modulo systems [10, 11]. o1 attempts to supplement an underlying LLM with System 2-like abilities in a different way." (Page 4)

Rationale: This comparison would help assess the relative advantages and disadvantages of different reasoning approaches for planning tasks.

Implementation: Evaluate the performance of symbolic planners or LLM-Modulo systems on the same PlanBench tasks used to evaluate o1, and compare their accuracy, efficiency, and robustness.

Non-Text Elements

figure 1

Figure 1 shows two line graphs comparing the performance of different models on the 'Mystery Blocksworld' task. One graph represents 'zero-shot' performance (no prior example given), and the other shows 'one-shot' performance (one example provided). The x-axis of each graph represents the length of the solution plan (number of steps), and the y-axis represents the percentage of problems solved correctly. Each line on the graphs corresponds to a different model, and you can see how their accuracy changes as the plans get longer. Generally, the lines slope downwards, meaning accuracy drops as the problems get harder (longer plans needed).

First Mention

Text: "Figure 1: These examples are on Mystery Blocksworld. Fast Downward, a domain-independent planner [8] solves all given instances near-instantly with guaranteed perfect accuracy. LLMs struggle on even the smallest instances. The two LRMs we tested, o1-preview and o1-mini, are surprisingly effective, but this performance is still not robust, and degrades quickly with length."

Context: This figure is introduced at the beginning of the 'From Approximate Retrieval to Approximate Reasoning: Evaluating o1' section on page 3. It visually demonstrates the performance difference between LLMs, LRMs (o1-preview and o1-mini), and a classical planner (Fast Downward) on the Mystery Blocksworld task, highlighting the relative effectiveness of the LRMs compared to LLMs and their limitations compared to Fast Downward.

Relevance: This figure is important because it visually shows that the new 'reasoning' models (LRMs) perform much better than standard language models (LLMs) on the tricky 'Mystery Blocksworld' task. It also shows that even these new models aren't perfect and struggle with longer, more complex problems. This supports the idea that while LRMs are a step forward, there's still a lot of room for improvement in planning and reasoning.

Critique

Visual Aspects

The graphs are generally clear, but the colors of some lines are too similar, making them hard to distinguish.
The y-axis label '% correct' could be more descriptive, such as 'Accuracy (%)'.
Adding a clear visual marker (like a thicker line or distinct symbol) for the best-performing LRM would improve readability.

Analytical Aspects

The figure effectively demonstrates the performance degradation with increasing plan length.
It would be helpful to include a brief explanation of 'zero-shot' and 'one-shot' in the caption for a broader audience.
While the figure shows trends, it lacks any indication of statistical significance or variability (e.g., error bars).

Numeric Data

table 2

Table 2 shows how well different models perform on several variations of the 'Blocksworld' problem, including a tricky version called 'Mystery Blocksworld' and an even trickier 'Randomized Mystery Blocksworld'. It compares the new 'reasoning' models (o1-preview and o1-mini) with a traditional planning system called 'Fast Downward'. The table shows how many out of 600 problems each model solved correctly and the average time it took them. 'Fast Downward' solves all the problems perfectly and very quickly. The o1 models do well on the regular 'Blocksworld' but struggle more with the 'Mystery' versions, taking much longer to answer.

First Mention

Text: "Table 2: Performance and average time taken on 600 instances from the Blocksworld, Mystery Blocksworld and Randomized Mystery Blocksworld domains by OpenAI's ol family of large reasoning models and Fast Downward"

Context: This table, appearing on page 3, follows Figure 1 and provides a more detailed breakdown of the performance of the o1 models and Fast Downward on different Blocksworld variations, including accuracy and average time taken. It further emphasizes the performance gap between LRMs and a dedicated planner.

Relevance: This table is important because it provides a direct comparison between the new LRMs, older LLMs (indirectly through comparison with Fast Downward), and a dedicated planning system. It shows that while LRMs are much better than LLMs, they are still not as good or as fast as specialized tools. This highlights the trade-offs between general-purpose language models and specialized systems.

Critique

Visual Aspects

The table is well-structured, but some empty cells make it slightly harder to compare all models directly.
Adding a visual cue to highlight the best performance in each row (besides Fast Downward) would be helpful.

Analytical Aspects

The table effectively presents both accuracy and time data, allowing for a more comprehensive evaluation.
Including the 'Randomized Mystery Blocksworld' is important to show that the results aren't just due to the specific wording of 'Mystery Blocksworld'.
The table could benefit from a brief explanation of why 'Fast Downward' is used as a comparison point.

Numeric Data

o1-preview Blocksworld Accuracy: 587 correct instances out of 600 (97.8%)
o1-mini Blocksworld Accuracy: 600 correct instances out of 600 (100%)
Fast Downward Blocksworld Accuracy: 600 correct instances out of 600 (100%)
o1-preview Blocksworld Average Time: 40.43 seconds
o1-mini Blocksworld Average Time: 35.54 seconds
Fast Downward Blocksworld Average Time: 0.265 seconds

figure 2

Figure 2 presents two graphs illustrating the relationship between plan length and the average number of reasoning tokens used by the o1-preview model. Graph (a) shows this relationship for 'Mystery Blocksworld', where obfuscated language is used to describe the block stacking task. Graph (b) shows the same relationship for the standard 'Blocksworld' task with clear instructions. Both graphs have 'Plan Length' on the x-axis and 'Average Reasoning Tokens' on the y-axis.

First Mention

Text: "Figure 2"

Context: This figure is introduced on page 4, in the section discussing the shift from approximate retrieval to approximate reasoning. It is used to illustrate how o1-preview's resource usage (reasoning tokens) changes with the complexity of the planning problem.

Relevance: Figure 2 helps to understand how o1-preview's reasoning process scales with problem complexity. It shows whether the model's resource consumption increases proportionally with the difficulty of the task, which is important for evaluating its efficiency.

Critique

Visual Aspects

The graphs are generally clear, with labeled axes and titles.
The shaded areas around the lines, presumably representing variability, are a bit too dark and make it slightly difficult to see the exact trend lines.
Using different colors or line styles for the trend lines and the shaded areas would improve readability.

Analytical Aspects

The caption could be more explicit about what 'reasoning tokens' are and why they matter.
The graphs don't show a clear correlation between plan length and reasoning tokens for the standard Blocksworld task, which raises questions about the model's behavior.
It would be helpful to include some statistical measure of the correlation or lack thereof between the variables.

Numeric Data

figure 3

Figure 3, as visible (it is partially out of frame in the provided document), appears to be a line graph showing how the accuracy of different models changes as the length of the plan (number of steps) increases. The x-axis represents 'Plan Length', and the y-axis represents '% correct'. Multiple lines, likely representing different models, are plotted on the graph, showing a general trend of decreasing accuracy with increasing plan length.

First Mention

Text: "Figure 3: All models are evaluated on Blocksworld. ol-preview outperforms the other LLMs, but its performance degrades more quickly with length."

Context: This figure, first mentioned on page 4, is part of the discussion on evaluating o1 and its performance on the Blocksworld task, especially as the problem size increases. It is presented after the discussion of o1's performance on the original test set and before the analysis of unsolvable instances.

Relevance: This figure is important because it shows how well o1-preview scales to more complex planning problems compared to other LLMs. The ability to handle longer plans is a key indicator of a model's reasoning capabilities.

Critique

Visual Aspects

The figure is unfortunately cut off in the provided document, making it impossible to fully interpret the data or see the legend.
The visible portion suggests that the lines might be close together, which could make it hard to distinguish between different models. Different line styles or markers would help.
The caption should be placed below the figure for better readability.

Analytical Aspects

The caption mentions that o1-preview's performance degrades quickly with length, but it would be more informative to quantify this degradation (e.g., 'accuracy drops by X% for every Y additional steps').
It's unclear from the visible portion what the range of plan lengths tested is. This information is crucial for understanding the scope of the evaluation.
The caption could explain why performance degradation with length is a significant issue for planning models.

Numeric Data

figure 3

Figure 3 is a line graph showing how the accuracy of the o1-preview model changes as the problems it tries to solve get more complex. The x-axis represents the number of steps needed to solve a problem (like the number of moves to stack blocks a certain way), ranging from 20 to 40. The y-axis represents how often the model gets the right answer (percentage correct), from 0% to 100%. There are multiple lines on the graph, likely comparing o1-preview's performance to other models or ideal solutions. The graph likely shows that as the number of steps increases, o1-preview's accuracy goes down.

First Mention

Text: "Figure 3: Extending even the (regular, not obfuscated) Blocksworld dataset to problems requiring greater numbers of steps worsens the performance of o1-preview. When tested on 110 instances which each require at least 20 steps to solve, it only manages 23.63%."

Context: This figure is introduced on page 4 in the section 'From Approximate Retrieval to Approximate Reasoning: Evaluating o1'. It is presented to show how o1-preview's performance changes as the complexity of the Blocksworld problems increases.

Relevance: This figure is important because it shows that even though o1-preview is better than older models, it still has trouble with harder problems. This tells us that there's still room for improvement in how these models reason and plan.

Critique

Visual Aspects

The graph is partially cut off on page 4, making it hard to see the full picture.
The lines and shaded areas overlap, making it difficult to distinguish the performance of different models or conditions.
The labels on the axes are clear, but the legend or model descriptions are not fully visible on page 4.

Analytical Aspects

The figure focuses on how accuracy changes with problem complexity, which is a key aspect of evaluating planning models.
The caption clearly states the main takeaway: o1-preview's performance degrades with increasing problem size.
The figure would be stronger if it included some measure of variability or uncertainty, like error bars or confidence intervals.

Numeric Data

o1-preview Accuracy on 20+ step problems: 23.63 %
Number of instances tested: 110 null

table 3

Table 3 shows how well OpenAI's o1-preview model can tell if a block-stacking problem is impossible to solve. It looks at two types of problems: regular 'Blocksworld' and a more complicated version called 'Randomized Mystery Blocksworld'. The table tells us two things: 1. The 'True Negative' rate: how often the model correctly says a problem is impossible when it actually is. 2. The 'False Negative' rate: how often the model wrongly says a problem is impossible when it actually has a solution. Think of it like a medical test - you want it to correctly identify the sick people (true negative) and not misdiagnose healthy people as sick (false negative).

First Mention

Text: "Table 3: Rate of claiming that a problem is impossible by OpenAI’s ol-preview on 100 unsolvable and 600 solvable instances in the Blocksworld and Randomized Mystery Blocksworld domains. The True Negative rate is the percent of unsolvable instances that were correctly marked as unsolvable. The False Negative rate is the percent of solvable instances that were incorrectly marked as unsolvable. Previous models are not shown in this table as their true negative and false negative rates were generally 0% across the board."

Context: This table appears on page 5, within the section discussing o1's evaluation. It follows the discussion of o1-preview's performance on larger Blocksworld problems and precedes the analysis of accuracy/cost tradeoffs.

Relevance: This table is important because it shows a new aspect of o1-preview's abilities - figuring out when a problem can't be solved at all. This is useful in real-world situations where knowing something is impossible is just as important as finding a solution.

Critique

Visual Aspects

The table is simple and easy to understand.
Clearly labeling the columns and rows makes the data easy to interpret.

Analytical Aspects

The table clearly shows that o1-preview is better at identifying impossible problems in regular 'Blocksworld' than in 'Randomized Mystery Blocksworld'.
The caption explains 'True Negative' and 'False Negative' rates clearly, which is helpful for a non-expert audience.
The table could be improved by showing results for other models, even if their rates are mostly 0%, to provide a better comparison.

Numeric Data

Blocksworld True Negative Rate: 27 %
Blocksworld False Negative Rate: 0 %
Randomized Mystery Blocksworld True Negative Rate: 16 %
Randomized Mystery Blocksworld False Negative Rate: 11.5 %

table 4

This table presents the cost per 100 instances, measured in US dollars, for using different Large Language Models (LLMs) and Large Reasoning Models (LRMs). It separates the models into two categories: LLMs (like Claude, GPT variants, and Gemini) and LRMs (specifically o1-preview and o1-mini).

First Mention

Text: "Table 4: Cost per 100 instances (in USD). LRMs are significantly more expensive than LLMs."

Context: This table is presented on page 6, within the discussion of accuracy/cost tradeoffs and guarantees for LRMs. It follows the analysis of o1's performance and precedes a comparison with classical planners and LLM-Modulo systems.

Relevance: This table is highly relevant because it directly addresses the cost implications of using LRMs for planning tasks. It highlights the significant cost difference between LLMs and LRMs, which is a crucial factor to consider when evaluating their practical applicability.

Critique

Visual Aspects

The table is clear and easy to read, with a simple structure that facilitates quick comparison between models.
The division into LLMs and LRMs is helpful for understanding the cost disparities between the two model types.

Analytical Aspects

The table effectively communicates the substantial cost difference between LLMs and LRMs, emphasizing the financial implications of using more computationally intensive models.
While the table focuses on cost, it would be beneficial to connect these costs to the performance differences shown in earlier tables and figures. This would provide a more comprehensive view of the cost-benefit tradeoff.
The table could also include a brief explanation of the 'reasoning tokens' and how they contribute to the higher cost of LRMs.

Numeric Data

Cost of o1-preview (per 100 instances): 42.12 USD
Cost of o1-mini (per 100 instances): 3.69 USD
Cost of GPT-4 (per 100 instances): 1.8 USD
Cost of GPT-4 Turbo (per 100 instances): 1.2 USD

Conclusion

Overview

This conclusion summarizes the findings of the study, highlighting the improved performance of Large Reasoning Models (LRMs like OpenAI's o1) compared to traditional LLMs on planning tasks using the PlanBench benchmark. While LLMs showed some progress on basic Blocksworld problems, they struggled with more complex or obfuscated versions. LRMs, particularly o1, demonstrated significantly better accuracy but still faced limitations with longer problems and unsolvable instances. The conclusion also emphasizes the importance of considering accuracy/efficiency trade-offs and the lack of correctness guarantees with LRMs, suggesting alternative approaches like LLM-Modulo systems or dedicated solvers for certain applications.

Key Aspects

LLM Progress: LLMs have shown limited improvement on standard Blocksworld problems, reaching a maximum accuracy of around 62.5%.
Mystery Blocksworld Performance: LLMs performed poorly on the obfuscated Mystery Blocksworld, revealing their reliance on surface-level patterns rather than true understanding.
LRM (o1) Performance: LRMs, especially o1, demonstrated substantially better accuracy on both standard and Mystery Blocksworld, indicating improved reasoning abilities.
Limitations of LRMs: Despite improvements, o1's performance degraded with longer problems and unsolvable instances, highlighting remaining challenges.
Accuracy/Efficiency Trade-offs: The conclusion emphasizes the need to consider the computational cost and lack of guarantees associated with LRMs when comparing them to other approaches.

Strengths

Concise Summary of Findings
The conclusion effectively summarizes the key findings of the study, highlighting the relative strengths and weaknesses of LLMs and LRMs.

"We took a fresh look at the planning capabilities of both SOTA LLMs, and examined the performance of OpenAI’s new o1 models on PlanBench. Over time, LLMs have improved their performance on vanilla Blocksworld–with the best performing model, LlaMA 3.1 405B, reaching 62.5% accuracy." (Page 7)
Emphasis on Practical Considerations
The conclusion goes beyond simply reporting accuracy and discusses important practical considerations like efficiency, cost, and guarantees.

"We also discussed the critical accuracy/efficiency tradeoffs that are brought up by the fact that o1 that uses (and charges for) significant inference-time compute, as well as how it compares to other LLM-based approaches (such as LLM-Modulo [10]) and dedicated solvers." (Page 7)

Suggestions for Improvement

Expand on Future Research Directions
While the conclusion briefly mentions future evaluations, it could expand on specific research directions to address the identified limitations.

"We hope this research note gives a good snapshot of the planning capabilities of LLMs and LRMs as well as useful suggestions for realistically evaluating them." (Page 7)

Rationale: Providing more concrete future research directions would strengthen the conclusion's impact and guide further work in the field.

Implementation: Suggest specific research areas like developing new LRM architectures that address the limitations with longer problems and unsolvable instances, or exploring hybrid approaches that combine the strengths of LLMs/LRMs with dedicated solvers.
Deeper Analysis of LRM Errors
The conclusion notes o1's limitations but could provide a deeper analysis of the types of errors it makes and their potential causes.

"Encouraged by this, we have also evaluated o1’s performance on longer problems and unsolvable instances, and found that these accuracy gains are not general or robust." (Page 7)

Rationale: A more detailed error analysis would provide valuable insights into the nature of LRM reasoning and guide future model improvements.

Implementation: Categorize the errors made by o1, such as incorrect action sequences, failure to recognize unsolvability, or exceeding computational limits. Investigate whether these errors are due to limitations in the model's architecture, training data, or reasoning process.

Evaluating the Planning Abilities of Large Language and Reasoning Models using PlanBench

Table of Contents

Overall Summary

Overview

Key Findings

Strengths

Areas for Improvement

Significant Elements

Table 2

Figure 3

Conclusion

Section Analysis

Abstract

Overview

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Overview

Key Aspects

Strengths

Suggestions for Improvement

State-of-the-Art LLMs Still Can’t Plan

Overview

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

First Mention

Numeric Data

From Approximate Retrieval to Approximate Reasoning: Evaluating o1

Overview

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

First Mention

Numeric Data

First Mention

Numeric Data

First Mention

Numeric Data

First Mention

Numeric Data

First Mention

Numeric Data

First Mention

Numeric Data

First Mention

Numeric Data

Conclusion

Overview

Key Aspects

Strengths

Suggestions for Improvement