Evaluating the Planning Abilities of Large Language and Reasoning Models using PlanBench

Table of Contents

Overall Summary

Overview

This research paper evaluates the planning capabilities of Large Language Models (LLMs) and Large Reasoning Models (LRMs), focusing on OpenAI's o1 model, using the PlanBench benchmark, which features block-stacking problems. The study reveals that while LLMs have made limited progress in planning, o1 demonstrates significant improvement due to its approximate reasoning abilities as opposed to the approximate retrieval approach of LLMs. However, o1's performance isn't perfect, particularly with complex or longer problems, highlighting the need for further research in LRM architecture and evaluation methodologies. The study emphasizes that evaluating LRMs requires considering not only accuracy but also efficiency, cost, and guarantees, which are critical for practical applications.

Key Findings

Strengths

Areas for Improvement

Significant Elements

Table 2

Description: Table 2 compares the performance (accuracy and time) of o1, other LLMs, and a classical planner (Fast Downward) on various Blocksworld tasks. It shows that while o1 outperforms LLMs, it's not as fast or accurate as Fast Downward, highlighting the trade-off between general-purpose models and specialized systems.

Relevance: This table underscores the performance gap between LRMs and dedicated planning systems, highlighting the need for further research to bridge this gap and improve LRM efficiency.

Figure 3

Description: Figure 3 depicts the decrease in o1's accuracy as the length of the required plan increases in Blocksworld. It visually demonstrates the model's limitations in handling more complex problems.

Relevance: This figure visually reinforces the key finding that o1's reasoning abilities struggle with increased problem complexity, highlighting a crucial area for future development.

Conclusion

This research paper demonstrates that while Large Reasoning Models (LRMs) like o1 represent a significant advancement over LLMs in planning tasks, challenges remain, particularly in handling complex problems and unsolvable instances. o1's improved performance comes at a higher computational cost, emphasizing the need to consider efficiency alongside accuracy in practical applications. Future research should focus on developing more robust LRM architectures that can efficiently handle longer plans and unsolvable problems, exploring hybrid approaches that combine the strengths of LLMs/LRMs with dedicated solvers, and developing more comprehensive evaluation methods that consider accuracy, efficiency, and guarantees. Furthermore, a deeper understanding of LRM reasoning processes and error analysis will be crucial for refining these models and unlocking their full potential in real-world planning scenarios.

Section Analysis

Abstract

Overview

This abstract introduces a study evaluating the planning abilities of Large Language Models (LLMs) and Large Reasoning Models (LRMs), particularly OpenAI's o1 model, using the PlanBench benchmark. It highlights the slow progress of LLMs in planning tasks and suggests that o1, while significantly improved, still has limitations in terms of accuracy, efficiency, and guarantees.

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Overview

This introduction sets the stage for evaluating Large Reasoning Models (LRMs), specifically OpenAI's o1, using the PlanBench benchmark. It emphasizes the shift from approximate retrieval in LLMs to approximate reasoning in LRMs, highlighting o1's potential and the need for new evaluation methods.

Key Aspects

Strengths

Suggestions for Improvement

State-of-the-Art LLMs Still Can’t Plan

Overview

This section examines the performance of existing Large Language Models (LLMs) on the PlanBench benchmark, particularly focusing on block-stacking problems. It finds that even the most advanced LLMs struggle with these planning tasks, especially variations like "Mystery Blocksworld," suggesting their limitations in reasoning and planning compared to retrieval tasks.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

table 1

Table 1 compares the performance of various Large Language Models (LLMs) on two block-stacking tasks: 'Blocksworld' and 'Mystery Blocksworld'. 'Blocksworld' involves arranging blocks according to clear instructions, while 'Mystery Blocksworld' uses confusing language for the same task. The table shows how many out of 600 problems each LLM solved correctly in both zero-shot (no example) and one-shot (one example provided) scenarios. The models are grouped by family (Claude, GPT, LLaMA, Gemini).

First Mention

Text: "Table 1: Performance on 600 instances from the Blocksworld and Mystery Blocksworld domains across large language models from different families, using both zero-shot and one-shot prompts. Best-in-class accuracies are bolded."

Context: This table appears in the 'State-of-the-Art LLMs Still Can't Plan' section on page 2. It presents the performance of various LLMs on the Blocksworld and Mystery Blocksworld tasks, which serves as a baseline for comparison with the newer LRM models discussed later.

Relevance: This table is crucial because it demonstrates that even the most advanced LLMs struggle with planning tasks, especially when the instructions are unclear. This highlights the need for models with better reasoning abilities, like the LRMs discussed in the paper.

Critique
Visual Aspects
  • The table is well-organized and easy to read.
  • Using bold font for the best results in each row makes it easy to see which model performed best in each category.
Analytical Aspects
  • The table clearly shows that LLMs are much better at 'Blocksworld' than 'Mystery Blocksworld', suggesting they rely on recognizing familiar phrasing rather than true understanding.
  • Including both the number of correct solutions and the percentage makes the data easier to interpret.
  • The table could benefit from a brief explanation of why 'Mystery Blocksworld' is a useful test.
Numeric Data
  • Best Blocksworld Zero-Shot Accuracy: 376 correct instances out of 600 (62.6%)
  • Best Mystery Blocksworld Zero-Shot Accuracy: 21 correct instances out of 600 (3.5%)

From Approximate Retrieval to Approximate Reasoning: Evaluating o1

Overview

This section discusses the shift from approximate retrieval (like looking up answers in a vast library) in Large Language Models (LLMs) to approximate reasoning (like figuring things out step-by-step) in Large Reasoning Models (LRMs), using OpenAI's o1 as a prime example. It evaluates o1's performance on PlanBench, a benchmark for planning tasks, and finds that while o1 shows significant improvement over LLMs, its performance isn't perfect and degrades with more complex problems.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

figure 1

Figure 1 shows two line graphs comparing the performance of different models on the 'Mystery Blocksworld' task. One graph represents 'zero-shot' performance (no prior example given), and the other shows 'one-shot' performance (one example provided). The x-axis of each graph represents the length of the solution plan (number of steps), and the y-axis represents the percentage of problems solved correctly. Each line on the graphs corresponds to a different model, and you can see how their accuracy changes as the plans get longer. Generally, the lines slope downwards, meaning accuracy drops as the problems get harder (longer plans needed).

First Mention

Text: "Figure 1: These examples are on Mystery Blocksworld. Fast Downward, a domain-independent planner [8] solves all given instances near-instantly with guaranteed perfect accuracy. LLMs struggle on even the smallest instances. The two LRMs we tested, o1-preview and o1-mini, are surprisingly effective, but this performance is still not robust, and degrades quickly with length."

Context: This figure is introduced at the beginning of the 'From Approximate Retrieval to Approximate Reasoning: Evaluating o1' section on page 3. It visually demonstrates the performance difference between LLMs, LRMs (o1-preview and o1-mini), and a classical planner (Fast Downward) on the Mystery Blocksworld task, highlighting the relative effectiveness of the LRMs compared to LLMs and their limitations compared to Fast Downward.

Relevance: This figure is important because it visually shows that the new 'reasoning' models (LRMs) perform much better than standard language models (LLMs) on the tricky 'Mystery Blocksworld' task. It also shows that even these new models aren't perfect and struggle with longer, more complex problems. This supports the idea that while LRMs are a step forward, there's still a lot of room for improvement in planning and reasoning.

Critique
Visual Aspects
  • The graphs are generally clear, but the colors of some lines are too similar, making them hard to distinguish.
  • The y-axis label '% correct' could be more descriptive, such as 'Accuracy (%)'.
  • Adding a clear visual marker (like a thicker line or distinct symbol) for the best-performing LRM would improve readability.
Analytical Aspects
  • The figure effectively demonstrates the performance degradation with increasing plan length.
  • It would be helpful to include a brief explanation of 'zero-shot' and 'one-shot' in the caption for a broader audience.
  • While the figure shows trends, it lacks any indication of statistical significance or variability (e.g., error bars).
Numeric Data
table 2

Table 2 shows how well different models perform on several variations of the 'Blocksworld' problem, including a tricky version called 'Mystery Blocksworld' and an even trickier 'Randomized Mystery Blocksworld'. It compares the new 'reasoning' models (o1-preview and o1-mini) with a traditional planning system called 'Fast Downward'. The table shows how many out of 600 problems each model solved correctly and the average time it took them. 'Fast Downward' solves all the problems perfectly and very quickly. The o1 models do well on the regular 'Blocksworld' but struggle more with the 'Mystery' versions, taking much longer to answer.

First Mention

Text: "Table 2: Performance and average time taken on 600 instances from the Blocksworld, Mystery Blocksworld and Randomized Mystery Blocksworld domains by OpenAI's ol family of large reasoning models and Fast Downward"

Context: This table, appearing on page 3, follows Figure 1 and provides a more detailed breakdown of the performance of the o1 models and Fast Downward on different Blocksworld variations, including accuracy and average time taken. It further emphasizes the performance gap between LRMs and a dedicated planner.

Relevance: This table is important because it provides a direct comparison between the new LRMs, older LLMs (indirectly through comparison with Fast Downward), and a dedicated planning system. It shows that while LRMs are much better than LLMs, they are still not as good or as fast as specialized tools. This highlights the trade-offs between general-purpose language models and specialized systems.

Critique
Visual Aspects
  • The table is well-structured, but some empty cells make it slightly harder to compare all models directly.
  • Adding a visual cue to highlight the best performance in each row (besides Fast Downward) would be helpful.
Analytical Aspects
  • The table effectively presents both accuracy and time data, allowing for a more comprehensive evaluation.
  • Including the 'Randomized Mystery Blocksworld' is important to show that the results aren't just due to the specific wording of 'Mystery Blocksworld'.
  • The table could benefit from a brief explanation of why 'Fast Downward' is used as a comparison point.
Numeric Data
  • o1-preview Blocksworld Accuracy: 587 correct instances out of 600 (97.8%)
  • o1-mini Blocksworld Accuracy: 600 correct instances out of 600 (100%)
  • Fast Downward Blocksworld Accuracy: 600 correct instances out of 600 (100%)
  • o1-preview Blocksworld Average Time: 40.43 seconds
  • o1-mini Blocksworld Average Time: 35.54 seconds
  • Fast Downward Blocksworld Average Time: 0.265 seconds
figure 2

Figure 2 presents two graphs illustrating the relationship between plan length and the average number of reasoning tokens used by the o1-preview model. Graph (a) shows this relationship for 'Mystery Blocksworld', where obfuscated language is used to describe the block stacking task. Graph (b) shows the same relationship for the standard 'Blocksworld' task with clear instructions. Both graphs have 'Plan Length' on the x-axis and 'Average Reasoning Tokens' on the y-axis.

First Mention

Text: "Figure 2"

Context: This figure is introduced on page 4, in the section discussing the shift from approximate retrieval to approximate reasoning. It is used to illustrate how o1-preview's resource usage (reasoning tokens) changes with the complexity of the planning problem.

Relevance: Figure 2 helps to understand how o1-preview's reasoning process scales with problem complexity. It shows whether the model's resource consumption increases proportionally with the difficulty of the task, which is important for evaluating its efficiency.

Critique
Visual Aspects
  • The graphs are generally clear, with labeled axes and titles.
  • The shaded areas around the lines, presumably representing variability, are a bit too dark and make it slightly difficult to see the exact trend lines.
  • Using different colors or line styles for the trend lines and the shaded areas would improve readability.
Analytical Aspects
  • The caption could be more explicit about what 'reasoning tokens' are and why they matter.
  • The graphs don't show a clear correlation between plan length and reasoning tokens for the standard Blocksworld task, which raises questions about the model's behavior.
  • It would be helpful to include some statistical measure of the correlation or lack thereof between the variables.
Numeric Data
figure 3

Figure 3, as visible (it is partially out of frame in the provided document), appears to be a line graph showing how the accuracy of different models changes as the length of the plan (number of steps) increases. The x-axis represents 'Plan Length', and the y-axis represents '% correct'. Multiple lines, likely representing different models, are plotted on the graph, showing a general trend of decreasing accuracy with increasing plan length.

First Mention

Text: "Figure 3: All models are evaluated on Blocksworld. ol-preview outperforms the other LLMs, but its performance degrades more quickly with length."

Context: This figure, first mentioned on page 4, is part of the discussion on evaluating o1 and its performance on the Blocksworld task, especially as the problem size increases. It is presented after the discussion of o1's performance on the original test set and before the analysis of unsolvable instances.

Relevance: This figure is important because it shows how well o1-preview scales to more complex planning problems compared to other LLMs. The ability to handle longer plans is a key indicator of a model's reasoning capabilities.

Critique
Visual Aspects
  • The figure is unfortunately cut off in the provided document, making it impossible to fully interpret the data or see the legend.
  • The visible portion suggests that the lines might be close together, which could make it hard to distinguish between different models. Different line styles or markers would help.
  • The caption should be placed below the figure for better readability.
Analytical Aspects
  • The caption mentions that o1-preview's performance degrades quickly with length, but it would be more informative to quantify this degradation (e.g., 'accuracy drops by X% for every Y additional steps').
  • It's unclear from the visible portion what the range of plan lengths tested is. This information is crucial for understanding the scope of the evaluation.
  • The caption could explain why performance degradation with length is a significant issue for planning models.
Numeric Data
figure 3

Figure 3 is a line graph showing how the accuracy of the o1-preview model changes as the problems it tries to solve get more complex. The x-axis represents the number of steps needed to solve a problem (like the number of moves to stack blocks a certain way), ranging from 20 to 40. The y-axis represents how often the model gets the right answer (percentage correct), from 0% to 100%. There are multiple lines on the graph, likely comparing o1-preview's performance to other models or ideal solutions. The graph likely shows that as the number of steps increases, o1-preview's accuracy goes down.

First Mention

Text: "Figure 3: Extending even the (regular, not obfuscated) Blocksworld dataset to problems requiring greater numbers of steps worsens the performance of o1-preview. When tested on 110 instances which each require at least 20 steps to solve, it only manages 23.63%."

Context: This figure is introduced on page 4 in the section 'From Approximate Retrieval to Approximate Reasoning: Evaluating o1'. It is presented to show how o1-preview's performance changes as the complexity of the Blocksworld problems increases.

Relevance: This figure is important because it shows that even though o1-preview is better than older models, it still has trouble with harder problems. This tells us that there's still room for improvement in how these models reason and plan.

Critique
Visual Aspects
  • The graph is partially cut off on page 4, making it hard to see the full picture.
  • The lines and shaded areas overlap, making it difficult to distinguish the performance of different models or conditions.
  • The labels on the axes are clear, but the legend or model descriptions are not fully visible on page 4.
Analytical Aspects
  • The figure focuses on how accuracy changes with problem complexity, which is a key aspect of evaluating planning models.
  • The caption clearly states the main takeaway: o1-preview's performance degrades with increasing problem size.
  • The figure would be stronger if it included some measure of variability or uncertainty, like error bars or confidence intervals.
Numeric Data
  • o1-preview Accuracy on 20+ step problems: 23.63 %
  • Number of instances tested: 110 null
table 3

Table 3 shows how well OpenAI's o1-preview model can tell if a block-stacking problem is impossible to solve. It looks at two types of problems: regular 'Blocksworld' and a more complicated version called 'Randomized Mystery Blocksworld'. The table tells us two things: 1. The 'True Negative' rate: how often the model correctly says a problem is impossible when it actually is. 2. The 'False Negative' rate: how often the model wrongly says a problem is impossible when it actually has a solution. Think of it like a medical test - you want it to correctly identify the sick people (true negative) and not misdiagnose healthy people as sick (false negative).

First Mention

Text: "Table 3: Rate of claiming that a problem is impossible by OpenAI’s ol-preview on 100 unsolvable and 600 solvable instances in the Blocksworld and Randomized Mystery Blocksworld domains. The True Negative rate is the percent of unsolvable instances that were correctly marked as unsolvable. The False Negative rate is the percent of solvable instances that were incorrectly marked as unsolvable. Previous models are not shown in this table as their true negative and false negative rates were generally 0% across the board."

Context: This table appears on page 5, within the section discussing o1's evaluation. It follows the discussion of o1-preview's performance on larger Blocksworld problems and precedes the analysis of accuracy/cost tradeoffs.

Relevance: This table is important because it shows a new aspect of o1-preview's abilities - figuring out when a problem can't be solved at all. This is useful in real-world situations where knowing something is impossible is just as important as finding a solution.

Critique
Visual Aspects
  • The table is simple and easy to understand.
  • Clearly labeling the columns and rows makes the data easy to interpret.
Analytical Aspects
  • The table clearly shows that o1-preview is better at identifying impossible problems in regular 'Blocksworld' than in 'Randomized Mystery Blocksworld'.
  • The caption explains 'True Negative' and 'False Negative' rates clearly, which is helpful for a non-expert audience.
  • The table could be improved by showing results for other models, even if their rates are mostly 0%, to provide a better comparison.
Numeric Data
  • Blocksworld True Negative Rate: 27 %
  • Blocksworld False Negative Rate: 0 %
  • Randomized Mystery Blocksworld True Negative Rate: 16 %
  • Randomized Mystery Blocksworld False Negative Rate: 11.5 %
table 4

This table presents the cost per 100 instances, measured in US dollars, for using different Large Language Models (LLMs) and Large Reasoning Models (LRMs). It separates the models into two categories: LLMs (like Claude, GPT variants, and Gemini) and LRMs (specifically o1-preview and o1-mini).

First Mention

Text: "Table 4: Cost per 100 instances (in USD). LRMs are significantly more expensive than LLMs."

Context: This table is presented on page 6, within the discussion of accuracy/cost tradeoffs and guarantees for LRMs. It follows the analysis of o1's performance and precedes a comparison with classical planners and LLM-Modulo systems.

Relevance: This table is highly relevant because it directly addresses the cost implications of using LRMs for planning tasks. It highlights the significant cost difference between LLMs and LRMs, which is a crucial factor to consider when evaluating their practical applicability.

Critique
Visual Aspects
  • The table is clear and easy to read, with a simple structure that facilitates quick comparison between models.
  • The division into LLMs and LRMs is helpful for understanding the cost disparities between the two model types.
Analytical Aspects
  • The table effectively communicates the substantial cost difference between LLMs and LRMs, emphasizing the financial implications of using more computationally intensive models.
  • While the table focuses on cost, it would be beneficial to connect these costs to the performance differences shown in earlier tables and figures. This would provide a more comprehensive view of the cost-benefit tradeoff.
  • The table could also include a brief explanation of the 'reasoning tokens' and how they contribute to the higher cost of LRMs.
Numeric Data
  • Cost of o1-preview (per 100 instances): 42.12 USD
  • Cost of o1-mini (per 100 instances): 3.69 USD
  • Cost of GPT-4 (per 100 instances): 1.8 USD
  • Cost of GPT-4 Turbo (per 100 instances): 1.2 USD

Conclusion

Overview

This conclusion summarizes the findings of the study, highlighting the improved performance of Large Reasoning Models (LRMs like OpenAI's o1) compared to traditional LLMs on planning tasks using the PlanBench benchmark. While LLMs showed some progress on basic Blocksworld problems, they struggled with more complex or obfuscated versions. LRMs, particularly o1, demonstrated significantly better accuracy but still faced limitations with longer problems and unsolvable instances. The conclusion also emphasizes the importance of considering accuracy/efficiency trade-offs and the lack of correctness guarantees with LRMs, suggesting alternative approaches like LLM-Modulo systems or dedicated solvers for certain applications.

Key Aspects

Strengths

Suggestions for Improvement

↑ Back to Top