Measuring AI Ability to Complete Long Tasks

Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney Von Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, Lawrence Chan
arXiv
Model Evaluation & Threat Research (METR)

Table of Contents

Overall Summary

Study Background and Main Findings

This paper addresses the challenge of translating AI benchmark scores into meaningful real-world capabilities. Current benchmarks often saturate quickly or use artificial tasks, making it difficult to gauge AI progress on complex, long-duration work relevant to human experts. To overcome this, the authors propose a novel metric: the 'task completion time horizon'. This metric quantifies AI capability by determining the typical time required by human experts to complete tasks that a given AI model can successfully accomplish with a specific probability (e.g., 50%). It provides an intuitive, continuous measure anchored in human performance.

The methodology involved compiling a diverse suite of tasks spanning software engineering and AI research, ranging from seconds to many hours in human completion time. These tasks were drawn from existing benchmarks (HCAST, RE-Bench) and newly created short tasks (SWAA). Human professionals with relevant expertise were timed performing these tasks to establish baseline durations. Subsequently, various frontier AI models released between 2019 and 2025 were evaluated on the same tasks using standardized agent frameworks ('scaffolding') that provide tools for interaction. By modeling AI success probability as a function of human task time using logistic regression, the 50% time horizon was calculated for each model.

The central finding is that the 50% time horizon for frontier AI models on these tasks has grown exponentially since 2019, doubling approximately every seven months (95% CI: 171-249 days). Current leading models like Claude 3.7 Sonnet achieve a horizon of around 50-60 minutes. Qualitative analysis suggests this rapid improvement is driven by enhanced reliability, better adaptation to mistakes, improved logical reasoning, and more effective tool use. However, the study also found a significant gap between the 50% horizon and the 80% horizon (requiring higher reliability), where the latter is much shorter (~15 minutes for Claude 3.7 Sonnet), indicating challenges remain in achieving dependable performance on longer tasks.

The authors conducted several experiments to test the robustness and external validity of their findings, including analyzing performance on tasks with varying 'messiness' (real-world complexity factors), comparing results with an external benchmark (SWE-bench), and evaluating models on real internal software issues. While the exponential trend appeared robust across different conditions, the absolute values and doubling times showed sensitivity to the benchmark and human time estimation methods. The paper concludes by discussing the significant implications of this rapid growth for future automation potential (extrapolating to month-long tasks within ~5 years, albeit with caveats) and AI safety, while carefully acknowledging limitations related to task representativeness and the interpretation of human baselines.

Research Impact and Future Directions

This research introduces a valuable new metric, the 'task completion time horizon,' offering a more intuitive way to track AI progress in handling complex, time-consuming tasks compared to traditional benchmarks. The core finding of an exponential increase in this horizon, doubling roughly every seven months since 2019 within the domain of software and research engineering tasks, is statistically robust within the study's framework and suggests a rapid acceleration of AI capabilities relevant to automating intellectual labor.

However, interpreting the practical implications requires significant caution. The time horizon metric is highly sensitive to methodological choices, particularly the nature and context of the human baselines used for comparison. The study itself demonstrates that AI performance aligns better with low-context human performance (like external contractors) than high-context experts, meaning the measured horizons might overestimate the time AI would save compared to experienced employees working in familiar environments. Furthermore, the benchmark tasks, despite efforts towards realism, still differ systematically from the 'messiness' and dynamic nature of many real-world jobs, raising questions about the external validity of the observed trend. The difference in observed doubling times when using an external benchmark (SWE-bench) further underscores the sensitivity to task selection and human time estimation methods.

Therefore, while the paper provides compelling evidence of rapid AI progress on increasingly complex benchmark tasks, directly extrapolating this trend to predict timelines for widespread automation of specific real-world jobs (like the 5-year prediction for month-long software tasks) is speculative. The study highlights a significant trend but also underscores the critical need for more research into AI performance on truly naturalistic tasks and the complex factors influencing real-world deployment. Key unanswered questions remain about the sustainability of the exponential trend and how well performance on these benchmarks translates to diverse, dynamic, and context-rich professional environments.

Critical Analysis and Recommendations

Clear Problem Statement and Proposed Metric (written-content)
Observation: The abstract clearly states the problem of ambiguous AI benchmark meaning and proposes the '50%-task-completion time horizon' metric. Impact: This immediately establishes the paper's motivation and core contribution, providing a clear framework for understanding the research.
Section: Abstract
Concise Summary of Key Findings (written-content)
Observation: The abstract concisely summarizes the key quantitative findings: a ~50-minute current horizon for frontier models and a historical doubling time of ~7 months. Impact: This provides readers with the essential results upfront, highlighting the magnitude and speed of AI progress according to the proposed metric.
Section: Abstract
Briefly Define Task Domains (written-content)
Issue: The abstract lists task sets (RE-Bench, HCAST, SWAA) without defining their domains. Impact: Adding brief parenthetical descriptions (e.g., 'AI research & engineering', 'diverse software engineering') would improve immediate comprehension for readers unfamiliar with these specific benchmarks, clarifying the scope of evaluation early on.
Section: Abstract
Clear Problem Definition (Benchmark Limitations) (written-content)
Observation: The introduction clearly outlines the limitations of existing AI benchmarks (artificiality, saturation, lack of unified metric). Impact: This effectively motivates the need for the novel 'task completion time horizon' metric proposed by the paper.
Section: Introduction
Effective Visualization of Exponential Growth Trend (Fig 1) (graphical-figure)
Observation: Figure 1 visually presents the core finding of exponential growth in the 50% time horizon using a log-linear plot, including trend lines and confidence intervals. Impact: This graphical representation makes the rapid, consistent improvement trend immediately apparent and provides quantitative details (doubling time, R-squared) supporting the claim.
Section: Introduction
Detailed Human Baselining Methodology (written-content)
Observation: The paper details the methodology for establishing human baseline performance, including recruitment, environment, incentives, and data collection across different task suites. Impact: This transparency and detail lend credibility to the human time estimates, which are fundamental to the calculation and interpretation of the AI time horizon metric.
Section: Methods
Logical Derivation of Time Horizon Metric (written-content)
Observation: The derivation of the time horizon metric using logistic regression, inspired by Item Response Theory but anchored by human time, is clearly explained. Impact: This provides a logical and statistically grounded basis for the core metric, allowing readers to understand how AI success rates are translated into a time-based capability measure.
Section: Methods
Robust Statistical Analysis of Trends (written-content)
Observation: Robust statistical methods, including OLS regression on log-transformed data and hierarchical bootstrapping for confidence intervals, are used to analyze the time horizon trend. Impact: This adds statistical rigor to the quantitative findings, particularly the estimation of the doubling time and its uncertainty.
Section: Methods
Explicitly Justify Task Family Weighting in Main Text (written-content)
Issue: The rationale for weighting tasks by the inverse square root of their family size (mentioned in Fig 3 caption) is not explicitly stated in the main methods text. Impact: Briefly explaining this weighting's purpose (e.g., ensuring diverse representation, preventing large families from dominating results) in the main text would strengthen the methodological justification.
Section: Methods
Clear Comparison of 50% and 80% Time Horizons (written-content)
Observation: The results clearly compare the 50% and 80% time horizons, showing similar doubling times but significantly lower absolute horizons at the 80% reliability level. Impact: This highlights the crucial gap between AI achieving occasional success and achieving high reliability on complex tasks, providing a more nuanced view of current capabilities.
Section: Results
Comprehensive External Validity and Robustness Checks (written-content)
Observation: The paper includes multiple supplementary experiments (retrodiction, messiness analysis, SWE-bench comparison, internal PRs) to assess external validity and robustness. Impact: This multi-pronged approach significantly strengthens the paper's claims by testing the core findings under different conditions and against different data sources, addressing potential limitations.
Section: Results
Nuanced Analysis of 'Messiness' Impact (written-content)
Observation: The analysis of 'messiness' factors finds that while models perform worse on messier tasks, the rate of improvement over time is similar for high and low messiness subsets. Impact: This nuanced finding addresses concerns about potential plateaus on more realistic tasks, suggesting that progress is occurring even on tasks with greater complexity, at least within the evaluated range.
Section: Results
Insightful Internal PR Experiment (Low vs High Context) (written-content)
Observation: The internal PR experiment shows AI performance aligns better with low-context contractor time than high-context maintainer time. Impact: This provides a crucial insight for interpreting the time horizon metric, suggesting it may better reflect the capability to replace low-context human labor rather than highly experienced experts working in familiar domains.
Section: Results
Nuanced Interpretation of the Metric (written-content)
Observation: The discussion thoughtfully addresses the complexities of interpreting the time horizon metric, emphasizing its dependence on task distribution and the context/skill of human baseliners. Impact: This demonstrates a sophisticated understanding of the metric's limitations and guides readers towards a more nuanced interpretation of the results.
Section: Discussion
Candid Acknowledgment of Limitations and Future Work (written-content)
Observation: The authors candidly acknowledge key limitations (model elicitation, baselining rigor, task naturalness, limited compute use) and propose specific directions for future work. Impact: This transparency enhances the paper's credibility and provides a valuable roadmap for subsequent research in AI capability evaluation.
Section: Discussion
Highlights Economic Implications (Inference Cost, Fig 13) (graphical-figure)
Observation: Figure 13 shows that the computational cost of successful AI runs is often a small fraction (<10%) of the estimated cost of human expert labor for the same task duration. Impact: This highlights the economic potential of AI automation and suggests significant headroom for improving AI performance (e.g., via more compute-intensive methods) while remaining cost-competitive.
Section: Discussion

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 1: The length of tasks (measured by how long they take human...
Full Caption

Figure 1: The length of tasks (measured by how long they take human professionals) that generalist autonomous frontier model agents can complete with 50% reliability has been doubling approx-imately every 7 months for the last 6 years (Section 4).

Figure/Table Image (Page 2)
Figure 1: The length of tasks (measured by how long they take human professionals) that generalist autonomous frontier model agents can complete with 50% reliability has been doubling approx-imately every 7 months for the last 6 years (Section 4).
First Reference in Text
We find that the 50% time horizon has been growing exponentially from 2019-2025 on our tasks, with a doubling time of approximately seven months (Figure 1).
Description
  • AI Capability Metric: 50% Task Completion Time Horizon: The figure illustrates how the capability of advanced AI systems, specifically 'generalist autonomous frontier model agents' (cutting-edge AIs designed for diverse, independent task execution), has changed over time. Capability is measured using a metric called the '50% task completion time horizon'. This represents the maximum duration a typical human expert would need to complete a task that the AI system can successfully finish 50% of the time it tries.
  • Data Representation: The figure shows data points representing various AI models released between 2019 and 2025. For each model, the corresponding point indicates the calculated 50% time horizon on the vertical axis (logarithmic scale) plotted against its release date on the horizontal axis.
  • Observed Trend: Exponential Growth: The central finding highlighted is that this AI capability metric has been increasing 'exponentially'. This means the rate of improvement is accelerating, similar to how money grows with compound interest. The figure indicates this time horizon has been doubling approximately every 7 months over the past 6 years.
  • Trend Quantification: The figure includes a trend line fitted to the data points, visually representing the exponential growth. Text within the figure quantifies this trend, stating a doubling time of 7 months and a high R-squared value (R² = 0.98), a statistical measure suggesting the exponential model fits the observed data very well.
Scientific Validity
  • Novelty and Potential Limitations of the Metric: The '50% time horizon' metric is a novel and intuitive way to quantify AI agent capabilities in terms of human task completion time. However, its validity depends heavily on the representativeness of the chosen task suite (HCAST, RE-Bench, SWAA) and the reliability of the human baseline time estimates. The paper acknowledges potential limitations regarding external validity.
  • Robustness of the Exponential Trend Fit: The claim of exponential growth (doubling time of ~7 months) appears strongly supported by the data presented visually and the high R-squared value (0.98) reported for the fit. This suggests a consistent trend over the 2019-2025 period for the specific tasks evaluated.
  • Reliance on Human Baselines: The measurement relies on comparing AI performance to 'human professionals'. The definition, selection, and performance consistency of these human baseliners are crucial for the metric's validity. Variability in human skill, motivation, and context could introduce noise or bias into the task time estimates, affecting the calculated AI time horizons.
  • Scope and Generalizability: The observed trend reflects the performance of specific AI models within particular agent scaffolds on a defined set of tasks. Generalizing this trend to broader AI capabilities or different task domains requires caution, as performance can be sensitive to the evaluation setup and task distribution.
  • Potential Confounding Factors: While the trend is strong, attributing it solely to core AI model improvements versus advancements in agent scaffolding, tool use integration, or prompt engineering techniques used during evaluation is complex. The paper attempts to use consistent scaffolding, but disentangling these factors remains a challenge.
Communication
  • Clarity and Conciseness of Caption: The caption clearly states the figure's main finding: the exponential growth trend in the 50% task completion time horizon for AI agents. It effectively defines the metric (task length solvable by AI with 50% success, measured in human time) and quantifies the trend (doubling every ~7 months over 6 years).
  • Appropriateness of Visual Representation: The figure uses a scatter plot with model release date on the x-axis and the logarithm of human task time on the y-axis. This logarithmic scaling on the y-axis is appropriate for visualizing exponential growth as a linear trend, making the doubling time concept easier to grasp.
  • Inclusion of Trend Metrics and Uncertainty: The inclusion of the trend line, the calculated doubling time (7 months), the 95% confidence interval for the doubling time (171 to 249 days, mentioned in figure text), and the R-squared value (0.98, indicating a strong fit) enhances the figure's informativeness and allows assessment of the trend's robustness and uncertainty.
  • Contextual Labeling of Data Points: Labeling specific AI models (e.g., GPT-2, GPT-4 0314, Claude 3.7 Sonnet) on the plot provides valuable context and allows readers to associate specific capability levels with known systems.

Methods

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 2: Our methodology for measuring AI agent time horizon.
Figure/Table Image (Page 3)
Figure 2: Our methodology for measuring AI agent time horizon.
First Reference in Text
We also measure the 80% time horizon of models (Figure 6) and find a similar trend, though horizons are roughly 5x shorter.
Description
  • Overall Methodology Overview: The figure outlines a three-step process designed to measure the 'time horizon' of AI agents. The time horizon is a way to quantify how advanced an AI is by determining the length of tasks it can reliably complete, where task length is measured by how long it takes a skilled human.
  • Step 1: Task Suite Creation: The first step involves creating a 'Diverse Task Suite'. This means assembling a collection of tasks varying in type and difficulty. The figure shows examples like HCAST (diverse agency tasks, 1 min-30 hrs), SWAA (short software actions, 1-30 sec), and RE-Bench (AI R&D tasks, 8 hrs), totaling 170 tasks used in the study. Task diversity is important to get a broad measure of capability.
  • Step 2: Performance Measurement (Human & AI): The second step is 'Task Performance' measurement. This involves having both humans ('Human Runs') and AI systems ('Agent Runs') attempt the tasks. For humans, the time taken to successfully complete a task is recorded to establish a 'Time Estimate' for its difficulty. For AI agents, their 'Success Rate' (how often they complete the task correctly) is measured.
  • Step 3: Time Horizon Calculation and Trend Analysis: The third step is 'Time Horizon Analysis'. This uses the data from Step 2. It involves finding the 'Horizon Length' for each AI model – essentially, the human time estimate corresponding to tasks the AI can solve with a specific success rate (e.g., 50%). This Horizon Length is then plotted against the 'Model Release Dates' to analyze trends over time, such as calculating the 'Doubling Time' – how long it takes for the AI's time horizon to double.
Scientific Validity
  • Structured Quantitative Approach: The methodology provides a structured, quantitative approach to tracking AI progress based on task completion capabilities relative to human performance, moving beyond simple benchmark scores. The concept of 'time horizon' offers an interpretable metric.
  • Dependence on Task Suite Representativeness: The validity hinges significantly on the quality and diversity of the task suite. If the suite is not representative of real-world tasks or relevant capabilities, the measured time horizon may not generalize. The paper uses HCAST, RE-Bench, and SWAA, aiming for diversity, but representativeness is inherently difficult to guarantee.
  • Reliance on Human Baseline Quality: Establishing accurate and unbiased human baseline times ('Time Estimate') is critical and challenging. Factors like baseliner skill, experience, motivation, and the exclusion of failed attempts can influence the time estimates and thus the calculated AI horizons.
  • Methodological Choices in Analysis: The use of a specific success rate threshold (e.g., 50% or 80%) and a logistic model (mentioned in text, abstracted in figure) to determine the time horizon involves methodological choices. The sensitivity of the results to these choices (threshold level, specific regression model) should be considered.
  • Focus on Task Completion Time/Success: The methodology focuses on task completion success and time, but doesn't explicitly capture other aspects of performance like efficiency (e.g., computational cost, mentioned later in Fig 13), robustness, or safety, which are also crucial dimensions of AI capability.
Communication
  • Flowchart Structure and Clarity: The figure effectively uses a three-panel flowchart structure to visually summarize the core methodology, enhancing reader comprehension compared to text alone. The numbered steps (1. Diverse Task Suite, 2. Task Performance, 3. Time Horizon Analysis) provide a clear, logical flow.
  • Iconography and Labeling: The use of simple icons (e.g., clock, checkmark, cross, graph) and concise labels within each panel aids in quickly grasping the key inputs and outputs of each methodological step (e.g., Time Estimate, Success Rate, Horizon Length, Doubling Time).
  • Visual Summary Effectiveness: The diagram serves as an excellent visual abstract of the methodology described in the text (Sections 3 and 4), providing a high-level overview before delving into details. It helps orient the reader to the overall process.
  • Abstraction of Analysis Details: While generally clear, the specific mathematical relationship or modeling technique used in Step 3 (Time Horizon Analysis) to derive Horizon Length and Doubling Time from Time Estimates, Success Rates, and Model Release Dates is abstracted. Readers need to consult the text (Section 4.1) for the logistic regression details.
Table 1: Example tasks of differing durations.
Figure/Table Image (Page 7)
Table 1: Example tasks of differing durations.
First Reference in Text
We find that our contract baseliners take 5x-18x longer to resolve issues than repository maintainers.
Description
  • Purpose of the Table: The table presents five specific examples of tasks used to evaluate AI agents, chosen to illustrate the variety of task types and the wide range of time they typically take skilled humans to complete.
  • Table Structure and Content: Each row represents a different task, identified by a 'Family' name (like 'find_shell_script' or 'cuda_backtesting'). It lists the estimated 'Length' (time for a human expert, ranging from 3 seconds to 8 hours) and provides a brief 'Description' of what the task involves.
  • Range of Task Examples and Durations: The tasks range significantly in complexity and duration. For instance, 'find_shell_script' is a quick multiple-choice question taking 3 seconds. 'wikipedia_research' involves finding factual information, taking about 1 minute. 'munge_data' requires writing a script to transform data formats, estimated at 56 minutes. 'cuda_backtesting' is a complex programming task involving optimizing code using CUDA (a platform for parallel computing using NVIDIA graphics cards) for financial backtesting (simulating trading strategies on historical data), estimated to take 8 hours.
  • Variety of Skills Illustrated: The descriptions highlight different skills tested, including basic knowledge retrieval, debugging (fixing errors in code or data, like in 'oxdna_simple' which involves a molecular dynamics simulation package called oxDNA), data processing, and advanced programming/optimization.
Scientific Validity
  • Representativeness of Examples: The table provides anecdotal examples rather than a systematic overview of the entire task distribution. While illustrative, these few examples may not fully represent the balance or characteristics of the 170 tasks used in the study.
  • Accuracy of Time Estimates: The 'Length' column represents estimated human completion times. The accuracy and consistency of these estimates (whether derived from actual baselining or researcher estimation) are crucial for the table's validity as an illustration of task difficulty scaling. The text mentions these are based on geometric means or estimates.
  • Plausibility of Task Descriptions and Durations: The tasks selected appear relevant to software engineering, ML, and general reasoning, aligning with the paper's focus. The descriptions seem plausible for the estimated durations.
  • Illustrative Purpose vs. Data Presentation: The table serves primarily as an illustrative tool within the 'Methods' section to give context to the task suite. Its scientific contribution is limited to providing concrete examples rather than presenting primary data or analysis.
Communication
  • Clarity and Structure: The table effectively uses a simple columnar format (Family, Length, Description) to present concrete examples of tasks used in the study, making it easy to grasp the range of complexities and time scales involved.
  • Illustrative Examples: Providing specific examples with estimated human completion times (from seconds to hours) gives readers a tangible sense of the different capability levels being measured. This is more illustrative than abstract descriptions alone.
  • Conciseness and Informativeness of Descriptions: The descriptions, while concise, offer sufficient detail to understand the nature of each task (e.g., multiple choice, research, bug fixing, data transformation, code optimization).
  • Demonstration of Task Range: The selection spans several orders of magnitude in duration (3 seconds to 8 hours), effectively conveying the wide range of task difficulties included in the overall benchmark suite.
Table 2: The source of our time estimates by task suite.
Figure/Table Image (Page 8)
Table 2: The source of our time estimates by task suite.
First Reference in Text
In total, 148 of our 169 tasks have human baselines, but we rely on researcher estimates for 21 tasks in HCAST.
Description
  • Purpose of the Table: This table details where the researchers got the 'time estimates' – the data on how long each task typically takes a human expert – for the different sets of tasks (called 'task suites') used in their study.
  • Breakdown by Task Suite and Source: It breaks down the task suites: HCAST, RE-Bench, and SWAA. For each suite, it specifies the 'Time Estimate Source', meaning whether the time was determined by 'Baseline' (timing actual humans performing the task) or 'Estimate' (researchers estimating the time, likely based on judgment or indirect data).
  • Numerical Breakdown of Tasks: The table shows the 'Number of Tasks' for each category. For HCAST, 76 tasks had times derived from human baselines, while 21 tasks had researcher-estimated times. For RE-Bench, all 6 tasks used human baselines. For SWAA, all 66 tasks used human baselines. No tasks in RE-Bench or SWAA relied on researcher estimates.
  • Summary of Data Sources: Overall, the table indicates that out of the 169 tasks considered in this part of the analysis (76+21+6+66), the vast majority (76+6+66 = 148 tasks) had their difficulty measured using direct human performance data, while a smaller portion (21 tasks, all within the HCAST suite) relied on researcher estimations.
Scientific Validity
  • Use of Researcher Estimates: The table highlights a potential source of uncertainty or bias in the overall results, as 21 out of 97 HCAST tasks (and 21 out of 169 total tasks) rely on researcher estimates rather than direct human baselining. Researcher estimates can be subjective and less reliable than empirical data.
  • Potential Bias in HCAST Estimates: The reliance on estimates specifically within the HCAST suite, which contains longer and potentially more complex tasks, might disproportionately affect the time horizon calculations for more capable models if these estimates are systematically biased (e.g., under- or over-estimating difficulty).
  • Transparency Regarding Data Origin: The transparency in reporting the source for each task suite is commendable and allows readers to assess the potential impact of the estimation method. It strengthens the credibility of the study by acknowledging this limitation.
  • Predominance of Baseline Data: The fact that the majority of tasks (148/169) and all tasks in RE-Bench and SWAA are based on human baselines provides a reasonably strong empirical foundation for the overall analysis, mitigating concerns about the researcher estimates to some extent.
Communication
  • Clarity and Structure: The table clearly and concisely summarizes the origin of the human time estimates for each task suite (HCAST, RE-Bench, SWAA). The structure (Suite, Source, Number of Tasks) is logical and easy to follow.
  • Transparency of Data Sources: It effectively communicates the reliance on actual human performance data ('Baseline') for the majority of tasks (148 out of 169, or ~88%), while transparently indicating where researcher judgment ('Estimate') was used (21 HCAST tasks).
  • Complementary to Text: The table provides a useful breakdown that complements the textual description, allowing readers to quickly grasp the data foundation for the time horizon calculations.
Figure 3: Average task success rate across our entire combined suite, for each...
Full Caption

Figure 3: Average task success rate across our entire combined suite, for each model.

Figure/Table Image (Page 9)
Figure 3: Average task success rate across our entire combined suite, for each model.
First Reference in Text
We performed 8 runs12 per agent/task pair and report the average results in Figure 3.
Description
  • Purpose: Model Performance Comparison: This figure presents a bar chart comparing the overall performance of various AI models released between approximately 2019 and 2025.
  • Metric Shown: Average Success Rate: The height of each bar represents the 'Average task success rate' for a specific AI model, calculated across all tasks in the combined benchmark suite (HCAST, SWAA, and RE-Bench). Success rate is the percentage of tasks the AI completed successfully.
  • Models Compared and General Trend: The models evaluated include early models like GPT-2 (showing near 0% success) and davinci-002 (GPT-3, also low success), various versions of GPT-4, Claude 3 models, and the o1 models. The bars generally increase in height from left to right, indicating that newer models tend to perform better.
  • Diversity Weighting: The success rates are 'diversity-weighted'. As explained in the caption's footnote, this means tasks belonging to larger 'families' (groups of similar tasks) have their contribution to the average reduced (specifically, weighted by the inverse square root of the family size). This prevents the overall average from being skewed by performance on just one or two large groups of similar tasks.
  • Performance of Recent Models: Recent models like Claude 3.7 Sonnet and o1 show the highest average success rates, appearing to exceed 60-70% on this weighted average across the task suite.
Scientific Validity
  • Basic Performance Overview: The figure presents a simple average success rate, which serves as a basic measure of overall capability across the defined task suite before more complex analysis (like time horizon calculation).
  • Validity of Averaging Across Diverse Tasks: The validity of the average success rate as a meaningful measure depends on the representativeness of the task suite and the appropriateness of the scoring criteria (often binary success/failure, as detailed later). Averaging can mask significant variations in performance across different task types or domains.
  • Justification and Impact of Diversity Weighting: The 'diversity weighting' (by inverse square root of family size) is a reasonable heuristic to mitigate the influence of large task families. However, the specific choice of weighting function is somewhat arbitrary, and different weighting schemes could yield different average scores.
  • Number of Runs per Task: The results are based on an average of 8 runs per agent/task pair (mentioned in reference text), which adds some robustness against single-run anomalies. However, the figure itself doesn't show the variability across these runs.
  • Influence of Agent Scaffolding: The performance reflects the specific combination of AI model and the 'scaffold' (the code environment and tools the AI uses to interact with tasks). Differences in scaffolding, especially for the o1 models (mentioned in Section 3.3.1), could influence performance comparisons.
Communication
  • Clarity of Bar Chart Format: The bar chart format provides a clear and straightforward visual comparison of the average performance levels across the different AI models tested.
  • Effective Ordering of Models: Models are ordered chronologically by approximate release date along the x-axis, effectively illustrating the general trend of increasing success rates over time.
  • Clear Axis Labeling: The y-axis is clearly labeled 'Average success rate' with a range from 0% to 100%, making the performance scale easy to interpret.
  • Informative Title and Caption: The title and caption clearly state what is being plotted (average success rate), the scope (entire combined suite), and mention the diversity weighting, providing essential context.
  • Lack of Variance Information: While showing averages is useful for a high-level overview, the chart does not display variance or confidence intervals for the success rates, which could provide additional insight into the reliability of the differences between models.
Figure 4: Model success rates are negatively correlated with how much time it...
Full Caption

Figure 4: Model success rates are negatively correlated with how much time it takes a human to complete the task.

Figure/Table Image (Page 10)
Figure 4: Model success rates are negatively correlated with how much time it takes a human to complete the task.
First Reference in Text
This decrease in success rate over length (Figure 4) is well-fit by an exponential model (R2 ≈ 0.83 when regressing model success rate against the logarithm of human time-to-complete).
Description
  • Purpose: Relationship between Human Time and AI Success: This figure is a scatter plot designed to show the relationship between how long a task takes a human expert to complete and how successful AI models are, on average, at completing that same task.
  • Data Representation: Axes and Data Points: Each point on the graph represents a single task from the benchmark suite (HCAST, RE-Bench, or SWAA, indicated by different markers). The horizontal position (x-axis) of a point shows the time it typically takes a human to complete that task, plotted on a logarithmic scale (meaning equal distances represent multiplicative increases, like 1 sec, 10 sec, 100 sec). The vertical position (y-axis) shows the 'Mean Model Success Rate' – the average success rate achieved across all the different AI models tested on that specific task.
  • Observed Trend: Negative Correlation: The overall pattern of the points slopes downwards from left to right. This indicates a 'negative correlation': tasks that take humans a shorter time (left side) tend to have higher average success rates for AI models (top part), while tasks that take humans longer (right side) tend to have lower average success rates for AI models (bottom part).
  • Trend Quantification: Regression Line and R-squared: A dashed line representing a linear regression fit is drawn through the points, visually summarizing the trend. The figure also reports an R-squared (R²) value of 0.83. R-squared is a statistical measure of how well the regression line fits the data, ranging from 0 to 1. A value of 0.83 suggests that approximately 83% of the variation in the average AI success rate across tasks can be explained by the logarithm of the human completion time, indicating a strong relationship.
Scientific Validity
  • Validation of Human Time as Difficulty Proxy: The figure provides strong evidence that human completion time serves as a reasonable proxy for task difficulty for AI models, averaged across models. The negative correlation (longer human time implies lower AI success) aligns with intuition.
  • Support for Methodological Approach: The high R-squared value (0.83) suggests a strong relationship between log(human time) and mean AI success rate. This supports the methodological choice later in the paper to use human time as the primary dimension for calculating the 'time horizon' via logistic regression (inspired by Item Response Theory).
  • Residual Variance and Other Factors: While the trend is strong (R²=0.83), there is still considerable scatter around the regression line. This indicates that human time alone does not perfectly predict AI success; other task characteristics (e.g., specific skills required, 'messiness') also influence difficulty for AI, which is explored later in the paper (Section 6.2, Figure 10).
  • Use of Mean Success Rate Across Models: The y-axis represents the mean success rate across all tested models. This averaging might obscure important differences between models. For instance, some models might deviate significantly from this average trend, performing unusually well or poorly on tasks of specific lengths. The analysis focuses on the average difficulty landscape.
  • Consistency between Figure and Reference Text: The regression mentioned in the reference text (exponential model fit, R² ≈ 0.83) corresponds to the linear fit shown in the figure because the x-axis (human time) is plotted logarithmically while the y-axis (success rate) is linear. A linear relationship on these axes implies an exponential decay relationship between linear human time and success rate.
Communication
  • Appropriateness of Scatter Plot: The scatter plot effectively visualizes the relationship between the difficulty of a task (measured by human completion time) and the average success rate of AI models on that task.
  • Clarity of Axes and Scaling: The axes are clearly labeled ('Human Time-to-Complete' and 'Mean Model Success Rate'). The logarithmic scale on the x-axis is appropriate for visualizing data spanning several orders of magnitude (seconds to hours) and is clearly indicated by the non-linear spacing of time labels.
  • Legend for Task Suites: The inclusion of a legend distinguishing between task suites (HCAST, RE-Bench, SWAA) allows readers to see if the trend holds across different types of tasks, although the overlap makes detailed comparison difficult.
  • Inclusion of Trend Line and R-squared: Adding the linear regression trend line and the R-squared value (R² = 0.83) directly on the plot provides immediate visual and statistical confirmation of the negative correlation mentioned in the caption.
  • Caption Clarity: The caption clearly and accurately summarizes the main takeaway message of the figure.
Figure 5: Success rates of all models on our test suite, showing the...
Full Caption

Figure 5: Success rates of all models on our test suite, showing the computation of time horizon as predicted 50% success rate time.

Figure/Table Image (Page 11)
Figure 5: Success rates of all models on our test suite, showing the computation of time horizon as predicted 50% success rate time.
First Reference in Text
Specifically, we perform Ordinary Least Squares regression on log(model_horizon) = a+ β·release_date.
Description
  • Figure Structure: Multi-Panel Plots per Model: This figure consists of multiple small plots (panels), one for each AI model tested, illustrating how the '50% time horizon' is calculated for that specific model.
  • Axes Definition: Each panel plots the AI model's 'Success Probability' (vertical axis, ranging from 0 to 1, or 0% to 100%) against the 'Task length (human time-to-complete)' (horizontal axis, logarithmic scale from seconds to days).
  • Data Representation: Empirical Points and Fitted Curve: The blue bars/points represent the actual measured 'Empirical success rates' of the model on tasks grouped by their human completion time, with error bars indicating the uncertainty (±2 standard errors). The green curve is a 'Fitted curve' derived from a statistical technique called logistic regression. This curve models the probability of the AI succeeding as the task length (difficulty) increases – generally, success probability decreases as tasks get longer.
  • Time Horizon Calculation (50% Threshold): The key calculation shown is the 'Time Horizon'. This is found by locating the point where the fitted green curve intersects the 50% success probability level (0.5 on the vertical axis). A vertical dashed red line is drawn down from this intersection point to the horizontal axis, indicating the corresponding human task completion time. This time value is the model's 50% time horizon.
  • Examples of Calculated Time Horizons: The figure shows a wide range of time horizons across models. For example, GPT-2 has a horizon of only 2 seconds, davinci-002 (GPT-3) is at 9 seconds, GPT-4 0314 is at 5 minutes, while the most capable model shown, Claude 3.7 Sonnet, reaches a 50% time horizon of 59 minutes.
  • Observed Data Discontinuity ('Jump'): The caption notes a 'jump' in success rates between tasks shorter than 1 minute (mostly SWAA tasks) and tasks longer than 1 minute (mostly HCAST tasks), which is visible as a potential discontinuity in the empirical data points around the 1-minute mark in some panels.
Scientific Validity
  • Appropriateness of Logistic Regression (IRT): The use of logistic regression to model success probability as a function of task difficulty (log human time) is a standard and appropriate technique, drawing inspiration from Item Response Theory (IRT) as mentioned in Section 4.1. It provides a principled way to estimate the difficulty level corresponding to a specific success probability.
  • Goodness-of-Fit Assessment: The figure visually suggests the logistic model provides a 'fairly good' fit, as stated in the caption, capturing the general downward trend of success rate with increasing task length for most models. However, deviations exist.
  • Implications of the SWAA/HCAST 'Jump': The noticeable 'jump' or discontinuity around the 1-minute mark in several panels, corresponding to the boundary between SWAA and HCAST tasks, suggests the simple logistic model might not perfectly capture the underlying data structure. This could stem from differences in task types, scoring methods, or human baseline accuracy between the two suites, potentially affecting the precision of the calculated horizon, especially for models whose 50% threshold falls near this boundary.
  • Dependence on Input Data Quality: The calculation of the time horizon depends directly on the quality of the human baseline times (x-axis) and the measured AI success rates (y-axis data points). Any noise or bias in these underlying measurements will propagate into the fitted curve and the resulting time horizon estimate.
  • Choice of 50% Success Threshold: The choice of the 50% success threshold is a key parameter. While common in psychometrics, the resulting time horizon value is specific to this threshold. Using a different threshold (e.g., 80%, as shown in Figure 6) would yield different, likely shorter, time horizons.
  • Representation of Uncertainty: The error bars (±2SE) on the empirical points provide some indication of uncertainty in the measured success rates for binned task lengths, but the figure does not explicitly show confidence intervals for the fitted logistic curve or the derived time horizon itself, although these are likely incorporated into the error bars shown later in Figure 1.
Communication
  • Multi-Panel Layout Clarity: The multi-panel layout, dedicating one plot per model, allows for a clear, uncluttered view of each model's performance curve and its corresponding time horizon calculation.
  • Integration of Empirical Data and Model Fit: Within each panel, the visualization effectively combines empirical data (binned success rates with error bars) and the fitted logistic regression curve, allowing readers to visually assess the model fit.
  • Clear Indication of Time Horizon: The use of a vertical dashed red line clearly indicates the calculated 50% time horizon, and the accompanying text explicitly states this value (e.g., 'Time Horizon: 59 min' for Claude 3.7 Sonnet), making the key output easily identifiable.
  • Consistent Axis Scaling: Consistent axis scaling (logarithmic x-axis for time, linear y-axis for probability) across panels facilitates comparison between models, although direct visual comparison requires looking across multiple plots.
  • Legend Clarity: The legend clearly explains the components: the fitted curve and the empirical success rates with standard error bars (±2SE).
  • Caption Accuracy: The caption accurately describes the figure's content, explaining that it shows success rates and how the 50% time horizon is derived.
  • Visualization of Data Discontinuity: The visual 'jump' in success rates around the 1-minute mark, mentioned in the caption text, is apparent in several panels, visually supporting the text's observation about the SWAA/HCAST boundary.
Figure 6: Trend in 80% success rate time horizon.
Figure/Table Image (Page 12)
Figure 6: Trend in 80% success rate time horizon.
First Reference in Text
We also measure the 80% time horizon of models (Figure 6) and find a similar trend, though horizons are roughly 5x shorter.
Description
  • Purpose: Trend in Higher-Reliability AI Capability: This figure plots the progress of AI models over time, similar to Figure 1, but uses a stricter measure of capability: the '80% task completion time horizon'. This represents the maximum length of a task (measured in human completion time) that an AI model can successfully complete 80% of the time it tries.
  • Plot Axes and Scaling: Like Figure 1, it's a scatter plot where the horizontal axis is the AI model's release date (from 2019 to 2027) and the vertical axis is the calculated time horizon, plotted on a logarithmic scale (from 1 second to 4 hours).
  • Data Points and Trend Line: Each point represents an AI model (e.g., davinci-002, GPT-4 0314, Claude 3.7 Sonnet), showing its calculated 80% time horizon. A trend line is fitted to these points.
  • Observed Trend and Quantification: The figure shows that the 80% time horizon has also been growing exponentially, with a calculated doubling time of 213 days. The R-squared value (R² = 0.97) indicates this exponential model fits the data very well for the models shown (starting from 2020).
  • Comparison with 50% Time Horizon: A faint grey line representing the 50% time horizon trend from Figure 1 is included for comparison. Visually, the points and trend line for the 80% horizon are significantly lower on the graph than the 50% horizon trend, indicating that achieving 80% reliability requires tackling much shorter tasks compared to achieving 50% reliability.
  • Magnitude of 80% Horizon: The most capable model shown (Claude 3.7 Sonnet) has an 80% time horizon of around 15 minutes, substantially less than its 50% time horizon of nearly 1 hour shown in Figure 5.
Scientific Validity
  • Sensitivity Analysis for Success Threshold: Calculating the 80% time horizon provides a valuable sensitivity analysis regarding the choice of success threshold. It demonstrates that while the exponential growth trend appears robust (similar doubling time to 50% horizon), the absolute capability level is highly sensitive to the reliability requirement.
  • Relevance of Higher Reliability Metric: The 80% threshold represents a higher standard of reliability, which may be more relevant for practical applications where occasional failure is less acceptable. Measuring this provides a more conservative estimate of AI capabilities.
  • Highlighting the Reliability Gap: The significantly shorter horizons at the 80% level (visually ~5x shorter, as stated in text) highlight a substantial gap between models sometimes succeeding on complex tasks (50% horizon) and reliably succeeding (80% horizon). This suggests limitations in current models' robustness or consistency.
  • Potential Estimation Challenges: Estimating the 80% success point might be statistically more challenging or require more data than the 50% point, especially for harder tasks where success rates are low. This could introduce greater uncertainty into the 80% horizon estimates, although the R²=0.97 suggests a good fit for the models included.
  • Model Inclusion Criteria and Trend Start Date: The analysis excludes the earliest models (like GPT-2) and starts the trend fit from 2020-01-01. This is likely because their performance was too low to reliably estimate an 80% success threshold, even on the easiest tasks. This selective inclusion should be noted when interpreting the trend.
Communication
  • Consistent Visual Format: The figure effectively uses the same format as Figure 1 (log-linear plot of time horizon vs. release date) to show the trend for the 80% success rate threshold, facilitating direct comparison.
  • Effective Visual Comparison (50% vs 80%): Including the 50% horizon trend line (in grey) provides an immediate visual reference point, clearly illustrating that the 80% horizons are substantially lower, as mentioned in the text.
  • Clear Axis Labeling: The axes are clearly labeled, with the y-axis specifying 'Task time (for humans) that model completes with 80% success rate', removing ambiguity about the metric being plotted.
  • Inclusion of Trend Metrics: Key metrics like the doubling time (213 days) and R-squared value (0.97) are displayed directly on the plot, summarizing the trend's characteristics.
  • Legend Clarity: The legend clearly identifies the models plotted, although fewer models are included compared to the 50% horizon plot (Figure 1), likely due to difficulty in estimating 80% horizons for lower-performing models.
Figure 14: Stacked histogram of tasks by difficulty rating.
Figure/Table Image (Page 27)
Figure 14: Stacked histogram of tasks by difficulty rating.
First Reference in Text
HCAST mainly includes tasks longer than 4 minutes, while we focused on tasks in the 2-second to 15-second range with SWAA in order to measure GPT-2 and GPT-3.
Description
  • Purpose: Task Distribution by Difficulty: This figure is a 'stacked histogram', a type of bar chart used to show how data is distributed across different categories or ranges. Here, it shows the distribution of the benchmark tasks based on their difficulty.
  • X-axis: Task Difficulty (Log Human Time): The horizontal axis (x-axis) represents the 'Human task time', which serves as the measure of task difficulty. It's shown on a logarithmic scale, meaning equal distances represent multiplicative increases in time (e.g., 1 sec, 4 sec, 15 sec, 1 min, 4 min, etc.). Tasks are grouped into bins based on their estimated human completion time.
  • Y-axis: Number of Tasks: The vertical axis (y-axis) represents the 'Number of tasks' falling into each difficulty bin.
  • Stacked Bars: Contribution of Task Suites: Each bar is 'stacked', meaning it's divided into colored segments. The colors correspond to the source task suite: SWAA (green), RE-Bench (orange), and HCAST (blue), as indicated by the legend. The height of each colored segment within a bar shows how many tasks from that specific suite fall into that difficulty bin. The total height of a bar shows the total number of tasks in that bin.
  • Observed Distribution Pattern: The histogram shows that the SWAA tasks (green) are concentrated in the very short duration bins (mostly under 15 seconds). The HCAST tasks (blue) and RE-Bench tasks (orange) make up the majority of tasks in the longer duration bins (minutes to hours). There appears to be a lower density of tasks in the range between roughly 15 seconds and 1-4 minutes.
Scientific Validity
  • Accurate Representation of Task Suite: The histogram accurately depicts the composition of the combined task suite used in the study, categorized by source and estimated human difficulty. It provides a transparent overview of the benchmark's characteristics.
  • Justification for SWAA Suite: The figure visually justifies the creation of the SWAA suite. By showing the concentration of HCAST/RE-Bench tasks at longer durations, it highlights the need for shorter tasks (provided by SWAA) to effectively measure the capabilities of less advanced models like GPT-2 and GPT-3, which would likely fail most HCAST/RE-Bench tasks.
  • Implications of the Distribution Gap: The distribution reveals a potential limitation: the relative sparsity of tasks in the intermediate difficulty range (tens of seconds to a few minutes). This gap could slightly affect the precision of the time horizon calculation for models whose 50% capability level falls within this specific range, as the logistic regression fit might be less constrained by data in this zone.
  • Dependence on Human Time Estimates: The validity of the distribution relies on the accuracy of the human time estimates used to assign tasks to difficulty bins. Any systematic errors in these time estimates would distort the histogram.
  • Choice of Histogram Bins: The choice of bin widths for the histogram can influence its visual appearance. While the logarithmic scale helps manage the wide range, the specific bin boundaries chosen are not explicitly stated but appear reasonable for visualizing the overall distribution.
Communication
  • Appropriateness of Visualization: The stacked histogram is an appropriate visualization choice to show both the overall distribution of tasks by difficulty (human time) and the contribution of each task suite (HCAST, RE-Bench, SWAA) within each difficulty bin.
  • Clarity of Logarithmic X-axis: The logarithmic scale on the x-axis ('Human task time') effectively handles the wide range of task durations (seconds to hours) and allows different time scales to be represented clearly.
  • Effective Use of Stacking and Legend: The color-coded stacking and the legend clearly distinguish the three task sources, making it easy to see their relative prevalence at different difficulty levels.
  • Visual Confirmation of Textual Claims: The figure visually confirms the statement in the caption and reference text: SWAA tasks are concentrated at the very short end (seconds), while HCAST and RE-Bench tasks dominate the longer durations (minutes to hours).
  • Highlighting the Task Distribution Gap: The histogram clearly highlights a relative gap in task density between the ~15-second upper range of SWAA and the ~1-4 minute lower range of HCAST, which is relevant to the study's methodology for measuring models across different capability levels.
Figure 16: Success rates and time horizon of human baseliners.
Figure/Table Image (Page 29)
Figure 16: Success rates and time horizon of human baseliners.
First Reference in Text
Figure 16 shows a graph of baseliner success rate by task length.
Description
  • Purpose: Calculate Human Time Horizon: This figure applies the same time horizon calculation methodology used for AI models (as shown in Figure 5) to the human baseline data.
  • Axes Definition: The plot shows the 'Success Probability' (vertical axis) of the human baseliners completing tasks versus the 'Task length (human time-to-complete)' (horizontal axis, logarithmic scale).
  • Data Representation: Empirical Rates and Fitted Curve: Grey bars/points represent the empirical success rates of humans on tasks grouped by difficulty, with error bars (±2SE). A black curve shows the fitted logistic regression model representing the probability of human success as task length increases.
  • Calculated Human Time Horizon (1 hr 37 min): Following the same procedure as for AI models, the figure identifies the task length at which the fitted curve crosses the 50% success probability mark. This yields a calculated 'Time Horizon' for the human baseliners of 1 hour and 37 minutes.
  • Observation: Decreasing Success Rate, Low Horizon: The plot shows that human success rates decrease as task length increases, similar to AI models, but the calculated 50% horizon is much lower than might be intuitively expected (e.g., lower than the 8 hours humans were often allotted).
Scientific Validity
  • Methodological Consistency: Applying the same IRT-inspired logistic regression methodology used for AI models to the human data allows for a methodologically consistent comparison, even if the resulting 'human time horizon' requires careful interpretation.
  • Interpretation Challenges of the Human Horizon: The calculated human time horizon of ~1.5 hours is surprisingly low, given baseliners were often paid for up to 8 hours. As discussed critically in the text (Section B.1.1), this low value is likely an artifact of the methodology: filtering only for successful runs (biasing towards shorter times) and potentially incentivizing baseliners to give up early on tasks they perceived as difficult or time-consuming, rather than reflecting the true upper limit of human capability within an 8-hour timeframe.
  • Non-Comparability with AI Horizons: Because the human horizon calculation is affected by these methodological choices (especially filtering for success), it is explicitly noted in the text (Section B.1.1) and implied in the caption's note that this value is not directly comparable to the AI time horizons calculated in the main analysis. It serves more as a methodological check than a true measure of human capability limits under the study conditions.
  • Reflection of Baselining Process Data: The plot does accurately reflect the observed success rates of humans under the specific conditions and incentives of the baselining process. The decreasing success rate with task length is an expected finding.
  • Highlighting Challenges in Human Baselining: The analysis highlights the significant challenges and potential biases involved in establishing reliable human performance baselines, particularly for complex, long-duration tasks where failure rates are high and motivation/incentives play a large role.
Communication
  • Clarity of Visualization: The plot clearly visualizes the relationship between task length and the success rate of the human baseliners, using the same format as the individual model plots in Figure 5 (logistic curve fit to empirical success rates).
  • Data Representation: The inclusion of empirical success rate bins with error bars allows assessment of the data variability and the goodness-of-fit of the logistic curve.
  • Indication of Human Time Horizon: The calculated 50% time horizon (1 hr, 37 min) is clearly indicated with a dashed line and text annotation, analogous to the AI model plots.
  • Contextualization via Caption and Text: The caption, combined with the crucial note in the text (Section B.1.1), clarifies that this calculated 'human time horizon' is artificially low due to methodological factors and not directly comparable to AI horizons, which is important context.
Table 4: Results of baselines on selected internal PRs
Figure/Table Image (Page 30)
Table 4: Results of baselines on selected internal PRs
First Reference in Text
Results of these baselines can be seen in table 4.
Description
  • Purpose: Comparing Human Performance on Internal Tasks: This table shows the results of having two different groups of humans attempt to resolve two specific internal software issues (labeled Issue 1 and Issue 8). These issues were likely related to fixing bugs or adding features in the researchers' own code repositories.
  • Human Groups Compared: Maintainer vs. Baseliner: The two groups of humans compared are: 1) 'Repository Maintainer', presumably someone very familiar with the specific codebase, and 2) 'Baseliner', likely an external contractor or employee with relevant skills but without prior deep context on that specific repository (similar to the baseliners used for the main benchmark suites).
  • Data Presented: Time and Score: For each issue and each type of human agent, the table lists the 'Time taken' to complete the task and the resulting 'Score' (on a scale likely from 0 to 1, based on the scoring description in Appendix B.2, where 1.0 means the fix could be merged as is).
  • Results for Issue 1: For Issue 1, the maintainer took 5 minutes and scored 1.0, while the baseliner took 81 minutes and also scored 1.0. This shows a ~16x difference in time for the same outcome.
  • Results for Issue 8: For Issue 8, the maintainer took 20 minutes and scored 1.0. The baseliner took 113 minutes but only achieved a score of 0.25 (indicating significant issues with the proposed fix). This shows a ~5.5x difference in time with a much worse outcome for the baseliner.
Scientific Validity
  • Demonstration of Context Effect: The comparison directly addresses the important variable of 'context' in software development. It empirically demonstrates that familiarity with a codebase (maintainer) dramatically reduces the time required compared to a skilled individual without that context (baseliner). This has significant implications for interpreting AI performance, as AI agents typically operate with low context.
  • Small Sample Size (N=2 Issues): The table presents results for only two selected issues. While illustrative, this is a very small sample size, and the magnitude of the time difference (5x-16x observed here) might vary significantly across different tasks or codebases.
  • Subjectivity of Manual Scoring: The scoring (1.0, 0.25) is based on manual assessment by maintainers (Appendix B.2), introducing subjectivity. While likely necessary for real-world PRs, it's less objective than the automated scoring used in the main benchmarks.
  • Ecological Validity vs. Specificity: These 'internal PRs' represent real work performed by METR staff (Section 6.4), potentially making them more representative of actual software engineering tasks than standardized benchmark tasks. However, they might also reflect idiosyncrasies of METR's specific codebase or workflow.
  • Implications for Interpreting AI Time Horizons: The results support the argument made in Section 6.4 that AI performance on benchmarks (where tasks are designed for low context) might better correspond to the performance of low-context humans (baseliners) rather than high-context humans (maintainers). If AI time horizons are calculated relative to baseliner time, they might overestimate the time needed to replace high-context human work.
Communication
  • Clarity of Comparison: The table clearly presents a comparison between two types of human agents (Repository Maintainer vs. Baseliner) on two specific issues.
  • Table Structure: The columns (Issue, Agent, Time taken, Score) are logically structured and easy to understand.
  • Highlighting Time Discrepancy: It effectively highlights the dramatic difference in time taken between maintainers (who have high context) and baseliners (who have low context) for the same task, supporting the discussion in Section 6.4 and Appendix B.2.
  • Inclusion of Scores: The scores provide context for the time taken, showing that even with much more time, the baseliner performance (score) could be lower (e.g., Issue 8).
Table 5: Internal PR Per-Task Average Scores (number of trials in parentheses).
Figure/Table Image (Page 30)
Table 5: Internal PR Per-Task Average Scores (number of trials in parentheses).
First Reference in Text
Note that we did minimal processing on Issue 9 to turn it into two issues, as in practice the issue description contained two entirely separate pieces of work.
Description
  • Purpose: AI Performance on Internal Tasks: This table shows the performance results of three specific AI models (GPT-4o, Claude 3.5 Sonnet, and o1) when they were tested on a small set of internal software development tasks, referred to as 'Internal PRs' (Pull Requests).
  • Tasks Evaluated: The tasks are identified by 'Task ID' (Issue 1, Issue 8, Issue 9-1, Issue 9-2, Issue 10, Issue 11). Note that Issue 9 was split into two separate tasks (9-1 and 9-2) because the original description contained two distinct pieces of work.
  • Metric: Average Score: For each task and each AI model, the table reports the average 'Score' achieved. The scoring likely follows the manual assessment criteria described in Appendix B.2 (0 to 1 scale based on merge readiness).
  • Number of Trials: The number in parentheses next to each score indicates the 'number of trials' or attempts the AI made for that specific task, over which the average score was calculated.
  • Summary of Results: Performance varied significantly. For example, on Issue 1, o1 achieved the highest average score (0.875 over 6 trials), while GPT-4o scored 0.35 (5 trials). On Issue 9-2, Claude 3.5 Sonnet achieved a perfect score (1.0 over 5 trials), while GPT-4o scored 0.0. On most other issues (8, 9-1, 10, 11), all models performed poorly, scoring 0.0 or very close to it.
Scientific Validity
  • Ecological Validity of Tasks: These tasks represent 'real-world' software issues from the researchers' internal repository, potentially offering higher ecological validity than standard benchmarks. However, they are specific to one particular codebase and workflow.
  • Subjectivity of Manual Scoring: The performance is measured using manual scoring by maintainers (Appendix B.2), which introduces subjectivity compared to automated scoring. Consistency between scorers is important but not quantified here.
  • Very Small Task Sample Size: The number of tasks (6 derived from 5 original issues) is very small. Results on this limited set may not generalize to broader AI performance on real-world software engineering.
  • Variable Number of Trials: The number of trials per task varies (from 5 to 12). Averages based on fewer trials are less statistically reliable.
  • Consistency with Human Baseline Findings: The results show generally poor performance by the AI models on these internal tasks compared to human maintainers (Table 4) and even human baseliners on Issue 1 (Table 4), highlighting the difficulty AI faces with real-world, high-context tasks, consistent with the paper's discussion.
  • Appropriateness of Splitting Issue 9: Splitting Issue 9 into two parts seems methodologically sound given the description that it contained two separate pieces of work, allowing for a more granular assessment.
Communication
  • Clear Presentation of Scores: The table clearly presents the average scores for three AI models across six distinct internal tasks (derived from five original issues).
  • Logical Structure: The structure (Task ID rows, Model columns) is straightforward and facilitates comparison across models for a given task.
  • Inclusion of Trial Numbers: Including the number of trials in parentheses for each score provides important context about the amount of data underlying each average, indicating the robustness (or lack thereof) for each data point.
  • Caption Clarity: The caption accurately describes the table content.
  • Handling of Split Issue: The splitting of 'Issue 9' into 'Issue 9-1' and 'Issue 9-2' is noted in the reference text and reflected in the table, adding clarity about the task definition.
Table 6: Comparison of time to fix issues by repo maintainers and baseliners
Figure/Table Image (Page 31)
Table 6: Comparison of time to fix issues by repo maintainers and baseliners
First Reference in Text
Not explicitly referenced in main text
Description
  • Purpose: Comparing Human Time with Varying Context: This table compares the time it took for two different groups of humans to fix specific software issues ('tasks') within an internal code repository (likely belonging to the research organization, METR).
  • Groups Compared: The humans compared are 'repo maintainers' (people deeply familiar with the specific codebase) and 'baseliners' (skilled individuals, likely external contractors, without prior deep familiarity, similar to those used in the main benchmarks).
  • Data Presented: Task ID, Maintainer Time, Baseliner Time (with Score): The table lists six specific tasks identified by 'Task ID' (e.g., 'eval-analysis-public-1', 'eval-analysis-public-8'). For each task, it shows the time taken in minutes by the maintainer and, where available, the time taken by the baseliner. Baseliner times are annotated with the quality score achieved.
  • Slowdown Factor Calculation: A 'Slowdown' column calculates the ratio of Baseliner Time to Maintainer Time for the tasks where both times are available. This factor quantifies how much longer the less familiar baseliner took.
  • Observed Time Differences: The results show maintainers completing tasks in times ranging from 5 minutes to 235 minutes. For the three tasks completed by baseliners, their times were much longer (81, 113, and 93 minutes compared to 5, 20, and 5 minutes for maintainers, respectively).
  • Magnitude of Slowdown: The calculated slowdown factors are substantial, ranging from approximately 5.5x (for Issue 8, where the baseliner also scored poorly) to 16x (Issue 1) and 18.6x (Issue 11). This indicates baseliners took 5 to 18 times longer than maintainers.
Scientific Validity
  • Demonstrates Impact of Context: The table provides strong empirical evidence for the significant impact of codebase familiarity ('context') on human software development time. This directly supports the paper's discussion about the challenges of comparing AI (typically low-context) performance to human performance.
  • Limited Sample Size and Generalizability: The analysis is based on a small number of internal tasks (6 listed, only 3 with full comparison data). These tasks are specific to one organization's codebase, which may limit the generalizability of the exact slowdown factors (5x-18.6x) to other software projects or task types.
  • Confounding Variable: Score/Quality: The comparison is potentially confounded by the quality of the outcome, as indicated by the baseliner scores. For Issue 8, the 5.5x slowdown is accompanied by a much lower score (0.25 vs 1.0), suggesting the baseliner didn't truly complete the task to the same standard, making the time comparison less direct.
  • Assumptions about Human Groups: The definition and selection process for 'maintainer' versus 'baseliner' are crucial but assumed based on context from Appendix B.2. Consistency in skill levels between the groups (aside from context) would be important for isolating the effect of familiarity.
  • Implications for Human Baseline Time Estimates: Despite limitations, the data strongly suggests that using low-context human baseliner times as a proxy for task difficulty (as done in the main analysis) likely results in significantly longer time estimates than would be achieved by experienced developers working within their own codebase.
Communication
  • Clear Comparative Structure: The table uses a clear columnar format to directly compare the time taken by repository maintainers versus external baseliners on the same set of internal software tasks.
  • Effective Use of 'Slowdown' Metric: Including the 'Slowdown' factor (ratio of baseliner time to maintainer time) provides an immediate, quantitative measure of the difference in speed, effectively highlighting the magnitude of the context effect.
  • Contextualization with Baseliner Scores: Annotating the baseliner times with the scores achieved (e.g., '113 (score .25)') adds crucial context, showing that the longer time taken by baseliners did not always result in equivalent quality outcomes compared to maintainers.
  • Clear Task Identification: Task IDs are clearly listed, allowing reference back to specific issues if needed (although descriptions aren't in this table).
  • Transparency via Footnote: The footnote indicating a slight variant for one run adds transparency.
Table 7: Scaffolding used for or each model in this report
Figure/Table Image (Page 32)
Table 7: Scaffolding used for or each model in this report
First Reference in Text
Not explicitly referenced in main text
Description
  • Purpose: Specify Agent Scaffolding: This table specifies the 'Agent scaffolding' used for each different AI model tested in the study. 'Scaffolding' refers to the surrounding code framework and tools that allow the core AI language model to interact with the task environment, execute commands (like running code or searching the web), manage its inputs and outputs, and structure its actions to solve a problem. It's like the operating system and basic tools provided to the AI 'brain'.
  • Models Listed: The table lists 11 different AI models, including various versions of Claude, GPT-3, GPT-4, and o1.
  • Predominant Scaffolding: modular-public: For most models (e.g., all Claude versions, davinci-002, gpt-3.5-turbo-instruct, all GPT-4 versions including GPT-4o), the table indicates that the 'modular-public' scaffolding was used. This suggests a consistent framework was applied for the majority of evaluations.
  • Exceptions: o1 and o1-preview Scaffolds: Two exceptions are noted: the 'o1' model used a scaffolding called 'triframe', and the 'o1-preview' model used one called 'duet'. This indicates these specific models were run using different interaction frameworks compared to the others.
Scientific Validity
  • Importance for Reproducibility and Interpretation: Specifying the scaffolding used for each model is critical for reproducibility and interpreting results. The performance of an AI agent depends significantly on both the underlying language model and the scaffolding that enables its interaction and tool use.
  • Enhanced Comparability via Consistent Scaffolding: The use of a consistent scaffold ('modular-public') for the majority of models enhances the comparability of their results, as differences are more likely attributable to the models themselves rather than the framework.
  • Potential Confound from Different Scaffolds (o1 models): The use of different scaffolds ('triframe', 'duet') for the o1 models introduces a potential confound. Differences in performance between o1/o1-preview and the other models might be due partly to the scaffolding rather than solely the model capabilities. The text (Section 3.3.1) acknowledges these scaffolds were different because o1 models seemed to struggle with the standard one, but quantifying the impact of the scaffold itself is difficult.
  • Transparency of Experimental Setup: The table provides transparency about the experimental setup, allowing readers to understand the specific agent implementations evaluated.
Communication
  • Clarity and Conciseness: The table clearly lists each AI model evaluated and the specific agent scaffolding used with it, providing essential methodological detail in a simple, easy-to-read format.
  • Highlighting Exceptions: It effectively highlights that most models used the 'modular-public' scaffold, while explicitly noting the exceptions (o1 using 'triframe', o1-preview using 'duet'), drawing attention to potential methodological differences.
  • Transparency and Reference Value: This table serves as a crucial reference for understanding the experimental setup described in Section 3.3.1 and Appendix B.3, ensuring transparency about the agent implementation.

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 7: Time horizons on HCAST + RE-bench, for models starting with GPT-4...
Full Caption

Figure 7: Time horizons on HCAST + RE-bench, for models starting with GPT-4 0314.

Figure/Table Image (Page 14)
Figure 7: Time horizons on HCAST + RE-bench, for models starting with GPT-4 0314.
First Reference in Text
This trend (with Claude 3.7 Sonnet added) is shown in Figure 7: time horizon doubles about every six months.
Description
  • Purpose: Trend Analysis on Subset: This figure shows the trend in the '50% task completion time horizon' (the typical human time for tasks an AI completes successfully half the time) specifically focusing on more recent AI models and a subset of the tasks.
  • Data Subset Used: Unlike Figure 1 which used the full dataset (HCAST, RE-Bench, SWAA) and models from 2019 onwards, this plot uses only the HCAST and RE-Bench task suites (excluding the very short SWAA tasks) and only includes models released from 2023 onwards, starting with GPT-4 0314.
  • Plot Axes and Scaling: The plot format is similar to Figure 1: a scatter plot with model release date on the horizontal axis (from 2023 to 2025) and the calculated 50% time horizon on the vertical axis (logarithmic scale, from 1 minute to 1 hour).
  • Data Points and Trend Line: Points represent individual models (GPT-4 versions, Claude 3 versions, o1 models). A trend line is fitted to these points.
  • Observed Trend and Quantification: The analysis for this subset (2023+ models on HCAST+RE-Bench) finds a doubling time of 191 days (approximately 6.3 months) and an R-squared value of 0.91, indicating a strong exponential fit within this specific timeframe and task set.
Scientific Validity
  • Robustness Check for Trend: This analysis serves as a robustness check, examining if the trend observed in the full dataset (Figure 1) holds when restricting the analysis to more recent models and excluding the shortest (SWAA) tasks. The similar doubling time (191 days vs. 212 days in Figure 8) suggests the overall trend is not solely driven by early models or the inclusion of very short tasks.
  • Limited Number of Data Points: The analysis is based on a smaller number of models (8 distinct points visible) over a shorter time period (~2 years) compared to the main analysis (Figure 1). While the R² is high (0.91), trend estimates derived from fewer data points are generally less statistically robust and more sensitive to individual model performance.
  • Impact of Excluding SWAA Tasks: Excluding SWAA tasks removes the potential discontinuity observed around the 1-minute mark in Figure 5. This might lead to a cleaner fit but focuses the analysis on tasks typically taking minutes to hours.
  • Acknowledged Uncertainty in Trend Estimation: The text (Section 6.1) notes that error bars were very wide for this subset analysis and that restricting further to 2024-only data yielded a much faster doubling time, highlighting the uncertainty associated with trend extrapolation from limited recent data. This figure presents the 2023+ trend.
  • Dependence on Underlying Data and Methodology: The calculation still relies on the validity of the HCAST and RE-Bench task difficulties (human time estimates) and the 50% time horizon methodology.
Communication
  • Consistent Plot Format: The plot uses a consistent log-linear format (log time horizon vs. linear release date) similar to Figures 1 and 6, aiding comparison across different analyses.
  • Clear Title and Scope: The title clearly specifies the subset of data used (HCAST + RE-Bench only) and the models included (starting from GPT-4 0314), preventing confusion with the main analysis in Figure 1.
  • Axis Labeling: Axes are clearly labeled, indicating the 50% success rate time horizon is being plotted.
  • Legend Clarity: The legend identifies the specific models plotted in this subset.
  • Inclusion of Trend Metrics: Key trend metrics (doubling time: 191 days, R²: 0.91) and the relevant time period (2023-01-01+ data) are displayed directly on the plot, summarizing the findings for this specific data subset.
Figure 8: The full time series for the time horizon of models, by release date.
Figure/Table Image (Page 15)
Figure 8: The full time series for the time horizon of models, by release date.
First Reference in Text
We plot in blue the regression from only 2023+ data on HCAST + RE-Bench tasks, extended into the past, and in gray the regression with all tasks (including SWAA) on the whole 6 year period (Figure 8).
Description
  • Purpose: Full Time Series and Trend Comparison: This figure presents the main result of the paper – the 50% time horizon for various AI models plotted against their release dates over a 6-year period (2019-2025). The time horizon measures AI capability based on the length of tasks (in human time) the AI can complete successfully 50% of the time.
  • Data Points: Model Horizons on Full Task Suite: The plot shows individual data points representing the 50% time horizon calculated for each AI model using the entire task suite (HCAST, RE-Bench, and SWAA). Models range from early ones like GPT-2 to recent ones like Claude 3.7 Sonnet.
  • Trend Lines: Full History vs. Recent Subset: Two trend lines are overlaid: 1) A gray line representing the exponential fit to all data points across the full 6-year period (2019-2025), yielding a doubling time of 212 days. 2) A blue line representing the exponential fit using only data from 2023 onwards and only on the longer HCAST + RE-Bench tasks (excluding SWAA), yielding a doubling time of 191 days. This blue line is visually extended backward into the pre-2023 period.
  • Comparison Goal: Retrodiction: The primary purpose of comparing the lines is 'retrodiction' – checking if the trend observed more recently (blue line, ~6.3 months doubling) is consistent with the overall historical trend when extrapolated backward (gray line, ~7 months doubling). The visual closeness suggests they are quite similar.
Scientific Validity
  • Robustness Check via Trend Comparison: The figure directly compares trends derived from different subsets of the data (full history/all tasks vs. recent history/longer tasks). This comparison acts as a robustness check for the main finding (the ~7-month doubling time). The close agreement between the 191-day doubling time (recent/longer tasks) and the 212-day doubling time (full history/all tasks) strengthens the conclusion that the rapid exponential growth is a consistent feature, not unduly influenced by early data points or the inclusion of very short tasks.
  • Validity of Retrodiction Approach: Plotting the trend from the recent subset (blue line) backward in time (retrodiction) is a valid way to visually assess its consistency with earlier data points. The fact that the extrapolated blue line aligns reasonably well with the earlier data points (which were not used in its fitting) supports the idea that the underlying growth dynamic might not have drastically changed recently, despite potential acceleration suggested by the slightly shorter doubling time (191 vs 212 days).
  • Consistency Across Task Subsets: The analysis acknowledges the potential difference introduced by excluding SWAA tasks for the blue line fit. The consistency observed suggests that the trend on longer tasks (>1 min) is similar to the overall trend, at least in the recent period.
  • Statistical Significance of Trend Difference: While the trends are similar, the blue line (191 days) does suggest slightly faster recent growth than the overall trend (212 days). However, given the limited data points post-2023 used to fit the blue line, this difference might not be statistically significant or might be sensitive to the specific models included, as noted in Section 6.1.
  • Interpretation Caveat: Points vs. Blue Line Fit Basis: It's important to note the data points shown are based on calculations including SWAA, while the blue line fit excludes SWAA. This is appropriate for the retrodiction goal (comparing the slope derived from the recent/longer tasks against the full historical data) but requires careful interpretation.
Communication
  • Effective Overlay for Comparison: The plot effectively overlays two distinct trend lines (one based on recent data/tasks, one on the full history) against the backdrop of the full dataset's data points, clearly facilitating the intended comparison for retrodiction.
  • Clear Labeling of Trend Lines: Both trend lines are clearly labeled within the plot area, specifying the data subset (e.g., 'non-SWAA tasks, 2023-2025 models' vs. 'all data') and the resulting doubling time, making the comparison explicit.
  • Color Coding: The use of different colors (blue and gray) for the two trend lines enhances visual distinction.
  • Consistent Axis Scaling: Consistent log-linear axes (log time horizon vs. linear release date) are used, consistent with other figures, aiding interpretation of exponential growth.
  • Caption Clarity: The caption text explicitly states which line corresponds to which fit and which data points are shown, reducing potential ambiguity.
Figure 9: Performance trends over time for HCAST and RE-Bench tasks by length...
Full Caption

Figure 9: Performance trends over time for HCAST and RE-Bench tasks by length and messiness (Section 6.2).

Figure/Table Image (Page 16)
Figure 9: Performance trends over time for HCAST and RE-Bench tasks by length and messiness (Section 6.2).
First Reference in Text
The data spans only 2023-2024 as pre-2023 models score 0 on non-SWAA tasks.
Description
  • Purpose: Analyzing Performance Trends by Task Characteristics: This figure investigates how AI model performance has improved over time (specifically between early 2023 and mid-2025) depending on two task characteristics: length (how long they take humans) and 'messiness'. 'Messiness' refers to a score assigned to tasks based on factors expected to make them harder or less clean than typical benchmarks, like having unclear goals or requiring interaction with changing environments (details in Section 6.2).
  • Structure: 2x2 Grid by Length and Messiness: The figure is divided into four panels arranged in a 2x2 grid. The top row shows results for shorter tasks ('Task Length: < 1 hour'), while the bottom row shows results for longer tasks ('Task Length: 1+ hours'). The left column shows results for the 50% 'least messy' tasks within each length category, while the right column shows results for the 50% 'most messy' tasks.
  • Axes and Data Points: Each panel plots the 'Weighted Success Rate' (average success rate adjusted for task family size) on the vertical axis against the 'Model Release Date' on the horizontal axis. Data points represent different AI models (identified by a shared legend), released between early 2023 and mid-2025. Error bars are shown for each point.
  • Observed Trends and Comparisons: The plots generally show an upward trend, indicating performance improvement over time in all four conditions. Comparing panels, success rates are visibly lower for longer tasks (bottom row vs. top row) and for messier tasks (right column vs. left column). However, the rate of improvement (the slope of the trend) appears visually similar across the messiness split (left vs. right columns) for both short and long tasks.
  • Data Subset Used: The analysis is restricted to tasks from the HCAST and RE-Bench suites and only includes models from 2023 onwards, because, as the reference text notes, earlier models generally scored zero on these longer, non-SWAA tasks.
Scientific Validity
  • Value of Stratified Analysis: Splitting the data by task length and messiness allows for a more nuanced analysis than looking at the overall average trend. It directly addresses the question of whether improvement rates differ for tasks perceived as more realistic or complex ('messier').
  • Subjectivity and Reliability of 'Messiness' Score: The 'messiness' score is based on researcher ratings of 16 factors (Section 6.2, Appendix D.4). While aiming to capture real-world complexities, the definition and weighting of these factors, and the reliability of the ratings, introduce subjectivity. The median split into 'least' and 'most' messy is a common but potentially coarse way to categorize tasks.
  • Support for Claim of Similar Trends (with Caveats): The visual observation that trends are similar across messiness levels supports the paper's claim that there is currently no evidence of a performance plateau specifically for messier tasks within this dataset and timeframe. However, the limited timeframe (2023-2024) and number of models restrict the statistical power to detect subtle differences in trends.
  • Limitations of Data Subset: The analysis is limited to HCAST and RE-Bench tasks, excluding SWAA. This focuses on more complex tasks but means the findings might not apply to very short actions. The exclusion of pre-2023 models due to zero scores is necessary but limits the historical perspective within this specific figure.
  • Methodological Considerations (Weighting, Error Bars): The use of weighted success rates accounts for task family diversity, which is methodologically sound. The presence of error bars acknowledges variability, but a formal statistical comparison of the slopes between conditions would strengthen the claim of similar trends.
Communication
  • Grid Layout Effectiveness: The 2x2 grid layout effectively organizes the data, allowing for visual comparison of performance trends across four conditions defined by task length (<1 hour vs 1+ hours) and messiness (least messy vs most messy).
  • Clear Panel Titling: Each panel is clearly titled, specifying the task length, messiness score range, and number of tasks included, providing essential context for interpreting the data within that panel.
  • Shared Legend: Using a shared legend for the models across all panels is efficient, although it requires the reader to reference it while comparing panels.
  • Clarity of Trend Visualization: Plotting weighted success rate against model release date clearly shows the performance improvement over time within each condition. The inclusion of error bars (presumably standard error, though not explicitly stated in this caption) provides a visual sense of data variability.
  • Visual Support for Main Message: The figure visually supports the key message discussed in the text: while absolute success rates differ (lower for longer and messier tasks), the upward trend (rate of improvement) appears qualitatively similar across all four conditions, with no obvious plateauing in the 'most messy' categories.
Figure 10: We plot the excess success rate (the observed empirical task success...
Full Caption

Figure 10: We plot the excess success rate (the observed empirical task success rate, minus success rate we would predict using the task's length, see Section 4.1) against messiness score for each task.

Figure/Table Image (Page 17)
Figure 10: We plot the excess success rate (the observed empirical task success rate, minus success rate we would predict using the task's length, see Section 4.1) against messiness score for each task.
First Reference in Text
As discussed in Section 6.2, there is a negative relationship between excess success rates and messiness.
Description
  • Purpose: Relationship between Task Messiness and AI Difficulty: This figure is a scatter plot examining whether tasks rated as more 'messy' are harder for AI models than would be expected based solely on how long those tasks take humans.
  • X-axis: Task Messiness Score: The horizontal axis (x-axis) represents the 'Task Messiness Score', a numerical rating assigned to each task based on characteristics like unclear goals or changing environments (ranging from 1 to 6 on the plot). A higher score indicates a 'messier' task.
  • Y-axis: Excess Success Rate (Residual): The vertical axis (y-axis) shows the 'Excess Success Rate', calculated as the difference between the actual average success rate of AI models on a task ('Observed') and the success rate predicted based only on the task's length/human time ('Predicted', using the model from Section 4.1). A positive value means AI performed better than expected for the task's length; a negative value means it performed worse.
  • Data Points: Individual Tasks: Each 'x' marker on the plot represents a single task (likely from the HCAST and RE-Bench suites, for which messiness was scored). Its position reflects its messiness score and its excess success rate.
  • Observed Trend: Negative Correlation: The points generally trend downwards from left to right, indicating a negative correlation: tasks with higher messiness scores tend to have lower (more negative) excess success rates, meaning AI models performed worse on these tasks than their length alone would predict.
  • Trend Quantification: Regression and R-squared: A dashed line shows the linear regression fit to the data. The R-squared value (R²=0.251) is displayed, indicating that about 25% of the variation in excess success rate is linearly associated with the messiness score. This suggests messiness has a noticeable negative impact, but other factors also contribute to AI performance deviations.
Scientific Validity
  • Validity of Residual Analysis Approach: The concept of 'excess success rate' (residual analysis) is a valid statistical approach to investigate the impact of a secondary variable (messiness) after accounting for a primary predictor (task length/difficulty).
  • Evidence for Impact of Messiness: The finding of a negative correlation (R²=0.251) provides quantitative evidence supporting the hypothesis that 'messiness', as defined and measured by the authors, negatively impacts AI performance beyond the effect of task duration.
  • Dependence on Messiness Score Validity: The scientific validity is significantly constrained by the reliability and construct validity of the 'messiness score'. Since this score is based on subjective ratings of multiple factors (Appendix D.4), its precision and objectivity are limited. Different definitions or ratings of messiness could yield different results.
  • Limited Explanatory Power (Low R-squared): The R² value of 0.251, while statistically indicating a relationship, is relatively low. This implies that messiness, as measured, explains only a modest portion of why AI performance deviates from length-based predictions. Other task characteristics or model-specific interactions likely play a substantial role.
  • Potential Heterogeneity Across Models: The analysis appears to pool results across different AI models by using the mean observed success rate and a prediction based on average model performance (derived from the Section 4.1 model). It's possible that the impact of messiness differs significantly between various AI architectures or training methods, which is not explored here.
  • Task Subset Limitation: The analysis likely pertains only to HCAST and RE-Bench tasks, as messiness scores were defined for these (Section 6.2). The findings may not generalize to the shorter SWAA tasks.
Communication
  • Clear Visualization Format: The scatter plot effectively visualizes the relationship between the two variables: task messiness score and excess success rate.
  • Axis Labeling and Clarity: Axes are clearly labeled. The y-axis label 'Residual (Observed - Predicted Success Rate)' accurately reflects the calculation, while the caption's use of 'excess success rate' provides a more intuitive interpretation.
  • Inclusion of Trend Line and R-squared: The inclusion of the linear regression line and the R-squared value (R²=0.251) directly on the plot helps quantify the observed negative trend.
  • Informative Caption: The caption clearly defines the 'excess success rate' metric and references the relevant section (4.1) for details on the prediction model, aiding reader understanding.
  • Support for Textual Claims: The plot directly supports the textual claim in Section 6.2 about a negative relationship between messiness and performance beyond task length.
Figure 11: Performance of frontier AI models using reported SWE-bench Verified...
Full Caption

Figure 11: Performance of frontier AI models using reported SWE-bench Verified results (Section 6.3).

Figure/Table Image (Page 17)
Figure 11: Performance of frontier AI models using reported SWE-bench Verified results (Section 6.3).
First Reference in Text
We observe a similar exponential trend to Figure 1, albeit with a steeper slope.
Description
  • Purpose: Trend Analysis on External Benchmark: This figure presents an analysis of AI model capability trends, similar to Figure 1 and Figure 8, but using data from a different source: the SWE-bench Verified benchmark. SWE-bench Verified is a standard test suite focused on software engineering tasks derived from real issues found in open-source code repositories like GitHub.
  • Metric and Models: The plot shows the calculated '50% task completion time horizon' for several recent AI models (Claude 3 Opus, GPT-4 1106, GPT-4o, Claude 3.5 Sonnet Old/New, o1) based on their performance on SWE-bench Verified tasks.
  • Plot Axes and Scaling: The horizontal axis represents the model release date (from 2023 to 2025), and the vertical axis shows the 50% time horizon on a logarithmic scale (from seconds to hours).
  • Observed Trend: Exponential Growth: Similar to the main analysis, the figure reveals an exponential growth trend in the 50% time horizon on SWE-bench Verified tasks for models released between 2023 and 2025.
  • Trend Quantification: Shorter Doubling Time: The trend line fitted to this data suggests a doubling time of only 70 days, which is significantly faster than the ~212-day doubling time found using the authors' main task suite (Figure 8). The R-squared value (R² = 0.87) indicates a reasonably strong fit for this exponential trend on the SWE-bench data.
  • Source of Human Time Estimates: Crucially, the 'human time' estimates used for the x-axis in the underlying calculation for this figure come from annotator estimates provided with SWE-bench Verified, rather than direct timing of human baseliners performing the tasks as done for the authors' main dataset.
Scientific Validity
  • External Validation of Trend: Applying the time horizon methodology to an independent, standard benchmark (SWE-bench Verified) serves as a valuable external validation check. Observing a similar exponential trend strengthens the general hypothesis of rapid AI capability growth in relevant domains like software engineering.
  • Sensitivity to Human Time Estimation Method: The significant discrepancy in the doubling time (70 days vs. ~212 days) highlights the sensitivity of the quantitative results to the specific dataset and, critically, the method used for estimating human task completion times. The paper (Section 6.3, Appendix D.3) attributes this difference primarily to SWE-bench Verified's annotator time estimates potentially underestimating the difficulty of easier tasks compared to the authors' direct baselining, which compresses the time scale and steepens the apparent slope.
  • Benchmark Dependence of Doubling Time: While confirming the exponential pattern, the large difference in the rate (doubling time) cautions against treating any single doubling time estimate as definitive. It underscores that the absolute rate is benchmark- and methodology-dependent.
  • Reproducibility using Public Data: The analysis uses publicly available SWE-bench results and time estimates, enhancing reproducibility for this specific part of the study.
  • Limited Model/Time Span: The analysis is limited to a smaller set of recent models (6 distinct models/versions) over a shorter timeframe (2023-2025) compared to the main analysis, which might affect the robustness of the 70-day doubling time estimate.
  • Goodness-of-Fit Comparison: The R-squared value of 0.87, while indicating a good fit, is slightly lower than for the main analysis (R²~0.96-0.98), suggesting potentially more noise or deviation from a perfect exponential trend in the SWE-bench data or its time estimates.
Communication
  • Consistent Visual Format: The figure maintains the consistent log-linear plot format used throughout the paper (log time horizon vs. linear release date), which effectively visualizes exponential growth and aids comparison with previous figures (e.g., Figure 1, Figure 8).
  • Clear Axis Labeling: Axes are clearly labeled, specifying the 50% success rate time horizon and the model release date range (2023-2025).
  • Legend Clarity: The legend clearly identifies the specific frontier models included in this analysis.
  • Inclusion of Trend Metrics: Trend metrics (doubling time: 70 days, R²: 0.87) and the relevant time period (2023-01-01+ data) are displayed directly on the plot, summarizing the quantitative findings for this specific benchmark.
  • Caption Clarity and Data Source Identification: The caption explicitly states the data source (SWE-bench Verified results), making it clear that this analysis uses an external benchmark.
  • Visual Representation of Steeper Slope: The steeper slope compared to Figure 1/8 is visually evident, supporting the caption's and text's assertion.

Discussion

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Table 3: Number of different categories of failures for 31 failed runs by GPT-4...
Full Caption

Table 3: Number of different categories of failures for 31 failed runs by GPT-4 1106 and 32 failed runs by ol (Section 5).

Figure/Table Image (Page 13)
Table 3: Number of different categories of failures for 31 failed runs by GPT-4 1106 and 32 failed runs by ol (Section 5).
First Reference in Text
We report the results in Table 3.
Description
  • Purpose: Failure Analysis Comparison: This table summarizes a qualitative analysis of why two different AI models, GPT-4 1106 and o1, failed on tasks they did not successfully complete. Researchers looked at 31 instances where GPT-4 1106 failed and 32 instances where o1 failed.
  • Failure Categories: The table lists five distinct 'Failure type' categories identified by the researchers: 'Poor planning/tool choice' (making a bad plan or picking the wrong tool), 'Incorrect mental math/reasoning' (errors in logic or calculation), 'Premature task abandonment' (giving up or submitting an answer too early without checking), 'Repeating failed actions' (trying the same unsuccessful approach multiple times), and 'Other' (failures not fitting the main categories).
  • Failure Counts per Model: For each failure category, the table shows the number of times that type of failure was observed for GPT-4 1106 and for o1. For example, 'Poor planning/tool choice' occurred 4 times for GPT-4 and 6 times for o1. 'Incorrect mental math/reasoning' occurred 6 times for GPT-4 and 7 times for o1.
  • Key Differences Observed: The table highlights notable differences. GPT-4 1106 frequently failed by 'Repeating failed actions' (12 out of 31 failures), whereas this was rare for o1 (only 2 out of 32 failures). Conversely, o1 failed more often due to 'Premature task abandonment' (16 out of 32 failures) compared to GPT-4 (8 out of 31 failures).
  • Total Counts: The total number of categorized failures matches the number of runs analyzed for each model (31 for GPT-4, 32 for o1), indicating each failed run was assigned to one category.
Scientific Validity
  • Subjectivity and Reliability of Manual Labeling: The failure categorization relies on manual labeling by contractors based on predefined criteria (described in Section 5). The reliability and consistency of this labeling process are crucial but not explicitly quantified (e.g., inter-rater reliability). Subjectivity in interpreting agent behavior could influence the counts.
  • Sample Size and Representativeness: The analysis is based on a sample of 31 and 32 failed runs for each model, respectively. While providing insights, these sample sizes are relatively small, and the representativeness of these samples regarding the models' overall failure modes across all tasks might be limited.
  • Exclusivity of Failure Categories: The failure categories are presented as mutually exclusive, but in practice, a single failed run might exhibit multiple issues. The methodology for assigning a single primary failure category is not detailed, which could affect interpretation.
  • Confounding Factor: Task Difficulty at Failure: The comparison between GPT-4 1106 and o1 is informative, suggesting different dominant failure modes. However, as noted in the caption text, o1 generally succeeds on more tasks, meaning its failures occur on comparatively harder tasks than GPT-4's failures. This difference in the difficulty of tasks where failures occur complicates direct comparison of failure types as solely reflecting inherent model weaknesses.
  • Value as Qualitative Insight: The qualitative analysis provides valuable hypotheses about model improvements (e.g., o1 being better at adapting from mistakes) and remaining weaknesses (e.g., o1 prematurely abandoning tasks), complementing the quantitative performance metrics.
Communication
  • Clarity and Simplicity: The table uses a clear and simple format to compare the frequency of different failure types between two AI models (GPT-4 1106 and o1).
  • Categorization Clarity: The failure categories listed are reasonably distinct and provide insight into the different ways the agents failed during task execution.
  • Use of Raw Counts: Presenting raw counts allows for direct comparison, although percentages might have offered additional perspective on the relative prevalence of each failure type within each model's total failures.
  • Informative Caption: The caption clearly states the models being compared, the total number of failed runs analyzed for each (31 for GPT-4, 32 for o1), and references the relevant section for context.
Figure 12: A sensitivity analysis of the extrapolated date at which frontier Al...
Full Caption

Figure 12: A sensitivity analysis of the extrapolated date at which frontier Al systems will have a horizon of 1 month.

Figure/Table Image (Page 18)
Figure 12: A sensitivity analysis of the extrapolated date at which frontier Al systems will have a horizon of 1 month.
First Reference in Text
In each row, we apply 10,000 random perturbations to our data and find the distribution over the date of 1-month AI implied by the perturbed data (Figure 12).
Description
  • Purpose: Sensitivity Analysis of 1-Month AI Forecast: This figure explores the uncertainty surrounding the prediction of when AI might reach a '1-month time horizon' (capable of completing tasks that take humans 1 month, or 167 working hours). It does this through a 'sensitivity analysis', which examines how much the predicted date changes when various sources of randomness or uncertainty in the original data and analysis method are considered.
  • Visualization Method: Box Plots: The figure displays multiple horizontal 'box plots' (also known as box-and-whisker plots). Each plot summarizes the distribution of predicted dates for achieving the 1-month horizon, obtained by running the prediction 10,000 times with random variations introduced according to a specific source of uncertainty.
  • Sources of Uncertainty Analyzed: Each row corresponds to a different source of uncertainty being tested: 'Bootstrap (tasks)' reflects uncertainty due to the specific set of tasks chosen; 'Bootstrap (runs)' reflects randomness in individual AI attempts; 'Bootstrap (models)' reflects uncertainty from the specific set of models analyzed; 'Weighting/Regularization' tests different analysis choices; 'IID Baseline Noise' simulates randomness in the human time estimates.
  • Interpretation of Box Plots: Within each box plot, the central line typically represents the median predicted date, the box spans the interquartile range (IQR, the middle 50% of predictions), and the 'whiskers' extend to cover a wider range (here, the 10th to 90th percentiles). Wider boxes/whiskers indicate greater uncertainty from that source.
  • Overall Uncertainty and Trend Comparison: The bottom two rows show 'Overall' uncertainty estimates. One combines the uncertainties assuming the trend observed from 2019-2025 continues. The other combines uncertainties assuming the potentially faster trend observed only in 2024-2025 continues. This latter prediction is centered earlier (around late 2027/early 2028) but has wider uncertainty.
  • Forecast Range (Statistical Uncertainty Only): The central estimate from the overall 2019-2025 trend analysis falls in late 2029, with the 10th-90th percentile range spanning roughly mid-2028 to mid-2030, based purely on these modeled statistical uncertainties.
Scientific Validity
  • Appropriateness of Sensitivity Analysis Framework: The sensitivity analysis framework, involving perturbations based on identified sources of statistical noise (sampling of tasks, runs, models; measurement noise; analysis choices), is a valid approach to quantify the uncertainty inherent in the historical data and modeling process.
  • Validity of Perturbation Methods: Bootstrapping is a standard and robust statistical technique for estimating uncertainty arising from sampling (tasks, runs, models). Simulating noise in baseline times and varying analysis parameters are also reasonable ways to probe sensitivity.
  • Identification of Dominant Uncertainties: The analysis correctly identifies that statistical uncertainty related to the past trend (represented by the widths of the box plots) is relatively small compared to the potential impact of assuming different future trend rates (comparing the bottom two 'Overall' rows) or unmodeled factors.
  • Crucial Limitation: Exclusion of Future Trend Changes and External Validity: A major limitation, clearly stated in the caption and text, is that this analysis only captures uncertainty quantifiable from the existing data and methodology. It explicitly excludes uncertainty about future changes in the AI development trend (e.g., acceleration or slowdown) and external validity (whether the trend on these benchmark tasks generalizes to real-world capabilities needed for a 1-month horizon). These unmodeled factors likely dominate the true forecast uncertainty.
  • Sensitivity to Assumed Trend Period: The comparison between extrapolating the full 2019-2025 trend versus the potentially faster 2024-2025 trend highlights the significant sensitivity of long-term forecasts to assumptions about near-term dynamics. The wider uncertainty associated with the 2024-2025 trend extrapolation appropriately reflects the greater statistical uncertainty in estimating a trend from fewer data points.
  • Relevance of Target Horizon: The choice of 1 month (167 hours) as the target horizon is justified in the text (Section 7.1) based on previous conceptual work (Ngo 2023) and its potential significance for transformative AI, lending relevance to the specific extrapolation target.
Communication
  • Effective Use of Box Plots: The use of box plots effectively visualizes the distribution (median, interquartile range, 10th-90th percentiles) of the extrapolated dates resulting from perturbations related to each source of uncertainty.
  • Clear Labeling of Uncertainty Sources: Each row is clearly labeled with the specific source of uncertainty being analyzed (e.g., 'Bootstrap (tasks)', 'Weighting/Regularization'), making it easy to understand what each distribution represents.
  • Facilitation of Comparison: Comparing the width and position of the different box plots allows for a straightforward visual assessment of the relative impact of each modeled source of uncertainty on the forecast.
  • Clear X-axis Labeling: The x-axis clearly indicates the extrapolated date range, providing context for the distributions.
  • Clarity of 'Overall' Trend Comparison: The inclusion of 'Overall' distributions, especially comparing the full trend vs. the recent trend, effectively highlights the significant impact of assuming different underlying growth rates.
  • Informative and Caveated Caption: The caption accurately describes the figure's purpose as a sensitivity analysis but crucially (and appropriately) notes that it does not account for future trend changes or external validity concerns, managing reader expectations about the forecast's limitations.
Figure 13: Cost of a successful run using an LLM agent as a fraction of the...
Full Caption

Figure 13: Cost of a successful run using an LLM agent as a fraction of the cost of the salary of a human expert performing the same task.

Figure/Table Image (Page 22)
Figure 13: Cost of a successful run using an LLM agent as a fraction of the cost of the salary of a human expert performing the same task.
First Reference in Text
This implies that if inference-time computation could be used to improve performance, there is substantial room to do so while still remaining economically competitive with human experts (Figure 13).
Description
  • Purpose: Comparing AI vs Human Costs: This figure is a scatter plot illustrating the economic cost of using an AI agent (specifically, a Large Language Model or LLM agent) to successfully complete a task, compared to the cost of paying a human expert to do the same task.
  • X-axis: Task Length (Human Time): Each point represents one successful run of an AI agent on a specific task. The horizontal position (x-axis) shows the 'Task Length' – how long that task typically takes a human expert, plotted on a logarithmic scale (from 1 second to 1 day).
  • Y-axis: Cost Ratio (Model Cost / Human Cost): The vertical position (y-axis) shows the 'Cost Ratio', calculated as the AI model's computational cost for that successful run divided by the estimated salary cost of a human expert performing the task for the corresponding duration. This axis is also logarithmic, spanning ratios from 0.00001 (AI is 100,000 times cheaper) to 1 (AI cost equals human cost).
  • Cost Assumptions: The human cost is based on an assumed average salary for a relevant expert (specifically, $143.61/hour, mentioned in Section 8.2 as an average L4 Engineer salary at Google divided by 2000 hours). The AI cost likely refers to the expense of using the LLM's computational resources (inference cost).
  • Main Observation: AI Runs Generally Cheaper: The vast majority of the 1460 data points shown are clustered in the lower part of the graph, with cost ratios significantly below 1, and often below 0.1 (or 10%). This indicates that for most tasks where the AI succeeded in this study, the computational cost was less than 10% of the estimated cost of human expert labor for the same task duration.
  • Relationship between Length and Cost Ratio: There is a general, though wide, positive correlation: longer tasks tend to have higher absolute AI costs, and thus potentially higher cost ratios, although many long tasks still show low relative costs.
Scientific Validity
  • Sensitivity to Human Cost Assumption: The calculation heavily depends on the assumed human expert salary ($143.61/hour). Different salary assumptions (e.g., for different expertise levels, locations, or inclusion of overhead costs) would shift the cost ratios up or down. The chosen figure seems plausible for a skilled software engineer in a high-cost region but might not be universally representative.
  • Definition and Variability of Model Cost: The definition and calculation of 'Model Cost' are not detailed here but are crucial. This typically involves API costs or estimated costs of running the model, which can vary significantly between models, providers, and over time. It likely focuses on inference cost, potentially excluding training or fine-tuning costs.
  • Exclusion of Failed Run Costs: The plot only includes data from successful AI runs. Failed runs also incur costs but produce no value, affecting the overall economic viability. This analysis focuses specifically on the cost when successful.
  • Support for Economic Competitiveness Argument: The figure provides evidence supporting the economic argument made in the text: since current successful runs are often very cheap relative to human labor, there is headroom to potentially spend more computational resources (e.g., on techniques like best-of-k sampling or self-reflection mentioned in Section 8.2) to improve AI success rates while potentially remaining cost-competitive.
  • Comparison Basis (AI Cost vs Human Time Cost): The analysis assumes the AI agent's run time is directly comparable to human task time for cost calculation. In reality, AI might complete tasks faster or slower than the human baseline time, which isn't directly factored into this ratio comparing AI compute cost to human time cost.
Communication
  • Clear Visualization Format: The scatter plot effectively visualizes the relationship between task length and the relative cost of using an AI agent versus a human expert for successful task completions.
  • Appropriate Use of Log-Log Scales: The use of logarithmic scales on both axes is appropriate for displaying data that spans several orders of magnitude (task lengths from seconds to hours, cost ratios from <0.00001 to >1). This allows the dense cluster of low-cost, short-duration tasks to be seen alongside longer, more expensive runs.
  • Axis and Caption Clarity: Axes are clearly labeled ('Task Length' and 'Cost Ratio (Model Cost / Human Cost)'). The caption clearly explains what the cost ratio represents.
  • Visual Support for Textual Claim: The plot visually supports the claim made in the text (Section 8.2) that most successful AI runs are significantly cheaper than the estimated human cost, with the majority of points falling well below the 0.1 (10%) cost ratio line.
  • Contextual Information in Title: The title includes the sample size (n=1460 successful runs analyzed) and notes that a small percentage (1.2%) are outside the visible range, providing useful context about the data coverage.
Figure 15: The 7 original RE-Bench tasks.
Figure/Table Image (Page 27)
Figure 15: The 7 original RE-Bench tasks.
First Reference in Text
RE-Bench See Figure 15 for a description of the RE-Bench tasks.
Description
  • Purpose: Describe RE-Bench Tasks: This figure presents a table summarizing the seven tasks included in the 'RE-Bench' benchmark suite, which was designed to evaluate the capabilities of AI systems on complex machine learning (ML) research and engineering problems.
  • Task Examples and Descriptions: The table lists each task (referred to as an 'Environment') and provides a 'Brief description' of the goal. The tasks cover a range of ML-related challenges:
  • Example: Optimize LLM Foundry: - **Optimize LLM Foundry:** Reduce the running time of a given script for fine-tuning a large language model (LLM).
  • Example: Optimize a Kernel: - **Optimize a Kernel:** Write efficient low-level code (a 'kernel') for a graphics processing unit (GPU) to speed up a specific computation (prefix sum).
  • Example: Fix Embedding: - **Fix Embedding:** Repair a 'corrupted' ML model where parts called embeddings have been mixed up, aiming to recover its original performance on a standard text dataset (OpenWebText).
  • Example: Scaling Law Experiment: - **Scaling Law Experiment:** Predict how well a larger ML model will perform based on experiments with much smaller models, testing understanding of 'scaling laws' (how model performance changes with size/data/compute).
  • Example: Restricted Architecture MLM: - **Restricted Architecture MLM:** Build a text prediction model (Masked Language Model or MLM) using only a limited set of basic programming tools (PyTorch primitives), testing creativity under constraints.
  • Example: Finetune GPT-2 for QA: - **Finetune GPT-2 for QA:** Improve the performance of the GPT-2 language model specifically for question-answering tasks, making it an effective chatbot.
  • Example: Scaffolding for Rust Codecontest: - **Scaffolding for Rust Codecontest:** Develop prompts and support structures ('scaffolding') to help another AI model (GPT-3.5) solve competitive programming problems written in the Rust language.
  • Scoring Functions: For each task, the table specifies the 'Scoring function' used to measure the AI's performance. These functions vary depending on the task goal, including measuring runtime (log time), prediction accuracy (log loss), or competitive success (win percentage against other models or percentage of problems solved).
Scientific Validity
  • Accurate Description of Benchmark: The table accurately describes the tasks comprising the RE-Bench suite, as defined in the original RE-Bench paper (Wijk et al., 2024, reference [2]).
  • Relevance and Complexity of Tasks: RE-Bench tasks are designed to be significantly more complex and open-ended than typical coding benchmarks like HumanEval, requiring multi-step reasoning, experimentation, and interaction with complex codebases. They represent a step towards evaluating more realistic research engineering capabilities.
  • Objectivity of Scoring Functions: The scoring functions are objective and automatically calculable, which is essential for standardized benchmarking. They directly reflect the optimization goal stated in the task description (e.g., minimizing loss, maximizing win rate).
  • Transparency about Task Limitations: The footnote regarding the 'Restricted Architecture MLM' task acknowledges a potential issue where models might circumvent the intended constraints, demonstrating transparency about benchmark limitations.
  • Contribution to Overall Methodology: As part of the overall methodology, the inclusion of these complex, long-duration tasks (estimated 8 hours human time) is crucial for probing the capabilities of frontier models beyond simpler tasks.
Communication
  • Clear Tabular Structure: The table effectively summarizes the 7 RE-Bench tasks using a clear, structured format with columns for Environment, Brief description, and Scoring function.
  • Concise Task Descriptions: The 'Brief description' column provides concise summaries of complex tasks, giving readers a good sense of the challenges involved (e.g., optimizing runtime, fixing corrupted models, predicting scaling laws).
  • Explicit Scoring Functions: The 'Scoring function' column clearly states the objective metric used to evaluate performance on each task (e.g., log time, log loss, win percentage), which is crucial for understanding how success is measured.
  • Categorization by Goal: Grouping tasks by the optimization goal (runtime, loss, win-rate) provides helpful categorization.
  • Reference Value: The figure serves as a useful reference, consolidating information about the RE-Bench suite mentioned in the text.
Figure 17: Linear, hyperbolic, and exponential fits for model time horizon...
Full Caption

Figure 17: Linear, hyperbolic, and exponential fits for model time horizon since 2019.

Figure/Table Image (Page 36)
Figure 17: Linear, hyperbolic, and exponential fits for model time horizon since 2019.
First Reference in Text
Linear and hyperbolic curves have poor fits (Figure 17).
Description
  • Core Data Presented: This figure revisits the plot of AI model 50% time horizon versus release date (from 2019 onwards), showing the same data points as Figure 1 and Figure 8.
  • Comparison of Trend Line Fits: The key feature of this figure is that it overlays three different mathematical models attempting to describe the trend in the data: a linear fit (straight line on a linear scale, appears curved upwards here due to log y-axis), a hyperbolic fit (another type of curve), and an exponential fit (which appears as a straight line on this log-linear plot).
  • Goodness-of-Fit Metrics (R-squared): The plot displays the R-squared (R²) value for each fit. R-squared is a statistical measure indicating how well the fitted line explains the variation in the data, with values closer to 1 indicating a better fit. The exponential fit has a much higher R² (0.98) compared to the hyperbolic (0.68) and linear (0.50) fits.
  • Visual Assessment of Fits: Visually, the exponential fit (straight line on this plot) closely follows the data points, especially the more recent ones, while the linear and hyperbolic fits clearly deviate significantly from the observed trend.
  • Exponential Doubling Time: The doubling time (212 days) associated with the best-fitting exponential model is also reiterated.
Scientific Validity
  • Valid Model Comparison Approach: Comparing different functional forms is a standard and valid approach in trend analysis to determine the model that best describes the data. This figure provides strong justification for selecting the exponential model.
  • Strong Quantitative Support for Exponential Fit: The R-squared values provide quantitative support for the visual assessment. The stark difference (0.98 vs 0.68/0.50) strongly suggests that the exponential model is a significantly better description of the observed capability growth trend on this metric than linear or hyperbolic models over this period.
  • Reinforcement of Main Claim: The analysis reinforces the paper's main claim of rapid, accelerating growth in AI capabilities as measured by the time horizon metric.
  • Limitations of Model Selection: While the exponential fit is clearly superior among the three tested, this does not guarantee that it is the 'true' underlying growth pattern or that it will continue indefinitely. Other complex growth models (e.g., logistic, double exponential) might eventually become relevant, but based on the available data (2019-2025), the simple exponential provides an excellent fit.
  • Robustness of Conclusion Despite Fit Details: The specific forms of the linear and hyperbolic models fitted are not detailed (e.g., were they fitted on log-transformed or linear y-values?), but their poor visual fit and low R-squared values make the conclusion robust regardless of minor variations in fitting procedure.
Communication
  • Direct Visual Comparison of Fits: Overlaying the three different functional fits (linear, hyperbolic, exponential) on the same plot allows for direct visual comparison of how well each model captures the trend in the data points.
  • Clear Quantitative Comparison (R-squared): Including the R-squared value for each fit directly on the plot (Linear: 0.50, Hyperbolic: 0.68, Exponential: 0.98) provides immediate quantitative evidence for the superiority of the exponential model, strongly supporting the reference text's claim of poor fits for linear and hyperbolic curves.
  • Consistent and Appropriate Axis Scaling: The use of a logarithmic y-axis is consistent with other figures showing this data and is appropriate for visualizing the exponential growth trend as approximately linear.
  • Justification for Model Choice: The figure effectively communicates why the exponential model was chosen for the main analysis and extrapolation presented in the paper.
Figure 18: Time horizon with continuous (non-binarized) scoring.
Figure/Table Image (Page 37)
Figure 18: Time horizon with continuous (non-binarized) scoring.
First Reference in Text
Claude 3.7 Sonnet has a 50% time horizon of nearly 2 hours.
Description
  • Purpose: Sensitivity Analysis with Continuous Scoring: This figure presents a sensitivity analysis, recalculating the AI model capability trend (50% time horizon vs. release date) using a different scoring method. Instead of 'binarized' scoring (where a task is either a success or failure based on a threshold), this analysis uses 'continuous' scoring, allowing for partial credit.
  • Continuous Scoring Explained: Continuous scoring means that for tasks where performance is measured on a scale (e.g., minimizing error in an ML model), the raw score (between 0 and 1) is used directly in the analysis, rather than converting it to a binary pass/fail based on whether it meets a specific threshold (like human performance).
  • Plot Format and Data: The plot format is identical to Figure 1: model release date (2019-2027) on the x-axis and the 50% time horizon on the y-axis (logarithmic scale). Data points represent the calculated horizons for various models using continuous scores.
  • Recalculated Trend Metrics: The exponential trend is recalculated using these continuous-score-based horizons. The resulting doubling time is 201 days, very similar to the 212 days found with binarized scoring (Figure 8/1). The R-squared value is 0.97, indicating a similarly strong exponential fit.
  • Impact on Absolute Horizon Values: While the trend (doubling time) is similar, the absolute time horizon values calculated using continuous scoring are noticeably higher, especially for recent models. As mentioned in the reference text and caption, Claude 3.7 Sonnet's 50% horizon is nearly 2 hours (~120 minutes) with continuous scoring, compared to ~1 hour (59 minutes) with binarized scoring (Figure 5).
Scientific Validity
  • Robustness Check for Scoring Method: Using continuous scoring serves as a valid sensitivity analysis to assess whether the main finding (exponential trend with ~7-month doubling time) is robust to the choice of scoring method (binarized vs. continuous). The similar doubling time (201 days vs. 212 days) suggests the trend itself is robust.
  • Potential Benefit: Capturing Partial Credit: Continuous scoring potentially captures more signal, especially from complex tasks like RE-Bench where models might achieve partial success without meeting the binary threshold. This could arguably provide a more nuanced measure of capability.
  • Interpretation Challenge and Potential Overstatement: The interpretation of the 50% time horizon becomes less intuitive with continuous scoring. Instead of 'time for tasks the AI passes 50% of the time', it becomes 'time for tasks where the AI achieves an average score of 0.5'. The paper acknowledges this might 'overstate' the horizon (caption text), as achieving an average score of 0.5 might be easier than achieving binary success 50% of the time, especially if partial credit is easily obtained.
  • Highlighting Sensitivity of Absolute Values: The significantly higher absolute horizon values (e.g., ~2 hours vs ~1 hour for Claude 3.7 Sonnet) highlight the substantial impact of the scoring definition on the resulting capability estimate. This underscores the importance of clearly defining 'success' when measuring AI performance.
  • Methodological Rigor: The analysis demonstrates methodological rigor by exploring the impact of key analysis choices (scoring method).
Communication
  • Consistent Plot Format: The figure uses the same consistent log-linear plot format (log time horizon vs. linear release date) as Figures 1, 6, 8, etc., facilitating comparison of results under different analysis assumptions.
  • Clear Title Indicating Variation: The title clearly indicates the key difference from the main analysis: the use of 'Continuous (non-binarized) scoring'. This immediately informs the reader about the methodological variation being tested.
  • Standard Labeling: Axes are clearly labeled, and the legend identifies the plotted models.
  • Inclusion of Trend Metrics: The calculated trend metrics (doubling time: 201 days, R²: 0.97) are displayed directly on the plot, allowing for easy comparison with the main analysis (212 days, R²: 0.98 in Fig 8/1).
  • Visual Clarity of Higher Horizons: The visual representation clearly shows higher absolute time horizon values for recent models compared to Figure 1, supporting the text's claim that continuous scoring yields higher horizons.
Figure 19: 2024–2025 and 2019-2025 exponential fits for 50% time horizon.
Figure/Table Image (Page 37)
Figure 19: 2024–2025 and 2019-2025 exponential fits for 50% time horizon.
First Reference in Text
Not explicitly referenced in main text
Description
  • Purpose: Comparing Long-term vs. Recent Trends: This figure compares two different exponential trend lines fitted to the 50% time horizon data for AI models, plotted against their release dates.
  • Data and Axes: The plot shows the same underlying data points (model time horizons vs. release date) as Figures 1, 8, and 17, using a logarithmic scale for the time horizon (y-axis) and a linear scale for the release date (x-axis, 2019-2027).
  • Trend 1: Full History (2019-2025): One trend line (likely gray, consistent with Fig 8) represents the exponential fit using data from the entire 2019-2025 period. This fit yields a doubling time of 212 days.
  • Trend 2: Recent History (2024-2025): The second trend line (likely dashed blue) represents an exponential fit using only the data points from models released in the 2024-2025 period. This fit yields a much shorter doubling time of 118 days.
  • Observation: Steeper Recent Trend: The visual comparison shows that the trend line based only on the most recent data points (2024-2025) is significantly steeper than the trend line based on the entire 6-year history, suggesting a potential acceleration in the rate of capability improvement in the latest period.
Scientific Validity
  • Valid Approach for Investigating Acceleration: Comparing trends fitted over different time periods is a valid method to investigate potential changes or accelerations in the underlying growth rate. This analysis directly addresses the question of whether recent progress is faster than the long-term average.
  • Statistical Uncertainty of Recent Trend: The trend fitted only to 2024-2025 data is based on very few data points (likely only 6 models, as mentioned in Section 7.2). Trend lines fitted to small datasets are highly sensitive to noise and individual data points, making the 118-day doubling time estimate statistically uncertain and potentially unreliable as an indicator of a sustained new trend. The paper acknowledges this uncertainty elsewhere (Section 6.1, 7.2).
  • Potential Causes of Apparent Acceleration (Signal vs. Noise): The apparent acceleration could be genuine, reflecting faster underlying progress in AI development recently. Alternatively, it could be noise, an artifact of the specific models released in 2024-2025 happening to perform particularly well relative to the long-term trend, or influenced by methodological factors like improved elicitation for recent models (Section 8.2). Disentangling these possibilities requires more data over time.
  • Relevance for Forecasting: While statistically uncertain, the possibility of acceleration highlighted by the 118-day doubling time is highly relevant for forecasting, as discussed in Section 7. Even if uncertain, it represents a plausible faster trajectory that significantly impacts extrapolation.
  • Dependence on Underlying Data: The comparison relies on the same underlying time horizon data and methodology as the main analysis, inheriting its strengths and weaknesses (e.g., dependence on task suite, human baselines).
Communication
  • Clear Overlay for Trend Comparison: The figure effectively overlays two trend lines representing different time periods (full history vs. very recent) on the same plot, allowing for direct visual comparison of their slopes.
  • Explicit Labeling of Trends and Doubling Times: The trend lines are clearly labeled with their respective time periods and calculated doubling times (212 days for 2019-2025, 118 days for 2024-2025), making the quantitative difference immediately apparent.
  • Visual Distinction of Trend Lines: Using distinct colors (or line styles, though colors are implied) helps differentiate the two trend lines.
  • Consistent Axis Scaling: The consistent log-linear axes facilitate the interpretation of exponential growth and comparison with other figures.
  • Highlighting Potential Acceleration: The plot visually highlights the possibility of recent acceleration in the growth trend, as the 2024-2025 line has a noticeably steeper slope.
Table 8: We convert the SWE-bench Verified time annotations into task...
Full Caption

Table 8: We convert the SWE-bench Verified time annotations into task estimates, by taking the geometric mean of the time annotation.

Figure/Table Image (Page 38)
Table 8: We convert the SWE-bench Verified time annotations into task estimates, by taking the geometric mean of the time annotation.
First Reference in Text
Not explicitly referenced in main text
Description
  • Purpose: SWE-bench Time Estimate Interpretation and Validation: This table details how the researchers interpreted the time estimates provided with the external SWE-bench Verified benchmark and compares these estimates to their own measurements for some tasks.
  • SWE-bench Time Buckets: SWE-bench Verified categorizes tasks into four 'Task Time Buckets' based on estimated human fix time: '< 15 min', '15 min-1 hour', '1-4 hours', and '> 4 hours'.
  • Derivation of Task Time Estimates: To get a single number for analysis, the authors calculated a 'Task Time Estimate' for each bucket. The caption states this is the 'geometric mean of the time annotation'. For example, for the '15 min-1 hour' bucket, the estimate is 30.0 min (likely the geometric mean of 15 and 60). For '< 15 min', they used 3.9 min (perhaps geometric mean of a lower bound like 1 min and 15 min, though the exact calculation isn't shown). For '1-4 hours', it's 120.0 min (geometric mean of 60 and 240). For '> 4 hours', it's 480.0 min (potentially using 16 hours or 960 minutes as an upper bound, or simply 8 hours).
  • Comparison with Actual Baseline Times: Crucially, the table includes an 'Average Baseline Time' column. This shows the average time it actually took the authors' own human baseliners (external contractors) to complete a sample of tasks falling into those SWE-bench buckets. This data is only available for the first three buckets.
  • Observed Discrepancy: There's a major difference for the shortest bucket: SWE-bench's '< 15 min' bucket was estimated at 3.9 min by the authors' method, but their baseliners actually took an average of 32.9 minutes for tasks in this category. For the '1-4 hours' bucket, the estimate (120.0 min) was closer to the measured average baseline time (131.6 min).
Scientific Validity
  • Method for Converting Buckets to Estimates: Converting time buckets into single point estimates (using geometric mean) is a necessary step for quantitative analysis like the regression shown in Figure 11. Using the geometric mean is reasonable for data spanning orders of magnitude.
  • Importance of Empirical Validation: The empirical validation by comparing estimated times with actual baseline times measured by the authors (even on a small sample) is a critical piece of analysis. It reveals a significant methodological issue with using the SWE-bench annotations directly.
  • Explanation for Steeper Slope in Figure 11: The finding that the '< 15 min' bucket is significantly underestimated by the SWE-bench annotations (average actual time ~33 min vs. estimate ~4 min) provides a strong, data-driven explanation for why the trend line in Figure 11 (using SWE-bench estimates) appears much steeper than the main analysis trend (Figure 1/8). Underestimating the difficulty of easy tasks compresses the time scale at the lower end, artificially increasing the slope of the capability growth curve.
  • Limited Sample Size for Validation: The validation is based on a limited number of baseline runs (Appendix D.3 mentions 7 baselines across 6 tasks). While the discrepancy for the '< 15 min' bucket is large and likely robust, the exact magnitude of underestimation might vary with more data.
  • Implications for Benchmark Comparability: This finding highlights the critical dependence of time horizon calculations on the accuracy of human time estimates, especially for shorter tasks where relative errors can be large. It suggests caution when comparing results across benchmarks that use different methods for estimating task difficulty.
Communication
  • Clear Comparison Structure: The table clearly presents the different time buckets used in SWE-bench Verified annotations and contrasts the derived 'Task Time Estimate' (geometric mean of bucket boundaries) with the 'Average Baseline Time' measured empirically by the authors for a subset of tasks.
  • Highlighting Time Discrepancy: It effectively highlights the significant discrepancy between the estimated time (3.9 min) and the actual average baseline time (32.9 min) for the '< 15 min fix' bucket, visually supporting the argument made in Appendix D.3 about underestimation.
  • Methodological Clarity in Caption: The caption explains the method used to derive the 'Task Time Estimate' (geometric mean), adding methodological transparency.
  • Context Provided by Multiple Buckets: Including data for multiple buckets allows readers to see that the discrepancy is most pronounced for the shortest time category.
  • Interpretive Aid in Caption: The cautionary note in the caption text about likely underestimation further aids correct interpretation.
Table 9: The time horizon of less capable models is substantially longer on our...
Full Caption

Table 9: The time horizon of less capable models is substantially longer on our tasks than on SWE-bench Verified.

Figure/Table Image (Page 38)
Table 9: The time horizon of less capable models is substantially longer on our tasks than on SWE-bench Verified.
First Reference in Text
Not explicitly referenced in main text
Description
  • Purpose: Comparing Time Horizons Across Benchmarks: This table compares the calculated '50% time horizon' (a measure of AI capability based on the length of tasks it can reliably complete) for six different AI models when measured using two different sets of tasks.
  • Models Compared: The models listed range from slightly older (Claude 3 Opus, GPT-4 1106) to more recent ones (GPT-4o, Claude 3.5 Sonnet, o1).
  • Data Columns: Horizons on Two Benchmarks: For each model, the table shows the 50% time horizon calculated using the authors' own task suite ('Our Tasks' - HCAST, RE-Bench, SWAA) and the horizon calculated using the external SWE-bench Verified benchmark.
  • Ratio Calculation: A final column calculates the 'Model Time Horizon Ratio' by dividing the horizon on 'Our Tasks' by the horizon on 'SWE-bench Verified'.
  • Observation for Less Capable Models: The table shows that for the earlier models listed (Claude 3 Opus, GPT-4 1106), the time horizon measured on the authors' tasks is substantially longer (around 7-8 times longer) than when measured on SWE-bench Verified (e.g., 6.42 min vs 0.83 min for Opus; 8.56 min vs 1.18 min for GPT-4 1106).
  • Observation for More Capable Models: This discrepancy decreases for more recent models. For GPT-4o, the ratio is 1.5x. For Claude 3.5 Sonnet (New), it's 1.7x. For the most capable model listed according to 'Our Tasks', o1, the ratio is actually less than 1 (0.8x), meaning its horizon was measured as shorter on the authors' tasks (39.21 min) compared to SWE-bench Verified (51.21 min).
Scientific Validity
  • Demonstrates Benchmark Dependence: The comparison directly demonstrates the significant impact of the chosen benchmark (including its tasks and associated human time estimates) on the resulting time horizon metric. It shows that absolute horizon values are not directly comparable across different benchmarks.
  • Supports Hypothesis on SWE-bench Time Estimates: The results strongly support the hypothesis raised in Table 8 and Appendix D.3: the time estimates used in SWE-bench Verified (especially for easier tasks) likely underestimate actual human difficulty compared to the authors' baselining method. This leads to inflated performance (shorter time horizons) for less capable models on SWE-bench.
  • Trend in Ratio Across Models: The decreasing ratio for more capable models might suggest that the time estimation discrepancy between the benchmarks is less pronounced for harder tasks, or that the performance scaling of highly capable models is measured more consistently across the benchmarks.
  • Interpretation of o1 Ratio: The fact that o1 has a lower ratio (0.8x) is interesting. It could imply o1 performs relatively better on the authors' task mix compared to SWE-bench, or it might reflect noise/uncertainty in the horizon calculations at the higher end, or differences in scaffolding (as noted in Table 7).
  • Reliance on Calculation Validity and Model Sample: The comparison relies on the validity of the time horizon calculations for both datasets. The number of models compared (6) is limited but covers a relevant range of recent capabilities.
Communication
  • Clear Comparative Structure: The table clearly presents a side-by-side comparison of the 50% time horizon calculated using two different task sets ('Our Tasks' vs. 'SWE-bench Verified') for several AI models.
  • Direct Comparison: Listing the models and their corresponding horizons on both benchmarks allows for easy identification of discrepancies.
  • Effective Use of Ratio: The inclusion of the 'Model Time Horizon Ratio' column effectively quantifies the difference between the two benchmarks for each model, highlighting the main point summarized in the caption.
  • Caption Accuracy: The caption accurately reflects the key observation presented in the table – the substantial difference in horizons, particularly for less capable models.
Table 10: Messiness Factor Definitions 1-8
Figure/Table Image (Page 39)
Table 10: Messiness Factor Definitions 1-8
First Reference in Text
Not explicitly referenced in main text
Description
  • Purpose: Defining Task 'Messiness' Factors: This table provides detailed definitions for the first 8 out of 16 factors used by the researchers to assess the 'messiness' of tasks in the HCAST and RE-Bench suites. 'Messiness' refers to characteristics thought to make tasks more like real-world problems and potentially harder for AI, compared to clean, well-defined benchmark tasks.
  • Factors Defined (1-8): Each row defines one factor. The factors listed are: 1. Real life source (task derived from or representative of a real problem), 2. Resource limited (requires careful use of finite resources beyond just time/cost), 3. Not easily resettable (hard to undo actions and return to the start state), 4. Irreversible mistake availability (>20% chance a typical human makes an unrecoverable error early on), 5. Dynamic environment (environment changes independently of the agent), 6. Difficult counterfactuals (hard to determine the cause of outcomes), 7. Not purely automatic scoring (requires some manual judgment), 8. Implicit generalizability required (success requires fulfilling unstated goals or producing a useful output beyond explicit criteria).
  • Format: Factor Name and Definition/Criteria: The 'Definition' column provides specific criteria for determining whether a task possesses the characteristic described by the factor. These appear to be binary (yes/no) judgments that contribute to an overall messiness score for each task (as used in Section 6.2 and Figure 10).
Scientific Validity
  • Construct Validity and Relevance: Attempting to operationalize and quantify 'messiness' or real-world task complexity is a valuable endeavor, as standard benchmarks often lack these dimensions. The chosen factors appear prima facie relevant to capturing such complexities (e.g., resource constraints, dynamic environments, ambiguity).
  • Subjectivity and Reliability of Definitions: The definitions rely heavily on subjective judgments (e.g., 'somewhat likely (>20%)', 'significant', 'saliently possible', 'sensible judgement calls'). This inherent subjectivity makes consistent and reliable rating across different individuals challenging, potentially impacting the validity of the messiness scores. Quantifying inter-rater reliability would strengthen this aspect.
  • Factor Selection and Weighting: The selection of these specific 16 factors (including those in Table 11) is based on researcher judgment about what constitutes relevant 'messiness'. While plausible, the list may not be exhaustive, and the equal weighting of each factor in the final score (implied by Section 6.2) is a simplification.
  • Operationalization for Binary Scoring: The definitions provide clear criteria for the presence/absence of each factor, enabling the binary scoring used in the analysis (Section 6.2).
  • Potential for Adversarial Selection Bias: As acknowledged in Appendix D.4, selecting factors based partly on perceived relevance to AI performance might introduce some bias, potentially overstating the correlation between 'messiness' and AI difficulty if factors were chosen specifically because current models struggle with them.
Communication
  • Clear Structure and Format: The table clearly defines the first eight factors contributing to the 'messiness' score. The two-column format (Factor | Definition) is straightforward.
  • Detailed Definitions: The definitions provided are detailed, often including specific criteria or examples (e.g., the conditions for 'Resource limited', the >20% likelihood for 'Irreversible mistake'). This level of detail aims to guide consistent rating.
  • Appropriate Placement (Appendix): Placing these detailed definitions in an appendix (implied location based on typical paper structure and reference in Section 6.2) is appropriate, keeping the main text focused while providing necessary methodological background.
  • Intuitive Factor Names: The factor names themselves (e.g., 'Dynamic environment', 'Resource limited') provide a quick intuition about the concept being captured.
  • Potential Complexity/Ambiguity: While detailed, the complexity and occasional subjectivity within the definitions (e.g., judging 'significant' parts or 'sensible judgement calls') might make them challenging to apply perfectly consistently without further clarification or examples.
Table 11: Messiness Factor Definitions 9-16
Figure/Table Image (Page 40)
Table 11: Messiness Factor Definitions 9-16
First Reference in Text
Not explicitly referenced in main text
Description
  • Purpose: Defining Messiness Factors 9-16: This table continues the definition of the 'messiness' factors started in Table 10, defining factors 9 through 16. These factors contribute to a score measuring how much a task deviates from a clean, simple benchmark towards more complex, real-world characteristics.
  • Factors Defined (9-16): The factors defined here are: 9. Non explicit scoring description (scoring criteria are not fully stated or have hidden aspects), 10. Is suboptimal behavior exploited (task involves an adversary that punishes poor agent choices), 11. No provided verification mechanisms (agent cannot easily check its answer mid-run), 12. Real-time coordination (requires managing multiple parallel processes or coordinating with external services in real-time), 13. Self modification required (agent needs to modify its own code or tools), 14. Self improvement required (agent needs to learn or adapt to improve its general performance), 15. Information seeking required (agent needs to actively gather information not provided initially), 16. Novel situation (task involves unusual constraints or properties).
  • Format: Factor Name and Definition/Criteria: Like Table 10, the 'Definition' column provides detailed criteria for making a binary (yes/no) judgment about whether each factor applies to a given task.
Scientific Validity
  • Construct Validity and Relevance: These factors continue the attempt to capture relevant aspects of real-world task complexity, covering areas like ambiguity in goals/scoring, adversarial interactions, lack of feedback, coordination needs, and self-adaptation requirements.
  • Subjectivity and Reliability of Definitions: Similar to factors 1-8, these definitions involve subjective judgments (e.g., 'significant hidden information', 'actively exploit weakness', 'non-trivial amount of work', 'unusual constraint'). This limits the objectivity and potential reliability of the ratings.
  • Relevance and Difficulty of Assessing Advanced Factors: Factors like 'Self modification required' and 'Self improvement required' touch upon advanced agent capabilities that are highly relevant to future AI progress but may be difficult to assess accurately or consistently, especially with current models.
  • Completeness vs. Equal Weighting: The complete set of 16 factors provides a multi-faceted view of 'messiness'. However, the assumption of equal weighting when combining these into a single score (implied by Section 6.2) remains a significant simplification, as different factors might impact AI performance to vastly different degrees.
  • Enabling Quantitative Analysis: The operationalization allows for the calculation of the messiness score used in the analyses (Figures 9, 10, 20, 21), enabling the quantitative investigation of messiness effects, despite the limitations in the score's precision.
Communication
  • Clear Structure and Format: Similar to Table 10, this table uses a clear two-column format (Factor | Definition) to define the remaining messiness factors.
  • Detailed Definitions: The definitions continue to provide detailed criteria, aiming for consistency in rating, although complexity and potential ambiguity remain.
  • Factor Names: Factor names (e.g., 'Non explicit scoring', 'Suboptimal behavior exploited') offer some intuition.
  • Completeness (with Table 10): This table, along with Table 10, provides the complete operationalization of the 'messiness' construct used in the study, essential for methodological transparency.
Figure 20: Messier tasks tend to be longer.
Figure/Table Image (Page 41)
Figure 20: Messier tasks tend to be longer.
First Reference in Text
Not explicitly referenced in main text
Description
  • Purpose: Relationship between Messiness and Length: This figure is a scatter plot designed to investigate the relationship between the 'messiness' of a task and its difficulty as measured by the time it takes a human to complete it.
  • Data Points: Individual Tasks: Each point on the graph represents a single task (likely from the HCAST and RE-Bench suites, for which messiness was scored).
  • X-axis: Task Messiness Score: The horizontal axis (x-axis) shows the 'Task Messiness Score' assigned to each task, ranging from approximately 1 to 7. Higher scores indicate 'messier' tasks.
  • Y-axis: Task Length (Log Human Time): The vertical axis (y-axis) shows the 'Task Length for Human', specifically the base-10 logarithm of the time in minutes. A value of 0 corresponds to 1 minute (10^0), 1 to 10 minutes (10^1), 2 to 100 minutes (10^2), and 3 to 1000 minutes (10^3, roughly 16 hours).
  • Color Coding: AI Success Rate: The color of each point indicates the 'Observed Success Rate' (presumably the mean success rate across AI models for that task), ranging from blue (low success, ~0.0) to red (high success, ~0.8).
  • Observed Correlations: The plot shows a general upward trend: tasks with higher messiness scores (further right) also tend to take humans longer to complete (higher up). It also shows a lack of tasks that are both very short (low y-value) and very messy (high x-value). Furthermore, the coloring suggests that tasks in the upper right (long and messy) tend to have low AI success rates (blue points).
Scientific Validity
  • Demonstration of Correlation: The figure demonstrates a positive correlation between the authors' measure of task 'messiness' and task length (human time). This is a plausible relationship, as more complex or ill-defined ('messy') tasks might naturally require more time to understand and execute.
  • Highlighting Confounding Variable: This observed correlation is important because it highlights that task length is a potential confounding variable when analyzing the impact of messiness on AI performance (as done in Figures 9 and 10). Since messier tasks are also longer, the poorer AI performance on messier tasks (seen in Fig 10) could be partly due to their length, not just their messiness. The residual analysis in Fig 10 attempts to disentangle this.
  • Dependence on Input Data Validity: The validity of the plot relies on the validity and reliability of the underlying data: the messiness scores (subjective ratings) and the human task length estimates.
  • Implications of Task Distribution: The non-uniform distribution of tasks (few short, messy tasks) is an important characteristic of the benchmark suite used. This distribution could limit the ability to fully separate the effects of length and messiness in statistical analyses.
  • Confirmation of Difficulty for AI: The addition of AI success rate via color provides a useful multi-dimensional view, visually confirming that tasks rated as long and messy are indeed associated with lower AI success rates.
Communication
  • Appropriate Visualization Choice: The scatter plot is an effective choice for visualizing the relationship between two continuous (or pseudo-continuous) variables: task messiness score and human task length.
  • Effective Use of Color Coding: Using color intensity to represent a third variable, observed AI success rate, adds a useful layer of information, allowing visual correlation between messiness, length, and AI performance.
  • Clear Axes and Scaling: The axes are clearly labeled ('Task Messiness Score', 'Task Length for Human (log10 minutes)'). The logarithmic scale for task length is appropriate given the range.
  • Clear Legend: The color bar legend clearly indicates the mapping between color and observed success rate.
  • Accurate Caption: The caption succinctly summarizes the primary relationship shown in the plot (positive correlation between messiness and length).
Figure 21: Model success rates on HCAST + RE-Bench tasks, split by task...
Full Caption

Figure 21: Model success rates on HCAST + RE-Bench tasks, split by task messiness rating.

Figure/Table Image (Page 42)
Figure 21: Model success rates on HCAST + RE-Bench tasks, split by task messiness rating.
First Reference in Text
Not explicitly referenced in main text
Description
  • Purpose: Performance Trend by Task Messiness: This figure presents trends in AI model performance over time, specifically focusing on how task 'messiness' affects success rates. It uses data only from the HCAST and RE-Bench task suites.
  • Structure: Split by Messiness: The figure is split into two panels. The top panel shows results for the 50% 'least messy' tasks (messiness score < 2.3, N=55 tasks). The bottom panel shows results for the 50% 'most messy' tasks (messiness score >= 3.0, N=59 tasks). Note: There appears to be a typo in the top panel title regarding the messiness score threshold compared to Figure 9, but the intent is a split based on messiness.
  • Axes: Weighted Success Rate vs. Release Date: Each panel plots the 'Weighted Success Rate' (average success rate adjusted for task family diversity) on the vertical axis against the 'Model Release Date' on the horizontal axis (ranging from early 2022 to mid-2025).
  • Data Points and Models: Data points represent various AI models released during this period (identified by a shared legend), showing their average performance on the respective task subset. Error bars indicate variability.
  • Observed Trends and Comparison: Both panels show a clear upward trend, indicating performance improvement over time for both less messy and more messy tasks. Comparing the panels, the absolute success rates are generally lower in the bottom panel (more messy tasks) than the top panel (less messy tasks) for any given model/time.
  • Time Frame and Model Inclusion: The time frame starts around 2022 because, as noted for Figure 9, earlier models like davinci-002 and gpt-3.5-turbo instruct scored near zero on these non-SWAA tasks.
Scientific Validity
  • Valid Analysis of Messiness Effect: Isolating the effect of messiness by splitting the tasks provides a direct test of whether progress differs on tasks with more real-world complexities. This is a relevant analysis for understanding the generalizability of AI improvements.
  • Qualitative vs. Quantitative Trend Comparison: The conclusion drawn from visual inspection – that the rate of improvement is similar for both subsets – is qualitative. A formal statistical comparison of the slopes of trend lines fitted to each subset would provide stronger evidence.
  • Dependence on Messiness Score Validity: The analysis relies on the validity of the 'messiness' score, which, as discussed for Tables 10/11 and Figure 10, is based on subjective ratings and has limitations.
  • Data Subset Limitation: The analysis is restricted to HCAST+RE-Bench tasks and models from roughly 2022 onwards. The findings might not apply to simpler tasks or earlier capability levels.
  • Evidence Against Messiness-Specific Plateau: The figure provides evidence against the hypothesis that AI progress might be plateauing specifically on 'messier' or more realistic tasks, at least within the scope of this dataset and timeframe. Both subsets show substantial improvement.
  • Confounding with Task Length: Potential confounding between messiness and task length (shown in Figure 20) still exists, although this figure analyzes the trend over time within each messiness split, somewhat mitigating the direct comparison issue seen in Figure 10.
Communication
  • Effective Two-Panel Layout: The two-panel layout effectively separates the data based on the messiness split (50% least vs. 50% most messy), allowing for focused comparison of trends within each subset.
  • Clear Panel Titling: Each panel is clearly titled with the messiness score range and the number of tasks included, providing necessary context.
  • Clarity of Trend Visualization: Plotting weighted success rate against model release date clearly shows the performance improvement over time within each messiness category.
  • Consistency and Shared Legend: The use of a shared legend and consistent axis scaling facilitates comparison between the two panels.
  • Visual Support for Key Finding: The figure visually supports the key finding discussed in Section 6.2: while absolute success rates are lower on messier tasks, the rate of improvement over time appears similar in both subsets, with no clear evidence of a plateau specific to the higher messiness tasks.
Figure 22: Correlation matrix of observed success rates across all models and...
Full Caption

Figure 22: Correlation matrix of observed success rates across all models and tasks.

Figure/Table Image (Page 43)
Figure 22: Correlation matrix of observed success rates across all models and tasks.
First Reference in Text
Not explicitly referenced in main text
Description
  • Purpose: Show Inter-Model Performance Correlation: This figure displays a 'correlation matrix', which is a square grid used to show how strongly different variables are related to each other. In this case, the 'variables' are the success rates of various AI models tested in the study.
  • Matrix Structure: Models on Axes: Both the rows and columns of the grid represent the different AI models, listed in approximate chronological order from GPT-2 to Claude 3.7 Sonnet.
  • Cell Content: Correlation Coefficient: Each cell in the grid shows the 'correlation coefficient' between the success rates of the model in that row and the model in that column, calculated across all the tasks in the benchmark suite. A correlation coefficient ranges from -1 to +1. A value close to +1 (shown in dark red) means the two models tend to perform similarly – succeeding on the same tasks and failing on the same tasks. A value close to 0 (blue/white) means there's little linear relationship between their success patterns.
  • Matrix Properties: Symmetry and Diagonal: The matrix is symmetric around the main diagonal (top-left to bottom-right) because the correlation between model A and model B is the same as between model B and model A. The diagonal cells always have a value of 1 (perfect correlation), as any model's performance is perfectly correlated with itself.
  • Observation: Generally High Positive Correlations: The figure shows generally high positive correlations between most models (many values are above 0.7, indicated by orange/red colors). This suggests that, to a large extent, tasks that are difficult for one model tend to be difficult for other models too. Correlations are particularly high between models released close together in time or from the same family (e.g., different versions of GPT-4).
  • Mean Correlation Value (0.73): The text below the figure states that the mean correlation across all pairs of models is 0.73, quantifying the overall strong positive relationship.
Scientific Validity
  • Standard Methodological Approach: Calculating pairwise correlations between model success rates across a common set of tasks is a standard and valid method to understand the similarity in performance profiles and whether models share common difficulties.
  • Evidence for Shared Difficulty Structure: The generally high correlations (mean 0.73) provide evidence for a substantial shared component of task difficulty that affects most models similarly. This aligns with the finding in Figure 4 that human time-to-complete (a proxy for difficulty) is strongly correlated with average model success.
  • Plausibility of Correlation Patterns: The pattern where temporally closer or architecturally similar models exhibit higher correlations is plausible and expected, reflecting incremental improvements or shared underlying architectures/training data.
  • Dependence on Success Rate Reliability: The calculation assumes that the success rate for each model on each task is a reliable measure. If success rates are noisy (e.g., due to few runs per task), the calculated correlations might be attenuated (lower than the true underlying relationship).
  • Correlation vs. Causation: While showing association, the correlation matrix doesn't explain why models perform similarly. The shared difficulty could stem from inherent task properties, limitations common to current AI architectures, or overlaps in training data.
  • Unspecified Correlation Type (Minor Point): The specific type of correlation coefficient (e.g., Pearson, Spearman) is not stated, but Pearson is standard for this type of data and likely used.
Communication
  • Effective Heatmap Visualization: The heatmap visualization is a standard and effective way to represent a correlation matrix, allowing for quick visual identification of patterns and strengths of relationships.
  • Clear Color Coding: Using a color scale (ranging from blue/white for lower correlation to dark red for higher correlation, indicated by the color bar) effectively encodes the magnitude of the correlation coefficients.
  • Inclusion of Numerical Values: Including the numerical correlation coefficient within each cell provides precise quantitative information alongside the visual representation.
  • Clear Axis Labeling: Both axes are clearly labeled with the names of the AI models being compared.
  • Informative Caption and Summary Stat: The caption clearly states what the figure represents (correlation matrix of observed success rates). The additional text below the figure mentioning the mean correlation (0.73) provides a useful summary statistic.
Figure 23: Correlation matrix of excess success rates (defined by ) across all...
Full Caption

Figure 23: Correlation matrix of excess success rates (defined by ) across all models and tasks.

Figure/Table Image (Page 43)
Figure 23: Correlation matrix of excess success rates (defined by  ) across all models and tasks.
First Reference in Text
Not explicitly referenced in main text
Description
  • Purpose: Show Inter-Model Correlation (Residuals): This figure shows another correlation matrix, similar in structure to Figure 22, comparing the performance patterns of different AI models across the benchmark tasks.
  • Metric Correlated: Excess Success Rate (Residuals): However, instead of correlating the raw success rates (as in Figure 22), this matrix correlates the 'excess success rates'. Excess success rate for a model on a task measures how much better or worse that model performed compared to what would be expected based only on the task's length (human time). It's calculated as (Actual Success Rate - Predicted Success Rate) / Predicted Success Rate. A positive excess rate means the model did better than expected for that task length; negative means it did worse.
  • Matrix Structure and Cell Content: The rows and columns again represent the different AI models. Each cell contains the correlation coefficient between the excess success rates of the two corresponding models, calculated across all tasks.
  • Observation: Lower Correlations than Raw Rates: The correlations shown here are generally lower than those in Figure 22. While still mostly positive (red/orange cells), the values are smaller, often between 0.3 and 0.7, with some closer to zero (white/light blue). This indicates that once the dominant effect of task length (difficulty) is accounted for, the remaining performance variations (the 'excess' rates) are less strongly correlated between models, though some correlation persists.
  • Mean Correlation Value (0.40): The text below the figure states the mean correlation of these excess success rates is 0.40, significantly lower than the 0.73 mean correlation for raw success rates (Figure 22).
Scientific Validity
  • Validity of Residual Correlation Analysis: Analyzing the correlation of residuals (excess success rates) is a valid statistical technique to investigate whether there are shared patterns in model performance after accounting for the primary predictor (task length). It helps determine if models consistently deviate from the length-based prediction in similar ways across tasks.
  • Confirms Dominance of Task Length Effect: The finding that the mean correlation drops significantly (from 0.73 for raw rates to 0.40 for excess rates) confirms that task length (as proxied by human time) is the dominant factor explaining the shared performance patterns between models. Models largely agree on which tasks are easy or hard based on length.
  • Evidence for Shared Residual Variance: The remaining positive correlation (mean 0.40) indicates that there are still shared factors influencing model performance beyond just task length. Models do show some consistency in which tasks they find unexpectedly easy or hard, relative to the length-based prediction. This residual shared variance could be due to factors like the 'messiness' explored in Section 6.2, specific skills required by certain task families, or other unmeasured task properties.
  • Dependence on Prediction Model Accuracy: The validity of the excess success rates, and thus their correlations, depends on the accuracy of the 'predicted' success rate model (derived from the logistic regression in Section 4.1). If the prediction model is misspecified, the residuals will be less meaningful.
  • Nuanced Understanding of Performance Factors: This analysis provides further nuance to the understanding of model capabilities, suggesting that while length is key, other systematic factors also contribute to cross-model performance patterns.
Communication
  • Effective Heatmap Visualization: The heatmap visualization effectively displays the correlation matrix for excess success rates, similar to Figure 22.
  • Appropriate Color Scale: The color scale, now ranging from negative (blue) through zero (white) to positive (red) correlations, appropriately reflects the nature of excess success rates (which can be positive or negative).
  • Inclusion of Numerical Values: Including numerical correlation coefficients in each cell provides precise data.
  • Clear Axis Labeling: Axes are clearly labeled with model names.
  • Informative Caption and Definition: The caption clearly defines 'excess success rate' using the formula, linking it back to the observed and predicted success rates (where prediction is based on task length, Section 4.1). The summary statistic (mean correlation: 0.40) is also helpful.
Figure 24: Change in time horizon of all frontier models over time.
Figure/Table Image (Page 44)
Figure 24: Change in time horizon of all frontier models over time.
First Reference in Text
Note: the data displayed is the same as in Figure 1, but with a linear axis.
Description
  • Purpose: Visualize Trend with Linear Scale: This figure plots the '50% task completion time horizon' (a measure of AI capability) against the release date for various 'frontier' AI models, similar to Figure 1.
  • Axis Scaling: Linear Y-axis: The key difference from Figure 1 is the scale used for the vertical axis (y-axis). Here, the 'Task time (for humans) that model completes with 50% success rate' is plotted on a standard 'linear' scale (e.g., 15 min, 30 min, 45 min, 1 hr have equal spacing), whereas Figure 1 used a logarithmic scale.
  • Axis Scaling: Linear X-axis: The horizontal axis (x-axis) remains the model release date, shown linearly from 2019 to 2025.
  • Data Points: Frontier Models: The data points represent the same frontier models shown in Figure 1 (e.g., GPT-2, davinci-002, GPT-4 0314, Claude 3.5 Sonnet, o1, Claude 3.7 Sonnet).
  • Observed Shape: Upward Curving Trend (Acceleration): Because the y-axis is linear, the exponential growth trend observed in Figure 1 now appears as a curve that bends sharply upwards, indicating that the absolute increase in time horizon per unit of time is accelerating.
Scientific Validity
  • Data Accuracy (Same as Figure 1): The figure accurately replots the same underlying time horizon data presented in Figure 1. Its scientific validity rests on the validity of those original calculations.
  • Demonstrates Absolute Acceleration: Presenting the data on a linear scale provides a visually intuitive demonstration of the accelerating absolute gains in capability (measured in minutes/hours of task time). While the relative growth rate might be constant (as suggested by the exponential fit in Figure 1), the absolute improvements become larger over time, which is characteristic of exponential growth.
  • Less Suitable for Quantitative Trend Assessment: This representation is less suitable than the log-linear plot (Figure 1) for quantitatively assessing the consistency of the exponential trend or for estimating parameters like the doubling time, as exponential growth does not appear linear on this scale.
  • Potential for Misinterpretation of Acceleration: The visual impression of dramatic recent acceleration could be slightly misleading if interpreted without reference to the log-scale plot, as any exponential process looks like it's 'taking off' when plotted linearly over enough time.
Communication
  • Visual Emphasis on Acceleration: Using a linear scale for the y-axis (time horizon) makes the accelerating nature of the capability growth visually apparent – the curve bends upwards steeply. This contrasts with the linear appearance on the log scale in Figure 1.
  • Obscures Constant Relative Growth: While emphasizing acceleration, the linear scale makes it harder to visually assess whether the growth is truly exponential (i.e., has a constant doubling time) compared to the log-linear plot in Figure 1, where exponential growth appears as a straight line.
  • Compression of Early Data: The plot compresses the data points for earlier, less capable models near the bottom of the graph, making it difficult to discern their individual values or the trend in the early period.
  • Clarity of Labels and Legend: Labels and the legend are clear, identifying the axes and the specific models plotted.
  • Complementary Perspective to Figure 1: This figure serves as a useful complement to Figure 1, offering a different perspective on the same data. The reference text note clarifying it uses the same data but a linear axis is crucial for correct interpretation.
Figure 25: Time horizon of all models we measured, including non-frontier...
Full Caption

Figure 25: Time horizon of all models we measured, including non-frontier models.

Figure/Table Image (Page 44)
Figure 25: Time horizon of all models we measured, including non-frontier models.
First Reference in Text
Not explicitly referenced in main text
Description
  • Purpose: Comprehensive Trend Including Non-Frontier Models: This figure presents a comprehensive view of the 50% time horizon plotted against the release date for all AI models measured by the authors in this study, explicitly including models that might be considered 'non-frontier' (i.e., not the absolute state-of-the-art at their release or slightly older versions).
  • Plot Format and Axes: The plot uses the same format as Figure 1: model release date (2019-2027) on the horizontal axis and the 50% time horizon on the vertical axis (logarithmic scale, seconds to hours).
  • Inclusion of Additional Models: Compared to Figure 1 or 24, this plot includes additional data points for models like GPT-4 Turbo and GPT-4 0125, which fall between the main GPT-4 releases and Claude 3 Opus/GPT-4o.
  • Overall Trend Calculation: An exponential trend line is fitted to all the data points shown. The calculated doubling time for this comprehensive set is 218 days, and the R-squared value is 0.96, indicating a very strong exponential fit across all measured models.
  • Consistency of Non-Frontier Models with Trend: The included non-frontier models generally appear to fall close to the overall exponential trend line, suggesting their performance is largely consistent with the capability trajectory defined by the frontier models.
Scientific Validity
  • Robustness Check and Completeness: Including non-frontier models provides a more complete picture of the capability landscape and acts as a robustness check for the trend observed using only frontier models. The fact that these models generally align with the trend strengthens the conclusion that the exponential growth is not merely an artifact of cherry-picking the best models.
  • Consistency of Trend Metrics: The calculated doubling time (218 days) and R² (0.96) based on this fuller dataset are very similar to those reported elsewhere using slightly different subsets (e.g., 212 days, R²=0.98 in Fig 8/17), indicating the main finding is robust to the precise definition of 'frontier'.
  • Dependence on Underlying Methodology: The validity still depends on the underlying methodology for calculating the 50% time horizon for each model (task suite representativeness, human baseline accuracy, logistic regression fit).
  • Selection of Non-Frontier Models: The selection of which 'non-frontier' models to include is not explicitly justified (they are simply 'all models we measured'). While providing more data, it's not necessarily a systematic sampling of all relevant non-frontier models available.
Communication
  • Consistent Plot Format: The plot uses the standard log-linear format established earlier, clearly showing the 50% time horizon against model release dates.
  • Comprehensive Legend: The legend clearly identifies all models plotted, including those implicitly considered 'non-frontier' (like GPT-4 Turbo, GPT-4 0125) by their inclusion here compared to Figure 1/24 which focused on 'frontier' models.
  • Caption Clarity: The caption explicitly states that this plot includes 'non-frontier' models, clarifying the scope of the data shown.
  • Inclusion of Overall Trend Metrics: Displaying the overall trend line and its metrics (Doubling time: 218 days, R²: 0.96) calculated across all measured models provides a comprehensive view of the trend based on the full dataset analyzed in the appendix/supplementary material.
  • Visual Reinforcement of Trend: Visually, the inclusion of the intermediate 'non-frontier' models helps fill in the trend line, showing they generally fall along the same exponential trajectory established by the frontier models.
Figure 26: Length in human expert clock-time of tasks that frontier models can...
Full Caption

Figure 26: Length in human expert clock-time of tasks that frontier models can perform competently over time.

Figure/Table Image (Page 45)
Figure 26: Length in human expert clock-time of tasks that frontier models can perform competently over time.
First Reference in Text
See Section 4 for details on time horizon length calculation.
Description
  • Purpose: Visualize AI Capability Growth Trend: This figure plots the '50% task completion time horizon' for various AI models against their release date, illustrating the growth in AI capability over time from 2019 to 2025. The time horizon represents the length of tasks (measured in human expert time) that an AI model can successfully complete 50% of the time.
  • Axes and Scaling: The vertical axis shows the time horizon on a logarithmic scale (ranging from 1 second to over 4 hours), while the horizontal axis shows the model release date linearly.
  • Data Points and Specific Model Handling: Data points represent individual AI models, including early models like GPT-2 and davinci-002 (representing GPT-3) and later models like GPT-4 and Claude 3.7 Sonnet. The caption notes specific handling for some models: davinci-002 and gpt-3.5-turbo-instruct are plotted at the release dates of GPT-3 and GPT-3.5 respectively, and GPT-2's score is imputed as zero for longer tasks where it was incompatible.
  • Trend Line and Confidence Interval: A linear regression line is fitted to the logarithm of the time horizons against the release date, visually representing the exponential growth trend. A shaded region indicates the 95% confidence interval for this trend.
  • Quantified Trend: Doubling Time and Fit: The analysis indicates a strong exponential trend (R² = 0.98), with the 50% time horizon doubling approximately every 7 months (95% CI: 171 to 249 days).
Scientific Validity
  • Core Methodology (Same as Figure 1): This figure appears to be identical or near-identical to Figure 1, presenting the paper's central finding. Its validity rests on the methodology detailed in Section 4: the representativeness of the task suite, the reliability of human baseline times, the appropriateness of the 50% success threshold, and the logistic regression used to calculate the horizon for each model.
  • Approximation of Release Dates: The specific handling of model release dates (using proxy dates for GPT-3/3.5 based on available models like davinci-002) is a necessary approximation due to API availability but introduces potential minor inaccuracies in the temporal positioning.
  • Justification for GPT-2 Imputation: Imputing GPT-2's score as zero for longer tasks is justified in the text (Section 3.3.1) based on its low context length and the zero scores of the much more capable davinci-002 on those tasks. This imputation likely has minimal impact on the overall trend calculation, which is dominated by later models.
  • Strong Statistical Fit: The high R-squared value (0.98) and the visually tight confidence interval suggest a very strong and consistent exponential trend within the measured data (2019-2025) according to this specific metric and methodology.
  • External Validity and Extrapolation Concerns: As with Figure 1, the primary scientific questions revolve around the external validity (generalizability beyond these tasks/metrics) and the sustainability of this exponential trend into the future, which are discussed extensively in the paper.
Communication
  • Effective Log-Linear Visualization: The log-linear plot effectively visualizes the exponential growth trend, making the constant doubling time appear as a linear relationship. This is a standard and clear way to present such data.
  • Clear Axis Labeling and Scaling: Axes are clearly labeled ('Model release date', 'Task time (for humans) that model completes with 50% success rate'), and the logarithmic scale on the y-axis is appropriate for the range of values.
  • Legend Clarity: The legend clearly identifies the models plotted.
  • Comprehensive Trend Metrics: The inclusion of the trend line, confidence interval (shaded region), doubling time (7 months), 95% CI for doubling time (171-249 days), and R-squared value (0.98) provides comprehensive quantitative information about the trend and its uncertainty.
  • Clarity of Model Placement/Imputation: The caption text (in the figure description, not the formal caption) clarifying the placement of davinci-002 and gpt-3.5-turbo-instruct and the imputation for GPT-2 is crucial for accurate interpretation, though placing this information directly in the formal caption or legend might improve accessibility.
↑ Back to Top