Measuring AI Ability to Complete Long Tasks

Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney Von Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, Lawrence Chan
arXiv
Model Evaluation & Threat Research (METR)

Table of Contents

Overall Summary

Study Background and Main Findings

This paper addresses the challenge of translating AI benchmark scores into meaningful real-world capabilities. Current benchmarks often saturate quickly or use artificial tasks, making it difficult to gauge AI progress on complex, long-duration work relevant to human experts. To overcome this, the authors propose a novel metric: the 'task completion time horizon'. This metric quantifies AI capability by determining the typical time required by human experts to complete tasks that a given AI model can successfully accomplish with a specific probability (e.g., 50%). It provides an intuitive, continuous measure anchored in human performance.

The methodology involved compiling a diverse suite of tasks spanning software engineering and AI research, ranging from seconds to many hours in human completion time. These tasks were drawn from existing benchmarks (HCAST, RE-Bench) and newly created short tasks (SWAA). Human professionals with relevant expertise were timed performing these tasks to establish baseline durations. Subsequently, various frontier AI models released between 2019 and 2025 were evaluated on the same tasks using standardized agent frameworks ('scaffolding') that provide tools for interaction. By modeling AI success probability as a function of human task time using logistic regression, the 50% time horizon was calculated for each model.

The central finding is that the 50% time horizon for frontier AI models on these tasks has grown exponentially since 2019, doubling approximately every seven months (95% CI: 171-249 days). Current leading models like Claude 3.7 Sonnet achieve a horizon of around 50-60 minutes. Qualitative analysis suggests this rapid improvement is driven by enhanced reliability, better adaptation to mistakes, improved logical reasoning, and more effective tool use. However, the study also found a significant gap between the 50% horizon and the 80% horizon (requiring higher reliability), where the latter is much shorter (~15 minutes for Claude 3.7 Sonnet), indicating challenges remain in achieving dependable performance on longer tasks.

The authors conducted several experiments to test the robustness and external validity of their findings, including analyzing performance on tasks with varying 'messiness' (real-world complexity factors), comparing results with an external benchmark (SWE-bench), and evaluating models on real internal software issues. While the exponential trend appeared robust across different conditions, the absolute values and doubling times showed sensitivity to the benchmark and human time estimation methods. The paper concludes by discussing the significant implications of this rapid growth for future automation potential (extrapolating to month-long tasks within ~5 years, albeit with caveats) and AI safety, while carefully acknowledging limitations related to task representativeness and the interpretation of human baselines.

Research Impact and Future Directions

This research introduces a valuable new metric, the 'task completion time horizon,' offering a more intuitive way to track AI progress in handling complex, time-consuming tasks compared to traditional benchmarks. The core finding of an exponential increase in this horizon, doubling roughly every seven months since 2019 within the domain of software and research engineering tasks, is statistically robust within the study's framework and suggests a rapid acceleration of AI capabilities relevant to automating intellectual labor.

However, interpreting the practical implications requires significant caution. The time horizon metric is highly sensitive to methodological choices, particularly the nature and context of the human baselines used for comparison. The study itself demonstrates that AI performance aligns better with low-context human performance (like external contractors) than high-context experts, meaning the measured horizons might overestimate the time AI would save compared to experienced employees working in familiar environments. Furthermore, the benchmark tasks, despite efforts towards realism, still differ systematically from the 'messiness' and dynamic nature of many real-world jobs, raising questions about the external validity of the observed trend. The difference in observed doubling times when using an external benchmark (SWE-bench) further underscores the sensitivity to task selection and human time estimation methods.

Therefore, while the paper provides compelling evidence of rapid AI progress on increasingly complex benchmark tasks, directly extrapolating this trend to predict timelines for widespread automation of specific real-world jobs (like the 5-year prediction for month-long software tasks) is speculative. The study highlights a significant trend but also underscores the critical need for more research into AI performance on truly naturalistic tasks and the complex factors influencing real-world deployment. Key unanswered questions remain about the sustainability of the exponential trend and how well performance on these benchmarks translates to diverse, dynamic, and context-rich professional environments.

Critical Analysis and Recommendations

Clear Problem Statement and Proposed Metric (written-content)
Observation: The abstract clearly states the problem of ambiguous AI benchmark meaning and proposes the '50%-task-completion time horizon' metric. Impact: This immediately establishes the paper's motivation and core contribution, providing a clear framework for understanding the research.
Section: Abstract
Concise Summary of Key Findings (written-content)
Observation: The abstract concisely summarizes the key quantitative findings: a ~50-minute current horizon for frontier models and a historical doubling time of ~7 months. Impact: This provides readers with the essential results upfront, highlighting the magnitude and speed of AI progress according to the proposed metric.
Section: Abstract
Briefly Define Task Domains (written-content)
Issue: The abstract lists task sets (RE-Bench, HCAST, SWAA) without defining their domains. Impact: Adding brief parenthetical descriptions (e.g., 'AI research & engineering', 'diverse software engineering') would improve immediate comprehension for readers unfamiliar with these specific benchmarks, clarifying the scope of evaluation early on.
Section: Abstract
Clear Problem Definition (Benchmark Limitations) (written-content)
Observation: The introduction clearly outlines the limitations of existing AI benchmarks (artificiality, saturation, lack of unified metric). Impact: This effectively motivates the need for the novel 'task completion time horizon' metric proposed by the paper.
Section: Introduction
Effective Visualization of Exponential Growth Trend (Fig 1) (graphical-figure)
Observation: Figure 1 visually presents the core finding of exponential growth in the 50% time horizon using a log-linear plot, including trend lines and confidence intervals. Impact: This graphical representation makes the rapid, consistent improvement trend immediately apparent and provides quantitative details (doubling time, R-squared) supporting the claim.
Section: Introduction
Detailed Human Baselining Methodology (written-content)
Observation: The paper details the methodology for establishing human baseline performance, including recruitment, environment, incentives, and data collection across different task suites. Impact: This transparency and detail lend credibility to the human time estimates, which are fundamental to the calculation and interpretation of the AI time horizon metric.
Section: Methods
Logical Derivation of Time Horizon Metric (written-content)
Observation: The derivation of the time horizon metric using logistic regression, inspired by Item Response Theory but anchored by human time, is clearly explained. Impact: This provides a logical and statistically grounded basis for the core metric, allowing readers to understand how AI success rates are translated into a time-based capability measure.
Section: Methods
Robust Statistical Analysis of Trends (written-content)
Observation: Robust statistical methods, including OLS regression on log-transformed data and hierarchical bootstrapping for confidence intervals, are used to analyze the time horizon trend. Impact: This adds statistical rigor to the quantitative findings, particularly the estimation of the doubling time and its uncertainty.
Section: Methods
Explicitly Justify Task Family Weighting in Main Text (written-content)
Issue: The rationale for weighting tasks by the inverse square root of their family size (mentioned in Fig 3 caption) is not explicitly stated in the main methods text. Impact: Briefly explaining this weighting's purpose (e.g., ensuring diverse representation, preventing large families from dominating results) in the main text would strengthen the methodological justification.
Section: Methods
Clear Comparison of 50% and 80% Time Horizons (written-content)
Observation: The results clearly compare the 50% and 80% time horizons, showing similar doubling times but significantly lower absolute horizons at the 80% reliability level. Impact: This highlights the crucial gap between AI achieving occasional success and achieving high reliability on complex tasks, providing a more nuanced view of current capabilities.
Section: Results
Comprehensive External Validity and Robustness Checks (written-content)
Observation: The paper includes multiple supplementary experiments (retrodiction, messiness analysis, SWE-bench comparison, internal PRs) to assess external validity and robustness. Impact: This multi-pronged approach significantly strengthens the paper's claims by testing the core findings under different conditions and against different data sources, addressing potential limitations.
Section: Results
Nuanced Analysis of 'Messiness' Impact (written-content)
Observation: The analysis of 'messiness' factors finds that while models perform worse on messier tasks, the rate of improvement over time is similar for high and low messiness subsets. Impact: This nuanced finding addresses concerns about potential plateaus on more realistic tasks, suggesting that progress is occurring even on tasks with greater complexity, at least within the evaluated range.
Section: Results
Insightful Internal PR Experiment (Low vs High Context) (written-content)
Observation: The internal PR experiment shows AI performance aligns better with low-context contractor time than high-context maintainer time. Impact: This provides a crucial insight for interpreting the time horizon metric, suggesting it may better reflect the capability to replace low-context human labor rather than highly experienced experts working in familiar domains.
Section: Results
Nuanced Interpretation of the Metric (written-content)
Observation: The discussion thoughtfully addresses the complexities of interpreting the time horizon metric, emphasizing its dependence on task distribution and the context/skill of human baseliners. Impact: This demonstrates a sophisticated understanding of the metric's limitations and guides readers towards a more nuanced interpretation of the results.
Section: Discussion
Candid Acknowledgment of Limitations and Future Work (written-content)
Observation: The authors candidly acknowledge key limitations (model elicitation, baselining rigor, task naturalness, limited compute use) and propose specific directions for future work. Impact: This transparency enhances the paper's credibility and provides a valuable roadmap for subsequent research in AI capability evaluation.
Section: Discussion
Highlights Economic Implications (Inference Cost, Fig 13) (graphical-figure)
Observation: Figure 13 shows that the computational cost of successful AI runs is often a small fraction (<10%) of the estimated cost of human expert labor for the same task duration. Impact: This highlights the economic potential of AI automation and suggests significant headroom for improving AI performance (e.g., via more compute-intensive methods) while remaining cost-competitive.
Section: Discussion

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 1: The length of tasks (measured by how long they take human...
Full Caption

Figure 1: The length of tasks (measured by how long they take human professionals) that generalist autonomous frontier model agents can complete with 50% reliability has been doubling approx-imately every 7 months for the last 6 years (Section 4).

Figure/Table Image (Page 2)
Figure 1: The length of tasks (measured by how long they take human professionals) that generalist autonomous frontier model agents can complete with 50% reliability has been doubling approx-imately every 7 months for the last 6 years (Section 4).
First Reference in Text
We find that the 50% time horizon has been growing exponentially from 2019-2025 on our tasks, with a doubling time of approximately seven months (Figure 1).
Description
  • AI Capability Metric: 50% Task Completion Time Horizon: The figure illustrates how the capability of advanced AI systems, specifically 'generalist autonomous frontier model agents' (cutting-edge AIs designed for diverse, independent task execution), has changed over time. Capability is measured using a metric called the '50% task completion time horizon'. This represents the maximum duration a typical human expert would need to complete a task that the AI system can successfully finish 50% of the time it tries.
  • Data Representation: The figure shows data points representing various AI models released between 2019 and 2025. For each model, the corresponding point indicates the calculated 50% time horizon on the vertical axis (logarithmic scale) plotted against its release date on the horizontal axis.
  • Observed Trend: Exponential Growth: The central finding highlighted is that this AI capability metric has been increasing 'exponentially'. This means the rate of improvement is accelerating, similar to how money grows with compound interest. The figure indicates this time horizon has been doubling approximately every 7 months over the past 6 years.
  • Trend Quantification: The figure includes a trend line fitted to the data points, visually representing the exponential growth. Text within the figure quantifies this trend, stating a doubling time of 7 months and a high R-squared value (R² = 0.98), a statistical measure suggesting the exponential model fits the observed data very well.
Scientific Validity
  • Novelty and Potential Limitations of the Metric: The '50% time horizon' metric is a novel and intuitive way to quantify AI agent capabilities in terms of human task completion time. However, its validity depends heavily on the representativeness of the chosen task suite (HCAST, RE-Bench, SWAA) and the reliability of the human baseline time estimates. The paper acknowledges potential limitations regarding external validity.
  • Robustness of the Exponential Trend Fit: The claim of exponential growth (doubling time of ~7 months) appears strongly supported by the data presented visually and the high R-squared value (0.98) reported for the fit. This suggests a consistent trend over the 2019-2025 period for the specific tasks evaluated.
  • Reliance on Human Baselines: The measurement relies on comparing AI performance to 'human professionals'. The definition, selection, and performance consistency of these human baseliners are crucial for the metric's validity. Variability in human skill, motivation, and context could introduce noise or bias into the task time estimates, affecting the calculated AI time horizons.
  • Scope and Generalizability: The observed trend reflects the performance of specific AI models within particular agent scaffolds on a defined set of tasks. Generalizing this trend to broader AI capabilities or different task domains requires caution, as performance can be sensitive to the evaluation setup and task distribution.
  • Potential Confounding Factors: While the trend is strong, attributing it solely to core AI model improvements versus advancements in agent scaffolding, tool use integration, or prompt engineering techniques used during evaluation is complex. The paper attempts to use consistent scaffolding, but disentangling these factors remains a challenge.
Communication
  • Clarity and Conciseness of Caption: The caption clearly states the figure's main finding: the exponential growth trend in the 50% task completion time horizon for AI agents. It effectively defines the metric (task length solvable by AI with 50% success, measured in human time) and quantifies the trend (doubling every ~7 months over 6 years).
  • Appropriateness of Visual Representation: The figure uses a scatter plot with model release date on the x-axis and the logarithm of human task time on the y-axis. This logarithmic scaling on the y-axis is appropriate for visualizing exponential growth as a linear trend, making the doubling time concept easier to grasp.
  • Inclusion of Trend Metrics and Uncertainty: The inclusion of the trend line, the calculated doubling time (7 months), the 95% confidence interval for the doubling time (171 to 249 days, mentioned in figure text), and the R-squared value (0.98, indicating a strong fit) enhances the figure's informativeness and allows assessment of the trend's robustness and uncertainty.
  • Contextual Labeling of Data Points: Labeling specific AI models (e.g., GPT-2, GPT-4 0314, Claude 3.7 Sonnet) on the plot provides valuable context and allows readers to associate specific capability levels with known systems.

Methods

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 2: Our methodology for measuring AI agent time horizon.
Figure/Table Image (Page 3)
Figure 2: Our methodology for measuring AI agent time horizon.
First Reference in Text
We also measure the 80% time horizon of models (Figure 6) and find a similar trend, though horizons are roughly 5x shorter.
Description
  • Overall Methodology Overview: The figure outlines a three-step process designed to measure the 'time horizon' of AI agents. The time horizon is a way to quantify how advanced an AI is by determining the length of tasks it can reliably complete, where task length is measured by how long it takes a skilled human.
  • Step 1: Task Suite Creation: The first step involves creating a 'Diverse Task Suite'. This means assembling a collection of tasks varying in type and difficulty. The figure shows examples like HCAST (diverse agency tasks, 1 min-30 hrs), SWAA (short software actions, 1-30 sec), and RE-Bench (AI R&D tasks, 8 hrs), totaling 170 tasks used in the study. Task diversity is important to get a broad measure of capability.
  • Step 2: Performance Measurement (Human & AI): The second step is 'Task Performance' measurement. This involves having both humans ('Human Runs') and AI systems ('Agent Runs') attempt the tasks. For humans, the time taken to successfully complete a task is recorded to establish a 'Time Estimate' for its difficulty. For AI agents, their 'Success Rate' (how often they complete the task correctly) is measured.
  • Step 3: Time Horizon Calculation and Trend Analysis: The third step is 'Time Horizon Analysis'. This uses the data from Step 2. It involves finding the 'Horizon Length' for each AI model – essentially, the human time estimate corresponding to tasks the AI can solve with a specific success rate (e.g., 50%). This Horizon Length is then plotted against the 'Model Release Dates' to analyze trends over time, such as calculating the 'Doubling Time' – how long it takes for the AI's time horizon to double.
Scientific Validity
  • Structured Quantitative Approach: The methodology provides a structured, quantitative approach to tracking AI progress based on task completion capabilities relative to human performance, moving beyond simple benchmark scores. The concept of 'time horizon' offers an interpretable metric.
  • Dependence on Task Suite Representativeness: The validity hinges significantly on the quality and diversity of the task suite. If the suite is not representative of real-world tasks or relevant capabilities, the measured time horizon may not generalize. The paper uses HCAST, RE-Bench, and SWAA, aiming for diversity, but representativeness is inherently difficult to guarantee.
  • Reliance on Human Baseline Quality: Establishing accurate and unbiased human baseline times ('Time Estimate') is critical and challenging. Factors like baseliner skill, experience, motivation, and the exclusion of failed attempts can influence the time estimates and thus the calculated AI horizons.
  • Methodological Choices in Analysis: The use of a specific success rate threshold (e.g., 50% or 80%) and a logistic model (mentioned in text, abstracted in figure) to determine the time horizon involves methodological choices. The sensitivity of the results to these choices (threshold level, specific regression model) should be considered.
  • Focus on Task Completion Time/Success: The methodology focuses on task completion success and time, but doesn't explicitly capture other aspects of performance like efficiency (e.g., computational cost, mentioned later in Fig 13), robustness, or safety, which are also crucial dimensions of AI capability.
Communication
  • Flowchart Structure and Clarity: The figure effectively uses a three-panel flowchart structure to visually summarize the core methodology, enhancing reader comprehension compared to text alone. The numbered steps (1. Diverse Task Suite, 2. Task Performance, 3. Time Horizon Analysis) provide a clear, logical flow.
  • Iconography and Labeling: The use of simple icons (e.g., clock, checkmark, cross, graph) and concise labels within each panel aids in quickly grasping the key inputs and outputs of each methodological step (e.g., Time Estimate, Success Rate, Horizon Length, Doubling Time).
  • Visual Summary Effectiveness: The diagram serves as an excellent visual abstract of the methodology described in the text (Sections 3 and 4), providing a high-level overview before delving into details. It helps orient the reader to the overall process.
  • Abstraction of Analysis Details: While generally clear, the specific mathematical relationship or modeling technique used in Step 3 (Time Horizon Analysis) to derive Horizon Length and Doubling Time from Time Estimates, Success Rates, and Model Release Dates is abstracted. Readers need to consult the text (Section 4.1) for the logistic regression details.
Table 1: Example tasks of differing durations.
Figure/Table Image (Page 7)
Table 1: Example tasks of differing durations.
First Reference in Text
We find that our contract baseliners take 5x-18x longer to resolve issues than repository maintainers.
Description
  • Purpose of the Table: The table presents five specific examples of tasks used to evaluate AI agents, chosen to illustrate the variety of task types and the wide range of time they typically take skilled humans to complete.
  • Table Structure and Content: Each row represents a different task, identified by a 'Family' name (like 'find_shell_script' or 'cuda_backtesting'). It lists the estimated 'Length' (time for a human expert, ranging from 3 seconds to 8 hours) and provides a brief 'Description' of what the task involves.
  • Range of Task Examples and Durations: The tasks range significantly in complexity and duration. For instance, 'find_shell_script' is a quick multiple-choice question taking 3 seconds. 'wikipedia_research' involves finding factual information, taking about 1 minute. 'munge_data' requires writing a script to transform data formats, estimated at 56 minutes. 'cuda_backtesting' is a complex programming task involving optimizing code using CUDA (a platform for parallel computing using NVIDIA graphics cards) for financial backtesting (simulating trading strategies on historical data), estimated to take 8 hours.
  • Variety of Skills Illustrated: The descriptions highlight different skills tested, including basic knowledge retrieval, debugging (fixing errors in code or data, like in 'oxdna_simple' which involves a molecular dynamics simulation package called oxDNA), data processing, and advanced programming/optimization.
Scientific Validity
  • Representativeness of Examples: The table provides anecdotal examples rather than a systematic overview of the entire task distribution. While illustrative, these few examples may not fully represent the balance or characteristics of the 170 tasks used in the study.
  • Accuracy of Time Estimates: The 'Length' column represents estimated human completion times. The accuracy and consistency of these estimates (whether derived from actual baselining or researcher estimation) are crucial for the table's validity as an illustration of task difficulty scaling. The text mentions these are based on geometric means or estimates.
  • Plausibility of Task Descriptions and Durations: The tasks selected appear relevant to software engineering, ML, and general reasoning, aligning with the paper's focus. The descriptions seem plausible for the estimated durations.
  • Illustrative Purpose vs. Data Presentation: The table serves primarily as an illustrative tool within the 'Methods' section to give context to the task suite. Its scientific contribution is limited to providing concrete examples rather than presenting primary data or analysis.
Communication
  • Clarity and Structure: The table effectively uses a simple columnar format (Family, Length, Description) to present concrete examples of tasks used in the study, making it easy to grasp the range of complexities and time scales involved.
  • Illustrative Examples: Providing specific examples with estimated human completion times (from seconds to hours) gives readers a tangible sense of the different capability levels being measured. This is more illustrative than abstract descriptions alone.
  • Conciseness and Informativeness of Descriptions: The descriptions, while concise, offer sufficient detail to understand the nature of each task (e.g., multiple choice, research, bug fixing, data transformation, code optimization).
  • Demonstration of Task Range: The selection spans several orders of magnitude in duration (3 seconds to 8 hours), effectively conveying the wide range of task difficulties included in the overall benchmark suite.
Table 2: The source of our time estimates by task suite.
Figure/Table Image (Page 8)
Table 2: The source of our time estimates by task suite.
First Reference in Text
In total, 148 of our 169 tasks have human baselines, but we rely on researcher estimates for 21 tasks in HCAST.
Description
  • Purpose of the Table: This table details where the researchers got the 'time estimates' – the data on how long each task typically takes a human expert – for the different sets of tasks (called 'task suites') used in their study.
  • Breakdown by Task Suite and Source: It breaks down the task suites: HCAST, RE-Bench, and SWAA. For each suite, it specifies the 'Time Estimate Source', meaning whether the time was determined by 'Baseline' (timing actual humans performing the task) or 'Estimate' (researchers estimating the time, likely based on judgment or indirect data).
  • Numerical Breakdown of Tasks: The table shows the 'Number of Tasks' for each category. For HCAST, 76 tasks had times derived from human baselines, while 21 tasks had researcher-estimated times. For RE-Bench, all 6 tasks used human baselines. For SWAA, all 66 tasks used human baselines. No tasks in RE-Bench or SWAA relied on researcher estimates.
  • Summary of Data Sources: Overall, the table indicates that out of the 169 tasks considered in this part of the analysis (76+21+6+66), the vast majority (76+6+66 = 148 tasks) had their difficulty measured using direct human performance data, while a smaller portion (21 tasks, all within the HCAST suite) relied on researcher estimations.
Scientific Validity
  • Use of Researcher Estimates: The table highlights a potential source of uncertainty or bias in the overall results, as 21 out of 97 HCAST tasks (and 21 out of 169 total tasks) rely on researcher estimates rather than direct human baselining. Researcher estimates can be subjective and less reliable than empirical data.
  • Potential Bias in HCAST Estimates: The reliance on estimates specifically within the HCAST suite, which contains longer and potentially more complex tasks, might disproportionately affect the time horizon calculations for more capable models if these estimates are systematically biased (e.g., under- or over-estimating difficulty).
  • Transparency Regarding Data Origin: The transparency in reporting the source for each task suite is commendable and allows readers to assess the potential impact of the estimation method. It strengthens the credibility of the study by acknowledging this limitation.
  • Predominance of Baseline Data: The fact that the majority of tasks (148/169) and all tasks in RE-Bench and SWAA are based on human baselines provides a reasonably strong empirical foundation for the overall analysis, mitigating concerns about the researcher estimates to some extent.
Communication
  • Clarity and Structure: The table clearly and concisely summarizes the origin of the human time estimates for each task suite (HCAST, RE-Bench, SWAA). The structure (Suite, Source, Number of Tasks) is logical and easy to follow.
  • Transparency of Data Sources: It effectively communicates the reliance on actual human performance data ('Baseline') for the majority of tasks (148 out of 169, or ~88%), while transparently indicating where researcher judgment ('Estimate') was used (21 HCAST tasks).
  • Complementary to Text: The table provides a useful breakdown that complements the textual description, allowing readers to quickly grasp the data foundation for the time horizon calculations.
Figure 3: Average task success rate across our entire combined suite, for each...
Full Caption

Figure 3: Average task success rate across our entire combined suite, for each model.

Figure/Table Image (Page 9)
Figure 3: Average task success rate across our entire combined suite, for each model.
First Reference in Text
We performed 8 runs12 per agent/task pair and report the average results in Figure 3.
Description
  • Purpose: Model Performance Comparison: This figure presents a bar chart comparing the overall performance of various AI models released between approximately 2019 and 2025.
  • Metric Shown: Average Success Rate: The height of each bar represents the 'Average task success rate' for a specific AI model, calculated across all tasks in the combined benchmark suite (HCAST, SWAA, and RE-Bench). Success rate is the percentage of tasks the AI completed successfully.
  • Models Compared and General Trend: The models evaluated include early models like GPT-2 (showing near 0% success) and davinci-002 (GPT-3, also low success), various versions of GPT-4, Claude 3 models, and the o1 models. The bars generally increase in height from left to right, indicating that newer models tend to perform better.
  • Diversity Weighting: The success rates are 'diversity-weighted'. As explained in the caption's footnote, this means tasks belonging to larger 'families' (groups of similar tasks) have their contribution to the average reduced (specifically, weighted by the inverse square root of the family size). This prevents the overall average from being skewed by performance on just one or two large groups of similar tasks.
  • Performance of Recent Models: Recent models like Claude 3.7 Sonnet and o1 show the highest average success rates, appearing to exceed 60-70% on this weighted average across the task suite.
Scientific Validity
  • Basic Performance Overview: The figure presents a simple average success rate, which serves as a basic measure of overall capability across the defined task suite before more complex analysis (like time horizon calculation).
  • Validity of Averaging Across Diverse Tasks: The validity of the average success rate as a meaningful measure depends on the representativeness of the task suite and the appropriateness of the scoring criteria (often binary success/failure, as detailed later). Averaging can mask significant variations in performance across different task types or domains.
  • Justification and Impact of Diversity Weighting: The 'diversity weighting' (by inverse square root of family size) is a reasonable heuristic to mitigate the influence of large task families. However, the specific choice of weighting function is somewhat arbitrary, and different weighting schemes could yield different average scores.
  • Number of Runs per Task: The results are based on an average of 8 runs per agent/task pair (mentioned in reference text), which adds some robustness against single-run anomalies. However, the figure itself doesn't show the variability across these runs.
  • Influence of Agent Scaffolding: The performance reflects the specific combination of AI model and the 'scaffold' (the code environment and tools the AI uses to interact with tasks). Differences in scaffolding, especially for the o1 models (mentioned in Section 3.3.1), could influence performance comparisons.
Communication
  • Clarity of Bar Chart Format: The bar chart format provides a clear and straightforward visual comparison of the average performance levels across the different AI models tested.
  • Effective Ordering of Models: Models are ordered chronologically by approximate release date along the x-axis, effectively illustrating the general trend of increasing success rates over time.
  • Clear Axis Labeling: The y-axis is clearly labeled 'Average success rate' with a range from 0% to 100%, making the performance scale easy to interpret.
  • Informative Title and Caption: The title and caption clearly state what is being plotted (average success rate), the scope (entire combined suite), and mention the diversity weighting, providing essential context.
  • Lack of Variance Information: While showing averages is useful for a high-level overview, the chart does not display variance or confidence intervals for the success rates, which could provide additional insight into the reliability of the differences between models.
Figure 4: Model success rates are negatively correlated with how much time it...
Full Caption

Figure 4: Model success rates are negatively correlated with how much time it takes a human to complete the task.

Figure/Table Image (Page 10)
Figure 4: Model success rates are negatively correlated with how much time it takes a human to complete the task.
First Reference in Text
This decrease in success rate over length (Figure 4) is well-fit by an exponential model (R2 ≈ 0.83 when regressing model success rate against the logarithm of human time-to-complete).
Description
  • Purpose: Relationship between Human Time and AI Success: This figure is a scatter plot designed to show the relationship between how long a task takes a human expert to complete and how successful AI models are, on average, at completing that same task.
  • Data Representation: Axes and Data Points: Each point on the graph represents a single task from the benchmark suite (HCAST, RE-Bench, or SWAA, indicated by different markers). The horizontal position (x-axis) of a point shows the time it typically takes a human to complete that task, plotted on a logarithmic scale (meaning equal distances represent multiplicative increases, like 1 sec, 10 sec, 100 sec). The vertical position (y-axis) shows the 'Mean Model Success Rate' – the average success rate achieved across all the different AI models tested on that specific task.
  • Observed Trend: Negative Correlation: The overall pattern of the points slopes downwards from left to right. This indicates a 'negative correlation': tasks that take humans a shorter time (left side) tend to have higher average success rates for AI models (top part), while tasks that take humans longer (right side) tend to have lower average success rates for AI models (bottom part).
  • Trend Quantification: Regression Line and R-squared: A dashed line representing a linear regression fit is drawn through the points, visually summarizing the trend. The figure also reports an R-squared (R²) value of 0.83. R-squared is a statistical measure of how well the regression line fits the data, ranging from 0 to 1. A value of 0.83 suggests that approximately 83% of the variation in the average AI success rate across tasks can be explained by the logarithm of the human completion time, indicating a strong relationship.
Scientific Validity
  • Validation of Human Time as Difficulty Proxy: The figure provides strong evidence that human completion time serves as a reasonable proxy for task difficulty for AI models, averaged across models. The negative correlation (longer human time implies lower AI success) aligns with intuition.
  • Support for Methodological Approach: The high R-squared value (0.83) suggests a strong relationship between log(human time) and mean AI success rate. This supports the methodological choice later in the paper to use human time as the primary dimension for calculating the 'time horizon' via logistic regression (inspired by Item Response Theory).
  • Residual Variance and Other Factors: While the trend is strong (R²=0.83), there is still considerable scatter around the regression line. This indicates that human time alone does not perfectly predict AI success; other task characteristics (e.g., specific skills required, 'messiness') also influence difficulty for AI, which is explored later in the paper (Section 6.2, Figure 10).
  • Use of Mean Success Rate Across Models: The y-axis represents the mean success rate across all tested models. This averaging might obscure important differences between models. For instance, some models might deviate significantly from this average trend, performing unusually well or poorly on tasks of specific lengths. The analysis focuses on the average difficulty landscape.
  • Consistency between Figure and Reference Text: The regression mentioned in the reference text (exponential model fit, R² ≈ 0.83) corresponds to the linear fit shown in the figure because the x-axis (human time) is plotted logarithmically while the y-axis (success rate) is linear. A linear relationship on these axes implies an exponential decay relationship between linear human time and success rate.
Communication
  • Appropriateness of Scatter Plot: The scatter plot effectively visualizes the relationship between the difficulty of a task (measured by human completion time) and the average success rate of AI models on that task.
  • Clarity of Axes and Scaling: The axes are clearly labeled ('Human Time-to-Complete' and 'Mean Model Success Rate'). The logarithmic scale on the x-axis is appropriate for visualizing data spanning several orders of magnitude (seconds to hours) and is clearly indicated by the non-linear spacing of time labels.
  • Legend for Task Suites: The inclusion of a legend distinguishing between task suites (HCAST, RE-Bench, SWAA) allows readers to see if the trend holds across different types of tasks, although the overlap makes detailed comparison difficult.
  • Inclusion of Trend Line and R-squared: Adding the linear regression trend line and the R-squared value (R² = 0.83) directly on the plot provides immediate visual and statistical confirmation of the negative correlation mentioned in the caption.
  • Caption Clarity: The caption clearly and accurately summarizes the main takeaway message of the figure.
Figure 5: Success rates of all models on our test suite, showing the...
Full Caption

Figure 5: Success rates of all models on our test suite, showing the computation of time horizon as predicted 50% success rate time.

Figure/Table Image (Page 11)
Figure 5: Success rates of all models on our test suite, showing the computation of time horizon as predicted 50% success rate time.
First Reference in Text
Specifically, we perform Ordinary Least Squares regression on log(model_horizon) = a+ β·release_date.
Description
  • Figure Structure: Multi-Panel Plots per Model: This figure consists of multiple small plots (panels), one for each AI model tested, illustrating how the '50% time horizon' is calculated for that specific model.
  • Axes Definition: Each panel plots the AI model's 'Success Probability' (vertical axis, ranging from 0 to 1, or 0% to 100%) against the 'Task length (human time-to-complete)' (horizontal axis, logarithmic scale from seconds to days).
  • Data Representation: Empirical Points and Fitted Curve: The blue bars/points represent the actual measured 'Empirical success rates' of the model on tasks grouped by their human completion time, with error bars indicating the uncertainty (±2 standard errors). The green curve is a 'Fitted curve' derived from a statistical technique called logistic regression. This curve models the probability of the AI succeeding as the task length (difficulty) increases – generally, success probability decreases as tasks get longer.
  • Time Horizon Calculation (50% Threshold): The key calculation shown is the 'Time Horizon'. This is found by locating the point where the fitted green curve intersects the 50% success probability level (0.5 on the vertical axis). A vertical dashed red line is drawn down from this intersection point to the horizontal axis, indicating the corresponding human task completion time. This time value is the model's 50% time horizon.
  • Examples of Calculated Time Horizons: The figure shows a wide range of time horizons across models. For example, GPT-2 has a horizon of only 2 seconds, davinci-002 (GPT-3) is at 9 seconds, GPT-4 0314 is at 5 minutes, while the most capable model shown, Claude 3.7 Sonnet, reaches a 50% time horizon of 59 minutes.
  • Observed Data Discontinuity ('Jump'): The caption notes a 'jump' in success rates between tasks shorter than 1 minute (mostly SWAA tasks) and tasks longer than 1 minute (mostly HCAST tasks), which is visible as a potential discontinuity in the empirical data points around the 1-minute mark in some panels.
Scientific Validity
  • Appropriateness of Logistic Regression (IRT): The use of logistic regression to model success probability as a function of task difficulty (log human time) is a standard and appropriate technique, drawing inspiration from Item Response Theory (IRT) as mentioned in Section 4.1. It provides a principled way to estimate the difficulty level corresponding to a specific success probability.
  • Goodness-of-Fit Assessment: The figure visually suggests the logistic model provides a 'fairly good' fit, as stated in the caption, capturing the general downward trend of success rate with increasing task length for most models. However, deviations exist.
  • Implications of the SWAA/HCAST 'Jump': The noticeable 'jump' or discontinuity around the 1-minute mark in several panels, corresponding to the boundary between SWAA and HCAST tasks, suggests the simple logistic model might not perfectly capture the underlying data structure. This could stem from differences in task types, scoring methods, or human baseline accuracy between the two suites, potentially affecting the precision of the calculated horizon, especially for models whose 50% threshold falls near this boundary.
  • Dependence on Input Data Quality: The calculation of the time horizon depends directly on the quality of the human baseline times (x-axis) and the measured AI success rates (y-axis data points). Any noise or bias in these underlying measurements will propagate into the fitted curve and the resulting time horizon estimate.
  • Choice of 50% Success Threshold: The choice of the 50% success threshold is a key parameter. While common in psychometrics, the resulting time horizon value is specific to this threshold. Using a different threshold (e.g., 80%, as shown in Figure 6) would yield different, likely shorter, time horizons.
  • Representation of Uncertainty: The error bars (±2SE) on the empirical points provide some indication of uncertainty in the measured success rates for binned task lengths, but the figure does not explicitly show confidence intervals for the fitted logistic curve or the derived time horizon itself, although these are likely incorporated into the error bars shown later in Figure 1.
Communication
  • Multi-Panel Layout Clarity: The multi-panel layout, dedicating one plot per model, allows for a clear, uncluttered view of each model's performance curve and its corresponding time horizon calculation.
  • Integration of Empirical Data and Model Fit: Within each panel, the visualization effectively combines empirical data (binned success rates with error bars) and the fitted logistic regression curve, allowing readers to visually assess the model fit.
  • Clear Indication of Time Horizon: The use of a vertical dashed red line clearly indicates the calculated 50% time horizon, and the accompanying text explicitly states this value (e.g., 'Time Horizon: 59 min' for Claude 3.7 Sonnet), making the key output easily identifiable.
  • Consistent Axis Scaling: Consistent axis scaling (logarithmic x-axis for time, linear y-axis for probability) across panels facilitates comparison between models, although direct visual comparison requires looking across multiple plots.
  • Legend Clarity: The legend clearly explains the components: the fitted curve and the empirical success rates with standard error bars (±2SE).
  • Caption Accuracy: The caption accurately describes the figure's content, explaining that it shows success rates and how the 50% time horizon is derived.
  • Visualization of Data Discontinuity: The visual 'jump' in success rates around the 1-minute mark, mentioned in the caption text, is apparent in several panels, visually supporting the text's observation about the SWAA/HCAST boundary.
Figure 6: Trend in 80% success rate time horizon.
Figure/Table Image (Page 12)
Figure 6: Trend in 80% success rate time horizon.
First Reference in Text
We also measure the 80% time horizon of models (Figure 6) and find a similar trend, though horizons are roughly 5x shorter.
Description
  • Purpose: Trend in Higher-Reliability AI Capability: This figure plots the progress of AI models over time, similar to Figure 1, but uses a stricter measure of capability: the '80% task completion time horizon'. This represents the maximum length of a task (measured in human completion time) that an AI model can successfully complete 80% of the time it tries.
  • Plot Axes and Scaling: Like Figure 1, it's a scatter plot where the horizontal axis is the AI model's release date (from 2019 to 2027) and the vertical axis is the calculated time horizon, plotted on a logarithmic scale (from 1 second to 4 hours).
  • Data Points and Trend Line: Each point represents an AI model (e.g., davinci-002, GPT-4 0314, Claude 3.7 Sonnet), showing its calculated 80% time horizon. A trend line is fitted to these points.
  • Observed Trend and Quantification: The figure shows that the 80% time horizon has also been growing exponentially, with a calculated doubling time of 213 days. The R-squared value (R² = 0.97) indicates this exponential model fits the data very well for the models shown (starting from 2020).
  • Comparison with 50% Time Horizon: A faint grey line representing the 50% time horizon trend from Figure 1 is included for comparison. Visually, the points and trend line for the 80% horizon are significantly lower on the graph than the 50% horizon trend, indicating that achieving 80% reliability requires tackling much shorter tasks compared to achieving 50% reliability.
  • Magnitude of 80% Horizon: The most capable model shown (Claude 3.7 Sonnet) has an 80% time horizon of around 15 minutes, substantially less than its 50% time horizon of nearly 1 hour shown in Figure 5.
Scientific Validity
  • Sensitivity Analysis for Success Threshold: Calculating the 80% time horizon provides a valuable sensitivity analysis regarding the choice of success threshold. It demonstrates that while the exponential growth trend appears robust (similar doubling time to 50% horizon), the absolute capability level is highly sensitive to the reliability requirement.
  • Relevance of Higher Reliability Metric: The 80% threshold represents a higher standard of reliability, which may be more relevant for practical applications where occasional failure is less acceptable. Measuring this provides a more conservative estimate of AI capabilities.
  • Highlighting the Reliability Gap: The significantly shorter horizons at the 80% level (visually ~5x shorter, as stated in text) highlight a substantial gap between models sometimes succeeding on complex tasks (50% horizon) and reliably succeeding (80% horizon). This suggests limitations in current models' robustness or consistency.
  • Potential Estimation Challenges: Estimating the 80% success point might be statistically more challenging or require more data than the 50% point, especially for harder tasks where success rates are low. This could introduce greater uncertainty into the 80% horizon estimates, although the R²=0.97 suggests a good fit for the models included.
  • Model Inclusion Criteria and Trend Start Date: The analysis excludes the earliest models (like GPT-2) and starts the trend fit from 2020-01-01. This is likely because their performance was too low to reliably estimate an 80% success threshold, even on the easiest tasks. This selective inclusion should be noted when interpreting the trend.
Communication
  • Consistent Visual Format: The figure effectively uses the same format as Figure 1 (log-linear plot of time horizon vs. release date) to show the trend for the 80% success rate threshold, facilitating direct comparison.
  • Effective Visual Comparison (50% vs 80%): Including the 50% horizon trend line (in grey) provides an immediate visual reference point, clearly illustrating that the 80% horizons are substantially lower, as mentioned in the text.
  • Clear Axis Labeling: The axes are clearly labeled, with the y-axis specifying 'Task time (for humans) that model completes with 80% success rate', removing ambiguity about the metric being plotted.
  • Inclusion of Trend Metrics: Key metrics like the doubling time (213 days) and R-squared value (0.97) are displayed directly on the plot, summarizing the trend's characteristics.
  • Legend Clarity: The legend clearly identifies the models plotted, although fewer models are included compared to the 50% horizon plot (Figure 1), likely due to difficulty in estimating 80% horizons for lower-performing models.
Figure 14: Stacked histogram of tasks by difficulty rating.
Figure/Table Image (Page 27)
Figure 14: Stacked histogram of tasks by difficulty rating.
First Reference in Text
HCAST mainly includes tasks longer than 4 minutes, while we focused on tasks in the 2-second to 15-second range with SWAA in order to measure GPT-2 and GPT-3.
Description
  • Purpose: Task Distribution by Difficulty: This figure is a 'stacked histogram', a type of bar chart used to show how data is distributed across different categories or ranges. Here, it shows the distribution of the benchmark tasks based on their difficulty.
  • X-axis: Task Difficulty (Log Human Time): The horizontal axis (x-axis) represents the 'Human task time', which serves as the measure of task difficulty. It's shown on a logarithmic scale, meaning equal distances represent multiplicative increases in time (e.g., 1 sec, 4 sec, 15 sec, 1 min, 4 min, etc.). Tasks are grouped into bins based on their estimated human completion time.
  • Y-axis: Number of Tasks: The vertical axis (y-axis) represents the 'Number of tasks' falling into each difficulty bin.
  • Stacked Bars: Contribution of Task Suites: Each bar is 'stacked', meaning it's divided into colored segments. The colors correspond to the source task suite: SWAA (green), RE-Bench (orange), and HCAST (blue), as indicated by the legend. The height of each colored segment within a bar shows how many tasks from that specific suite fall into that difficulty bin. The total height of a bar shows the total number of tasks in that bin.
  • Observed Distribution Pattern: The histogram shows that the SWAA tasks (green) are concentrated in the very short duration bins (mostly under 15 seconds). The HCAST tasks (blue) and RE-Bench tasks (orange) make up the majority of tasks in the longer duration bins (minutes to hours). There appears to be a lower density of tasks in the range between roughly 15 seconds and 1-4 minutes.
Scientific Validity
  • Accurate Representation of Task Suite: The histogram accurately depicts the composition of the combined task suite used in the study, categorized by source and estimated human difficulty. It provides a transparent overview of the benchmark's characteristics.
  • Justification for SWAA Suite: The figure visually justifies the creation of the SWAA suite. By showing the concentration of HCAST/RE-Bench tasks at longer durations, it highlights the need for shorter tasks (provided by SWAA) to effectively measure the capabilities of less advanced models like GPT-2 and GPT-3, which would likely fail most HCAST/RE-Bench tasks.
  • Implications of the Distribution Gap: The distribution reveals a potential limitation: the relative sparsity of tasks in the intermediate difficulty range (tens of seconds to a few minutes). This gap could slightly affect the precision of the time horizon calculation for models whose 50% capability level falls within this specific range, as the logistic regression fit might be less constrained by data in this zone.
  • Dependence on Human Time Estimates: The validity of the distribution relies on the accuracy of the human time estimates used to assign tasks to difficulty bins. Any systematic errors in these time estimates would distort the histogram.
  • Choice of Histogram Bins: The choice of bin widths for the histogram can influence its visual appearance. While the logarithmic scale helps manage the wide range, the specific bin boundaries chosen are not explicitly stated but appear reasonable for visualizing the overall distribution.
Communication
  • Appropriateness of Visualization: The stacked histogram is an appropriate visualization choice to show both the overall distribution of tasks by difficulty (human time) and the contribution of each task suite (HCAST, RE-Bench, SWAA) within each difficulty bin.
  • Clarity of Logarithmic X-axis: The logarithmic scale on the x-axis ('Human task time') effectively handles the wide range of task durations (seconds to hours) and allows different time scales to be represented clearly.
  • Effective Use of Stacking and Legend: The color-coded stacking and the legend clearly distinguish the three task sources, making it easy to see their relative prevalence at different difficulty levels.
  • Visual Confirmation of Textual Claims: The figure visually confirms the statement in the caption and reference text: SWAA tasks are concentrated at the very short end (seconds), while HCAST and RE-Bench tasks dominate the longer durations (minutes to hours).
  • Highlighting the Task Distribution Gap: The histogram clearly highlights a relative gap in task density between the ~15-second upper range of SWAA and the ~1-4 minute lower range of HCAST, which is relevant to the study's methodology for measuring models across different capability levels.
Figure 16: Success rates and time horizon of human baseliners.
Figure/Table Image (Page 29)
Figure 16: Success rates and time horizon of human baseliners.
First Reference in Text
Figure 16 shows a graph of baseliner success rate by task length.
Description
  • Purpose: Calculate Human Time Horizon: This figure applies the same time horizon calculation methodology used for AI models (as shown in Figure 5) to the human baseline data.
  • Axes Definition: The plot shows the 'Success Probability' (vertical axis) of the human baseliners completing tasks versus the 'Task length (human time-to-complete)' (horizontal axis, logarithmic scale).
  • Data Representation: Empirical Rates and Fitted Curve: Grey bars/points represent the empirical success rates of humans on tasks grouped by difficulty, with error bars (±2SE). A black curve shows the fitted logistic regression model representing the probability of human success as task length increases.
  • Calculated Human Time Horizon (1 hr 37 min): Following the same procedure as for AI models, the figure identifies the task length at which the fitted curve crosses the 50% success probability mark. This yields a calculated 'Time Horizon' for the human baseliners of 1 hour and 37 minutes.
  • Observation: Decreasing Success Rate, Low Horizon: The plot shows that human success rates decrease as task length increases, similar to AI models, but the calculated 50% horizon is much lower than might be intuitively expected (e.g., lower than the 8 hours humans were often allotted).
Scientific Validity
  • Methodological Consistency: Applying the same IRT-inspired logistic regression methodology used for AI models to the human data allows for a methodologically consistent comparison, even if the resulting 'human time horizon' requires careful interpretation.
  • Interpretation Challenges of the Human Horizon: The calculated human time horizon of ~1.5 hours is surprisingly low, given baseliners were often paid for up to 8 hours. As discussed critically in the text (Section B.1.1), this low value is likely an artifact of the methodology: filtering only for successful runs (biasing towards shorter times) and potentially incentivizing baseliners to give up early on tasks they perceived as difficult or time-consuming, rather than reflecting the true upper limit of human capability within an 8-hour timeframe.
  • Non-Comparability with AI Horizons: Because the human horizon calculation is affected by these methodological choices (especially filtering for success), it is explicitly noted in the text (Section B.1.1) and implied in the caption's note that this value is not directly comparable to the AI time horizons calculated in the main analysis. It serves more as a methodological check than a true measure of human capability limits under the study conditions.
  • Reflection of Baselining Process Data: The plot does accurately reflect the observed success rates of humans under the specific conditions and incentives of the baselining process. The decreasing success rate with task length is an expected finding.
  • Highlighting Challenges in Human Baselining: The analysis highlights the significant challenges and potential biases involved in establishing reliable human performance baselines, particularly for complex, long-duration tasks where failure rates are high and motivation/incentives play a large role.
Communication
  • Clarity of Visualization: The plot clearly visualizes the relationship between task length and the success rate of the human baseliners, using the same format as the individual model plots in Figure 5 (logistic curve fit to empirical success rates).
  • Data Representation: The inclusion of empirical success rate bins with error bars allows assessment of the data variability and the goodness-of-fit of the logistic curve.
  • Indication of Human Time Horizon: The calculated 50% time horizon (1 hr, 37 min) is clearly indicated with a dashed line and text annotation, analogous to the AI model plots.
  • Contextualization via Caption and Text: The caption, combined with the crucial note in the text (Section B.1.1), clarifies that this calculated 'human time horizon' is artificially low due to methodological factors and not directly comparable to AI horizons, which is important context.
Table 4: Results of baselines on selected internal PRs
Figure/Table Image (Page 30)