This paper addresses the challenge of translating AI benchmark scores into meaningful real-world capabilities. Current benchmarks often saturate quickly or use artificial tasks, making it difficult to gauge AI progress on complex, long-duration work relevant to human experts. To overcome this, the authors propose a novel metric: the 'task completion time horizon'. This metric quantifies AI capability by determining the typical time required by human experts to complete tasks that a given AI model can successfully accomplish with a specific probability (e.g., 50%). It provides an intuitive, continuous measure anchored in human performance.
The methodology involved compiling a diverse suite of tasks spanning software engineering and AI research, ranging from seconds to many hours in human completion time. These tasks were drawn from existing benchmarks (HCAST, RE-Bench) and newly created short tasks (SWAA). Human professionals with relevant expertise were timed performing these tasks to establish baseline durations. Subsequently, various frontier AI models released between 2019 and 2025 were evaluated on the same tasks using standardized agent frameworks ('scaffolding') that provide tools for interaction. By modeling AI success probability as a function of human task time using logistic regression, the 50% time horizon was calculated for each model.
The central finding is that the 50% time horizon for frontier AI models on these tasks has grown exponentially since 2019, doubling approximately every seven months (95% CI: 171-249 days). Current leading models like Claude 3.7 Sonnet achieve a horizon of around 50-60 minutes. Qualitative analysis suggests this rapid improvement is driven by enhanced reliability, better adaptation to mistakes, improved logical reasoning, and more effective tool use. However, the study also found a significant gap between the 50% horizon and the 80% horizon (requiring higher reliability), where the latter is much shorter (~15 minutes for Claude 3.7 Sonnet), indicating challenges remain in achieving dependable performance on longer tasks.
The authors conducted several experiments to test the robustness and external validity of their findings, including analyzing performance on tasks with varying 'messiness' (real-world complexity factors), comparing results with an external benchmark (SWE-bench), and evaluating models on real internal software issues. While the exponential trend appeared robust across different conditions, the absolute values and doubling times showed sensitivity to the benchmark and human time estimation methods. The paper concludes by discussing the significant implications of this rapid growth for future automation potential (extrapolating to month-long tasks within ~5 years, albeit with caveats) and AI safety, while carefully acknowledging limitations related to task representativeness and the interpretation of human baselines.
This research introduces a valuable new metric, the 'task completion time horizon,' offering a more intuitive way to track AI progress in handling complex, time-consuming tasks compared to traditional benchmarks. The core finding of an exponential increase in this horizon, doubling roughly every seven months since 2019 within the domain of software and research engineering tasks, is statistically robust within the study's framework and suggests a rapid acceleration of AI capabilities relevant to automating intellectual labor.
However, interpreting the practical implications requires significant caution. The time horizon metric is highly sensitive to methodological choices, particularly the nature and context of the human baselines used for comparison. The study itself demonstrates that AI performance aligns better with low-context human performance (like external contractors) than high-context experts, meaning the measured horizons might overestimate the time AI would save compared to experienced employees working in familiar environments. Furthermore, the benchmark tasks, despite efforts towards realism, still differ systematically from the 'messiness' and dynamic nature of many real-world jobs, raising questions about the external validity of the observed trend. The difference in observed doubling times when using an external benchmark (SWE-bench) further underscores the sensitivity to task selection and human time estimation methods.
Therefore, while the paper provides compelling evidence of rapid AI progress on increasingly complex benchmark tasks, directly extrapolating this trend to predict timelines for widespread automation of specific real-world jobs (like the 5-year prediction for month-long software tasks) is speculative. The study highlights a significant trend but also underscores the critical need for more research into AI performance on truly naturalistic tasks and the complex factors influencing real-world deployment. Key unanswered questions remain about the sustainability of the exponential trend and how well performance on these benchmarks translates to diverse, dynamic, and context-rich professional environments.
The abstract effectively introduces the core problem—the ambiguity of AI benchmark performance in real-world terms—and immediately proposes a novel metric to address it.
It concisely summarizes the main quantitative findings regarding the current time horizon of frontier AI models and the historical trend of its rapid growth.
The abstract briefly outlines the methodology (human timing, task sets) and identifies the key factors perceived to drive the observed improvements in AI capabilities.
The authors responsibly acknowledge the study's limitations and discuss the broader implications of their findings, including potential future capabilities and risks.
This low-impact improvement would enhance reader comprehension regarding the scope of the evaluation. The Abstract serves as a gateway to the paper, and clarifying the nature of the tasks evaluated early on aids readers in contextualizing the findings, especially those unfamiliar with the specific benchmarks cited.
Implementation: After listing the task sets, add a brief parenthetical description of the domains they cover. For example, change "...on a combination of RE-Bench, HCAST, and 66 novel shorter tasks." to "...on a combination of RE-Bench (AI research & engineering), HCAST (diverse software engineering and reasoning tasks), and 66 novel shorter tasks."
The introduction effectively establishes the context by highlighting the rapid advancement of AI capabilities and the associated potential risks, thereby motivating the need for robust evaluation methods.
The text clearly articulates the limitations of existing AI benchmarks, such as artificiality, adversarial selection, saturation, and lack of a unified, intuitive metric, setting the stage for the proposed solution.
The proposed 'task completion time horizon' metric is introduced logically as a solution to the identified benchmark limitations, offering an intuitive way to compare AI capabilities against human performance.
The introduction provides a concise yet comprehensive overview of the paper's methodology, key findings (exponential growth trend), and structure, effectively guiding the reader.
This low-impact improvement would slightly enhance reader anticipation and contextual understanding for the later discussion on external validity. The Introduction section appropriately raises the question of external validity, and briefly hinting at the nature of the challenges (i.e., the 'messiness' of real-world tasks compared to benchmarks) provides a slightly stronger bridge to Section 6 without preempting its detailed discussion.
Implementation: In the paragraph discussing external validity (page 3), after mentioning the question, consider adding a brief clause hinting at the types of real-world complexities. For instance, change "...this raises the question of external validity (Section 6): whether the exponential trend holds on real-world tasks." to "...this raises the question of external validity (Section 6): whether the exponential trend holds on real-world tasks, which often involve greater ambiguity, dynamism, or 'messiness' than typical benchmarks."
Figure 1: The length of tasks (measured by how long they take human professionals) that generalist autonomous frontier model agents can complete with 50% reliability has been doubling approx-imately every 7 months for the last 6 years (Section 4).
The paper clearly defines the three distinct task suites (HCAST, RE-Bench, SWAA) used for evaluation, detailing their origins, characteristics (type of tasks, duration range), and the rationale for their inclusion, providing a solid foundation for understanding the scope of the assessment.
The methodology for establishing human baseline performance is well-described, including the recruitment of skilled professionals, the environment used (Vivaria), incentives, data collection methods (screen/audio recording), and specific procedures for each task suite (HCAST, RE-Bench, SWAA), lending credibility to the human time estimates.
The process for evaluating AI agent performance is transparently outlined, specifying the use of consistent agent scaffolds (modular-public, with noted exceptions), the provision of similar affordances as humans, and the handling of model limitations (e.g., GPT-2 imputation).
The derivation of the 'time horizon' metric is logically presented, drawing inspiration from Item Response Theory (IRT) but adapting it by using human baseline time as a proxy for difficulty. The use of logistic regression to model the 50% success probability is clearly explained.
The paper employs robust statistical methods for analyzing the trend in time horizon, including Ordinary Least Squares regression on the logarithm of time horizon against release date and hierarchical bootstrapping to estimate confidence intervals, adding rigor to the quantitative findings.
This low-impact improvement would enhance clarity regarding the evaluation setup. The Methods section mentions a 'simple prompting scaffold' for SWAA tasks and 'slightly different scaffold[s]' for o1 models, referencing Appendix B.3 for details. While the appendix provides information, briefly summarizing the key difference (e.g., nature of prompting for SWAA, specific challenges addressed for o1) directly within Section 3.3.1 would improve flow and immediate comprehension for readers focused on the main methodology, without requiring them to immediately jump to the appendix.
Implementation: In Section 3.3.1, after mentioning the exceptions for SWAA and o1 models, add a brief parenthetical phrase or sentence summarizing the main difference in their scaffolds compared to modular-public. For example: "...except for the SWAA tasks, which used a simple prompting scaffold (primarily for formatting input/output)," and "We used a slightly different scaffold for o1-preview and o1 (e.g., employing alternative tool-use strategies), because they seemed to struggle..."
This low-impact improvement would enhance the justification for a specific methodological choice. Figure 3's caption mentions weighting tasks by the inverse square root of their family size to reduce the influence of large families, a standard practice in the main results. Explicitly stating the rationale for this weighting (i.e., to ensure diversity and prevent large families from dominating the overall results) within the main text, perhaps in Section 3.1 or 3.3.2 where task families or results are first discussed, would strengthen the methodological explanation.
Implementation: In Section 3.1 where task families are introduced, or in Section 3.3.2 when presenting initial results, add a sentence explaining the purpose of the inverse square root weighting. For example, after explaining task families in Section 3.1: "To ensure diverse representation and prevent larger families from disproportionately influencing overall capability metrics, tasks are typically weighted by the inverse square root of their family size in our analyses."
Figure 3: Average task success rate across our entire combined suite, for each model.
Figure 4: Model success rates are negatively correlated with how much time it takes a human to complete the task.
Figure 5: Success rates of all models on our test suite, showing the computation of time horizon as predicted 50% success rate time.
The section clearly presents the time horizon results at both 50% and 80% success rates, using figures (Fig 6) to visualize the 80% trend and explicitly comparing the doubling times and absolute horizon lengths. This comparison effectively highlights the gap between occasional success and reliable performance.
The qualitative analysis provides valuable context for the quantitative trends. It systematically categorizes task families, identifies areas of significant improvement (tool use, adaptability, reasoning), and pinpoints persistent weaknesses ('messier' environments), offering insights beyond simple success rates.
The quantitative failure analysis (Table 3) comparing GPT-4 1106 and o1 provides concrete evidence for the qualitative claim of improved adaptability, specifically noting the drastic reduction in 'repeating failed actions'. This adds empirical weight to the qualitative observations.
The section undertakes a commendable effort to assess external validity and robustness through multiple distinct experiments (retrodiction, messiness factors, SWE-bench, internal PRs). This multi-pronged approach strengthens the paper's claims by testing the core findings under different conditions and against different data sources.
The analysis of 'messiness' factors is nuanced. While confirming that models perform worse on messier tasks (Fig 10), it importantly finds that the rate of improvement over time is similar for high and low messiness subsets (Fig 9). This finding directly addresses potential concerns about plateaus on more realistic tasks.
The internal PR experiment provides a valuable, albeit small-scale, comparison to real-world software engineering work. The finding that AI performance aligns better with low-context contractor time than high-context maintainer time offers a crucial insight into interpreting the time horizon metric.
This low-impact improvement would enhance the clarity of the failure analysis interpretation. The Results section notes that the high rate of 'premature task abandonment' for o1 might stem from task difficulty or 'idiosyncrasies of o1'. Briefly elaborating on potential idiosyncrasies (even if speculative, e.g., related to its specific training, architecture, or the custom scaffold mentioned in Methods/Appendix B.3) or explicitly stating the difficulty in distinguishing these factors would provide a more complete interpretation of the failure data presented in Table 3.
Implementation: In the paragraph discussing Table 3 results (page 13), after mentioning the possibility of 'idiosyncrasies of o1', add a brief clause or sentence. For example: "...or may reflect idiosyncrasies of o1, potentially related to its training objectives or differences in how it interacts with its specific agent scaffold (see Appendix B.3). Further analysis would be needed to disentangle these possibilities."
This low-impact improvement would enhance the presentation of the internal PR experiment findings. Section 6.4 effectively summarizes the key result regarding AI performance aligning with contractor time versus maintainer time, which supports the low-context interpretation of the time horizon metric. Visualizing this comparison, perhaps via a simple bar chart or scatter plot showing AI scores/success against contractor time and maintainer time for the five PRs, would make this important finding more immediately apparent and impactful for the reader. This visualization could be placed within Section 6.4 or referenced in Appendix B.2.
Implementation: Consider adding a small figure (e.g., Figure 11b or a supplementary figure referenced here) that plots, for the five internal PRs: (a) Maintainer time-to-complete, (b) Contractor time-to-complete, and (c) AI agent success rates or scores (perhaps averaged across the tested models). This would visually reinforce the finding that AI performance tracks contractor time better than maintainer time.
Figure 7: Time horizons on HCAST + RE-bench, for models starting with GPT-4 0314.
Figure 9: Performance trends over time for HCAST and RE-Bench tasks by length and messiness (Section 6.2).
Figure 10: We plot the excess success rate (the observed empirical task success rate, minus success rate we would predict using the task's length, see Section 4.1) against messiness score for each task.
Figure 11: Performance of frontier AI models using reported SWE-bench Verified results (Section 6.3).
The discussion effectively synthesizes the core findings, reiterating the proposal of the time horizon metric, the methodology used for its calculation, and the key result of exponential growth.
The section thoughtfully addresses the nuances and complexities involved in measuring and interpreting the time horizon metric, including the significant impact of baseliner context and skill, task distribution effects, and challenges in measuring extreme horizons.
The authors candidly identify several key limitations of the current work and propose concrete directions for future research, including improving model elicitation, enhancing human baselining rigor, expanding task diversity, and leveraging more inference compute.
The discussion connects the findings to broader concepts like Artificial General Intelligence (AGI), noting that the time horizon metric would become infinite for an AGI capable of matching human experts across all tasks, and appropriately links this to the extrapolation challenges.
The discussion highlights the economic implications of current AI agent performance, noting that even with limited inference compute, successful runs are often significantly cheaper than human labor, suggesting substantial potential for cost-effective automation and performance improvement.
This medium-impact improvement would enhance the discussion around future capabilities and potential acceleration factors. The Discussion section appropriately identifies increased use of inference compute as a potential area for future work and performance improvement (Section 8.2). However, explicitly linking this back to the factors discussed in Section 7.2.2 (Compute Scaling and Automation of AI R&D) would strengthen the argument. Specifically, emphasizing how increased inference compute could not only improve current performance (as noted) but also potentially accelerate the rate of progress by enabling more complex agentic behaviors or substituting for training compute could provide a more complete picture of its future impact.
Implementation: In the 'More use of inference compute' subsection of Section 8.2, after discussing the economic competitiveness and potential for performance improvement, add a sentence explicitly linking increased inference compute usage to the potential acceleration factors discussed in Section 7.2.2. For example: "Furthermore, leveraging significantly more inference compute, as discussed in Section 7.2.2 regarding compute scaling and potential R&D automation, could not only boost success rates but potentially accelerate the overall trend in time horizon by enabling more sophisticated agentic strategies or partially substituting for training compute limitations."
This low-impact improvement would add clarity to the discussion of limitations regarding task naturalness. Section 8.2 correctly points out systematic differences between the benchmark tasks and real-world work, referencing Section 6. While Section 7.2.1 lists specific differences (automatic scoring, no interaction, etc.), explicitly reiterating one or two key examples of these differences within Section 8.2 would reinforce the point for readers focused on the Discussion and reduce the need to immediately refer back to earlier sections for context on the nature of the 'unnaturalness'.
Implementation: In the 'More natural, varied tasks' subsection of Section 8.2, after stating the reasons to believe the task distribution is systematically different, briefly incorporate examples mentioned in Section 7.2.1. For instance: "...there remain many differences that we did not explore. For example, the reliance on automatic scoring limits open-endedness, the lack of interaction with other agents ignores coordination challenges, and the tasks rarely involve stringent resource constraints or dynamic environments common in real-world scenarios (as detailed in Section 7.2.1)."
Table 3: Number of different categories of failures for 31 failed runs by GPT-4 1106 and 32 failed runs by ol (Section 5).
Figure 12: A sensitivity analysis of the extrapolated date at which frontier Al systems will have a horizon of 1 month.
Figure 13: Cost of a successful run using an LLM agent as a fraction of the cost of the salary of a human expert performing the same task.
Figure 17: Linear, hyperbolic, and exponential fits for model time horizon since 2019.
Table 8: We convert the SWE-bench Verified time annotations into task estimates, by taking the geometric mean of the time annotation.
Table 9: The time horizon of less capable models is substantially longer on our tasks than on SWE-bench Verified.
Figure 21: Model success rates on HCAST + RE-Bench tasks, split by task messiness rating.
Figure 22: Correlation matrix of observed success rates across all models and tasks.
Figure 23: Correlation matrix of excess success rates (defined by ) across all models and tasks.
Figure 25: Time horizon of all models we measured, including non-frontier models.
Figure 26: Length in human expert clock-time of tasks that frontier models can perform competently over time.