This study evaluated the performance of the o1-preview model on various medical reasoning tasks, comparing it to human physicians and prior LLMs like GPT-4. The o1-preview model demonstrated significant improvements in differential diagnosis generation (78.3% accuracy, 95% CI 70.7%-84.8%), diagnostic reasoning (median score 97% vs 74% for physicians on Landmark cases), and management reasoning (median score 86% vs 34% for physicians on Grey Matters cases). However, it showed no improvement in probabilistic reasoning compared to GPT-4. Inter-rater agreement between physicians was substantial for most tasks (kappa = 0.66 for differential diagnosis, 0.89 for R-IDEA score, 0.71 for Grey Matters cases).
The study provides a comprehensive evaluation of the o1-preview model's capabilities in medical reasoning, demonstrating its superior performance in several key areas compared to previous models and human physicians. The use of robust statistical measures and the clear presentation of results through figures and tables are major strengths. However, the study's impact is limited by the lack of a clearly stated research question, insufficient detail on specific metrics and prompts used, and limited discussion of the clinical significance of the findings. Additionally, the lack of improvement in probabilistic reasoning is not adequately addressed. While the study acknowledges some limitations, it does not fully explore their implications. The conclusion effectively summarizes the findings but fails to clearly articulate the main takeaways and provide specific guidance for future research. The study's practical utility is promising, particularly in differential diagnosis and management reasoning, but further research, including human-computer interaction studies and clinical trials, is needed to fully realize its potential. Key unanswered questions include how to improve the model's probabilistic reasoning, how to address its verbosity, and how to effectively integrate it into clinical workflows. The methodological limitations, particularly the lack of detail on prompts and metrics, could affect the reproducibility and generalizability of the findings. Overall, the study represents a significant step forward in evaluating LLMs for medical reasoning, but further refinements are needed to fully leverage its potential benefits for end users.
This high-impact improvement would enhance the abstract's clarity and completeness, making it more informative and impactful. The abstract currently lacks a clear statement of the study's limitations, which is crucial for proper interpretation of the findings. Including limitations in the abstract ensures that readers are aware of potential caveats from the outset. Addressing this would strengthen the paper by providing a balanced perspective on the study's contributions and limitations, which is essential for maintaining scientific rigor and transparency. This would also prevent over-interpretation of the results and encourage a more nuanced understanding of the study's implications. Ultimately, explicitly stating the study's limitations in the abstract would improve the overall quality and credibility of the research.
Implementation: Add a sentence or two to the end of the abstract summarizing the main limitations of the study. For example, 'Limitations of this study include the focus on internal medicine cases, the potential for verbosity bias, and the lack of human-computer interaction analysis.'
This medium-impact improvement would enhance the abstract's clarity and contextual understanding by explicitly stating the specific version of the model used. The abstract currently mentions the '01-preview model' but does not provide the exact version, which is important for reproducibility and future comparisons. Adding the specific version will ensure that other researchers can accurately replicate the study and understand the precise model being evaluated. This will strengthen the paper by increasing transparency and enabling more precise comparisons with other studies. It also provides essential information for readers to assess the study's findings and implications. Ultimately, stating the exact model version enhances the scientific rigor and reproducibility of the work.
Implementation: Include the specific version of the 01-preview model used in the study. For example, change the sentence to read: 'We sought to evaluate OpenAl's 01-preview model (o1-preview-2024-09-12), a model developed to increase run-time via chain of thought processes prior to generating a response.'
This medium-impact improvement would enhance the abstract's clarity and completeness by explicitly stating the primary outcome measure. The abstract mentions 'comparison of the o1-preview output' but does not clearly define what metric was used for this comparison. Stating the primary outcome measure will ensure that readers immediately understand how the model's performance was evaluated. This will strengthen the paper by making the abstract more informative and accessible, allowing readers to quickly grasp the study's main findings. It also provides essential context for interpreting the results and assessing the study's implications. Ultimately, specifying the primary outcome measure will improve the overall quality and clarity of the abstract.
Implementation: Add a phrase or sentence that explicitly states the primary outcome measure. For example, change the sentence to read: 'Our primary outcome was comparison of the o1-preview output to identical prior experiments that have historical human controls and benchmarks of previous LLMs, measured by the percentage of correct diagnoses and management decisions.'
This high-impact improvement would enhance the introduction's clarity and contextual understanding by explicitly stating the study's main research question. The introduction currently provides background on the limitations of multiple-choice benchmarks and the promise of LLMs in clinical reasoning, but it does not explicitly articulate the central question the study aims to answer. Adding a clear research question would provide a focal point for the reader and set the stage for the study's methodology and findings. This is essential for the introduction as it provides the reader with a clear understanding of what the study is trying to achieve. A well-defined research question ensures that the study's purpose is immediately apparent, guiding the reader's understanding of the subsequent sections. Ultimately, stating the research question would significantly improve the introduction's clarity and purpose, making it more effective in setting up the study.
Implementation: Add a sentence or two at the end of the introduction that clearly states the research question. For example, 'Therefore, this study aims to evaluate the performance of the 01-preview model across five medical reasoning domains, comparing its performance to human physicians and prior LLMs.'
This medium-impact improvement would enhance the introduction's completeness by briefly mentioning the limitations of the study. The introduction currently focuses on the rationale for the study and the need for better benchmarks, but it does not provide any indication of the study's limitations. Including a brief mention of the limitations would provide a balanced perspective and prepare the reader for a more nuanced interpretation of the results. This is crucial in the introduction to give the reader an idea of the study's scope and the potential caveats to the findings. Addressing this would strengthen the paper by demonstrating the authors' awareness of the study's limitations and promoting a more critical evaluation of the results. Ultimately, briefly mentioning the study's limitations in the introduction would improve the paper's credibility and transparency.
Implementation: Add a sentence or two at the end of the introduction that briefly mentions the main limitations of the study. For example, 'While this study provides valuable insights, it is limited by its focus on internal medicine cases and the absence of human-computer interaction analysis.'
This medium-impact improvement would enhance the introduction's flow and logical progression by explicitly connecting the study's focus on complex reasoning to the limitations of existing benchmarks. The introduction discusses the limitations of multiple-choice benchmarks and the importance of complex reasoning, but it does not explicitly link these two concepts. By making this connection explicit, the authors would clarify why the study focuses on complex reasoning tasks. This connection is important in the introduction as it establishes the logical flow of the argument and justifies the study's approach. Addressing this would strengthen the paper by making the rationale for the study more transparent and coherent. Ultimately, explicitly connecting the limitations of benchmarks to the focus on complex reasoning would improve the introduction's clarity and logical structure.
Implementation: Add a sentence that explicitly connects the limitations of multiple-choice benchmarks to the study's focus on complex reasoning. For example, 'Given that multiple-choice benchmarks do not capture the complexity of clinical decision-making, this study focuses on evaluating the model's performance in complex reasoning tasks.'
The Results section effectively uses statistical measures like Cohen's kappa, confidence intervals, and p-values to quantify the agreement between physician raters and to compare the performance of the o1-preview model with other models and human controls. This rigorous approach provides a strong foundation for the study's findings and enhances the credibility of the results.
The Results section presents data clearly through figures and tables, which aids in understanding the findings. The figures visually compare the model's performance with other benchmarks, and the tables provide detailed examples of the model's outputs. This enhances the accessibility of the results and allows readers to easily interpret the data.
The Results section includes a variety of clinical reasoning tasks, including differential diagnosis generation, diagnostic test selection, presentation of reasoning, and probabilistic reasoning. This broad approach provides a comprehensive assessment of the model's capabilities across different aspects of medical decision-making, increasing the generalizability of the findings.
This medium-impact improvement would enhance the clarity and completeness of the Results section by explicitly stating the specific metrics used for each experiment. While the section presents various results, it does not always clearly define the exact metrics used for each comparison. For example, while the section mentions 'agreement' for differential diagnoses, it does not specify how 'agreement' was measured for the 'Grey Matters Management Cases'. Explicitly stating the metrics would ensure that readers understand how the model's performance was evaluated in each experiment. This is crucial for the Results section as it provides the necessary context for interpreting the findings and enhances the reproducibility of the study. This would strengthen the paper by increasing transparency and enabling more precise comparisons with other studies. Ultimately, explicitly stating the metrics used for each experiment would improve the overall quality and clarity of the Results section.
Implementation: For each experiment, include a sentence that clearly states the metric used for evaluation. For example, when presenting the results for the 'Grey Matters Management Cases', add a phrase like 'The primary outcome was the percentage of total points obtained by o1-preview, measured using a 100-point scale'.
This medium-impact improvement would enhance the Results section by explicitly mentioning any limitations associated with each experiment. While the Discussion section addresses general limitations, the Results section would benefit from brief, specific limitations for each experiment. For example, the kappa value for diagnostic test selection was low, which should be noted in the results. Adding these specific limitations would provide a balanced perspective on the study's findings and prevent over-interpretation of the results. This is important in the Results section to give the reader an idea of the scope and potential caveats of the findings. Addressing this would strengthen the paper by demonstrating the authors' awareness of the limitations and promoting a more critical evaluation of the results. Ultimately, briefly mentioning the limitations within the Results section would improve the paper's credibility and transparency.
Implementation: For each experiment, include a sentence or phrase that briefly mentions any specific limitations. For example, when presenting the results for the diagnostic test selection, add a phrase like 'While the proportion of agreements was high, the kappa was low due to severe class imbalance'.
This medium-impact improvement would enhance the Results section by providing more context for the clinical significance of the findings. While the section presents the model's performance on various tasks, it does not always explain the clinical relevance of these results. For example, the section mentions that the model achieved a perfect R-IDEA score in 78/80 cases, but it does not discuss the implications of this result for clinical practice. Adding this context would help readers understand the practical importance of the study's findings. This is crucial in the Results section as it bridges the gap between statistical results and real-world applications. Addressing this would strengthen the paper by making the results more meaningful and relevant to the clinical audience. Ultimately, adding clinical context to the findings would improve the overall impact and practical value of the study.
Implementation: For each experiment, include a sentence or two that discusses the clinical significance of the results. For example, after presenting the R-IDEA scores, add a phrase like 'This suggests that the model is highly capable of documenting clinical reasoning, which is a crucial aspect of medical practice'.
Figure 1: Barplot showing the accuracy of including the correct diagnosis in the differential for differential diagnosis (DDx) generators and LLMs on the NEJM CPCs, sorted by year. Data for other LLMs or DDx generators was obtained from the literature.36 23 8 The 95% confidence intervals were computed using a one-sample binomial test.
Figure 2: A. Comparison of o1-preview with a previous evaluation of GPT-4 in providing the exact or very close diagnosis (Bond scores 4-5) on the same 70 cases. Bars are annotated with the accuracy of each model. 95% confidence intervals were computed using a one-sample binomial test. P-value was computed using McNemar's test. B. Histogram of o1 performance as measured by the Bond Score on the complete set of 143 cases.
Figure 3: Performance of o1-preview in predicting the next diagnostic tests that should be ordered. Performance was measured by two physicians using a likert scale of “Unhelpful,” "Helpful,” and “Exactly right.” We excluded 7 cases from the total case set in which it did not make sense to ask for the next test (Supplement 1B).
Figure 4: A. Distribution of 312 R-IDEA scores stratified by respondents on 20 NEJM Healer cases. B. Box plot of the proportion of cannot-miss diagnoses included in differential diagnosis for the initial triage presentation. The total sample size in this figure is 70, with 18 responses from attending physicians, GPT-4 and o1-preview, and 16 responses from residents. Two cases were excluded because the cannot-miss diagnoses could not be identified. Ns: not statistically significant.
Figure 5: A. Box plot of normalized management reasoning points by LLMs and physicians. Five cases were included. We generated one o1-preview response for each case. The prior study collected five GPT-4 responses to each case, 176 responses from physicians with access to GPT-4, and 199 responses from physicians with access to conventional resources. *: p <= 0.05, **: p <= 0.01, ***: p <= 0.001, ****: *: p <= 0.0001. B. Box plot of normalized diagnostic reasoning points by model and physicians. Six diagnostic challenges were included. We generated one o1-preview response for each case. The prior study collected three GPT-4 responses to all cases, 25 responses from physicians with access to GPT-4, and 25 responses from physicians with access to conventional resources. Ns: not statistically significant.
Figure 6: Density plots for the distribution of responses by 01-preview, GPT-4 and humans to clinical vignettes asking for (1) the pretest probability of disease, (2) the updated probability after a positive test result, and (3) the updated probability after a negative test result. The shaded blue indicates the reference range based on a review of literature from a prior study.22 Human responses are from 553 medical practitioners (290 resident physicians, 202 attending physicians, and 61 nurse practitioners or physician assistants). 100 predictions were generated by GPT-4 and o1-preview for each question.
Table 1: Three examples in which o1-preview correctly diagnosed a complex case that GPT-4 could not solve. GPT-4 examples are from a prior study8
Table 2: Three examples of the o1-preview suggested testing plan compared to the testing plan conducted. One example scored a two, indicating that the test was appropriate and nearly identical to the case plan. A score of one indicates that the suggested diagnostics would have been helpful or yielded the diagnosis via another test not used in the case. A score of zero indicates that the suggested diagnostics would be unhelpful. Verbose rationales from 01-preview were abridged by a physician (Z.K.) to better fit in the table.
The Discussion section effectively summarizes the key findings, highlighting the superior performance of the o1-preview model in differential diagnosis, diagnostic reasoning, and management reasoning. This clear summary provides a strong foundation for the subsequent discussion of implications and limitations.
The Discussion section appropriately connects the study's findings to broader implications for clinical practice, such as the potential for AI to mitigate diagnostic errors and delays. This demonstrates the practical relevance of the research and its potential impact on healthcare.
The Discussion section acknowledges the limitations of the study, including the model's verbosity, the lack of human-computer interaction analysis, and the focus on internal medicine cases. This demonstrates a balanced and critical perspective on the research.
This high-impact improvement would enhance the Discussion section by explicitly stating the main conclusions of the study in a concise manner. While the section discusses various implications and limitations, it does not clearly articulate the core conclusions that the authors want the reader to take away. Adding a concluding paragraph that summarizes the main findings and their significance would provide a strong closing statement. This is crucial for the Discussion section as it reinforces the study's key contributions and provides a clear sense of closure. This would strengthen the paper by ensuring that the main messages are clearly conveyed and that the reader is left with a clear understanding of the study's overall impact. Ultimately, explicitly stating the main conclusions would significantly improve the Discussion section's effectiveness.
Implementation: Add a concluding paragraph that explicitly states the main conclusions of the study. For example, 'In conclusion, this study demonstrates that the o1-preview model exhibits superior performance in complex medical reasoning tasks, particularly in differential diagnosis and management. These findings suggest the potential for AI to improve clinical practice, but further research is needed to address limitations and ensure safe implementation.'
This medium-impact improvement would enhance the Discussion section by explicitly addressing the lack of improvement in probabilistic reasoning. While the section notes that no improvements were observed in probabilistic reasoning, it does not fully explore the potential reasons for this finding. Discussing why the model did not improve in this area would provide a more comprehensive analysis of the results. This is important in the Discussion section as it provides a more nuanced understanding of the model's capabilities and limitations. Addressing this would strengthen the paper by demonstrating a thorough exploration of all the findings and encouraging future research in this area. Ultimately, explicitly discussing the lack of improvement in probabilistic reasoning would improve the overall quality and depth of the Discussion section.
Implementation: Add a paragraph that explores the potential reasons for the lack of improvement in probabilistic reasoning. For example, 'The lack of improvement in probabilistic reasoning tasks may be due to the model's difficulty in handling abstract concepts or the limitations of the current probabilistic reasoning benchmarks. Further research is needed to investigate these factors and develop more robust evaluation methods.'
This medium-impact improvement would enhance the Discussion section by providing more specific recommendations for future research. While the section mentions the need for new benchmarks and trials in real-world settings, it does not provide specific directions for future studies. Suggesting concrete research avenues would make the discussion more actionable and impactful. This is important in the Discussion section as it provides a clear path for future research and encourages the scientific community to build upon the current findings. Addressing this would strengthen the paper by demonstrating a forward-thinking approach and highlighting the potential for further advancements in the field. Ultimately, providing more specific recommendations for future research would improve the overall value and impact of the Discussion section.
Implementation: Add a paragraph that provides specific recommendations for future research. For example, 'Future research should focus on developing new benchmarks that incorporate human-computer interaction, exploring the model's performance in diverse clinical settings, and investigating methods to improve its probabilistic reasoning capabilities. Additionally, clinical trials are needed to evaluate the impact of these models on patient outcomes.'
The Methods section clearly identifies the specific model used (o1-preview-2024-09-12) and how it was accessed through OpenAI's API. This is crucial for reproducibility and allows other researchers to understand exactly which model was evaluated.
The Methods section provides a detailed description of the case selection process for the NEJM CPCs, including the specific years of publication and the criteria for inclusion (cases with a "Differential Diagnosis" section). This level of detail is important for understanding the dataset used in the study.
The Methods section clearly defines the primary outcomes for each experiment, including differential diagnosis quality, quality of the suggested testing plan, and the R-IDEA score for clinical reasoning documentation. This helps to clarify what the study was measuring and how the model's performance was evaluated.
The Methods section thoroughly describes the scoring systems used, such as the Bond Score for differential diagnoses and the Likert scale for testing plans. It also specifies the use of linear-weighted Cohen's kappa for interrater agreement and how discordant scores were resolved. This level of detail enhances the rigor and transparency of the study.
This high-impact improvement would enhance the Methods section by explicitly stating the specific prompts used for each experiment. While the section mentions adapting prompts from previous studies, it does not provide the exact wording of these prompts, which is essential for reproducibility. Including the prompts would allow other researchers to replicate the study precisely and understand the context in which the model was evaluated. This is crucial for the Methods section as it details the core experimental procedures. This would strengthen the paper by increasing transparency and enabling more precise comparisons with other studies. Ultimately, providing the specific prompts would significantly improve the study's scientific contribution by ensuring its findings can be properly contextualized and replicated.
Implementation: Include the exact prompts used for each experiment, either in the main text or in the supplementary materials. For example, for the NEJM CPCs, include the exact prompt used to query the model after the differential diagnosis prediction ('What diagnostic tests would you order next given this differential?').
This medium-impact improvement would enhance the Methods section by explicitly stating the rationale for using a linear-weighted Cohen's kappa instead of a standard Cohen's kappa. While the section mentions using a linear-weighted kappa, it does not explain why this specific measure was chosen. Providing this rationale would clarify the methodological choices and ensure that readers understand the statistical approach. This is important in the Methods section as it provides context for the statistical analyses. This would strengthen the paper by demonstrating the authors' awareness of the nuances of interrater reliability measures and promoting a more critical evaluation of the results. Ultimately, explaining the rationale for using a linear-weighted kappa would improve the overall quality and transparency of the Methods section.
Implementation: Add a sentence or two explaining why a linear-weighted Cohen's kappa was used. For example, 'A linear-weighted Cohen's kappa was used to assess interrater agreement because it accounts for the degree of disagreement, which is important when scoring systems have ordinal categories.'
This medium-impact improvement would enhance the Methods section by explicitly stating the criteria used to define "cannot-miss" diagnoses. While the section mentions using a list of "cannot-miss" diagnoses defined in a prior study, it does not specify what these criteria were. Providing these criteria would clarify the study's methodology and ensure that readers understand how these diagnoses were identified. This is crucial for the Methods section as it describes the key elements of the study's design. This would strengthen the paper by increasing transparency and enabling more precise comparisons with other studies. Ultimately, specifying the criteria for "cannot-miss" diagnoses would improve the overall quality and clarity of the Methods section.
Implementation: Include the criteria used to define "cannot-miss" diagnoses, either in the main text or in the supplementary materials. For example, 'Cannot-miss diagnoses were defined as conditions that, if missed, could lead to significant patient harm or death.'
This medium-impact improvement would enhance the Methods section by providing more detail on the statistical analysis used for the probabilistic reasoning cases. While the section mentions using Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE), it does not fully explain how these metrics were applied in this context. Providing more detail would help readers understand the statistical methods and ensure the reproducibility of the study. This is important in the Methods section as it provides context for the statistical analyses. This would strengthen the paper by demonstrating a thorough understanding of the statistical methods and promoting a more critical evaluation of the results. Ultimately, providing more detail on the statistical analysis for probabilistic reasoning would improve the overall quality and transparency of the Methods section.
Implementation: Add a sentence or two explaining how MAE and MAPE were computed in the context of the probabilistic reasoning cases. For example, 'MAE and MAPE were computed by comparing the model's predicted probabilities to the reference probabilities for each case, averaging the absolute errors across all cases'.