Superhuman performance of a large language model on the reasoning tasks of a physician

Table of Contents

Overall Summary

Overview

This study evaluated the performance of the o1-preview model on various medical reasoning tasks, comparing it to human physicians and prior LLMs like GPT-4. The o1-preview model demonstrated significant improvements in differential diagnosis generation (78.3% accuracy, 95% CI 70.7%-84.8%), diagnostic reasoning (median score 97% vs 74% for physicians on Landmark cases), and management reasoning (median score 86% vs 34% for physicians on Grey Matters cases). However, it showed no improvement in probabilistic reasoning compared to GPT-4. Inter-rater agreement between physicians was substantial for most tasks (kappa = 0.66 for differential diagnosis, 0.89 for R-IDEA score, 0.71 for Grey Matters cases).

Key Points

Lack of Clear Statement of Study Limitations in Abstract (written-content)
This high-impact improvement would enhance the abstract's clarity and completeness, making it more informative and impactful. The abstract currently lacks a clear statement of the study's limitations, which is crucial for proper interpretation of the findings. Including limitations in the abstract ensures that readers are aware of potential caveats from the outset. Addressing this would strengthen the paper by providing a balanced perspective on the study's contributions and limitations, which is essential for maintaining scientific rigor and transparency. This would also prevent over-interpretation of the results and encourage a more nuanced understanding of the study's implications. Ultimately, explicitly stating the study's limitations in the abstract would improve the overall quality and credibility of the research.
Section: Abstract
Unclear Research Question in Introduction (written-content)
This high-impact improvement would enhance the introduction's clarity and contextual understanding by explicitly stating the study's main research question. The introduction currently provides background on the limitations of multiple-choice benchmarks and the promise of LLMs in clinical reasoning, but it does not explicitly articulate the central question the study aims to answer. Adding a clear research question would provide a focal point for the reader and set the stage for the study's methodology and findings. This is essential for the introduction as it provides the reader with a clear understanding of what the study is trying to achieve. A well-defined research question ensures that the study's purpose is immediately apparent, guiding the reader's understanding of the subsequent sections. Ultimately, stating the research question would significantly improve the introduction's clarity and purpose, making it more effective in setting up the study.
Section: Introduction
Robust Statistical Measures in Results (written-content)
The Results section effectively uses statistical measures like Cohen's kappa, confidence intervals, and p-values to quantify the agreement between physician raters and to compare the performance of the o1-preview model with other models and human controls. This rigorous approach provides a strong foundation for the study's findings and enhances the credibility of the results.
Section: Results
Unclear Metrics for Each Experiment in Results (written-content)
This medium-impact improvement would enhance the clarity and completeness of the Results section by explicitly stating the specific metrics used for each experiment. While the section presents various results, it does not always clearly define the exact metrics used for each comparison. For example, while the section mentions 'agreement' for differential diagnoses, it does not specify how 'agreement' was measured for the 'Grey Matters Management Cases'. Explicitly stating the metrics would ensure that readers understand how the model's performance was evaluated in each experiment. This is crucial for the Results section as it provides the necessary context for interpreting the findings and enhances the reproducibility of the study. This would strengthen the paper by increasing transparency and enabling more precise comparisons with other studies. Ultimately, explicitly stating the metrics used for each experiment would improve the overall quality and clarity of the Results section.
Section: Results
Lack of Clinical Context in Results (written-content)
This medium-impact improvement would enhance the Results section by providing more context for the clinical significance of the findings. While the section presents the model's performance on various tasks, it does not always explain the clinical relevance of these results. For example, the section mentions that the model achieved a perfect R-IDEA score in 78/80 cases, but it does not discuss the implications of this result for clinical practice. Adding this context would help readers understand the practical importance of the study's findings. This is crucial in the Results section as it bridges the gap between statistical results and real-world applications. Addressing this would strengthen the paper by making the results more meaningful and relevant to the clinical audience. Ultimately, adding clinical context to the findings would improve the overall impact and practical value of the study.
Section: Results
Ambiguity in Definition of Correct Diagnosis in Figure 1 (graphical-figure)
The authors did not specify how they defined a 'correct diagnosis' in the differential. This ambiguity could affect the interpretation of the accuracy values, if there was a specific rank of differential diagnoses that was considered the threshold for the diagnosis to be considered 'correct'.
Section: Results
Lack of Methodological Details for the R-IDEA Score in Figure 4 (graphical-figure)
The figure and the reference text do not explicitly explain the methodology behind the R-IDEA score, and whether it is a scoring system that is validated for use in this study. Without a description of the methodology and validation of the R-IDEA, it is not possible to evaluate the scientific validity of this element. Additionally, it is unclear whether the data for the different models and groups was generated using the same methodology and prompts, which could impact the overall validity of the study.
Section: Results
Lack of Methodological Details for Normalization and Scoring Metrics in Figure 5 (graphical-figure)
The caption and the reference text mention that the scores are 'normalized' but they don't provide any details about how the scores were normalized. The process of normalization is important for the scientific validity, as it can impact the results of the study. Also, it is not clear whether the methods used to calculate the management reasoning and diagnostic reasoning points are valid or reliable measures of these constructs. Without more information on the scoring methodology, it is not possible to assess the scientific validity of this element.
Section: Results
Lack of a Systematic Approach in Table 1 (graphical-figure)
The table does not provide a systematic or comprehensive assessment of the models' performance. The selection of cases is not justified, and it is unclear if they were randomly selected or chosen because they were the only cases that show this effect. The inclusion of only three cases could be insufficient to draw broader conclusions, and might introduce a bias into the study. The table also does not explain how the case information was presented to each model, or if the prompts were similar.
Section: Results
Clear Summary of Key Findings in Discussion (written-content)
The Discussion section effectively summarizes the key findings, highlighting the superior performance of the o1-preview model in differential diagnosis, diagnostic reasoning, and management reasoning. This clear summary provides a strong foundation for the subsequent discussion of implications and limitations.
Section: Discussion
Unclear Main Conclusions in Discussion (written-content)
This high-impact improvement would enhance the Discussion section by explicitly stating the main conclusions of the study in a concise manner. While the section discusses various implications and limitations, it does not clearly articulate the core conclusions that the authors want the reader to take away. Adding a concluding paragraph that summarizes the main findings and their significance would provide a strong closing statement. This is crucial for the Discussion section as it reinforces the study's key contributions and provides a clear sense of closure. This would strengthen the paper by ensuring that the main messages are clearly conveyed and that the reader is left with a clear understanding of the study's overall impact. Ultimately, explicitly stating the main conclusions would significantly improve the Discussion section's effectiveness.
Section: Discussion
Clear Identification of the Model and Access Method in Methods (written-content)
The Methods section clearly identifies the specific model used (o1-preview-2024-09-12) and how it was accessed through OpenAI's API. This is crucial for reproducibility and allows other researchers to understand exactly which model was evaluated.
Section: Methods
Lack of Specific Prompts in Methods (written-content)
This high-impact improvement would enhance the Methods section by explicitly stating the specific prompts used for each experiment. While the section mentions adapting prompts from previous studies, it does not provide the exact wording of these prompts, which is essential for reproducibility. Including the prompts would allow other researchers to replicate the study precisely and understand the context in which the model was evaluated. This is crucial for the Methods section as it details the core experimental procedures. This would strengthen the paper by increasing transparency and enabling more precise comparisons with other studies. Ultimately, providing the specific prompts would significantly improve the study's scientific contribution by ensuring its findings can be properly contextualized and replicated.
Section: Methods

Conclusion

The study provides a comprehensive evaluation of the o1-preview model's capabilities in medical reasoning, demonstrating its superior performance in several key areas compared to previous models and human physicians. The use of robust statistical measures and the clear presentation of results through figures and tables are major strengths. However, the study's impact is limited by the lack of a clearly stated research question, insufficient detail on specific metrics and prompts used, and limited discussion of the clinical significance of the findings. Additionally, the lack of improvement in probabilistic reasoning is not adequately addressed. While the study acknowledges some limitations, it does not fully explore their implications. The conclusion effectively summarizes the findings but fails to clearly articulate the main takeaways and provide specific guidance for future research. The study's practical utility is promising, particularly in differential diagnosis and management reasoning, but further research, including human-computer interaction studies and clinical trials, is needed to fully realize its potential. Key unanswered questions include how to improve the model's probabilistic reasoning, how to address its verbosity, and how to effectively integrate it into clinical workflows. The methodological limitations, particularly the lack of detail on prompts and metrics, could affect the reproducibility and generalizability of the findings. Overall, the study represents a significant step forward in evaluating LLMs for medical reasoning, but further refinements are needed to fully leverage its potential benefits for end users.

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 1: Barplot showing the accuracy of including the correct diagnosis in...
Full Caption

Figure 1: Barplot showing the accuracy of including the correct diagnosis in the differential for differential diagnosis (DDx) generators and LLMs on the NEJM CPCs, sorted by year. Data for other LLMs or DDx generators was obtained from the literature.36 23 8 The 95% confidence intervals were computed using a one-sample binomial test.

First Reference in Text
01-preview included the correct diagnosis in its differential in 78.3% of cases (95% CI, 70.7% to 84.8%) (Figure 1).
Description
  • Key aspect of what is shown: This figure is a barplot which is a type of graph that uses rectangular bars to show numerical data. The height or length of each bar corresponds to a specific value, making it easy to compare different categories. In this case, each bar represents the accuracy of a different method or model for generating a differential diagnosis. The ‘differential diagnosis’ is a list of possible medical conditions that could be causing a patient's symptoms. Each bar is labeled with the name of the model or method used, which can be an automated ‘differential diagnosis generator’ or a ‘Large Language Model’ (LLM), a type of artificial intelligence system. The height of each bar shows the percentage of times that a correct diagnosis was included in the generated differential diagnosis. ‘Accuracy’ in this context means the ability of the model to include the actual correct diagnosis within the list of differential diagnoses. The bars are sorted by the year the model or method was developed. The vertical lines on each bar are 95% ‘confidence intervals’, which are a range of values where the true value of the accuracy is likely to fall. The confidence intervals are calculated using a statistical test called a 'one-sample binomial test'. This test helps to account for the uncertainty in the accuracy calculation due to having only a limited number of medical cases to analyze. The test provides a measure of how much variability you might expect in the result.
Scientific Validity
  • Appropriateness of statistical methods: The use of a one-sample binomial test to calculate confidence intervals is appropriate for proportions. The 95% confidence intervals provide a measure of the uncertainty associated with each model's estimated accuracy. The fact that the data for other LLMs and DDx generators was sourced from literature is acceptable, however it is important to note that these might have been measured differently, affecting the validity of the comparison.
  • Data comparison and potential biases: The sorting of models by year is reasonable for showing the evolution of the field, but it could also be biased if earlier models were evaluated with different methodologies or data sets. It is important to ensure that the different methods/models were evaluated on a comparable dataset of medical cases, which isn't clearly stated in the caption. The inclusion of 01-preview data with other LLMs is useful, but it does not highlight whether the model is better than its predecessors in terms of medical accuracy.
  • Definition of correct diagnosis: The authors did not specify how they defined a 'correct diagnosis' in the differential. This ambiguity could affect the interpretation of the accuracy values, if there was a specific rank of differential diagnoses that was considered the threshold for the diagnosis to be considered 'correct'.
Communication
  • Clarity and completeness of the caption: The caption clearly indicates the figure's purpose, which is to compare the diagnostic accuracy of different methods. The use of 'differential diagnosis (DDx) generators' and 'LLMs' indicates that the barplot compares automated approaches with large language models. The caption also explains that the data is sorted by year and that information was sourced from existing literature. The mention of '95% confidence intervals' shows that the statistical uncertainty of each estimate is being considered. However, the figure could be improved by adding a clear definition for 'correct diagnosis'.
  • Effectiveness of the barplot presentation: The barplot is generally effective in presenting the data visually. The inclusion of error bars representing 95% confidence intervals allows for a visual assessment of the uncertainty associated with each estimate. The sorting of the models by year enables the reader to quickly see how newer approaches tend to perform better. However, using a barplot with distinct categories (models and methods) might not accurately represent the actual change in performance that is being measured and might benefit from using an alternative representation.
  • Axis labels and descriptions: The y-axis label, ‘Model’, is clear. However, the x-axis label, ‘Percent Correct Diagnosis in Differential’, could be more precise by defining the criteria for a ‘correct diagnosis’, or by adding additional information in the caption. For example, 'a model's top suggestion was the correct answer' or 'the target diagnosis was within a ranked list of differential diagnoses'.
Figure 2: A. Comparison of o1-preview with a previous evaluation of GPT-4 in...
Full Caption

Figure 2: A. Comparison of o1-preview with a previous evaluation of GPT-4 in providing the exact or very close diagnosis (Bond scores 4-5) on the same 70 cases. Bars are annotated with the accuracy of each model. 95% confidence intervals were computed using a one-sample binomial test. P-value was computed using McNemar's test. B. Histogram of o1 performance as measured by the Bond Score on the complete set of 143 cases.

First Reference in Text
On 70 cases evaluated using GPT-4 in a prior study, 01-preview produced a response with the exact or a very close diagnosis in 88.6% of cases, compared to 72.9% of cases by GPT-4 (p=.015, Figure 2).
Description
  • Key aspect of what is shown: This figure has two parts, A and B. Figure 2A is a barplot which is a type of graph that uses rectangular bars to show numerical data. Here, it is used to compare the performance of two different AI models: 01-preview and GPT-4. Each model was evaluated on the same 70 medical cases, and the bar height represents the percentage of times that the model provided a diagnosis that was either 'exact' or 'very close' to the correct one. The term 'Bond scores 4-5' is used here, which is a scoring method where 5 is the best possible score and indicates an exact diagnosis. The vertical lines on top of each bar represent the 95% 'confidence intervals', which is a range of values where the true accuracy is likely to fall. This accounts for the uncertainty due to limited data. Additionally, the caption notes that the 'p-value was computed using McNemar's test'. A 'p-value' is a statistical measure that indicates the likelihood that the difference observed between the two models is due to random chance. The McNemar's test is a specific statistical test used for comparing paired data, like the results for the same 70 cases being assessed by the two different models. Figure 2B is a 'histogram' which is a type of graph used to show the distribution of numerical data. Here, the histogram shows the distribution of ‘Bond scores’ achieved by the 01-preview model when it is evaluated across all 143 cases, which includes the 70 cases from Figure 2A. Each bar in the histogram shows the number of cases that have a particular Bond score, which can range from 0 to 5. This helps to see how frequently the 01-preview model achieved each level of diagnostic accuracy.
Scientific Validity
  • Appropriateness of statistical methods: The use of a one-sample binomial test for calculating the 95% confidence intervals is appropriate for proportions. The McNemar's test is correctly used for comparing the paired proportions between 01-preview and GPT-4, where each model was tested on the same set of 70 cases. The reported p-value of 0.015 indicates a statistically significant difference between the two models, suggesting that the difference observed in the accuracy is unlikely due to random chance. However, the validity of this comparison relies on the assumption that the 70 cases were a representative sample from the full data set, and this should be addressed.
  • Interpretation of statistical results: The reference text correctly points out that the 70 cases used for comparison are the same cases used in a prior study with GPT-4. This allows for a direct comparison between the two models. The performance of 01-preview (88.6%) is significantly better than GPT-4 (72.9%), as highlighted by the p-value. This supports the claim that 01-preview is an improvement over its predecessor.
  • Methodological details and definition of scoring metrics: The figure and reference text mention that ‘Bond scores 4-5’ are considered ‘exact or very close diagnosis’. However, the definition of Bond scores is not discussed in depth. This raises questions about the degree of subjectivity in the scoring and whether it is reproducible by other researchers. A detailed description of how the Bond score was developed is needed for evaluating this element's validity. In addition, it is not clear whether the data for the GPT-4 evaluation was collected using the same methodology, prompts, and raters. If the data was collected using different conditions, this could affect the validity of the comparison.
Communication
  • Clarity and completeness of the caption: The caption for Figure 2 is well structured and provides the necessary details. It clearly states that part A is a comparison of two models (01-preview and GPT-4) based on 'exact or very close diagnosis' which is defined by 'Bond scores 4-5'. It also mentions that the comparison uses the same 70 cases. The statistical tests used (one-sample binomial test and McNemar's test) are also included in the caption. Part B is described as a histogram of the 01-preview model performance based on the Bond Score over 143 cases. This overall provides good context for the figure's content.
  • Appropriateness of visualization methods: The use of a barplot in Figure 2A is appropriate for comparing the performance of two models. The accuracy of each model is clearly indicated by the height of the bar. The 95% confidence intervals show the uncertainty of each estimated accuracy. The annotation of the bars with the accuracy is also helpful for quick interpretation. The histogram in Figure 2B is effective in showing the distribution of the 01-preview model's Bond scores, where most of the cases have a Bond score of 4 or 5.
  • Axis labels and descriptions: The axis labels in both subfigures are clear. However, the histogram in Figure 2B could benefit from a brief explanation of the Bond Score in the figure itself, rather than just the caption, as some readers may not fully understand what a Bond score is and how it is measured. It is important to have this context, especially because it is a scoring methodology developed by the authors themselves.
Figure 3: Performance of o1-preview in predicting the next diagnostic tests...
Full Caption

Figure 3: Performance of o1-preview in predicting the next diagnostic tests that should be ordered. Performance was measured by two physicians using a likert scale of “Unhelpful,” "Helpful,” and “Exactly right.” We excluded 7 cases from the total case set in which it did not make sense to ask for the next test (Supplement 1B).

First Reference in Text
In 87.5% of cases, o1-preview selected the correct test to order, in another 11% of cases the chosen testing plan was judged by the two physicians to be helpful, and in 1.5% of cases it would have been unhelpful (Figure 3).
Description
  • Key aspect of what is shown: This figure is a barplot, which is a type of graph that uses bars to display numerical data. In this case, each bar represents the number of times the 01-preview model's suggestion for the next diagnostic test was rated as ‘Unhelpful,’ ‘Helpful,’ or ‘Exactly right.’ The rating was provided by two physicians, who evaluated the model's suggestions for different medical cases. A ‘diagnostic test’ is a procedure used to help identify a disease or condition, and the model was tasked with suggesting the next test that should be ordered, after being presented with a medical case. A ‘Likert scale’ is a rating scale where people indicate their level of agreement or disagreement, or how they perceive a particular item or question. Here, the Likert scale has three categories: ‘Unhelpful’, which indicates that the suggested test would not help in diagnosis; ‘Helpful’, which means that the suggested test would provide some benefit; and ‘Exactly right’, which implies that the suggested test is the most appropriate one to perform. The figure does not provide the total number of cases evaluated, but the caption notes that 7 cases were excluded because 'it did not make sense to ask for the next test', which means that for these 7 cases, no diagnostic test was deemed necessary or helpful.
Scientific Validity
  • Appropriateness of measurement scale: The use of a Likert scale is appropriate for capturing the subjective judgments of the physicians. The three categories ('Unhelpful,' 'Helpful,' 'Exactly right') are sufficiently distinct for this purpose. The percentages provided in the reference text accurately represent the data displayed in the barplot. The reference text also provides the distribution of the ratings, which supports the conclusion that the o1-preview model is generally accurate in predicting the next test.
  • Potential biases due to case exclusions: The exclusion of 7 cases is a potential source of bias, as the authors mention that 'it did not make sense to ask for the next test', but it is unclear what this criteria is. The reasons behind the exclusion of these cases should be explicitly stated in the main text and not just in the supplement. Without this information, it's impossible to assess whether these exclusions were justified and if they could impact the overall results. Additionally, the reference text uses the term 'correct test to order' which is not directly described in the figure. This term is also not present in the caption.
  • Lack of inter-rater reliability assessment: The study mentions that two physicians were used to score the suggested test plans, however, the methodology does not indicate whether their scores were statistically compared for inter-rater reliability. Inter-rater reliability is critical to ensure that the scores provided are consistent and not biased by the subjective interpretation of a single physician. This needs to be explicitly addressed.
Communication
  • Clarity and completeness of the caption: The caption clearly states the figure's purpose which is to evaluate how well the o1-preview model can predict the next diagnostic test. It also describes how the performance was measured (using a Likert scale), and mentions that 7 cases were excluded. However, the caption could be improved by stating which kind of cases were excluded, and why, and by stating the total number of cases evaluated.
  • Effectiveness of the barplot presentation: The barplot effectively shows the distribution of ratings given by the physicians. The categories are clear ('Unhelpful,' 'Helpful,' 'Exactly right'), and the use of a barplot is suitable for this kind of categorical data. The visual representation effectively conveys the distribution of responses and makes it clear that the majority of the tests suggested by the model were 'Exactly Right'.
  • Axis labels and descriptions: The axis labels are descriptive, and the use of 'Diagnostic Test Score' on the x-axis is generally understandable. However, it might be beneficial to explicitly state that the scores are based on the Likert scale ratings provided by the physicians, in case that was not clear to the reader.
Figure 4: A. Distribution of 312 R-IDEA scores stratified by respondents on 20...
Full Caption

Figure 4: A. Distribution of 312 R-IDEA scores stratified by respondents on 20 NEJM Healer cases. B. Box plot of the proportion of cannot-miss diagnoses included in differential diagnosis for the initial triage presentation. The total sample size in this figure is 70, with 18 responses from attending physicians, GPT-4 and o1-preview, and 16 responses from residents. Two cases were excluded because the cannot-miss diagnoses could not be identified. Ns: not statistically significant.

First Reference in Text
GPT-4 (47/80, p<0.0001), attending physicians (28/80, p<0.0001), and resident physicians (16/80, p<0.0001) as shown in Figure 4A.
Description
  • Key aspect of what is shown: This figure has two parts, A and B. Figure 4A shows a series of histograms, which are a type of graph that use bars to show the distribution of numerical data. Here, each histogram shows the distribution of 'R-IDEA scores' for a particular group of respondents. The 'R-IDEA' score is a measure of the quality of clinical reasoning documentation, and each score ranges from 0 to 10. The ‘respondents’ can be either the 01-preview AI model, the GPT-4 AI model, attending physicians, or resident physicians. The histograms are used to compare the overall distribution of the R-IDEA scores across different respondent types, and a higher R-IDEA score indicates better documentation of clinical reasoning. Figure 4B is a ‘box plot’ which is a type of graph used to show the distribution of a numerical variable using its median, quartiles, and range. The box plot represents the proportion of ‘cannot-miss’ diagnoses included by different groups during the ‘initial triage presentation’. A ‘cannot-miss’ diagnosis refers to a diagnosis that must be considered in a patient's assessment because it has a high risk of severe negative outcomes if it is not diagnosed in a timely manner. Each box plot corresponds to a different group: resident physicians, attending physicians, the GPT-4 AI model, and the 01-preview AI model. The box plot shows the median (the line inside the box), the interquartile range (the box itself), and the range of data (the lines extending from the box). Additionally, the caption indicates that 'ns' refers to 'not statistically significant', which is a statistical term used to show that the observed differences between the groups are not likely due to random chance. Two cases were excluded from the analysis in part B, because the “cannot-miss” diagnoses could not be identified for those cases.
Scientific Validity
  • Appropriateness of statistical methods: The use of histograms to show the distribution of R-IDEA scores in Figure 4A is appropriate. The reference text correctly states the number of cases that obtained a perfect R-IDEA score (10/10) for the GPT-4, attending physicians, and resident physicians. It also mentions the p-values obtained, which indicate that all these groups were significantly different from the 01-preview model. This statistical analysis is adequate for testing whether the 01-preview model is significantly different from the other groups.
  • Clarity of reference to the figure: The reference text references Figure 4A, but does not refer to figure 4B. The reference text also only reports the number of perfect scores for different groups, and does not report on the statistical significance of the differences between the groups, which are presented in the figure. The reference text mentions the p-values but does not state what statistical test was used to compute these. This makes it difficult to assess the overall validity of the statistical test performed.
  • Lack of methodological details for the R-IDEA score: The figure and the reference text do not explicitly explain the methodology behind the R-IDEA score, and whether it is a scoring system that is validated for use in this study. Without a description of the methodology and validation of the R-IDEA, it is not possible to evaluate the scientific validity of this element. Additionally, it is unclear whether the data for the different models and groups was generated using the same methodology and prompts, which could impact the overall validity of the study.
Communication
  • Clarity and completeness of the caption: The caption is relatively comprehensive, describing both subfigures A and B. It mentions that Figure 4A shows the distribution of 'R-IDEA' scores for different respondents and Figure 4B shows a box plot of 'cannot-miss' diagnoses. The caption also provides the total sample sizes, the number of responses per group, and the reason for excluding two cases. However, it could benefit from a brief explanation of what 'R-IDEA' scores and 'cannot-miss' diagnoses mean for the reader.
  • Appropriateness of visualization methods: Figure 4A uses histograms to display the distribution of R-IDEA scores. This is appropriate for visualizing how the scores are distributed for each group (01-preview, GPT-4, attending physicians, and resident physicians). Figure 4B uses a box plot, which is a good choice for showing the median, quartiles, and range of the 'cannot-miss' diagnoses for each group. Both plots are effective in communicating the respective datasets, however it could be improved by including a numerical y-axis for the histogram. Without a numerical y-axis, it is difficult to assess the significance of the data.
  • Axis labels and descriptions: The axis labels in both subfigures are clear. However, the y-axis label in the histograms in figure 4A is missing, and the x-axis label in figure 4B is also missing. The use of “ns” for non-significant differences in Figure 4B is effective, but the figure could benefit from also providing the actual p-values for the comparisons.
Figure 5: A. Box plot of normalized management reasoning points by LLMs and...
Full Caption

Figure 5: A. Box plot of normalized management reasoning points by LLMs and physicians. Five cases were included. We generated one o1-preview response for each case. The prior study collected five GPT-4 responses to each case, 176 responses from physicians with access to GPT-4, and 199 responses from physicians with access to conventional resources. *: p <= 0.05, **: p <= 0.01, ***: p <= 0.001, ****: *: p <= 0.0001. B. Box plot of normalized diagnostic reasoning points by model and physicians. Six diagnostic challenges were included. We generated one o1-preview response for each case. The prior study collected three GPT-4 responses to all cases, 25 responses from physicians with access to GPT-4, and 25 responses from physicians with access to conventional resources. Ns: not statistically significant.

First Reference in Text
The median score for the o1-preview per case was 86% (IQR, 82%-87%) (Figure 5A) as compared to GPT-4 (median 42%, IQR 33%-52%), physicians with access to GPT-4 (median 41%, IQR 31%-54%), and physicians with conventional resources (median 34%, IQR 23%-48%).
Description
  • Key aspect of what is shown: This figure has two parts, A and B. Both figures use ‘box plots’, which are a type of graph used to show the distribution of numerical data through its quartiles, median and range. Figure 5A shows the distribution of ‘normalized management reasoning points’ for different groups. ‘Management reasoning’ refers to the process of deciding the next steps in a patient’s care, and ‘normalized’ means that the scores have been adjusted to a scale of 0 to 100 to allow for a fair comparison across different cases. The groups being compared are the 01-preview AI model, the GPT-4 AI model, physicians with access to GPT-4, and physicians with access to conventional resources. The caption notes that there were five medical cases used to evaluate the management reasoning, and that one response was generated for each case using the 01-preview model. It also notes that the prior study collected five GPT-4 responses per case and 176 and 199 responses for physicians with access to GPT-4 and conventional resources, respectively. Figure 5B shows a box plot of ‘normalized diagnostic reasoning points’ for similar groups. ‘Diagnostic reasoning’ refers to the process of identifying the most likely diagnosis, and like in figure 5A, the scores have been normalized to a scale of 0 to 100. The figure indicates that six diagnostic cases were used for the evaluation and that one response was generated for each case using the 01-preview model. The prior study collected three GPT-4 responses per case, and 25 responses from physicians with access to GPT-4 and conventional resources. Additionally, the caption notes that “ns” indicates ‘not statistically significant’ and the asterisks represent different levels of statistical significance. Statistical significance is used to determine whether the observed differences are likely due to random chance or a real effect.
Scientific Validity
  • Appropriateness of descriptive statistics: The use of box plots to represent the distribution of scores is appropriate. The reference text correctly provides the median and interquartile range (IQR) values for each group in Figure 5A. The IQR is a measure of the variability in the data and provides context for the median value. The reference text highlights the superior performance of the o1-preview model as compared to the other groups.
  • Clarity of reference to the figure: The reference text focuses on Figure 5A and does not reference figure 5B. The reference text only reports descriptive statistics, and does not mention any of the statistical tests used, or whether the differences were statistically significant. Although the reference text reports the medians for each group, it does not include the p-values that are shown in the figure. This makes it difficult to evaluate whether the differences between groups are significant.
  • Lack of methodological details for normalization and scoring metrics: The caption and the reference text mention that the scores are 'normalized' but they don't provide any details about how the scores were normalized. The process of normalization is important for the scientific validity, as it can impact the results of the study. Also, it is not clear whether the methods used to calculate the management reasoning and diagnostic reasoning points are valid or reliable measures of these constructs. Without more information on the scoring methodology, it is not possible to assess the scientific validity of this element.
Communication
  • Clarity and completeness of the caption: The caption for Figure 5 is generally comprehensive, clearly differentiating between the two subfigures (A and B). It specifies that both are box plots comparing LLMs and physicians, but each subfigure focuses on a different task (management and diagnostic reasoning). The caption also provides crucial information, including the number of cases, the number of responses collected for each model and group, and a definition of the statistical significance symbols. However, the caption could be improved by adding a brief explanation of what is meant by 'normalized' and by including more information on how the ‘management reasoning points’ and ‘diagnostic reasoning points’ were determined.
  • Appropriateness of visualization methods: The use of box plots in both subfigures is appropriate for showing the distribution of scores for each group. The box plots effectively display the median, interquartile range, and outliers for each group, which enables a quick comparison of the performance of the different models and physicians. However, it is difficult to compare the distributions across groups because of the lack of the y-axis labels.
  • Axis labels and descriptions: The x-axis labels in both figures are clear and accurately describe each group being compared. The use of asterisks to indicate statistical significance is effective, however, the y-axis labels are missing and this limits the interpretability of the data. The inclusion of the number of cases in the caption is useful, but it would also be beneficial to include this information in the figure itself, for better readability.
Figure 6: Density plots for the distribution of responses by 01-preview, GPT-4...
Full Caption

Figure 6: Density plots for the distribution of responses by 01-preview, GPT-4 and humans to clinical vignettes asking for (1) the pretest probability of disease, (2) the updated probability after a positive test result, and (3) the updated probability after a negative test result. The shaded blue indicates the reference range based on a review of literature from a prior study.22 Human responses are from 553 medical practitioners (290 resident physicians, 202 attending physicians, and 61 nurse practitioners or physician assistants). 100 predictions were generated by GPT-4 and o1-preview for each question.

First Reference in Text
As shown in Figure 6 and Table 3, 01-preview performs similarly to GPT-4 in estimating pre-test and post-test probabilities.
Description
  • Key aspect of what is shown: This figure uses ‘density plots’, which are a type of graph used to display the distribution of continuous numerical data. Here, each plot shows the distribution of probability estimates made by the 01-preview AI model, the GPT-4 AI model, and a group of human medical practitioners. These estimates are related to the probability of a disease given different clinical vignettes or scenarios. A ‘clinical vignette’ is a short, hypothetical case scenario used for educational or research purposes. The ‘pretest probability of disease’ refers to the initial probability of a patient having a particular disease before any tests are done. The ‘updated probability after a positive test result’ refers to the probability of the disease after a positive test result is obtained. Similarly, the ‘updated probability after a negative test result’ is the probability after a negative test result. The figure contains 4 sets of density plots which show the changes in probability estimates for each clinical vignette after different test results. The shaded blue area indicates the ‘reference range’, which is a range of probabilities from prior literature that is used as a benchmark. The figure caption also notes that 553 medical practitioners were used as a human control, and that each model generated 100 estimates for each probability question. This shows how the models and the humans respond to the same questions, and whether they align with the reference ranges.
Scientific Validity
  • Appropriateness of visualization methods: The use of density plots is an appropriate method for visualizing the distribution of probabilistic estimates. The figure shows the variability in the estimates of each group. The inclusion of the reference range provides a useful benchmark to evaluate the model performance. The reference text correctly highlights that 01-preview and GPT-4 perform similarly in estimating pre-test and post-test probabilities. This claim is supported by the figure, which shows that the distributions of the two models are very similar.
  • Interpretation of the results: The reference text refers to both Figure 6 and Table 3, but does not specify the specific table values that support the claim that both models perform similarly. This claim could be further supported by citing specific data from Table 3. The reference text is also somewhat vague, as it states that both models perform similarly, but does not address the fact that the 01-preview model seems to be slightly closer to the reference range in the 'Stress test for coronary artery disease' scenario, and that all groups are often far from the reference ranges. This requires a more detailed evaluation of the data.
  • Lack of methodological details: The figure caption states that the reference ranges are based on a review of literature from a prior study, but does not explain how the reference ranges were established, which limits the ability to assess the scientific validity of these benchmarks. Also, the figure does not explicitly state which specific cases were used, and it is not clear whether the human controls were presented with the same prompts and scenarios as the AI models. Without further details on the methodology, it is difficult to assess the scientific validity of this element.
Communication
  • Clarity and completeness of the caption: The caption provides good context for the figure, stating that it shows density plots representing responses by 01-preview, GPT-4, and humans for clinical vignettes. It clearly explains that the responses are for pretest probabilities, probabilities after positive tests, and probabilities after negative tests. The caption also mentions that the shaded blue area is the reference range obtained from a prior study, and provides the number of human participants and the number of predictions generated by the models. However, the caption does not mention which specific clinical cases were used, nor does it define the meaning of 'pretest' and 'post-test' probabilities, and it would benefit from an explanation of the term 'density plot'.
  • Effectiveness of the density plot presentation: The use of density plots is suitable for visualizing the distribution of probabilistic estimates for each group. The plots are well-organized, allowing the reader to see how the distributions of each group change after a positive and negative test result. The overlay of the shaded blue area for the reference range helps the reader evaluate the model and human performance with respect to established benchmarks. However, the density plots would benefit from a numerical y-axis to evaluate the actual densities of each distribution.
  • Axis labels and descriptions: The axis labels are clear ('Estimates, %'), which is useful to interpret the data. The figure is also well-organized into three rows, each representing a different clinical case, and each column represents a different probability estimate (pretest, post-test positive, post-test negative). This layout makes it easy to compare the different scenarios. However, the cases used are not stated within the figure itself, and the y-axis labels are missing, which makes it difficult to assess the distribution of the data.
Table 1: Three examples in which o1-preview correctly diagnosed a complex case...
Full Caption

Table 1: Three examples in which o1-preview correctly diagnosed a complex case that GPT-4 could not solve. GPT-4 examples are from a prior study8

First Reference in Text
Examples of o1-preview solving a complex case are shown in Table 1.
Description
  • Key aspect of what is shown: Table 1 is a table which is a way of organizing information using rows and columns. The table is used to present examples of where the 01-preview AI model was able to correctly diagnose a complex medical case, while the GPT-4 AI model did not. Each row in the table corresponds to a different medical case, which is identified by the case number in the first column. The ‘Final Diagnosis’ column shows the actual diagnosis of the case. The ‘GPT-4 Differential’ column shows the list of possible diagnoses that the GPT-4 model generated. A ‘differential diagnosis’ is a list of possible medical conditions that could be causing a patient’s symptoms. The ‘01-preview Differential’ column shows the list of possible diagnoses generated by the 01-preview model. The models can provide a list of possible diagnoses, but can also include a “Most Likely Diagnosis” which is the diagnosis they deemed most probable, based on the information provided. A ‘Bond Score’ is included at the end of each differential diagnosis, which is a score that measures how close the model's diagnoses are to the actual diagnosis. It is a numerical score, and higher scores generally indicate better performance. The caption notes that the GPT-4 examples are from a prior study, which means that the GPT-4 model was evaluated in a separate study, and the data has been included here for comparison.
Scientific Validity
  • Usefulness of examples: The table is useful for providing specific examples of cases where the 01-preview model outperforms GPT-4. The examples are well-chosen and demonstrate the strengths of the 01-preview model in complex diagnostic situations. The reference text accurately points out that Table 1 shows examples of o1-preview correctly diagnosing a complex case. However, the reference text could be improved by highlighting that GPT-4 failed in these cases.
  • Lack of a systematic approach: The table does not provide a systematic or comprehensive assessment of the models' performance. The selection of cases is not justified, and it is unclear if they were randomly selected or chosen because they were the only cases that show this effect. The inclusion of only three cases could be insufficient to draw broader conclusions, and might introduce a bias into the study. The table also does not explain how the case information was presented to each model, or if the prompts were similar.
  • Lack of methodological details: The use of 'Bond Score' as a metric is included in the table but is not explained in the reference text. The definition of 'correct diagnosis' is not explicit in the table or the reference text. Without a clear explanation of the scoring system and the criteria for a 'correct diagnosis', it is difficult to assess the scientific validity of the data presented. Also, the table lacks information on how the ‘differential diagnosis’ for each model was generated, which is critical for evaluating the methodology.
Communication
  • Clarity and completeness of the caption: The caption is clear and concise, indicating the purpose of the table which is to show examples of complex cases that o1-preview diagnosed correctly while GPT-4 did not. It also mentions that the GPT-4 data is from a prior study, which is important for context. However, the caption does not explicitly state how a 'correct diagnosis' was defined, or what the criteria for a 'complex case' were, which could limit the reader's understanding of the data presented.
  • Effectiveness of the table presentation: The table is well-organized with clear column headers, making it easy to compare the different diagnoses provided by both models. The table is effective at presenting the information clearly, and the use of bold text to highlight the 'Most Likely Diagnosis' is helpful. However, it would be beneficial to explicitly state what a ‘differential diagnosis’ is in the context of this table, as well as to include the actual case information for the reader.
  • Column headers and descriptions: The column headers are descriptive and informative. The inclusion of 'Bond Score' is useful but requires further context, as this scoring methodology is not well-known. The table uses a clear font and layout, which contributes to readability. However, it would be beneficial to add a numerical ID to each case for better referencing.
Table 2: Three examples of the o1-preview suggested testing plan compared to...
Full Caption

Table 2: Three examples of the o1-preview suggested testing plan compared to the testing plan conducted. One example scored a two, indicating that the test was appropriate and nearly identical to the case plan. A score of one indicates that the suggested diagnostics would have been helpful or yielded the diagnosis via another test not used in the case. A score of zero indicates that the suggested diagnostics would be unhelpful. Verbose rationales from 01-preview were abridged by a physician (Z.K.) to better fit in the table.

First Reference in Text
Examples are shown in Table 2.
Description
  • Key aspect of what is shown: Table 2 is a table, which is a structured way of organizing information using rows and columns. This table is used to show examples of how the o1-preview AI model's suggestions for diagnostic testing compare to the actual tests that were conducted in three different medical cases. The ‘Case’ column indicates the case number, and the ‘Case Test Plan’ column provides a description of the diagnostic tests that were actually performed in each case. The ‘o1-preview Suggested Test Plan’ column lists the tests that the o1-preview model suggested for each case. The ‘Score’ column provides a numerical score (0, 1, or 2) that is used to evaluate how well the o1-preview's suggested plan aligned with the actual test plan. A score of 2 indicates that the suggested plan was appropriate and nearly identical to the actual plan. A score of 1 indicates that the suggested plan was helpful or could have yielded the diagnosis via another test that was not used. A score of 0 indicates that the suggested plan was unhelpful. The table presents the information side-by-side so that it is easy to compare what was done with what the model suggested. The caption also mentions that the 'verbose rationales' of the 01-preview model were abridged or shortened by a physician (Z.K.) to better fit in the table, which implies that the model provided more detailed explanations than what is presented in the table.
Scientific Validity
  • Usefulness of examples: The table is useful for providing specific examples of cases where the o1-preview model's suggested testing plans were evaluated. The table highlights the model's ability to suggest appropriate testing strategies. The reference text correctly states that Table 2 shows examples of the o1-preview model's suggested testing plans. However, the reference text could be improved by stating that the table compares the model's suggestions to the actual testing plans used in each case.
  • Lack of a systematic approach: The table only provides three examples and does not provide any statistical measures of the overall performance of the model in recommending the next tests. This means that the examples might be biased and are not representative of the model's overall performance. The selection criteria for these cases is not mentioned, and it is unclear if the selected examples are representative of the overall dataset. The table also does not state how the actual test plans were selected, or whether they were the optimal plans.
  • Lack of methodological details: The table provides a score for each case, but does not specify how the scores were assigned, and whether there was any inter-rater reliability measure. Without a clear explanation of how the scores were assigned, it is difficult to assess the scientific validity of the evaluation. Additionally, the table lacks information on how the 'o1-preview Suggested Test Plan' was generated. The methodology does not specify whether the models were provided with the complete case information, or just the differential diagnosis.
Communication
  • Clarity and completeness of the caption: The caption clearly states the purpose of the table, which is to compare the o1-preview's suggested testing plans with the actual testing plans conducted in three medical cases. The caption also provides a clear explanation of the scoring system (0, 1, and 2) used to evaluate the suggested plans. The caption also mentions that the verbose rationales of the 01-preview were abridged to fit the table, which is a useful note for the reader. However, the caption does not provide information on what 'testing plans' are, and how they are defined, and does not state how the actual plans were selected.
  • Effectiveness of the table presentation: The table is well-organized, with columns for the 'Case', 'Case Test Plan', 'o1-preview Suggested Test Plan', and 'Score'. The use of a table is appropriate for presenting this kind of comparative data. The table effectively presents the actual test plans and the model's suggestions side-by-side, which makes it easy for the reader to compare them. The use of bold text to indicate important information and the numerical score is also useful.
  • Column headers and descriptions: The column headers are descriptive and clearly indicate the content of each column. The table provides a clear numerical score that is associated with a description of its meaning in the caption. However, it would be beneficial to include a numerical ID for each case for better referencing and to clarify what the criteria were for a test plan to be considered 'helpful', 'unhelpful', or 'exactly right'.
Table 3: Probabilistic Reasoning Before and After Testing by 01-preview
First Reference in Text
As shown in Figure 6 and Table 3, 01-preview performs similarly to GPT-4 in estimating pre-test and post-test probabilities.
Description
  • Key aspect of what is shown: Table 3 is a table, which is a way of organizing information using rows and columns. This table is used to present the results of the probabilistic reasoning performance of the 01-preview AI model. The table shows how the model estimates the probability of a disease both before any tests are done (‘Before test’), and after either a positive or a negative test result (‘After positive test result’ and ‘After negative test result’). The ‘Reference probability range’ column shows the range of probabilities based on prior literature, which is used as a benchmark. The ‘01-preview (n=100)’ and ‘GPT-4 (n=100)’ columns show the median estimate, and the ‘interquartile range’ (IQR) which is a measure of the variability in the data, for each probability estimate. The ‘Clinician (n=553)’ column shows the median estimate for a group of 553 medical practitioners. Additionally, the table includes the ‘Mean Absolute Error (MAE)’ and ‘Mean Absolute Percentage Error (MAPE)’ for each model and for the human clinicians, which are statistical measures of how different the models’ estimates are from the reference ranges. The caption specifies that 100 predictions were generated by both models for each question.
Scientific Validity
  • Usefulness of numerical data: The table provides numerical data that supports the claim that 01-preview and GPT-4 perform similarly in probabilistic reasoning. The inclusion of MAE and MAPE provides a quantitative measure of the models' performance. The reference text correctly notes that the 01-preview model performs similarly to GPT-4. However, it could be improved by highlighting that the performance of both models is not always within the reference ranges.
  • Lack of statistical test details: The table lacks information on the statistical tests used to determine the p-values. Without this information, it is difficult to assess whether the differences between the models and the human controls are statistically significant. Also, the table does not specify the number of cases included in each scenario, which makes it difficult to interpret the results.
  • Lack of methodological details: The table does not explicitly explain what a 'pre-test' and 'post-test' probability is, and it lacks detail on the actual clinical vignettes used. Without this context, it is difficult to assess the validity of the probabilistic reasoning task. Additionally, the table does not specify the methodology used to obtain the reference probability ranges, and if the prior studies are comparable.
Communication
  • Clarity and completeness of the caption: The caption is concise and clearly indicates the table's purpose, which is to show the probabilistic reasoning performance of the 01-preview model before and after testing. However, the caption could be improved by specifying what type of testing is being referred to, and whether the human controls were also included in this table.
  • Effectiveness of the table presentation: The table is well-organized, with clear column headers. The use of a table is appropriate for presenting the numerical data associated with the different models. The table is effective at showing the pre-test and post-test probability estimates, as well as the mean absolute error (MAE) and mean absolute percentage error (MAPE) for each scenario. However, it would be useful to add a definition of MAE and MAPE in the table itself.
  • Column headers and descriptions: The column headers are descriptive and informative. The inclusion of 'Reference probability range' is useful for providing a benchmark for the estimates. The use of p-value indicators is helpful, but could be improved by adding a footnote explaining what statistical test was used to obtain these values. The table also provides the interquartile range (IQR) for each estimate, but it does not state that this is what it is measuring.

Discussion

Key Aspects

Strengths

Suggestions for Improvement

Methods

Key Aspects

Strengths

Suggestions for Improvement

↑ Back to Top