Evaluation of OpenAI's o1 Large Language Model in the Medical Domain

Table of Contents

Overall Summary

Overview

This study evaluates the performance of OpenAI's o1, a large language model with internalized chain-of-thought, in the medical field. Using 37 medical datasets, including new question-answering tasks based on professional medical quizzes from NEJM and The Lancet, the study assesses o1's understanding, reasoning, and multilingual capabilities. The research aims to address limitations in existing medical LLM benchmarks and provide a comprehensive evaluation of o1's potential in clinical applications.

Key Findings

Strengths

Areas for Improvement

Significant Elements

Table 2

Description: Presents accuracy and F1 results for o1 and other LLMs across various medical tasks, demonstrating o1's superior performance in Knowledge QA and Medical Calculation.

Relevance: Provides key quantitative evidence supporting o1's enhanced clinical understanding and reasoning.

Figure 4

Description: Compares o1 and GPT-4's responses on a LancetQA diagnostic question, visually showcasing o1's accurate and concise reasoning compared to GPT-4's hallucination and incorrect answer.

Relevance: Illustrates o1's superior reasoning ability in a real-world clinical scenario.

Conclusion

This study demonstrates the potential of OpenAI's o1 model as a step towards an "AI doctor." The evaluation across multiple medical datasets reveals o1's enhanced understanding and reasoning capabilities, showing promise for bridging the gap between AI and human doctors. However, further research is needed to address limitations such as hallucination and to develop more robust evaluation metrics for advanced LLMs in the medical domain. Future work should explore areas like safety, retrieval augmented generation, and the broader implications of integrating such models into healthcare practices.

Section Analysis

Abstract

Overview

This abstract introduces a study evaluating OpenAI's o1, a large language model with internalized chain-of-thought, in the medical field. It focuses on o1's understanding, reasoning, and multilingual capabilities across 37 medical datasets, including new question-answering tasks based on professional medical quizzes. The study finds that o1 shows promise in medical applications, outperforming GPT-4, but also identifies weaknesses like hallucination and inconsistent multilingual abilities.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 1

Figure 1 is a radar chart comparing the performance of five large language models (LLMs) on twelve medical datasets. The LLMs compared are o1, GPT-4, GPT-3.5, Meditron-70B, and Llama3-8B. The datasets include PubMedQA, MedQA, NEJMQA, MedMCQA, LancetQA, Medbullets, PUBHEALTH Ver., MedNLI-Dis, MedCalc-Bench, MIMIC4ED, PICO, and MedBench. The chart suggests o1 generally outperforms the other models on most datasets.

First Mention

Text: "Figure 1: Overall results of o1 and other 4 strong LLMs. We show performance on 12 medical datasets spanning diverse domains. o1 demonstrates a clear performance advantage over close- and open-source models."

Context: This is the first sentence of the abstract and introduces the first figure summarizing the overall performance comparison of different LLMs.

Relevance: This figure is highly relevant to the abstract as it visually summarizes the main finding of the study: o1's superior performance on a range of medical datasets, supporting the claim that it advances the development of an 'AI doctor'.

Critique
Visual Aspects
  • The radar chart format allows for easy comparison across multiple datasets simultaneously.
  • The color-coding helps differentiate the models, but some lines are close and could be visually confusing.
  • The specific performance metric (presumably accuracy) is not labeled on the chart itself, although it is mentioned in the caption and surrounding text.
Analytical Aspects
  • The figure shows a general trend of o1's superior performance, but specific numerical values are difficult to discern from the chart itself.
  • The caption mentions a 'clear performance advantage', but no statistical tests are mentioned to support this claim.
  • The choice of 12 datasets and their representativeness of the medical domain could be further discussed.
Numeric Data
  • Number of LLMs compared: 5
  • Number of medical datasets: 12
Figure 2

Figure 2 is a bar chart showing the average accuracy of the same five LLMs (o1, GPT-4, GPT-3.5, Meditron-70B, and Llama3-8B) across 19 medical datasets. o1 achieves the highest average accuracy of 74.3%.

First Mention

Text: "Figure 2: Average accuracy of o1 and other 4 strong LLMs. o1 achieves the highest average accuracy of 74.3% across 19 medical datasets."

Context: This sentence, appearing immediately after the mention of Figure 1, introduces the second figure focusing on average accuracy across a wider range of datasets.

Relevance: This figure complements Figure 1 by providing a summary statistic (average accuracy) across a larger set of datasets, further strengthening the claim of o1's superior performance and its relevance to the abstract's main point.

Critique
Visual Aspects
  • The bar chart is clear and easy to interpret, with the model achieving the highest accuracy clearly visible.
  • The use of different colors and icons for each LLM enhances visual distinction.
  • The y-axis is clearly labeled with 'Accuracy' and percentage values.
Analytical Aspects
  • While the average accuracy is provided for o1, the average accuracies for other models are not explicitly stated in the caption.
  • The figure doesn't show the individual dataset accuracies that contribute to the average, limiting a deeper understanding of performance variations.
  • No statistical tests are mentioned to assess the significance of the differences in average accuracy.
Numeric Data
  • Average accuracy of o1: 74.3 %
  • Number of LLMs compared: 5
  • Number of medical datasets: 19

Introduction

Overview

This introduction sets the stage for a study evaluating the OpenAI o1 model's performance in medicine. It discusses the concept of general intelligence, the progress of LLMs, and the limitations of current medical LLM benchmarks. The paper aims to address this gap by evaluating o1's understanding, reasoning, and multilingual capabilities in medicine using existing and newly created datasets.

Key Aspects

Strengths

Suggestions for Improvement

Related Works

Overview

This section discusses prior work on large language models (LLMs) with enhanced reasoning abilities, particularly in the medical domain. It highlights the development of LLMs with chain-of-thought (CoT) reasoning and reinforcement learning, and their application to medical tasks. The section also emphasizes the need for comprehensive evaluation of LLMs in medicine, considering understanding, reasoning, and multilinguality.

Key Aspects

Strengths

Suggestions for Improvement

Evaluation Pipeline

Overview

This section details the evaluation pipeline used in the study, outlining the taxonomy of evaluations, the aspects and tasks involved, and the metrics employed. The pipeline assesses language models across understanding, reasoning, and multilinguality using a variety of medical datasets and prompting strategies. The section emphasizes the comprehensive nature of the evaluation, aiming to provide a holistic view of LLM performance in the medical domain.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 3

Figure 3 illustrates the evaluation pipeline used in the study. It outlines the aspects of model capabilities (understanding, reasoning, and multilinguality), the prompt strategies (direct prompting, chain-of-thought, few-shot prompting), the language models used (o1, GPT 3.5, GPT 4, Meditron, Llama3), and the evaluation metrics (Prediction Accuracy, Hallucination, Tendency, and Free-form Text Generation Capability). The diagram shows a flow from aspects and tasks to prompt strategies, then to language models, and finally to evaluation.

First Mention

Text: "First, we present the taxonomy of our evaluation, along with an overview of the evaluation pipeline as shown in Figure 3."

Context: This sentence appears in the 'Overall Taxonomy of Evaluations' subsection, introducing the figure that visually represents the evaluation pipeline.

Relevance: Figure 3 is crucial for understanding the methodology of the study. It visually outlines the entire evaluation process, connecting the aspects being evaluated, the prompting strategies, the models used, and the metrics employed. This provides a clear framework for the experiments and results presented later in the paper.

Critique
Visual Aspects
  • The flow diagram format clearly illustrates the steps in the evaluation pipeline.
  • The use of distinct boxes for each stage and arrows to indicate the flow enhances readability.
  • The figure could benefit from a brief explanation of how 'Tendency' is measured as an evaluation metric.
Analytical Aspects
  • The pipeline logically connects the different components of the evaluation, providing a structured approach to assessing LLM performance.
  • The inclusion of different prompting strategies allows for a more comprehensive evaluation of the models' capabilities.
  • The figure could be strengthened by providing a more precise definition of 'Free-form Text Generation Capability' and how it is evaluated.
Numeric Data
  • Number of aspects: 3
  • Number of prompting strategies: 3
  • Number of language models: 5
Table 1

Table 1 provides a detailed list of the 37 datasets used in the evaluation, categorized by three fundamental aspects: Understanding (Concept Recognition, Text Summary), Reasoning (Knowledge QA, Clinical Decision Support, Agent, Medical Calculation), and Multilinguality (Knowledge QA, Agent). For each dataset, the table lists the task, a brief description, and the metrics used for evaluation. Asterisks mark newly constructed datasets.

First Mention

Text: "In Table 1, our evaluation efforts are structured into three main parts: aspect, task, and dataset."

Context: This sentence, located in the 'Aspects and Tasks' subsection, introduces Table 1, which details the datasets used in the evaluation, organized by aspect and task.

Relevance: Table 1 is essential for understanding the scope and methodology of the study. It provides a comprehensive overview of the datasets used, linking them to the three aspects being evaluated (understanding, reasoning, and multilinguality). This allows readers to assess the breadth and relevance of the evaluation.

Critique
Visual Aspects
  • The table is well-organized, with clear headings and consistent formatting.
  • The use of asterisks to denote newly constructed datasets is helpful.
  • The table could benefit from a clearer visual separation between the three aspects (Understanding, Reasoning, Multilinguality) for easier navigation.
Analytical Aspects
  • The table provides valuable information about the datasets, including their descriptions and associated metrics.
  • The categorization of datasets by aspect and task helps structure the evaluation and allows for comparisons across different capabilities.
  • The table could be further enhanced by including the number of instances or data points in each dataset to give a better sense of their size and complexity.
Numeric Data
  • Number of datasets for Understanding: 14
  • Number of datasets for Reasoning: 19
  • Number of datasets for Multilinguality: 4
  • Total number of datasets: 37

Experiments

Overview

This section details the experimental setup, including prompting strategies, models used, and the main result highlighting o1's improved performance in clinical understanding and reasoning. It emphasizes o1's superior performance compared to other models, particularly in knowledge understanding and diagnostic reasoning scenarios, suggesting a step closer to an "AI doctor". The section also sets the stage for further analysis by mentioning other observations and potential limitations.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Table 2

Table 2 presents the Accuracy (Acc.) and F1 results for several models (o1, GPT-4, GPT-3.5, Meditron-70B, Llama3-8B) across four tasks related to two aspects: Understanding (Concept Recognition) and Reasoning (Clinical Decision Support, Knowledge QA, Medical Calculation). The table highlights o1's results with a gray background and includes average scores for each metric. o1 generally outperforms other models, especially in Knowledge QA and Medical Calculation tasks. For example, o1 achieves 72.6% average F1 score in Concept Recognition, 74.9% average accuracy in Clinical Decision Support, and 84.8% average accuracy in Knowledge QA.

First Mention

Text: "Results presented in Table 2 demonstrate that o1 outperforms other models on the understanding aspect in most clinical tasks."

Context: This sentence, located in the 'Main Result' subsection, introduces Table 2 to support the claim of o1's superior performance in understanding and reasoning tasks.

Relevance: Table 2 directly supports the main claim of the section by providing quantitative evidence of o1's performance in various medical tasks. It highlights o1's strengths in knowledge understanding and reasoning, contributing to the overall argument that the model is a step closer to an 'AI doctor'.

Critique
Visual Aspects
  • The table is well-organized and easy to read, with clear headings for aspects, tasks, datasets, and models.
  • The gray background effectively highlights o1's results.
  • The use of abbreviations (Acc., F1) might require readers to refer back to earlier sections for clarification.
Analytical Aspects
  • The table provides a good overview of model performance across different tasks, but it lacks statistical significance testing to confirm the observed differences.
  • The inclusion of average scores is helpful for overall comparison, but it might obscure performance variations across individual datasets.
  • The choice of specific datasets within each task and their representativeness of the broader medical domain could be further discussed.
Numeric Data
  • o1 Average F1 (Concept Recognition): 72.6 %
  • o1 Average Accuracy (Clinical Decision Support): 74.9 %
  • o1 Average Accuracy (Knowledge QA): 84.8 %
  • o1 Accuracy (MedCalc-Bench): 34.9 %
  • GPT-4 Accuracy (MedCalc-Bench): 25.5 %
Table 3

Table 3 presents the BLEU-1 (B-1) and ROUGE-1 (R-1) results for the same models as in Table 2, focusing on three tasks related to two aspects: Understanding (Text Summary, Concept Recognition) and Reasoning (Clinical Decision Support). o1's results are highlighted. o1 shows competitive performance in text summarization and concept recognition, achieving average ROUGE-1 scores of 31.4% and 32.5%, respectively. For example, in Text Summary, o1 achieves an average ROUGE-1 score of 31.4%, compared to 29.0% for GPT-4 and 27.7% for GPT-3.5.

First Mention

Text: "Additionally, on the summarization task in Table 3, o1 achieves a 2.4% and 3.7% increase in ROUGE-1 score over GPT-4 and GPT-3.5 (i.e., 31.4% vs. 29.0% vs. 27.7%), demonstrating its enhanced capacity for real-world clinical understanding."

Context: This sentence follows the discussion of Table 2 and introduces Table 3 to further demonstrate o1's enhanced capacity for clinical understanding in summarization tasks.

Relevance: Table 3 complements Table 2 by providing additional evidence of o1's performance using different metrics (BLEU-1 and ROUGE-1) commonly used for evaluating text generation tasks. This further supports the section's argument about o1's enhanced clinical understanding and its potential as an 'AI doctor'.

Critique
Visual Aspects
  • The table is well-structured, similar to Table 2, with clear headings and highlighting for o1's results.
  • The inclusion of both BLEU-1 and ROUGE-1 provides a more comprehensive evaluation.
  • The upward arrows next to the metric names (B-1 ↑, R-1 ↑) clearly indicate that higher scores are better.
Analytical Aspects
  • While the table shows numerical differences between models, it doesn't include statistical significance testing, making it difficult to draw strong conclusions about the superiority of one model over another.
  • The average scores are helpful for overall comparison but might mask performance variations across individual datasets.
  • The choice of BLEU-1 and ROUGE-1 could be further justified, and other relevant metrics for text generation could be considered.
Numeric Data
  • o1 Average ROUGE-1 (Text Summary): 31.4 %
  • GPT-4 Average ROUGE-1 (Text Summary): 29.0 %
  • GPT-3.5 Average ROUGE-1 (Text Summary): 27.7 %
  • o1 Average ROUGE-1 (Concept Recognition): 32.5 %
  • o1 Average ROUGE-1 (Clinical Decision Support): 24.4 %
Table 4

Table 4 presents the AlignScore and Mauve results for the same models across three tasks and two aspects, similar to Table 3. Higher AlignScore indicates better factual consistency, while higher Mauve indicates better distribution match with human-written text. o1 shows some improvements in Mauve compared to other models, but its AlignScore is generally lower than GPT-4's, suggesting potential issues with hallucination. For example, in Text Summary, o1's average AlignScore is 20.3, while GPT-4's is 21.6.

First Mention

Text: "We use AlignScore (Zha et al., 2023) to evaluate hallucination in LLMs. In Table 4, the o1 model demonstrates a 1.3% decrease in AlignScore compared to GPT-4 across five text summarization datasets."

Context: This sentence, appearing in the 'Further Analysis' subsection, introduces Table 4 to discuss the issue of hallucination in o1, using AlignScore as a metric.

Relevance: Table 4 is relevant because it addresses the important issue of hallucination, a known weakness of LLMs. By presenting AlignScore and Mauve results, it provides insights into the factual consistency and distribution match of the generated text, adding a crucial dimension to the evaluation of o1's performance.

Critique
Visual Aspects
  • The table is well-organized, consistent with the previous tables, making it easy to compare results across different models and metrics.
  • The upward arrows next to AlignScore and Mauve clearly indicate that higher values are desirable.
  • The table could benefit from visually separating the Understanding and Reasoning aspects for clearer navigation.
Analytical Aspects
  • The table provides valuable data on hallucination and distribution match, but it lacks statistical significance testing.
  • The average scores provide an overview but may mask variations across individual datasets.
  • The choice of AlignScore and Mauve could be further discussed, and other relevant metrics for evaluating hallucination and text quality could be considered.
Numeric Data
  • o1 Average AlignScore (Text Summary): 20.3
  • GPT-4 Average AlignScore (Text Summary): 21.6
  • o1 Average Mauve (Text Summary): 14.8
  • GPT-4 Average Mauve (Text Summary): 17.9
  • o1 Average AlignScore (Concept Recognition): 26.5
Figure 4

Figure 4 compares the answers and reasoning provided by o1 and GPT-4 on a question from the LancetQA dataset. The question involves diagnosing a rash in a two-month-old infant. o1 correctly identifies Congenital Syphilis, while GPT-4 incorrectly chooses Congenital Candidiasis. o1's reasoning is shorter and more concise, while GPT-4's explanation includes hallucinated information.

First Mention

Text: "In addition to delivering higher accuracy, o1 provides more concise and straightforward answers. In the example illustrated in Figure 4, o1 generates shorter interpretations while offering the correct answer."

Context: This sentence, following the discussion of o1's strong reasoning abilities, introduces Figure 4 to provide a specific example comparing o1 and GPT-4's responses on a diagnostic question.

Relevance: Figure 4 strengthens the section's argument about o1's superior reasoning abilities by providing a qualitative example. It visually demonstrates o1's concise and accurate reasoning compared to GPT-4's incorrect answer and hallucinated explanation, further supporting the claim that o1 is getting closer to an 'AI doctor'.

Critique
Visual Aspects
  • The figure clearly presents the question, options, and responses of both models.
  • The use of a checkmark for o1 and a cross for GPT-4 clearly indicates the correct and incorrect answers.
  • The figure could benefit from highlighting the specific parts of GPT-4's response that are considered hallucinations.
Analytical Aspects
  • The figure provides a compelling example of o1's accurate and concise reasoning in a clinical scenario.
  • The comparison with GPT-4 highlights the potential for hallucination in other LLMs.
  • The figure could be strengthened by providing additional examples from different datasets to demonstrate the generalizability of o1's performance.
Numeric Data
Figure 4

Figure 4 compares the answers and reasoning provided by o1 and GPT-4 on a medical question from the LancetQA dataset. The question involves diagnosing a rash in a two-month-old infant. o1 correctly identifies Congenital Syphilis, providing a concise explanation. GPT-4 incorrectly diagnoses Congenital Candidiasis, offering a longer and less accurate reasoning process.

First Mention

Text: "In addition to delivering higher accuracy, o1 provides more concise and straightforward answers. In the example illustrated in Figure 4, o1 generates shorter interpretations while offering the correct answer. In contrast, GPT-4 tends to generate hallucinated explanations alongside incorrect answers."

Context: This paragraph discusses o1's improved accuracy and conciseness in comparison to GPT-4, using Figure 4 as a supporting example.

Relevance: Figure 4 supports the claim that o1 not only achieves higher accuracy but also provides more concise and clinically sound reasoning compared to GPT-4. This reinforces the paper's argument for o1's potential as a reliable clinical tool.

Critique
Visual Aspects
  • The figure clearly presents the question, answer options, and reasoning provided by each model.
  • The use of checkmark and cross icons clearly distinguishes the correct and incorrect answers.
  • The figure could be improved by including the full text of the question instead of just a summary.
Analytical Aspects
  • The figure effectively demonstrates o1's superior reasoning ability in a specific clinical scenario.
  • The comparison highlights the tendency of GPT-4 to generate hallucinated explanations, a key weakness identified in the paper.
  • The figure could be strengthened by providing more context about the LancetQA dataset and its characteristics.
Table 5

Table 5 compares the accuracy of o1, GPT-4, and GPT-3.5 on two agentic benchmarks: AI Hospital and AgentClinic. AI Hospital tasks include Symptom identification, Medical Examination, Diagnostic Results, Diagnostic Rationales, and Treatment Plan. AgentClinic includes MedQA and NEJM subsets. o1 generally performs well, but GPT-4 outperforms it on certain AI Hospital tasks.

First Mention

Text: "In more complex reasoning scenarios that involve multi-turn conversations and environmental simulations, o1 outperforms both GPT-4 and GPT-3.5 on the AgentClinic benchmark, achieving accuracy gains of at least 15.5% and 10% with scores of 45.5% and 20.0% on its MedQA and NEJM subsets, respectively."

Context: This paragraph discusses o1's performance in complex reasoning scenarios using agentic benchmarks, referencing Table 5 for detailed results.

Relevance: Table 5 provides quantitative evidence of o1's performance in complex, multi-turn medical scenarios, which are crucial for real-world clinical applications. It directly relates to the section's focus on evaluating o1's reasoning abilities.

Critique
Visual Aspects
  • The table is clear and easy to read, with distinct rows for each model and columns for each task.
  • The use of percentages makes it easy to compare performance across models and tasks.
  • The table could benefit from highlighting the best-performing model for each task.
Analytical Aspects
  • The table provides a valuable comparison across different models and agentic tasks.
  • The inclusion of both AI Hospital and AgentClinic benchmarks offers a broader perspective on agentic performance.
  • The table could be strengthened by providing more details about the tasks within each benchmark and the specific metrics used to calculate accuracy.
Numeric Data
  • o1 accuracy on AI Hospital (Symptom): 67.0 %
  • GPT-4 accuracy on AI Hospital (Symptom): 66.7 %
  • o1 accuracy on AgentClinic (MedQA): 45.5 %
  • GPT-4 accuracy on AgentClinic (MedQA): 30.4 %
  • o1 accuracy on AgentClinic (NEJM): 20.0 %
  • GPT-4 accuracy on AgentClinic (NEJM): 10.0 %
Table 6

Table 6 presents the accuracy of o1 and GPT-4, with and without Chain-of-Thought (CoT) prompting, on five knowledge QA datasets: PubMedQA, MedQA, MedMCQA, LancetQA, and NEJMQA. The results show that CoT prompting improves the performance of both models, even though o1 is trained with CoT data.

First Mention

Text: "o1 was released using chain-of-thought (CoT) data embedding in the training process; however, we found that applying the CoT prompting still enhances o1’s performance on knowledge QA tasks in medicine, as shown in Table 6."

Context: This paragraph discusses the impact of CoT prompting on o1 and GPT-4's performance, referring to Table 6 for detailed results.

Relevance: Table 6 directly addresses the question of whether CoT prompting benefits models already trained with CoT data. This is relevant to the section's focus on analyzing o1's performance and the impact of different prompting strategies.

Critique
Visual Aspects
  • The table clearly presents the accuracy scores for each model and prompting condition.
  • The inclusion of citations for each dataset enhances credibility.
  • The table could be improved by visually highlighting the performance gains from CoT prompting.
Analytical Aspects
  • The table demonstrates that CoT prompting can still improve performance even for models trained on CoT data.
  • The results suggest that explicit CoT prompting may be more effective than implicit CoT training in certain cases.
  • The table could be strengthened by including statistical significance tests to assess the reliability of the observed improvements.
Numeric Data
  • o1 accuracy on PubMedQA (no CoT): 75.0 %
  • o1 accuracy on PubMedQA (CoT): 75.2 %
  • GPT-4 accuracy on PubMedQA (no CoT): 52.8 %
  • GPT-4 accuracy on PubMedQA (CoT): 62.2 %
  • o1 accuracy on MedQA (no CoT): 95.0 %
  • o1 accuracy on MedQA (CoT): 95.2 %

Discussion

Overview

This discussion section analyzes the implications of o1's performance in the medical domain, highlighting both its potential benefits and drawbacks. It raises concerns about o1's adverse impacts, such as increased decoding time and inconsistent performance across tasks. The section also emphasizes the need for rethinking evaluation metrics for stronger LLMs, advocating for more robust measures beyond traditional metrics like BLEU and ROUGE. Finally, it calls for more reliable prompting techniques adapted to the evolving internal prompting strategies of future LLMs.

Key Aspects

Strengths

Suggestions for Improvement

Conclusion

Overview

This conclusion summarizes the study's findings on OpenAI's o1 model in the medical domain, emphasizing its progress towards an "AI doctor." It evaluated o1 across three key aspects (understanding, reasoning, and multilingual capabilities) using 37 medical datasets, including two novel ones. The study concludes that o1 shows promise in bridging the gap between AI and human doctors.

Key Aspects

Strengths

Suggestions for Improvement

↑ Back to Top