Evaluation of OpenAI's o1 Large Language Model in the Medical Domain

Abstract

Overview

This abstract introduces a study evaluating OpenAI's o1, a large language model with internalized chain-of-thought, in the medical field. It focuses on o1's understanding, reasoning, and multilingual capabilities across 37 medical datasets, including new question-answering tasks based on professional medical quizzes. The study finds that o1 shows promise in medical applications, outperforming GPT-4, but also identifies weaknesses like hallucination and inconsistent multilingual abilities.

Key Aspects

Purpose: To evaluate the performance of o1 in the medical domain, focusing on understanding, reasoning, and multilinguality.
Datasets: 37 medical datasets were used, including two new QA datasets based on quizzes from NEJM and The Lancet.
Findings: o1 demonstrates improved clinical understanding and reasoning, outperforming GPT-4, but still exhibits weaknesses such as hallucination and inconsistent multilingual ability.
Methodology: The study evaluated o1 across six tasks using diverse prompting strategies, including direct prompting, chain-of-thought, and few-shot prompting.

Strengths

Clear research focus
The abstract clearly states the purpose of the research, which is to evaluate the performance of o1 in the medical domain.

"To this end, this report provides a preliminary exploration of o1 on different medical scenarios, comprehensively examining 3 key aspects: understanding, reasoning, and multilinguality." (Page 1)
Relevance of datasets
The use of datasets based on professional medical quizzes (NEJM and The Lancet) adds clinical relevance and real-world applicability to the study.

"including two newly constructed and more challenging question-answering (QA) tasks based on professional medical quizzes from the New England Journal of Medicine and The Lancet." (Page 1)

Suggestions for Improvement

Quantify improvements
While the abstract mentions that o1 surpasses GPT-4, it would be stronger to quantify this improvement with specific percentages or metrics.

"Notably, o1 surpasses the previous GPT-4 in accuracy" (Page 1)

Rationale: Quantifying the improvement would provide a more concrete measure of o1's performance gain and strengthen the impact of the findings.

Implementation: Include specific performance metrics, such as accuracy percentages or F1 scores, to demonstrate the extent of o1's improvement over GPT-4.
Elaborate on weaknesses
The abstract briefly mentions weaknesses like hallucination and inconsistent multilingual ability. Providing more detail on these limitations would strengthen the analysis.

"But meanwhile, we also identify several weaknesses in both the model capability and the existing evaluation protocols, including hallucination, inconsistent multilingual ability, and discrepant metrics for evaluation." (Page 2)

Rationale: A more detailed discussion of the weaknesses would provide a more balanced perspective on o1's performance and highlight areas for future research.

Implementation: Briefly describe the types of hallucinations observed and the specific contexts where multilingual ability is inconsistent.

Non-Text Elements

Figure 1

Figure 1 is a radar chart comparing the performance of five large language models (LLMs) on twelve medical datasets. The LLMs compared are o1, GPT-4, GPT-3.5, Meditron-70B, and Llama3-8B. The datasets include PubMedQA, MedQA, NEJMQA, MedMCQA, LancetQA, Medbullets, PUBHEALTH Ver., MedNLI-Dis, MedCalc-Bench, MIMIC4ED, PICO, and MedBench. The chart suggests o1 generally outperforms the other models on most datasets.

First Mention

Text: "Figure 1: Overall results of o1 and other 4 strong LLMs. We show performance on 12 medical datasets spanning diverse domains. o1 demonstrates a clear performance advantage over close- and open-source models."

Context: This is the first sentence of the abstract and introduces the first figure summarizing the overall performance comparison of different LLMs.

Relevance: This figure is highly relevant to the abstract as it visually summarizes the main finding of the study: o1's superior performance on a range of medical datasets, supporting the claim that it advances the development of an 'AI doctor'.

Critique

Visual Aspects

The radar chart format allows for easy comparison across multiple datasets simultaneously.
The color-coding helps differentiate the models, but some lines are close and could be visually confusing.
The specific performance metric (presumably accuracy) is not labeled on the chart itself, although it is mentioned in the caption and surrounding text.

Analytical Aspects

The figure shows a general trend of o1's superior performance, but specific numerical values are difficult to discern from the chart itself.
The caption mentions a 'clear performance advantage', but no statistical tests are mentioned to support this claim.
The choice of 12 datasets and their representativeness of the medical domain could be further discussed.

Numeric Data

Number of LLMs compared: 5
Number of medical datasets: 12

Figure 2

Figure 2 is a bar chart showing the average accuracy of the same five LLMs (o1, GPT-4, GPT-3.5, Meditron-70B, and Llama3-8B) across 19 medical datasets. o1 achieves the highest average accuracy of 74.3%.

First Mention

Text: "Figure 2: Average accuracy of o1 and other 4 strong LLMs. o1 achieves the highest average accuracy of 74.3% across 19 medical datasets."

Context: This sentence, appearing immediately after the mention of Figure 1, introduces the second figure focusing on average accuracy across a wider range of datasets.

Relevance: This figure complements Figure 1 by providing a summary statistic (average accuracy) across a larger set of datasets, further strengthening the claim of o1's superior performance and its relevance to the abstract's main point.

Critique

Visual Aspects

The bar chart is clear and easy to interpret, with the model achieving the highest accuracy clearly visible.
The use of different colors and icons for each LLM enhances visual distinction.
The y-axis is clearly labeled with 'Accuracy' and percentage values.

Analytical Aspects

While the average accuracy is provided for o1, the average accuracies for other models are not explicitly stated in the caption.
The figure doesn't show the individual dataset accuracies that contribute to the average, limiting a deeper understanding of performance variations.
No statistical tests are mentioned to assess the significance of the differences in average accuracy.

Numeric Data

Average accuracy of o1: 74.3 %
Number of LLMs compared: 5
Number of medical datasets: 19

Introduction

Overview

This introduction sets the stage for a study evaluating the OpenAI o1 model's performance in medicine. It discusses the concept of general intelligence, the progress of LLMs, and the limitations of current medical LLM benchmarks. The paper aims to address this gap by evaluating o1's understanding, reasoning, and multilingual capabilities in medicine using existing and newly created datasets.

Key Aspects

General Intelligence: The introduction highlights the ongoing pursuit of artificial general intelligence and how LLMs represent significant progress in this area.
LLM Advancements: It traces the evolution of LLMs, emphasizing the role of prompting techniques like chain-of-thought (CoT) and OpenAI's integration of CoT into o1's training.
Limitations of Current Benchmarks: It critiques existing medical LLM benchmarks for their narrow focus and lack of comprehensive assessment, particularly for advanced models like o1.
Research Gap and Aim: The paper aims to fill this gap by providing a comprehensive evaluation of o1's capabilities in medicine, focusing on understanding, reasoning, and multilinguality.
Evaluation Approach: The study uses 35 existing medical datasets and develops two new QA datasets to evaluate o1 across these three key aspects.

Strengths

Contextual Background
The introduction effectively establishes the context by discussing the broader field of AI and the evolution of LLMs, leading to the rationale for evaluating o1 in medicine.

"The most exciting progresses in AI are achieved by language models in these years, from the initial start of ChatGPT to its evolution and other open-source projects (Touvron et al., 2023a;b; Jiang et al., 2023; Bai et al., 2023; Peng et al., 2024)." (Page 2)
Clear Problem Statement
The introduction clearly articulates the limitations of current medical LLM benchmarks and the need for a more comprehensive evaluation of advanced models like o1.

"Moreover, current benchmarks for LLMs in the medical domain often evaluate models only on a limited set of factors, often focusing on isolated aspects such as knowledge and reasoning (Nori et al., 2023b; Li´evin et al., 2024), safety (Han et al., 2024), or multilinguality (Wang et al., 2024)." (Page 2)

Suggestions for Improvement

Expand on o1's Novelty
While the introduction mentions o1's internalized CoT, it could further elaborate on what makes o1 unique compared to previous models and how this might impact its performance in medicine.

"OpenAI capitalized on this by embedding the CoT process into model training, integrating reinforcement learning, and finally introduced the o1 model (OpenAI, 2024)." (Page 2)

Rationale: Highlighting o1's specific advancements would strengthen the motivation for the study and clarify its contribution.

Implementation: Add a sentence or two explaining the novel aspects of o1's architecture or training process that differentiate it from previous LLMs, particularly in the context of medical applications.
Clarify Evaluation Scope
The introduction mentions three key aspects (understanding, reasoning, and multilinguality), but could more clearly define what these entail in the medical context and how they will be operationalized in the evaluation.

"This paper aims to provide an initiative to close this gap, focusing on o1. We identify three fundamental aspects of LLMs in medicine: understanding, reasoning, and multilinguality." (Page 2)

Rationale: Providing more concrete definitions would improve the clarity and rigor of the evaluation framework.

Implementation: Briefly define each aspect (understanding, reasoning, and multilinguality) within the specific context of medical applications. For example, explain what "understanding" means for a medical LLM and how it differs from "reasoning" in this domain.

Related Works

Overview

This section discusses prior work on large language models (LLMs) with enhanced reasoning abilities, particularly in the medical domain. It highlights the development of LLMs with chain-of-thought (CoT) reasoning and reinforcement learning, and their application to medical tasks. The section also emphasizes the need for comprehensive evaluation of LLMs in medicine, considering understanding, reasoning, and multilinguality.

Key Aspects

LLMs with Enhanced Reasoning: Discusses the evolution of LLMs and the incorporation of CoT and reinforcement learning to improve reasoning abilities.
Medical LLMs: Focuses on the application of LLMs in medicine, including GPT-4 and models fine-tuned on biomedical corpora.
Need for Comprehensive Evaluation: Emphasizes the importance of evaluating LLMs in medicine across understanding, reasoning, and multilinguality.
o1's Potential in Medicine: Highlights the aim of the paper to investigate o1's effectiveness in the clinical medical domain.

Strengths

Clear Context and Motivation
The section effectively establishes the context of LLMs and their growing importance in various fields, including medicine. It clearly motivates the need for evaluating o1's capabilities in this domain.

"While the o1 model demonstrates strong performance in general domains, its effectiveness in specialized fields like medicine—where domain-specific training may be lacking—remains uncertain." (Page 3)
Relevant Literature Review
The section provides a concise yet relevant overview of prior work on LLMs with enhanced reasoning and their application to medical tasks. It cites key studies that have contributed to this area.

"Large Language models (LLMs) based on next token prediction pre-training (Touvron et al., 2023a;b; Achiam et al., 2023) have demonstrated promising capabilities on various language undersanding tasks." (Page 3)

Suggestions for Improvement

Deeper Discussion of Multilinguality
While multilinguality is mentioned as an important aspect, the section could benefit from a more in-depth discussion of the challenges and opportunities of multilingual medical LLMs.

"In this paper, we aim to explore the potential of o1 as a clinical viable model." (Page 3)

Rationale: Given the global nature of healthcare, addressing language barriers is crucial for the widespread adoption of medical LLMs. A deeper discussion would highlight the importance of this aspect.

Implementation: Include a brief discussion of the specific challenges in developing multilingual medical LLMs, such as data scarcity in certain languages and the need for cross-lingual evaluation. Mention any existing work on multilingual medical NLP and how o1 relates to it.
Elaborate on Evaluation Metrics
The section mentions the need for comprehensive evaluation but doesn't delve into the specific metrics that will be used. Providing more detail on the chosen metrics would strengthen the methodological foundation.

"Our experimental findings reveal that with enhanced understanding, reasoning, and multilinguality medical capabilities, o1 makes a step closer to reliable clinical AI-system." (Page 3)

Rationale: Clearly defining the evaluation metrics upfront would enhance the transparency and rigor of the study. It would also allow readers to better understand the criteria used to assess o1's performance.

Implementation: Briefly mention the types of metrics that will be used to evaluate understanding, reasoning, and multilinguality. For example, mention accuracy, F1-score, BLEU, or other relevant metrics. Refer to Table 1 if it provides more details on the metrics.

Evaluation Pipeline

Overview

This section details the evaluation pipeline used in the study, outlining the taxonomy of evaluations, the aspects and tasks involved, and the metrics employed. The pipeline assesses language models across understanding, reasoning, and multilinguality using a variety of medical datasets and prompting strategies. The section emphasizes the comprehensive nature of the evaluation, aiming to provide a holistic view of LLM performance in the medical domain.

Key Aspects

Taxonomy of Evaluations: The section outlines the overall structure of the evaluation, encompassing aspects, tasks, datasets, prompting strategies, and metrics.
Aspects and Tasks: Details the three key aspects of evaluation (understanding, reasoning, and multilinguality) and the specific tasks associated with each aspect, using 35 existing and 2 new datasets.
Metrics: Describes the various metrics used to evaluate model performance, including accuracy, F1-score, BLEU, ROUGE, AlignScore, and Mauve, explaining how each metric is applied to different tasks.

Strengths

Clear Structure and Organization
The section is well-organized, presenting the evaluation pipeline in a logical and easy-to-follow manner. The clear structure makes it easy to understand the different components of the evaluation.

"First, we present the taxonomy of our evaluation, along with an overview of the evaluation pipeline as shown in Figure 3." (Page 3)
Comprehensive Evaluation Approach
The evaluation encompasses a wide range of tasks, datasets, and metrics, covering key aspects of medical language understanding and reasoning. This comprehensive approach strengthens the validity and generalizability of the findings.

"Firstly, we specify three aspects of the model capabilities, namely understanding, reasoning, and multilinguality, that correspond to the real-world needs of clinical physicians." (Page 3)

Suggestions for Improvement

More Detail on Prompting Strategies
While the section mentions three prompting strategies (direct, chain-of-thought, and few-shot), it could provide more specific examples of how these prompts are constructed and implemented for different tasks.

"Moreover, we explore three prompting strategies in our pipeline, including (1) direct prompting, which instructs LLMs to solve specific problems directly, (2) chain-of-thought, which requires models to think step-by-step before generating the final answer, (3) few-shot prompting, which providing models with several examples to learn the input-output mapping on the fly." (Page 3)

Rationale: Providing more concrete examples of the prompts would enhance the reproducibility of the study and allow for better comparison with future work.

Implementation: Include example prompts for at least one task under each aspect (understanding, reasoning, and multilinguality). This could be done by adding a supplementary table or appendix with example prompts.
Justification for Metric Selection
The section lists the metrics used but could provide more justification for why these specific metrics were chosen and how they align with the goals of the evaluation.

"Lastly, appropriate metrics are utilized to measure the discrepancy between generated responses and ground-truth answers. Details about metrics utilized in each dataset are provided in Table 1." (Page 4)

Rationale: Providing a rationale for the metric selection would strengthen the methodological rigor of the study and demonstrate that the chosen metrics are appropriate for evaluating the specific tasks and aspects.

Implementation: For each metric (accuracy, F1-score, BLEU, ROUGE, AlignScore, and Mauve), briefly explain why it is suitable for evaluating the corresponding tasks. For example, explain why accuracy is appropriate for multiple-choice questions and why BLEU/ROUGE are used for free-form text generation.

Non-Text Elements

Figure 3

Figure 3 illustrates the evaluation pipeline used in the study. It outlines the aspects of model capabilities (understanding, reasoning, and multilinguality), the prompt strategies (direct prompting, chain-of-thought, few-shot prompting), the language models used (o1, GPT 3.5, GPT 4, Meditron, Llama3), and the evaluation metrics (Prediction Accuracy, Hallucination, Tendency, and Free-form Text Generation Capability). The diagram shows a flow from aspects and tasks to prompt strategies, then to language models, and finally to evaluation.

First Mention

Text: "First, we present the taxonomy of our evaluation, along with an overview of the evaluation pipeline as shown in Figure 3."

Context: This sentence appears in the 'Overall Taxonomy of Evaluations' subsection, introducing the figure that visually represents the evaluation pipeline.

Relevance: Figure 3 is crucial for understanding the methodology of the study. It visually outlines the entire evaluation process, connecting the aspects being evaluated, the prompting strategies, the models used, and the metrics employed. This provides a clear framework for the experiments and results presented later in the paper.

Critique

Visual Aspects

The flow diagram format clearly illustrates the steps in the evaluation pipeline.
The use of distinct boxes for each stage and arrows to indicate the flow enhances readability.
The figure could benefit from a brief explanation of how 'Tendency' is measured as an evaluation metric.

Analytical Aspects

The pipeline logically connects the different components of the evaluation, providing a structured approach to assessing LLM performance.
The inclusion of different prompting strategies allows for a more comprehensive evaluation of the models' capabilities.
The figure could be strengthened by providing a more precise definition of 'Free-form Text Generation Capability' and how it is evaluated.

Numeric Data

Number of aspects: 3
Number of prompting strategies: 3
Number of language models: 5

Table 1

Table 1 provides a detailed list of the 37 datasets used in the evaluation, categorized by three fundamental aspects: Understanding (Concept Recognition, Text Summary), Reasoning (Knowledge QA, Clinical Decision Support, Agent, Medical Calculation), and Multilinguality (Knowledge QA, Agent). For each dataset, the table lists the task, a brief description, and the metrics used for evaluation. Asterisks mark newly constructed datasets.

First Mention

Text: "In Table 1, our evaluation efforts are structured into three main parts: aspect, task, and dataset."

Context: This sentence, located in the 'Aspects and Tasks' subsection, introduces Table 1, which details the datasets used in the evaluation, organized by aspect and task.

Relevance: Table 1 is essential for understanding the scope and methodology of the study. It provides a comprehensive overview of the datasets used, linking them to the three aspects being evaluated (understanding, reasoning, and multilinguality). This allows readers to assess the breadth and relevance of the evaluation.

Critique

Visual Aspects

The table is well-organized, with clear headings and consistent formatting.
The use of asterisks to denote newly constructed datasets is helpful.
The table could benefit from a clearer visual separation between the three aspects (Understanding, Reasoning, Multilinguality) for easier navigation.

Analytical Aspects

The table provides valuable information about the datasets, including their descriptions and associated metrics.
The categorization of datasets by aspect and task helps structure the evaluation and allows for comparisons across different capabilities.
The table could be further enhanced by including the number of instances or data points in each dataset to give a better sense of their size and complexity.

Numeric Data

Number of datasets for Understanding: 14
Number of datasets for Reasoning: 19
Number of datasets for Multilinguality: 4
Total number of datasets: 37

Experiments

Overview

This section details the experimental setup, including prompting strategies, models used, and the main result highlighting o1's improved performance in clinical understanding and reasoning. It emphasizes o1's superior performance compared to other models, particularly in knowledge understanding and diagnostic reasoning scenarios, suggesting a step closer to an "AI doctor". The section also sets the stage for further analysis by mentioning other observations and potential limitations.

Key Aspects

Prompting Strategies: The experiments employed direct prompting, few-shot prompting (following MedS-Bench settings), and investigated advanced promptings like CoT, Self-Consistency, and Reflex.
Models for Evaluation: o1, GPT-4, GPT-3.5, Meditron-70B, and Llama3-8B were evaluated.
Main Result: o1 demonstrated enhanced clinical understanding and reasoning, outperforming other models in most tasks, particularly in concept recognition, summarization, challenging QA, mathematical reasoning, and agent-based scenarios.
o1's Strengths: The section highlights o1's ability to transfer general NLP capabilities to the medical domain, showing improved performance in concept recognition and summarization tasks, as well as strong reasoning abilities in clinical diagnosis and mathematical reasoning.
Further Analysis Setup: The section briefly introduces areas for further analysis, including model comparison across all tasks, the impact of advanced prompting, and the persistent challenge of hallucination.

Strengths

Clear Description of Methods
The section clearly outlines the prompting strategies and models used in the evaluation, providing sufficient detail for reproducibility.

"For most datasets, we employ the same prompting strategy as described in previous literature (Wu et al., 2024b; Nori et al., 2023b;a): For knowledge QA tasks, agent tasks, medical calculation tasks, and multilingual-related tasks, we use the direct prompting evaluation method, which is consistent with the settings of these benchmarks." (Page 6)
Strong Evidence for o1's Performance
The section provides compelling evidence for o1's superior performance in various medical tasks, supporting the claim of its enhanced capabilities in clinical understanding and reasoning.

"In our newly constructed challenging QA tasks, NEJMQA and LacentQA, o1 showcases an average accuracy improvement of 8.9% and 27.1% over the performance of GPT-4 (79.6%) and GPT-3.5 (61.5%) on the respective datasets (Table 2)." (Page 6)

Suggestions for Improvement

More Details on Advanced Prompting
While the section mentions investigating advanced promptings, it lacks details on their specific implementation and impact on o1's performance. Providing more information would enhance the analysis.

"To further validate this claim, we also investigate the effect of several advanced promptings in our evaluation (e.g., CoT, Self-Consistency (Wang et al., 2022), and Reflex (Shinn et al., 2024)), the detailed input instruction formats are in Section A.1" (Page 6)

Rationale: A more detailed analysis of advanced prompting techniques would provide insights into their effectiveness and potential for further improving o1's performance in medical tasks.

Implementation: Include specific examples of the advanced prompts used and quantify their impact on o1's performance across different tasks. Discuss any observed trends or limitations of these techniques.
Comparison with Prior Work
The section could benefit from a more explicit comparison of o1's performance with previous state-of-the-art models in medical NLP tasks. This would better contextualize the findings and highlight the significance of o1's advancements.

"These results together provide a positive answer to the question we raised in this paper: Yes! We are getting closer to an automatic AI doctor with the latest o1 model." (Page 7)

Rationale: Positioning o1's performance within the existing literature would strengthen the paper's contribution and demonstrate the novelty of the findings.

Implementation: Include a brief discussion comparing o1's results with those reported in previous studies using similar datasets or tasks. Quantify the improvements achieved by o1 and discuss the potential reasons for its superior performance.

Non-Text Elements

Table 2

Table 2 presents the Accuracy (Acc.) and F1 results for several models (o1, GPT-4, GPT-3.5, Meditron-70B, Llama3-8B) across four tasks related to two aspects: Understanding (Concept Recognition) and Reasoning (Clinical Decision Support, Knowledge QA, Medical Calculation). The table highlights o1's results with a gray background and includes average scores for each metric. o1 generally outperforms other models, especially in Knowledge QA and Medical Calculation tasks. For example, o1 achieves 72.6% average F1 score in Concept Recognition, 74.9% average accuracy in Clinical Decision Support, and 84.8% average accuracy in Knowledge QA.

First Mention

Text: "Results presented in Table 2 demonstrate that o1 outperforms other models on the understanding aspect in most clinical tasks."

Context: This sentence, located in the 'Main Result' subsection, introduces Table 2 to support the claim of o1's superior performance in understanding and reasoning tasks.

Relevance: Table 2 directly supports the main claim of the section by providing quantitative evidence of o1's performance in various medical tasks. It highlights o1's strengths in knowledge understanding and reasoning, contributing to the overall argument that the model is a step closer to an 'AI doctor'.

Critique

Visual Aspects

The table is well-organized and easy to read, with clear headings for aspects, tasks, datasets, and models.
The gray background effectively highlights o1's results.
The use of abbreviations (Acc., F1) might require readers to refer back to earlier sections for clarification.

Analytical Aspects

The table provides a good overview of model performance across different tasks, but it lacks statistical significance testing to confirm the observed differences.
The inclusion of average scores is helpful for overall comparison, but it might obscure performance variations across individual datasets.
The choice of specific datasets within each task and their representativeness of the broader medical domain could be further discussed.

Numeric Data

o1 Average F1 (Concept Recognition): 72.6 %
o1 Average Accuracy (Clinical Decision Support): 74.9 %
o1 Average Accuracy (Knowledge QA): 84.8 %
o1 Accuracy (MedCalc-Bench): 34.9 %
GPT-4 Accuracy (MedCalc-Bench): 25.5 %

Table 3

Table 3 presents the BLEU-1 (B-1) and ROUGE-1 (R-1) results for the same models as in Table 2, focusing on three tasks related to two aspects: Understanding (Text Summary, Concept Recognition) and Reasoning (Clinical Decision Support). o1's results are highlighted. o1 shows competitive performance in text summarization and concept recognition, achieving average ROUGE-1 scores of 31.4% and 32.5%, respectively. For example, in Text Summary, o1 achieves an average ROUGE-1 score of 31.4%, compared to 29.0% for GPT-4 and 27.7% for GPT-3.5.

First Mention

Text: "Additionally, on the summarization task in Table 3, o1 achieves a 2.4% and 3.7% increase in ROUGE-1 score over GPT-4 and GPT-3.5 (i.e., 31.4% vs. 29.0% vs. 27.7%), demonstrating its enhanced capacity for real-world clinical understanding."

Context: This sentence follows the discussion of Table 2 and introduces Table 3 to further demonstrate o1's enhanced capacity for clinical understanding in summarization tasks.

Relevance: Table 3 complements Table 2 by providing additional evidence of o1's performance using different metrics (BLEU-1 and ROUGE-1) commonly used for evaluating text generation tasks. This further supports the section's argument about o1's enhanced clinical understanding and its potential as an 'AI doctor'.

Critique

Visual Aspects

The table is well-structured, similar to Table 2, with clear headings and highlighting for o1's results.
The inclusion of both BLEU-1 and ROUGE-1 provides a more comprehensive evaluation.
The upward arrows next to the metric names (B-1 ↑, R-1 ↑) clearly indicate that higher scores are better.

Analytical Aspects

While the table shows numerical differences between models, it doesn't include statistical significance testing, making it difficult to draw strong conclusions about the superiority of one model over another.
The average scores are helpful for overall comparison but might mask performance variations across individual datasets.
The choice of BLEU-1 and ROUGE-1 could be further justified, and other relevant metrics for text generation could be considered.

Numeric Data

o1 Average ROUGE-1 (Text Summary): 31.4 %
GPT-4 Average ROUGE-1 (Text Summary): 29.0 %
GPT-3.5 Average ROUGE-1 (Text Summary): 27.7 %
o1 Average ROUGE-1 (Concept Recognition): 32.5 %
o1 Average ROUGE-1 (Clinical Decision Support): 24.4 %

Table 4

Table 4 presents the AlignScore and Mauve results for the same models across three tasks and two aspects, similar to Table 3. Higher AlignScore indicates better factual consistency, while higher Mauve indicates better distribution match with human-written text. o1 shows some improvements in Mauve compared to other models, but its AlignScore is generally lower than GPT-4's, suggesting potential issues with hallucination. For example, in Text Summary, o1's average AlignScore is 20.3, while GPT-4's is 21.6.

First Mention

Text: "We use AlignScore (Zha et al., 2023) to evaluate hallucination in LLMs. In Table 4, the o1 model demonstrates a 1.3% decrease in AlignScore compared to GPT-4 across five text summarization datasets."

Context: This sentence, appearing in the 'Further Analysis' subsection, introduces Table 4 to discuss the issue of hallucination in o1, using AlignScore as a metric.

Relevance: Table 4 is relevant because it addresses the important issue of hallucination, a known weakness of LLMs. By presenting AlignScore and Mauve results, it provides insights into the factual consistency and distribution match of the generated text, adding a crucial dimension to the evaluation of o1's performance.

Critique

Visual Aspects

The table is well-organized, consistent with the previous tables, making it easy to compare results across different models and metrics.
The upward arrows next to AlignScore and Mauve clearly indicate that higher values are desirable.
The table could benefit from visually separating the Understanding and Reasoning aspects for clearer navigation.

Analytical Aspects

The table provides valuable data on hallucination and distribution match, but it lacks statistical significance testing.
The average scores provide an overview but may mask variations across individual datasets.
The choice of AlignScore and Mauve could be further discussed, and other relevant metrics for evaluating hallucination and text quality could be considered.

Numeric Data

o1 Average AlignScore (Text Summary): 20.3
GPT-4 Average AlignScore (Text Summary): 21.6
o1 Average Mauve (Text Summary): 14.8
GPT-4 Average Mauve (Text Summary): 17.9
o1 Average AlignScore (Concept Recognition): 26.5

Figure 4

Figure 4 compares the answers and reasoning provided by o1 and GPT-4 on a question from the LancetQA dataset. The question involves diagnosing a rash in a two-month-old infant. o1 correctly identifies Congenital Syphilis, while GPT-4 incorrectly chooses Congenital Candidiasis. o1's reasoning is shorter and more concise, while GPT-4's explanation includes hallucinated information.

First Mention

Text: "In addition to delivering higher accuracy, o1 provides more concise and straightforward answers. In the example illustrated in Figure 4, o1 generates shorter interpretations while offering the correct answer."

Context: This sentence, following the discussion of o1's strong reasoning abilities, introduces Figure 4 to provide a specific example comparing o1 and GPT-4's responses on a diagnostic question.

Relevance: Figure 4 strengthens the section's argument about o1's superior reasoning abilities by providing a qualitative example. It visually demonstrates o1's concise and accurate reasoning compared to GPT-4's incorrect answer and hallucinated explanation, further supporting the claim that o1 is getting closer to an 'AI doctor'.

Critique

Visual Aspects

The figure clearly presents the question, options, and responses of both models.
The use of a checkmark for o1 and a cross for GPT-4 clearly indicates the correct and incorrect answers.
The figure could benefit from highlighting the specific parts of GPT-4's response that are considered hallucinations.

Analytical Aspects

The figure provides a compelling example of o1's accurate and concise reasoning in a clinical scenario.
The comparison with GPT-4 highlights the potential for hallucination in other LLMs.
The figure could be strengthened by providing additional examples from different datasets to demonstrate the generalizability of o1's performance.

Numeric Data

Figure 4

Figure 4 compares the answers and reasoning provided by o1 and GPT-4 on a medical question from the LancetQA dataset. The question involves diagnosing a rash in a two-month-old infant. o1 correctly identifies Congenital Syphilis, providing a concise explanation. GPT-4 incorrectly diagnoses Congenital Candidiasis, offering a longer and less accurate reasoning process.

First Mention

Text: "In addition to delivering higher accuracy, o1 provides more concise and straightforward answers. In the example illustrated in Figure 4, o1 generates shorter interpretations while offering the correct answer. In contrast, GPT-4 tends to generate hallucinated explanations alongside incorrect answers."

Context: This paragraph discusses o1's improved accuracy and conciseness in comparison to GPT-4, using Figure 4 as a supporting example.

Relevance: Figure 4 supports the claim that o1 not only achieves higher accuracy but also provides more concise and clinically sound reasoning compared to GPT-4. This reinforces the paper's argument for o1's potential as a reliable clinical tool.

Critique

Visual Aspects

The figure clearly presents the question, answer options, and reasoning provided by each model.
The use of checkmark and cross icons clearly distinguishes the correct and incorrect answers.
The figure could be improved by including the full text of the question instead of just a summary.

Analytical Aspects

The figure effectively demonstrates o1's superior reasoning ability in a specific clinical scenario.
The comparison highlights the tendency of GPT-4 to generate hallucinated explanations, a key weakness identified in the paper.
The figure could be strengthened by providing more context about the LancetQA dataset and its characteristics.

Table 5

Table 5 compares the accuracy of o1, GPT-4, and GPT-3.5 on two agentic benchmarks: AI Hospital and AgentClinic. AI Hospital tasks include Symptom identification, Medical Examination, Diagnostic Results, Diagnostic Rationales, and Treatment Plan. AgentClinic includes MedQA and NEJM subsets. o1 generally performs well, but GPT-4 outperforms it on certain AI Hospital tasks.

First Mention

Text: "In more complex reasoning scenarios that involve multi-turn conversations and environmental simulations, o1 outperforms both GPT-4 and GPT-3.5 on the AgentClinic benchmark, achieving accuracy gains of at least 15.5% and 10% with scores of 45.5% and 20.0% on its MedQA and NEJM subsets, respectively."

Context: This paragraph discusses o1's performance in complex reasoning scenarios using agentic benchmarks, referencing Table 5 for detailed results.

Relevance: Table 5 provides quantitative evidence of o1's performance in complex, multi-turn medical scenarios, which are crucial for real-world clinical applications. It directly relates to the section's focus on evaluating o1's reasoning abilities.

Critique

Visual Aspects

The table is clear and easy to read, with distinct rows for each model and columns for each task.
The use of percentages makes it easy to compare performance across models and tasks.
The table could benefit from highlighting the best-performing model for each task.

Analytical Aspects

The table provides a valuable comparison across different models and agentic tasks.
The inclusion of both AI Hospital and AgentClinic benchmarks offers a broader perspective on agentic performance.
The table could be strengthened by providing more details about the tasks within each benchmark and the specific metrics used to calculate accuracy.

Numeric Data

o1 accuracy on AI Hospital (Symptom): 67.0 %
GPT-4 accuracy on AI Hospital (Symptom): 66.7 %
o1 accuracy on AgentClinic (MedQA): 45.5 %
GPT-4 accuracy on AgentClinic (MedQA): 30.4 %
o1 accuracy on AgentClinic (NEJM): 20.0 %
GPT-4 accuracy on AgentClinic (NEJM): 10.0 %

Table 6

Table 6 presents the accuracy of o1 and GPT-4, with and without Chain-of-Thought (CoT) prompting, on five knowledge QA datasets: PubMedQA, MedQA, MedMCQA, LancetQA, and NEJMQA. The results show that CoT prompting improves the performance of both models, even though o1 is trained with CoT data.

First Mention

Text: "o1 was released using chain-of-thought (CoT) data embedding in the training process; however, we found that applying the CoT prompting still enhances o1’s performance on knowledge QA tasks in medicine, as shown in Table 6."

Context: This paragraph discusses the impact of CoT prompting on o1 and GPT-4's performance, referring to Table 6 for detailed results.

Relevance: Table 6 directly addresses the question of whether CoT prompting benefits models already trained with CoT data. This is relevant to the section's focus on analyzing o1's performance and the impact of different prompting strategies.

Critique

Visual Aspects

The table clearly presents the accuracy scores for each model and prompting condition.
The inclusion of citations for each dataset enhances credibility.
The table could be improved by visually highlighting the performance gains from CoT prompting.

Analytical Aspects

The table demonstrates that CoT prompting can still improve performance even for models trained on CoT data.
The results suggest that explicit CoT prompting may be more effective than implicit CoT training in certain cases.
The table could be strengthened by including statistical significance tests to assess the reliability of the observed improvements.

Numeric Data

o1 accuracy on PubMedQA (no CoT): 75.0 %
o1 accuracy on PubMedQA (CoT): 75.2 %
GPT-4 accuracy on PubMedQA (no CoT): 52.8 %
GPT-4 accuracy on PubMedQA (CoT): 62.2 %
o1 accuracy on MedQA (no CoT): 95.0 %
o1 accuracy on MedQA (CoT): 95.2 %

Discussion

Overview

This discussion section analyzes the implications of o1's performance in the medical domain, highlighting both its potential benefits and drawbacks. It raises concerns about o1's adverse impacts, such as increased decoding time and inconsistent performance across tasks. The section also emphasizes the need for rethinking evaluation metrics for stronger LLMs, advocating for more robust measures beyond traditional metrics like BLEU and ROUGE. Finally, it calls for more reliable prompting techniques adapted to the evolving internal prompting strategies of future LLMs.

Key Aspects

Adverse Impacts of o1: Discusses the increased decoding time and inconsistent performance of o1 compared to previous models.
Rethinking Evaluation Metrics: Critiques traditional metrics and highlights the need for more robust evaluation methods for advanced LLMs.
Reliable Prompting Techniques: Advocates for developing prompting techniques that are adaptable to future LLMs with internal prompting strategies.

Strengths

Critical Analysis of o1
The section goes beyond simply praising o1's performance and critically examines its limitations and potential drawbacks, such as increased decoding time and inconsistent performance across tasks.

"Additionally, o1 does not always outperform other models, with inconsistent performance across different tasks." (Page 10)
Addressing Evaluation Challenges
The section raises important questions about the limitations of current evaluation metrics for advanced LLMs and advocates for the development of more robust measures.

"Traditional evaluation metrics like BLEU and ROUGE, which rely on n-gram overlap, have long been criticized for their limitations in capturing the quality of generated text, particularly for LLMs." (Page 10)

Suggestions for Improvement

More Specific Examples
While the section mentions o1's inconsistent performance, it would be stronger with more specific examples and analysis of when and why o1 underperforms.

"For instance, in the concept recognition task detailed in Table 2, o1 underperforms compared to other LLMs on half of the datasets." (Page 10)

Rationale: Providing specific examples would make the analysis more concrete and persuasive, allowing readers to better understand the limitations of o1.

Implementation: Select a few representative examples of datasets or tasks where o1 underperforms and provide a detailed analysis of the possible reasons for this underperformance. Consider factors such as task complexity, data characteristics, or limitations of the model's architecture or training data.
Explore Alternative Metrics
The section critiques traditional metrics but could benefit from exploring and suggesting specific alternative metrics that might be more suitable for evaluating advanced LLMs.

"Therefore, there is a growing need to develop more robust and nuanced evaluation metrics that can better assess the performance of state-of-the-art LLMs in complex scenarios." (Page 10)

Rationale: Suggesting concrete alternative metrics would provide more actionable guidance for future research and contribute to the development of better evaluation practices.

Implementation: Discuss specific alternative metrics that have been proposed in the literature, such as metrics based on semantic similarity, factual consistency, or human evaluation. Analyze the strengths and weaknesses of these alternatives and discuss their potential applicability to evaluating medical LLMs like o1.
Expand on Prompting Techniques
The section briefly mentions the need for reliable prompting techniques but could expand on this by discussing specific strategies and their potential benefits for future LLMs.

"As future LLMs like o1 may continue to evolve with internal prompts for efficient user instruction, new prompting methods should consider their adaptability to existing strategies." (Page 10)

Rationale: A more detailed discussion of prompting techniques would provide valuable insights for future research and development of LLMs.

Implementation: Discuss specific prompting strategies that could be adapted to future LLMs with internal prompts. Consider techniques like prompt engineering, meta-learning for prompt generation, or incorporating external knowledge into prompts. Analyze the potential benefits and limitations of these strategies in the context of medical applications.

Conclusion

Overview

This conclusion summarizes the study's findings on OpenAI's o1 model in the medical domain, emphasizing its progress towards an "AI doctor." It evaluated o1 across three key aspects (understanding, reasoning, and multilingual capabilities) using 37 medical datasets, including two novel ones. The study concludes that o1 shows promise in bridging the gap between AI and human doctors.

Key Aspects

Summary of Findings: The study assessed o1's understanding, reasoning, and multilingual capabilities in medicine using 35 existing and 2 new datasets.
o1's Potential: The results suggest o1 is making significant progress towards the vision of an AI doctor.
Future Directions: The conclusion acknowledges limitations and suggests future work on areas like safety and retrieval augmented generation.

Strengths

Concise Summary
The conclusion effectively summarizes the key findings of the study, highlighting the main contributions and implications.

"This preliminary study assesses 3 important aspects across 35 existing and 2 novel medical datasets using the latest o1 model." (Page 11)
Emphasis on Impact
The conclusion emphasizes the potential impact of o1 in the medical field, highlighting its progress towards the goal of an AI doctor.

"The findings provide convincing evidence that o1 is narrowing the gap between AI and human doctors, shaping the vision of an ideal AI doctor closer to reality." (Page 11)

Suggestions for Improvement

Elaborate on Future Directions
While the conclusion mentions limitations and future work, it could be more specific about the research questions and methodologies that should be explored in future studies.

"there are many other dimensions to consider such as safety (Han et al., 2024) and we leave them for future work." (Page 11)

Rationale: Providing more concrete suggestions for future research would enhance the impact and value of the study.

Implementation: Instead of just mentioning safety and RAG, elaborate on specific research questions related to safety, such as bias detection, fairness, and explainability. For RAG, discuss how it could be integrated into o1 and what specific benefits it might offer in the medical domain.
Discuss Broader Implications
The conclusion could expand on the broader implications of the findings, considering the potential impact of o1 on healthcare practices, medical education, and patient care.

"The findings provide convincing evidence that o1 is narrowing the gap between AI and human doctors, shaping the vision of an ideal AI doctor closer to reality." (Page 11)

Rationale: Discussing the broader implications would enrich the conclusion and provide a more comprehensive perspective on the potential impact of o1 in the medical field.

Implementation: Discuss the potential benefits and challenges of integrating o1 into clinical workflows, such as its role in assisting with diagnosis, treatment planning, or patient education. Consider the ethical implications of using AI in healthcare and how these can be addressed.

Evaluation of OpenAI's o1 Large Language Model in the Medical Domain

Table of Contents

Overall Summary

Overview

Key Findings

Strengths

Areas for Improvement

Significant Elements

Table 2

Figure 4

Conclusion

Section Analysis

Abstract

Overview

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

First Mention

Numeric Data

First Mention

Numeric Data

Introduction

Overview

Key Aspects

Strengths

Suggestions for Improvement

Related Works

Overview

Key Aspects

Strengths

Suggestions for Improvement

Evaluation Pipeline

Overview

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

First Mention

Numeric Data

First Mention

Numeric Data

Experiments

Overview

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

First Mention

Numeric Data

First Mention

Numeric Data

First Mention

Numeric Data

First Mention

Numeric Data

First Mention

First Mention

Numeric Data

First Mention

Numeric Data

Discussion

Overview

Key Aspects

Strengths

Suggestions for Improvement

Conclusion

Overview

Key Aspects

Strengths

Suggestions for Improvement