Towards conversational diagnostic artificial intelligence

Tao Tu, Mike Schaekermann, Anil Palepu, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Yong Cheng, Elahe Vedadi, Nenad Tomasev, Shekoofeh Azizi, Karan Singhal, Le Hou, Albert Webson, Kavita Kulkarni, S. Sara Mahdavi, Christopher Semturs, Juraj Gottweis, Joelle Barral, Katherine Chou, Greg S. Corrado, Yossi Matias, Alan Karthikesalingam, Vivek Natarajan
Nature
Google Research, Mountain View, CA, USA

First Page Preview

First page preview

Table of Contents

Overall Summary

Study Background and Main Findings

This paper introduces and evaluates AMIE (Articulate Medical Intelligence Explorer), an artificial intelligence system based on large language models (LLMs), specifically optimized for engaging in diagnostic conversations—a core component of medical practice involving complex history-taking and clinical reasoning. Recognizing the limitations of existing medical AI and the difficulty of replicating human clinical dialogue, the researchers developed AMIE using a combination of real-world medical data and an innovative 'self-play' simulated environment, where AI agents interact to generate diverse training dialogues and receive automated feedback, allowing the system to learn across many medical conditions and contexts.

The primary objective was to compare AMIE's performance against human primary care physicians (PCPs) in a realistic, albeit simulated, setting. The researchers designed a rigorous evaluation framework using a randomized, double-blind crossover study modeled after the Objective Structured Clinical Examination (OSCE), a standard method for assessing clinical skills. Twenty PCPs and AMIE conducted text-based consultations with validated patient-actors portraying 159 different medical scenarios. Performance was assessed across multiple clinically relevant dimensions, including diagnostic accuracy (comparing generated lists of possible diagnoses, or differential diagnoses (DDx), against ground truth), history-taking quality, management planning, communication skills, and empathy, using ratings from both the patient-actors and independent specialist physicians.

The results indicated that, within this specific text-chat based simulated environment, AMIE demonstrated statistically significantly higher diagnostic accuracy than the PCPs (e.g., top-1 accuracy against ground truth was ~85% for AMIE vs. ~75% for PCPs, P < 0.05). Furthermore, AMIE received superior ratings from specialist physicians on 30 out of 32 quality axes and from patient-actors on 25 out of 26 axes, including measures of empathy and communication clarity. Analysis suggested AMIE's advantage stemmed more from interpreting information to form a diagnosis rather than from gathering information more effectively.

The authors conclude that AMIE represents a significant milestone in developing conversational AI for diagnostic purposes. However, they appropriately caution that the results must be interpreted carefully due to major limitations, particularly the use of a text-chat interface unfamiliar to clinicians (which likely biased the comparison) and the simulated nature of the evaluation. Substantial further research focusing on safety, reliability, fairness, and rigorous clinical validation in real-world settings is deemed essential before systems like AMIE could be considered for practical application in healthcare.

Research Impact and Future Directions

This study represents a significant technical achievement, demonstrating that a large language model (LLM) optimized for diagnostic dialogue, AMIE, can outperform primary care physicians (PCPs) in simulated, text-based consultations across key metrics like diagnostic accuracy and communication quality. The use of a randomized, double-blind crossover design modeled after Objective Structured Clinical Examinations (OSCEs) provides a rigorous framework for comparison within the study's specific context. AMIE's superior performance, particularly in diagnostic reasoning and perceived empathy by both patient-actors and specialist evaluators, highlights the potential of advanced AI in complex medical interactions.

However, the study's conclusions must be interpreted with considerable caution due to fundamental limitations inherent in its design. The reliance on synchronous text-chat, a modality unfamiliar to most clinicians for diagnostic purposes, likely disadvantaged the PCPs and favored the text-native LLM, potentially exaggerating AMIE's relative performance. Furthermore, the evaluation occurred within a simulated environment using trained patient-actors and predefined scenarios. This controlled setting cannot fully replicate the complexity, unpredictability, and multi-modal nature (including non-verbal cues) of real-world clinical encounters. Therefore, the study demonstrates AMIE's capabilities in a specific, artificial setting but does not provide sufficient evidence to claim superiority over human physicians in actual clinical practice.

The findings strongly suggest potential future applications for AI like AMIE as assistive tools—perhaps helping clinicians generate differential diagnoses, draft patient communications, or summarize information. Yet, the path to real-world deployment is long and requires addressing critical challenges. Substantial further research and rigorous clinical validation are essential to ensure safety, reliability, efficacy, fairness, and privacy. Key unanswered questions include how AMIE performs in diverse, real-world patient populations and clinical settings, how to effectively mitigate potential biases, and how to best integrate such tools into clinical workflows with appropriate human oversight. This work serves as a crucial proof-of-concept and a milestone for conversational AI in medicine, but underscores the extensive validation needed before such technology can be responsibly integrated into patient care.

Critical Analysis and Recommendations

Effective Motivation (written-content)
Clear Problem Statement and Motivation: The abstract effectively establishes the importance of physician-patient dialogue and the challenge of replicating it with AI. This clearly justifies the research need and engages the reader.
Section: Abstract
Lack of Quantitative Key Result in Abstract (written-content)
Quantify 'Greater Diagnostic Accuracy' Claim: The abstract states AMIE had 'greater diagnostic accuracy' but doesn't provide the key quantitative result (e.g., top-1 accuracy difference from Fig 3a). Including the primary accuracy metric would make the abstract's summary of findings significantly more informative and impactful.
Section: Abstract
Insufficient Context for Key Limitation (written-content)
Contextualize Text-Chat Limitation Impact: The abstract mentions the text-chat interface limitation but doesn't briefly state its potential impact (disadvantaging clinicians). Adding this context would improve the initial interpretation of the results presented in the abstract.
Section: Abstract
Robust Evidence for Diagnostic Accuracy Claim (written-content)
Clear Quantitative Demonstration of Diagnostic Superiority: The results clearly show AMIE's statistically significant higher diagnostic accuracy compared to PCPs using quantitative data (Fig. 3) and appropriate statistical tests. This provides robust evidence for a central claim within the study's context.
Section: Results
Comprehensive Conversation Quality Assessment (graphical-figure)
Multi-Perspective Evaluation of Conversation Quality: Conversation quality was comprehensively assessed using standardized rubrics from both patient-actors (Fig. 4) and specialist physicians (Fig. 5). This multi-faceted approach adds significant credibility and depth to the findings on communication and empathy.
Section: Results
Analysis of Performance Source (written-content)
Insightful Analysis of Performance Drivers: The study effectively investigated why AMIE performed better diagnostically, concluding it excels more in interpreting information than acquiring it. This analysis adds depth beyond simply reporting the performance difference.
Section: Results
Lack of Direct Reporting for Inter-Rater Reliability (graphical-figure)
Report Inter-Rater Reliability Metric Directly: While IRR for specialist ratings (Fig. 5) is mentioned as being in supplementary info, stating the actual metric (e.g., Fleiss' Kappa) in the main results text would immediately strengthen the perceived robustness of these subjective ratings.
Section: Results
Rigorous Handling of Text-Chat Limitation (written-content)
Thorough Acknowledgment of Interface Limitations: The discussion proactively and extensively addresses the major limitation of the text-chat interface, acknowledging its unfamiliarity for clinicians and potential to disadvantage them. This demonstrates scientific rigor and balanced interpretation.
Section: Discussion
Responsible Focus on Fairness and Bias (written-content)
Explicit Focus on Fairness, Bias, and Future Mitigation: The discussion dedicates significant attention to the critical issues of fairness and bias, acknowledging current limitations and outlining necessary future work. This highlights a responsible approach to AI development.
Section: Discussion
Responsible Framing of Future Work (written-content)
Cautious and Responsible Framing of Deployment Path: The discussion clearly outlines the many steps (safety, reliability, ethics, oversight) required before translation to practice. This manages expectations appropriately.
Section: Discussion
Lack of Concrete Examples for Human-AI Collaboration (written-content)
Elaborate on Specific Human-AI Complementarity Workflows: The discussion mentions human-AI complementarity but lacks concrete examples of how this might work (e.g., AI suggesting DDx options, drafting empathic responses). Providing specific examples would make this important concept more tangible.
Section: Discussion
Insufficient Emphasis on Simulation Limitation Impact (written-content)
Simulation-Based Design Limits Generalizability: The study's reliance on simulated scenarios and patient-actors, while necessary for control, fundamentally limits the ability to generalize findings to real-world clinical practice. The discussion acknowledges simulation limits but could more strongly emphasize how this constrains the interpretation of comparative performance.
Section: Discussion
Effective Summary of Contribution (written-content)
Clear Summary of Milestone Achieved: The conclusion effectively synthesizes the core achievement, positioning AMIE's performance in the simulated setting as a significant milestone for conversational AI in diagnostics.
Section: Conclusion
Clear Articulation of Research-Practice Gap (written-content)
Explicit Statement of Research-to-Practice Gap: The conclusion clearly articulates the substantial gap between the experimental findings and real-world application, emphasizing the need for extensive further research. This provides essential context and manages expectations.
Section: Conclusion
Lack of Specificity on Clinical Validation Need (written-content)
Explicitly Mention Need for Clinical Validation: While the conclusion stresses the need for 'substantial additional research', explicitly mentioning 'rigorous clinical validation' or 'clinical trials' would add specificity and reinforce the critical next step for translating these findings.
Section: Conclusion
Novel Self-Play Training Method (written-content)
Innovative Self-Play Simulation Framework: The detailed description of the self-play environment for scaling training data represents a novel and well-explained methodological contribution for training specialized conversational AI.
Section: Methods
Rigorous Evaluation Methodology (written-content)
Rigorous and Comprehensive Evaluation Design (Remote OSCE): The methods meticulously detail the randomized, blinded, crossover remote OSCE design, use of validated patient-actors, diverse scenarios, clinically relevant metrics (PACES, PCCBP, etc.), and multi-perspective assessment. This rigorous design significantly boosts the credibility of the comparative findings within the study's context.
Section: Methods
Transparent Statistical Methods (written-content)
Transparent Statistical Analysis Methods: The specific statistical tests (bootstrap, Wilcoxon signed-rank) and corrections (FDR) used are clearly described. This transparency supports the reproducibility and validity of the results.
Section: Methods
Lack of Summary for Agent Instructions (written-content)
Briefly Summarize Core Agent Instructions in Main Text: The prompts for the self-play agents are in supplementary material, but briefly summarizing the core instruction for each agent (vignette generator, patient, doctor, critic) in the Methods section would improve the immediate understanding of this novel simulation framework.
Section: Methods
Establishes Novelty Effectively (written-content)
Clear Differentiation from Prior AI Work: The related work section effectively distinguishes this study from previous AI applications in medicine (e.g., symptom checkers, transcription) by highlighting their limitations. This clearly establishes the novelty and contribution of the AMIE research.
Section: Related work
Justifies Rigorous Evaluation Approach (written-content)
Critical Assessment of Prior Evaluation Methods: The section rightly critiques the inadequacy of evaluation metrics used in prior AI dialogue studies (e.g., fluency, relevance) compared to clinical standards. This justifies the paper's more rigorous, clinically-aligned evaluation approach.
Section: Related work

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Fig. 1| Overview of contributions. AMIE is a conversational medical Al...
Full Caption

Fig. 1| Overview of contributions. AMIE is a conversational medical Al optimized for diagnostic dialogue.

Figure/Table Image (Page 2)
Fig. 1| Overview of contributions. AMIE is a conversational medical Al optimized for diagnostic dialogue.
First Reference in Text
Our key contributions (Fig. 1) are summarized here.
Description
  • AMIE System Design and Fine-tuning: The diagram outlines the development and evaluation of AMIE, an AI system for medical diagnosis through conversation. It shows AMIE's system design involving multiple data inputs (like medical reasoning datasets, real-world dialogues, and simulated dialogues) used for 'fine-tuning' – a process of adapting a general large language model (LLM) for this specific medical task.
  • Self-Play Training Mechanism: A key part of AMIE's training involves 'self-play', where the AI learns by interacting with itself. The diagram shows two loops: an 'inner' loop where AMIE acts as both doctor and patient, receiving feedback from an AI 'critic' to improve its responses within a single simulated conversation, and an 'outer' loop where these improved simulated dialogues are collected and used for further rounds of fine-tuning the main AMIE model.
  • Inference Reasoning Chain: During use ('inference'), AMIE employs a 'reasoning chain' involving analyzing the conversation context, generating a potential response, and then refining that response before presenting it to the user.
  • Randomized Evaluation Study (Remote OSCE): The evaluation method depicted is a randomized study designed like an OSCE (Objective Structured Clinical Examination – a standard test format for medical skills). In this setup, actors playing simulated patients interact via text chat randomly with either AMIE or real Primary Care Physicians (PCPs).
  • Comparative Performance Summary (Radar Chart): A radar chart summarizes the comparative performance, illustrating that AMIE (represented by the orange line encompassing a larger area) is suggested to outperform PCPs (blue line) across various evaluation metrics like diagnostic accuracy, management planning, empathy, and patient confidence, according to both specialist physicians and patient-actors.
Scientific Validity
  • ✅ Comprehensive methodological overview: The diagram provides a coherent and logical overview of the system's architecture, training process (including the novel self-play mechanism), and the evaluation strategy, aligning well with the abstract's description.
  • ✅ Representation of rigorous evaluation design: The inclusion of the randomized, double-blind crossover study design (remote OSCE) directly addresses how the system's performance was compared against human clinicians, representing a rigorous evaluation approach for this type of AI.
  • 💡 High-level depiction of self-play: The schematic nature of the self-play loops and the critic feedback mechanism is appropriate for an overview figure but lacks the specific details needed to fully assess the technical novelty or potential limitations of this training approach (e.g., specific feedback criteria, data flow specifics).
  • 💡 Simplified representation of inference: The inference reasoning chain (Analyse, Generate, Refine) is presented linearly. While illustrative, the actual process within a sophisticated LLM might be more complex or iterative, which isn't captured here.
  • 💡 Schematic nature of comparison results: The radar chart visually implies AMIE's superiority across many axes. As a schematic overview, this is acceptable, but it doesn't present quantitative data, error bars, or statistical significance, which are necessary to substantiate the claims and are presumably detailed elsewhere in the paper.
Communication
  • ✅ Effective use of mixed graphical elements: The figure effectively uses a combination of flowcharts, icons, and a summary graphic (radar chart) to provide a high-level overview of the complex system architecture, training methodology, and evaluation framework.
  • ✅ Logical flow and structure: The diagram follows a logical flow from left to right, top to bottom, generally guiding the reader through system design, training, evaluation, and comparison.
  • ✅ Intuitive summary graphic: The radar chart provides a visually intuitive snapshot comparing AMIE and PCP performance across multiple dimensions, summarizing a key outcome.
  • 💡 Information density and label size: The diagram is information-dense, containing many components, labels, and arrows. Some text labels within blocks are small, potentially hindering readability, especially in smaller print or screen sizes. Consider increasing font sizes or simplifying labels where possible.
  • 💡 Potential ambiguity in terminology: The term 'Simulated dialogue' appears both as input data for fine-tuning and as an output of the 'Simulated dialogue generator'. While contextually different, using slightly different terminology (e.g., 'Simulated Training Dialogues' vs. 'Generated Dialogue') could prevent potential ambiguity.
  • 💡 Clarity of 'Critic' block connections: The specific inputs and outputs for the 'Critic' block within the inner self-play loop could be made more explicit to clarify its precise role in refining the dialogue.
Fig. 2 | Overview of randomized study design. A PCP and AMIE perform (in a...
Full Caption

Fig. 2 | Overview of randomized study design. A PCP and AMIE perform (in a randomized order) a virtual remote OSCE with simulated patients by means of an online multi-turn synchronous text chat and produce answers

Figure/Table Image (Page 3)
Fig. 2 | Overview of randomized study design. A PCP and AMIE perform (in a randomized order) a virtual remote OSCE with simulated patients by means of an online multi-turn synchronous text chat and produce answers
First Reference in Text
Next we designed and conducted a blinded, remote objective structured clinical examination (OSCE) study (Fig. 2) using 159 case scenarios from clinical providers
Description
  • Study Overview and Format: This figure diagrams the methodology for a study comparing a medical AI called AMIE with human Primary Care Physicians (PCPs). The study uses a format similar to an OSCE (Objective Structured Clinical Examination), which is a standardized way of testing clinical skills, but conducted remotely via text chat.
  • Step 1: Randomized, Blinded Consultation: Step 1 shows the core interaction: A simulated patient (an actor trained to portray a specific medical case from a 'scenario pack') engages in an online text chat consultation. Crucially, the patient interacts with both a PCP and the AMIE system, but the order is randomized, and the patient is blinded (doesn't know if they are chatting with the human or the AI).
  • Step 2: Post-Questionnaires and Data Collection: Step 2 involves data collection immediately after each consultation. Both the patient-actor and the 'OSCE agent' (either the PCP or AMIE) complete post-questionnaires. The patient-actor provides feedback using standardized tools like GMCPQ (General Medical Council Patient Questionnaire), PACES (Practical Assessment of Clinical Examination Skills – focusing on empathy/patient concerns), and metrics derived from PCCBP (Patient-Centred Communication Best Practice – assessing relationship building). The OSCE agent provides a DDx (Differential Diagnosis – a ranked list of possible conditions), proposed investigations/treatment, and escalation plans.
  • Step 3: Specialist Physician Evaluation: Step 3 depicts the evaluation phase. Specialist physicians, who are experts in the relevant medical field, review the data collected in Step 2 (scenario details, consultation transcript, agent's DDx/plan) for both the PCP and AMIE consultations corresponding to the same patient scenario. They evaluate performance based on criteria including diagnostic accuracy (DDx), quality of diagnosis and management, and communication skills (using PCCBP and PACES frameworks). The patient-actor's questionnaire responses also contribute to the evaluation.
Scientific Validity
  • ✅ Strong randomized crossover design: The diagram clearly depicts a randomized crossover design where each simulated patient interacts with both AMIE and a PCP. This allows for within-subject comparison, reducing variability and strengthening the validity of the comparison between the AI and human clinicians.
  • ✅ Inclusion of patient-actor blinding: The blinding of the patient-actor to whether they are interacting with AMIE or a PCP is a crucial methodological strength, minimizing potential bias in their interaction style and subsequent ratings.
  • ✅ Independent specialist evaluation: The use of specialist physicians, separate from the participating PCPs, to evaluate the performance based on transcripts and agent outputs ensures an independent assessment, further reducing bias.
  • ✅ Use of standardized evaluation metrics: Employing standardized evaluation criteria (PCCBP, PACES, DDx accuracy) derived from established medical assessment practices (OSCEs, questionnaires) provides a structured and relevant framework for comparing performance.
  • 💡 Limitation: Use of simulated patients/scenarios: The study relies on simulated patients and scenarios. While necessary for standardization and scale, this may not fully capture the complexity, unpredictability, and nuances of real-world patient encounters.
  • 💡 Potential bias due to text-chat modality: The interaction medium is synchronous text chat. As noted in the text, this may be unfamiliar to PCPs for diagnostic consultations, potentially disadvantaging them compared to the LLM-based AMIE, which operates natively in text. The diagram itself doesn't highlight this potential bias, but it's inherent in the depicted method.
  • 💡 Lack of detail on resolving evaluator disagreements: The diagram shows evaluation criteria but doesn't specify how conflicting ratings between the multiple specialist evaluators (mentioned in the text) were resolved (e.g., consensus, majority vote, averaging), which is an important detail for understanding the final evaluation.
Communication
  • ✅ Clear three-step structure: The figure clearly illustrates the three main steps of the study design (consultation, post-questionnaires, evaluation) in a sequential manner, making the overall process easy to follow.
  • ✅ Effective use of icons: The use of icons effectively represents the different participants (patient-actor, PCP, AMIE, specialist physician) and data elements (scenario pack, transcript, questionnaires).
  • ✅ Highlights randomization and blinding: The diagram visually emphasizes the randomization between the PCP and AMIE arms and the blinding aspect for the simulated patient, clearly communicating key methodological strengths.
  • ✅ Logical data flow: The flow is logical, showing data collection (transcripts, questionnaires) feeding into the final evaluation step.
  • 💡 High density of acronyms: While standard in medical assessment contexts, the figure relies heavily on acronyms (OSCE, PCP, AMIE, DDx, GMCPQ, PCCBP, PACES). While defined in the text, a brief expansion in a legend or footnote within the figure itself could improve its standalone readability for a broader audience.
  • 💡 Ambiguity in 'OSCE data' arrow: The arrow indicating 'OSCE data' feeding into the 'Evaluation criteria' box is slightly ambiguous. Clarify if 'OSCE data' refers to the combined transcripts and questionnaires or just the questionnaires/DDx lists.
Fig. 3 Specialist-rated top-k diagnostic accuracy. a, b, The AMIE and PCP top-k...
Full Caption

Fig. 3 Specialist-rated top-k diagnostic accuracy. a, b, The AMIE and PCP top-k DDx accuracies, determined by the majority vote of three specialists, are compared across 159 scenarios with respect to the ground-truth diagnosis (a) and all diagnoses in the accepted differential (b).

Figure/Table Image (Page 4)
Fig. 3 Specialist-rated top-k diagnostic accuracy. a, b, The AMIE and PCP top-k DDx accuracies, determined by the majority vote of three specialists, are compared across 159 scenarios with respect to the ground-truth diagnosis (a) and all diagnoses in the accepted differential (b).
First Reference in Text
Figure 3 shows the top-k accuracy for AMIE and the PCPs, considering matches with the ground-truth diagnosis (Fig. 3a) and matches with any item on the accepted differential (Fig. 3b).
Description
  • Top-k Diagnostic Accuracy vs. Ground Truth: This line graph compares the diagnostic accuracy of the AMIE AI system (orange line) against human Primary Care Physicians (PCPs, blue line) based on specialist ratings across 159 simulated medical scenarios. Accuracy is measured as 'top-k accuracy', meaning the percentage of scenarios where the single correct 'ground-truth' diagnosis was found within the top 'k' diagnoses listed by either AMIE or the PCP. The x-axis shows 'k' ranging from 1 to 10.
  • AMIE vs. PCP Performance Trend: The graph shows that for all values of k from 1 to 10, AMIE consistently achieves higher average top-k accuracy than the PCPs. For instance, at k=1 (meaning the top diagnosis listed was the correct one), AMIE's accuracy is approximately 85%, while PCP accuracy is around 75%.
  • Accuracy Increase with k: As 'k' increases (allowing for more potential matches within the ranked list), the accuracy for both AMIE and PCPs increases, eventually plateauing. AMIE's accuracy reaches over 95% by k=5, while PCP accuracy approaches 90% by k=10.
  • Confidence Intervals: Shaded areas around each line represent 95% confidence intervals, indicating the range of statistical uncertainty for the average accuracy at each value of k. The confidence intervals for AMIE and PCPs do not appear to overlap significantly, visually suggesting a statistically significant difference.
  • Statistical Significance: The caption notes that the differences are statistically significant (P < 0.05 after FDR correction) for all values of k, based on bootstrap testing.
Scientific Validity
  • ✅ Appropriate metric (Top-k accuracy vs ground-truth): The use of top-k accuracy is an appropriate metric for evaluating ranked lists of differential diagnoses, particularly comparing performance when only the single best answer is considered correct.
  • ✅ Evaluation based on specialist majority vote: Basing the evaluation on the majority vote of three specialist physicians provides a degree of robustness against individual rater bias, although inter-rater reliability details (mentioned elsewhere in the text) are important context.
  • ✅ Sound statistical analysis approach: The use of bootstrap testing with FDR correction for multiple comparisons (k=1 to 10) is a statistically sound approach to determine the significance of the observed differences.
  • ✅ Strong support for claim of AMIE superiority (vs ground-truth): The graph strongly supports the claim made in the text and caption that AMIE demonstrates greater diagnostic accuracy than PCPs when measured against the ground-truth diagnosis across all top-k levels.
  • ✅ Reasonable sample size (159 scenarios): The sample size of 159 scenarios provides a reasonable basis for comparison, although generalizability might depend on the diversity and representativeness of these scenarios.
  • 💡 Limitation: Focus on single ground-truth: This panel only considers the single 'ground-truth' diagnosis. Clinical reality often involves multiple plausible diagnoses. Panel 3b addresses this limitation by considering the 'accepted differential'.
Communication
  • ✅ Clear visual comparison: The line graph effectively compares the performance of AMIE and PCPs over the range of k values. Using distinct colors (Orange for AMIE, Blue for PCP) and shaded confidence intervals enhances clarity.
  • ✅ Clear axis labeling: The x-axis ('Top-k') and y-axis ('Accuracy (%)') are clearly labeled, making the graph easy to interpret.
  • ✅ Inclusion of confidence intervals: The inclusion of 95% confidence intervals (shaded areas) provides a visual representation of the uncertainty around the mean accuracy estimates.
  • ✅ Informative caption: The caption clearly explains what is being plotted (top-k accuracy vs ground-truth) and the basis for the evaluation (specialist majority vote, 159 scenarios). The reference to statistical significance testing (FDR-adjusted p-values provided in the caption text) adds important context, although the values are not on the graph itself.
  • 💡 Potential for showing data distribution: While the overall trend is clear, plotting individual data points (perhaps semi-transparently) could offer insight into the distribution and variability of accuracy across scenarios, although this might clutter the graph.
Fig. 4 | Patient-actor ratings. Conversation qualities, as assessed by the...
Full Caption

Fig. 4 | Patient-actor ratings. Conversation qualities, as assessed by the patient-actors upon conclusion of the consultation.

Figure/Table Image (Page 5)
Fig. 4 | Patient-actor ratings. Conversation qualities, as assessed by the patient-actors upon conclusion of the consultation.
First Reference in Text
Patient-actor ratings. Figure 4 presents the various conversation qualities the patient-actors assessed following their consultations with the OSCE agents.
Description
  • Comparison Overview: This figure presents patient-actor ratings comparing the AI system AMIE (top bar in each pair) with human Primary Care Physicians (PCPs, bottom bar) on various aspects of conversation quality during simulated medical consultations.
  • Rating Scales Used: Ratings were collected using questions adapted from standardized instruments: GMCPQ (General Medical Council Patient Questionnaire), PACES (Practical Assessment of Clinical Examination Skills), and PCCBP (Patient-Centred Communication Best Practice). These cover aspects like politeness, listening skills, empathy, explaining conditions, building rapport, and honesty.
  • Data Visualization Format: The results are shown as divergent stacked bar charts. Each bar represents 100% of the consultations for either AMIE or PCP. The segments within each bar show the percentage of consultations receiving a specific rating, mapped to a five-point scale from 'Very unfavourable' (dark red) to 'Very favourable' (dark blue). For Yes/No questions, 'Yes' is mapped to 'Favourable' and 'No' to 'Unfavourable'.
  • Overall Trend: AMIE Rated More Favorably: Visually, for the vast majority of the 26 quality metrics shown, the blue segments (representing favourable ratings) are larger for AMIE compared to PCPs, while the red segments (unfavourable ratings) are smaller for AMIE.
  • Statistical Significance Markers: Statistical significance (P-value from Wilcoxon signed-rank tests, corrected for multiple comparisons using FDR - False Discovery Rate) and the number of valid comparisons (N) are provided for each metric. The caption notes that AMIE was rated significantly better (P < 0.05) on 25 out of 26 axes.
  • Exception: 'Acknowledging mistakes': One exception where no significant difference was found is 'Acknowledging mistakes', which had a much smaller sample size (N=46) because it was only applicable when a mistake was actually made and pointed out.
Scientific Validity
  • ✅ Valid use of patient-actor ratings: Using patient-actors, who are trained to simulate cases and provide feedback, is a standard and valid methodology in OSCEs and medical education research for assessing interactional skills.
  • ✅ Use of established questionnaire frameworks: Basing the assessment on questions derived from established and validated questionnaires (GMCPQ, PACES, PCCBP) adds rigor and relevance to the measured constructs.
  • ✅ Grounded in randomized crossover design: The comparison is based on the randomized crossover design described in Fig. 2, where each patient-actor interacted with both AMIE and a PCP, strengthening the comparison by controlling for patient variability.
  • ✅ Appropriate statistical test (Wilcoxon signed-rank): The use of the Wilcoxon signed-rank test is appropriate for comparing paired, ordinal rating scale data.
  • ✅ Appropriate correction for multiple comparisons (FDR): The application of FDR correction appropriately addresses the issue of multiple comparisons across the 26 axes.
  • ✅ Strong support for claim of AMIE superiority in patient-actor ratings: The figure strongly supports the claim that patient-actors rated AMIE's conversational qualities significantly higher than PCPs' across most dimensions assessed in this specific study setup.
  • 💡 Limitation: Subjectivity of ratings: Patient ratings, even from trained actors, are subjective and may be influenced by factors beyond the objective quality of the interaction (e.g., length of response, perceived formality).
  • 💡 Limitation: Context dependence (simulation, text-chat): As with other results, these findings are specific to the simulated, text-based consultation format, which may differ from real-world, multi-modal interactions.
  • 💡 Limitation highlighted by low N for 'Acknowledging mistakes': The significantly smaller N for 'Acknowledging mistakes' highlights a potential limitation in assessing rare events within this study design.
Communication
  • ✅ Effective visualization format: The use of divergent stacked bar charts effectively visualizes the distribution of ratings (from 'Very unfavourable' to 'Very favourable') for both AMIE and PCP on each quality metric, allowing for direct comparison.
  • ✅ Consistent and intuitive color mapping: The consistent color scheme across all bars, mapping colors to levels of favorability (e.g., darker blue for 'Very favourable', darker red for 'Very unfavourable'), aids interpretation.
  • ✅ Clear grouping by questionnaire: Grouping related questions under standardized questionnaire headings (GMCPQ, PACES, PCCBP) provides structure and context.
  • ✅ Direct comparison layout: Placing the AMIE and PCP bars adjacent for each metric facilitates direct comparison.
  • ✅ Inclusion of statistical details (P-value, N): Including the P-value and sample size (N) next to each comparison provides immediate statistical context, enhancing the figure's self-containedness.
  • ✅ Clear legend: The legend clearly explains the color mapping and the order (AMIE top, PCP bottom).
  • 💡 Minor label length considerations: Some labels for the quality metrics are slightly long (e.g., 'Discussing roles and responsibilities (Y/N)'). While accurate, minor abbreviation or rephrasing might slightly improve visual balance if space were constrained, but it's generally acceptable.
  • 💡 Explicit legend note for Y/N mapping: The mapping of binary 'Yes/No' responses to 'Favourable'/'Unfavourable' is explained in the caption text but could be explicitly noted in the legend for absolute clarity within the figure itself (e.g., adding '(Y/N mapped)' to relevant legend entries).
Fig. 5| Specialist physician ratings. Conversation and reasoning qualities, as...
Full Caption

Fig. 5| Specialist physician ratings. Conversation and reasoning qualities, as assessed by specialist physicians.

Figure/Table Image (Page 6)
Fig. 5| Specialist physician ratings. Conversation and reasoning qualities, as assessed by specialist physicians.
First Reference in Text
Again, AMIE's responses were rated significantly better by the specialists than those from the PCPs on 30 out of 32 evaluation axes, with the specialists preferring AMIE's consultations, diagnoses and management plans over those from the PCPs (Fig. 5).
Description
  • Comparison Overview: Specialist Ratings: This figure displays ratings from specialist physicians who evaluated the quality of consultations and reasoning demonstrated by the AI system AMIE (top bar in each pair) compared to human Primary Care Physicians (PCPs, bottom bar). The evaluation covered 159 simulated patient scenarios.
  • Rating Scales and Categories: Ratings were based on established clinical assessment frameworks: PACES (Practical Assessment of Clinical Examination Skills - covering aspects like history taking accuracy, clarity, structure, empathy, and patient welfare) and PCCBP (Patient-Centred Communication Best Practice - covering relationship fostering, information gathering/providing, decision making, etc.). Additional metrics assessed the appropriateness and comprehensiveness of the differential diagnosis (DDx - the list of possible conditions) and the proposed management plan (investigations, treatments, follow-up).
  • Data Visualization Format: Data is visualized using divergent stacked bar charts, where each bar shows the distribution of ratings from 'Very unfavourable' (dark red) to 'Very favourable' (dark blue) based on the median rating from three specialists per case. Yes/No questions are mapped to 'Favourable'/'Unfavourable'.
  • Overall Trend: AMIE Rated More Favorably by Specialists: The visual trend across nearly all metrics shows larger blue segments (favourable ratings) and smaller red segments (unfavourable ratings) for AMIE compared to PCPs, indicating specialists generally rated AMIE higher.
  • Statistical Significance: The reference text and figure annotations state that AMIE was rated significantly better (P < 0.05 after FDR correction using Wilcoxon signed-rank tests) on 30 out of the 32 evaluation axes presented.
  • Exceptions: Non-significant Differences: Two metrics showed no significant difference: 'Escalation recommendation appropriate (Y/N)' (P = 0.1210) and 'Confabulation absent (Y/N)' (P = 0.4795). For these, the bars appear visually similar between AMIE and PCP, and both performed well (high percentage of 'Favourable'/'Yes' ratings).
Scientific Validity
  • ✅ Credible evaluation by specialist physicians: Using specialist physicians, who possess domain expertise relevant to the scenarios, provides a credible and clinically relevant assessment of consultation quality, diagnosis, and management.
  • ✅ Use of median ratings from multiple specialists: Employing median ratings from three specialists helps mitigate individual rater bias and is an appropriate aggregation method for ordinal rating scales.
  • ✅ Comprehensive assessment across multiple dimensions: The assessment covers a broad range of clinically important dimensions, including communication skills (PACES, PCCBP), diagnostic reasoning (DDx appropriateness/comprehensiveness), and management planning, providing a comprehensive evaluation.
  • ✅ Appropriate statistical analysis: The statistical approach (Wilcoxon signed-rank test for paired ordinal data, with FDR correction for multiple comparisons) is methodologically sound.
  • ✅ Strong support for claim of AMIE superiority in specialist ratings: The figure provides strong visual and statistical support for the central claim that specialist physicians rated AMIE significantly higher than PCPs across the vast majority of evaluation axes in this study.
  • 💡 Limitation: Subjectivity of specialist ratings: While specialists provide expert judgment, their ratings are still subjective and could potentially be influenced by factors like response length or writing style, even when evaluating transcripts and outputs.
  • 💡 Limitation: Evaluation based on transcripts/outputs, not live interaction: The evaluation is based on reviewing transcripts and post-consultation outputs (like DDx lists), not a live interaction. Specialists lack access to non-verbal cues or real-time interaction dynamics that might influence assessment in a true clinical setting or live OSCE.
  • ✅ Non-significant findings provide valuable context: The findings regarding the two non-significant axes ('Escalation recommendation', 'Confabulation absent') are also informative, suggesting areas where AMIE performed comparably well to PCPs, or where perhaps the task was less discriminating.
Communication
  • ✅ Effective visualization format: The figure effectively uses divergent stacked bar charts, similar to Fig. 4, allowing for easy visual comparison of rating distributions between AMIE and PCPs across numerous metrics.
  • ✅ Clear grouping of metrics: Metrics are clearly grouped by assessment category (PACES, PCCBP, Diagnosis and management), which aids interpretation and provides structure.
  • ✅ Consistent color scheme and layout: The consistent color scheme mapping favorability and the adjacent placement of AMIE/PCP bars facilitate direct comparison.
  • ✅ Inclusion of statistical details (P-value, N): Including P-values and sample sizes (N) directly on the chart for each metric significantly enhances its informational value and self-containedness.
  • ✅ Clear legend: The legend clearly defines the rating scale mapping and identifies the AMIE (top) and PCP (bottom) bars.
  • ✅ Manages high information density well: The number of metrics presented is high (32 axes). While comprehensive, this makes the figure quite dense. However, the clear structure prevents it from becoming overwhelming.
  • 💡 Consider acronym expansion in legend/footnote: Similar to Fig. 4, expanding acronyms like PACES and PCCBP briefly in a footnote or legend could improve accessibility for readers less familiar with these specific assessment tools.
Extended Data Fig. 2 | DDx top-kaccuracy for non-disease-states and positive...
Full Caption

Extended Data Fig. 2 | DDx top-kaccuracy for non-disease-states and positive disease-states. a, b: Specialist rated DDx top-kaccuracy for the 149 "positive" scenarios with respect to (a) the ground-truth diagnosis and (b) the accepted differentials. c, d: Specialist rated DDx top-k accuracy for the 10 "negative" scenarios with respect to (c) the ground-truth diagnosis and (d) the accepted differentials.

Figure/Table Image (Page 16)
Extended Data Fig. 2 | DDx top-kaccuracy for non-disease-states and positive disease-states. a, b: Specialist rated DDx top-kaccuracy for the 149 "positive" scenarios with respect to (a) the ground-truth diagnosis and (b) the accepted differentials. c, d: Specialist rated DDx top-k accuracy for the 10 "negative" scenarios with respect to (c) the ground-truth diagnosis and (d) the accepted differentials.
First Reference in Text
AMIE has superior DDx accuracy on the set of 149 primarily positive disease state scenarios (in which only three scenarios had a ground-truth of a non-disease state).
Description
  • Top-k Accuracy vs Accepted Differentials (Positive Scenarios): This graph shows the top-k accuracy for the 149 'positive' scenarios (where a disease state is generally present), but accuracy is measured against any diagnosis listed in the 'accepted differential' list provided by specialists, not just the single ground truth. This represents a more lenient accuracy measure.
  • AMIE vs PCP Performance: Similar to panel (a), AMIE (orange line) shows higher accuracy than PCPs (blue line) across all k values.
  • Accuracy Levels and Trend: Accuracies are higher overall compared to panel (a) because matching any accepted differential is easier than matching the specific ground truth. AMIE's top-1 accuracy is around 90%, reaching over 95% by k=3. PCP's top-1 accuracy is around 80%, reaching over 90% by k=4.
  • Confidence Intervals and Significance: The confidence intervals (shaded areas) do not overlap, visually suggesting a significant difference, which is confirmed by the p-values in the caption (P < 0.05 for all k after FDR correction).
Scientific Validity
  • ✅ Clinically relevant metric (vs accepted differentials): Measuring accuracy against the accepted differential list is a clinically relevant approach, as often multiple diagnoses are considered plausible or acceptable initially.
  • ✅ Focus on positive scenarios: Comparing performance on positive disease state scenarios is important for evaluating the system's ability to identify actual conditions.
  • ✅ Sound statistical analysis: The statistical analysis (bootstrap testing with FDR correction) remains appropriate for this comparison.
  • ✅ Strong support for claim (vs accepted differentials): The results strongly support the claim that AMIE's diagnostic lists are more likely than PCPs' lists to contain an acceptable diagnosis within the top-k positions for these positive scenarios.
  • 💡 Dependence on specialist consensus for accepted list: The definition of 'accepted differentials' relies on specialist consensus, which might have some inherent subjectivity, although using a majority vote mitigates this.
Communication
  • ✅ Consistent and clear visualization: Clear presentation using line graphs with distinct colors and confidence intervals, consistent with other figures.
  • ✅ Clear labeling: Axes are clearly labeled, and the legend is unambiguous.
  • ✅ Specific caption description: The caption clearly defines the content of this specific panel (positive scenarios vs accepted differentials).
  • ✅ Statistical context provided: Including FDR-adjusted p-values in the caption text provides necessary statistical context.
Extended Data Fig. 3 | Specialist rated DDx accuracy by scenario specialty....
Full Caption

Extended Data Fig. 3 | Specialist rated DDx accuracy by scenario specialty. Top-k DDx accuracy for scenarios with respect to the ground-truth in (a) Cardiology (N=31, not significant), (b) Gastroenterology (N=33, not significant), (c) Internal Medicine (N=16, significant for all k), (d) Neurology (N=32, significant for k> 5), (e) Obstetrics and Gynaecology (OBGYN)/Urology (N=15, not significant), (f) Respiratory (N=32, significant for all k).

Figure/Table Image (Page 17)