Towards conversational diagnostic artificial intelligence

Tao Tu, Mike Schaekermann, Anil Palepu, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Yong Cheng, Elahe Vedadi, Nenad Tomasev, Shekoofeh Azizi, Karan Singhal, Le Hou, Albert Webson, Kavita Kulkarni, S. Sara Mahdavi, Christopher Semturs, Juraj Gottweis, Joelle Barral, Katherine Chou, Greg S. Corrado, Yossi Matias, Alan Karthikesalingam, Vivek Natarajan
Nature
Google Research, Mountain View, CA, USA

First Page Preview

First page preview

Table of Contents

Overall Summary

Study Background and Main Findings

This paper introduces and evaluates AMIE (Articulate Medical Intelligence Explorer), an artificial intelligence system based on large language models (LLMs), specifically optimized for engaging in diagnostic conversations—a core component of medical practice involving complex history-taking and clinical reasoning. Recognizing the limitations of existing medical AI and the difficulty of replicating human clinical dialogue, the researchers developed AMIE using a combination of real-world medical data and an innovative 'self-play' simulated environment, where AI agents interact to generate diverse training dialogues and receive automated feedback, allowing the system to learn across many medical conditions and contexts.

The primary objective was to compare AMIE's performance against human primary care physicians (PCPs) in a realistic, albeit simulated, setting. The researchers designed a rigorous evaluation framework using a randomized, double-blind crossover study modeled after the Objective Structured Clinical Examination (OSCE), a standard method for assessing clinical skills. Twenty PCPs and AMIE conducted text-based consultations with validated patient-actors portraying 159 different medical scenarios. Performance was assessed across multiple clinically relevant dimensions, including diagnostic accuracy (comparing generated lists of possible diagnoses, or differential diagnoses (DDx), against ground truth), history-taking quality, management planning, communication skills, and empathy, using ratings from both the patient-actors and independent specialist physicians.

The results indicated that, within this specific text-chat based simulated environment, AMIE demonstrated statistically significantly higher diagnostic accuracy than the PCPs (e.g., top-1 accuracy against ground truth was ~85% for AMIE vs. ~75% for PCPs, P < 0.05). Furthermore, AMIE received superior ratings from specialist physicians on 30 out of 32 quality axes and from patient-actors on 25 out of 26 axes, including measures of empathy and communication clarity. Analysis suggested AMIE's advantage stemmed more from interpreting information to form a diagnosis rather than from gathering information more effectively.

The authors conclude that AMIE represents a significant milestone in developing conversational AI for diagnostic purposes. However, they appropriately caution that the results must be interpreted carefully due to major limitations, particularly the use of a text-chat interface unfamiliar to clinicians (which likely biased the comparison) and the simulated nature of the evaluation. Substantial further research focusing on safety, reliability, fairness, and rigorous clinical validation in real-world settings is deemed essential before systems like AMIE could be considered for practical application in healthcare.

Research Impact and Future Directions

This study represents a significant technical achievement, demonstrating that a large language model (LLM) optimized for diagnostic dialogue, AMIE, can outperform primary care physicians (PCPs) in simulated, text-based consultations across key metrics like diagnostic accuracy and communication quality. The use of a randomized, double-blind crossover design modeled after Objective Structured Clinical Examinations (OSCEs) provides a rigorous framework for comparison within the study's specific context. AMIE's superior performance, particularly in diagnostic reasoning and perceived empathy by both patient-actors and specialist evaluators, highlights the potential of advanced AI in complex medical interactions.

However, the study's conclusions must be interpreted with considerable caution due to fundamental limitations inherent in its design. The reliance on synchronous text-chat, a modality unfamiliar to most clinicians for diagnostic purposes, likely disadvantaged the PCPs and favored the text-native LLM, potentially exaggerating AMIE's relative performance. Furthermore, the evaluation occurred within a simulated environment using trained patient-actors and predefined scenarios. This controlled setting cannot fully replicate the complexity, unpredictability, and multi-modal nature (including non-verbal cues) of real-world clinical encounters. Therefore, the study demonstrates AMIE's capabilities in a specific, artificial setting but does not provide sufficient evidence to claim superiority over human physicians in actual clinical practice.

The findings strongly suggest potential future applications for AI like AMIE as assistive tools—perhaps helping clinicians generate differential diagnoses, draft patient communications, or summarize information. Yet, the path to real-world deployment is long and requires addressing critical challenges. Substantial further research and rigorous clinical validation are essential to ensure safety, reliability, efficacy, fairness, and privacy. Key unanswered questions include how AMIE performs in diverse, real-world patient populations and clinical settings, how to effectively mitigate potential biases, and how to best integrate such tools into clinical workflows with appropriate human oversight. This work serves as a crucial proof-of-concept and a milestone for conversational AI in medicine, but underscores the extensive validation needed before such technology can be responsibly integrated into patient care.

Critical Analysis and Recommendations

Effective Motivation (written-content)
Clear Problem Statement and Motivation: The abstract effectively establishes the importance of physician-patient dialogue and the challenge of replicating it with AI. This clearly justifies the research need and engages the reader.
Section: Abstract
Lack of Quantitative Key Result in Abstract (written-content)
Quantify 'Greater Diagnostic Accuracy' Claim: The abstract states AMIE had 'greater diagnostic accuracy' but doesn't provide the key quantitative result (e.g., top-1 accuracy difference from Fig 3a). Including the primary accuracy metric would make the abstract's summary of findings significantly more informative and impactful.
Section: Abstract
Insufficient Context for Key Limitation (written-content)
Contextualize Text-Chat Limitation Impact: The abstract mentions the text-chat interface limitation but doesn't briefly state its potential impact (disadvantaging clinicians). Adding this context would improve the initial interpretation of the results presented in the abstract.
Section: Abstract
Robust Evidence for Diagnostic Accuracy Claim (written-content)
Clear Quantitative Demonstration of Diagnostic Superiority: The results clearly show AMIE's statistically significant higher diagnostic accuracy compared to PCPs using quantitative data (Fig. 3) and appropriate statistical tests. This provides robust evidence for a central claim within the study's context.
Section: Results
Comprehensive Conversation Quality Assessment (graphical-figure)
Multi-Perspective Evaluation of Conversation Quality: Conversation quality was comprehensively assessed using standardized rubrics from both patient-actors (Fig. 4) and specialist physicians (Fig. 5). This multi-faceted approach adds significant credibility and depth to the findings on communication and empathy.
Section: Results
Analysis of Performance Source (written-content)
Insightful Analysis of Performance Drivers: The study effectively investigated why AMIE performed better diagnostically, concluding it excels more in interpreting information than acquiring it. This analysis adds depth beyond simply reporting the performance difference.
Section: Results
Lack of Direct Reporting for Inter-Rater Reliability (graphical-figure)
Report Inter-Rater Reliability Metric Directly: While IRR for specialist ratings (Fig. 5) is mentioned as being in supplementary info, stating the actual metric (e.g., Fleiss' Kappa) in the main results text would immediately strengthen the perceived robustness of these subjective ratings.
Section: Results
Rigorous Handling of Text-Chat Limitation (written-content)
Thorough Acknowledgment of Interface Limitations: The discussion proactively and extensively addresses the major limitation of the text-chat interface, acknowledging its unfamiliarity for clinicians and potential to disadvantage them. This demonstrates scientific rigor and balanced interpretation.
Section: Discussion
Responsible Focus on Fairness and Bias (written-content)
Explicit Focus on Fairness, Bias, and Future Mitigation: The discussion dedicates significant attention to the critical issues of fairness and bias, acknowledging current limitations and outlining necessary future work. This highlights a responsible approach to AI development.
Section: Discussion
Responsible Framing of Future Work (written-content)
Cautious and Responsible Framing of Deployment Path: The discussion clearly outlines the many steps (safety, reliability, ethics, oversight) required before translation to practice. This manages expectations appropriately.
Section: Discussion
Lack of Concrete Examples for Human-AI Collaboration (written-content)
Elaborate on Specific Human-AI Complementarity Workflows: The discussion mentions human-AI complementarity but lacks concrete examples of how this might work (e.g., AI suggesting DDx options, drafting empathic responses). Providing specific examples would make this important concept more tangible.
Section: Discussion
Insufficient Emphasis on Simulation Limitation Impact (written-content)
Simulation-Based Design Limits Generalizability: The study's reliance on simulated scenarios and patient-actors, while necessary for control, fundamentally limits the ability to generalize findings to real-world clinical practice. The discussion acknowledges simulation limits but could more strongly emphasize how this constrains the interpretation of comparative performance.
Section: Discussion
Effective Summary of Contribution (written-content)
Clear Summary of Milestone Achieved: The conclusion effectively synthesizes the core achievement, positioning AMIE's performance in the simulated setting as a significant milestone for conversational AI in diagnostics.
Section: Conclusion
Clear Articulation of Research-Practice Gap (written-content)
Explicit Statement of Research-to-Practice Gap: The conclusion clearly articulates the substantial gap between the experimental findings and real-world application, emphasizing the need for extensive further research. This provides essential context and manages expectations.
Section: Conclusion
Lack of Specificity on Clinical Validation Need (written-content)
Explicitly Mention Need for Clinical Validation: While the conclusion stresses the need for 'substantial additional research', explicitly mentioning 'rigorous clinical validation' or 'clinical trials' would add specificity and reinforce the critical next step for translating these findings.
Section: Conclusion
Novel Self-Play Training Method (written-content)
Innovative Self-Play Simulation Framework: The detailed description of the self-play environment for scaling training data represents a novel and well-explained methodological contribution for training specialized conversational AI.
Section: Methods
Rigorous Evaluation Methodology (written-content)
Rigorous and Comprehensive Evaluation Design (Remote OSCE): The methods meticulously detail the randomized, blinded, crossover remote OSCE design, use of validated patient-actors, diverse scenarios, clinically relevant metrics (PACES, PCCBP, etc.), and multi-perspective assessment. This rigorous design significantly boosts the credibility of the comparative findings within the study's context.
Section: Methods
Transparent Statistical Methods (written-content)
Transparent Statistical Analysis Methods: The specific statistical tests (bootstrap, Wilcoxon signed-rank) and corrections (FDR) used are clearly described. This transparency supports the reproducibility and validity of the results.
Section: Methods
Lack of Summary for Agent Instructions (written-content)
Briefly Summarize Core Agent Instructions in Main Text: The prompts for the self-play agents are in supplementary material, but briefly summarizing the core instruction for each agent (vignette generator, patient, doctor, critic) in the Methods section would improve the immediate understanding of this novel simulation framework.
Section: Methods
Establishes Novelty Effectively (written-content)
Clear Differentiation from Prior AI Work: The related work section effectively distinguishes this study from previous AI applications in medicine (e.g., symptom checkers, transcription) by highlighting their limitations. This clearly establishes the novelty and contribution of the AMIE research.
Section: Related work
Justifies Rigorous Evaluation Approach (written-content)
Critical Assessment of Prior Evaluation Methods: The section rightly critiques the inadequacy of evaluation metrics used in prior AI dialogue studies (e.g., fluency, relevance) compared to clinical standards. This justifies the paper's more rigorous, clinically-aligned evaluation approach.
Section: Related work

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Fig. 1| Overview of contributions. AMIE is a conversational medical Al...
Full Caption

Fig. 1| Overview of contributions. AMIE is a conversational medical Al optimized for diagnostic dialogue.

Figure/Table Image (Page 2)
Fig. 1| Overview of contributions. AMIE is a conversational medical Al optimized for diagnostic dialogue.
First Reference in Text
Our key contributions (Fig. 1) are summarized here.
Description
  • AMIE System Design and Fine-tuning: The diagram outlines the development and evaluation of AMIE, an AI system for medical diagnosis through conversation. It shows AMIE's system design involving multiple data inputs (like medical reasoning datasets, real-world dialogues, and simulated dialogues) used for 'fine-tuning' – a process of adapting a general large language model (LLM) for this specific medical task.
  • Self-Play Training Mechanism: A key part of AMIE's training involves 'self-play', where the AI learns by interacting with itself. The diagram shows two loops: an 'inner' loop where AMIE acts as both doctor and patient, receiving feedback from an AI 'critic' to improve its responses within a single simulated conversation, and an 'outer' loop where these improved simulated dialogues are collected and used for further rounds of fine-tuning the main AMIE model.
  • Inference Reasoning Chain: During use ('inference'), AMIE employs a 'reasoning chain' involving analyzing the conversation context, generating a potential response, and then refining that response before presenting it to the user.
  • Randomized Evaluation Study (Remote OSCE): The evaluation method depicted is a randomized study designed like an OSCE (Objective Structured Clinical Examination – a standard test format for medical skills). In this setup, actors playing simulated patients interact via text chat randomly with either AMIE or real Primary Care Physicians (PCPs).
  • Comparative Performance Summary (Radar Chart): A radar chart summarizes the comparative performance, illustrating that AMIE (represented by the orange line encompassing a larger area) is suggested to outperform PCPs (blue line) across various evaluation metrics like diagnostic accuracy, management planning, empathy, and patient confidence, according to both specialist physicians and patient-actors.
Scientific Validity
  • ✅ Comprehensive methodological overview: The diagram provides a coherent and logical overview of the system's architecture, training process (including the novel self-play mechanism), and the evaluation strategy, aligning well with the abstract's description.
  • ✅ Representation of rigorous evaluation design: The inclusion of the randomized, double-blind crossover study design (remote OSCE) directly addresses how the system's performance was compared against human clinicians, representing a rigorous evaluation approach for this type of AI.
  • 💡 High-level depiction of self-play: The schematic nature of the self-play loops and the critic feedback mechanism is appropriate for an overview figure but lacks the specific details needed to fully assess the technical novelty or potential limitations of this training approach (e.g., specific feedback criteria, data flow specifics).
  • 💡 Simplified representation of inference: The inference reasoning chain (Analyse, Generate, Refine) is presented linearly. While illustrative, the actual process within a sophisticated LLM might be more complex or iterative, which isn't captured here.
  • 💡 Schematic nature of comparison results: The radar chart visually implies AMIE's superiority across many axes. As a schematic overview, this is acceptable, but it doesn't present quantitative data, error bars, or statistical significance, which are necessary to substantiate the claims and are presumably detailed elsewhere in the paper.
Communication
  • ✅ Effective use of mixed graphical elements: The figure effectively uses a combination of flowcharts, icons, and a summary graphic (radar chart) to provide a high-level overview of the complex system architecture, training methodology, and evaluation framework.
  • ✅ Logical flow and structure: The diagram follows a logical flow from left to right, top to bottom, generally guiding the reader through system design, training, evaluation, and comparison.
  • ✅ Intuitive summary graphic: The radar chart provides a visually intuitive snapshot comparing AMIE and PCP performance across multiple dimensions, summarizing a key outcome.
  • 💡 Information density and label size: The diagram is information-dense, containing many components, labels, and arrows. Some text labels within blocks are small, potentially hindering readability, especially in smaller print or screen sizes. Consider increasing font sizes or simplifying labels where possible.
  • 💡 Potential ambiguity in terminology: The term 'Simulated dialogue' appears both as input data for fine-tuning and as an output of the 'Simulated dialogue generator'. While contextually different, using slightly different terminology (e.g., 'Simulated Training Dialogues' vs. 'Generated Dialogue') could prevent potential ambiguity.
  • 💡 Clarity of 'Critic' block connections: The specific inputs and outputs for the 'Critic' block within the inner self-play loop could be made more explicit to clarify its precise role in refining the dialogue.
Fig. 2 | Overview of randomized study design. A PCP and AMIE perform (in a...
Full Caption

Fig. 2 | Overview of randomized study design. A PCP and AMIE perform (in a randomized order) a virtual remote OSCE with simulated patients by means of an online multi-turn synchronous text chat and produce answers

Figure/Table Image (Page 3)
Fig. 2 | Overview of randomized study design. A PCP and AMIE perform (in a randomized order) a virtual remote OSCE with simulated patients by means of an online multi-turn synchronous text chat and produce answers
First Reference in Text
Next we designed and conducted a blinded, remote objective structured clinical examination (OSCE) study (Fig. 2) using 159 case scenarios from clinical providers
Description
  • Study Overview and Format: This figure diagrams the methodology for a study comparing a medical AI called AMIE with human Primary Care Physicians (PCPs). The study uses a format similar to an OSCE (Objective Structured Clinical Examination), which is a standardized way of testing clinical skills, but conducted remotely via text chat.
  • Step 1: Randomized, Blinded Consultation: Step 1 shows the core interaction: A simulated patient (an actor trained to portray a specific medical case from a 'scenario pack') engages in an online text chat consultation. Crucially, the patient interacts with both a PCP and the AMIE system, but the order is randomized, and the patient is blinded (doesn't know if they are chatting with the human or the AI).
  • Step 2: Post-Questionnaires and Data Collection: Step 2 involves data collection immediately after each consultation. Both the patient-actor and the 'OSCE agent' (either the PCP or AMIE) complete post-questionnaires. The patient-actor provides feedback using standardized tools like GMCPQ (General Medical Council Patient Questionnaire), PACES (Practical Assessment of Clinical Examination Skills – focusing on empathy/patient concerns), and metrics derived from PCCBP (Patient-Centred Communication Best Practice – assessing relationship building). The OSCE agent provides a DDx (Differential Diagnosis – a ranked list of possible conditions), proposed investigations/treatment, and escalation plans.
  • Step 3: Specialist Physician Evaluation: Step 3 depicts the evaluation phase. Specialist physicians, who are experts in the relevant medical field, review the data collected in Step 2 (scenario details, consultation transcript, agent's DDx/plan) for both the PCP and AMIE consultations corresponding to the same patient scenario. They evaluate performance based on criteria including diagnostic accuracy (DDx), quality of diagnosis and management, and communication skills (using PCCBP and PACES frameworks). The patient-actor's questionnaire responses also contribute to the evaluation.
Scientific Validity
  • ✅ Strong randomized crossover design: The diagram clearly depicts a randomized crossover design where each simulated patient interacts with both AMIE and a PCP. This allows for within-subject comparison, reducing variability and strengthening the validity of the comparison between the AI and human clinicians.
  • ✅ Inclusion of patient-actor blinding: The blinding of the patient-actor to whether they are interacting with AMIE or a PCP is a crucial methodological strength, minimizing potential bias in their interaction style and subsequent ratings.
  • ✅ Independent specialist evaluation: The use of specialist physicians, separate from the participating PCPs, to evaluate the performance based on transcripts and agent outputs ensures an independent assessment, further reducing bias.
  • ✅ Use of standardized evaluation metrics: Employing standardized evaluation criteria (PCCBP, PACES, DDx accuracy) derived from established medical assessment practices (OSCEs, questionnaires) provides a structured and relevant framework for comparing performance.
  • 💡 Limitation: Use of simulated patients/scenarios: The study relies on simulated patients and scenarios. While necessary for standardization and scale, this may not fully capture the complexity, unpredictability, and nuances of real-world patient encounters.
  • 💡 Potential bias due to text-chat modality: The interaction medium is synchronous text chat. As noted in the text, this may be unfamiliar to PCPs for diagnostic consultations, potentially disadvantaging them compared to the LLM-based AMIE, which operates natively in text. The diagram itself doesn't highlight this potential bias, but it's inherent in the depicted method.
  • 💡 Lack of detail on resolving evaluator disagreements: The diagram shows evaluation criteria but doesn't specify how conflicting ratings between the multiple specialist evaluators (mentioned in the text) were resolved (e.g., consensus, majority vote, averaging), which is an important detail for understanding the final evaluation.
Communication
  • ✅ Clear three-step structure: The figure clearly illustrates the three main steps of the study design (consultation, post-questionnaires, evaluation) in a sequential manner, making the overall process easy to follow.
  • ✅ Effective use of icons: The use of icons effectively represents the different participants (patient-actor, PCP, AMIE, specialist physician) and data elements (scenario pack, transcript, questionnaires).
  • ✅ Highlights randomization and blinding: The diagram visually emphasizes the randomization between the PCP and AMIE arms and the blinding aspect for the simulated patient, clearly communicating key methodological strengths.
  • ✅ Logical data flow: The flow is logical, showing data collection (transcripts, questionnaires) feeding into the final evaluation step.
  • 💡 High density of acronyms: While standard in medical assessment contexts, the figure relies heavily on acronyms (OSCE, PCP, AMIE, DDx, GMCPQ, PCCBP, PACES). While defined in the text, a brief expansion in a legend or footnote within the figure itself could improve its standalone readability for a broader audience.
  • 💡 Ambiguity in 'OSCE data' arrow: The arrow indicating 'OSCE data' feeding into the 'Evaluation criteria' box is slightly ambiguous. Clarify if 'OSCE data' refers to the combined transcripts and questionnaires or just the questionnaires/DDx lists.
Fig. 3 Specialist-rated top-k diagnostic accuracy. a, b, The AMIE and PCP top-k...
Full Caption

Fig. 3 Specialist-rated top-k diagnostic accuracy. a, b, The AMIE and PCP top-k DDx accuracies, determined by the majority vote of three specialists, are compared across 159 scenarios with respect to the ground-truth diagnosis (a) and all diagnoses in the accepted differential (b).

Figure/Table Image (Page 4)
Fig. 3 Specialist-rated top-k diagnostic accuracy. a, b, The AMIE and PCP top-k DDx accuracies, determined by the majority vote of three specialists, are compared across 159 scenarios with respect to the ground-truth diagnosis (a) and all diagnoses in the accepted differential (b).
First Reference in Text
Figure 3 shows the top-k accuracy for AMIE and the PCPs, considering matches with the ground-truth diagnosis (Fig. 3a) and matches with any item on the accepted differential (Fig. 3b).
Description
  • Top-k Diagnostic Accuracy vs. Ground Truth: This line graph compares the diagnostic accuracy of the AMIE AI system (orange line) against human Primary Care Physicians (PCPs, blue line) based on specialist ratings across 159 simulated medical scenarios. Accuracy is measured as 'top-k accuracy', meaning the percentage of scenarios where the single correct 'ground-truth' diagnosis was found within the top 'k' diagnoses listed by either AMIE or the PCP. The x-axis shows 'k' ranging from 1 to 10.
  • AMIE vs. PCP Performance Trend: The graph shows that for all values of k from 1 to 10, AMIE consistently achieves higher average top-k accuracy than the PCPs. For instance, at k=1 (meaning the top diagnosis listed was the correct one), AMIE's accuracy is approximately 85%, while PCP accuracy is around 75%.
  • Accuracy Increase with k: As 'k' increases (allowing for more potential matches within the ranked list), the accuracy for both AMIE and PCPs increases, eventually plateauing. AMIE's accuracy reaches over 95% by k=5, while PCP accuracy approaches 90% by k=10.
  • Confidence Intervals: Shaded areas around each line represent 95% confidence intervals, indicating the range of statistical uncertainty for the average accuracy at each value of k. The confidence intervals for AMIE and PCPs do not appear to overlap significantly, visually suggesting a statistically significant difference.
  • Statistical Significance: The caption notes that the differences are statistically significant (P < 0.05 after FDR correction) for all values of k, based on bootstrap testing.
Scientific Validity
  • ✅ Appropriate metric (Top-k accuracy vs ground-truth): The use of top-k accuracy is an appropriate metric for evaluating ranked lists of differential diagnoses, particularly comparing performance when only the single best answer is considered correct.
  • ✅ Evaluation based on specialist majority vote: Basing the evaluation on the majority vote of three specialist physicians provides a degree of robustness against individual rater bias, although inter-rater reliability details (mentioned elsewhere in the text) are important context.
  • ✅ Sound statistical analysis approach: The use of bootstrap testing with FDR correction for multiple comparisons (k=1 to 10) is a statistically sound approach to determine the significance of the observed differences.
  • ✅ Strong support for claim of AMIE superiority (vs ground-truth): The graph strongly supports the claim made in the text and caption that AMIE demonstrates greater diagnostic accuracy than PCPs when measured against the ground-truth diagnosis across all top-k levels.
  • ✅ Reasonable sample size (159 scenarios): The sample size of 159 scenarios provides a reasonable basis for comparison, although generalizability might depend on the diversity and representativeness of these scenarios.
  • 💡 Limitation: Focus on single ground-truth: This panel only considers the single 'ground-truth' diagnosis. Clinical reality often involves multiple plausible diagnoses. Panel 3b addresses this limitation by considering the 'accepted differential'.
Communication
  • ✅ Clear visual comparison: The line graph effectively compares the performance of AMIE and PCPs over the range of k values. Using distinct colors (Orange for AMIE, Blue for PCP) and shaded confidence intervals enhances clarity.
  • ✅ Clear axis labeling: The x-axis ('Top-k') and y-axis ('Accuracy (%)') are clearly labeled, making the graph easy to interpret.
  • ✅ Inclusion of confidence intervals: The inclusion of 95% confidence intervals (shaded areas) provides a visual representation of the uncertainty around the mean accuracy estimates.
  • ✅ Informative caption: The caption clearly explains what is being plotted (top-k accuracy vs ground-truth) and the basis for the evaluation (specialist majority vote, 159 scenarios). The reference to statistical significance testing (FDR-adjusted p-values provided in the caption text) adds important context, although the values are not on the graph itself.
  • 💡 Potential for showing data distribution: While the overall trend is clear, plotting individual data points (perhaps semi-transparently) could offer insight into the distribution and variability of accuracy across scenarios, although this might clutter the graph.
Fig. 4 | Patient-actor ratings. Conversation qualities, as assessed by the...
Full Caption

Fig. 4 | Patient-actor ratings. Conversation qualities, as assessed by the patient-actors upon conclusion of the consultation.

Figure/Table Image (Page 5)
Fig. 4 | Patient-actor ratings. Conversation qualities, as assessed by the patient-actors upon conclusion of the consultation.
First Reference in Text
Patient-actor ratings. Figure 4 presents the various conversation qualities the patient-actors assessed following their consultations with the OSCE agents.
Description
  • Comparison Overview: This figure presents patient-actor ratings comparing the AI system AMIE (top bar in each pair) with human Primary Care Physicians (PCPs, bottom bar) on various aspects of conversation quality during simulated medical consultations.
  • Rating Scales Used: Ratings were collected using questions adapted from standardized instruments: GMCPQ (General Medical Council Patient Questionnaire), PACES (Practical Assessment of Clinical Examination Skills), and PCCBP (Patient-Centred Communication Best Practice). These cover aspects like politeness, listening skills, empathy, explaining conditions, building rapport, and honesty.
  • Data Visualization Format: The results are shown as divergent stacked bar charts. Each bar represents 100% of the consultations for either AMIE or PCP. The segments within each bar show the percentage of consultations receiving a specific rating, mapped to a five-point scale from 'Very unfavourable' (dark red) to 'Very favourable' (dark blue). For Yes/No questions, 'Yes' is mapped to 'Favourable' and 'No' to 'Unfavourable'.
  • Overall Trend: AMIE Rated More Favorably: Visually, for the vast majority of the 26 quality metrics shown, the blue segments (representing favourable ratings) are larger for AMIE compared to PCPs, while the red segments (unfavourable ratings) are smaller for AMIE.
  • Statistical Significance Markers: Statistical significance (P-value from Wilcoxon signed-rank tests, corrected for multiple comparisons using FDR - False Discovery Rate) and the number of valid comparisons (N) are provided for each metric. The caption notes that AMIE was rated significantly better (P < 0.05) on 25 out of 26 axes.
  • Exception: 'Acknowledging mistakes': One exception where no significant difference was found is 'Acknowledging mistakes', which had a much smaller sample size (N=46) because it was only applicable when a mistake was actually made and pointed out.
Scientific Validity
  • ✅ Valid use of patient-actor ratings: Using patient-actors, who are trained to simulate cases and provide feedback, is a standard and valid methodology in OSCEs and medical education research for assessing interactional skills.
  • ✅ Use of established questionnaire frameworks: Basing the assessment on questions derived from established and validated questionnaires (GMCPQ, PACES, PCCBP) adds rigor and relevance to the measured constructs.
  • ✅ Grounded in randomized crossover design: The comparison is based on the randomized crossover design described in Fig. 2, where each patient-actor interacted with both AMIE and a PCP, strengthening the comparison by controlling for patient variability.
  • ✅ Appropriate statistical test (Wilcoxon signed-rank): The use of the Wilcoxon signed-rank test is appropriate for comparing paired, ordinal rating scale data.
  • ✅ Appropriate correction for multiple comparisons (FDR): The application of FDR correction appropriately addresses the issue of multiple comparisons across the 26 axes.
  • ✅ Strong support for claim of AMIE superiority in patient-actor ratings: The figure strongly supports the claim that patient-actors rated AMIE's conversational qualities significantly higher than PCPs' across most dimensions assessed in this specific study setup.
  • 💡 Limitation: Subjectivity of ratings: Patient ratings, even from trained actors, are subjective and may be influenced by factors beyond the objective quality of the interaction (e.g., length of response, perceived formality).
  • 💡 Limitation: Context dependence (simulation, text-chat): As with other results, these findings are specific to the simulated, text-based consultation format, which may differ from real-world, multi-modal interactions.
  • 💡 Limitation highlighted by low N for 'Acknowledging mistakes': The significantly smaller N for 'Acknowledging mistakes' highlights a potential limitation in assessing rare events within this study design.
Communication
  • ✅ Effective visualization format: The use of divergent stacked bar charts effectively visualizes the distribution of ratings (from 'Very unfavourable' to 'Very favourable') for both AMIE and PCP on each quality metric, allowing for direct comparison.
  • ✅ Consistent and intuitive color mapping: The consistent color scheme across all bars, mapping colors to levels of favorability (e.g., darker blue for 'Very favourable', darker red for 'Very unfavourable'), aids interpretation.
  • ✅ Clear grouping by questionnaire: Grouping related questions under standardized questionnaire headings (GMCPQ, PACES, PCCBP) provides structure and context.
  • ✅ Direct comparison layout: Placing the AMIE and PCP bars adjacent for each metric facilitates direct comparison.
  • ✅ Inclusion of statistical details (P-value, N): Including the P-value and sample size (N) next to each comparison provides immediate statistical context, enhancing the figure's self-containedness.
  • ✅ Clear legend: The legend clearly explains the color mapping and the order (AMIE top, PCP bottom).
  • 💡 Minor label length considerations: Some labels for the quality metrics are slightly long (e.g., 'Discussing roles and responsibilities (Y/N)'). While accurate, minor abbreviation or rephrasing might slightly improve visual balance if space were constrained, but it's generally acceptable.
  • 💡 Explicit legend note for Y/N mapping: The mapping of binary 'Yes/No' responses to 'Favourable'/'Unfavourable' is explained in the caption text but could be explicitly noted in the legend for absolute clarity within the figure itself (e.g., adding '(Y/N mapped)' to relevant legend entries).
Fig. 5| Specialist physician ratings. Conversation and reasoning qualities, as...
Full Caption

Fig. 5| Specialist physician ratings. Conversation and reasoning qualities, as assessed by specialist physicians.

Figure/Table Image (Page 6)
Fig. 5| Specialist physician ratings. Conversation and reasoning qualities, as assessed by specialist physicians.
First Reference in Text
Again, AMIE's responses were rated significantly better by the specialists than those from the PCPs on 30 out of 32 evaluation axes, with the specialists preferring AMIE's consultations, diagnoses and management plans over those from the PCPs (Fig. 5).
Description
  • Comparison Overview: Specialist Ratings: This figure displays ratings from specialist physicians who evaluated the quality of consultations and reasoning demonstrated by the AI system AMIE (top bar in each pair) compared to human Primary Care Physicians (PCPs, bottom bar). The evaluation covered 159 simulated patient scenarios.
  • Rating Scales and Categories: Ratings were based on established clinical assessment frameworks: PACES (Practical Assessment of Clinical Examination Skills - covering aspects like history taking accuracy, clarity, structure, empathy, and patient welfare) and PCCBP (Patient-Centred Communication Best Practice - covering relationship fostering, information gathering/providing, decision making, etc.). Additional metrics assessed the appropriateness and comprehensiveness of the differential diagnosis (DDx - the list of possible conditions) and the proposed management plan (investigations, treatments, follow-up).
  • Data Visualization Format: Data is visualized using divergent stacked bar charts, where each bar shows the distribution of ratings from 'Very unfavourable' (dark red) to 'Very favourable' (dark blue) based on the median rating from three specialists per case. Yes/No questions are mapped to 'Favourable'/'Unfavourable'.
  • Overall Trend: AMIE Rated More Favorably by Specialists: The visual trend across nearly all metrics shows larger blue segments (favourable ratings) and smaller red segments (unfavourable ratings) for AMIE compared to PCPs, indicating specialists generally rated AMIE higher.
  • Statistical Significance: The reference text and figure annotations state that AMIE was rated significantly better (P < 0.05 after FDR correction using Wilcoxon signed-rank tests) on 30 out of the 32 evaluation axes presented.
  • Exceptions: Non-significant Differences: Two metrics showed no significant difference: 'Escalation recommendation appropriate (Y/N)' (P = 0.1210) and 'Confabulation absent (Y/N)' (P = 0.4795). For these, the bars appear visually similar between AMIE and PCP, and both performed well (high percentage of 'Favourable'/'Yes' ratings).
Scientific Validity
  • ✅ Credible evaluation by specialist physicians: Using specialist physicians, who possess domain expertise relevant to the scenarios, provides a credible and clinically relevant assessment of consultation quality, diagnosis, and management.
  • ✅ Use of median ratings from multiple specialists: Employing median ratings from three specialists helps mitigate individual rater bias and is an appropriate aggregation method for ordinal rating scales.
  • ✅ Comprehensive assessment across multiple dimensions: The assessment covers a broad range of clinically important dimensions, including communication skills (PACES, PCCBP), diagnostic reasoning (DDx appropriateness/comprehensiveness), and management planning, providing a comprehensive evaluation.
  • ✅ Appropriate statistical analysis: The statistical approach (Wilcoxon signed-rank test for paired ordinal data, with FDR correction for multiple comparisons) is methodologically sound.
  • ✅ Strong support for claim of AMIE superiority in specialist ratings: The figure provides strong visual and statistical support for the central claim that specialist physicians rated AMIE significantly higher than PCPs across the vast majority of evaluation axes in this study.
  • 💡 Limitation: Subjectivity of specialist ratings: While specialists provide expert judgment, their ratings are still subjective and could potentially be influenced by factors like response length or writing style, even when evaluating transcripts and outputs.
  • 💡 Limitation: Evaluation based on transcripts/outputs, not live interaction: The evaluation is based on reviewing transcripts and post-consultation outputs (like DDx lists), not a live interaction. Specialists lack access to non-verbal cues or real-time interaction dynamics that might influence assessment in a true clinical setting or live OSCE.
  • ✅ Non-significant findings provide valuable context: The findings regarding the two non-significant axes ('Escalation recommendation', 'Confabulation absent') are also informative, suggesting areas where AMIE performed comparably well to PCPs, or where perhaps the task was less discriminating.
Communication
  • ✅ Effective visualization format: The figure effectively uses divergent stacked bar charts, similar to Fig. 4, allowing for easy visual comparison of rating distributions between AMIE and PCPs across numerous metrics.
  • ✅ Clear grouping of metrics: Metrics are clearly grouped by assessment category (PACES, PCCBP, Diagnosis and management), which aids interpretation and provides structure.
  • ✅ Consistent color scheme and layout: The consistent color scheme mapping favorability and the adjacent placement of AMIE/PCP bars facilitate direct comparison.
  • ✅ Inclusion of statistical details (P-value, N): Including P-values and sample sizes (N) directly on the chart for each metric significantly enhances its informational value and self-containedness.
  • ✅ Clear legend: The legend clearly defines the rating scale mapping and identifies the AMIE (top) and PCP (bottom) bars.
  • ✅ Manages high information density well: The number of metrics presented is high (32 axes). While comprehensive, this makes the figure quite dense. However, the clear structure prevents it from becoming overwhelming.
  • 💡 Consider acronym expansion in legend/footnote: Similar to Fig. 4, expanding acronyms like PACES and PCCBP briefly in a footnote or legend could improve accessibility for readers less familiar with these specific assessment tools.
Extended Data Fig. 2 | DDx top-kaccuracy for non-disease-states and positive...
Full Caption

Extended Data Fig. 2 | DDx top-kaccuracy for non-disease-states and positive disease-states. a, b: Specialist rated DDx top-kaccuracy for the 149 "positive" scenarios with respect to (a) the ground-truth diagnosis and (b) the accepted differentials. c, d: Specialist rated DDx top-k accuracy for the 10 "negative" scenarios with respect to (c) the ground-truth diagnosis and (d) the accepted differentials.

Figure/Table Image (Page 16)
Extended Data Fig. 2 | DDx top-kaccuracy for non-disease-states and positive disease-states. a, b: Specialist rated DDx top-kaccuracy for the 149 "positive" scenarios with respect to (a) the ground-truth diagnosis and (b) the accepted differentials. c, d: Specialist rated DDx top-k accuracy for the 10 "negative" scenarios with respect to (c) the ground-truth diagnosis and (d) the accepted differentials.
First Reference in Text
AMIE has superior DDx accuracy on the set of 149 primarily positive disease state scenarios (in which only three scenarios had a ground-truth of a non-disease state).
Description
  • Top-k Accuracy vs Accepted Differentials (Positive Scenarios): This graph shows the top-k accuracy for the 149 'positive' scenarios (where a disease state is generally present), but accuracy is measured against any diagnosis listed in the 'accepted differential' list provided by specialists, not just the single ground truth. This represents a more lenient accuracy measure.
  • AMIE vs PCP Performance: Similar to panel (a), AMIE (orange line) shows higher accuracy than PCPs (blue line) across all k values.
  • Accuracy Levels and Trend: Accuracies are higher overall compared to panel (a) because matching any accepted differential is easier than matching the specific ground truth. AMIE's top-1 accuracy is around 90%, reaching over 95% by k=3. PCP's top-1 accuracy is around 80%, reaching over 90% by k=4.
  • Confidence Intervals and Significance: The confidence intervals (shaded areas) do not overlap, visually suggesting a significant difference, which is confirmed by the p-values in the caption (P < 0.05 for all k after FDR correction).
Scientific Validity
  • ✅ Clinically relevant metric (vs accepted differentials): Measuring accuracy against the accepted differential list is a clinically relevant approach, as often multiple diagnoses are considered plausible or acceptable initially.
  • ✅ Focus on positive scenarios: Comparing performance on positive disease state scenarios is important for evaluating the system's ability to identify actual conditions.
  • ✅ Sound statistical analysis: The statistical analysis (bootstrap testing with FDR correction) remains appropriate for this comparison.
  • ✅ Strong support for claim (vs accepted differentials): The results strongly support the claim that AMIE's diagnostic lists are more likely than PCPs' lists to contain an acceptable diagnosis within the top-k positions for these positive scenarios.
  • 💡 Dependence on specialist consensus for accepted list: The definition of 'accepted differentials' relies on specialist consensus, which might have some inherent subjectivity, although using a majority vote mitigates this.
Communication
  • ✅ Consistent and clear visualization: Clear presentation using line graphs with distinct colors and confidence intervals, consistent with other figures.
  • ✅ Clear labeling: Axes are clearly labeled, and the legend is unambiguous.
  • ✅ Specific caption description: The caption clearly defines the content of this specific panel (positive scenarios vs accepted differentials).
  • ✅ Statistical context provided: Including FDR-adjusted p-values in the caption text provides necessary statistical context.
Extended Data Fig. 3 | Specialist rated DDx accuracy by scenario specialty....
Full Caption

Extended Data Fig. 3 | Specialist rated DDx accuracy by scenario specialty. Top-k DDx accuracy for scenarios with respect to the ground-truth in (a) Cardiology (N=31, not significant), (b) Gastroenterology (N=33, not significant), (c) Internal Medicine (N=16, significant for all k), (d) Neurology (N=32, significant for k> 5), (e) Obstetrics and Gynaecology (OBGYN)/Urology (N=15, not significant), (f) Respiratory (N=32, significant for all k).

Figure/Table Image (Page 17)
Extended Data Fig. 3 | Specialist rated DDx accuracy by scenario specialty. Top-k DDx accuracy for scenarios with respect to the ground-truth in (a) Cardiology (N=31, not significant), (b) Gastroenterology (N=33, not significant), (c) Internal Medicine (N=16, significant for all k), (d) Neurology (N=32, significant for k> 5), (e) Obstetrics and Gynaecology (OBGYN)/Urology (N=15, not significant), (f) Respiratory (N=32, significant for all k).
First Reference in Text
Accuracy by specialty. Extended Data Fig. 3 illustrates the DDx accuracy achieved by AMIE and the PCPs across the six medical specialties covered by the scenarios in our study.
Description
  • Cardiology Top-k Accuracy vs Ground Truth (N=31): This panel (a) specifically shows the top-k diagnostic accuracy for 31 scenarios within the Cardiology specialty. It compares AMIE (orange line) and PCPs (blue line) based on whether their ranked list of diagnoses (differential diagnosis or DDx) included the single correct 'ground-truth' diagnosis within the top 'k' positions (where k ranges from 1 to 10).
  • Performance Comparison and Confidence Intervals: Visually, the lines for AMIE and PCP are close together, with their 95% confidence intervals (shaded areas) overlapping substantially across all values of k.
  • Accuracy Levels: AMIE's top-1 accuracy appears slightly above 80%, while PCP's is slightly below 80%. Both improve as k increases.
  • Non-Significant Difference: The caption explicitly states that the difference between AMIE and PCP accuracy in Cardiology scenarios was not statistically significant (P-values > 0.05 after FDR correction).
Scientific Validity
  • ✅ Important subgroup analysis: Breaking down the overall accuracy by specialty is crucial for understanding potential variations in performance across different clinical domains.
  • ✅ Appropriate metrics and statistics: The analysis uses the appropriate metric (top-k accuracy vs ground truth) and statistical comparison methods (bootstrap testing with FDR mentioned in caption text for Fig 3).
  • ✅ Moderate sample size (N=31): The sample size for Cardiology (N=31) is moderate, providing some confidence in the finding of non-significance, although larger numbers would increase power.
  • ✅ Highlights domain-specific performance: The finding that AMIE does not significantly outperform PCPs in Cardiology contrasts with the overall results and results in other specialties (like Respiratory), highlighting domain-specific performance differences.
  • 💡 Potential uncontrolled variable: scenario difficulty: Potential variations in scenario difficulty within or between specialties are not controlled for in this visualization itself, although the randomization helps mitigate systematic bias.
Communication
  • ✅ Consistent visualization format: The visualization uses a consistent format (line graph, colors, confidence intervals) across all panels, facilitating comparison between specialties.
  • ✅ Clear labeling: Axes are clearly labeled ('Top-k', 'Cardiovascular Accuracy (%)'). The title clearly indicates the specialty.
  • ✅ Informative caption elements: The caption provides the sample size (N=31) and the key finding (not significant), aiding interpretation.
  • ✅ Visuals align with statistics: The overlapping confidence intervals visually align with the reported non-significance.
Extended Data Fig. 4 | DDx accuracy by location. a, b: Specialist DDx rating of...
Full Caption

Extended Data Fig. 4 | DDx accuracy by location. a, b: Specialist DDx rating of AMIE and the PCPs with respect to the ground-truth for the 77 cases conducted in Canada (a) and 82 cases in India (b).

Figure/Table Image (Page 18)
Extended Data Fig. 4 | DDx accuracy by location. a, b: Specialist DDx rating of AMIE and the PCPs with respect to the ground-truth for the 77 cases conducted in Canada (a) and 82 cases in India (b).
First Reference in Text
Accuracy by location. We observed that both AMIE and the PCPs had higher diagnostic accuracy in consultations performed in the Canada OSCE lab compared to those enacted in the India OSCE lab.
Description
  • Top-k Accuracy vs Ground Truth (Canada Cases, N=77): This graph (panel a) isolates the performance data for the 77 simulated patient scenarios conducted within the Canadian OSCE (Objective Structured Clinical Examination) lab setting. It plots the top-k diagnostic accuracy, comparing AMIE (orange line) against PCPs (blue line). Accuracy is defined as the percentage of cases where the correct ground-truth diagnosis was listed within the top 'k' positions (k=1 to 10) of the differential diagnosis provided by the agent.
  • AMIE vs PCP Performance (Canada): Within the Canadian subset, AMIE consistently demonstrates higher top-k accuracy than PCPs for all values of k. The gap appears wider than in the overall cohort (Fig 3a). AMIE's top-1 accuracy is near 90%, while PCP top-1 accuracy is around 75%.
  • Accuracy Trend with k (Canada): Both AMIE and PCP accuracies improve as k increases, with AMIE reaching near 100% accuracy by k=5, and PCPs approaching 90% accuracy by k=10.
  • Confidence Intervals and Significance (Canada): The 95% confidence intervals (shaded areas) for AMIE and PCP are distinctly separated across all k values. The caption confirms this difference is statistically significant (P < 0.05 after FDR correction for all k).
Scientific Validity
  • ✅ Valid subgroup analysis by location: Analyzing performance stratified by location (OSCE lab origin) is a valid and important step to check for potential confounding factors related to differences in scenarios, patient-actors, or participating clinicians between the sites.
  • ✅ Sufficient sample size for subgroup (N=77): The sample size for the Canadian subset (N=77) is substantial enough to allow for meaningful comparison and statistical testing.
  • ✅ Strengthens overall conclusion: The consistent finding of AMIE's superiority within this subgroup strengthens the overall study conclusion, demonstrating the effect holds within a specific geographical/lab context.
  • ✅ Appropriate metrics and statistics: The comparison uses appropriate metrics (top-k accuracy vs ground truth) and statistical methods (bootstrap testing with FDR correction).
  • 💡 Does not identify cause of location difference: While stratified by lab location, this analysis doesn't disentangle the specific reasons for potential performance differences between Canada and India (e.g., inherent scenario difficulty, actor training, PCP baseline performance). It identifies the difference but not the cause.
  • 💡 Requires comparison with panel (b) for full context: The reference text notes higher accuracy in Canada compared to India for both AMIE and PCPs. This panel only shows the Canada data; comparison with panel (b) is needed to fully appreciate the location effect described in the text.
Communication
  • ✅ Consistent visualization format: The graph uses the standard, clear format seen in previous figures (line graph, distinct colors, confidence intervals) for consistency.
  • ✅ Clear labeling and title: Axes are clearly labeled ('Top-k', 'Accuracy (%)'), and the title clearly indicates the location ('Canada').
  • ✅ Informative caption elements (N, p-values): The caption specifies the sample size (N=77) and provides the FDR-adjusted p-values, making the figure informative.
  • ✅ Visuals align with statistics: The non-overlapping confidence intervals visually reinforce the statistical significance reported in the caption.
Extended Data Fig. 5 | Auto-evaluation of DDx performance. a, b: Top-k DDx...
Full Caption

Extended Data Fig. 5 | Auto-evaluation of DDx performance. a, b: Top-k DDx auto-evaluation of AMIE's and the PCP's differential diagnoses from their own consultations with respect to the ground-truth (a, significant for k> 3) and the list of accepted differentials (b, significant for k> 4). c, d: Top-k DDx auto-evaluation of AMIE's differential diagnoses when provided its own vs. the PCP's consultation transcript with respect to the ground-truth (c, not significant) and the list of accepted differentials (d, not significant).

Figure/Table Image (Page 19)
Extended Data Fig. 5 | Auto-evaluation of DDx performance. a, b: Top-k DDx auto-evaluation of AMIE's and the PCP's differential diagnoses from their own consultations with respect to the ground-truth (a, significant for k> 3) and the list of accepted differentials (b, significant for k> 4). c, d: Top-k DDx auto-evaluation of AMIE's differential diagnoses when provided its own vs. the PCP's consultation transcript with respect to the ground-truth (c, not significant) and the list of accepted differentials (d, not significant).
First Reference in Text
Auto-evaluation accuracy. We reproduced the DDx accuracy analysis with our model-based DDx auto-evaluator using the same procedure as in Fig. 3.
Description
  • Auto-Evaluated Accuracy Comparison: This graph (panel a) shows diagnostic accuracy results similar to Fig. 3a, but instead of human specialist ratings, it uses an automated evaluation system ('auto-evaluator') – likely another AI model – to judge the accuracy. It compares AMIE (orange line) and PCPs (blue line) based on their performance in their respective consultations across 159 scenarios.
  • Metric: Top-k Accuracy vs Ground Truth: Accuracy is measured using 'top-k accuracy' against the single correct 'ground-truth' diagnosis. This means the graph shows the percentage of cases where the ground truth diagnosis was found within the top 'k' diagnoses listed by AMIE or the PCP, as determined by the auto-evaluator. 'k' ranges from 1 to 10.
  • Performance Trend and Significance: The trend shows AMIE generally performing better than PCPs, with the difference becoming more pronounced and statistically significant (according to the caption) for k values greater than 3. For k=1, 2, 3, the lines are closer, and confidence intervals overlap more.
  • Comparison to Specialist Ratings (Fig. 3a): Compared to the specialist ratings in Fig. 3a, the absolute accuracy values appear slightly lower here for both AMIE and PCP, especially at lower k values (e.g., top-1 accuracy around 75-80% for AMIE here vs ~85% in Fig 3a), suggesting the auto-evaluator might be stricter or have different criteria.
Scientific Validity
  • ✅ Potential for scalable/consistent evaluation: Using an automated evaluator allows for scalable and potentially more consistent assessment than human raters, although its validity depends heavily on how well the auto-evaluator aligns with human expert judgment or ground truth.
  • ✅ Broad alignment with specialist-rated trends: The finding that AMIE outperforms PCPs, especially for k>3, generally aligns with the trend observed in the specialist ratings (Fig. 3a), providing some validation for the auto-evaluator's ability to discern performance differences.
  • ✅ Appropriate statistical testing: The statistical significance testing (bootstrap with FDR correction mentioned in Fig. 3 caption, assumed applied here) is appropriate.
  • 💡 Validity of the auto-evaluator is assumed, not demonstrated: The validity and potential biases of the 'model-based DDx auto-evaluator' itself are not detailed here. Its performance relative to human specialists as an evaluator is crucial but not shown in this figure.
  • 💡 Discrepancy with specialist ratings needs explanation: The discrepancy in absolute accuracy scores compared to Fig. 3a suggests the auto-evaluator may not perfectly replicate specialist judgment. Understanding the reasons for this difference (e.g., criteria, strictness) is important.
  • 💡 Different significance thresholds compared to specialist ratings: The significance threshold (k>3) differs from Fig. 3a (significant for all k), further highlighting potential differences between auto-evaluation and specialist evaluation.
Communication
  • ✅ Consistent and clear visualization: The line graph format is consistent with previous figures (e.g., Fig. 3), facilitating comparison. Distinct colors and shaded confidence intervals are used effectively.
  • ✅ Clear labeling: Axes ('Top-k', 'Auto-eval Accuracy (%)') and the legend (AMIE, PCP) are clearly labeled.
  • ✅ Informative caption elements: The caption specifies the comparison (AMIE vs PCP based on own consults, vs ground truth) and significance levels, aiding interpretation.
  • ✅ Visuals align with reported significance thresholds: The visual separation between the lines and confidence intervals becomes clearer for k > 3, aligning with the significance reported in the caption.
Extended Data Fig. 6 | Consultation verbosity and efficiency of information...
Full Caption

Extended Data Fig. 6 | Consultation verbosity and efficiency of information acquisition. a, Total patient actor words elicited by AMIE and the PCPs. b, Total words sent to patient actor from AMIE and the PCPs. c, Total number of turns in AMIE vs. the PCP consultations.

Figure/Table Image (Page 20)
Extended Data Fig. 6 | Consultation verbosity and efficiency of information acquisition. a, Total patient actor words elicited by AMIE and the PCPs. b, Total words sent to patient actor from AMIE and the PCPs. c, Total number of turns in AMIE vs. the PCP consultations.
First Reference in Text
Efficiency of information acquisition. Although AMIE displayed greater verbosity compared to the PCPs, in terms of total number of words generated in their responses during the consultation, the number of conversational turns and the number of words elicited from the patient-actors were similar across both OSCE agents, as illustrated in Extended Data Fig. 6a-c.
Description
  • Comparison of Patient Word Count: This panel (a) uses box plots to compare the total number of words spoken (or typed, in this text-based context) by the simulated patient-actors during consultations with AMIE versus consultations with human Primary Care Physicians (PCPs). A box plot summarizes the distribution of data, showing the median (center line), the middle 50% of the data (the box, or interquartile range), and the overall range excluding outliers (the 'whiskers'). Outliers are shown as individual points.
  • Similar Median Word Counts: The median number of words elicited from patient-actors appears very similar for both AMIE and PCPs, roughly around 400 words. The interquartile ranges (IQRs) also look comparable.
  • Presence of Outliers: Both distributions show some outliers, indicating consultations where the patient-actor typed significantly more words than average.
  • Support for Text Claim: This visualization supports the statement in the reference text that the number of words elicited from patient-actors was similar across both OSCE agents (AMIE and PCPs).
Scientific Validity
  • ✅ Relevant metric for information elicitation: Measuring the number of words elicited from the patient is a relevant, quantifiable proxy for the amount of information gathered during the consultation.
  • ✅ Appropriate visualization method: The box plot is an appropriate visualization method for comparing the distributions of this continuous variable (word count) between the two groups (AMIE vs PCP).
  • ✅ Supports claim of similar elicitation: The visual similarity strongly supports the claim in the text that patient word elicitation was comparable between AMIE and PCPs.
  • 💡 Statistical test results not shown on plot: The plot itself does not display the results of a statistical test to formally confirm the lack of a significant difference between the two groups, although the visual overlap suggests this is likely.
  • 💡 Limitation: Word count vs information quality: Word count is only a proxy for information quality; it doesn't measure the relevance or clinical utility of the elicited words.
Communication
  • ✅ Clear visualization of distribution: The box plot format clearly displays the distribution (median, interquartile range, outliers) of patient word counts for both AMIE and PCP consultations.
  • ✅ Clear axis labeling: Axes are clearly labeled ('OSCE Agent', 'Total Patient Actor Words'), making the plot easy to understand.
  • ✅ Facilitates direct comparison: The direct side-by-side comparison facilitates assessment of similarities or differences between the two agents.
  • ✅ Clear caption component: The caption clearly defines what this panel represents.
  • 💡 Consider adding median value annotation: While the box plot shows the distribution, adding the specific median values as text annotations could slightly enhance readability.

Discussion

Key Aspects

Strengths

Suggestions for Improvement

Conclusion

Key Aspects

Strengths

Suggestions for Improvement

Methods

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Extended Data Fig. 1 | User interfaces for the online consultation and...
Full Caption

Extended Data Fig. 1 | User interfaces for the online consultation and evaluation processes.

Figure/Table Image (Page 15)
Extended Data Fig. 1 | User interfaces for the online consultation and evaluation processes.
First Reference in Text
Extended Data Fig. 1 | User interfaces for the online consultation and evaluation processes.
Description
  • Figure Overview: This figure displays screenshots of the two primary user interfaces used in the study.
  • Chat Interface: The left panel shows the 'Chat Interface'. This is the screen used for the online text-based consultations between the simulated patient-actor and the OSCE agent (either AMIE or a PCP). It resembles a standard instant messaging application, with messages appearing in bubbles, indicating the back-and-forth nature of the diagnostic dialogue. An example exchange about headache symptoms is visible.
  • Specialist Physician Evaluation Interface: The right panel shows the 'Specialist Physician Evaluation Interface'. This is the tool used by the specialist physicians to assess the performance of AMIE and the PCPs after the consultations. It displays several sections: the original patient scenario description (1. Scenario), the probable and alternative diagnoses ('Answer Key', 2), the full transcript of the text consultation (3. Online Text-based Consultation), and the post-consultation answers provided by the OSCE agent (like their differential diagnosis list and escalation decision, 4. Post-Questionnaire Completed by Doctor). The interface includes specific questions for the specialist evaluator, such as rating how closely the agent's differential diagnosis matched the answer key, using radio buttons and checkboxes.
Scientific Validity
  • ✅ Enhances methodological transparency: Showing the user interfaces enhances the transparency and reproducibility of the study methodology. It allows readers to see the specific tools used for data collection and evaluation.
  • ✅ Appropriate chat interface design: The chat interface appears suitable for conducting synchronous text-based medical consultations, providing a clear log of the interaction.
  • ✅ Structured evaluation interface: The evaluation interface seems well-structured for presenting the necessary information (scenario, transcript, agent responses, answer key) to the specialist evaluators and collecting their ratings in a standardized format.
  • ✅ Supports description of study design: The figure provides visual evidence supporting the description of the remote OSCE process outlined in the Methods and Fig. 2.
  • 💡 Cannot assess usability from static images: While the interfaces appear functional, the static screenshots do not allow for assessment of usability factors (e.g., ease of navigation, response time) which could potentially influence participant experience or evaluator efficiency, although this is a minor point for a static figure.
  • 💡 Interface design could potentially influence evaluators (though appears neutral): The design of the evaluation interface, particularly how information is presented and questions are framed, could potentially influence specialist ratings. However, the interface appears designed to present information neutrally.
Communication
  • ✅ Clear presentation of key interfaces: The figure clearly presents the two key interfaces side-by-side, allowing readers to visualize the tools used for both the primary interaction (chat) and the subsequent evaluation.
  • ✅ Legible screenshots: The screenshots are generally clear and legible, showing representative examples of the chat flow and the evaluation form.
  • 💡 Potential for annotations: Annotations or callouts could have been used to highlight specific features or elements within the interfaces that are particularly relevant to the study design (e.g., the specific rating scales in the evaluation interface, the input/output fields in the chat).
  • ✅ Adequate resolution: The resolution appears adequate for understanding the layout and general content of the interfaces.
  • ✅ Enhances understanding of methodology: Providing these visual examples greatly enhances the reader's understanding of the experimental setup described in the Methods section and Figure 2.
Extended Data Table 1 | Practical Assessment of Clinical Examination Skills...
Full Caption

Extended Data Table 1 | Practical Assessment of Clinical Examination Skills (PACES) rubric details

Figure/Table Image (Page 21)
Extended Data Table 1 | Practical Assessment of Clinical Examination Skills (PACES) rubric details
First Reference in Text
Not explicitly referenced in main text
Description
  • Rubric Overview: This table outlines the specific questions and structure of the Practical Assessment of Clinical Examination Skills (PACES) rubric used in the study to evaluate the performance of AMIE and the PCPs.
  • Assessment Domains: The rubric is divided into several key domains: 'Clinical Communication Skills' (covering history taking like presenting complaint, systems review, past medical/family/medication history, and explaining information accurately, clearly, structurally, comprehensively, professionally), 'Differential Diagnosis' (constructing a sensible diagnosis list), 'Clinical Judgement' (selecting an appropriate management plan), 'Managing Patient Concerns' (addressing concerns, confirming understanding, showing empathy), and 'Maintaining Patient Welfare'.
  • Rating Scale: Most assessment questions listed are rated using a '5-point scale'. The specific anchors or meaning of the points on the scale are not defined within this table.
  • Assessors: The table indicates who performed the assessment for each question. Most items under 'Clinical Communication Skills', 'Differential Diagnosis', and 'Clinical Judgement' were assessed solely by the 'Specialist' physicians. Items under 'Managing Patient Concerns' and 'Maintaining Patient Welfare' were assessed by both the 'Specialist & Patient Actor'.
Scientific Validity
  • ✅ Use of an established assessment framework (PACES): The PACES framework is a well-established component of postgraduate medical examinations (specifically the MRCP(UK) exam) and is recognized for assessing clinical skills, lending credibility to its use here.
  • ✅ Enhances methodological transparency: Detailing the specific rubric questions enhances the transparency and reproducibility of the evaluation methodology. It allows readers to understand precisely which aspects of performance were assessed under the PACES criteria.
  • ✅ Clinically relevant assessment domains: The domains covered by the rubric (communication, diagnosis, judgment, patient concerns, welfare) are clinically relevant and appropriate for evaluating the quality of a diagnostic consultation.
  • ✅ Clear distinction between assessor roles: The rubric clearly distinguishes which items were rated by specialists versus those rated by both specialists and patient-actors, reflecting a thoughtful approach to capturing different perspectives.
  • 💡 Potential limitation: Applying in-person rubric to text-based interaction: The PACES exam traditionally involves face-to-face encounters with real or simulated patients, including physical examination. Applying this rubric, particularly communication aspects, to a purely text-based interaction might require adaptation or interpretation by the raters, which is a potential limitation.
  • 💡 Lack of definition for 5-point scale anchors: The specific anchors of the 5-point scale are not defined. While likely standardized within the PACES context, explicitly stating them would remove any ambiguity about how performance levels were categorized.
Communication
  • ✅ Clear tabular structure: The table is clearly structured with distinct columns for the assessment question, the rating scale used, and who performed the assessment.
  • ✅ Logical grouping of questions: Questions are grouped under logical headings (Clinical Communication Skills, Differential Diagnosis, Clinical Judgement, Managing Patient Concerns, Maintaining Patient Welfare), making the rubric's scope easy to grasp.
  • ✅ Concise presentation: The information is presented concisely and efficiently.
  • 💡 Define scale anchors: Specifying '5-point scale' is clear, but defining the anchors of the scale (e.g., 1=Very Poor, 5=Very Good) directly in the table or a footnote would make it fully self-contained.
  • ✅ Clear indication of assessors: The 'Assessed by' column clearly indicates whether specialists alone or both specialists and patient-actors rated each item, which is crucial information.
Extended Data Table 2 | Patient-Centred Communication Best Practice (PCCBP)...
Full Caption

Extended Data Table 2 | Patient-Centred Communication Best Practice (PCCBP) rubric details

Figure/Table Image (Page 22)
Extended Data Table 2 | Patient-Centred Communication Best Practice (PCCBP) rubric details
First Reference in Text
Not explicitly referenced in main text
Description
  • Rubric Overview: This table details the structure of the Patient-Centred Communication Best Practice (PCCBP) rubric used for evaluating communication aspects in the study.
  • Assessment Domains: The rubric assesses five core domains of patient-centered communication: 'Fostering the Relationship', 'Gathering Information', 'Providing Information', 'Decision Making', 'Enabling Disease and Treatment-Related Behavior', and 'Responding to Emotions'. For each domain, a general question is posed about rating the doctor's behavior.
  • Rating Scales and Assessors (Specialist): The table indicates the type of rating scale used for each domain. Most domains ('Gathering Information', 'Providing Information', 'Decision Making', 'Enabling Behavior', 'Responding to Emotions') were assessed by 'Specialist' physicians using a '5-point scale'.
  • Rating Scales and Assessors (Fostering Relationship): For the 'Fostering the Relationship' domain, the table shows two assessment methods: Specialists used a '5-point scale', while the 'Patient Actor' used a 'Binary scale per criterion'. The specific criteria for the binary scale are not listed.
Scientific Validity
  • ✅ Use of an established communication framework (PCCBP): PCCBP represents a well-established conceptual framework for evaluating key aspects of effective patient-clinician communication, making its use relevant and appropriate for this study.
  • ✅ Enhances methodological transparency: Clearly defining the domains assessed under PCCBP enhances the transparency of the communication quality evaluation.
  • ✅ Clinically relevant communication domains: The domains covered are central to patient-centered care and provide a comprehensive assessment of communication beyond basic information exchange.
  • ✅ Differentiated assessment methods: Distinguishing between specialist ratings (on a 5-point scale) and patient-actor ratings (binary criteria for relationship fostering) shows a nuanced approach, although the specifics of the binary criteria are missing.
  • 💡 Applicability of framework to text-based interactions: The PCCBP framework was developed for general clinical encounters. Its direct applicability and potential need for adaptation when evaluating purely text-based interactions should be considered, as non-verbal cues are absent.
  • 💡 Missing details on binary criteria limit full assessment: Without knowing the specific binary criteria used by the patient-actor for 'Fostering the Relationship', it's difficult to fully assess the validity and scope of that particular measurement.
  • 💡 Rater reliability not addressed in table: The reliability of applying these scales (both 5-point and binary) in this context (specialists evaluating transcripts, actors rating text interactions) is not addressed in the table itself.
Communication
  • ✅ Clear structure by domain: The table clearly outlines the main domains of the PCCBP framework used for evaluation.
  • ✅ Well-defined columns: Columns for Question, Scale, and Assessed by are distinct and easy to understand.
  • ✅ Clean layout: The layout is clean and uncluttered, making the information accessible.
  • 💡 Specific binary criteria not listed: While the general question for each domain is clear (e.g., "How would you rate the doctor's behavior of FOSTERING A RELATIONSHIP with the patient?"), the table notes that the Patient Actor used a 'Binary scale per criterion' for 'Fostering the Relationship', but these specific criteria are not listed in the table itself. Listing these criteria or referencing where they can be found would improve completeness.
  • 💡 Define 5-point scale anchors: Similar to Table 1, explicitly defining the anchors for the '5-point scale' would enhance clarity.
Extended Data Table 3 | Diagnosis and Management rubric details
Figure/Table Image (Page 23)
Extended Data Table 3 | Diagnosis and Management rubric details
First Reference in Text
Not explicitly referenced in main text
Description
  • Rubric Overview: This table details the rubric used by specialist physicians to evaluate the diagnostic and management capabilities demonstrated by AMIE and the PCPs.
  • Diagnosis Assessment Criteria: The 'Diagnosis' section assesses the appropriateness (5-point scale: Very Inappropriate to Very Appropriate) and comprehensiveness (4-point scale: Major candidates missing to All reasonable candidates included) of the differential diagnosis (DDx). It also evaluates how closely the DDx came to including the 'Probable Diagnosis' and any 'Plausible Alternative Diagnoses' from the provided answer key (both on 5-point scales measuring relatedness).
  • Management Assessment Criteria: The 'Management' section evaluates several aspects: whether escalation to non-text consultation was appropriately recommended (4-point scale); whether appropriate investigations were suggested (3-point scale: No/Incomplete/Comprehensive) and inappropriate ones avoided (Binary: Yes/No); whether appropriate treatments were suggested (3-point scale) and inappropriate ones avoided (Binary); the overall appropriateness of the management plan including emergency referrals (5-point scale); and the appropriateness of follow-up recommendations (4-point scale).
  • Confabulation Assessment: A final 'Confabulation' section assesses whether the OSCE agent (AMIE or PCP) made things up or stated non-factual information during the consultation or in their post-questionnaire answers (Binary scale: Yes/No).
  • Assessor: All items in this rubric were assessed exclusively by the 'Specialist' physicians.
Scientific Validity
  • ✅ Clinically relevant assessment criteria: The rubric covers key aspects of clinical reasoning and decision-making, including generating a differential diagnosis, selecting appropriate investigations and treatments, and planning follow-up, making it highly relevant for evaluating diagnostic performance.
  • ✅ Standardized assessment approach: Using specific questions with defined scales and options promotes standardized assessment across different specialists and scenarios.
  • ✅ Use of answer key for objective comparison: Assessing against an 'answer key' (probable/plausible diagnoses) provides an objective benchmark for diagnostic accuracy components.
  • ✅ Inclusion of confabulation assessment: Including an assessment for 'Confabulation' is important for evaluating the safety and reliability of AI systems like AMIE, which can sometimes generate plausible but incorrect information.
  • 💡 Post-hoc evaluation limitation: The evaluation relies on specialist judgment based on reviewing transcripts and outputs. While necessary, this post-hoc assessment might differ from evaluating decision-making processes in real-time.
  • 💡 Inherent subjectivity in 'appropriateness' ratings: Some criteria, like the 'appropriateness' of the DDx or management plan, still involve subjective judgment by the specialist, even with defined scales. Inter-rater reliability data (mentioned elsewhere in the paper) is crucial context.
  • 💡 Validity depends on quality of answer keys: The quality of the 'answer key' itself (the ground truth and plausible alternatives) is critical to the validity of the diagnostic accuracy assessments. The process for creating these keys is not detailed in the table.
Communication
  • ✅ Clear structure and sections: The table is well-organized into Diagnosis, Management, and Confabulation sections, with clear questions.
  • ✅ Explicit scales and options: Each question specifies the rating scale (e.g., 5-point, 4-point, binary) and lists the explicit options or anchors for multi-point scales, enhancing clarity and reproducibility.
  • ✅ Clear indication of assessor: Consistently indicating that the 'Specialist' is the assessor for all items simplifies understanding.
  • ✅ Clear question phrasing: The questions are phrased clearly and target specific aspects of diagnostic and management quality.

Related work

Key Aspects

Strengths

Suggestions for Improvement