Sequential Diagnosis with Language Models

Harsha Nori, Mayank Daswani, Christopher Kelly, Scott Lundberg, Marco Tulio Ribeiro, Marc Wilson, Xiaoxuan Liu, Viknesh Sounderajah, Jonathan Carlson, Matthew P Lungren, Bay Gross, Peter Hames, Mustafa Suleyman, Dominic King, Eric Horvitz
arXiv: arXiv:2506.22405v1
Microsoft AI

First Page Preview

First page preview

Table of Contents

Overall Summary

Study Background and Main Findings

This paper addresses the challenge of evaluating medical AI diagnostic systems by introducing a novel benchmark, the Sequential Diagnosis Benchmark (SDBench), and a new AI system, the MAI Diagnostic Orchestrator (MAI-DxO). SDBench moves beyond traditional static evaluations by presenting complex medical cases as interactive scenarios where the AI must sequentially request information, order tests, and ultimately propose a diagnosis, mirroring the real-world diagnostic process. Crucially, SDBench assesses performance not only on diagnostic accuracy but also on the cost of the diagnostic workup.

MAI-DxO is designed to excel within this framework. It simulates a 'virtual panel' of physician personas, each with a specific role (e.g., generating hypotheses, challenging assumptions, considering costs). This orchestrated approach aims to replicate the benefits of team-based clinical reasoning, mitigating cognitive biases and optimizing the balance between accuracy and cost. The system is evaluated against a range of state-of-the-art language models and a cohort of experienced physicians.

The results demonstrate that MAI-DxO achieves significantly higher diagnostic accuracy than both baseline AI models and human physicians on the SDBench cases, while simultaneously reducing diagnostic costs. For example, when paired with a leading language model, MAI-DxO achieved 80% accuracy, four times higher than the 20% average of generalist physicians. It also reduced costs by 20% compared to physicians and 70% compared to the unassisted AI model. Furthermore, the system's performance gains generalize across various language models from different providers, demonstrating its model-agnostic nature.

The authors argue that MAI-DxO's 'superhuman' performance stems from its ability to combine the broad knowledge of a generalist with the specialized expertise of multiple specialists, a capability no single human possesses. They raise the important question of whether future AI evaluations should benchmark against individual clinicians or entire hospital teams. The paper concludes by discussing limitations, such as the benchmark's reliance on complex, atypical cases, and outlines future research directions, including developing more representative benchmarks and validating the system in real-world clinical settings.

Research Impact and Future Directions

This paper presents a compelling advancement in medical AI diagnosis by introducing a novel benchmark, SDBench, and a high-performing diagnostic system, MAI-DxO. SDBench's focus on sequential reasoning, cost-awareness, and clinical realism makes it a valuable contribution to the field, addressing key limitations of existing static benchmarks. MAI-DxO's innovative multi-persona architecture, simulating a virtual panel of physicians, demonstrates the potential of structured reasoning to improve both diagnostic accuracy and cost-efficiency. The system's model-agnostic design further enhances its practical applicability.

While the study's reliance on complex, atypical cases and simplified cost model limits the direct generalizability of its findings to everyday clinical practice, it provides a strong foundation for future research. The authors appropriately acknowledge these limitations and propose concrete directions for future work, including developing more representative benchmarks, validating the system in real-world settings, and exploring applications in medical education. The core contribution of this work lies in demonstrating the feasibility and potential of a structured, cost-aware approach to AI-driven diagnosis, opening new avenues for improving healthcare delivery and medical training.

The study's simulation-based design, while offering valuable insights, inherently constrains the strength of causal claims. The observed improvements in accuracy and cost-efficiency are demonstrably linked to the MAI-DxO architecture within the controlled environment of SDBench. However, extrapolating these gains to real-world clinical settings requires further investigation. The study cannot definitively prove the superiority of MAI-DxO over human clinicians or other AI systems in actual practice due to the simplified nature of the simulation. It does, however, provide compelling evidence for the potential of this approach, establishing a strong case for further research and development.

Critical Analysis and Recommendations

Clear and Impactful Headline Result (written-content)
The abstract effectively presents a clear headline result (80% accuracy vs. 20% for physicians), immediately establishing the potential impact of the AI system. This clear comparison makes the paper's contribution memorable and impactful.
Section: Abstract
Dual Contribution of Benchmark and System (written-content)
The abstract introduces both a novel benchmark (SDBench) and a new AI system (MAI-DxO), demonstrating a comprehensive approach to addressing the challenges of medical AI evaluation. This dual contribution strengthens the paper's overall impact.
Section: Abstract
Focus on Cost-Effectiveness (written-content)
The abstract emphasizes cost-effectiveness as a key metric alongside accuracy, aligning the research with practical healthcare needs. This focus on real-world relevance enhances the paper's appeal to clinicians and policymakers.
Section: Abstract
Compelling Problem-Solution Narrative (written-content)
The introduction establishes a clear problem-solution narrative, highlighting the limitations of static AI evaluations and positioning SDBench and MAI-DxO as solutions. This logical structure makes the paper's contributions easy to understand.
Section: Introduction
Quantification of Contributions (written-content)
The introduction immediately quantifies the significance of the contributions by presenting key performance metrics (20% physician accuracy, 79-85% MAI-DxO accuracy). This use of concrete data strengthens the paper's claims.
Section: Introduction
Clear Demonstration but Potential Selection Bias (graphical-figure)
Figure 1 effectively demonstrates the core concept of sequential diagnosis through a step-by-step example. However, presenting only a 'happy path' scenario may introduce selection bias and limit insight into the model's limitations.
Section: Introduction
Innovative Handling of Missing Information (written-content)
The methods section describes an innovative approach to handling missing information by generating synthetic findings. This enhances clinical realism and avoids biasing the evaluation. The validation of this approach by a physician panel further strengthens its credibility.
Section: Methods
Rigorous Judge Validation (written-content)
The methods section details a rigorous validation of the automated judging mechanism, comparing its scores to those of human physicians. This strengthens the credibility of the accuracy metric.
Section: Methods
Undefined Certainty Threshold (written-content)
The methods section lacks a clear definition of the 'certainty' threshold used by MAI-DxO to commit to a diagnosis. Specifying this parameter is crucial for reproducibility and understanding the model's behavior.
Section: Methods
Powerful Visualization but Limited Comparison (graphical-figure)
Figure 7 powerfully visualizes the core findings using a Pareto frontier, demonstrating MAI-DxO's dominance in the cost-accuracy trade-off. However, the lack of variance visualization for physician performance limits a nuanced comparison.
Section: Results
Quantitative and Qualitative Analysis (written-content)
The results section combines quantitative data with a compelling qualitative case study, illustrating how MAI-DxO's structured reasoning leads to superior outcomes. This strengthens the paper's narrative and explanatory power.
Section: Results
Framing of Superhuman Performance (written-content)
The discussion provides a thoughtful framing of 'superhuman' performance by comparing the AI to a team of specialists rather than individual physicians. This nuanced perspective addresses a key challenge in AI evaluation.
Section: Discussion
Acknowledgement of Limitations (written-content)
The discussion acknowledges key limitations, such as the non-representative case distribution and simplified cost model, enhancing the paper's credibility and transparency.
Section: Discussion
Roadmap for Future Research (written-content)
The discussion proposes concrete directions for future research, including developing more realistic benchmarks and exploring real-world applications. This provides a clear roadmap for building upon the paper's contributions.
Section: Discussion
Lack of Concrete Framework for Team-Based Evaluation (written-content)
While the discussion raises the important question of team-based AI evaluation, it would be strengthened by proposing a concrete framework for such an evaluation. This would make the idea more actionable and impactful.
Section: Discussion

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 1: Example of an AI agent solving a sequential-diagnosis reasoning...
Full Caption

Figure 1: Example of an AI agent solving a sequential-diagnosis reasoning problem.

Figure/Table Image (Page 4)
Figure 1: Example of an AI agent solving a sequential-diagnosis reasoning problem.
First Reference in Text
Sequential diagnosis is a cornerstone of clinical reasoning, wherein physicians refine their diagnostic hypotheses step-by-step through iterative questioning and testing. Figure 1 illustrates how a diagnostician might approach a case given limited initial information, posing broad then increasingly specific questions to narrow down the differential to a likely malignancy, followed by imaging, biopsy, and specialist studies to arrive at a final diagnosis.
Description
  • A step-by-step diagnostic dialogue: The figure displays a flowchart of a simulated diagnostic process for a 29-year-old woman with a sore throat and swelling. It shows a conversation between an 'AI Diagnostic agent' and a 'Gatekeeper agent' which provides case information. The process is iterative: the AI begins with a broad question about the patient's history, receives an answer, then requests a series of increasingly specific medical tests to narrow down the possible diagnoses.
  • Iterative hypothesis testing using advanced diagnostics: The AI agent is shown refining its hypothesis based on test results. After initial tests, it considers 'Nasopharyngeal carcinoma' but rules it out. It then orders further tests to investigate 'Alveolar rhabdomyosarcoma'. This involves several advanced techniques: 'H&E' (Hematoxylin and Eosin), a standard stain for visualizing basic cell structure; 'Immunohistochemistry (IHC)', which uses antibodies to tag specific proteins in cells to identify their type (e.g., desmin, myogenin); and 'FISH' (Fluorescence in situ hybridization), a genetic test that uses fluorescent probes to detect specific chromosomal abnormalities like the 'FOXO1 rearrangement' relevant to this cancer type.
  • Final diagnosis and performance evaluation: The figure culminates in a final judgment. The AI agent diagnoses 'Embryonal rhabdomyosarcoma of the right peritonsillar region'. This is compared against the 'NEJM ground truth diagnosis' of 'Embryonal rhabdomyosarcoma of the pharynx'. A 'Judge' agent scores the AI's answer a perfect 5/5, noting that it correctly identified the disease and provided a more specific location.
Scientific Validity
  • ✅ Excellent demonstration of the core concept: The figure provides a compelling illustration of the core concept of sequential diagnosis. By showing the step-by-step process of information gathering and hypothesis refinement, it effectively demonstrates the dynamic reasoning paradigm the paper aims to model, strongly supporting the claims made in the reference text.
  • 💡 Potential for selection bias with a single 'happy path' example: The figure presents a single, successful case where the AI performs exceptionally well, even providing a more specific diagnosis than the ground truth. This represents a 'happy path' scenario and may introduce selection bias. To provide a more balanced and rigorous perspective, the authors should consider showing an example of a diagnostic failure, a less efficient pathway, or a case where the AI's reasoning is flawed. This would offer valuable insight into the model's limitations.
  • 💡 Lack of representativeness of the chosen case: The chosen example involves a rare and complex diagnosis (embryonal rhabdomyosarcoma) requiring highly specialized and expensive tests (extensive IHC panel, FISH). While this showcases the model's capability on a difficult problem, it is not representative of the vast majority of clinical encounters. The authors should clarify whether this example was chosen for its illustrative power or if it reflects typical performance across a range of case difficulties.
  • ✅ Inclusion of an objective evaluation mechanism: The inclusion of a 'Judge' component that provides a quantitative score (5/5) and qualitative feedback on the AI's final diagnosis is a methodological strength. It introduces a clear, objective measure of performance for this specific instance, grounding the example in a defined evaluation framework.
Communication
  • ✅ Clear conversational layout: The two-column layout effectively separates the 'Diagnostic agent' actions from the 'Gatekeeper agent' responses, creating a clear, conversational flow that is easy to follow. The use of distinct visual blocks for different types of interactions (e.g., 'Patient history question', 'Test request') further enhances readability.
  • 💡 Ambiguous representation of AI's internal state: The AI's internal reasoning (e.g., 'Let's gather some history.', 'Not it.') is presented as simple text, making it visually indistinguishable from its direct outputs. To improve clarity, consider using a distinct visual style, such as italics, a different color, or a thought bubble, to differentiate the AI's internal monologue from its explicit queries to the Gatekeeper.
  • 💡 Poor text legibility and density: The text within the response blocks, particularly the detailed results for immunohistochemistry and histology, is dense and uses a small font. This may impair readability, especially in print or on smaller screens. Consider summarizing the key findings (e.g., 'IHC panel positive for myogenic markers, negative for others') and providing the full list in a supplement or appendix.
  • ✅ High degree of self-containment: The figure and caption are highly effective at conveying the core message on their own. A reader can understand the concept of iterative diagnosis and see a demonstration of the AI's process without needing to refer to the main text.

Methods

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Table 1: Five-point Likert rubric used by the Judge agent.
Figure/Table Image (Page 7)
Table 1: Five-point Likert rubric used by the Judge agent.
First Reference in Text
The rubric evaluates key dimensions of diagnostic quality, including the core disease entity, etiology, anatomic site, specificity, and overall completeness, with a particular emphasis on whether the candidate diagnosis would meaningfully alter clinical management. To ensure contextual understanding, the Judge had full access to each case file during adjudication. We set a cut-off of > 4 on a five-point Likert scale to count as a “correct” diagnosis, based on the clinical rationale that clinical management would remain largely unchanged above this threshold.
Description
  • A 5-point rating scale for diagnostic accuracy: This table presents a scoring system, known as a rubric, for grading the quality of a medical diagnosis. The system uses a 5-point Likert scale, which is a common rating method where 1 is the lowest score ('Completely incorrect') and 5 is the highest ('Perfect / Clinically superior').
  • Detailed definitions for score levels: Each score has a detailed 'anchor' description explaining the criteria. A score of 5 is given for a diagnosis that is identical to the reference or even more specific and correct. In contrast, a score of 1 is for a diagnosis that is completely wrong and could lead to harmful patient care.
  • Multi-dimensional evaluation criteria: The rubric specifies five key aspects of a diagnosis to be evaluated: the main disease ('core disease entity'), its cause ('etiology'), its location in the body ('anatomic site'), its level of detail ('specificity'), and its thoroughness ('completeness'). A critical distinction is made between a score of 4 ('Mostly correct'), where patient management would not change, and a score of 3 ('Partially correct'), where the error is major enough to alter the treatment plan.
  • Strict threshold for a 'correct' diagnosis: As noted in the reference text, the authors set a strict threshold for a 'correct' diagnosis, requiring a score greater than 4. This means that only diagnoses receiving a perfect score of 5 are counted as correct in their final analysis, while a 'Mostly correct' diagnosis (score 4) is considered incorrect.
Scientific Validity
  • ✅ Strong operationalization of a key metric: The rubric is a significant methodological strength. It operationalizes the abstract concept of 'diagnostic accuracy' into a structured, quantifiable metric. By defining explicit criteria for each score, it enhances the reproducibility and objectivity of the evaluation process.
  • ✅ Focus on clinical relevance and impact: A key strength is the grounding of the rubric in clinical utility. The distinction between scores 3 and 4 hinges on whether the error would 'alter work-up or prognosis.' This ensures that the evaluation prioritizes outcomes that are meaningful to patient care, rather than just semantic correctness.
  • 💡 The threshold for a 'correct' diagnosis seems overly stringent: The authors define 'correct' as a score > 4 (i.e., only 5). However, the definition for a score of 4 ('Mostly correct...Overall management would remain largely unchanged') describes what many would consider a clinically successful and acceptable diagnosis. This stringent threshold should be more robustly justified. While it sets a high bar for the AI, it may not fully reflect clinical reality, where a diagnosis with a 'minor incompleteness' is often sufficient and effective.
  • 💡 Inherent subjectivity in qualitative descriptors: While the rubric is detailed, terms like 'minor incompleteness' vs. 'major error' retain a degree of subjectivity. The authors mitigate this by validating against human raters and reporting inter-rater reliability in the text, which is excellent practice. However, the rubric itself could be strengthened by including concrete, anonymized examples for each score level to further standardize interpretation and reduce potential rater drift.
Communication
  • ✅ Clear and logical structure: The table's three-column layout (Score, Label, Definition) is logical and highly effective. It allows a reader to quickly understand the grading scale and the criteria associated with each level, making the complex evaluation process transparent.
  • ✅ Informative and well-defined anchors: The detailed descriptions for each score point, or 'anchors', are well-written and provide concrete criteria for evaluation. For example, distinguishing between a 'major error' that would 'alter work-up' (Score 3) and a 'minor incompleteness' where 'management would remain largely unchanged' (Score 4) is a critical and well-communicated detail.
  • 💡 Integrate evaluation dimensions into the table body: The five dimensions of evaluation (core disease, etiology, etc.) are mentioned in the caption text but are not part of the table structure itself. To improve the table as a self-contained instrument, consider adding a sub-section within the 'Definition / Anchor' column that explicitly lists these dimensions as a checklist for each score, reinforcing the evaluation criteria.
  • 💡 Use visual hierarchy to highlight key phrases: The text within the 'Definition / Anchor' column is dense. To improve scannability and emphasize the key distinctions between scores, consider using bold formatting for critical phrases, such as '**management would remain largely unchanged**', '**alter work-up or prognosis**', or '**likely lead to harmful care**'. This would create a stronger visual hierarchy.
Figure 2: Multiagent orchestration in the SDBench benchmark.
Figure/Table Image (Page 6)
Figure 2: Multiagent orchestration in the SDBench benchmark.
First Reference in Text
Figure 2: Multiagent orchestration in the SDBench benchmark.
Description
  • System architecture flowchart: This flowchart illustrates the architecture of the 'SDBench' system, a benchmark designed to test diagnostic reasoning. It shows how different automated 'agents'—specialized software programs—interact to simulate a medical diagnosis.
  • Three-agent interaction model: The process involves three main agents: a 'Diagnostic Agent' (the AI or human being tested), a 'Gatekeeper Agent', and a 'Judge Agent'. The Diagnostic Agent interacts with the Gatekeeper in a 'Sequential diagnosis cycle', asking questions and ordering tests to arrive at a 'Final Diagnosis'.
  • The Gatekeeper and synthetic data generation: The Gatekeeper acts as an information oracle, using a database of 304 real, complex medical cases from the New England Journal of Medicine (NEJM). A key feature is its ability to generate plausible 'synthetic answers' for queries about information not present in the original case file, preventing the simulation from stalling.
  • Dual evaluation of accuracy and cost: Once a diagnosis is made, two evaluations occur. The Judge Agent compares the proposed diagnosis to the 'Ground Truth Diagnosis' to determine if it is correct or incorrect. Concurrently, a 'Cost Estimator' calculates the total monetary cost of all tests ordered during the process, providing a second performance metric.
Scientific Validity
  • ✅ Robust multi-faceted evaluation framework: The diagram clearly outlines a comprehensive and methodologically sound evaluation framework. By incorporating both diagnostic accuracy (via the Judge) and resource utilization (via the Cost Estimator), the benchmark provides a more holistic assessment of an agent's performance than accuracy-only metrics, better reflecting real-world clinical constraints.
  • ✅ Novel approach to handling missing information: The concept of a Gatekeeper agent that generates synthetic findings for out-of-scope queries is a significant methodological innovation. This approach enhances the ecological validity of the simulation by avoiding unnatural 'Not Available' responses that could inadvertently guide the diagnostic agent, thus allowing for a more realistic assessment of its reasoning pathways.
  • 💡 Oversimplification of the 'Judge Agent' output: The diagram shows the Judge Agent producing a binary 'Correct'/'Incorrect' output. This is an oversimplification of the evaluation process described in the text and Table 1, which uses a more nuanced 5-point Likert scale. The diagram should ideally reflect this nuance, perhaps with a note or reference to the rubric, to more accurately represent the methodological rigor of the scoring system.
  • 💡 Lack of implementation detail: The diagram successfully abstracts a complex system into a digestible format, which is a strength. However, it omits key details about the agents' implementation (e.g., the specific language models used for each agent, the rules governing the Gatekeeper). While a full description belongs in the text, the diagram could be slightly more informative without adding clutter, for instance, by noting the model type under the agent name (e.g., 'Gatekeeper Agent (LM-based)').
Communication
  • ✅ Excellent visual flow and organization: The diagram uses a clear, logical flow with well-defined boxes and directional arrows to illustrate the complex interaction between different system components. The central placement of the 'Sequential diagnosis cycle' effectively highlights the core iterative process.
  • ✅ Effective use of iconography: The use of simple icons (e.g., a gate for the Gatekeeper, scales for the Judge) provides intuitive visual shorthand for the function of each agent, making the diagram easier to parse at a glance.
  • 💡 Visual clutter and overlapping elements: The diagram is visually cluttered. The shaded background boxes and overlapping elements (like the 'Final Diagnosis' arrow crossing the 'Judge Agent' box) reduce clarity. Simplifying the color scheme and ensuring a clean, non-overlapping layout would improve readability.
  • 💡 Ambiguous relationship between agent types and the agent role: The relationship between the components on the left ('MAI-Dx Orchestrator', 'Baseline models', 'Physicians') and the 'Diagnostic Agent' box is ambiguous. It is unclear if these are examples of the agent or inputs that configure it. Clarifying this relationship, for instance by labeling the arrow 'can be implemented as', would improve the diagram's precision.
Figure 3: Participating physicians and models are provided with a case abstract...
Full Caption

Figure 3: Participating physicians and models are provided with a case abstract to begin the sequential diagnosis process.

Figure/Table Image (Page 8)
Figure 3: Participating physicians and models are provided with a case abstract to begin the sequential diagnosis process.
First Reference in Text
As described in Section 2, each case begins with a brief clinical vignette (typically 2-3 sentences, as in Figure 3) summarizing the patient's chief complaint.
Description
  • Example of an initial case summary: The figure displays a text box titled 'Initial Information Provided', which contains a short medical case summary. This summary serves as the starting point for the diagnostic challenge presented to both human physicians and AI models in the study.
  • Key clinical details provided: The specific example describes a 52-year-old man in Argentina with fever and 'hypoxemic respiratory failure', a serious condition where breathing is inadequate to maintain sufficient oxygen in the blood. The summary also notes two key findings: 'pulmonary opacities', which are shadows on a lung X-ray or CT scan that can indicate fluid or inflammation, and a 'hematocrit' of 56.9%. Hematocrit measures the proportion of red blood cells in the blood; 56.9% is abnormally high (a condition called polycythemia), which can make the blood thicker and lead to complications.
  • Minimalist information to prompt inquiry: This vignette is intentionally brief, providing only a few critical data points. This minimalist setup is designed to mimic a real-world scenario where a doctor begins with limited information and must decide which questions to ask or tests to order next to solve the case.
Scientific Validity
  • ✅ Enhances methodological transparency and reproducibility: Providing a concrete example of the initial prompt is a major strength. It makes the experimental methodology transparent and allows other researchers to understand the exact nature of the task given to the participants, which is crucial for reproducibility.
  • ✅ High ecological validity: The vignette format strongly supports the study's goal of simulating realistic clinical reasoning. Clinicians frequently start with a concise, information-poor summary from a referral note or initial triage, and this setup effectively establishes that starting condition for the experiment.
  • 💡 Unclear representativeness of the example case: The chosen example is clinically complex, involving multiple abnormal findings and a specific geographic location (Argentina) that might suggest certain endemic diseases. It is unclear if this level of complexity is representative of the entire 304-case dataset. The authors should clarify whether this is an example of a typical, easy, or difficult case to help readers gauge the overall difficulty of the benchmark.
Communication
  • ✅ Simplicity and clarity: The figure's design is minimalist and highly effective. By presenting only a simple text box, it focuses the reader's attention entirely on the content of the initial information, which is the sole purpose of the element. There is no distracting clutter.
  • ✅ Excellent support for the text: The element perfectly illustrates the concept of a 'brief clinical vignette' described in the text. It provides a concrete, easy-to-understand example that reinforces the methodological description, making the experimental setup very clear.
  • 💡 Minor improvement for self-containment: The caption is clear, but the figure itself could be more self-contained. Adding a subtitle within the figure, such as 'Example of an Initial Case Vignette,' would immediately clarify its function as a representative sample rather than a generic instruction.
Figure 4: Prompt used for baseline performance estimation.
Figure/Table Image (Page 9)
Figure 4: Prompt used for baseline performance estimation.
First Reference in Text
The baseline prompt (Figure 4) instructed models to use simple XML tags for requesting tests (<test>) and asking questions (<question>), with a final <diagnosis> tag for submitting their answer.
Description
  • Instructions for an AI model: This figure displays the exact text of a 'prompt'—a set of instructions given to an AI language model—used to establish a baseline level of performance. The prompt instructs the AI to act as a 'diagnostic assistant'.
  • Structured command format: The prompt specifies a structured communication format using 'XML tags', which are simple commands enclosed in angle brackets. The AI must use `<test>` tags to order medical tests and `<question>` tags to ask for patient information. This ensures the AI's output is standardized and can be automatically processed by the system.
  • A key operational constraint: A key rule is established: the AI cannot mix test requests and questions in the same turn. It must choose to do one or the other. This forces a choice between different information-gathering strategies at each step of the diagnostic process.
Scientific Validity
  • ✅ Excellent methodological transparency: Presenting the exact prompt is a critical component of methodological transparency and reproducibility. It allows other researchers to replicate the baseline condition precisely and verify the results, which is a significant strength.
  • ✅ Establishes a fair and simple baseline: The prompt is appropriately minimal, avoiding complex instructions or strategic hints. This establishes a true 'out-of-the-box' baseline, ensuring that the performance measured is genuinely that of the model itself, not the quality of the prompt engineering. This makes the comparison with the more advanced MAI-DxO system fair and meaningful.
  • 💡 Subjective instruction introduces uncontrolled variance: The instruction 'Make sure to ask for enough questions and tests to reach a diagnosis' is ambiguous and subjective. It does not define what 'enough' means, leaving a critical aspect of the agent's behavior (its stopping criterion) uncontrolled. This could introduce significant variability in performance that is unrelated to the model's core reasoning ability.
  • 💡 An experimental constraint that may reduce ecological validity: The constraint that disallows mixing `<test>` and `<question>` tags in the same turn is an important experimental design choice that simplifies the interaction logic. However, it may not perfectly reflect clinical practice. The authors should briefly acknowledge this simplification and its potential impact on the realism of the diagnostic simulation.
Communication
  • ✅ Appropriate visual presentation: The figure uses a monospaced font inside a clearly demarcated box, which effectively mimics a code block or system instruction. This visual choice clearly communicates the nature of the element as a direct input to a computer model.
  • ✅ Clear examples enhance understanding: The inclusion of concrete examples for each command (e.g., `<test>CBC</test>`, `<question>...</question>`) is highly effective. It removes any ambiguity about the required output format, making the instructions easy to understand for both the reader and the language model.
  • ✅ High degree of self-containment: The figure and its caption are perfectly aligned and self-contained. A reader can immediately grasp that this is the exact text used to instruct the baseline models, which is the figure's entire purpose.
Figure 5: Overview of the MAI-Dx Orchestrator
Figure/Table Image (Page 10)
Figure 5: Overview of the MAI-Dx Orchestrator
First Reference in Text
As shown in Figure 5, a single language model role-plays five distinct medical personas, each contributing specialized expertise to the diagnostic process.
Description
  • A model-agnostic system architecture: This figure is a flowchart illustrating the architecture of the 'MAI-Dx Orchestrator', an advanced AI system for medical diagnosis. The system is designed to be 'model-agnostic', meaning it can be powered by various underlying AI models like GPT, Claude, or Gemini, as shown in the 'Model selected' box.
  • A 'Virtual Doctor Panel' with five AI personas: The core of the system is a 'Virtual Doctor Panel'. As stated in the text, a single AI model takes on five different roles or 'personas' to analyze a case. These include 'Dr Hypothesis' (to generate potential diagnoses), 'Dr Challenger' (to question assumptions and prevent errors), and 'Dr Stewardship' (to consider the cost of tests). This simulates a multi-disciplinary team meeting to leverage different expert perspectives.
  • An iterative decision-making process with internal checks: After an internal 'Chain of Debate', the panel chooses one of three actions: ask a question, request a test, or provide a diagnosis. Before the action is sent to the external 'SDBench Framework' (the testing environment), it goes through an internal review loop involving 'Cost analysis' and 'Diagnosis confirmation', with a final 'Decision to proceed'. This represents a built-in check for cost-effectiveness and confidence.
Scientific Validity
  • ✅ Novel architecture designed to mitigate cognitive bias: The multi-persona 'Virtual Doctor Panel' is a novel and methodologically strong concept. By explicitly programming different cognitive roles (e.g., skepticism, cost-consciousness, hypothesis generation), the architecture is designed to mitigate common cognitive biases in diagnosis, such as premature closure or anchoring bias. This represents a more sophisticated approach than a single, monolithic AI agent.
  • ✅ Integration of cost-effectiveness into the core design: The explicit inclusion of a 'Cost analysis' step and a 'Dr Stewardship' agent is a significant strength. It grounds the system in real-world clinical constraints, where cost is a critical factor alongside accuracy. This makes the system's performance evaluation more holistic and clinically relevant.
  • 💡 Undefined mechanism for the 'Chain of Debate': The diagram shows a 'Chain of Debate' but does not specify how the five personas interact or reach a consensus. Is it a structured vote? A weighted average? A generative debate? This lack of detail about the core deliberation mechanism is a key omission that makes the system's internal reasoning process opaque. The authors should clarify this protocol.
  • 💡 Potential lack of cognitive diversity with single-model role-playing: The reference text states that a single language model role-plays all five personas. This raises a critical question about the true independence of these roles. It is plausible that a single model might struggle to generate genuinely adversarial or diverse viewpoints, potentially leading to a form of AI 'groupthink'. The authors should discuss this potential limitation and ideally provide evidence that the personas exhibit distinct and independent reasoning patterns.
Communication
  • ✅ Clear modular structure: The diagram effectively uses a modular, flowchart-style layout with nested boxes to represent the system's architecture. The separation of the 'Virtual Doctor Panel', the decision-making loop, and the external 'SDBench Framework' is visually intuitive and easy to follow.
  • ✅ Effective use of iconography: The use of simple, recognizable icons for each of the five agent personas (e.g., a brain for 'Dr Hypothesis', a warning sign for 'Dr Challenger') is an excellent design choice. It makes the roles of the different agents immediately understandable and the diagram more engaging.
  • 💡 Ambiguous information flow and visual clutter: The diagram is visually busy. The flow of information could be clearer; for example, the 'Chain of Debate' is represented by a vague arrow, and the feedback loop from 'SDBench Framework' points to the top of the main box rather than a specific input stage. To improve this, ensure all arrows have clear start and end points and simplify the background shading to reduce clutter.
  • 💡 Poor color contrast and palette: The color palette is muted and lacks contrast, which may pose accessibility challenges and makes it difficult to distinguish between different functional blocks. Using a more vibrant and distinct color scheme for the key components (e.g., the panel, the decision loop, the external framework) would improve readability and visual appeal.
Figure 6: Interface developed for physicians to attempt cases from SDBench.
Figure/Table Image (Page 12)
Figure 6: Interface developed for physicians to attempt cases from SDBench.
First Reference in Text
Thus, human physicians participated in SDBench the same way as an AI diagnostic agent (Figure 6).
Description
  • A web-based interface for human participants: This figure displays a screenshot of the web-based user interface (UI) that human physicians used to solve diagnostic cases. The layout is divided into three main sections: a static case summary, a dynamic history of interactions, and a set of input fields for the user.
  • Example case information: At the top, a 'Case Summary' box provides the initial information for a specific case: a 62-year-old man with abdominal pain after eating ('postprandial'), weight loss, and a history of 'cirrhosis' (severe liver scarring) and 'portal vein thrombosis' (a blood clot in a major vein connected to the liver).
  • Interaction history and results: The middle section shows a log of the physician's interactions. The example shows the physician asked a question about the abdominal pain and ordered a 'Complete blood count' (a standard blood test). The system's response is shown for each, with the blood test results presented numerically alongside their normal reference ranges for easy comparison.
  • Structured input fields for user actions: At the bottom, there are three separate text boxes allowing the physician to take their next action: 'Ask Question about Patient History', 'Order Diagnostic Tests', or 'Submit Final Diagnosis'. Each has its own dedicated button, structuring the user's input into one of three distinct action types.
Scientific Validity
  • ✅ Ensures a standardized experimental condition for humans: The use of a single, standardized interface for all human participants is a major methodological strength. It ensures that differences in performance are attributable to the physicians' diagnostic abilities rather than variations in how they received information, thus enhancing the internal validity of the human benchmark data.
  • ✅ Creates a fair comparison by mirroring AI agent constraints: The UI's structure, with separate input fields for questions, tests, and diagnoses, directly mirrors the structured output format required of the AI agents (e.g., `<question>`, `<test>`). This parallel structure is crucial for ensuring a fair and direct comparison between human and AI performance, as both operate under analogous constraints.
  • 💡 Simplified interface reduces ecological validity: The UI presents a simplified, text-based version of a clinical encounter. Real-world electronic health records (EHRs) are far more complex, with structured data, imaging viewers, and graphical trend displays. While this simplification is necessary for a controlled experiment, it represents a limitation in ecological validity. Performance in this clean environment may not fully predict performance in the cluttered, information-dense context of a real EHR.
  • 💡 Relies on participant adherence to external resource restrictions: The text states that physicians were instructed not to use external resources like search engines. The interface itself does not appear to have any technical controls to prevent this (e.g., by locking the user's screen). The study therefore relies on the participants' adherence to instructions, which introduces a potential source of unmonitored variability in the human performance data.
Communication
  • ✅ Clean and logical user interface design: The interface has a clean, uncluttered, and logical layout. The top-to-bottom flow, from the static 'Case Summary' to the dynamic interaction history and finally the user input fields, is highly intuitive and easy for a user to follow.
  • ✅ Effective presentation of clinical data: Presenting lab results, like the complete blood count, with the patient's value directly next to the normal 'reference range' is a best practice in medical data visualization. It allows for immediate and effortless interpretation of whether a result is normal or abnormal.
  • 💡 Lack of history management for complex cases: The interface presents the interaction history as a single, continuous scroll. For complex cases involving many questions and tests, this could become cumbersome to navigate. To improve usability, consider adding features to manage the information, such as collapsible sections for each turn or a persistent summary panel.
  • 💡 Generic placeholder text in input fields: The input fields for 'Ask Question', 'Order Diagnostic Tests', and 'Submit Final Diagnosis' are visually distinct, which is good. However, the placeholder text (e.g., 'What is your age?') is generic. Providing more specific examples or tooltips could better guide users on the expected format and specificity of their inputs.

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 7: Pareto-frontier showing diagnostic accuracy versus average cumulative...
Full Caption

Figure 7: Pareto-frontier showing diagnostic accuracy versus average cumulative monetary cost for each agent.

Figure/Table Image (Page 13)
Figure 7: Pareto-frontier showing diagnostic accuracy versus average cumulative monetary cost for each agent.
First Reference in Text
We present the performance of all diagnostic agents on SDBench in Figure 7.
Description
  • A cost vs. accuracy performance plot: This figure is a scatter plot that compares the performance of various diagnostic agents. The vertical axis measures diagnostic accuracy as a percentage (from 0% to 100%), while the horizontal axis measures the average monetary cost in US dollars (from $0 to $8000) to reach a diagnosis.
  • Two Pareto frontiers comparing systems: The plot features two 'Pareto frontiers', which are lines that connect the best-performing options. The blue line represents the 'MAI-DxO' system, showing that it can achieve very high accuracy (up to 85.5% with the 'ensemble' method for about $7,184) and can also achieve good accuracy (79.9%) at a much lower cost ($2,396). This frontier is consistently better—higher accuracy for a given cost—than the gray frontier, which represents the baseline 'off-the-shelf' AI models.
  • Performance of AI models and human physicians: The performance of individual AI models is shown as colored circles. For example, the 'o3' model achieves the highest accuracy among baseline models (78.6%) but at the highest cost ($7,850). In contrast, human physicians, represented by red crosses, show much lower accuracy, clustering around an average of 20% accuracy at an average cost of about $2,963.
Scientific Validity
  • ✅ Holistic and clinically relevant evaluation metric: Plotting accuracy against cost and visualizing the Pareto frontier is a methodologically sophisticated and highly appropriate way to evaluate diagnostic systems. It moves beyond a single accuracy metric to provide a more holistic view of performance that incorporates resource utilization, which is critical for real-world clinical applications.
  • ✅ Strong visual support for the paper's central claims: The figure provides very strong evidence for the paper's main conclusion. The clear dominance of the MAI-DxO frontier over all other agents, including strong baseline models and human physicians, visually substantiates the claim that the orchestrated system advances both diagnostic precision and cost-effectiveness.
  • 💡 Absence of uncertainty measures: The data points represent average performance across many cases, but the plot lacks any indication of uncertainty or variability (e.g., error bars, confidence intervals). For example, the cost for a given model could have a very wide distribution. Adding error bars for both accuracy and cost would provide a more complete and scientifically rigorous picture of the performance data.
  • 💡 Inconsistent datasets for comparison: The text indicates that AI models were evaluated on all 304 cases, while physicians were evaluated on a 56-case test set. This is a potential confounding variable, as the performance metrics are not derived from the exact same data population. While this is clarified elsewhere, a note in the figure caption is essential to prevent misinterpretation of this direct visual comparison.
Communication
  • ✅ Excellent choice of visualization: The use of a Pareto frontier is an excellent visualization choice. It immediately and intuitively communicates the core message of the paper: that the MAI-DxO system (blue line) represents a superior trade-off between cost and accuracy compared to baseline models (gray line) and physicians.
  • ✅ Clear and effective legend: The legend is well-organized, using color to group models by their provider and distinct shapes for physicians. This allows for quick identification of the different agent types being compared.
  • 💡 Data point clutter and label overlap: The plot is cluttered with many unlabeled and overlapping data points, particularly in the center. This makes it difficult to identify the performance of specific models and compare them. To improve this, consider labeling only the most important points on the plot (e.g., the points on the frontiers and the physician average) and providing a comprehensive table with all data points in a supplement.
  • 💡 Ineffective representation of physician data: The individual physician data points (red crosses) add to the clutter without providing significant insight beyond showing variability. A more effective approach would be to represent the physician cohort as a single point for the average, with error bars or a shaded ellipse to represent the distribution of performance on both axes. This would simplify the visual and make the comparison to the AI agents clearer.
Figure 8: Accuracy improvements delivered by MAI-DxO (no budget constraints)...
Full Caption

Figure 8: Accuracy improvements delivered by MAI-DxO (no budget constraints) across different large language models.

Figure/Table Image (Page 15)
Figure 8: Accuracy improvements delivered by MAI-DxO (no budget constraints) across different large language models.
First Reference in Text
Figure 8 demonstrates that MAI-DxO consistently improves diagnostic accuracy across all sufficiently capable foundation models, with particularly pronounced gains for weaker baselines, suggesting the framework helps weaker models overcome their limitations through structured reasoning.
Description
  • A before-and-after comparison of AI model accuracy: This bar chart compares the diagnostic accuracy of several different large language models (LLMs) before and after being integrated into the MAI-DxO orchestration system. The horizontal axis lists the different LLMs, such as 'Claude-4-opus' and 'GPT-4.1', while the vertical axis shows diagnostic accuracy as a percentage.
  • Visual representation of accuracy boost: For each model, a solid colored bar shows its 'Baseline' accuracy when used with a simple prompt. A lighter-colored bar on top shows the additional accuracy gained when using the 'MAI-DxO boost'. The total height of the bar represents the final accuracy with the MAI-DxO system.
  • Variable performance gains across models: The chart demonstrates that the MAI-DxO system improves the accuracy of every model tested. The magnitude of the improvement varies; for example, the 'o3' model improves from 78.6% to 81.9% (a 3.3 percentage point gain), while the 'Deepseek-R1' model shows a much larger improvement, from 47.4% to 65.5% (an 18.1 percentage point gain).
  • Indication of statistical significance: Asterisks above the bars (e.g., '***') indicate that the observed accuracy improvements are statistically significant, meaning they are unlikely to be due to random chance.
Scientific Validity
  • ✅ Demonstrates model-agnostic performance gains: The inclusion of a wide range of models from multiple providers (OpenAI, Google, Anthropic, etc.) is a major strength. It demonstrates that the MAI-DxO framework is 'model-agnostic' and its benefits are not tied to a specific proprietary model, which strongly supports the paper's claims of generalizability.
  • ✅ Supports the hypothesis about scaffolding weaker models: The chart provides strong visual evidence for the claim made in the reference text that gains are more pronounced for 'weaker baselines'. For instance, models with lower baseline accuracies like Deepseek-R1 and Grok-3 show the largest relative improvements, supporting the hypothesis that the framework provides a 'scaffolding' for less capable models.
  • 💡 Absence of error bars or uncertainty measures: The bars represent mean accuracies, but the chart lacks error bars or any other measure of uncertainty. Without them, it is difficult to assess the variability of the performance or the true significance of the differences between models. For example, the small 3.3 point gain for 'o3' might be within the noise of the measurement. Including confidence intervals or standard error bars is essential for a more rigorous presentation.
  • 💡 Omission of the cost dimension: This figure isolates accuracy but omits the corresponding cost, which is the other critical variable in the paper's central thesis. The caption notes this is for the 'no budget constraints' version, but a reader could misinterpret these accuracy gains as being 'free'. The authors should explicitly remind the reader that these gains come at a cost, which is analyzed elsewhere (e.g., in Figure 7), to avoid presenting an incomplete picture.
Communication
  • ✅ Excellent visualization of performance improvement: The visualization choice is highly effective. By showing the baseline performance as a solid bar and the improved performance as a lighter extension on top, the chart intuitively and immediately communicates the 'boost' or 'lift' provided by the MAI-DxO system for each model. This is more impactful than a standard grouped bar chart.
  • ✅ Clear data labels: The chart is well-labeled, with the exact numerical accuracy values placed on top of the bars. This allows for precise data extraction without requiring the reader to estimate values from the y-axis, which is a best practice.
  • 💡 Ambiguous legend design: The legend is slightly confusing. It attempts to combine the company color code with the baseline/boost distinction in a single entry (e.g., 'Anthropic' for the solid bar, 'MAI-DxO boost' for the light bar). A clearer approach would be to have two separate legends: one mapping colors to companies, and another explaining that the solid bar represents the baseline and the full bar represents the MAI-DxO result.
  • ✅ Good use of color to group agents: The color coding by model provider (Anthropic, Google, OpenAI, etc.) is a useful feature that allows for at-a-glance comparisons between different model families.
Figure 9: Pareto frontier curves of MAI-DxO and baseline prompting across...
Full Caption

Figure 9: Pareto frontier curves of MAI-DxO and baseline prompting across validation and held-out test data.

Figure/Table Image (Page 16)
Figure 9: Pareto frontier curves of MAI-DxO and baseline prompting across validation and held-out test data.
First Reference in Text
In Figure 9, we report stratified Pareto frontier curves of model performance across the validation (248 cases) and test (56 cases) sets.
Description
  • A comparison of cost-accuracy trade-offs: This figure is a line graph that shows four 'Pareto frontiers'. A Pareto frontier is a curve representing the optimal trade-off between two competing goals—in this case, getting the highest diagnostic accuracy (vertical axis) for the lowest possible cost (horizontal axis).
  • Comparing two systems on two datasets: The graph compares two AI systems: the advanced 'MAI-DxO' system (blue lines) and the standard 'Baseline' system (gray lines). Each system's performance is shown on two different datasets: a 'validation set' (solid lines), which is data the system was developed with, and a 'test set' (dashed lines), which is completely new data used to check if the performance is real and not just memorization.
  • Demonstration of robust performance: The key finding is that the performance curves for the validation set (solid lines) and the test set (dashed lines) are very close to each other for both the MAI-DxO and Baseline systems. This indicates that both systems 'generalize' well, meaning their performance is robust and not simply due to overfitting to the development data.
  • Consistent superiority of the MAI-DxO system: The chart also reaffirms that the MAI-DxO system (blue lines) is consistently superior to the Baseline system (gray lines), as the blue curves are always higher (better accuracy) for any given cost. This superiority holds true on both the validation and the unseen test data.
Scientific Validity
  • ✅ Strong evidence of generalizability: This figure provides a rigorous test of the system's generalizability. By explicitly separating performance on a validation set (248 cases) from a held-out test set (56 cases), the authors are following best practices in machine learning evaluation to guard against overfitting. The close alignment of the solid and dashed curves provides strong evidence that the reported performance gains are robust.
  • ✅ Appropriate choice of visualization for multi-objective evaluation: The use of Pareto frontiers is a highly appropriate and sophisticated method for this analysis. It correctly frames the problem as a multi-objective optimization (maximizing accuracy while minimizing cost) and provides a much richer picture of performance than single-point estimates would.
  • ✅ Provides strong support for the paper's conclusions: The figure strongly supports the paper's conclusion that the MAI-DxO system's advantages are robust and not driven by memorization effects. The consistent gap between the blue and gray lines across both datasets is compelling evidence for the superiority of the orchestrated approach.
  • 💡 Discrepancy in dataset sizes and statistical power: The reference text notes that the test set (56 cases) is substantially smaller than the validation set (248 cases). This means the performance estimates for the test set are subject to greater statistical uncertainty. While the curves appear similar, the authors should ideally supplement this visual with a statistical test to confirm that the frontiers are not significantly different, or at least acknowledge the difference in statistical power between the two estimates in the figure caption.
Communication
  • ✅ Effective use of line styles: The use of solid lines for the validation set and dashed lines for the test set is a standard and highly effective visual convention. It allows for an immediate and intuitive comparison of performance on seen versus unseen data.
  • ✅ Clear and uncluttered design: The plot is clean, well-labeled, and uses a simple color scheme (blue for the enhanced system, gray for the baseline) that effectively communicates the main comparison. The information is presented without unnecessary clutter.
  • 💡 Potentially redundant legend: The legend is clear but could be more efficient. The labels 'Baseline Frontier (Validation)' and 'Baseline Frontier (Test)' are repetitive. A more streamlined legend could group by system ('Baseline', 'MAI-DxO') and line style ('Validation', 'Test') to reduce redundancy.
  • 💡 Omission of underlying data points: The graph shows only the smoothed frontier curves and omits the individual data points (i.e., the specific model configurations) that define them. While this enhances clarity, showing the underlying points as faint markers would provide a better sense of how many configurations were tested and how densely the frontier is populated, without significantly cluttering the view.
Figure 10: Case level scores of MAI-DxO variants and clinicians across the 56...
Full Caption

Figure 10: Case level scores of MAI-DxO variants and clinicians across the 56 case test set.

Figure/Table Image (Page 23)
Figure 10: Case level scores of MAI-DxO variants and clinicians across the 56 case test set.
First Reference in Text
Figure 10: Case level scores of MAI-DxO variants and clinicians across the 56 case test set.
Description
  • A heatmap of agent-by-case performance: This figure is a heatmap, a grid where the color of each cell represents a value. It visualizes the performance of different diagnostic agents (rows) on each of the 56 individual medical cases in the test set (columns).
  • Ordered comparison of AI variants and clinicians: The agents on the y-axis include seven different versions of the MAI-DxO system and 18 individual human clinicians. The agents are ordered from best-performing at the top (MAI-Dx Ensemble) to worst-performing at the bottom. The cases on the x-axis are ordered from easiest (on the left) to hardest (on the right), as determined by the performance of the best AI model.
  • Color-coded performance scores: The color of each cell indicates the score achieved by that agent on that case, based on the 5-point rubric from Table 1. Dark green represents a perfect score, red represents a completely incorrect diagnosis, and gray indicates that the clinician did not attempt that case.
  • Visual pattern of AI superiority: The visual pattern clearly shows that the top rows, corresponding to the MAI-DxO models, are predominantly green, indicating consistently high performance. In contrast, the bottom rows, corresponding to the clinicians, show a mix of all colors, including a large amount of red (incorrect) and gray (not attempted), even on the 'easier' cases on the left side of the chart.
Scientific Validity
  • ✅ Provides high-granularity, transparent data: This figure provides a highly granular and transparent view of the raw performance data, moving beyond simple averages. Showing the score for every agent on every case is a methodological strength that allows for a deeper and more nuanced analysis of performance patterns.
  • ✅ Strong visual support for the paper's conclusions: The visualization strongly supports the paper's claims. The stark visual difference between the performance of the MAI-DxO models (top rows) and the clinicians (bottom rows) provides compelling evidence for the AI system's superior diagnostic accuracy on this challenging test set.
  • 💡 The definition of 'case difficulty' is agent-dependent: The caption states that cases are ordered by difficulty 'according to MAI-DxO (ensemble)'. This is a reasonable proxy for difficulty, but it is not an objective measure. This ordering might mask cases that are disproportionately difficult for the AI but easier for humans. Acknowledging that 'difficulty' is defined from the AI's perspective is important for proper interpretation.
  • ✅ Transparently shows incomplete data from clinicians: The inclusion of the 'Not Attempted' category for clinicians is important for transparency. It reveals that the average performance scores for clinicians are based on a subset of cases, which is a critical piece of context for comparing them to the AI models that attempted all cases.
Communication
  • ✅ Excellent choice of visualization: A heatmap is an excellent visualization choice for this type of matrix data. It allows for the immediate visual identification of performance patterns across both agents and individual cases.
  • ✅ Intuitive and clear color scale: The color scale is intuitive, using a 'traffic light' system (green for correct, red for incorrect) that is easy to interpret. The use of a distinct neutral color (gray) for 'Not Attempted' is also very effective.
  • ✅ Thoughtful ordering of rows and columns: The sorting of rows (agents) by overall performance and columns (cases) by difficulty is a powerful analytical choice. It creates a clear visual gradient from the top-left (easy cases, best agents) to the bottom-right (hard cases, worst agents), which powerfully communicates the main findings.
  • 💡 Illegible x-axis labels: The labels on the x-axis (the specific case numbers) are vertically oriented, extremely small, and practically illegible. This prevents any meaningful analysis of individual cases directly from the figure. To fix this, the cases could be numbered 1-56 on the axis, with a corresponding lookup table provided in a supplement.

Discussion

Key Aspects

Strengths

Suggestions for Improvement