Variation in Large Language Model Recommendations in Challenging Inpatient Management Scenarios

Susan Landon, Thomas Savage, S. Ryan Greysen, Eric Bressman
Journal of General Internal Medicine
Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA

First Page Preview

First page preview

Table of Contents

Overall Summary

Study Background and Main Findings

This study investigates the reliability of large language models (LLMs) in common, judgment-dependent clinical scenarios where no single "correct" answer exists. The primary objective was to describe and quantify the variation in recommendations both across different commercially available LLMs (inter-model agreement) and within a single LLM when queried multiple times (intra-model consistency). The authors aimed to understand how these AI tools handle the ambiguity inherent in the "art" of medicine, as opposed to their well-documented performance on standardized tests with clear answers.

The research employed a cross-sectional simulation design. Six prominent LLMs, including five general-purpose models and one domain-specific model trained on biomedical literature, were tested. Each model was presented with four brief clinical vignettes representing common inpatient dilemmas, such as decisions on blood transfusions or anticoagulation. To measure consistency, each vignette was posed to each model in five independent sessions, resulting in a total of 120 model-vignette interactions. The primary outcomes were the model's majority recommendation and its internal consistency score, a metric of stability ranging from 0 (no consistency) to 1 (perfect consistency).

The results revealed significant variability in LLM recommendations. Inter-model agreement was low, with recommendations diverging in every scenario; for two key decisions, the models were split exactly 50/50. Furthermore, intra-model consistency was often poor, with some models changing their recommendation on the same case in up to two of five repeated queries, yielding consistency scores as low as 0.60. Only in the scenario with the clearest guideline-based answer (bridging anticoagulation) did a strong consensus (83%) emerge, though even then, some models remained internally inconsistent.

The study concludes that for nuanced clinical questions, LLMs produce highly variable and sometimes unstable recommendations, mirroring the ambiguity of the tasks themselves. The authors provide direct guidance for clinicians, advising them to treat LLM outputs as a single perspective rather than a definitive answer, consider querying multiple models, and always retain final responsibility for patient care. This work highlights the risks of relying on single LLM outputs for complex decisions and underscores the need for methods that can surface model uncertainty to ensure the safe integration of this technology into medicine.

Research Impact and Future Directions

Overall, the evidence strongly supports the conclusion that current large language models are unreliable for nuanced, judgment-dependent clinical decision-making. The strongest corroborating findings are the significant inter-model disagreements, which reached a 50/50 split in two of four scenarios, and the poor intra-model consistency, with some models changing their advice up to 40% of the time on identical prompts. This core claim is tempered by the finding that models reached a strong consensus (83% agreement) in the one scenario governed by clear clinical guidelines, suggesting their reliability may be context-dependent and higher for questions with a more established evidence base.

Major Limitations and Risks: The study's conclusions are constrained by several methodological factors. First, the simulation-based design using brief, synthetic vignettes may not accurately reflect complex, real-world clinical practice where clinicians can engage in iterative dialogue with an LLM and have access to richer patient data. This limits the direct applicability of the findings to actual clinical workflows. Second, as noted in the Methods analysis, the selection of only four vignettes without a clear justification raises questions about generalizability; the observed variability might be specific to these scenarios. Finally, the qualitative summaries of model reasoning presented in the results tables lack a described methodology for their extraction, introducing a risk of researcher bias in the interpretation of why models differed.

Based on this simulation, the recommendation is that clinicians should not use LLMs as primary or sole decision-makers for ambiguous clinical cases. Confidence in this recommendation is High for the specific models and conditions tested. However, confidence is Medium when generalizing this behavior to all 'gray-zone' clinical scenarios or future, more advanced models, primarily due to the simulation design's limitations. The most critical unanswered question is how this observed variability translates to real-world clinical practice. A prospective study evaluating the impact of LLM use on actual physician decisions and patient outcomes is the essential next step to raise confidence and develop evidence-based guidelines for safe implementation.

Critical Analysis and Recommendations

Actionable Guidance for Clinicians (written-content)
The abstract concludes with direct, practical recommendations for clinicians, advising them to view LLM output critically, sample multiple models, and retain final responsibility. This translates the research findings into immediately useful guidance, enhancing the paper's clinical relevance and impact by framing the technology as a tool to be managed rather than an oracle to be followed.
Section: Abstract
Effective Framing of Clinical Ambiguity (written-content)
The introduction effectively frames the research problem by distinguishing between the 'science' of medicine (tasks with correct answers) and the 'art' of medicine (judgment-based decisions). This framing clearly articulates the gap between most LLM evaluations, which focus on exam-style questions, and the reality of clinical practice, establishing the study's novelty and high relevance.
Section: Introduction
Dual-Axis Measurement of LLM Variability (written-content)
The study's methodology robustly assesses two critical forms of unreliability: variation between different models (inter-model) and instability within a single model over repeated queries (intra-model). This dual focus provides a comprehensive and nuanced view of LLM performance that is central to the paper's contribution, moving beyond a simple comparison of final recommendations.
Section: Methods
Lack of Justification for Vignette Selection (written-content)
Limitation Type: Small Sample Size Limits Generalizability. The methods do not provide a specific rationale for selecting these four particular clinical problems over other possibilities. Adding a justification for their selection—for example, that they represent a range of common yet challenging inpatient decisions—would strengthen confidence in the vignettes' representativeness and the generalizability of the findings to other ambiguous clinical scenarios.
Section: Methods
Granular, Multi-Dimensional Data Presentation (graphical-figure)
The results tables (Tables 2-5) effectively summarize each model's recommendation, internal consistency, key clinical reasoning, and proposed management plan. This format allows for an efficient and deep side-by-side comparison, providing valuable insight into how the models' reasoning differs, not just that their final conclusions diverge.
Section: Results
Lack of a High-Level Visual Synthesis of Results (graphical-figure)
While the detailed tables are effective, the paper lacks a single summary figure to display the core findings—inter-model recommendation splits and intra-model consistency—for all four vignettes in one place. A composite visual would provide an intuitive, at-a-glance overview of LLM variability, dramatically increasing the accessibility and immediate impact of the study's central message.
Section: Results
Clear Explanation of Model Stochasticity (written-content)
The discussion effectively explains to a clinical audience that internal inconsistency is not a flaw but an inherent 'feature of probabilistic text generation.' This demystifies the models' 'stochastic nature' and provides a clear, mechanistic rationale for the observed 'flip-flops,' preventing the dangerous misconception that LLMs function like deterministic calculators.
Section: Discussion
Superficial Analysis of Communication Style's Impact (written-content)
The discussion identifies contrasting communication styles between models (e.g., authoritative vs. verbose) but does not deeply explore the potential psychological impact on clinicians. Explicitly discussing how an authoritative tone might engender over-reliance (automation bias), while a verbose style might be unduly dismissed, would significantly strengthen the paper's practical implications for safe LLM use.
Section: Discussion

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Methods

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Table 1 Case Vignettes
Figure/Table Image (Page 2)
Table 1 Case Vignettes
First Reference in Text
nuanced situations in which (1) guidelines provide limited direction or available evidence does not cover the exact scenario, and (2) bedside management is known to vary among clinicians (Table 1).
Description
  • Table Content and Purpose: This table presents the exact text of four hypothetical medical scenarios, called vignettes, that were used as prompts to test the clinical decision-making abilities of six different large language models (LLMs). The scenarios were designed to reflect common but complex inpatient situations where there isn't a single, clear-cut "correct" answer, forcing the AI to weigh competing factors.
  • Vignette 1: Transfusing borderline hemoglobin: This scenario describes a 58-year-old male with a weak heart (HFrEF), clogged heart arteries (CAD), and moderate kidney disease (CKD3). His hemoglobin—a protein in red blood cells that carries oxygen—is 6.9 g/dL, which is just below the typical threshold of 7.0 g/dL for a blood transfusion. The dilemma is whether to give him blood, which could improve oxygen delivery but also risk fluid overload in his already weak heart.
  • Vignette 2: Restarting anticoagulation after bleed: This vignette features a 72-year-old female with atrial fibrillation (Afib), an irregular heartbeat that increases stroke risk, for which she takes a blood thinner (apixaban). She was hospitalized for a bleeding stomach ulcer, which has been treated. The question is whether to restart her blood thinner before she goes home. Restarting it would protect her from a stroke but carries the risk of causing the ulcer to bleed again.
  • Vignette 3: Discharge with lab abnormality: This case involves a 67-year-old male recovering well from pneumonia. However, his lab tests show a small increase in creatinine (from 1.6 to 1.85 mg/dL), a waste product that indicates kidney function. A higher level suggests the kidneys might be under stress. The decision is whether to discharge him home as he feels better or keep him in the hospital to monitor his kidney function.
  • Vignette 4: Bridging anticoagulation in high risk: This scenario presents a 67-year-old female with multiple heart and vascular diseases who is on a blood thinner (apixaban) for Afib. She needs a procedure where the blood thinner must be temporarily stopped. The question is whether to use 'bridging' therapy—giving a short-acting, injectable blood thinner to cover the period when she is off her usual medication. This is a complex decision, as bridging can reduce stroke risk but significantly increases bleeding risk.
Scientific Validity
  • ✅ High clinical relevance and realism: The vignettes selected represent authentic and common clinical dilemmas in inpatient medicine. They effectively simulate the 'gray zones' of clinical practice where physician judgment varies, making them an excellent tool for assessing the nuanced reasoning capabilities of LLMs beyond simple factual recall.
  • ✅ Appropriate design for the research question: The study aims to describe variation within and across LLMs. By designing prompts with no single correct answer, the methodology is well-suited to elicit and analyze differences in recommendations and internal consistency, which is a key objective of the paper.
  • 💡 Limited detail in prompts may introduce confounding variability: The prompts are intentionally brief to mimic a typical user query. However, this lack of detail (e.g., absence of vital signs, full medication lists, patient-specific goals of care) forces the LLMs to make assumptions. While this tests their handling of ambiguity, it also means that observed variability could stem from different implicit assumptions rather than core reasoning differences. Providing a slightly more detailed, standardized case could offer a more controlled comparison.
  • 💡 Selection process for vignettes is not described: The paper mentions the vignettes reflect common scenarios in the authors' practice, but the process for selecting these specific four cases is not detailed. Clarifying whether these were chosen via a consensus panel, a review of common consultation questions, or another systematic method would strengthen the claim that they are representative of key judgment-dependent decisions.
Communication
  • ✅ Excellent clarity and organization: The table is straightforward and highly effective. The two-column layout clearly separates the general topic from the specific prompt text, allowing readers to quickly understand the exact inputs used in the experiment.
  • ✅ Self-contained and informative: The table functions as a complete reference for the study's experimental prompts. A reader can understand the core of the methodological setup from this table alone, without needing to refer back and forth to the main text.
  • 💡 Readability of prompt text could be improved: The prompt for each vignette is presented as a single block of text. To enhance scannability, consider using formatting such as bullet points or line breaks to separate patient demographics, past medical history, current clinical situation, and the final question. This would make the dense clinical information easier to parse.

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Table 2 Responses to Vignette 1—Transfusing Borderline Hemoglobin
Figure/Table Image (Page 3)
Table 2 Responses to Vignette 1—Transfusing Borderline Hemoglobin
First Reference in Text
In a majority of repeated prompts, four of six (67%) models recommended transfusion now, while two (33%) recommended continued observation (Table 2).
Description
  • Divergent Overall Recommendations: The table shows a split in recommendations among the six large language models (LLMs) for a patient with borderline low hemoglobin. Four models (GPT-o1, Grok 3, Gemini 2.0, and OpenEvidence) recommended to 'Transfuse' the patient, while two models (GPT-4o and Claude Sonnet 3.7) recommended to 'Observe'. This aligns with the text, showing a 67% vs. 33% split in the primary advice given.
  • Variable Internal Consistency: The table reports the 'internal consistency' for each model, which measures how often a model gave the same answer when asked the identical question five times. This is scored from 0 to 1, where 1 means it gave the same answer every time. Two models, GPT-o1 and OpenEvidence, were perfectly consistent (1.0). In contrast, Claude Sonnet 3.7 and Grok 3 were the least consistent (0.6), meaning they changed their recommendation in two out of the five trials.
  • Differing Clinical Justifications: The 'Key considerations' column reveals the reasoning behind the recommendations. Models advising to 'Transfuse' frequently focused on the numerical hemoglobin value being below the standard 7 g/dL threshold. Conversely, models advising to 'Observe' highlighted the patient's lack of symptoms and the potential risk of fluid overload, a complication where excess fluid builds up in the body, which is dangerous for someone with a weak heart.
  • Management Plans and Cited Sources: The 'Management plan' column details the specific actions suggested, such as transfusing one unit of blood slowly. The 'Sources cited' column shows which medical guidelines or studies the models referenced to support their advice. For example, several models cited guidelines from the American Association of Blood Banks (AABB) or the TRICC trial, a major study on transfusion thresholds. Notably, some models provided no sources for their recommendations.
Scientific Validity
  • ✅ Excellent reporting of model variability: The inclusion of the 'Internal consistency' metric is a major strength. It transparently quantifies the stochastic nature of these models, providing crucial data on their reliability for a single query, which is a novel and important contribution to the literature on LLM evaluation in medicine.
  • ✅ Granular, multi-dimensional assessment: The table provides a rich, multi-faceted view of model performance by breaking down responses into overall recommendation, consistency, key reasoning points, management plan, and sources. This goes beyond a simple right/wrong analysis and allows for a deeper understanding of how the models approach clinical problems.
  • 💡 Subjectivity in qualitative data extraction: The 'Key considerations' and 'Management plan' columns are qualitative summaries of the LLMs' text outputs. The methodology for this summarization is not described, which introduces potential for researcher interpretation bias. Including a brief methods note on how these points were extracted and coded (e.g., by two independent reviewers with a consensus process) would improve the rigor and reproducibility of these qualitative findings.
  • 💡 The 'Overall' recommendation metric could be clarified: The 'Overall' recommendation is defined in the methods as the majority choice across five runs. For models with an internal consistency of 0.6, this means the recommendation was made in only 3 of 5 instances. While this is a reasonable approach, adding a footnote to the table explicitly defining 'Overall' and noting the majority threshold (e.g., '≥3 of 5 runs') would enhance clarity and prevent misinterpretation of the data's strength.
Communication
  • ✅ Clear and logical structure: The table is well-organized with clear column headers. The information flows logically from the high-level summary (Overall recommendation) to the detailed justification (Key considerations, Management plan, Sources cited), making it easy for the reader to follow and compare the different models.
  • ✅ Effective use of summarization: The use of bullet points to summarize key considerations and management steps is highly effective. It condenses potentially verbose LLM outputs into scannable, digestible information, allowing for efficient comparison across models without sacrificing essential details.
  • 💡 Grouping rows could enhance visual impact: To more immediately highlight the primary finding of inter-model disagreement, consider grouping the rows by the 'Overall' recommendation. Placing the two 'Observe' models together and the four 'Transfuse' models together, perhaps separated by a subtle horizontal line or shading, would make the divergence in advice instantly apparent to the reader.
  • 💡 Define abbreviations for broader accessibility: The table uses several clinical abbreviations (e.g., Hgb, CKD, AABB, TRICC, CBC). While familiar to an internal medicine audience, adding a footnote that defines these terms would make the table more self-contained and accessible to readers from other disciplines.
Table 3 Responses to Vignette 2—Restarting Anticoagulation After Bleed
Figure/Table Image (Page 3)
Table 3 Responses to Vignette 2—Restarting Anticoagulation After Bleed
First Reference in Text
Three (50%) models recommended restarting anticoagulation on discharge, and three (50%) recommended holding in a majority of prompts (Table 3).
Description
  • Even Split in Overall Recommendations: The table shows a perfect 50/50 split among the six large language models (LLMs) regarding whether to restart a blood thinner (anticoagulation) after a stomach bleed. Three models (GPT-4o, Gemini 2.0, OpenEvidence) recommended restarting the medication, while the other three (GPT-o1, Claude Sonnet 3.7, Grok 3) recommended against restarting it for the time being.
  • Variable Internal Consistency: The 'internal consistency' score, which measures if a model gives the same answer over five identical queries, varied significantly. Claude Sonnet 3.7 and OpenEvidence were perfectly consistent (1.0), always giving the same advice. In contrast, GPT-o1 and Grok 3 were the least consistent (0.6), changing their recommendation in two of the five trials.
  • Conflicting Clinical Justifications: The 'Key considerations' column highlights the core clinical dilemma. Models recommending 'Restart' focused on the patient's high risk of stroke and the fact that the bleeding source was successfully treated ('definitively addressed'). Models recommending 'Don't restart' prioritized the risk of the ulcer re-bleeding, emphasizing that the cauterized vessel needs time to heal.
  • Differences in Management Plans: The proposed management plans differed, particularly among the models that advised holding the medication. The recommended duration to pause the blood thinner ranged from 4-7 days (Grok 3) to 1-2 weeks (Claude Sonnet 3.7). Most models, regardless of their primary recommendation, also suggested starting a proton pump inhibitor (PPI), a medication to reduce stomach acid and help the ulcer heal.
Scientific Validity
  • ✅ Excellent demonstration of clinical equipoise: The 50/50 split in recommendations effectively mirrors the real-world clinical equipoise surrounding this decision. This makes the vignette a well-chosen instrument for probing LLM reasoning in areas of genuine medical uncertainty, which is a key strength of the study design.
  • ✅ Granular data allows for deeper analysis of reasoning: The table provides rich qualitative data by including columns for 'Key considerations' and 'Management plan'. This allows the reader to understand the 'why' behind the different recommendations, moving beyond a simple tally of outcomes and offering insight into the models' reasoning processes.
  • 💡 Variable quality of evidence cited by models: The 'Sources cited' column reveals a critical finding: the quality of evidence used by the models is inconsistent. While some cite appropriate clinical guidelines (ACC, ACG) and trials, others refer vaguely to 'expert recommendations' or, in one case, a 'lay audience webpage about stroke risk'. This variability in sourcing is a significant aspect of model performance that warrants further investigation.
  • 💡 Lack of context for internal inconsistency: The table reports that some models had low consistency (0.6), but it does not provide insight into what might have triggered the change in recommendation across runs. While a full analysis is likely beyond the table's scope, acknowledging this limitation or providing a sample of the differing outputs in an appendix would add valuable context to the stochastic nature of the models.
Communication
  • ✅ Clear structure facilitates model comparison: The table is logically organized, with distinct columns for each aspect of the LLM's response. This parallel structure makes it very easy for readers to compare and contrast the different models on multiple dimensions, such as their overall choice, consistency, and rationale.
  • ✅ Caption is specific and informative: The caption clearly and concisely describes the content of the table, specifying both the vignette number and its clinical topic. This allows the table to be understood as a standalone element.
  • 💡 Visual grouping would enhance the main takeaway: The central finding is the 50/50 split. To make this more immediately apparent, consider visually grouping the three 'Restart' models and the three 'Don't restart' models. A simple horizontal line or light shading to separate the two blocks of rows would instantly draw the reader's eye to the fundamental disagreement.
  • 💡 Acronyms could be defined: The table uses several acronyms (e.g., AC, PPI, EGD, ACC, ACG) that may not be familiar to all readers. Adding a footnote to define these terms would improve the table's accessibility and ensure it is self-contained.
Table 4 Responses to Vignette 3-Discharge with Modest Creatinine Rise
Figure/Table Image (Page 4)
Table 4 Responses to Vignette 3-Discharge with Modest Creatinine Rise
First Reference in Text
Across five re-prompts, three (50%) models recommended discharging today, and three (50%) recommended delaying a majority of the time (Table 4).
Description
  • Even Split in Recommendations: The table shows that the six large language models (LLMs) were evenly divided on whether to discharge a patient who was recovering from pneumonia but had a small increase in creatinine, a waste product in the blood that helps measure kidney function. Three models (GPT-4o, Grok 3, OpenEvidence) recommended 'Discharge', while the other three (GPT-o1, Claude Sonnet 3.7, Gemini 2.0) recommended 'Don't discharge'.
  • High Internal Consistency: Compared to other scenarios, the models were generally more consistent in their advice for this vignette. 'Internal consistency' measures how often a model gives the same answer when asked the same question five times. Four of the six models were perfectly consistent (a score of 1.0), and the other two wavered only once (a score of 0.8). This suggests this type of clinical question may elicit more stable responses from these AIs.
  • Differing Interpretations of the Lab Result: The 'Key considerations' column shows two distinct ways of thinking. The 'Discharge' models framed the creatinine increase as mild, expected, and not meeting the formal criteria for Acute Kidney Injury (AKI), which is a sudden decline in kidney function. They also emphasized the risks of a prolonged hospital stay. In contrast, the 'Don't discharge' models interpreted the same lab value as a potential early sign of AKI, warranting caution and further monitoring in the hospital.
  • Contrasting Management Plans and Sources: The proposed actions reflected the core recommendations. 'Discharge' models suggested repeating lab tests as an outpatient in 2-3 days. 'Don't discharge' models called for inpatient monitoring and additional lab tests within 24 hours. Notably, most models did not cite any sources, though Grok 3 referenced KDIGO (global kidney disease guidelines) and OpenEvidence cited ATS/IDSA guidelines related to discharge readiness for pneumonia.
Scientific Validity
  • ✅ High ecological validity of the vignette: The chosen scenario—a modest, unexpected lab abnormality in an otherwise improving patient—is an excellent and highly realistic test case. It effectively probes the models' ability to navigate clinical uncertainty and weigh the risks of over-investigation against the risks of a premature discharge.
  • ✅ Important finding regarding model-specific errors: The reference text notes that 'Gemini consistently misapplied the AKI definition.' This is a critical finding, demonstrating that some models may generate confident but factually incorrect reasoning. While not detailed in the table itself, identifying such specific failure modes is a significant contribution of the study.
  • 💡 Inconsistent application of clinical criteria: The table reveals that models disagree on whether the creatinine rise meets the criteria for AKI. This highlights a key challenge: LLMs may interpret established clinical definitions differently. This finding suggests that relying on LLMs for tasks requiring strict adherence to diagnostic criteria could be problematic without further validation.
  • 💡 The relevance of cited sources is variable: OpenEvidence's citation of pneumonia discharge guidelines (ATS, IDSA) is noteworthy. While these guidelines do address overall stability for discharge, they are not specific to the management of acute kidney injury. This raises questions about the model's ability to select the most relevant evidence for a specific clinical problem, as opposed to the patient's general diagnosis.
Communication
  • ✅ Effective and clear tabular layout: The table is well-structured, with distinct columns that logically present the data from the overall recommendation to the specific details of the reasoning and management plan. This facilitates easy comparison across the different LLM agents.
  • ✅ Summaries are concise and informative: The use of bullet points to summarize the 'Key considerations' and 'Management plan' is effective. It distills the core logic from the LLM outputs into a format that is quick to read and understand.
  • 💡 Visual grouping could improve readability: To make the 50/50 split in recommendations more immediately obvious, consider visually separating the three 'Discharge' models from the three 'Don't discharge' models. A subtle horizontal line or different background shading for the two groups would enhance the table's scannability.
  • 💡 Lack of a legend for abbreviations: The table uses several abbreviations (AKI, KDIGO, ATS, IDSA, FeNa) that might not be familiar to all readers. Including a footnote that defines these terms would make the table more self-contained and accessible to a broader scientific audience.
Table 5 Responses to Vignette 4—Bridging Anticoagulation in High Risk
Figure/Table Image (Page 5)
Table 5 Responses to Vignette 4—Bridging Anticoagulation in High Risk
First Reference in Text
Over successive prompts, in a majority of cases, five (83%) models recommended holding anticoagulation without bridging, while one (17%) favored bridging (Table 5).
Description
  • Strong Consensus Against Bridging: The table shows a strong agreement among the large language models (LLMs) on a complex medication management question. For a high-risk patient needing to temporarily stop their blood thinner (apixaban) for a procedure, five out of the six models (83%) recommended against using 'bridging'. Bridging is the practice of using a short-acting, injectable blood thinner to cover the period when the main oral medication is stopped. Only one model, Grok 3, recommended in favor of bridging.
  • High but Imperfect Internal Consistency: Most models were highly consistent in their advice. 'Internal consistency' measures if a model gives the same answer when asked the same question five times. Four models (GPT-4o, GPT-o1, Claude Sonnet 3.7, and Open Evidence) were perfectly consistent, scoring 1.0. However, two models, Grok 3 and Gemini 2.0, were less stable, scoring 0.6, meaning they changed their recommendation in two of the five trials.
  • Reasoning Based on Modern Anticoagulant Guidelines: The 'Key considerations' column reveals that the majority of models based their 'Don't bridge' recommendation on current medical guidelines. They correctly noted that bridging is generally not recommended for patients on newer blood thinners known as DOACs (Direct Oral Anticoagulants), like apixaban. The models acknowledged the patient's high stroke risk but concluded that the bleeding risk from bridging outweighed the benefits.
  • Outlier Model's Nuanced Rationale: The one model that recommended bridging, Grok 3, provided a more nuanced rationale. It noted that the patient's risk profile was higher than the population studied in the landmark BRIDGE trial (a major study that influenced bridging guidelines) and that her additional vascular diseases increased her clotting risk, justifying the use of bridging in this specific case.
  • Evidence and Guideline Citation: Several models demonstrated an ability to cite relevant, high-quality evidence. They referred to guidelines from major medical societies like the American College of Cardiology (ACC), American Heart Association (AHA), and the American College of Chest Physicians, as well as the pivotal BRIDGE trial, to support their recommendations.
Scientific Validity
  • ✅ Excellent choice of vignette to test guideline application: This scenario is a strong test of an LLM's ability to apply established, evidence-based guidelines. The question of bridging DOACs is a common clinical query where the guideline recommendation (don't bridge) often conflicts with clinician anxiety about a high-risk patient, making it a perfect test of guideline adherence versus risk perception.
  • ✅ Data reveals sophisticated reasoning in some models: The table successfully captures nuanced differences in reasoning. Grok 3's ability to question the applicability of the BRIDGE trial to this specific, higher-risk patient, while ultimately providing a non-standard recommendation, demonstrates a level of sophisticated reasoning beyond simple rule-following. This is a significant finding regarding model capabilities.
  • 💡 Inconsistency of the outlier model is a key finding: Grok 3, the only model to recommend bridging, had an internal consistency of only 0.6. This is a critical detail, as it suggests the model's 'contrarian' view is unstable. It would be valuable to clarify in the text what the model recommended in the two minority runs—presumably 'don't bridge'—as this would significantly contextualize its overall stance as being highly variable rather than firmly pro-bridging.
  • 💡 Potentially irrelevant clinical reasoning noted: Gemini 2.0's warning about heparin-induced thrombocytopenia (HIT) is an interesting but clinically peripheral point when deciding on a short course of LMWH for bridging. While factually correct, its inclusion may represent a model retrieving a loosely associated fact rather than demonstrating salient clinical reasoning, highlighting a potential failure mode in how LLMs prioritize information.
Communication
  • ✅ Clear structure and direct support for text: The table is well-organized and its primary finding—the 5 vs. 1 split in recommendations—is immediately clear and directly supports the quantitative summary (83% vs. 17%) presented in the main text.
  • ✅ Effective summarization of complex outputs: The bulleted lists in the 'Key considerations' and 'Management plan' columns effectively distill the core logic and action items from what were likely much longer text outputs from the LLMs. This allows for efficient comparison between models.
  • 💡 Visually highlight the outlier for greater impact: The main narrative of this table is the consensus versus the single outlier (Grok 3). To enhance this, consider visually separating the row for Grok 3 from the others using a different background shade or a horizontal rule. This would immediately guide the reader's eye to the most significant point of contrast.
  • 💡 Define acronyms in a footnote: The table is rich with clinical acronyms (CHADS-VASc, DOACs, CAD, PAD, LMWH, ACC, AHA, CHEST, ASH). To improve accessibility for a broader audience, provide a footnote that defines these terms. This would make the table more self-contained.

Discussion

Key Aspects

Strengths

Suggestions for Improvement