Variation in Large Language Model Recommendations in Challenging Inpatient Management Scenarios

Section Analysis

Abstract

Key Aspects

🎯 Investigating LLM Reliability in Ambiguous Medicine: The study addresses a critical gap in understanding Large Language Model (LLM) behavior in clinical settings. While LLMs are increasingly used, their performance on routine bedside decisions that lack a single correct answer—requiring clinical judgment—is unclear. The primary objective is to systematically describe and quantify the variation in recommendations both within a single model (intra-model consistency) and across different commercially available models (inter-model agreement) when presented with common, judgment-dependent inpatient scenarios.
🔬 Cross-Sectional Simulation Design: A cross-sectional simulation study was designed to test six prominent LLMs, including five general-purpose models and one domain-specific model. The models were exposed to four standardized clinical vignettes representing common inpatient dilemmas, such as transfusion decisions or resuming anticoagulation. Each model was prompted with each vignette in five independent sessions to assess consistency, with the primary outcomes being the model's majority recommendation and its internal consistency score, a measure of recommendation stability ranging from 0 to 1.
📊 Divergent and Inconsistent Recommendations: The study analyzed 120 model-vignette interactions and found significant variability. Recommendations between models (inter-model) diverged in every one of the four scenarios, with splits as close as 50/50 on key decisions like restarting anticoagulation. Furthermore, some models exhibited considerable internal inconsistency, changing their recommendations on the same case in up to two of five repeated queries, resulting in consistency scores as low as 0.60. The domain-specific model, OpenEvidence, was noted as the most internally consistent.
🔑 Clinical Guidance and Future Directions: The paper concludes that for nuanced clinical questions, LLMs produce variable and sometimes inconsistent recommendations, mirroring the ambiguity of the tasks themselves. It provides direct guidance for clinicians, advising them to treat LLM outputs as a single data point rather than a definitive answer, consider using multiple models or re-prompting, and always retain final clinical responsibility. The authors call for future prospective studies to develop methods that can surface model uncertainty and ensure the safe integration of this technology into complex medical decision-making.

Strengths

✅ Clear IMRAD-C Structure
The abstract is impeccably organized using the standard Importance, Objective, Design, Exposures, Main Measures, Results, and Conclusions (IMRAD-C) format. This structure provides exceptional clarity, allowing readers to quickly grasp the study's rationale, methodology, key findings, and implications without ambiguity.

"IMPORTANCE: ... OBJECTIVE: ... DESIGN: ... EXPOSURES: ... MAIN MEASURES: ... RESULTS: ... CONCLUSIONS:" (Page 1)
✅ Strong Quantitative Evidence
The results are presented with specific quantitative data, such as the percentage split in recommendations (e.g., 67% vs 33%) and the internal consistency scores (as low as 0.60). This numerical evidence provides a robust and compelling foundation for the study's conclusions, moving beyond purely qualitative descriptions of variability.

"Inter-model recommendations diverged in every scenario: transfuse vs observe (67% vs 33% of models), restart vs hold anticoagulation (50% vs 50%)..." (Page 1)
✅ Actionable Clinical Takeaways
The conclusion translates the research findings into direct, practical recommendations for practicing clinicians. By advising users to view LLM output critically, sample multiple models, and retain final responsibility, the abstract enhances its clinical relevance and impact.

"Clinicians should view LLM output as one perspective among many, consider sampling multiple models or re-prompting, and retain final responsibility for bedside decisions." (Page 1)

Suggestions for Improvement

💡 Explicitly Name the Domain-Specific LLM in the Design Section
High impact. The abstract introduces a critical comparison between general-purpose and domain-specific LLMs in the 'DESIGN' section but waits until the 'RESULTS' to name the specific model. Explicitly identifying OpenEvidence as the domain-specific model in the 'DESIGN' section would immediately establish the experimental contrast, creating a stronger narrative link to the key finding that it was the most consistent model. This change would improve clarity and help the reader anticipate the significance of the results.

"...five general-purpose (GPT-4o, GPT-o1, Claude 3.7 Sonnet, Grok 3, and Gemini 2.0 Flash) and one domain-specific (OpenEvidence)." (Page 1)

Implementation: Revise the sentence in the 'DESIGN' section to read: 'Six LLMs were queried: five general-purpose (GPT-4o, GPT-o1, Claude 3.7 Sonnet, Grok 3, and Gemini 2.0 Flash) and one domain-specific, OpenEvidence.'

Introduction

Key Aspects

📚 Contextualizing LLM Performance: The introduction begins by acknowledging the impressive performance of Large Language Models (LLMs) on standardized evaluations like the United States Medical Licensing Examination (USMLE). However, it immediately pivots to argue that these tests, which typically have single correct answers, are not representative of the complex, ambiguous nature of real-world clinical practice. This framing effectively establishes a critical gap in the literature: while LLM capabilities in solving clear-cut problems are known, their utility in nuanced situations remains unevaluated.
⚖️ The Challenge of Clinical Ambiguity: The paper defines its core problem by focusing on clinical decisions that lack a discrete 'correct' recommendation. It highlights that these scenarios require 'judgment calls' that balance competing risks, patient preferences, and clinical uncertainty. The authors characterize this as 'the art, rather than the science, of medicine,' a powerful framing that underscores the challenge of applying deterministic-seeming AI tools to situations defined by ambiguity, especially as practitioners are already integrating them into clinical workflows.
🎯 Stating the Research Aim and Significance: The final paragraph clearly articulates the study's objective: to examine how different commercially available LLMs respond to these nuanced scenarios and to assess the variation in their recommendations, both within a single model and across different models. The authors explicitly state the broader significance of this work, noting that understanding these capabilities and limitations is essential for clinicians, educators, and policymakers. This positions the study as a crucial step toward the safe and effective integration of LLMs into inpatient care.

Strengths

✅ Effective 'Funnel' Structure
The introduction masterfully employs a classic 'funnel' or 'inverted pyramid' structure. It begins with the broad context of LLMs in medicine, narrows to the specific problem of their performance in ambiguous clinical scenarios, and culminates in the study's precise research question and its significance. This logical progression makes the rationale for the study exceptionally clear, persuasive, and easy for the reader to follow.

"While their performance in these studies has been impressive — at times exceeding that of the average physician — these simulations are not representative of the majority of real-world practice." (Page 1)
✅ Compelling Problem Framing
The authors frame the core research problem with the highly effective and relatable distinction between the 'art' and 'science' of medicine. This framing immediately resonates with a clinical audience, clearly articulating the gap between current LLM testing paradigms and the judgment-based reality of medical practice. It successfully establishes the relevance and urgency of the investigation.

"Instead, they require judgment calls that balance competing risks, patient preferences, and clinical uncertainty— the art, rather than the science, of medicine." (Page 1)

Suggestions for Improvement

💡 Briefly Foreshadow the Clinical Scenarios
High impact. The introduction does an excellent job of establishing the theoretical problem of clinical ambiguity but remains abstract. Mentioning one or two concrete examples of the clinical dilemmas to be tested (e.g., transfusion or anticoagulation decisions) at the end of the section would ground the research question in tangible terms for the reader. This would make the study's purpose more vivid and create a smoother, more engaging transition into the Methods section.

"In this study, we examine how different commercially available LLMs respond to nuanced real-world management scenarios commonly encountered in inpatient medicine." (Page 1)

Implementation: In the final paragraph, revise the sentence 'In this study, we examine how different commercially available LLMs respond to nuanced real-world management scenarios commonly encountered in inpatient medicine' to something similar to: 'In this study, we examine how different commercially available LLMs respond to nuanced real-world management scenarios commonly encountered in inpatient medicine, such as decisions regarding blood transfusions and anticoagulation.'

Methods

Key Aspects

🔬 Cross-Sectional Simulation Design: The study employs a cross-sectional simulation design to compare the recommendations of six large language models across four common, judgment-dependent inpatient scenarios. Each scenario was crafted as a brief vignette requiring a binary management decision (e.g., transfuse vs. observe), intentionally mirroring situations where clinical guidelines are limited and practice patterns are known to vary. This methodological approach is specifically tailored to probe how LLMs handle the clinical ambiguity and uncertainty that characterize real-world medical decision-making, moving beyond tests with single correct answers.
💻 Model Selection and Prompting Protocol: The methodology specifies the inclusion of five prominent general-purpose LLMs and one domain-specific model (OpenEvidence) trained on biomedical literature, ensuring a diverse and relevant sample. To ensure consistency and reproducibility, each model interaction was initiated with an identical primer casting the LLM as an 'expert hospitalist.' Crucially, to assess intra-model variability, each vignette was posed to each model five times in separate, memory-cleared sessions, a rigorous protocol designed to quantify the stability of the models' recommendations.
📊 Quantitative Outcome Definition: The study's outcomes are clearly and quantitatively defined to allow for robust analysis. The primary measures were each model's overall recommendation, determined by the majority decision across five runs, and its internal consistency, calculated as the proportion of identical recommendations (ranging from 0 to 1). A key methodological choice was to disregard the expressed strength of a recommendation, treating 'I would definitely' and 'I would lean toward' as equivalent, which standardizes the binary outcome for direct comparison across all interactions.

Strengths

✅ High Clarity and Reproducibility
The section provides an exceptionally clear and detailed account of the study's methodology. It explicitly names the six LLMs, details the exact prompting protocol including the primer text, and precisely defines the quantitative outcomes like internal consistency. This high degree of transparency allows for critical evaluation and facilitates the potential for replication by other researchers, a cornerstone of strong scientific reporting.

"Each interaction opened with the identical primer: “You are an expert hospitalist, faced with the following patient scenario. What would you do and why?”" (Page 2)
✅ Pragmatic and Clinically Relevant Design
The study design effectively mirrors potential real-world clinical application. The use of brief vignettes, default model 'creativity' settings, and a non-iterative prompting process reflects how a busy practitioner might quickly query an LLM. This pragmatic approach strengthens the external validity and practical relevance of the study's findings for clinicians considering these tools in their workflow.

"These were written in a brief format, meant to reflect the style and level of detail a typical user might present when querying an LLM." (Page 2)
✅ Robust Assessment of Model Variability
The methodology is robustly designed to capture two critical forms of variation. By querying six different models, it assesses inter-model agreement, while the protocol of posing each case five times in fresh sessions allows for a direct quantitative measurement of intra-model consistency. This dual focus provides a comprehensive and nuanced view of LLM reliability that is central to the paper's contribution.

"To gauge intra-model variability, we posed each case five times in fresh sessions with cleared conversation history..." (Page 2)

Suggestions for Improvement

💡 Justify the Specific Choice of the Four Clinical Vignettes
Medium impact. The methods state that the vignettes represent nuanced situations but do not provide a specific rationale for selecting these four particular clinical problems (transfusion, anticoagulation, discharge, and bridging) over other possibilities. Adding a brief justification—for example, that they represent a range of common yet challenging inpatient decisions involving different types of risk-benefit calculations—would strengthen the reader's confidence in the vignettes' representativeness and the generalizability of the findings.

"Four vignettes were drafted to represent nuanced situations in which (1) guidelines provide limited direction or available evidence does not cover the exact scenario, and (2) bedside management is known to vary among clinicians..." (Page 2)

Implementation: In the 'Study Design' subsection, after the sentence ending with '(Table 1).7, 8', add a sentence such as: 'These specific scenarios were selected to represent a diverse set of common inpatient dilemmas, including decisions about medical therapies (transfusion, anticoagulation), disposition planning (discharge readiness), and peri-procedural management (bridging).'
💡 Clarify the Rationale for Model Selection
Medium impact. The 'Model Selection' section lists the six LLMs tested but does not explain the criteria for their inclusion. While the models are prominent, explicitly stating the rationale (e.g., chosen for their widespread availability, representation of different developers, and high performance on prior benchmarks) would enhance methodological transparency. This clarification helps the reader understand why this specific set of models provides a meaningful snapshot of the current LLM landscape.

"We included five general-purpose LLMs — GPT-4o (OpenAI), GPT-o1 (OpenAI), Claude Sonnet 3.7 (Anthropic), Grok 3 (xAI), and Gemini 2.0 Flash (Google) — and one domain-specific model, OpenEvidence..." (Page 2)

Implementation: At the beginning of the 'Model Selection' section, add a sentence such as: 'The selected models represent a cross-section of the most widely used and publicly accessible platforms from major developers at the time of the study, including both general-purpose and biomedically-focused architectures.'

Non-Text Elements

Table 1 Case Vignettes

Figure/Table Image (Page 2)

First Reference in Text

nuanced situations in which (1) guidelines provide limited direction or available evidence does not cover the exact scenario, and (2) bedside management is known to vary among clinicians (Table 1).

Description

Table Content and Purpose: This table presents the exact text of four hypothetical medical scenarios, called vignettes, that were used as prompts to test the clinical decision-making abilities of six different large language models (LLMs). The scenarios were designed to reflect common but complex inpatient situations where there isn't a single, clear-cut "correct" answer, forcing the AI to weigh competing factors.
Vignette 1: Transfusing borderline hemoglobin: This scenario describes a 58-year-old male with a weak heart (HFrEF), clogged heart arteries (CAD), and moderate kidney disease (CKD3). His hemoglobin—a protein in red blood cells that carries oxygen—is 6.9 g/dL, which is just below the typical threshold of 7.0 g/dL for a blood transfusion. The dilemma is whether to give him blood, which could improve oxygen delivery but also risk fluid overload in his already weak heart.
Vignette 2: Restarting anticoagulation after bleed: This vignette features a 72-year-old female with atrial fibrillation (Afib), an irregular heartbeat that increases stroke risk, for which she takes a blood thinner (apixaban). She was hospitalized for a bleeding stomach ulcer, which has been treated. The question is whether to restart her blood thinner before she goes home. Restarting it would protect her from a stroke but carries the risk of causing the ulcer to bleed again.
Vignette 3: Discharge with lab abnormality: This case involves a 67-year-old male recovering well from pneumonia. However, his lab tests show a small increase in creatinine (from 1.6 to 1.85 mg/dL), a waste product that indicates kidney function. A higher level suggests the kidneys might be under stress. The decision is whether to discharge him home as he feels better or keep him in the hospital to monitor his kidney function.
Vignette 4: Bridging anticoagulation in high risk: This scenario presents a 67-year-old female with multiple heart and vascular diseases who is on a blood thinner (apixaban) for Afib. She needs a procedure where the blood thinner must be temporarily stopped. The question is whether to use 'bridging' therapy—giving a short-acting, injectable blood thinner to cover the period when she is off her usual medication. This is a complex decision, as bridging can reduce stroke risk but significantly increases bleeding risk.

Scientific Validity

✅ High clinical relevance and realism: The vignettes selected represent authentic and common clinical dilemmas in inpatient medicine. They effectively simulate the 'gray zones' of clinical practice where physician judgment varies, making them an excellent tool for assessing the nuanced reasoning capabilities of LLMs beyond simple factual recall.
✅ Appropriate design for the research question: The study aims to describe variation within and across LLMs. By designing prompts with no single correct answer, the methodology is well-suited to elicit and analyze differences in recommendations and internal consistency, which is a key objective of the paper.
💡 Limited detail in prompts may introduce confounding variability: The prompts are intentionally brief to mimic a typical user query. However, this lack of detail (e.g., absence of vital signs, full medication lists, patient-specific goals of care) forces the LLMs to make assumptions. While this tests their handling of ambiguity, it also means that observed variability could stem from different implicit assumptions rather than core reasoning differences. Providing a slightly more detailed, standardized case could offer a more controlled comparison.
💡 Selection process for vignettes is not described: The paper mentions the vignettes reflect common scenarios in the authors' practice, but the process for selecting these specific four cases is not detailed. Clarifying whether these were chosen via a consensus panel, a review of common consultation questions, or another systematic method would strengthen the claim that they are representative of key judgment-dependent decisions.

Communication

✅ Excellent clarity and organization: The table is straightforward and highly effective. The two-column layout clearly separates the general topic from the specific prompt text, allowing readers to quickly understand the exact inputs used in the experiment.
✅ Self-contained and informative: The table functions as a complete reference for the study's experimental prompts. A reader can understand the core of the methodological setup from this table alone, without needing to refer back and forth to the main text.
💡 Readability of prompt text could be improved: The prompt for each vignette is presented as a single block of text. To enhance scannability, consider using formatting such as bullet points or line breaks to separate patient demographics, past medical history, current clinical situation, and the final question. This would make the dense clinical information easier to parse.

Results

Key Aspects

🩸 Divergent Recommendations on Blood Transfusion: In the first vignette concerning a patient with borderline hemoglobin, the models' recommendations diverged significantly. A majority, four of six models (67%), advised for immediate transfusion, while the remaining two (33%) recommended continued observation. The analysis reveals that the models' reasoning was split along key clinical principles: pro-transfusion models anchored their decision on the widely recognized 7 g/dL hemoglobin threshold, whereas models advising observation prioritized the patient's clinical stability and the risk of volume overload in the context of heart failure. This outcome highlights how LLMs can arrive at different conclusions by weighing established clinical heuristics differently.
⚖️ Evenly Split Decision on Anticoagulation Resumption: The second vignette, which posed the question of restarting anticoagulation after a gastrointestinal bleed, resulted in a perfect 50/50 split among the six models. Three models recommended restarting the medication before discharge, while three advised holding it. This division reflects a common and complex clinical dilemma, with the LLMs' reasoning mirroring the typical risk-benefit analysis performed by clinicians. Models favoring resumption focused on the successful endoscopic treatment of the bleed and the patient's high underlying stroke risk, while models advising against it emphasized the need for adequate healing time to prevent re-bleeding.
🏥 Conflicting Advice on Discharge with Minor Lab Abnormality: For the third vignette, involving a patient ready for discharge but for a minor rise in creatinine, the models were again evenly split, with three recommending discharge and three advising delay. Despite the split decision, this scenario elicited high internal consistency from the models (median 1.0). The reasoning for the divergent recommendations centered on the interpretation of the lab value: 'discharge' models framed the creatinine change as clinically insignificant and not meeting formal criteria for acute kidney injury (AKI), while 'stay' models interpreted it as a potential early AKI requiring further observation and investigation.
🌉 General Consensus Against Anticoagulation Bridging: The fourth vignette, addressing peri-procedural anticoagulation, was the only scenario to yield a strong consensus among the models. Five of the six models (83%) recommended against bridging anticoagulation, aligning with current clinical guidelines for direct oral anticoagulants (DOACs). This finding suggests that when a decision is more clearly covered by established guidelines, LLMs are more likely to converge on the standard-of-care recommendation. However, even within this consensus, the internal consistency of some models was notably lower (0.6), indicating that they still exhibited variability in their recommendations across repeated prompts.

Strengths

✅ Excellent Tabular Data Presentation
The results for each vignette are meticulously organized into clear, comprehensive tables (Tables 2-5). These tables effectively summarize each model's overall recommendation, internal consistency, key clinical considerations, proposed management plan, and cited sources. This format allows for efficient side-by-side comparison and a deep, nuanced understanding of inter-model differences.

"Table 2 Responses to Vignette 1—Transfusing Borderline Hemoglobin" (Page 3)
✅ Strong Quantitative Reporting of Variability
The section effectively uses quantitative data to substantiate its core findings of inter- and intra-model variability. For each scenario, the text provides specific percentages for inter-model agreement (e.g., 67% vs 33%) and descriptive statistics for intra-model consistency (median and range). This numerical rigor provides a solid, evidence-based foundation for the paper's conclusions.

"Median internal consistency was 0.8 (range 0.6–1.0)." (Page 2)
✅ Insightful Analysis of Model Reasoning
The analysis moves beyond simply reporting the final recommendations to explore the underlying clinical reasoning articulated by the models. By summarizing the key factors that different models emphasized (e.g., 'restrictive 7 g/dL threshold' vs. 'volume-overload risk'), the authors provide valuable insight into how these systems approach complex risk-benefit trade-offs, mirroring the cognitive processes of human clinicians.

"“Transfuse” models anchored on the restrictive 7 g/dL threshold, whereas “observe” models emphasized volume-overload risk and patient stability." (Page 2)

Suggestions for Improvement

💡 Synthesize Key Findings with a Summary Figure
High impact. While the tables are detailed and effective, a single summary figure would dramatically increase the accessibility and immediate impact of the paper's central findings. A composite visual, such as a panel of bar charts, could display the inter-model recommendation split and median intra-model consistency for all four vignettes in one place. This would provide readers with an intuitive, at-a-glance overview of the study's core message about LLM variability, powerfully complementing the granular detail in the tables.

"In this simulation of four nuanced inpatient management decisions, large language models frequently provided different recommendations from one another— and showed a high degree of internal variability as well." (Page 4)

Implementation: Create a composite figure with four panels, one for each vignette. In each panel, use a stacked or grouped bar chart to visually represent the percentage split in recommendations (e.g., 67% Transfuse vs. 33% Observe). Annotate each panel with the median and range of the internal consistency scores for that vignette to synthesize both primary outcomes visually.

Non-Text Elements

Table 2 Responses to Vignette 1—Transfusing Borderline Hemoglobin

Figure/Table Image (Page 3)

First Reference in Text

In a majority of repeated prompts, four of six (67%) models recommended transfusion now, while two (33%) recommended continued observation (Table 2).

Description

Divergent Overall Recommendations: The table shows a split in recommendations among the six large language models (LLMs) for a patient with borderline low hemoglobin. Four models (GPT-o1, Grok 3, Gemini 2.0, and OpenEvidence) recommended to 'Transfuse' the patient, while two models (GPT-4o and Claude Sonnet 3.7) recommended to 'Observe'. This aligns with the text, showing a 67% vs. 33% split in the primary advice given.
Variable Internal Consistency: The table reports the 'internal consistency' for each model, which measures how often a model gave the same answer when asked the identical question five times. This is scored from 0 to 1, where 1 means it gave the same answer every time. Two models, GPT-o1 and OpenEvidence, were perfectly consistent (1.0). In contrast, Claude Sonnet 3.7 and Grok 3 were the least consistent (0.6), meaning they changed their recommendation in two out of the five trials.
Differing Clinical Justifications: The 'Key considerations' column reveals the reasoning behind the recommendations. Models advising to 'Transfuse' frequently focused on the numerical hemoglobin value being below the standard 7 g/dL threshold. Conversely, models advising to 'Observe' highlighted the patient's lack of symptoms and the potential risk of fluid overload, a complication where excess fluid builds up in the body, which is dangerous for someone with a weak heart.
Management Plans and Cited Sources: The 'Management plan' column details the specific actions suggested, such as transfusing one unit of blood slowly. The 'Sources cited' column shows which medical guidelines or studies the models referenced to support their advice. For example, several models cited guidelines from the American Association of Blood Banks (AABB) or the TRICC trial, a major study on transfusion thresholds. Notably, some models provided no sources for their recommendations.

Scientific Validity

✅ Excellent reporting of model variability: The inclusion of the 'Internal consistency' metric is a major strength. It transparently quantifies the stochastic nature of these models, providing crucial data on their reliability for a single query, which is a novel and important contribution to the literature on LLM evaluation in medicine.
✅ Granular, multi-dimensional assessment: The table provides a rich, multi-faceted view of model performance by breaking down responses into overall recommendation, consistency, key reasoning points, management plan, and sources. This goes beyond a simple right/wrong analysis and allows for a deeper understanding of how the models approach clinical problems.
💡 Subjectivity in qualitative data extraction: The 'Key considerations' and 'Management plan' columns are qualitative summaries of the LLMs' text outputs. The methodology for this summarization is not described, which introduces potential for researcher interpretation bias. Including a brief methods note on how these points were extracted and coded (e.g., by two independent reviewers with a consensus process) would improve the rigor and reproducibility of these qualitative findings.
💡 The 'Overall' recommendation metric could be clarified: The 'Overall' recommendation is defined in the methods as the majority choice across five runs. For models with an internal consistency of 0.6, this means the recommendation was made in only 3 of 5 instances. While this is a reasonable approach, adding a footnote to the table explicitly defining 'Overall' and noting the majority threshold (e.g., '≥3 of 5 runs') would enhance clarity and prevent misinterpretation of the data's strength.

Communication

✅ Clear and logical structure: The table is well-organized with clear column headers. The information flows logically from the high-level summary (Overall recommendation) to the detailed justification (Key considerations, Management plan, Sources cited), making it easy for the reader to follow and compare the different models.
✅ Effective use of summarization: The use of bullet points to summarize key considerations and management steps is highly effective. It condenses potentially verbose LLM outputs into scannable, digestible information, allowing for efficient comparison across models without sacrificing essential details.
💡 Grouping rows could enhance visual impact: To more immediately highlight the primary finding of inter-model disagreement, consider grouping the rows by the 'Overall' recommendation. Placing the two 'Observe' models together and the four 'Transfuse' models together, perhaps separated by a subtle horizontal line or shading, would make the divergence in advice instantly apparent to the reader.
💡 Define abbreviations for broader accessibility: The table uses several clinical abbreviations (e.g., Hgb, CKD, AABB, TRICC, CBC). While familiar to an internal medicine audience, adding a footnote that defines these terms would make the table more self-contained and accessible to readers from other disciplines.

Table 3 Responses to Vignette 2—Restarting Anticoagulation After Bleed

Figure/Table Image (Page 3)

First Reference in Text

Three (50%) models recommended restarting anticoagulation on discharge, and three (50%) recommended holding in a majority of prompts (Table 3).

Description

Even Split in Overall Recommendations: The table shows a perfect 50/50 split among the six large language models (LLMs) regarding whether to restart a blood thinner (anticoagulation) after a stomach bleed. Three models (GPT-4o, Gemini 2.0, OpenEvidence) recommended restarting the medication, while the other three (GPT-o1, Claude Sonnet 3.7, Grok 3) recommended against restarting it for the time being.
Variable Internal Consistency: The 'internal consistency' score, which measures if a model gives the same answer over five identical queries, varied significantly. Claude Sonnet 3.7 and OpenEvidence were perfectly consistent (1.0), always giving the same advice. In contrast, GPT-o1 and Grok 3 were the least consistent (0.6), changing their recommendation in two of the five trials.
Conflicting Clinical Justifications: The 'Key considerations' column highlights the core clinical dilemma. Models recommending 'Restart' focused on the patient's high risk of stroke and the fact that the bleeding source was successfully treated ('definitively addressed'). Models recommending 'Don't restart' prioritized the risk of the ulcer re-bleeding, emphasizing that the cauterized vessel needs time to heal.
Differences in Management Plans: The proposed management plans differed, particularly among the models that advised holding the medication. The recommended duration to pause the blood thinner ranged from 4-7 days (Grok 3) to 1-2 weeks (Claude Sonnet 3.7). Most models, regardless of their primary recommendation, also suggested starting a proton pump inhibitor (PPI), a medication to reduce stomach acid and help the ulcer heal.

Scientific Validity

✅ Excellent demonstration of clinical equipoise: The 50/50 split in recommendations effectively mirrors the real-world clinical equipoise surrounding this decision. This makes the vignette a well-chosen instrument for probing LLM reasoning in areas of genuine medical uncertainty, which is a key strength of the study design.
✅ Granular data allows for deeper analysis of reasoning: The table provides rich qualitative data by including columns for 'Key considerations' and 'Management plan'. This allows the reader to understand the 'why' behind the different recommendations, moving beyond a simple tally of outcomes and offering insight into the models' reasoning processes.
💡 Variable quality of evidence cited by models: The 'Sources cited' column reveals a critical finding: the quality of evidence used by the models is inconsistent. While some cite appropriate clinical guidelines (ACC, ACG) and trials, others refer vaguely to 'expert recommendations' or, in one case, a 'lay audience webpage about stroke risk'. This variability in sourcing is a significant aspect of model performance that warrants further investigation.
💡 Lack of context for internal inconsistency: The table reports that some models had low consistency (0.6), but it does not provide insight into what might have triggered the change in recommendation across runs. While a full analysis is likely beyond the table's scope, acknowledging this limitation or providing a sample of the differing outputs in an appendix would add valuable context to the stochastic nature of the models.

Communication

✅ Clear structure facilitates model comparison: The table is logically organized, with distinct columns for each aspect of the LLM's response. This parallel structure makes it very easy for readers to compare and contrast the different models on multiple dimensions, such as their overall choice, consistency, and rationale.
✅ Caption is specific and informative: The caption clearly and concisely describes the content of the table, specifying both the vignette number and its clinical topic. This allows the table to be understood as a standalone element.
💡 Visual grouping would enhance the main takeaway: The central finding is the 50/50 split. To make this more immediately apparent, consider visually grouping the three 'Restart' models and the three 'Don't restart' models. A simple horizontal line or light shading to separate the two blocks of rows would instantly draw the reader's eye to the fundamental disagreement.
💡 Acronyms could be defined: The table uses several acronyms (e.g., AC, PPI, EGD, ACC, ACG) that may not be familiar to all readers. Adding a footnote to define these terms would improve the table's accessibility and ensure it is self-contained.

Table 4 Responses to Vignette 3-Discharge with Modest Creatinine Rise

Figure/Table Image (Page 4)

First Reference in Text

Across five re-prompts, three (50%) models recommended discharging today, and three (50%) recommended delaying a majority of the time (Table 4).

Description

Even Split in Recommendations: The table shows that the six large language models (LLMs) were evenly divided on whether to discharge a patient who was recovering from pneumonia but had a small increase in creatinine, a waste product in the blood that helps measure kidney function. Three models (GPT-4o, Grok 3, OpenEvidence) recommended 'Discharge', while the other three (GPT-o1, Claude Sonnet 3.7, Gemini 2.0) recommended 'Don't discharge'.
High Internal Consistency: Compared to other scenarios, the models were generally more consistent in their advice for this vignette. 'Internal consistency' measures how often a model gives the same answer when asked the same question five times. Four of the six models were perfectly consistent (a score of 1.0), and the other two wavered only once (a score of 0.8). This suggests this type of clinical question may elicit more stable responses from these AIs.
Differing Interpretations of the Lab Result: The 'Key considerations' column shows two distinct ways of thinking. The 'Discharge' models framed the creatinine increase as mild, expected, and not meeting the formal criteria for Acute Kidney Injury (AKI), which is a sudden decline in kidney function. They also emphasized the risks of a prolonged hospital stay. In contrast, the 'Don't discharge' models interpreted the same lab value as a potential early sign of AKI, warranting caution and further monitoring in the hospital.
Contrasting Management Plans and Sources: The proposed actions reflected the core recommendations. 'Discharge' models suggested repeating lab tests as an outpatient in 2-3 days. 'Don't discharge' models called for inpatient monitoring and additional lab tests within 24 hours. Notably, most models did not cite any sources, though Grok 3 referenced KDIGO (global kidney disease guidelines) and OpenEvidence cited ATS/IDSA guidelines related to discharge readiness for pneumonia.

Scientific Validity

✅ High ecological validity of the vignette: The chosen scenario—a modest, unexpected lab abnormality in an otherwise improving patient—is an excellent and highly realistic test case. It effectively probes the models' ability to navigate clinical uncertainty and weigh the risks of over-investigation against the risks of a premature discharge.
✅ Important finding regarding model-specific errors: The reference text notes that 'Gemini consistently misapplied the AKI definition.' This is a critical finding, demonstrating that some models may generate confident but factually incorrect reasoning. While not detailed in the table itself, identifying such specific failure modes is a significant contribution of the study.
💡 Inconsistent application of clinical criteria: The table reveals that models disagree on whether the creatinine rise meets the criteria for AKI. This highlights a key challenge: LLMs may interpret established clinical definitions differently. This finding suggests that relying on LLMs for tasks requiring strict adherence to diagnostic criteria could be problematic without further validation.
💡 The relevance of cited sources is variable: OpenEvidence's citation of pneumonia discharge guidelines (ATS, IDSA) is noteworthy. While these guidelines do address overall stability for discharge, they are not specific to the management of acute kidney injury. This raises questions about the model's ability to select the most relevant evidence for a specific clinical problem, as opposed to the patient's general diagnosis.

Communication

✅ Effective and clear tabular layout: The table is well-structured, with distinct columns that logically present the data from the overall recommendation to the specific details of the reasoning and management plan. This facilitates easy comparison across the different LLM agents.
✅ Summaries are concise and informative: The use of bullet points to summarize the 'Key considerations' and 'Management plan' is effective. It distills the core logic from the LLM outputs into a format that is quick to read and understand.
💡 Visual grouping could improve readability: To make the 50/50 split in recommendations more immediately obvious, consider visually separating the three 'Discharge' models from the three 'Don't discharge' models. A subtle horizontal line or different background shading for the two groups would enhance the table's scannability.
💡 Lack of a legend for abbreviations: The table uses several abbreviations (AKI, KDIGO, ATS, IDSA, FeNa) that might not be familiar to all readers. Including a footnote that defines these terms would make the table more self-contained and accessible to a broader scientific audience.

Table 5 Responses to Vignette 4—Bridging Anticoagulation in High Risk

Figure/Table Image (Page 5)

First Reference in Text

Over successive prompts, in a majority of cases, five (83%) models recommended holding anticoagulation without bridging, while one (17%) favored bridging (Table 5).

Description

Strong Consensus Against Bridging: The table shows a strong agreement among the large language models (LLMs) on a complex medication management question. For a high-risk patient needing to temporarily stop their blood thinner (apixaban) for a procedure, five out of the six models (83%) recommended against using 'bridging'. Bridging is the practice of using a short-acting, injectable blood thinner to cover the period when the main oral medication is stopped. Only one model, Grok 3, recommended in favor of bridging.
High but Imperfect Internal Consistency: Most models were highly consistent in their advice. 'Internal consistency' measures if a model gives the same answer when asked the same question five times. Four models (GPT-4o, GPT-o1, Claude Sonnet 3.7, and Open Evidence) were perfectly consistent, scoring 1.0. However, two models, Grok 3 and Gemini 2.0, were less stable, scoring 0.6, meaning they changed their recommendation in two of the five trials.
Reasoning Based on Modern Anticoagulant Guidelines: The 'Key considerations' column reveals that the majority of models based their 'Don't bridge' recommendation on current medical guidelines. They correctly noted that bridging is generally not recommended for patients on newer blood thinners known as DOACs (Direct Oral Anticoagulants), like apixaban. The models acknowledged the patient's high stroke risk but concluded that the bleeding risk from bridging outweighed the benefits.
Outlier Model's Nuanced Rationale: The one model that recommended bridging, Grok 3, provided a more nuanced rationale. It noted that the patient's risk profile was higher than the population studied in the landmark BRIDGE trial (a major study that influenced bridging guidelines) and that her additional vascular diseases increased her clotting risk, justifying the use of bridging in this specific case.
Evidence and Guideline Citation: Several models demonstrated an ability to cite relevant, high-quality evidence. They referred to guidelines from major medical societies like the American College of Cardiology (ACC), American Heart Association (AHA), and the American College of Chest Physicians, as well as the pivotal BRIDGE trial, to support their recommendations.

Scientific Validity

✅ Excellent choice of vignette to test guideline application: This scenario is a strong test of an LLM's ability to apply established, evidence-based guidelines. The question of bridging DOACs is a common clinical query where the guideline recommendation (don't bridge) often conflicts with clinician anxiety about a high-risk patient, making it a perfect test of guideline adherence versus risk perception.
✅ Data reveals sophisticated reasoning in some models: The table successfully captures nuanced differences in reasoning. Grok 3's ability to question the applicability of the BRIDGE trial to this specific, higher-risk patient, while ultimately providing a non-standard recommendation, demonstrates a level of sophisticated reasoning beyond simple rule-following. This is a significant finding regarding model capabilities.
💡 Inconsistency of the outlier model is a key finding: Grok 3, the only model to recommend bridging, had an internal consistency of only 0.6. This is a critical detail, as it suggests the model's 'contrarian' view is unstable. It would be valuable to clarify in the text what the model recommended in the two minority runs—presumably 'don't bridge'—as this would significantly contextualize its overall stance as being highly variable rather than firmly pro-bridging.
💡 Potentially irrelevant clinical reasoning noted: Gemini 2.0's warning about heparin-induced thrombocytopenia (HIT) is an interesting but clinically peripheral point when deciding on a short course of LMWH for bridging. While factually correct, its inclusion may represent a model retrieving a loosely associated fact rather than demonstrating salient clinical reasoning, highlighting a potential failure mode in how LLMs prioritize information.

Communication

✅ Clear structure and direct support for text: The table is well-organized and its primary finding—the 5 vs. 1 split in recommendations—is immediately clear and directly supports the quantitative summary (83% vs. 17%) presented in the main text.
✅ Effective summarization of complex outputs: The bulleted lists in the 'Key considerations' and 'Management plan' columns effectively distill the core logic and action items from what were likely much longer text outputs from the LLMs. This allows for efficient comparison between models.
💡 Visually highlight the outlier for greater impact: The main narrative of this table is the consensus versus the single outlier (Grok 3). To enhance this, consider visually separating the row for Grok 3 from the others using a different background shade or a horizontal rule. This would immediately guide the reader's eye to the most significant point of contrast.
💡 Define acronyms in a footnote: The table is rich with clinical acronyms (CHADS-VASc, DOACs, CAD, PAD, LMWH, ACC, AHA, CHEST, ASH). To improve accessibility for a broader audience, provide a footnote that defines these terms. This would make the table more self-contained.

Discussion

Key Aspects

📊 Synthesizing Variability as the Core Finding: The discussion opens by synthesizing the study's primary finding: large language models exhibit significant variability both between different models (inter-model) and within individual models across repeated queries (intra-model). It quantifies this by noting that individual models changed recommendations up to 40% of the time and that agreement between models was near chance levels. This establishes the central theme that LLMs do not provide a single, stable answer for nuanced clinical problems, directly challenging the notion of them as consistent sources of medical advice.
🗺️ Contextualizing Findings in 'Gray Zone' Medicine: The paper situates its findings within the broader context of LLM evaluation, contrasting its approach with studies that use tasks with a clear 'ground truth,' like exam questions. The authors argue that their study deliberately probes the 'gray zone' of clinical practice, which is characterized by epistemic uncertainty—a state of limited knowledge where multiple management options may be justifiable. This framing explains why the observed heterogeneity is not surprising and underscores a key takeaway: for ambiguous problems, a single LLM cannot be treated as a shortcut to medical consensus.
⚙️ Explaining Inconsistency via Probabilistic Nature: The discussion provides a mechanistic explanation for the observed internal inconsistency, attributing it to the inherent 'stochastic nature' of the models. It clarifies that this is a fundamental feature of probabilistic text generation, meaning the models are designed to produce a range of plausible outputs rather than a single deterministic answer. The authors highlight the clinical significance of this feature, as it can manifest as 'clinically meaningful flip-flops' on critical decisions, warning that clinicians who treat LLMs like deterministic calculators risk being lulled into a false sense of certainty.
💬 Analyzing Divergent Model Communication Styles: The authors analyze the qualitative differences in the models' response styles, identifying distinct 'personalities.' General-purpose models tended to provide balanced, narrative risk-benefit analyses, with some (like Grok 3) being verbose and stream-of-consciousness. In contrast, the domain-specific model, OpenEvidence, delivered 'succinct, seemingly authoritative directives.' This analysis is significant because it reveals how model design and training data can influence not just the recommendation but also its presentation, which has implications for user perception and trust.

Strengths

✅ Strong Synthesis and Interpretation of Results
The discussion excels at moving beyond a simple restatement of the results to provide a cohesive interpretation of their meaning. It synthesizes the findings of inter- and intra-model variability into a clear, overarching narrative about the unreliability of LLMs in clinically ambiguous situations, immediately establishing the paper's central thesis.

"In this simulation of four nuanced inpatient management decisions, large language models frequently provided different recommendations from one another— and showed a high degree of internal variability as well." (Page 4)
✅ Accessible Explanation of Technical AI Concepts
The authors effectively demystify a core technical concept for a clinical audience by explaining that internal inconsistency is not a flaw but a fundamental 'feature of probabilistic text generation.' This explanation of the models' 'stochastic nature' provides a clear, mechanistic rationale for the observed 'flip-flops,' enhancing the reader's understanding of the technology's inherent behavior and limitations.

"The stochastic nature of the models is such that, when asked to repeatedly reason through a complex scenario with multiple justifiable conclusions, it will not arrive at the same recommendation each time; this is a feature of probabilistic text generation." (Page 5)
✅ Effective Framing of the Study's Contribution
The discussion clearly positions the study's contribution by contrasting its focus on 'gray-zone' medicine with the majority of published LLM research that evaluates tasks with an 'unambiguous ground truth.' This framing effectively highlights the novelty and clinical relevance of the paper, demonstrating how it addresses a critical gap in the literature concerning real-world LLM application.

"Most published LLM studies have evaluated tasks that have an unambiguous ground truth... These studies are important, but overlook the fact that a significant portion of clinical practice does not have a single ground truth..." (Page 4)

Suggestions for Improvement

💡 Deepen the Analysis of Model Style on Clinical Trust and Bias
High impact. The discussion astutely identifies the contrasting communication styles between models, such as the verbose Grok and the 'authoritative' OpenEvidence. However, it could more deeply explore the potential psychological impact of these styles on clinicians. Explicitly discussing how an authoritative tone might engender over-reliance or automation bias, while a stream-of-consciousness style might be dismissed, would significantly strengthen the paper's practical implications for safe LLM use. This belongs in the Discussion as it is an interpretation of the findings on model nuances.

"On the other hand, OpenEvidence — trained on biomedical literature — issued succinct, seemingly authoritative directives. For busy providers, that concreteness is attractive, but it can mask the underlying uncertainty." (Page 5)

Implementation: Add a few sentences after the point about OpenEvidence's attractive concreteness. For example: 'This stylistic difference carries significant implications for clinical practice. An authoritative tone, while appealing for its clarity, may inadvertently promote automation bias, leading clinicians to accept recommendations with less critical scrutiny. Conversely, a verbose, exploratory style might be perceived as less confident and be unduly dismissed, even if its reasoning is sound. Future work should investigate how these presentational nuances influence user trust and decision-making.'

Variation in Large Language Model Recommendations in Challenging Inpatient Management Scenarios

First Page Preview

Table of Contents

Overall Summary

Study Background and Main Findings

Research Impact and Future Directions

Critical Analysis and Recommendations

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Methods

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Discussion

Key Aspects

Strengths

Suggestions for Improvement