This study investigates the reliability of large language models (LLMs) in common, judgment-dependent clinical scenarios where no single "correct" answer exists. The primary objective was to describe and quantify the variation in recommendations both across different commercially available LLMs (inter-model agreement) and within a single LLM when queried multiple times (intra-model consistency). The authors aimed to understand how these AI tools handle the ambiguity inherent in the "art" of medicine, as opposed to their well-documented performance on standardized tests with clear answers.
The research employed a cross-sectional simulation design. Six prominent LLMs, including five general-purpose models and one domain-specific model trained on biomedical literature, were tested. Each model was presented with four brief clinical vignettes representing common inpatient dilemmas, such as decisions on blood transfusions or anticoagulation. To measure consistency, each vignette was posed to each model in five independent sessions, resulting in a total of 120 model-vignette interactions. The primary outcomes were the model's majority recommendation and its internal consistency score, a metric of stability ranging from 0 (no consistency) to 1 (perfect consistency).
The results revealed significant variability in LLM recommendations. Inter-model agreement was low, with recommendations diverging in every scenario; for two key decisions, the models were split exactly 50/50. Furthermore, intra-model consistency was often poor, with some models changing their recommendation on the same case in up to two of five repeated queries, yielding consistency scores as low as 0.60. Only in the scenario with the clearest guideline-based answer (bridging anticoagulation) did a strong consensus (83%) emerge, though even then, some models remained internally inconsistent.
The study concludes that for nuanced clinical questions, LLMs produce highly variable and sometimes unstable recommendations, mirroring the ambiguity of the tasks themselves. The authors provide direct guidance for clinicians, advising them to treat LLM outputs as a single perspective rather than a definitive answer, consider querying multiple models, and always retain final responsibility for patient care. This work highlights the risks of relying on single LLM outputs for complex decisions and underscores the need for methods that can surface model uncertainty to ensure the safe integration of this technology into medicine.
Overall, the evidence strongly supports the conclusion that current large language models are unreliable for nuanced, judgment-dependent clinical decision-making. The strongest corroborating findings are the significant inter-model disagreements, which reached a 50/50 split in two of four scenarios, and the poor intra-model consistency, with some models changing their advice up to 40% of the time on identical prompts. This core claim is tempered by the finding that models reached a strong consensus (83% agreement) in the one scenario governed by clear clinical guidelines, suggesting their reliability may be context-dependent and higher for questions with a more established evidence base.
Major Limitations and Risks: The study's conclusions are constrained by several methodological factors. First, the simulation-based design using brief, synthetic vignettes may not accurately reflect complex, real-world clinical practice where clinicians can engage in iterative dialogue with an LLM and have access to richer patient data. This limits the direct applicability of the findings to actual clinical workflows. Second, as noted in the Methods analysis, the selection of only four vignettes without a clear justification raises questions about generalizability; the observed variability might be specific to these scenarios. Finally, the qualitative summaries of model reasoning presented in the results tables lack a described methodology for their extraction, introducing a risk of researcher bias in the interpretation of why models differed.
Based on this simulation, the recommendation is that clinicians should not use LLMs as primary or sole decision-makers for ambiguous clinical cases. Confidence in this recommendation is High for the specific models and conditions tested. However, confidence is Medium when generalizing this behavior to all 'gray-zone' clinical scenarios or future, more advanced models, primarily due to the simulation design's limitations. The most critical unanswered question is how this observed variability translates to real-world clinical practice. A prospective study evaluating the impact of LLM use on actual physician decisions and patient outcomes is the essential next step to raise confidence and develop evidence-based guidelines for safe implementation.
The abstract is impeccably organized using the standard Importance, Objective, Design, Exposures, Main Measures, Results, and Conclusions (IMRAD-C) format. This structure provides exceptional clarity, allowing readers to quickly grasp the study's rationale, methodology, key findings, and implications without ambiguity.
The results are presented with specific quantitative data, such as the percentage split in recommendations (e.g., 67% vs 33%) and the internal consistency scores (as low as 0.60). This numerical evidence provides a robust and compelling foundation for the study's conclusions, moving beyond purely qualitative descriptions of variability.
The conclusion translates the research findings into direct, practical recommendations for practicing clinicians. By advising users to view LLM output critically, sample multiple models, and retain final responsibility, the abstract enhances its clinical relevance and impact.
High impact. The abstract introduces a critical comparison between general-purpose and domain-specific LLMs in the 'DESIGN' section but waits until the 'RESULTS' to name the specific model. Explicitly identifying OpenEvidence as the domain-specific model in the 'DESIGN' section would immediately establish the experimental contrast, creating a stronger narrative link to the key finding that it was the most consistent model. This change would improve clarity and help the reader anticipate the significance of the results.
Implementation: Revise the sentence in the 'DESIGN' section to read: 'Six LLMs were queried: five general-purpose (GPT-4o, GPT-o1, Claude 3.7 Sonnet, Grok 3, and Gemini 2.0 Flash) and one domain-specific, OpenEvidence.'
The introduction masterfully employs a classic 'funnel' or 'inverted pyramid' structure. It begins with the broad context of LLMs in medicine, narrows to the specific problem of their performance in ambiguous clinical scenarios, and culminates in the study's precise research question and its significance. This logical progression makes the rationale for the study exceptionally clear, persuasive, and easy for the reader to follow.
The authors frame the core research problem with the highly effective and relatable distinction between the 'art' and 'science' of medicine. This framing immediately resonates with a clinical audience, clearly articulating the gap between current LLM testing paradigms and the judgment-based reality of medical practice. It successfully establishes the relevance and urgency of the investigation.
High impact. The introduction does an excellent job of establishing the theoretical problem of clinical ambiguity but remains abstract. Mentioning one or two concrete examples of the clinical dilemmas to be tested (e.g., transfusion or anticoagulation decisions) at the end of the section would ground the research question in tangible terms for the reader. This would make the study's purpose more vivid and create a smoother, more engaging transition into the Methods section.
Implementation: In the final paragraph, revise the sentence 'In this study, we examine how different commercially available LLMs respond to nuanced real-world management scenarios commonly encountered in inpatient medicine' to something similar to: 'In this study, we examine how different commercially available LLMs respond to nuanced real-world management scenarios commonly encountered in inpatient medicine, such as decisions regarding blood transfusions and anticoagulation.'
The section provides an exceptionally clear and detailed account of the study's methodology. It explicitly names the six LLMs, details the exact prompting protocol including the primer text, and precisely defines the quantitative outcomes like internal consistency. This high degree of transparency allows for critical evaluation and facilitates the potential for replication by other researchers, a cornerstone of strong scientific reporting.
The study design effectively mirrors potential real-world clinical application. The use of brief vignettes, default model 'creativity' settings, and a non-iterative prompting process reflects how a busy practitioner might quickly query an LLM. This pragmatic approach strengthens the external validity and practical relevance of the study's findings for clinicians considering these tools in their workflow.
The methodology is robustly designed to capture two critical forms of variation. By querying six different models, it assesses inter-model agreement, while the protocol of posing each case five times in fresh sessions allows for a direct quantitative measurement of intra-model consistency. This dual focus provides a comprehensive and nuanced view of LLM reliability that is central to the paper's contribution.
Medium impact. The methods state that the vignettes represent nuanced situations but do not provide a specific rationale for selecting these four particular clinical problems (transfusion, anticoagulation, discharge, and bridging) over other possibilities. Adding a brief justification—for example, that they represent a range of common yet challenging inpatient decisions involving different types of risk-benefit calculations—would strengthen the reader's confidence in the vignettes' representativeness and the generalizability of the findings.
Implementation: In the 'Study Design' subsection, after the sentence ending with '(Table 1).7, 8', add a sentence such as: 'These specific scenarios were selected to represent a diverse set of common inpatient dilemmas, including decisions about medical therapies (transfusion, anticoagulation), disposition planning (discharge readiness), and peri-procedural management (bridging).'
Medium impact. The 'Model Selection' section lists the six LLMs tested but does not explain the criteria for their inclusion. While the models are prominent, explicitly stating the rationale (e.g., chosen for their widespread availability, representation of different developers, and high performance on prior benchmarks) would enhance methodological transparency. This clarification helps the reader understand why this specific set of models provides a meaningful snapshot of the current LLM landscape.
Implementation: At the beginning of the 'Model Selection' section, add a sentence such as: 'The selected models represent a cross-section of the most widely used and publicly accessible platforms from major developers at the time of the study, including both general-purpose and biomedically-focused architectures.'
The results for each vignette are meticulously organized into clear, comprehensive tables (Tables 2-5). These tables effectively summarize each model's overall recommendation, internal consistency, key clinical considerations, proposed management plan, and cited sources. This format allows for efficient side-by-side comparison and a deep, nuanced understanding of inter-model differences.
The section effectively uses quantitative data to substantiate its core findings of inter- and intra-model variability. For each scenario, the text provides specific percentages for inter-model agreement (e.g., 67% vs 33%) and descriptive statistics for intra-model consistency (median and range). This numerical rigor provides a solid, evidence-based foundation for the paper's conclusions.
The analysis moves beyond simply reporting the final recommendations to explore the underlying clinical reasoning articulated by the models. By summarizing the key factors that different models emphasized (e.g., 'restrictive 7 g/dL threshold' vs. 'volume-overload risk'), the authors provide valuable insight into how these systems approach complex risk-benefit trade-offs, mirroring the cognitive processes of human clinicians.
High impact. While the tables are detailed and effective, a single summary figure would dramatically increase the accessibility and immediate impact of the paper's central findings. A composite visual, such as a panel of bar charts, could display the inter-model recommendation split and median intra-model consistency for all four vignettes in one place. This would provide readers with an intuitive, at-a-glance overview of the study's core message about LLM variability, powerfully complementing the granular detail in the tables.
Implementation: Create a composite figure with four panels, one for each vignette. In each panel, use a stacked or grouped bar chart to visually represent the percentage split in recommendations (e.g., 67% Transfuse vs. 33% Observe). Annotate each panel with the median and range of the internal consistency scores for that vignette to synthesize both primary outcomes visually.
The discussion excels at moving beyond a simple restatement of the results to provide a cohesive interpretation of their meaning. It synthesizes the findings of inter- and intra-model variability into a clear, overarching narrative about the unreliability of LLMs in clinically ambiguous situations, immediately establishing the paper's central thesis.
The authors effectively demystify a core technical concept for a clinical audience by explaining that internal inconsistency is not a flaw but a fundamental 'feature of probabilistic text generation.' This explanation of the models' 'stochastic nature' provides a clear, mechanistic rationale for the observed 'flip-flops,' enhancing the reader's understanding of the technology's inherent behavior and limitations.
The discussion clearly positions the study's contribution by contrasting its focus on 'gray-zone' medicine with the majority of published LLM research that evaluates tasks with an 'unambiguous ground truth.' This framing effectively highlights the novelty and clinical relevance of the paper, demonstrating how it addresses a critical gap in the literature concerning real-world LLM application.
High impact. The discussion astutely identifies the contrasting communication styles between models, such as the verbose Grok and the 'authoritative' OpenEvidence. However, it could more deeply explore the potential psychological impact of these styles on clinicians. Explicitly discussing how an authoritative tone might engender over-reliance or automation bias, while a stream-of-consciousness style might be dismissed, would significantly strengthen the paper's practical implications for safe LLM use. This belongs in the Discussion as it is an interpretation of the findings on model nuances.
Implementation: Add a few sentences after the point about OpenEvidence's attractive concreteness. For example: 'This stylistic difference carries significant implications for clinical practice. An authoritative tone, while appealing for its clarity, may inadvertently promote automation bias, leading clinicians to accept recommendations with less critical scrutiny. Conversely, a verbose, exploratory style might be perceived as less confident and be unduly dismissed, even if its reasoning is sound. Future work should investigate how these presentational nuances influence user trust and decision-making.'