GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial

Section Analysis

Abstract

Key Aspects

Primary Research Objective: The primary objective of this study was to determine if the use of a large language model (LLM), specifically GPT-4, could improve physician performance in management reasoning tasks. This is significant because while LLMs have shown promise in diagnostic reasoning, their effectiveness in the more complex domain of management reasoning, which involves treatment decisions, testing strategies, and risk management, was previously unknown. The study aims to address a critical gap in the understanding of LLM capabilities in clinical decision-making.
Study Design: The study employed a prospective, randomized, controlled trial design. Practicing physicians were randomly assigned to one of two groups: one using GPT-4 in addition to conventional resources, and the other using conventional resources alone. This rigorous methodology allows for a direct comparison of the impact of LLM assistance on physician performance, minimizing bias and strengthening the validity of the findings.
Clinical Vignettes: The study utilized five expert-developed clinical vignettes based on real, de-identified patient encounters. Information was presented sequentially to mimic the dynamic nature of clinical environments. This approach enhances the ecological validity of the study, making the findings more relevant to real-world clinical practice.
Outcome Measures: The primary outcome measured was the difference in total score between the two groups, assessed using expert-developed scoring rubrics. Secondary outcomes included domain-specific scores and the time spent by physicians on each case. This comprehensive assessment approach allows for a detailed evaluation of the impact of LLM assistance on various aspects of physician performance.
Main Findings: The key finding was that physicians using GPT-4 scored significantly higher than those using conventional resources alone, with a mean difference of 6.5% (95% CI = 2.7 to 10.2, P < 0.001). However, LLM users also spent significantly more time per case. Notably, there was no significant difference between the performance of LLM-augmented physicians and the LLM used independently. These findings demonstrate the potential of LLMs to improve physician management reasoning, but also highlight the increased time investment associated with their use.

Strengths

Clear and Concise Summary
The abstract clearly states the research question, the study design (randomized controlled trial), the intervention (GPT-4 assistance), the control group (conventional resources), the primary outcome (difference in total score), and the main findings, including statistical significance.

"This prospective, randomized, controlled trial assessed whether LLM assistance improves physician performance on open-ended management reasoning tasks compared to conventional resources... Physicians using the LLM scored significantly higher compared to those using conventional resources (mean difference = 6.5%, 95% confidence interval (CI) = 2.7 to 10.2, P < 0.001)." (Page 1)
Quantitative Data Support
The abstract provides key statistical data (mean difference, confidence interval, p-value) for the primary outcome, lending quantitative support to the findings.

"Physicians using the LLM scored significantly higher compared to those using conventional resources (mean difference = 6.5%, 95% confidence interval (CI) = 2.7 to 10.2, P < 0.001)." (Page 1)
Inclusion of Secondary Outcomes
The abstract concisely mentions secondary outcomes, including time spent per case, and provides relevant statistical data for these as well.

"Secondary outcomes included domain-specific scores and time spent per case. ... LLM users spent more time per case (mean difference = 119.3 s, 95% CI = 17.4 to 221.2, P = 0.02)." (Page 1)
Trial Registration
The abstract includes the ClinicalTrials.gov registration number, promoting transparency and reproducibility.

"ClinicalTrials.gov registration: NCT06208423." (Page 1)

Suggestions for Improvement

Include Participant Demographics
This high-impact improvement would enhance the abstract's clarity and completeness by providing a brief overview of the participant characteristics. The abstract is the first point of contact for most readers, and quickly conveying the type of physicians involved is crucial for contextualizing the results. It helps readers assess the generalizability and relevance of the findings to their own practice or research.

"From November 2023 to April 2024, 92 practicing physicians were randomized to use either GPT-4 plus conventional resources or conventional resources alone to answer five expert-developed clinical vignettes in a simulated setting." (Page 1)

Implementation: Add a sentence summarizing the key demographics of the physician participants, such as their specialties and years of experience. For example: 'Participants were 92 practicing physicians (primarily internal medicine, with an average of 7.6 years of experience) randomized to...'
Clarify Direction of Time Difference
This medium-impact improvement would strengthen the abstract by explicitly stating the direction of the effect on time spent per case. The abstract currently mentions a statistically significant difference, but it's not immediately clear whether LLM users spent more or less time. Explicitly stating this enhances clarity and avoids potential misinterpretations. This information belongs in the abstract as it is a key finding that contextualizes the primary outcome.

"LLM users spent more time per case (mean difference = 119.3 s, 95% CI = 17.4 to 221.2, P = 0.02)." (Page 1)

Implementation: Modify the sentence about time spent to clearly indicate that LLM users spent *more* time per case. For example, change to: 'LLM users spent significantly more time per case...'
Add Trend Direction for LLM Alone vs. Conventional Resources
This low-impact improvement would provide a more complete and nuanced understanding of the study's findings within the abstract. While the abstract mentions that LLM-augmented physicians performed similarly to the LLM alone, adding the direction of the trend compared to conventional resources adds important context. It allows readers to quickly grasp the relative performance of all three groups (physicians + LLM, physicians + conventional resources, and LLM alone).

"There was no significant difference between LLM-augmented physicians and LLM alone (−0.9%, 95% CI = −9.0 to 7.2, P = 0.8)." (Page 1)

Implementation: Add a phrase indicating the direction of the trend when comparing LLM alone to conventional resources. For example: 'There was no significant difference between LLM-augmented physicians and LLM alone (−0.9%, 95% CI = −9.0 to 7.2, P = 0.8), although the LLM alone trended towards higher scores than the conventional resources group.'

Introduction

Key Aspects

LLMs in Diagnostic Reasoning: The introduction establishes the context of the study by highlighting the existing capabilities of Large Language Models (LLMs) in diagnostic reasoning. It cites previous research showing that LLMs can outperform both AI models and human physicians in tasks such as constructing differential diagnoses and explaining reasoning. This sets the stage for exploring LLMs' potential in a different, more complex area: management reasoning.
Diagnostic vs. Management Reasoning: A core distinction is drawn between diagnostic reasoning and management reasoning. Diagnostic reasoning is characterized as a classification task, often with a single correct answer. Management reasoning, on the other hand, is presented as a more complex process involving decision-making around treatment, testing, patient preferences, social determinants of health, and cost-conscious care, all while managing risk. This highlights the novelty of the research focus.
Management Reasoning in Clinical Context: The introduction situates management reasoning within the broader context of clinical reasoning. It acknowledges the overlap between diagnostic and management reasoning but emphasizes the distinct challenges of management reasoning, which often involves weighing trade-offs between risky courses of action and may not have a single "right" answer.
Theoretical Framework of Management Reasoning: The introduction provides a brief overview of the theoretical underpinnings of management reasoning, referencing concepts like shared decision-making, dynamic relationships, and competing priorities. It also mentions the use of "management scripts" as cognitive heuristics, highlighting their potential susceptibility to fallibilities.
Historical Context of Reasoning Research: The introduction positions the study within the existing body of research by noting the long history of diagnostic reasoning research and the comparatively recent emergence of management reasoning as a focus of study. This helps to establish the novelty and relevance of the current investigation.
Study Design Overview: The introduction briefly describes the study's design as a prospective, randomized, controlled trial. This provides a high-level overview of the methodology used to investigate the research question.
Research Question: The central research question is clearly stated: to assess whether physicians using an LLM perform better than those using standard resources on complex clinical management questions. This provides a concise and focused statement of the study's primary objective.
Comparison to LLM Alone: The study includes a comparison of physician answers to the output of the LLM alone, without human interaction. This adds another dimension to the investigation, exploring the potential of the LLM as an independent tool.
Use of Clinical Vignettes: The introduction highlights the use of clinical vignettes derived from real, de-identified patient encounters, with information revealed sequentially. This design choice aims to mirror the complex nature of clinical progression and enhance the study's ecological validity.

Strengths

Clear Research Gap
The introduction clearly establishes the research gap by highlighting the known capabilities of LLMs in diagnostic reasoning and contrasting it with the unknown impact on management reasoning. This sets the stage for the study's central question.

"While large language models (LLMs) have shown promise in diagnostic reasoning, their impact on management reasoning, which involves balancing treatment decisions and testing strategies while managing risk, is unknown." (Page 1)
Distinction Between Diagnostic and Management Reasoning
The introduction effectively differentiates between diagnostic and management reasoning, providing context for the study's focus. It highlights the complexity of management reasoning, involving multiple factors and trade-offs, unlike diagnostic reasoning, which often has a single correct answer.

"Unlike diagnostic reasoning, which can be thought of as a classification task with often a single right answer, management reasoning may have no right answers and involves weighing trade-offs between inherently risky courses of action..." (Page 1)
Concise Study Overview
The introduction provides a concise overview of the study design, mentioning the prospective, randomized, controlled trial approach and the use of clinical vignettes derived from real patient encounters. This gives readers a quick understanding of the study's methodology.

"We designed a prospective, randomized, controlled trial to assess whether physicians using an LLM performed better than physicians using standard resources on a series of complex clinical management questions...All cases were derived from real, de-identified patient encounters." (Page 1)
Contextualization within Existing Literature
The introduction connects the research to existing literature on clinical reasoning, mentioning the history of diagnostic reasoning research and the more recent focus on management reasoning. This contextualizes the study within the broader field.

"The study of diagnostic reasoning has a century-long history with many metacognitive frameworks and assessment methods, while management reasoning processes are a comparatively recent area of study9–11." (Page 1)

Suggestions for Improvement

Explicitly Justify Focus on Physician Performance
This medium-impact improvement would strengthen the introduction by providing a more explicit justification for focusing on *physician* performance. While the ultimate goal is likely improved patient care, the introduction could briefly explain why studying physician performance is a crucial intermediate step. This is important because it clarifies the study's direct focus and links it to the broader goal of patient well-being. It also helps readers understand the chain of reasoning: better physician performance is expected to lead to better patient outcomes.

"This prospective, randomized, controlled trial assessed whether LLM assistance improves physician performance on open-ended management reasoning tasks compared to conventional resources." (Page 1)

Implementation: Add a sentence or phrase explaining the link between physician performance and patient outcomes. For example: 'By focusing on physician performance, a key determinant of care quality, this study aims to contribute to the broader goal of improving patient outcomes.'
Mention Specific LLM
This low-impact improvement would add a brief mention of the specific LLM used (GPT-4) in the introduction. While this is mentioned in the abstract, including it in the introduction provides immediate context for readers. It clarifies the technology being evaluated and helps readers familiar with different LLMs understand the scope of the study.

"While large language models (LLMs) have shown promise in diagnostic reasoning, their impact on management reasoning, which involves balancing treatment decisions and testing strategies while managing risk, is unknown." (Page 1)

Implementation: Add "(GPT-4)" after the first mention of LLM in the introduction. For instance, 'While large language models (LLMs) (GPT-4) have shown promise...'
Hint at Potential Benefits Before Research Question
This low-impact improvement would enhance the introduction by briefly mentioning the potential benefits of LLMs in management reasoning *before* stating the research question. This creates a more compelling narrative by first suggesting the potential and then highlighting the need for research to confirm it. It frames the study as addressing a promising but unproven area.

"This prospective, randomized, controlled trial assessed whether LLM assistance improves physician performance on open-ended management reasoning tasks compared to conventional resources." (Page 1)

Implementation: Add a sentence or phrase before the statement of the research question, hinting at the potential benefits. For example: 'Given the potential of LLMs to assist with complex decision-making, this study investigated whether...' or 'If LLMs can indeed improve management reasoning, this could have significant implications for clinical practice. Therefore, this study...'

Results

Key Aspects

Participant Enrollment and Demographics: The study enrolled 92 physicians, evenly randomized into two groups: one using an LLM (GPT-4) in addition to conventional resources, and the other using conventional resources alone. The participants were primarily attending physicians (73%) specializing in internal medicine (74%), with a mean of 7.6 years in practice. A minority (24%) self-described as frequent LLM users.
Case Scoring and Inter-rater Reliability: A total of 400 cases were scored, derived from the 92 physicians: 176 from the LLM group, 199 from the conventional resources group, and 25 from the LLM alone. Three graders assessed the cases, achieving substantial agreement (82% agreement, pooled kappa statistic of 0.80).
Primary Outcome: Total Score: The primary outcome, the total score on the cases, was significantly higher for physicians using the LLM compared to those using conventional resources (43.0% vs. 35.7%, difference = 6.5%, 95% CI = 2.7% to 10.2%, P < 0.001). This indicates a statistically significant improvement in performance with LLM assistance.
Comparison with LLM Alone: The LLM alone performed comparably to physicians using the LLM (43.7% vs. 43.0%, difference = 0.9%, 95% CI = -7.2% to 9.0%, P = 0.80), and trended towards higher scores than physicians using conventional resources (43.7% vs. 35.7%, difference = 7.3%, 95% CI = -0.7% to 15.4%, P = 0.074). This suggests the potential for LLMs to be used independently in some clinical scenarios.
Performance in Question Subgroups: Physicians using the LLM scored significantly higher in questions testing management decisions (40.5% vs. 33.4%, P = 0.001), diagnostic decisions (56.8% vs. 45.8%, P = 0.009), and context-specific questions (42.4% vs. 34.9%, P = 0.002). No significant difference was found in factual recall or general management knowledge, although the trends were similar.
Time Spent per Case: Physicians using the LLM spent significantly more time per case (801.5s vs. 690.2s, difference = 119.3s, 95% CI = 17.4 to 221.2, P = 0.022). However, even after adjusting for time spent, the LLM group still showed a significant increase in score.
Response Length Analysis: A post-hoc sensitivity analysis adjusting for response length showed an attenuated but still positive effect for the LLM group (3.7 percentage points higher, 95% CI = 0.7 to 6.7, P = 0.02). This suggests that the improved performance was not solely due to longer responses.
Potential Harm Analysis: Analysis of potential harm revealed similar patterns between the LLM-assisted and conventional resources groups, suggesting that LLM use did not increase the likelihood or severity of harmful responses in this study setting.

Strengths

Clear Presentation of Participant Demographics
The Results section clearly presents the participant demographics, including career stage, specialty, years in training, and prior experience with GPT. This provides essential context for interpreting the study findings and assessing their generalizability.

"Participants were randomized evenly between the LLM and conventional resources groups (Fig. 1); 73% (67 of 92) were attending physicians while 27% (25 of 92) were residents (Table 1). Seventy-four percent (68 of 92) specialized in internal medicine, 20% (18 of 92) emergency medicine and 6.5% (6 of 92) family medicine. The mean time in practice of all physicians was 7.6 years while the median was 5.8 years (interquartile range (IQR) = 3.0 to 9.0 years). Only 24% (22 of 92) self-described as frequent users of LLMs; 20.8% (19 of 92) had either used it only once or never used it." (Page 2)
Detailed Reporting of Primary Outcome
The section reports the primary outcome (total score) with appropriate statistical detail, including the difference between groups, confidence interval, and p-value. This allows for a clear understanding of the magnitude and significance of the effect.

"Physicians randomized to use the LLM performed better than the control group (43.0% compared to 35.7%, difference = 6.5%, 95% confidence interval (CI) = 2.7% to 10.2%, P < 0.001) (Table 2 and Fig. 2)." (Page 2)
Inclusion of Secondary Outcomes
The Results section includes secondary outcomes, such as performance in different question domains (management, diagnostic, specific, general, factual) and time spent per case. This provides a more nuanced understanding of the LLM's impact.

"The physicians using the LLM scored better than those using conventional resources alone in questions explicitly testing management decisions (40.5% versus 33.4%, difference = 6.1%, 95% CI = 2.5% to 9.7%, P = 0.001), questions testing diagnostic decisions (56.8% versus 45.8%, difference = 12.1%, 95% CI = 3.1% to 21%, P = 0.009) and context-specific questions (42.4% versus 34.9%, difference = 6.2%, 95% CI = 2.4% to 9.9%, P = 0.002)." (Page 2)
Comparison with LLM Alone
The section presents a comparison between the performance of physicians using the LLM, physicians using conventional resources, and the LLM alone. This provides valuable insight into the potential of the LLM as an independent tool.

"The LLM alone scored comparably to humans using the LLM (43.7% versus 43.0%, difference = 0.9%, 95% CI = −7.2% to 9.0%, P = 0.80), while trending toward scoring higher than humans using conventional resources (43.7% versus 35.7%, difference = 7.3%, 95% CI = −0.7% to 15.4%, P = 0.074) (Fig. 3)." (Page 2)
Reporting of Inter-rater Reliability
The section reports inter-rater reliability statistics (pooled kappa) for the scoring of cases, indicating substantial agreement between graders. This enhances the credibility of the scoring process.

"Three graders agreed on the scoring of 328 of 400 cases (82%), with a pooled kappa statistic (κ) of 0.80, reflecting substantial agreement between graders..." (Page 2)
Inclusion of Sensitivity Analyses
The section includes post-hoc sensitivity analyses adjusting for time spent and response length, addressing potential confounding factors and strengthening the main findings.

"We performed an additional post hoc sensitivity analysis adjusting for time spent on each case (Extended Data Table 1), which showed a 5.4 percentage point (95% CI = 1.7 to 9.0, P = 0.004) increase in score per case even after adjustment for time spent on the case...To address the potential influence of response length on scores, we conducted an additional post hoc sensitivity analysis adjusting our primary analysis for the character count of responses (Supplementary Item 3)." (Page 2)
Consideration of Potential Harm
The Results section briefly addresses the potential for harm, reporting similar patterns of likelihood and severity of harm between the two groups. This is an important consideration for the ethical implementation of LLMs.

"Analysis of potential harm revealed similar patterns between groups (Supplementary Item 6)." (Page 2)

Suggestions for Improvement

Clarify Number of Cases Per Physician
This low-impact improvement would enhance the clarity of the Results section. While the text mentions "400 cases were scored in total," and later refers to n values for different groups and outcomes, it's not immediately obvious how these numbers relate to the 92 physicians. Explicitly stating the number of cases completed *per physician* (on average or the range) would improve clarity. This aligns with the Results section's purpose of transparently reporting the data.

"From these 92 physicians, 400 cases were scored in total, 176 from the group of physicians using the LLM, 199 from physicians using only conventional resources and 25 from the LLM alone." (Page 2)

Implementation: Add a sentence clarifying the average or range of cases completed per physician. For example: 'The 92 physicians completed a total of 400 cases, with an average of X cases per physician (range Y-Z).'
Include Raw Numbers for Past GPT Experience in Table 1
This medium-impact improvement would provide a more complete picture of the participant demographics and their experience with LLMs. While Table 1 provides percentages for past GPT experience, it doesn't explicitly state the *number* of physicians in each category. Adding the raw numbers (n values) alongside the percentages would give a clearer sense of the distribution. The Results section is where this level of detail is expected, providing context for the study's findings.

"Past GPT experience I use it frequently (weekly or more) 22 (24%) 11 (24%) 11 (24%)" (Page 3)

Implementation: In Table 1, add the raw number (n) for each category of past GPT experience, in addition to the percentages. For example: 'I use it frequently (weekly or more) 22 (24%) 11 (24%) 11 (24%)'
Explain Varying 'n' Values in Table 2
This low-impact improvement would enhance the clarity and readability of Table 2. Currently, the table presents the number of cases (n) for each outcome, but these numbers vary. Providing a brief explanation for *why* the 'n' values differ across outcomes would improve reader understanding. This is important for transparency and aligns with the Results section's role in presenting the data clearly.

"Table 2 | Comparisons of the primary and secondary outcomes for physicians with LLM and with conventional resources only (scores standardized to 0–100)" (Page 3)

Implementation: Add a footnote to Table 2 explaining the varying 'n' values. For example: 'The 'n' values vary across outcomes due to some physicians not completing all questions within each case, or not all questions being applicable to every case.'
Provide Specific Percentages for Harm Analysis
This medium-impact improvement would enhance the Results section by providing more specific information about the potential for harm. While the section mentions similar patterns between groups, it lacks detail. Reporting the actual percentages for *each* category of likelihood and severity of harm would provide a more complete and nuanced understanding of this important aspect. This belongs in the Results section as it is a direct finding of the study.

"Analysis of potential harm revealed similar patterns between groups (Supplementary Item 6)." (Page 2)

Implementation: Include the specific percentages for each category of likelihood (medium, high) and severity (mild-to-moderate, severe) of harm for both groups. For example: 'In the LLM-assisted group, 8.5% and 4.2% of physician responses carried medium and high likelihood of harm, respectively, compared to 11.4% and 2.9% in the conventional resources group. Regarding harm severity, mild-to-moderate harm was observed in 4.0% of LLM-assisted responses compared to 5.3% in the conventional resources group. Severe harm ratings were nearly identical between groups (LLM = 7.7%; conventional = 7.5%).'

Non-Text Elements

Fig. 1| Study flow diagram. The study included 92 practicing attending...

Full Caption

Fig. 1| Study flow diagram. The study included 92 practicing attending physicians and residents with training in internal medicine, family medicine or emergency medicine. Five expert-developed cases were presented, with scoring rubrics created using a Delphi process. Physicians were randomized to use either GPT-4 via ChatGPT plus in addition to conventional resources (for example, UpToDate, Google), or conventional resources alone. The primary outcome was the difference in total score between groups on expert-developed scoring rubrics. Secondary outcomes included domain-specific scores and time spent per case.

Figure/Table Image (Page 2)

First Reference in Text

Participants were randomized evenly between the LLM and conventional resources groups (Fig. 1); 73% (67 of 92) were attending physicians while 27% (25 of 92) were residents (Table 1).

Description

The study included 92 physicians with various training.: The study flow diagram, often called a CONSORT diagram, visually explains how participants moved through the research study. Ninety-two practicing physicians and residents—meaning doctors still in training—were involved. These doctors specialized in internal medicine (adult healthcare), family medicine (healthcare for all ages), or emergency medicine (immediate medical care). The doctors reviewed five cases, prepared by experts. The expert's scoring was based on rubrics, or detailed scoring guides, developed using a 'Delphi process'. A Delphi process is a way to get a group of experts to agree on something by having them answer questions and then giving them anonymous feedback until they reach a consensus. The doctors were then randomly put into one of two groups: one group used GPT-4 (a large language model) via ChatGPT and regular resources like UpToDate (an online medical resource) and Google; the other group used only the regular resources. The main thing the researchers were measuring was the difference in scores between the groups, based on the expert-developed scoring guides. They also looked at other things, like scores on specific areas of the cases and how much time each group spent on each case.
Physicians were randomized evenly between the LLM and conventional resources groups.: The physicians were randomized to either GPT-4 with conventional resources or conventional resources alone. Randomization is a process of randomly assigning participants to different groups in a study, such as a treatment group or a control group. This helps to ensure that the groups are as similar as possible at the beginning of the study, so that any differences in outcomes can be attributed to the intervention being studied. This is often done by using a computer-generated random number sequence to assign participants to groups.
73% (67 of 92) were attending physicians while 27% (25 of 92) were residents: The reference text from the Results section notes that 73% (67 out of 92) of the participants were attending physicians, meaning fully qualified doctors, while the remaining 27% (25 out of 92) were residents, or doctors still undergoing training. These percentages are used to describe the study population.

Scientific Validity

The study flow diagram is a standard element in RCTs.: The study flow diagram is a standard element in RCTs, adhering to CONSORT guidelines, and is essential for transparency. The mention of the Delphi process for rubric development strengthens the validity by highlighting a rigorous approach to outcome assessment.
Even randomization minimizes selection bias.: The even randomization, as stated in the reference text and implied by the flow diagram, is crucial for minimizing selection bias. The percentages of attending physicians and residents are descriptive and relevant for characterizing the study population.

Communication

The caption provides a good overview of the study.: The caption effectively summarizes the key aspects of the study design and outcomes, providing a concise overview for readers to quickly understand the figure's content. It clearly states the number of participants, their backgrounds, the intervention, and the primary and secondary outcomes.
The caption could be more explicit about the purpose of a CONSORT diagram.: While the caption is informative, it could benefit from a more explicit mention of the CONSORT diagram's purpose, which is to visually represent the flow of participants through each stage of the randomized controlled trial.

Table 1 | Participant characteristics according to randomized group

Figure/Table Image (Page 3)

First Reference in Text

Participants were randomized evenly between the LLM and conventional resources groups (Fig. 1); 73% (67 of 92) were attending physicians while 27% (25 of 92) were residents (Table 1).

Description

Table 1 describes participant characteristics split by randomized group.: Table 1 is all about describing the people who took part in the study. It shows key details about them, split into two groups based on how they were randomly assigned: one group used GPT-4 along with regular resources, and the other used only regular resources. The table lists things like what stage of their career they're at (attending physician or resident), their medical specialty (internal medicine, emergency medicine, or family medicine), how many years they've been in medical training, and their past experience with using GPT-like tools. It also includes a number called the 'standardized mean difference' or SMD. The SMD is a way to measure how different the two groups are for each of these characteristics. An SMD close to zero means the groups are very similar, which is what you want when you randomly assign people to groups in a study.
The table provides numbers, percentages, means and standard deviations.: The table provides specific numbers and percentages for each characteristic within each group. For instance, it shows that in the overall study population (n=92), 73% were attending physicians. This is broken down to 74% in the GPT-4 group and 72% in the conventional resources group. The table also gives the mean (average) and standard deviation (a measure of spread) for continuous variables like years in medical training.

Scientific Validity

Describing participant characteristics is essential for assessing comparability.: The inclusion of participant characteristics is essential for assessing the comparability of the randomized groups at baseline. This is a standard practice in RCTs to ensure that any observed differences in outcomes are attributable to the intervention rather than pre-existing differences between the groups.
The use of SMD is appropriate for assessing balance.: The use of the standardized mean difference (SMD) is appropriate for assessing the balance between the groups. While there is no universally agreed-upon threshold, an SMD less than 0.2 is generally considered a small and acceptable difference, suggesting adequate randomization.

Communication

The table is well-organized and facilitates comparison.: The table's clear labeling and organization facilitate easy comparison of baseline characteristics between the two randomized groups. The inclusion of the standardized mean difference (SMD) provides a quick assessment of any imbalances.
A definition of SMD would improve understanding for some readers.: While the table is generally clear, providing a brief definition of SMD in the caption or a footnote would enhance understanding for readers unfamiliar with this statistic.

Table 2 | Comparisons of the primary and secondary outcomes for physicians with...

Full Caption

Table 2 | Comparisons of the primary and secondary outcomes for physicians with LLM and with conventional resources only (scores standardized to 0-100)

Figure/Table Image (Page 3)

First Reference in Text

Physicians randomized to use the LLM performed better than the control group (43.0% compared to 35.7%, difference = 6.5%, 95% con- fidence interval (CI) = 2.7% to 10.2%, P < 0.001) (Table 2 and Fig. 2).

Description

Table 2 compares outcomes between LLM and conventional resources groups.: Table 2 presents a head-to-head comparison of how well doctors performed on the cases, depending on whether they had access to the GPT-4 language model or only conventional resources. The table focuses on both the main (primary) outcome and other related (secondary) outcomes. To make the scores easier to understand and compare, they've been converted to a scale from 0 to 100. The table shows two main numbers for each outcome: the average score (mean) and the middle score (median). The average can be skewed by very high or low scores, while the median represents the true middle value. The table also includes the interquartile range (IQR), which is the range between the 25th and 75th percentile, representing the spread of the middle 50% of the data. Finally, it provides the difference between the two groups, a confidence interval (CI), and a P-value. The confidence interval is a range of values that we are fairly sure contains the true population mean. A p-value is a measure of the probability that an observed difference could have occurred just by random chance, if there really was no difference.
LLM group had an average total score of 43.0 vs 35.7 for conventional resources (p<0.001).: The table shows that the average total score for physicians using LLM was 43.0 (out of 100), while for those using conventional resources it was 35.7. The difference between these averages was 6.5, with a 95% confidence interval ranging from 2.7 to 10.2, and a p-value less than 0.001. This suggests that the LLM group performed significantly better than the conventional resources group.

Scientific Validity

The table provides comprehensive statistical information.: The table provides essential statistical information for evaluating the study's findings. The inclusion of means, medians, interquartile ranges, differences, confidence intervals, and p-values allows for a comprehensive assessment of the magnitude and statistical significance of the observed effects.
The use of a generalized mixed-effects model is appropriate.: The use of a generalized mixed-effects model, as mentioned in the table footnote, is appropriate for accounting for the potential correlation between cases for a given participant. This approach strengthens the validity of the statistical analysis.

Communication

Score standardization enhances interpretability.: The standardization of scores to a 0-100 scale enhances interpretability and allows for easier comparison across different outcome measures. The inclusion of both means and medians provides a more complete picture of the data distribution, as means can be influenced by outliers.
Clarity could be improved by specifying the level of confidence for CIs in the table header.: While the table is generally well-organized, the presentation of confidence intervals (CI) could be improved. Specifying the level of confidence (95% CI) in the column header, rather than only in the reference text, would provide clarity and avoid ambiguity.

Fig. 2 | Comparison of the primary outcome for physicians with LLM and with...

Full Caption

Fig. 2 | Comparison of the primary outcome for physicians with LLM and with conventional resources only (total score standardized to 0-100). Ninety-two physicians (46 randomized to the LLM group and 46 randomized to conventional resources) completed 375 cases (178 in the LLM group, 197 in the conventional resources group). The center of the box plot represents the median, with the boundaries representing the first and third quartiles. The whiskers represent the furthest data points from the center within 1.5 times the IQR.

Figure/Table Image (Page 4)

First Reference in Text

Description

Figure 2 uses a box plot to compare total scores between the two groups.: Figure 2 uses a box plot to show how the total scores compare between the two groups of doctors. The scores have already been standardized to a 0-100 scale, making it easier to compare. A box plot is a way of summarizing data using five key numbers: the median (the middle value), the first quartile (the value below which 25% of the data falls), the third quartile (the value below which 75% of the data falls), and the minimum and maximum values. The box itself shows the range between the first and third quartiles, representing the middle 50% of the data. The line inside the box shows the median. The 'whiskers' extend out from the box to the furthest data points that are within 1.5 times the interquartile range (IQR) from the box. Any points beyond the whiskers are considered outliers.
The caption specifies the number of physicians and cases in each group.: The caption specifies that 92 physicians participated in the study, with 46 in each group. The LLM group completed 178 cases, while the conventional resources group completed 197 cases. The box plot visually represents the distribution of these scores for each group.

Scientific Validity

The use of a box plot is appropriate for visualizing the distribution of scores.: The use of a box plot is appropriate for visualizing the distribution of scores and comparing the central tendency and spread between the two groups. The caption clearly defines the elements of the box plot, ensuring accurate interpretation.
The reference to Table 2 provides complementary statistical information.: The reference to Table 2 provides the statistical information (mean, confidence interval, p-value) that complements the visual representation in the box plot. This combination of visual and numerical data enhances the rigor and interpretability of the results.

Communication

The box plot effectively visualizes the distribution of scores.: The box plot effectively visualizes the distribution of total scores for each group, making it easy to compare the median, quartiles, and range of scores. The clear labeling and concise caption contribute to the figure's understandability.
Adding a visual indicator of statistical significance could enhance the figure's impact.: While the box plot is informative, adding a visual indicator of statistical significance (e.g., an asterisk) could further enhance its impact. Also, reporting the exact p-value within the figure or caption would provide more precise information than simply stating 'P < 0.001' in the reference text.

Fig. 3 | Comparison of the primary outcome according to GPT alone versus...

Full Caption

Fig. 3 | Comparison of the primary outcome according to GPT alone versus physician with GPT-4 and with conventional resources only (total score standardized to 0-100). The GPT-alone arm represents the model being prompted by the study team to complete the five cases, with the models prompted five times for each case for a total of 25 observations. The physicians with GPT-4 group included 46 participants that completed 178 cases, while the physician with conventional resources group included 46 participants that completed 197 cases. The center of the box plot represents the median, with the boundaries representing the first and third quartiles. The whiskers represent the furthest data points from the center within 1.5 times the IQR.

Figure/Table Image (Page 4)

First Reference in Text

While trending toward scoring higher than humans using conventional resources (43.7% versus 35.7%, difference = 7.3%, 95% CI = -0.7% to 15.4%, P = 0.074) (Fig. 3).

Description

Figure 3 compares primary outcomes across three groups using a box plot.: Figure 3 presents a comparison of the primary outcome (total score standardized to 0-100) across three groups: GPT-4 used alone, physicians using GPT-4, and physicians using conventional resources. It uses a box plot to visualize the distribution of scores for each group. The box plot shows the median (the middle value), the first and third quartiles (the 25th and 75th percentiles), and the range of the data (with whiskers extending to 1.5 times the interquartile range).
The caption explains the methodology for the GPT-alone arm.: The caption explains that the 'GPT-alone arm' represents the model being prompted by the study team to complete the five cases. Each case was prompted five times, resulting in a total of 25 observations. The physicians with GPT-4 group included 46 participants who completed 178 cases, and the physician with conventional resources group included 46 participants who completed 197 cases.

Scientific Validity

The inclusion of a GPT-alone arm is a strength of the study.: The inclusion of a GPT-alone arm is a strength of the study, as it allows for a direct comparison of the model's performance against human physicians. This provides valuable information about the potential for standalone AI applications in this domain.
The small sample size of the GPT-alone arm should be considered.: The relatively small number of observations in the GPT-alone arm (25) compared to the physician groups (178 and 197) should be considered when interpreting the results. This difference in sample size may limit the statistical power to detect significant differences.
The p-value of 0.074 indicates a trend, but is not statistically significant.: The p-value of 0.074 indicates a trend towards higher scores for GPT alone compared to physicians using conventional resources, but this difference is not statistically significant at the conventional alpha level of 0.05. Therefore, caution is warranted when interpreting this result.

Communication

The figure effectively compares GPT alone against human physicians.: The figure effectively compares the performance of GPT alone against human physicians, both with and without GPT assistance. The inclusion of the GPT-alone arm provides valuable insight into the potential of the model as a standalone tool.
The caption clearly explains the methodology for the GPT-alone arm.: The caption clearly explains the methodology for the GPT-alone arm, including the number of prompts and observations. This transparency is crucial for understanding the context of the GPT-alone results.
The p-value of 0.074 should be interpreted cautiously.: The p-value of 0.074 is marginal and should be interpreted cautiously. While the text notes a 'trend', it's important to avoid overstating the significance of this result.

Fig. 4 | Comparison of the time spent per case by physicians using GPT-4 and...

Full Caption

Fig. 4 | Comparison of the time spent per case by physicians using GPT-4 and physicians using conventional resources only. Ninety-two physicians (46 randomized to the LLM group and 46 randomized to the conventional resources) completed 375 cases (178 in the LLM group, 197 in the conventional resources group). The center of the boxplot represents the median, with the boundaries representing the first and third quartiles. The whiskers represent the furthest data points from the center within 1.5 times the IQR.

Figure/Table Image (Page 5)

First Reference in Text

Physicians randomized to use the LLM spent 111.3 s more on each case (801.5 s versus 690.2 s, difference = 119.3 s, 95% CI = 17.4 to 221.2, P = 0.022) (Fig. 4).

Description

Figure 4 compares time spent per case between the two groups using a box plot.: Figure 4 uses a box plot to compare the amount of time doctors spent on each case, depending on whether they were using GPT-4 or conventional resources. The time is measured in seconds. As in Figure 2, the box plot shows the median, first and third quartiles, and the range of the data, with whiskers extending to 1.5 times the interquartile range. The caption indicates that the 92 physicians (46 in each group) completed a total of 375 cases (178 in the GPT-4 group and 197 in the conventional resources group).
The reference text states that the LLM group spent 111.3 seconds more on each case (p=0.022).: The reference text states that physicians using GPT-4 spent 111.3 seconds more on each case, on average, compared to those using conventional resources. The average time spent was 801.5 seconds for the GPT-4 group and 690.2 seconds for the conventional resources group. The 95% confidence interval for this difference ranges from 17.4 to 221.2 seconds, and the p-value is 0.022, indicating a statistically significant difference.

Scientific Validity

The use of a box plot is appropriate, and the difference is statistically significant.: The use of a box plot is appropriate for visualizing and comparing the distribution of time spent per case between the two groups. The statistical significance of the difference (p = 0.022) suggests that the observed difference is unlikely to be due to chance.
The extended time spent in the LLM group warrants further investigation.: The extended time spent in the LLM group warrants further investigation. It would be valuable to explore whether this increased time was associated with improved decision-making or simply reflected a more time-consuming process.

Communication

The box plot clearly illustrates the difference in time spent.: The box plot clearly illustrates the difference in time spent per case between the two groups. This visual representation effectively communicates the finding that physicians using GPT-4 spent more time on each case.
The caption provides sufficient information to understand the figure.: The caption provides sufficient information to understand the figure, including the number of participants and cases in each group. The description of the box plot elements (median, quartiles, whiskers) is also helpful for readers unfamiliar with this type of graph.

Extended Data Fig. 1 | Correlation between Time Spent in Seconds and Total...

Full Caption

Extended Data Fig. 1 | Correlation between Time Spent in Seconds and Total Score. This figure demonstrates a sample medical management case with multi-part assessment questions, scoring rubric and example responses. The case presents a 72-year-old post-cholecystectomy patient with new-onset atrial fibrillation. The rubric (23 points total) evaluates clinical decision-making across key areas: initial workup, anticoagulation decisions, and outpatient monitoring strategy. Sample high-scoring (21/23) and low-scoring (8/23) responses illustrate varying depths of clinical reasoning and management decisions.

Figure/Table Image (Page 9)

First Reference in Text

We performed an additional post hoc sensitivity analysis adjusting for time spent on each case (Extended Data Table 1), which showed a 5.4 percentage point (95% CI = 1.7 to 9.0, P = 0.004) increase in score per case even after adjustment for time spent on the case. Results were similar for subdomains. We further examined the unadjusted correlation between time spent and total scores with a posi- tive association between time spent and total scores for both groups (Extended Data Table 2). Overall, we observed that for each additional minute spent on a case, there was a small but statistically significant increase of 0.6 points in the score per case (95% CI = 0.4 to 0.8, P < 0.001) using a mixed-effects model (Extended Data Fig. 1).

Description

Extended Data Figure 1 shows the correlation between time spent and total score.: Extended Data Figure 1 shows the relationship between how much time the doctors spent on a case (in seconds) and their total score on that case. Since it is an 'extended data' figure, it's likely providing supporting information that's not essential to the main findings but still useful. The caption describes a sample case used in the study: a 72-year-old patient who has new-onset atrial fibrillation (an irregular heartbeat) after having their gallbladder removed (cholecystectomy). The doctors were asked questions about how they would manage this patient, and their answers were scored using a rubric. The rubric, which had a total of 23 points, covered key areas like the initial workup (tests and procedures to figure out what's going on), decisions about anticoagulation (medication to prevent blood clots), and the plan for monitoring the patient after they leave the hospital.
The figure includes sample high-scoring and low-scoring responses.: The caption also mentions that the figure includes sample responses, both high-scoring (21/23) and low-scoring (8/23). These examples are used to illustrate the different levels of clinical reasoning and management decisions demonstrated by the participants.

Scientific Validity

The post hoc sensitivity analysis addresses potential confounding.: The reference text indicates that a post hoc sensitivity analysis was performed to adjust for the potential confounding effect of time spent on the case. This is a rigorous approach that strengthens the validity of the findings by addressing a potential source of bias.
Examining the correlation between time spent and total scores is valuable.: Examining the correlation between time spent and total scores is valuable for understanding the relationship between effort and performance. The reported statistically significant increase of 0.6 points per minute suggests a positive association, but the small magnitude of this effect should be considered.

Communication

The caption provides valuable context about the case and scoring rubric.: While the figure itself is not presented, the detailed caption offers valuable context by describing a sample case, assessment questions, and scoring rubric. This provides insight into the nature of the management reasoning tasks and the criteria used for evaluation.
The caption should explicitly state the type of graph (scatter plot).: The caption could be improved by explicitly stating the type of graph being referenced (scatter plot). This would help readers quickly understand the relationship being examined (correlation between time spent and total score).

Extended Data Table 1 | Post-hoc Analysis Adjusted for Time Spent in Each Case

Figure/Table Image (Page 10)

First Reference in Text

Description

Extended Data Table 1 presents a post-hoc analysis adjusting for time spent.: Extended Data Table 1 presents the results of a 'post-hoc analysis'. In research, a post-hoc analysis is one that's done *after* the main experiment, usually to explore the data in more detail or to address unexpected findings. In this case, the researchers were concerned that the amount of time doctors spent on each case might have affected their scores. To address this, they performed an analysis that 'adjusted for' time spent, meaning they statistically removed the effect of time to see if the main results still held true. The table shows the results of this adjusted analysis, including the change in score, the 95% confidence interval (a range within which we can be 95% sure the true value lies), and the p-value (a measure of statistical significance).
The adjusted analysis showed a 5.4 percentage point increase in score (p=0.004).: The reference text highlights that this adjusted analysis showed a 5.4 percentage point increase in score per case (with a 95% confidence interval of 1.7 to 9.0 and a p-value of 0.004) even after accounting for the time spent. This suggests that the positive effect of LLM was not simply due to doctors spending more time on the cases.

Scientific Validity

The post-hoc analysis strengthens the validity of the study.: Conducting a post-hoc sensitivity analysis to adjust for time spent is a rigorous methodological approach that strengthens the validity of the study. It addresses the potential confounding effect of time and provides a more accurate estimate of the intervention's effect.
Reporting of CI and p-value is essential for evaluating statistical significance.: The reporting of the confidence interval and p-value is essential for evaluating the statistical significance and precision of the adjusted results. The statistically significant p-value (P = 0.004) provides evidence that the observed effect is unlikely to be due to chance.

Communication

The table's clear labeling facilitates understanding.: The table's clear labeling facilitates understanding of the adjusted results. The inclusion of confidence intervals and p-values allows for a thorough assessment of the statistical significance of the findings.
Explaining the rationale for the analysis would enhance the table's context.: While the table is well-structured, briefly explaining the rationale for conducting this post-hoc analysis in the caption would enhance its context and significance for the reader.

Extended Data Table 2 | Post-hoc Analysis for the Associations between the...

Full Caption

Extended Data Table 2 | Post-hoc Analysis for the Associations between the Primary and Secondary Outcomes Overall

Figure/Table Image (Page 11)

First Reference in Text

Description

Extended Data Table 2 presents a post-hoc analysis of associations between time spent and outcomes.: Extended Data Table 2 presents a 'post-hoc analysis', meaning it was done after the main experiment to explore the relationships between different factors. Specifically, it looks at how the amount of time spent on each case relates to the primary outcome (total score) and secondary outcomes (different aspects of the case, like management decisions or factual recall). The table shows these relationships for the entire group of doctors, as well as separately for the doctors using LLM and those using conventional resources. These relationships are quantified using a measure that describes the strength and direction of a linear relationship between two variables. A positive value indicates that as one variable increases, the other tends to increase as well, and vice versa.
The table provides the difference in scores per minute spent on the case.: The table provides the 'difference in the scores by one minute increase of time spent on the case'. The reference text notes that the researchers examined the 'unadjusted correlation between time spent and total scores', meaning they looked at the raw relationship without statistically removing the effect of any other variables.

Scientific Validity

Examining associations between time spent and outcomes is valuable.: Examining the associations between time spent and outcomes is a valuable exploratory analysis that can provide insights into the cognitive processes involved in the decision-making tasks. Reporting these associations separately for the LLM and conventional resources groups allows for a comparison of how time influences performance in each group.
Correlation does not equal causation.: It's important to note that correlation does not equal causation. While the table may reveal positive or negative associations between time spent and outcomes, it does not prove that spending more or less time directly causes changes in performance. Other factors may be influencing both time spent and scores.

Communication

The table is clearly labeled and separates the analysis by group.: The table's clear labeling helps in understanding the associations between time spent and various outcomes. Separating the analysis by group (LLM vs. conventional resources) provides valuable insights into potential differences in these associations.
The table should clarify what the reported values represent (e.g., correlation coefficients).: The table could benefit from a clearer explanation of what the reported values represent. Specifying that these are correlation coefficients or regression coefficients in the caption or a footnote would improve interpretability.

Discussion

Key Aspects

Improved Management Reasoning with LLM: The central finding of the study is that the availability of an LLM (GPT-4) improved physician management reasoning compared to using conventional resources alone. This is a significant result, as it demonstrates the potential of LLMs to assist with complex clinical decision-making tasks that go beyond diagnosis.
Comparable Performance of LLM Alone: The study found that the LLM alone performed comparably to physicians using the LLM, and trended towards better performance than physicians using only conventional resources. This highlights the potential for standalone LLM applications in certain clinical scenarios, although further research is needed to determine the safety and efficacy of such applications.
Cognitive Psychology Perspective: The Discussion section frames the findings within the context of cognitive psychology. It contrasts the token prediction architecture of LLMs, which may explain their diagnostic abilities, with the highly contextual and individualized nature of management scripts, making the observed improvement in management reasoning somewhat surprising.
Illustrative Example of Lung Nodule Management: An illustrative example of a patient with an incidentally discovered lung nodule is used to demonstrate the complexity of management reasoning. This example highlights how the "best" decision can vary significantly depending on patient-specific factors, such as their likelihood of follow-up, health system capabilities, and personal preferences.
Increased Time Spent with LLM: The study found that physicians using the LLM spent more time solving cases. This is consistent with some previous studies of diagnostic support systems but contrasts with other recent findings on LLM use in diagnostic reasoning. The authors propose that this increased time may be due to a combination of factors, including the time spent interacting with the LLM and a potential "time out" effect that encourages more careful consideration of the patient context.
Potential for Increased Empathy: The authors suggest that the LLM may have influenced physicians to exhibit greater empathy towards patients and other providers. This is attributed to the reinforcement learning through human feedback used in LLM training, which may prioritize empathetic and patient-centered responses.
Study Limitations: The Discussion acknowledges several limitations of the study. These include the use of clinical vignettes rather than real patient cases, the lack of external validity evidence for the scoring rubrics, the limited number of cases, and the basic training provided on LLM use.
Challenges of Assessing Management Reasoning: The section discusses the challenges of assessing management reasoning, particularly the difficulty of distinguishing between accuracy and thoroughness in responses. The study's rubric design attempted to address this by rewarding appropriateness rather than exhaustiveness.
Consideration of Potential Harms: The Discussion highlights the need for careful consideration of potential harms, such as hallucinations and misinformation, when implementing LLMs in clinical settings. It suggests that real-world deployment may require physicians to serve as a backstop for misinformation.
Promising Application and Need for Validation: The study concludes that decision support, even in complex tasks like management reasoning, represents a promising application of LLMs. However, it emphasizes the need for rigorous validation in real clinical settings to realize the potential of LLMs for enhancing patient care.

Strengths

Clear Summary of Main Finding
The Discussion section effectively summarizes the main finding of the study: that LLM assistance improved physician management reasoning compared to conventional resources. This provides a clear and concise restatement of the primary result.

"In this randomized, controlled trial, the availability of an LLM improved physician management reasoning compared to conventional resources only, with comparable scores between physicians randomized to use AI and AI alone." (Page 4)
Contextualization within Existing Literature
The section appropriately places the findings in the context of existing literature, comparing and contrasting the results with previous studies on diagnostic support systems and LLM use in diagnostic reasoning. This helps to situate the study within the broader field of research.

"The group using the LLM spent more time solving cases, a finding that aligns with historical studies of diagnostic support systems19,20, but contrasts with recent findings of LLM use in diagnostic reasoning4,5." (Page 4)
Exploration of Potential Mechanisms
The Discussion section explores potential mechanisms underlying the observed effects, such as the possibility that LLM interaction served as a beneficial 'time out' for better consideration of patient context. This demonstrates a thoughtful consideration of the cognitive processes involved.

"While this increased time may be due to the combined effects of case problem-solving and LLM interaction, engaging with the LLM may have served as a beneficial ‘time out’ to better consider the patient context." (Page 4)
Acknowledgement of Study Limitations
The section acknowledges the limitations of the study, including the use of clinical vignettes rather than real patient cases and the lack of external validity evidence for the scoring rubrics. This demonstrates a critical and balanced assessment of the study's limitations.

"This study has multiple limitations. First, the cases are clinical vignettes, based on, but not actual patient cases. While our scoring rubrics show substantial interrater reliability, validity evidence for these rubrics has not been gathered outside this study." (Page 4)
Consideration of Real-World Implementation Challenges
The Discussion section raises important considerations for the real-world implementation of LLMs in clinical settings, such as the potential impact of hallucinations and misinformation on patient care. This highlights the ethical and practical challenges of deploying LLMs in healthcare.

"The real-world implementation of LLMs in clinical settings necessitates careful consideration of how potential hallucinations and misinformation could impact patient care." (Page 5)
Connection to Broader Implications
The section connects the findings to broader implications for clinical practice, suggesting that decision support represents a promising application of LLMs. This highlights the potential of LLMs to enhance patient care beyond clerical workflows.

"Our findings demonstrate that decision support—even in a task as complex as management reasoning—represents a promising application of LLMs that requires rigorous validation in real clinical settings to realize its potential for enhancing patient care." (Page 5)
Effective Use of Illustrative Example
The Discussion section effectively uses an illustrative example (lung nodule management) to highlight the complexity of management reasoning and the importance of considering patient-specific factors. This helps to clarify the distinction between diagnostic and management reasoning.

"For example, the appropriate management of an incidentally discovered 2.0-cm upper lobe lung nodule in a hospitalized inpatient might be immediate biopsy in a patient unlikely to follow up; scheduled outpatient biopsy in a health system capable of organizing and ensuring continuity; outpatient positron emission tomography scan in a patient reticent to undergo an invasive procedure; or serial imaging in a patient with limited life expectancy." (Page 4)

Suggestions for Improvement

Provide a More Balanced Discussion of Risks and Benefits
This medium-impact improvement would enhance the Discussion section by providing a more balanced discussion of the potential benefits and risks of LLM use. While the section acknowledges the risk of hallucinations and misinformation, it could more explicitly discuss the potential for bias, over-reliance on the LLM, and the impact on physician-patient interaction. This is crucial for the Discussion section, which is responsible for providing a comprehensive and nuanced interpretation of the study's findings and their implications.

"The real-world implementation of LLMs in clinical settings necessitates careful consideration of how potential hallucinations and misinformation could impact patient care." (Page 5)

Implementation: Add a paragraph discussing potential risks and downsides of LLM use, including: (1) Potential for bias in LLM output. (2) Risk of over-reliance on the LLM by physicians, potentially leading to deskilling or reduced critical thinking. (3) Potential negative impact on the physician-patient relationship if LLMs are perceived as impersonal or replacing human interaction. (4) Challenges in ensuring equitable access to LLM technology.
Connect Findings Back to Theoretical Framework of Management Scripts
This medium-impact improvement would strengthen the Discussion section by explicitly connecting the findings back to the theoretical framework of management scripts introduced in the Introduction. The Discussion should explain how the study's results support or challenge the existing understanding of management scripts and their use in clinical decision-making. This is important for linking the study's empirical findings to the theoretical underpinnings of the research, a key function of the Discussion section.

"Management scripts, on the other hand, are highly contextual and individualized, and include many factors outside the biomedical encounter." (Page 4)

Implementation: Add a paragraph explicitly discussing the implications of the findings for the understanding of management scripts. For example: 'The study's findings suggest that LLMs may be able to augment or improve physician management scripts by providing alternative perspectives and prompting consideration of a wider range of factors. This challenges the traditional view of management scripts as solely individual cognitive heuristics and suggests a potential role for AI in shaping and refining these scripts.'
Clarify the Definition of "Empathy"
This low-impact improvement would improve the clarity of the Discussion section. While the section mentions the potential for LLMs to influence physicians to be more empathetic, it could benefit from a more precise definition of "empathy" in this context. Different interpretations of empathy exist, and clarifying the specific aspect observed in the study would strengthen the claim. This aligns with the Discussion section's role in providing a clear and precise interpretation of the study's findings.

"For example, we observed that physicians using the LLM exhibited apparent empathy to other providers and patients in difficult situations more frequently." (Page 4)

Implementation: Replace "apparent empathy" with a more specific description of the observed behavior. For example: '...we observed that physicians using the LLM more frequently expressed consideration for the perspectives and feelings of other providers and patients in difficult situations.' Also, consider adding a brief definition or operationalization of empathy in this context.
Expand on Limitations of Clinical Vignettes
This low-impact improvement would enhance the Discussion section by providing a more nuanced discussion of the study's limitations related to the clinical vignettes. While the section acknowledges that the cases are not real patient cases, it could further elaborate on the potential impact of this limitation on the study's external validity and generalizability. The Discussion section is the appropriate place to address such limitations in detail.

"This study has multiple limitations. First, the cases are clinical vignettes, based on, but not actual patient cases." (Page 4)

Implementation: Expand the discussion of the limitations related to clinical vignettes. For example: 'While the clinical vignettes were designed to be realistic and based on real patient encounters, they necessarily represent a simplification of the complexities of real-world clinical practice. Factors such as time pressure, interruptions, and emotional responses to patients, which can influence decision-making, are difficult to fully replicate in a simulated setting. This may limit the generalizability of the findings to real-world clinical scenarios.'

Methods

Key Aspects

Participant Recruitment and Informed Consent: The study recruited practicing attending physicians and resident physicians with training in general medical specialties (internal medicine, family medicine, or emergency medicine). Recruitment occurred through email lists from three major medical institutions: Stanford University, Beth Israel Deaconess Medical Center, and the University of Virginia. Written informed consent was obtained from all participants before enrollment and randomization.
IRB Exemption: The study was reviewed and determined to be exempt from institutional review board (IRB) oversight by the IRBs at Stanford University, Beth Israel Deaconess Medical Center, and the University of Virginia. This indicates that the study met specific criteria for exemption, likely involving minimal risk to participants.
Study Procedures and Compensation: Participants were proctored by study coordinators either remotely or in an in-person computer laboratory. Sessions lasted for one hour, and physicians were offered financial compensation for their participation: $100 for residents and $200 for attending physicians.
Clinical Case Vignette Development: The clinical case vignettes used in the study were adapted from the "Grey Matters" series of the American College of Physicians podcast "Core IM." These cases were specifically modified for the study and were not publicly available before the study. A panel of subspecialty and generalist experts constructed each case to explore physician decision-making in situations without clear right answers.
Case Selection and Time Constraints: The study aimed to reflect the breadth of general medicine management decision-making. Pilot studies determined that participants typically completed no more than five cases within the one-hour session, aligning with standardized tests of physician reasoning.
Scoring Rubric Development: Scoring rubrics for each case were developed through an iterative modified Delphi process, involving an expert group consisting of a study team member, two generalists, and two subspecialists. This method aims to achieve consensus among experts through multiple rounds of feedback and refinement.
Scoring Rubric Characteristics: The scoring rubrics were designed to be comprehensive while acknowledging the potential for variation in acceptable management approaches. Points were awarded for all answers deemed reasonable by the expert panel, with no clear cutoff for high-quality care. The rubrics were tested in pilot groups and refined based on user feedback.
Question Categorization: Each question in the cases was independently labeled by two study team members as reflecting either case-specific reasoning or more generalized clinical reasoning. Questions were also categorized as representing diagnostic decisions, management decisions, or knowledge recall. There was complete agreement between the labelers on these categorizations.
Study Design and Randomization: The study employed a prospective, randomized, single-blind design. Participants were randomized to either use GPT-4 via the ChatGPT Plus interface or conventional resources (e.g., UpToDate, Epocrates, other internet resources). The control group was instructed not to use any LLMs.
GPT-4 Training: Participants in the GPT-4 group received basic instruction on system access and use, along with live technical support from a proctor. This training was intended to mirror real-world implementation of LLMs in clinical settings.
Participant Instructions: Participants were instructed to prioritize the quality of their responses over completing all cases. They were not able to change their answers to prior prompts as new information was introduced, reflecting the sequential nature of clinical information gathering.
LLM-Only Arm Prompt Design: For the LLM-only arm, a zero-shot prompt was iteratively developed using established principles of prompt design. The prompt involved copying and pasting the management cases and questions. Each prompt was run five times, and the results were included for blinded grading alongside human outputs.
Rubric Validation and Grading Process: The scoring rubrics were validated using two preliminary datasets graded independently by three graders. The graders then met to reach a consensus on the grading of these validation datasets. After data collection, two of the three graders, blinded to group assignment, independently graded each case. Disagreements (defined as a difference greater than 10% of the final score) were resolved through discussion and consensus.
Study Outcomes and Harm Assessment: The primary study outcome was the mean score for each group. Secondary outcomes included scores in predefined domains (management, knowledge recall, diagnostic), case specificity/generality, and time spent on cases. An exploratory evaluation assessed potential harm, rating the likelihood and extent of possible harm for each response.
Statistical Analysis: The target minimum sample size of 84 participants was prespecified based on a power analysis. Generalized mixed-effects models were used to assess differences in outcomes, accounting for potential clustering by participant and case. Statistical significance was set at P < 0.05.

Strengths

Clear Description of Participant Recruitment
The Methods section clearly describes the participant recruitment process, including the target population (practicing physicians and residents in general medical specialties), recruitment sources (email lists from three institutions), and the requirement for written informed consent. This provides transparency about the study population.

"We recruited practicing attending physicians and resident physicians with training in a general medical specialty (internal medicine, family medicine or emergency medicine) through email lists from Stanford University, Beth Israel Deaconess Medical Center and the University of Virginia. Written informed consent was obtained before enroll- ment and randomization." (Page 7)
Statement of IRB Exemption
The section specifies the study's exemption from institutional review board oversight, naming the relevant IRBs. This indicates adherence to ethical research practices.

"This study was reviewed and determined to be exempt from institutional review board oversight by institutional review boards at Stanford University, Beth Israel Deaconess Medical Center and the University of Virginia." (Page 7)
Detailed Description of Case Vignette Construction
The Methods section details the construction of the clinical case vignettes, including their origin (Core IM podcast), the involvement of a panel of experts in their development, and the intentional selection of cases to explore a breadth of management decision-making. This provides context for the study's stimulus materials.

"We constructed our cases from the series of ‘Grey Matters’ from the American College of Physicians podcast ‘Core IM’27. As these cases were adapted specifically for this study, they were not available to either GPT-4 or the participants before our study. Each of these cases was constructed by a panel of subspecialty and generalist experts (including A.R., Z.K., E.S., J.H. and A.S.P.) to explore how physicians make decisions when there are no clear right answers." (Page 7)
Rigorous Development of Scoring Rubrics
The section describes the development of the scoring rubrics using a modified Delphi process involving an expert group. This indicates a rigorous approach to creating assessment tools for the complex task of management reasoning.

"Through an iterative modified Delphi process, we refined management rubrics to score each case32. These rubrics were designed to be as thorough as possible for the specific case, while also acknowledging that considerable variation of acceptable management was possible." (Page 7)
Clear Outline of Study Design
The Methods section clearly outlines the study design as a prospective, randomized, single-blind study, specifying the randomization process (to either GPT-4 or conventional resources) and the blinding of graders to group assignment. This provides a concise overview of the study's methodology.

"We used a prospective, randomized, single-blind (to the rater) study design with participants randomized to either using GPT-4 via the ChatGPT plus (OpenAI) interface or the conventional resources group (Fig. 1)." (Page 7)
Description of Prompt Design for LLM-Only Arm
The section describes the prompt design for the LLM-only arm, mentioning the use of established principles of prompt design and the iterative development of a zero-shot prompt. This provides insight into the methodology used for the LLM comparison group.

"For the LLM-only arm, we used established principles of prompt design to iteratively develop a zero-shot prompt by copy and pasting the management cases along with questions (Supplementary Item 2)33." (Page 7)
Detailed Rubric Validation Process
The Methods section details the rubric validation process, including the use of preliminary datasets, independent grading by multiple graders, and consensus meetings to resolve disagreements. This enhances the credibility of the scoring process.

"Two preliminary sets of data from ten individuals were collected to validate the rubrics. The three graders (A.R., E.S. and K.P.L.) indepen- dently graded these two datasets. They then met in person and came to a consensus on grading these two validation datasets. After data collec- tion was complete, each case was graded independently by two of three graders who were blinded to group assignment." (Page 7)
Clear Definition of Study Outcomes
The section clearly defines the primary and secondary study outcomes, including the mean score for each group and scores in predefined domains of the rubrics. This provides a clear understanding of the study's key measures.

"The primary study outcome was the mean score for each of the groups. Secondary outcomes included scores in predefined domains of the rubrics, including management, knowledge recall and diagnostic domains, case specificity or generality of decisions and time spent on cases." (Page 7)
Description of Statistical Methods
The Methods section describes the statistical methods used, including the target sample size, the use of generalized mixed-effects models to account for clustering, and the statistical significance level. This provides a concise overview of the statistical analysis approach.

"The target minimum sample size of 84 participants was prespecified based on a power analysis using the preliminary data of 13 cases among three participants, scored before study enrollment...All analyses were at the case level, clustered according to the participant...Generalized mixed-effects mod- els were applied to assess the difference in primary and secondary out- comes of the LLM group compared to the conventional resources-only group...All statistical analyses were performed using R v.4.3.2 (R Foundation for Statistical Computing). Statistical significance was set as P < 0.05." (Page 7)

Suggestions for Improvement

Specify the Randomization Method
This medium-impact improvement would enhance the study's methodological rigor and transparency. The Methods section should explicitly state *how* participants were randomized to the two study arms. While the section mentions randomization, it doesn't specify the method used (e.g., simple randomization, block randomization, stratified randomization). This is important for assessing the potential for bias and ensuring the comparability of the two groups. The Methods section is the appropriate place for this level of detail.

"We used a prospective, randomized, single-blind (to the rater) study design with participants randomized to either using GPT-4 via the ChatGPT plus (OpenAI) interface or the conventional resources group (Fig. 1)." (Page 7)

Implementation: Add a sentence specifying the randomization method used. For example: 'Participants were randomized to the two study arms using block randomization with a block size of 4, ensuring balanced group sizes throughout the enrollment period.' or 'Simple randomization was performed using a computer-generated random number sequence.'
Provide Detailed Description of GPT-4 Training
This high-impact improvement would significantly enhance the study's reproducibility and transparency. The Methods section should provide more detail about the "basic instruction on system access and use" given to participants in the GPT-4 arm. Simply stating that such instruction was provided is insufficient. The specific instructions, including any examples or guidance on how to interact with GPT-4, should be included, ideally in a supplementary file. This is crucial for allowing other researchers to replicate the study and understand the specific conditions under which GPT-4 was used. The Methods section is the definitive location for this information.

"To mirror real-world implementation, participants received GPT-4 training comparable to current live deployments in clinical set- tings25. This included basic instruction on system access and use, and live technical support throughout the study from a proctor." (Page 7)

Implementation: Include a detailed description of the training provided to participants in the GPT-4 arm. This should include: (1) The specific instructions given on accessing and using the ChatGPT Plus interface. (2) Any examples or demonstrations provided to illustrate how to interact with GPT-4. (3) The content of any written or verbal instructions given. (4) The duration of the training. Ideally, provide the full training materials as a supplementary file.
Clarify Proctor Blinding
This medium-impact improvement would enhance the clarity and completeness of the Methods section. While the section mentions that participants were proctored, it doesn't explicitly state whether the proctors were blinded to the study arm assignments. This is important for assessing the potential for bias in the administration of the study. The Methods section is where this information should be clearly stated.

"Small groups of participants were proctored by study coordinators either remotely or at an in-person computer laboratory." (Page 7)

Implementation: Add a sentence explicitly stating whether the proctors were blinded to the study arm assignments. For example: 'Proctors were blinded to the participants' group assignments to minimize potential bias.' or 'Due to the nature of the intervention, proctors were not blinded to group assignments, but they were instructed to provide only technical support and avoid influencing participants' responses.'
Specify Number of Delphi Rounds
This low-impact improvement would add clarity to the Methods section. While the section mentions a modified Delphi process, it does not specify the *number of rounds* used in this process. This is a standard detail to report for Delphi studies, and it helps readers understand the extent of iteration and consensus-building. This detail belongs in the Methods section as part of the description of the rubric development.

"Through an iterative modified Delphi process, we refined management rubrics to score each case32." (Page 7)

Implementation: Add the number of rounds used in the modified Delphi process. For example: 'Through an iterative modified Delphi process, involving three rounds of expert feedback and refinement, we developed management rubrics...'
Elaborate on Criteria for "Reasonable" Answers
This medium-impact improvement would enhance the study's transparency and allow for a more thorough assessment of the scoring rubrics. The Methods section should provide more detail about the *criteria* used to determine whether answers were "reasonable" and therefore awarded points. While the section mentions that points were awarded for all answers deemed reasonable by the expert panel, it doesn't explain the criteria or guidelines used to make these judgments. This is important for understanding the scoring process and assessing its potential for bias. This information is crucial for the Methods section, which is where the scoring methodology is described.

"These rubrics were designed to be as thorough as possible for the specific case, while also acknowledging that considerable variation of acceptable management was possible...points were awarded for all answers determined reasonable by the panel and while a higher score reflects a more comprehensive answer, there is no clear cutoff for high-quality care)." (Page 7)

Implementation: Provide a more detailed explanation of the criteria used to determine the "reasonableness" of answers. This could include: (1) Referencing established clinical guidelines or best practices. (2) Describing the process used by the expert panel to reach consensus on what constituted a reasonable answer. (3) Providing examples of answers that were considered reasonable and unreasonable. (4) Including the full scoring rubrics in a supplementary file.

GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial

Table of Contents

Overall Summary

Study Background and Main Findings

Research Impact and Future Directions

Critical Analysis and Recommendations

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Discussion

Key Aspects

Strengths

Suggestions for Improvement

Methods

Key Aspects

Strengths

Suggestions for Improvement