GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial

Table of Contents

Overall Summary

Study Background and Main Findings

This randomized controlled trial assessed the impact of LLM assistance (GPT-4) on physician performance in management reasoning tasks. Physicians using the LLM scored significantly higher (43.0% vs 35.7%, difference = 6.5%, 95% CI = 2.7% to 10.2%, P < 0.001) than those using conventional resources, but also spent significantly more time per case (801.5s vs 690.2s, difference = 119.3s, 95% CI = 17.4 to 221.2, P=0.022). The LLM alone performed comparably to LLM-assisted physicians.

Research Impact and Future Directions

The study provides compelling evidence that LLM assistance, specifically using GPT-4, significantly improves physician performance on management reasoning tasks compared to using conventional resources alone. The randomized controlled trial design and statistically significant results (p < 0.001) support a causal link between LLM use and improved scores. However, it's crucial to note that this improvement comes with a statistically significant increase in time spent per case. The study also found that the LLM alone performed comparably to LLM-assisted physicians, suggesting potential for independent use, although this finding requires cautious interpretation due to the smaller sample size in the LLM-alone arm.

The practical utility of these findings is promising, suggesting that LLMs could serve as valuable decision support tools in clinical practice, particularly for complex management decisions. The study is well-contextualized within existing literature, acknowledging the growing body of research on LLMs in diagnostic reasoning and highlighting the novelty of its focus on management reasoning. However, the study's reliance on clinical vignettes, while ecologically valid to a degree, limits the direct applicability of the findings to real-world clinical settings. The increased time spent per case also raises questions about the efficiency of LLM use in time-constrained environments.

While the study demonstrates a clear benefit in terms of improved scores, the guidance for practitioners must be nuanced. LLMs like GPT-4 show potential as decision support tools, but their implementation should be approached with caution. The increased time investment, potential for bias, and risk of hallucinations/misinformation need careful consideration. The study strongly suggests that LLMs should be used as *assistive* tools, augmenting physician judgment rather than replacing it. Further research is needed to optimize the integration of LLMs into clinical workflows and to minimize potential risks.

Several critical questions remain unanswered. The long-term effects of LLM use on physician skills and clinical judgment are unknown. Will over-reliance on LLMs lead to deskilling or reduced critical thinking? The study's limitations, particularly the use of clinical vignettes and the lack of external validity evidence for the scoring rubrics, raise concerns about the generalizability of the findings. While the study addresses potential harm and finds no significant difference between groups, the long-term safety implications of LLM use in real-world clinical practice require further investigation. Future research should focus on real-world clinical trials, longitudinal studies, and the development of robust methods for detecting and mitigating potential biases and harms associated with LLM use.

Critical Analysis and Recommendations

Improved Physician Performance with LLM Assistance (written-content)
Physicians using GPT-4 scored significantly higher on management reasoning tasks (mean difference 6.5%, 95% CI = 2.7 to 10.2, P < 0.001). This demonstrates a statistically significant improvement in performance with LLM assistance, suggesting a potential for enhanced clinical decision-making.
Section: Results
Rigorous Study Design (written-content)
The study used a prospective, randomized, controlled trial design. This rigorous methodology minimizes bias and strengthens the causal inference between LLM use and improved physician performance.
Section: Methods
Rigorous Scoring Rubric Development (written-content)
The scoring rubrics were developed using a modified Delphi process with an expert panel and showed substantial inter-rater reliability (pooled kappa = 0.80). This indicates a rigorous and reliable approach to assessing the complex task of management reasoning.
Section: Methods
Increased Time Spent with LLM (written-content)
Physicians using the LLM spent significantly more time per case (mean difference = 119.3s, 95% CI = 17.4 to 221.2, P = 0.022). This increased time investment raises concerns about the efficiency of LLM use in time-constrained clinical settings.
Section: Results
Use of Clinical Vignettes (written-content)
The study used clinical vignettes rather than real patient cases. While based on real encounters, this limits the direct generalizability of the findings to real-world clinical practice.
Section: Discussion
Lack of Detail in Methods (written-content)
The Methods section lacks specific details on the randomization method and the training provided to participants in the GPT-4 arm. This lack of detail hinders reproducibility and makes it difficult to fully assess potential biases.
Section: Methods
Clear Study Flow Diagram (graphical-figure)
Figure 1 clearly presents the study flow diagram, adhering to CONSORT guidelines. This enhances transparency in reporting participant enrollment, allocation, and analysis.
Section: Results
Comparison with LLM Alone (graphical-figure)
Figure 3 includes a comparison with the LLM used alone, providing valuable insight into its potential as an independent tool. However, the small sample size for this arm (n=25) limits the statistical power of this comparison.
Section: Results
Incomplete Discussion of Risks (written-content)
The Discussion section does not fully address the potential for bias in the LLM's output or the risk of over-reliance on the LLM by physicians. A more balanced discussion of risks and benefits is needed.
Section: Discussion

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Fig. 1| Study flow diagram. The study included 92 practicing attending...
Full Caption

Fig. 1| Study flow diagram. The study included 92 practicing attending physicians and residents with training in internal medicine, family medicine or emergency medicine. Five expert-developed cases were presented, with scoring rubrics created using a Delphi process. Physicians were randomized to use either GPT-4 via ChatGPT plus in addition to conventional resources (for example, UpToDate, Google), or conventional resources alone. The primary outcome was the difference in total score between groups on expert-developed scoring rubrics. Secondary outcomes included domain-specific scores and time spent per case.

Figure/Table Image (Page 2)
Fig. 1| Study flow diagram. The study included 92 practicing attending physicians and residents with training in internal medicine, family medicine or emergency medicine. Five expert-developed cases were presented, with scoring rubrics created using a Delphi process. Physicians were randomized to use either GPT-4 via ChatGPT plus in addition to conventional resources (for example, UpToDate, Google), or conventional resources alone. The primary outcome was the difference in total score between groups on expert-developed scoring rubrics. Secondary outcomes included domain-specific scores and time spent per case.
First Reference in Text
Participants were randomized evenly between the LLM and conventional resources groups (Fig. 1); 73% (67 of 92) were attending physicians while 27% (25 of 92) were residents (Table 1).
Description
  • The study included 92 physicians with various training.: The study flow diagram, often called a CONSORT diagram, visually explains how participants moved through the research study. Ninety-two practicing physicians and residents—meaning doctors still in training—were involved. These doctors specialized in internal medicine (adult healthcare), family medicine (healthcare for all ages), or emergency medicine (immediate medical care). The doctors reviewed five cases, prepared by experts. The expert's scoring was based on rubrics, or detailed scoring guides, developed using a 'Delphi process'. A Delphi process is a way to get a group of experts to agree on something by having them answer questions and then giving them anonymous feedback until they reach a consensus. The doctors were then randomly put into one of two groups: one group used GPT-4 (a large language model) via ChatGPT and regular resources like UpToDate (an online medical resource) and Google; the other group used only the regular resources. The main thing the researchers were measuring was the difference in scores between the groups, based on the expert-developed scoring guides. They also looked at other things, like scores on specific areas of the cases and how much time each group spent on each case.
  • Physicians were randomized evenly between the LLM and conventional resources groups.: The physicians were randomized to either GPT-4 with conventional resources or conventional resources alone. Randomization is a process of randomly assigning participants to different groups in a study, such as a treatment group or a control group. This helps to ensure that the groups are as similar as possible at the beginning of the study, so that any differences in outcomes can be attributed to the intervention being studied. This is often done by using a computer-generated random number sequence to assign participants to groups.
  • 73% (67 of 92) were attending physicians while 27% (25 of 92) were residents: The reference text from the Results section notes that 73% (67 out of 92) of the participants were attending physicians, meaning fully qualified doctors, while the remaining 27% (25 out of 92) were residents, or doctors still undergoing training. These percentages are used to describe the study population.
Scientific Validity
  • The study flow diagram is a standard element in RCTs.: The study flow diagram is a standard element in RCTs, adhering to CONSORT guidelines, and is essential for transparency. The mention of the Delphi process for rubric development strengthens the validity by highlighting a rigorous approach to outcome assessment.
  • Even randomization minimizes selection bias.: The even randomization, as stated in the reference text and implied by the flow diagram, is crucial for minimizing selection bias. The percentages of attending physicians and residents are descriptive and relevant for characterizing the study population.
Communication
  • The caption provides a good overview of the study.: The caption effectively summarizes the key aspects of the study design and outcomes, providing a concise overview for readers to quickly understand the figure's content. It clearly states the number of participants, their backgrounds, the intervention, and the primary and secondary outcomes.
  • The caption could be more explicit about the purpose of a CONSORT diagram.: While the caption is informative, it could benefit from a more explicit mention of the CONSORT diagram's purpose, which is to visually represent the flow of participants through each stage of the randomized controlled trial.
Table 1 | Participant characteristics according to randomized group
Figure/Table Image (Page 3)
Table 1 | Participant characteristics according to randomized group
First Reference in Text
Participants were randomized evenly between the LLM and conventional resources groups (Fig. 1); 73% (67 of 92) were attending physicians while 27% (25 of 92) were residents (Table 1).
Description
  • Table 1 describes participant characteristics split by randomized group.: Table 1 is all about describing the people who took part in the study. It shows key details about them, split into two groups based on how they were randomly assigned: one group used GPT-4 along with regular resources, and the other used only regular resources. The table lists things like what stage of their career they're at (attending physician or resident), their medical specialty (internal medicine, emergency medicine, or family medicine), how many years they've been in medical training, and their past experience with using GPT-like tools. It also includes a number called the 'standardized mean difference' or SMD. The SMD is a way to measure how different the two groups are for each of these characteristics. An SMD close to zero means the groups are very similar, which is what you want when you randomly assign people to groups in a study.
  • The table provides numbers, percentages, means and standard deviations.: The table provides specific numbers and percentages for each characteristic within each group. For instance, it shows that in the overall study population (n=92), 73% were attending physicians. This is broken down to 74% in the GPT-4 group and 72% in the conventional resources group. The table also gives the mean (average) and standard deviation (a measure of spread) for continuous variables like years in medical training.
Scientific Validity
  • Describing participant characteristics is essential for assessing comparability.: The inclusion of participant characteristics is essential for assessing the comparability of the randomized groups at baseline. This is a standard practice in RCTs to ensure that any observed differences in outcomes are attributable to the intervention rather than pre-existing differences between the groups.
  • The use of SMD is appropriate for assessing balance.: The use of the standardized mean difference (SMD) is appropriate for assessing the balance between the groups. While there is no universally agreed-upon threshold, an SMD less than 0.2 is generally considered a small and acceptable difference, suggesting adequate randomization.
Communication
  • The table is well-organized and facilitates comparison.: The table's clear labeling and organization facilitate easy comparison of baseline characteristics between the two randomized groups. The inclusion of the standardized mean difference (SMD) provides a quick assessment of any imbalances.
  • A definition of SMD would improve understanding for some readers.: While the table is generally clear, providing a brief definition of SMD in the caption or a footnote would enhance understanding for readers unfamiliar with this statistic.
Table 2 | Comparisons of the primary and secondary outcomes for physicians with...
Full Caption

Table 2 | Comparisons of the primary and secondary outcomes for physicians with LLM and with conventional resources only (scores standardized to 0-100)

Figure/Table Image (Page 3)
Table 2 | Comparisons of the primary and secondary outcomes for physicians with LLM and with conventional resources only (scores standardized to 0-100)
First Reference in Text
Physicians randomized to use the LLM performed better than the control group (43.0% compared to 35.7%, difference = 6.5%, 95% con- fidence interval (CI) = 2.7% to 10.2%, P < 0.001) (Table 2 and Fig. 2).
Description
  • Table 2 compares outcomes between LLM and conventional resources groups.: Table 2 presents a head-to-head comparison of how well doctors performed on the cases, depending on whether they had access to the GPT-4 language model or only conventional resources. The table focuses on both the main (primary) outcome and other related (secondary) outcomes. To make the scores easier to understand and compare, they've been converted to a scale from 0 to 100. The table shows two main numbers for each outcome: the average score (mean) and the middle score (median). The average can be skewed by very high or low scores, while the median represents the true middle value. The table also includes the interquartile range (IQR), which is the range between the 25th and 75th percentile, representing the spread of the middle 50% of the data. Finally, it provides the difference between the two groups, a confidence interval (CI), and a P-value. The confidence interval is a range of values that we are fairly sure contains the true population mean. A p-value is a measure of the probability that an observed difference could have occurred just by random chance, if there really was no difference.
  • LLM group had an average total score of 43.0 vs 35.7 for conventional resources (p<0.001).: The table shows that the average total score for physicians using LLM was 43.0 (out of 100), while for those using conventional resources it was 35.7. The difference between these averages was 6.5, with a 95% confidence interval ranging from 2.7 to 10.2, and a p-value less than 0.001. This suggests that the LLM group performed significantly better than the conventional resources group.
Scientific Validity
  • The table provides comprehensive statistical information.: The table provides essential statistical information for evaluating the study's findings. The inclusion of means, medians, interquartile ranges, differences, confidence intervals, and p-values allows for a comprehensive assessment of the magnitude and statistical significance of the observed effects.
  • The use of a generalized mixed-effects model is appropriate.: The use of a generalized mixed-effects model, as mentioned in the table footnote, is appropriate for accounting for the potential correlation between cases for a given participant. This approach strengthens the validity of the statistical analysis.
Communication
  • Score standardization enhances interpretability.: The standardization of scores to a 0-100 scale enhances interpretability and allows for easier comparison across different outcome measures. The inclusion of both means and medians provides a more complete picture of the data distribution, as means can be influenced by outliers.
  • Clarity could be improved by specifying the level of confidence for CIs in the table header.: While the table is generally well-organized, the presentation of confidence intervals (CI) could be improved. Specifying the level of confidence (95% CI) in the column header, rather than only in the reference text, would provide clarity and avoid ambiguity.
Fig. 2 | Comparison of the primary outcome for physicians with LLM and with...
Full Caption

Fig. 2 | Comparison of the primary outcome for physicians with LLM and with conventional resources only (total score standardized to 0-100). Ninety-two physicians (46 randomized to the LLM group and 46 randomized to conventional resources) completed 375 cases (178 in the LLM group, 197 in the conventional resources group). The center of the box plot represents the median, with the boundaries representing the first and third quartiles. The whiskers represent the furthest data points from the center within 1.5 times the IQR.

Figure/Table Image (Page 4)
Fig. 2 | Comparison of the primary outcome for physicians with LLM and with conventional resources only (total score standardized to 0-100). Ninety-two physicians (46 randomized to the LLM group and 46 randomized to conventional resources) completed 375 cases (178 in the LLM group, 197 in the conventional resources group). The center of the box plot represents the median, with the boundaries representing the first and third quartiles. The whiskers represent the furthest data points from the center within 1.5 times the IQR.
First Reference in Text
Physicians randomized to use the LLM performed better than the control group (43.0% compared to 35.7%, difference = 6.5%, 95% con- fidence interval (CI) = 2.7% to 10.2%, P < 0.001) (Table 2 and Fig. 2).
Description
  • Figure 2 uses a box plot to compare total scores between the two groups.: Figure 2 uses a box plot to show how the total scores compare between the two groups of doctors. The scores have already been standardized to a 0-100 scale, making it easier to compare. A box plot is a way of summarizing data using five key numbers: the median (the middle value), the first quartile (the value below which 25% of the data falls), the third quartile (the value below which 75% of the data falls), and the minimum and maximum values. The box itself shows the range between the first and third quartiles, representing the middle 50% of the data. The line inside the box shows the median. The 'whiskers' extend out from the box to the furthest data points that are within 1.5 times the interquartile range (IQR) from the box. Any points beyond the whiskers are considered outliers.
  • The caption specifies the number of physicians and cases in each group.: The caption specifies that 92 physicians participated in the study, with 46 in each group. The LLM group completed 178 cases, while the conventional resources group completed 197 cases. The box plot visually represents the distribution of these scores for each group.
Scientific Validity
  • The use of a box plot is appropriate for visualizing the distribution of scores.: The use of a box plot is appropriate for visualizing the distribution of scores and comparing the central tendency and spread between the two groups. The caption clearly defines the elements of the box plot, ensuring accurate interpretation.
  • The reference to Table 2 provides complementary statistical information.: The reference to Table 2 provides the statistical information (mean, confidence interval, p-value) that complements the visual representation in the box plot. This combination of visual and numerical data enhances the rigor and interpretability of the results.
Communication
  • The box plot effectively visualizes the distribution of scores.: The box plot effectively visualizes the distribution of total scores for each group, making it easy to compare the median, quartiles, and range of scores. The clear labeling and concise caption contribute to the figure's understandability.
  • Adding a visual indicator of statistical significance could enhance the figure's impact.: While the box plot is informative, adding a visual indicator of statistical significance (e.g., an asterisk) could further enhance its impact. Also, reporting the exact p-value within the figure or caption would provide more precise information than simply stating 'P < 0.001' in the reference text.
Fig. 3 | Comparison of the primary outcome according to GPT alone versus...
Full Caption

Fig. 3 | Comparison of the primary outcome according to GPT alone versus physician with GPT-4 and with conventional resources only (total score standardized to 0-100). The GPT-alone arm represents the model being prompted by the study team to complete the five cases, with the models prompted five times for each case for a total of 25 observations. The physicians with GPT-4 group included 46 participants that completed 178 cases, while the physician with conventional resources group included 46 participants that completed 197 cases. The center of the box plot represents the median, with the boundaries representing the first and third quartiles. The whiskers represent the furthest data points from the center within 1.5 times the IQR.

Figure/Table Image (Page 4)
Fig. 3 | Comparison of the primary outcome according to GPT alone versus physician with GPT-4 and with conventional resources only (total score standardized to 0-100). The GPT-alone arm represents the model being prompted by the study team to complete the five cases, with the models prompted five times for each case for a total of 25 observations. The physicians with GPT-4 group included 46 participants that completed 178 cases, while the physician with conventional resources group included 46 participants that completed 197 cases. The center of the box plot represents the median, with the boundaries representing the first and third quartiles. The whiskers represent the furthest data points from the center within 1.5 times the IQR.
First Reference in Text
While trending toward scoring higher than humans using conventional resources (43.7% versus 35.7%, difference = 7.3%, 95% CI = -0.7% to 15.4%, P = 0.074) (Fig. 3).
Description
  • Figure 3 compares primary outcomes across three groups using a box plot.: Figure 3 presents a comparison of the primary outcome (total score standardized to 0-100) across three groups: GPT-4 used alone, physicians using GPT-4, and physicians using conventional resources. It uses a box plot to visualize the distribution of scores for each group. The box plot shows the median (the middle value), the first and third quartiles (the 25th and 75th percentiles), and the range of the data (with whiskers extending to 1.5 times the interquartile range).
  • The caption explains the methodology for the GPT-alone arm.: The caption explains that the 'GPT-alone arm' represents the model being prompted by the study team to complete the five cases. Each case was prompted five times, resulting in a total of 25 observations. The physicians with GPT-4 group included 46 participants who completed 178 cases, and the physician with conventional resources group included 46 participants who completed 197 cases.
Scientific Validity
  • The inclusion of a GPT-alone arm is a strength of the study.: The inclusion of a GPT-alone arm is a strength of the study, as it allows for a direct comparison of the model's performance against human physicians. This provides valuable information about the potential for standalone AI applications in this domain.
  • The small sample size of the GPT-alone arm should be considered.: The relatively small number of observations in the GPT-alone arm (25) compared to the physician groups (178 and 197) should be considered when interpreting the results. This difference in sample size may limit the statistical power to detect significant differences.
  • The p-value of 0.074 indicates a trend, but is not statistically significant.: The p-value of 0.074 indicates a trend towards higher scores for GPT alone compared to physicians using conventional resources, but this difference is not statistically significant at the conventional alpha level of 0.05. Therefore, caution is warranted when interpreting this result.
Communication
  • The figure effectively compares GPT alone against human physicians.: The figure effectively compares the performance of GPT alone against human physicians, both with and without GPT assistance. The inclusion of the GPT-alone arm provides valuable insight into the potential of the model as a standalone tool.
  • The caption clearly explains the methodology for the GPT-alone arm.: The caption clearly explains the methodology for the GPT-alone arm, including the number of prompts and observations. This transparency is crucial for understanding the context of the GPT-alone results.
  • The p-value of 0.074 should be interpreted cautiously.: The p-value of 0.074 is marginal and should be interpreted cautiously. While the text notes a 'trend', it's important to avoid overstating the significance of this result.
Fig. 4 | Comparison of the time spent per case by physicians using GPT-4 and...
Full Caption

Fig. 4 | Comparison of the time spent per case by physicians using GPT-4 and physicians using conventional resources only. Ninety-two physicians (46 randomized to the LLM group and 46 randomized to the conventional resources) completed 375 cases (178 in the LLM group, 197 in the conventional resources group). The center of the boxplot represents the median, with the boundaries representing the first and third quartiles. The whiskers represent the furthest data points from the center within 1.5 times the IQR.

Figure/Table Image (Page 5)