GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial

Table of Contents

Overall Summary

Study Background and Main Findings

This randomized controlled trial assessed the impact of LLM assistance (GPT-4) on physician performance in management reasoning tasks. Physicians using the LLM scored significantly higher (43.0% vs 35.7%, difference = 6.5%, 95% CI = 2.7% to 10.2%, P < 0.001) than those using conventional resources, but also spent significantly more time per case (801.5s vs 690.2s, difference = 119.3s, 95% CI = 17.4 to 221.2, P=0.022). The LLM alone performed comparably to LLM-assisted physicians.

Research Impact and Future Directions

The study provides compelling evidence that LLM assistance, specifically using GPT-4, significantly improves physician performance on management reasoning tasks compared to using conventional resources alone. The randomized controlled trial design and statistically significant results (p < 0.001) support a causal link between LLM use and improved scores. However, it's crucial to note that this improvement comes with a statistically significant increase in time spent per case. The study also found that the LLM alone performed comparably to LLM-assisted physicians, suggesting potential for independent use, although this finding requires cautious interpretation due to the smaller sample size in the LLM-alone arm.

The practical utility of these findings is promising, suggesting that LLMs could serve as valuable decision support tools in clinical practice, particularly for complex management decisions. The study is well-contextualized within existing literature, acknowledging the growing body of research on LLMs in diagnostic reasoning and highlighting the novelty of its focus on management reasoning. However, the study's reliance on clinical vignettes, while ecologically valid to a degree, limits the direct applicability of the findings to real-world clinical settings. The increased time spent per case also raises questions about the efficiency of LLM use in time-constrained environments.

While the study demonstrates a clear benefit in terms of improved scores, the guidance for practitioners must be nuanced. LLMs like GPT-4 show potential as decision support tools, but their implementation should be approached with caution. The increased time investment, potential for bias, and risk of hallucinations/misinformation need careful consideration. The study strongly suggests that LLMs should be used as *assistive* tools, augmenting physician judgment rather than replacing it. Further research is needed to optimize the integration of LLMs into clinical workflows and to minimize potential risks.

Several critical questions remain unanswered. The long-term effects of LLM use on physician skills and clinical judgment are unknown. Will over-reliance on LLMs lead to deskilling or reduced critical thinking? The study's limitations, particularly the use of clinical vignettes and the lack of external validity evidence for the scoring rubrics, raise concerns about the generalizability of the findings. While the study addresses potential harm and finds no significant difference between groups, the long-term safety implications of LLM use in real-world clinical practice require further investigation. Future research should focus on real-world clinical trials, longitudinal studies, and the development of robust methods for detecting and mitigating potential biases and harms associated with LLM use.

Critical Analysis and Recommendations

Improved Physician Performance with LLM Assistance (written-content)
Physicians using GPT-4 scored significantly higher on management reasoning tasks (mean difference 6.5%, 95% CI = 2.7 to 10.2, P < 0.001). This demonstrates a statistically significant improvement in performance with LLM assistance, suggesting a potential for enhanced clinical decision-making.
Section: Results
Rigorous Study Design (written-content)
The study used a prospective, randomized, controlled trial design. This rigorous methodology minimizes bias and strengthens the causal inference between LLM use and improved physician performance.
Section: Methods
Rigorous Scoring Rubric Development (written-content)
The scoring rubrics were developed using a modified Delphi process with an expert panel and showed substantial inter-rater reliability (pooled kappa = 0.80). This indicates a rigorous and reliable approach to assessing the complex task of management reasoning.
Section: Methods
Increased Time Spent with LLM (written-content)
Physicians using the LLM spent significantly more time per case (mean difference = 119.3s, 95% CI = 17.4 to 221.2, P = 0.022). This increased time investment raises concerns about the efficiency of LLM use in time-constrained clinical settings.
Section: Results
Use of Clinical Vignettes (written-content)
The study used clinical vignettes rather than real patient cases. While based on real encounters, this limits the direct generalizability of the findings to real-world clinical practice.
Section: Discussion
Lack of Detail in Methods (written-content)
The Methods section lacks specific details on the randomization method and the training provided to participants in the GPT-4 arm. This lack of detail hinders reproducibility and makes it difficult to fully assess potential biases.
Section: Methods
Clear Study Flow Diagram (graphical-figure)
Figure 1 clearly presents the study flow diagram, adhering to CONSORT guidelines. This enhances transparency in reporting participant enrollment, allocation, and analysis.
Section: Results
Comparison with LLM Alone (graphical-figure)
Figure 3 includes a comparison with the LLM used alone, providing valuable insight into its potential as an independent tool. However, the small sample size for this arm (n=25) limits the statistical power of this comparison.
Section: Results
Incomplete Discussion of Risks (written-content)
The Discussion section does not fully address the potential for bias in the LLM's output or the risk of over-reliance on the LLM by physicians. A more balanced discussion of risks and benefits is needed.
Section: Discussion

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Fig. 1| Study flow diagram. The study included 92 practicing attending...
Full Caption

Fig. 1| Study flow diagram. The study included 92 practicing attending physicians and residents with training in internal medicine, family medicine or emergency medicine. Five expert-developed cases were presented, with scoring rubrics created using a Delphi process. Physicians were randomized to use either GPT-4 via ChatGPT plus in addition to conventional resources (for example, UpToDate, Google), or conventional resources alone. The primary outcome was the difference in total score between groups on expert-developed scoring rubrics. Secondary outcomes included domain-specific scores and time spent per case.

Figure/Table Image (Page 2)
Fig. 1| Study flow diagram. The study included 92 practicing attending physicians and residents with training in internal medicine, family medicine or emergency medicine. Five expert-developed cases were presented, with scoring rubrics created using a Delphi process. Physicians were randomized to use either GPT-4 via ChatGPT plus in addition to conventional resources (for example, UpToDate, Google), or conventional resources alone. The primary outcome was the difference in total score between groups on expert-developed scoring rubrics. Secondary outcomes included domain-specific scores and time spent per case.
First Reference in Text
Participants were randomized evenly between the LLM and conventional resources groups (Fig. 1); 73% (67 of 92) were attending physicians while 27% (25 of 92) were residents (Table 1).
Description
  • The study included 92 physicians with various training.: The study flow diagram, often called a CONSORT diagram, visually explains how participants moved through the research study. Ninety-two practicing physicians and residents—meaning doctors still in training—were involved. These doctors specialized in internal medicine (adult healthcare), family medicine (healthcare for all ages), or emergency medicine (immediate medical care). The doctors reviewed five cases, prepared by experts. The expert's scoring was based on rubrics, or detailed scoring guides, developed using a 'Delphi process'. A Delphi process is a way to get a group of experts to agree on something by having them answer questions and then giving them anonymous feedback until they reach a consensus. The doctors were then randomly put into one of two groups: one group used GPT-4 (a large language model) via ChatGPT and regular resources like UpToDate (an online medical resource) and Google; the other group used only the regular resources. The main thing the researchers were measuring was the difference in scores between the groups, based on the expert-developed scoring guides. They also looked at other things, like scores on specific areas of the cases and how much time each group spent on each case.
  • Physicians were randomized evenly between the LLM and conventional resources groups.: The physicians were randomized to either GPT-4 with conventional resources or conventional resources alone. Randomization is a process of randomly assigning participants to different groups in a study, such as a treatment group or a control group. This helps to ensure that the groups are as similar as possible at the beginning of the study, so that any differences in outcomes can be attributed to the intervention being studied. This is often done by using a computer-generated random number sequence to assign participants to groups.
  • 73% (67 of 92) were attending physicians while 27% (25 of 92) were residents: The reference text from the Results section notes that 73% (67 out of 92) of the participants were attending physicians, meaning fully qualified doctors, while the remaining 27% (25 out of 92) were residents, or doctors still undergoing training. These percentages are used to describe the study population.
Scientific Validity
  • The study flow diagram is a standard element in RCTs.: The study flow diagram is a standard element in RCTs, adhering to CONSORT guidelines, and is essential for transparency. The mention of the Delphi process for rubric development strengthens the validity by highlighting a rigorous approach to outcome assessment.
  • Even randomization minimizes selection bias.: The even randomization, as stated in the reference text and implied by the flow diagram, is crucial for minimizing selection bias. The percentages of attending physicians and residents are descriptive and relevant for characterizing the study population.
Communication
  • The caption provides a good overview of the study.: The caption effectively summarizes the key aspects of the study design and outcomes, providing a concise overview for readers to quickly understand the figure's content. It clearly states the number of participants, their backgrounds, the intervention, and the primary and secondary outcomes.
  • The caption could be more explicit about the purpose of a CONSORT diagram.: While the caption is informative, it could benefit from a more explicit mention of the CONSORT diagram's purpose, which is to visually represent the flow of participants through each stage of the randomized controlled trial.
Table 1 | Participant characteristics according to randomized group
Figure/Table Image (Page 3)
Table 1 | Participant characteristics according to randomized group
First Reference in Text
Participants were randomized evenly between the LLM and conventional resources groups (Fig. 1); 73% (67 of 92) were attending physicians while 27% (25 of 92) were residents (Table 1).
Description
  • Table 1 describes participant characteristics split by randomized group.: Table 1 is all about describing the people who took part in the study. It shows key details about them, split into two groups based on how they were randomly assigned: one group used GPT-4 along with regular resources, and the other used only regular resources. The table lists things like what stage of their career they're at (attending physician or resident), their medical specialty (internal medicine, emergency medicine, or family medicine), how many years they've been in medical training, and their past experience with using GPT-like tools. It also includes a number called the 'standardized mean difference' or SMD. The SMD is a way to measure how different the two groups are for each of these characteristics. An SMD close to zero means the groups are very similar, which is what you want when you randomly assign people to groups in a study.
  • The table provides numbers, percentages, means and standard deviations.: The table provides specific numbers and percentages for each characteristic within each group. For instance, it shows that in the overall study population (n=92), 73% were attending physicians. This is broken down to 74% in the GPT-4 group and 72% in the conventional resources group. The table also gives the mean (average) and standard deviation (a measure of spread) for continuous variables like years in medical training.
Scientific Validity
  • Describing participant characteristics is essential for assessing comparability.: The inclusion of participant characteristics is essential for assessing the comparability of the randomized groups at baseline. This is a standard practice in RCTs to ensure that any observed differences in outcomes are attributable to the intervention rather than pre-existing differences between the groups.
  • The use of SMD is appropriate for assessing balance.: The use of the standardized mean difference (SMD) is appropriate for assessing the balance between the groups. While there is no universally agreed-upon threshold, an SMD less than 0.2 is generally considered a small and acceptable difference, suggesting adequate randomization.
Communication
  • The table is well-organized and facilitates comparison.: The table's clear labeling and organization facilitate easy comparison of baseline characteristics between the two randomized groups. The inclusion of the standardized mean difference (SMD) provides a quick assessment of any imbalances.
  • A definition of SMD would improve understanding for some readers.: While the table is generally clear, providing a brief definition of SMD in the caption or a footnote would enhance understanding for readers unfamiliar with this statistic.
Table 2 | Comparisons of the primary and secondary outcomes for physicians with...
Full Caption

Table 2 | Comparisons of the primary and secondary outcomes for physicians with LLM and with conventional resources only (scores standardized to 0-100)

Figure/Table Image (Page 3)
Table 2 | Comparisons of the primary and secondary outcomes for physicians with LLM and with conventional resources only (scores standardized to 0-100)
First Reference in Text
Physicians randomized to use the LLM performed better than the control group (43.0% compared to 35.7%, difference = 6.5%, 95% con- fidence interval (CI) = 2.7% to 10.2%, P < 0.001) (Table 2 and Fig. 2).
Description
  • Table 2 compares outcomes between LLM and conventional resources groups.: Table 2 presents a head-to-head comparison of how well doctors performed on the cases, depending on whether they had access to the GPT-4 language model or only conventional resources. The table focuses on both the main (primary) outcome and other related (secondary) outcomes. To make the scores easier to understand and compare, they've been converted to a scale from 0 to 100. The table shows two main numbers for each outcome: the average score (mean) and the middle score (median). The average can be skewed by very high or low scores, while the median represents the true middle value. The table also includes the interquartile range (IQR), which is the range between the 25th and 75th percentile, representing the spread of the middle 50% of the data. Finally, it provides the difference between the two groups, a confidence interval (CI), and a P-value. The confidence interval is a range of values that we are fairly sure contains the true population mean. A p-value is a measure of the probability that an observed difference could have occurred just by random chance, if there really was no difference.
  • LLM group had an average total score of 43.0 vs 35.7 for conventional resources (p<0.001).: The table shows that the average total score for physicians using LLM was 43.0 (out of 100), while for those using conventional resources it was 35.7. The difference between these averages was 6.5, with a 95% confidence interval ranging from 2.7 to 10.2, and a p-value less than 0.001. This suggests that the LLM group performed significantly better than the conventional resources group.
Scientific Validity
  • The table provides comprehensive statistical information.: The table provides essential statistical information for evaluating the study's findings. The inclusion of means, medians, interquartile ranges, differences, confidence intervals, and p-values allows for a comprehensive assessment of the magnitude and statistical significance of the observed effects.
  • The use of a generalized mixed-effects model is appropriate.: The use of a generalized mixed-effects model, as mentioned in the table footnote, is appropriate for accounting for the potential correlation between cases for a given participant. This approach strengthens the validity of the statistical analysis.
Communication
  • Score standardization enhances interpretability.: The standardization of scores to a 0-100 scale enhances interpretability and allows for easier comparison across different outcome measures. The inclusion of both means and medians provides a more complete picture of the data distribution, as means can be influenced by outliers.
  • Clarity could be improved by specifying the level of confidence for CIs in the table header.: While the table is generally well-organized, the presentation of confidence intervals (CI) could be improved. Specifying the level of confidence (95% CI) in the column header, rather than only in the reference text, would provide clarity and avoid ambiguity.
Fig. 2 | Comparison of the primary outcome for physicians with LLM and with...
Full Caption

Fig. 2 | Comparison of the primary outcome for physicians with LLM and with conventional resources only (total score standardized to 0-100). Ninety-two physicians (46 randomized to the LLM group and 46 randomized to conventional resources) completed 375 cases (178 in the LLM group, 197 in the conventional resources group). The center of the box plot represents the median, with the boundaries representing the first and third quartiles. The whiskers represent the furthest data points from the center within 1.5 times the IQR.

Figure/Table Image (Page 4)
Fig. 2 | Comparison of the primary outcome for physicians with LLM and with conventional resources only (total score standardized to 0-100). Ninety-two physicians (46 randomized to the LLM group and 46 randomized to conventional resources) completed 375 cases (178 in the LLM group, 197 in the conventional resources group). The center of the box plot represents the median, with the boundaries representing the first and third quartiles. The whiskers represent the furthest data points from the center within 1.5 times the IQR.
First Reference in Text
Physicians randomized to use the LLM performed better than the control group (43.0% compared to 35.7%, difference = 6.5%, 95% con- fidence interval (CI) = 2.7% to 10.2%, P < 0.001) (Table 2 and Fig. 2).
Description
  • Figure 2 uses a box plot to compare total scores between the two groups.: Figure 2 uses a box plot to show how the total scores compare between the two groups of doctors. The scores have already been standardized to a 0-100 scale, making it easier to compare. A box plot is a way of summarizing data using five key numbers: the median (the middle value), the first quartile (the value below which 25% of the data falls), the third quartile (the value below which 75% of the data falls), and the minimum and maximum values. The box itself shows the range between the first and third quartiles, representing the middle 50% of the data. The line inside the box shows the median. The 'whiskers' extend out from the box to the furthest data points that are within 1.5 times the interquartile range (IQR) from the box. Any points beyond the whiskers are considered outliers.
  • The caption specifies the number of physicians and cases in each group.: The caption specifies that 92 physicians participated in the study, with 46 in each group. The LLM group completed 178 cases, while the conventional resources group completed 197 cases. The box plot visually represents the distribution of these scores for each group.
Scientific Validity
  • The use of a box plot is appropriate for visualizing the distribution of scores.: The use of a box plot is appropriate for visualizing the distribution of scores and comparing the central tendency and spread between the two groups. The caption clearly defines the elements of the box plot, ensuring accurate interpretation.
  • The reference to Table 2 provides complementary statistical information.: The reference to Table 2 provides the statistical information (mean, confidence interval, p-value) that complements the visual representation in the box plot. This combination of visual and numerical data enhances the rigor and interpretability of the results.
Communication
  • The box plot effectively visualizes the distribution of scores.: The box plot effectively visualizes the distribution of total scores for each group, making it easy to compare the median, quartiles, and range of scores. The clear labeling and concise caption contribute to the figure's understandability.
  • Adding a visual indicator of statistical significance could enhance the figure's impact.: While the box plot is informative, adding a visual indicator of statistical significance (e.g., an asterisk) could further enhance its impact. Also, reporting the exact p-value within the figure or caption would provide more precise information than simply stating 'P < 0.001' in the reference text.
Fig. 3 | Comparison of the primary outcome according to GPT alone versus...
Full Caption

Fig. 3 | Comparison of the primary outcome according to GPT alone versus physician with GPT-4 and with conventional resources only (total score standardized to 0-100). The GPT-alone arm represents the model being prompted by the study team to complete the five cases, with the models prompted five times for each case for a total of 25 observations. The physicians with GPT-4 group included 46 participants that completed 178 cases, while the physician with conventional resources group included 46 participants that completed 197 cases. The center of the box plot represents the median, with the boundaries representing the first and third quartiles. The whiskers represent the furthest data points from the center within 1.5 times the IQR.

Figure/Table Image (Page 4)
Fig. 3 | Comparison of the primary outcome according to GPT alone versus physician with GPT-4 and with conventional resources only (total score standardized to 0-100). The GPT-alone arm represents the model being prompted by the study team to complete the five cases, with the models prompted five times for each case for a total of 25 observations. The physicians with GPT-4 group included 46 participants that completed 178 cases, while the physician with conventional resources group included 46 participants that completed 197 cases. The center of the box plot represents the median, with the boundaries representing the first and third quartiles. The whiskers represent the furthest data points from the center within 1.5 times the IQR.
First Reference in Text
While trending toward scoring higher than humans using conventional resources (43.7% versus 35.7%, difference = 7.3%, 95% CI = -0.7% to 15.4%, P = 0.074) (Fig. 3).
Description
  • Figure 3 compares primary outcomes across three groups using a box plot.: Figure 3 presents a comparison of the primary outcome (total score standardized to 0-100) across three groups: GPT-4 used alone, physicians using GPT-4, and physicians using conventional resources. It uses a box plot to visualize the distribution of scores for each group. The box plot shows the median (the middle value), the first and third quartiles (the 25th and 75th percentiles), and the range of the data (with whiskers extending to 1.5 times the interquartile range).
  • The caption explains the methodology for the GPT-alone arm.: The caption explains that the 'GPT-alone arm' represents the model being prompted by the study team to complete the five cases. Each case was prompted five times, resulting in a total of 25 observations. The physicians with GPT-4 group included 46 participants who completed 178 cases, and the physician with conventional resources group included 46 participants who completed 197 cases.
Scientific Validity
  • The inclusion of a GPT-alone arm is a strength of the study.: The inclusion of a GPT-alone arm is a strength of the study, as it allows for a direct comparison of the model's performance against human physicians. This provides valuable information about the potential for standalone AI applications in this domain.
  • The small sample size of the GPT-alone arm should be considered.: The relatively small number of observations in the GPT-alone arm (25) compared to the physician groups (178 and 197) should be considered when interpreting the results. This difference in sample size may limit the statistical power to detect significant differences.
  • The p-value of 0.074 indicates a trend, but is not statistically significant.: The p-value of 0.074 indicates a trend towards higher scores for GPT alone compared to physicians using conventional resources, but this difference is not statistically significant at the conventional alpha level of 0.05. Therefore, caution is warranted when interpreting this result.
Communication
  • The figure effectively compares GPT alone against human physicians.: The figure effectively compares the performance of GPT alone against human physicians, both with and without GPT assistance. The inclusion of the GPT-alone arm provides valuable insight into the potential of the model as a standalone tool.
  • The caption clearly explains the methodology for the GPT-alone arm.: The caption clearly explains the methodology for the GPT-alone arm, including the number of prompts and observations. This transparency is crucial for understanding the context of the GPT-alone results.
  • The p-value of 0.074 should be interpreted cautiously.: The p-value of 0.074 is marginal and should be interpreted cautiously. While the text notes a 'trend', it's important to avoid overstating the significance of this result.
Fig. 4 | Comparison of the time spent per case by physicians using GPT-4 and...
Full Caption

Fig. 4 | Comparison of the time spent per case by physicians using GPT-4 and physicians using conventional resources only. Ninety-two physicians (46 randomized to the LLM group and 46 randomized to the conventional resources) completed 375 cases (178 in the LLM group, 197 in the conventional resources group). The center of the boxplot represents the median, with the boundaries representing the first and third quartiles. The whiskers represent the furthest data points from the center within 1.5 times the IQR.

Figure/Table Image (Page 5)
Fig. 4 | Comparison of the time spent per case by physicians using GPT-4 and physicians using conventional resources only. Ninety-two physicians (46 randomized to the LLM group and 46 randomized to the conventional resources) completed 375 cases (178 in the LLM group, 197 in the conventional resources group). The center of the boxplot represents the median, with the boundaries representing the first and third quartiles. The whiskers represent the furthest data points from the center within 1.5 times the IQR.
First Reference in Text
Physicians randomized to use the LLM spent 111.3 s more on each case (801.5 s versus 690.2 s, difference = 119.3 s, 95% CI = 17.4 to 221.2, P = 0.022) (Fig. 4).
Description
  • Figure 4 compares time spent per case between the two groups using a box plot.: Figure 4 uses a box plot to compare the amount of time doctors spent on each case, depending on whether they were using GPT-4 or conventional resources. The time is measured in seconds. As in Figure 2, the box plot shows the median, first and third quartiles, and the range of the data, with whiskers extending to 1.5 times the interquartile range. The caption indicates that the 92 physicians (46 in each group) completed a total of 375 cases (178 in the GPT-4 group and 197 in the conventional resources group).
  • The reference text states that the LLM group spent 111.3 seconds more on each case (p=0.022).: The reference text states that physicians using GPT-4 spent 111.3 seconds more on each case, on average, compared to those using conventional resources. The average time spent was 801.5 seconds for the GPT-4 group and 690.2 seconds for the conventional resources group. The 95% confidence interval for this difference ranges from 17.4 to 221.2 seconds, and the p-value is 0.022, indicating a statistically significant difference.
Scientific Validity
  • The use of a box plot is appropriate, and the difference is statistically significant.: The use of a box plot is appropriate for visualizing and comparing the distribution of time spent per case between the two groups. The statistical significance of the difference (p = 0.022) suggests that the observed difference is unlikely to be due to chance.
  • The extended time spent in the LLM group warrants further investigation.: The extended time spent in the LLM group warrants further investigation. It would be valuable to explore whether this increased time was associated with improved decision-making or simply reflected a more time-consuming process.
Communication
  • The box plot clearly illustrates the difference in time spent.: The box plot clearly illustrates the difference in time spent per case between the two groups. This visual representation effectively communicates the finding that physicians using GPT-4 spent more time on each case.
  • The caption provides sufficient information to understand the figure.: The caption provides sufficient information to understand the figure, including the number of participants and cases in each group. The description of the box plot elements (median, quartiles, whiskers) is also helpful for readers unfamiliar with this type of graph.
Extended Data Fig. 1 | Correlation between Time Spent in Seconds and Total...
Full Caption

Extended Data Fig. 1 | Correlation between Time Spent in Seconds and Total Score. This figure demonstrates a sample medical management case with multi-part assessment questions, scoring rubric and example responses. The case presents a 72-year-old post-cholecystectomy patient with new-onset atrial fibrillation. The rubric (23 points total) evaluates clinical decision-making across key areas: initial workup, anticoagulation decisions, and outpatient monitoring strategy. Sample high-scoring (21/23) and low-scoring (8/23) responses illustrate varying depths of clinical reasoning and management decisions.

Figure/Table Image (Page 9)
Extended Data Fig. 1 | Correlation between Time Spent in Seconds and Total Score. This figure demonstrates a sample medical management case with multi-part assessment questions, scoring rubric and example responses. The case presents a 72-year-old post-cholecystectomy patient with new-onset atrial fibrillation. The rubric (23 points total) evaluates clinical decision-making across key areas: initial workup, anticoagulation decisions, and outpatient monitoring strategy. Sample high-scoring (21/23) and low-scoring (8/23) responses illustrate varying depths of clinical reasoning and management decisions.
First Reference in Text
We performed an additional post hoc sensitivity analysis adjusting for time spent on each case (Extended Data Table 1), which showed a 5.4 percentage point (95% CI = 1.7 to 9.0, P = 0.004) increase in score per case even after adjustment for time spent on the case. Results were similar for subdomains. We further examined the unadjusted correlation between time spent and total scores with a posi- tive association between time spent and total scores for both groups (Extended Data Table 2). Overall, we observed that for each additional minute spent on a case, there was a small but statistically significant increase of 0.6 points in the score per case (95% CI = 0.4 to 0.8, P < 0.001) using a mixed-effects model (Extended Data Fig. 1).
Description
  • Extended Data Figure 1 shows the correlation between time spent and total score.: Extended Data Figure 1 shows the relationship between how much time the doctors spent on a case (in seconds) and their total score on that case. Since it is an 'extended data' figure, it's likely providing supporting information that's not essential to the main findings but still useful. The caption describes a sample case used in the study: a 72-year-old patient who has new-onset atrial fibrillation (an irregular heartbeat) after having their gallbladder removed (cholecystectomy). The doctors were asked questions about how they would manage this patient, and their answers were scored using a rubric. The rubric, which had a total of 23 points, covered key areas like the initial workup (tests and procedures to figure out what's going on), decisions about anticoagulation (medication to prevent blood clots), and the plan for monitoring the patient after they leave the hospital.
  • The figure includes sample high-scoring and low-scoring responses.: The caption also mentions that the figure includes sample responses, both high-scoring (21/23) and low-scoring (8/23). These examples are used to illustrate the different levels of clinical reasoning and management decisions demonstrated by the participants.
Scientific Validity
  • The post hoc sensitivity analysis addresses potential confounding.: The reference text indicates that a post hoc sensitivity analysis was performed to adjust for the potential confounding effect of time spent on the case. This is a rigorous approach that strengthens the validity of the findings by addressing a potential source of bias.
  • Examining the correlation between time spent and total scores is valuable.: Examining the correlation between time spent and total scores is valuable for understanding the relationship between effort and performance. The reported statistically significant increase of 0.6 points per minute suggests a positive association, but the small magnitude of this effect should be considered.
Communication
  • The caption provides valuable context about the case and scoring rubric.: While the figure itself is not presented, the detailed caption offers valuable context by describing a sample case, assessment questions, and scoring rubric. This provides insight into the nature of the management reasoning tasks and the criteria used for evaluation.
  • The caption should explicitly state the type of graph (scatter plot).: The caption could be improved by explicitly stating the type of graph being referenced (scatter plot). This would help readers quickly understand the relationship being examined (correlation between time spent and total score).
Extended Data Table 1 | Post-hoc Analysis Adjusted for Time Spent in Each Case
Figure/Table Image (Page 10)
Extended Data Table 1 | Post-hoc Analysis Adjusted for Time Spent in Each Case
First Reference in Text
We performed an additional post hoc sensitivity analysis adjusting for time spent on each case (Extended Data Table 1), which showed a 5.4 percentage point (95% CI = 1.7 to 9.0, P = 0.004) increase in score per case even after adjustment for time spent on the case.
Description
  • Extended Data Table 1 presents a post-hoc analysis adjusting for time spent.: Extended Data Table 1 presents the results of a 'post-hoc analysis'. In research, a post-hoc analysis is one that's done *after* the main experiment, usually to explore the data in more detail or to address unexpected findings. In this case, the researchers were concerned that the amount of time doctors spent on each case might have affected their scores. To address this, they performed an analysis that 'adjusted for' time spent, meaning they statistically removed the effect of time to see if the main results still held true. The table shows the results of this adjusted analysis, including the change in score, the 95% confidence interval (a range within which we can be 95% sure the true value lies), and the p-value (a measure of statistical significance).
  • The adjusted analysis showed a 5.4 percentage point increase in score (p=0.004).: The reference text highlights that this adjusted analysis showed a 5.4 percentage point increase in score per case (with a 95% confidence interval of 1.7 to 9.0 and a p-value of 0.004) even after accounting for the time spent. This suggests that the positive effect of LLM was not simply due to doctors spending more time on the cases.
Scientific Validity
  • The post-hoc analysis strengthens the validity of the study.: Conducting a post-hoc sensitivity analysis to adjust for time spent is a rigorous methodological approach that strengthens the validity of the study. It addresses the potential confounding effect of time and provides a more accurate estimate of the intervention's effect.
  • Reporting of CI and p-value is essential for evaluating statistical significance.: The reporting of the confidence interval and p-value is essential for evaluating the statistical significance and precision of the adjusted results. The statistically significant p-value (P = 0.004) provides evidence that the observed effect is unlikely to be due to chance.
Communication
  • The table's clear labeling facilitates understanding.: The table's clear labeling facilitates understanding of the adjusted results. The inclusion of confidence intervals and p-values allows for a thorough assessment of the statistical significance of the findings.
  • Explaining the rationale for the analysis would enhance the table's context.: While the table is well-structured, briefly explaining the rationale for conducting this post-hoc analysis in the caption would enhance its context and significance for the reader.
Extended Data Table 2 | Post-hoc Analysis for the Associations between the...
Full Caption

Extended Data Table 2 | Post-hoc Analysis for the Associations between the Primary and Secondary Outcomes Overall

Figure/Table Image (Page 11)
Extended Data Table 2 | Post-hoc Analysis for the Associations between the Primary and Secondary Outcomes Overall
First Reference in Text
We performed an additional post hoc sensitivity analysis adjusting for time spent on each case (Extended Data Table 1), which showed a 5.4 percentage point (95% CI = 1.7 to 9.0, P = 0.004) increase in score per case even after adjustment for time spent on the case. Results were similar for subdomains. We further examined the unadjusted correlation between time spent and total scores with a posi- tive association between time spent and total scores for both groups (Extended Data Table 2).
Description
  • Extended Data Table 2 presents a post-hoc analysis of associations between time spent and outcomes.: Extended Data Table 2 presents a 'post-hoc analysis', meaning it was done after the main experiment to explore the relationships between different factors. Specifically, it looks at how the amount of time spent on each case relates to the primary outcome (total score) and secondary outcomes (different aspects of the case, like management decisions or factual recall). The table shows these relationships for the entire group of doctors, as well as separately for the doctors using LLM and those using conventional resources. These relationships are quantified using a measure that describes the strength and direction of a linear relationship between two variables. A positive value indicates that as one variable increases, the other tends to increase as well, and vice versa.
  • The table provides the difference in scores per minute spent on the case.: The table provides the 'difference in the scores by one minute increase of time spent on the case'. The reference text notes that the researchers examined the 'unadjusted correlation between time spent and total scores', meaning they looked at the raw relationship without statistically removing the effect of any other variables.
Scientific Validity
  • Examining associations between time spent and outcomes is valuable.: Examining the associations between time spent and outcomes is a valuable exploratory analysis that can provide insights into the cognitive processes involved in the decision-making tasks. Reporting these associations separately for the LLM and conventional resources groups allows for a comparison of how time influences performance in each group.
  • Correlation does not equal causation.: It's important to note that correlation does not equal causation. While the table may reveal positive or negative associations between time spent and outcomes, it does not prove that spending more or less time directly causes changes in performance. Other factors may be influencing both time spent and scores.
Communication
  • The table is clearly labeled and separates the analysis by group.: The table's clear labeling helps in understanding the associations between time spent and various outcomes. Separating the analysis by group (LLM vs. conventional resources) provides valuable insights into potential differences in these associations.
  • The table should clarify what the reported values represent (e.g., correlation coefficients).: The table could benefit from a clearer explanation of what the reported values represent. Specifying that these are correlation coefficients or regression coefficients in the caption or a footnote would improve interpretability.

Discussion

Key Aspects

Strengths

Suggestions for Improvement

Methods

Key Aspects

Strengths

Suggestions for Improvement

↑ Back to Top