Delving into LLM-assisted writing in biomedical publications through excess vocabulary

Dmitry Kobak, Rita González-Márquez, Emőke-Ágnes Horvát, Jan Lause
Science Advances
Hertie Institute for Al in Brain Health, University of Tübingen, 72076 Tübingen, Germany

First Page Preview

First page preview

Table of Contents

Overall Summary

Study Background and Main Findings

This paper investigates the growing use of Large Language Models (LLMs) in scientific writing, focusing on biomedical research publications. The authors address the challenge of quantifying LLM usage, given the limitations of existing detection methods that rely on potentially biased training data. They introduce a novel, unbiased approach inspired by epidemiological studies of 'excess mortality.' Instead of directly detecting LLM-generated text, they analyze vocabulary changes over time, tracking the 'excess usage' of specific words after the release of LLMs like ChatGPT.

The study analyzes over 15 million biomedical abstracts from PubMed, spanning 2010 to 2024. By comparing word frequencies in 2024 to a projected baseline based on pre-LLM trends (2021-2022), they identify a significant increase in the use of certain 'style words' – terms like 'delving,' 'pivotal,' and 'intricate' that are characteristic of LLM-generated text. This vocabulary shift is not only substantial but also qualitatively different from previous changes, such as those seen during the COVID-19 pandemic, which primarily involved content-related words (e.g., 'coronavirus,' 'pandemic').

The authors estimate that at least 13.5% of biomedical abstracts published in 2024 were processed with LLMs. This lower-bound estimate is derived using two independent methods, one based on a large set of rare style words and the other on a smaller, manually curated set of common style words. Importantly, this estimated LLM usage in 2024 is more than double the proportion of COVID-related abstracts at the pandemic's peak in 2021. Furthermore, the study reveals significant variation in LLM usage across scientific disciplines, countries, and journals, with some sub-groups showing estimates as high as 40%.

The research concludes that LLMs have had an unprecedented impact on scientific writing, surpassing even major global events like the COVID-19 pandemic in terms of their linguistic influence. The authors acknowledge limitations, such as the inability to distinguish between direct LLM use and authors adopting an LLM-influenced style. However, they emphasize the strengths of their approach, particularly its unbiased nature and its ability to provide historical context. The findings raise important questions about the implications of widespread LLM adoption for research integrity, scientific discourse, and the future of academic publishing.

Research Impact and Future Directions

This paper presents a compelling and rigorous analysis of the impact of Large Language Models (LLMs) on scientific writing, specifically within the biomedical research domain. The authors' novel 'excess vocabulary' approach offers a clever way to quantify LLM influence without relying on potentially biased ground-truth datasets, a significant methodological advantage over previous studies. The study's large scale, spanning over 15 million PubMed abstracts, and its longitudinal design, covering the period from 2010 to 2024, allow for robust detection of linguistic shifts and meaningful comparisons against established baselines, such as the COVID-19 pandemic's impact on scientific vocabulary.

The key finding of at least 13.5% LLM processing in 2024 biomedical abstracts is striking, especially when contextualized against the peak of COVID-related literature. The study's rigorous methodology, including the use of two independent heuristics ('rare' and 'common' word sets) and the blinded annotation of words, enhances the credibility of this estimate. The observed heterogeneity in LLM usage across disciplines, countries, and journals raises important questions about varying adoption rates, editorial practices, and the potential for more sophisticated, less detectable LLM use. While the study acknowledges its limitations, such as the inability to distinguish between direct LLM generation and stylistic mimicry, it nonetheless provides a valuable and timely contribution to the ongoing discussion about the role of AI in scientific communication.

The study's implications extend beyond mere quantification. By highlighting the potential for linguistic homogenization and the risk of perpetuating biases, it underscores the need for careful consideration of LLM use in scientific writing. The authors' suggestion to adapt their framework to track these risks further strengthens the paper's contribution and opens avenues for future research. Overall, this work provides a robust, data-driven foundation for understanding the transformative impact of LLMs on scientific communication and offers a valuable tool for monitoring and shaping the evolving relationship between AI and academic discourse.

Critical Analysis and Recommendations

Clear, Comprehensive, and Impactful Summary (written-content)
The abstract effectively summarizes the study's scope, providing specific, quantifiable findings (13.5% lower bound, reaching 40%) and impactful contextualization (comparison to the COVID-19 pandemic). This clarity and precision immediately convey the significance of the work.
Section: Abstract
Lack of Examples for 'Style Words' (written-content)
The abstract mentions 'style words' without examples. Briefly exemplifying these (e.g., 'delving,' 'pivotal') would enhance clarity and accessibility for non-experts.
Section: Abstract
Effective Problem Context and Literature Review (written-content)
The introduction effectively scaffolds the problem by linking language change to world events and focusing on the LLM disruption. It also provides a critical literature review, highlighting the limitations of existing LLM detection methods.
Section: Introduction
Undeveloped 'Equity' Point (written-content)
The introduction mentions the 'equity' hope for LLMs without elaboration. Briefly explaining the potential mechanisms (e.g., aiding non-native speakers) would provide a more balanced perspective.
Section: Introduction
Lack of Examples for 'Stylistic Words' (written-content)
The introduction mentions 'stylistic words' without examples. Providing examples (e.g., 'delving,' 'intricate') would clarify the linguistic phenomenon being investigated.
Section: Introduction
Exceptional Data Visualization (graphical-figure)
Figures 1-5 systematically visualize the data, building the argument from individual word trends to lower-bound estimates and heterogeneity. This clear visual communication makes complex findings accessible.
Section: Results
Opaque 'Common Set' Selection (written-content)
The 'common set' of words was manually selected. Detailing the selection criteria (e.g., maximizing combined frequency gap) would enhance transparency and address potential bias concerns.
Section: Results
Nuanced Interpretation of Heterogeneity (written-content)
The discussion thoughtfully interprets the observed heterogeneity, considering factors beyond adoption rates (e.g., differential editing, publication timelines). This nuanced perspective strengthens the analysis.
Section: Discussion
Untested 'Naïve vs. Advanced' Hypothesis (written-content)
The discussion hypothesizes about 'naïve vs. advanced' LLM usage. Suggesting a future study (e.g., correlating estimates with linguistic complexity) would add empirical validation.
Section: Discussion
Unquantified Homogenization Risk (written-content)
The discussion mentions linguistic homogenization as a risk. Proposing a method to measure it (e.g., tracking vocabulary diversity) would strengthen the paper's contribution.
Section: Discussion
Transparent and Reproducible Methods (written-content)
The methods section provides clear, specific details (e.g., software parameters, libraries used), enhancing reproducibility.
Section: Materials and Methods
Unjustified Extrapolation Formula (written-content)
The counterfactual projection formula lacks justification. Explaining its rationale (e.g., why this non-linear model) would enhance methodological transparency.
Section: Materials and Methods

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Fig. 1. Frequencies of PubMed abstracts containing several example words.
Figure/Table Image (Page 2)
Fig. 1. Frequencies of PubMed abstracts containing several example words.
First Reference in Text
Some words strongly increased their occurrence frequency in 2023-2024 (Fig. 1).
Description
  • Nine line charts showing word frequency over time: The figure displays a grid of nine separate line charts, each tracking the frequency of a specific word's appearance in the abstracts of biomedical papers from the PubMed database between 2012 and 2024. The vertical axis represents frequency (how common the word is), and the horizontal axis represents the year.
  • Actual frequency vs. projected 'counterfactual' trend: For each word, a solid blue line shows its actual frequency year by year. A dashed black line starting from 2022 shows a 'counterfactual extrapolation'—a statistical projection of what the frequency would likely have been in 2023 and 2024 based on the trend from 2021-2022. This projection serves as a baseline to measure any unexpected changes.
  • Sharp increase in frequency for LLM-associated words post-2022: The first six words (e.g., 'delves', 'crucial', 'potential') show a sharp increase in frequency in 2023 and 2024, far exceeding the projected trend. For example, the frequency of 'potential' increased from approximately 0.20 (20% of abstracts) in 2022 to about 0.25 (25% of abstracts) in 2024, while its projected frequency was only around 0.21.
  • Comparison words illustrate trends from other major events: The last three words ('pandemic', 'ebola', 'convolutional') are included for comparison to show how word frequencies change due to other major real-world or scientific events. The word 'pandemic' spiked dramatically around 2020-2021, reaching a frequency of about 0.04, while 'ebola' had a smaller spike around 2015. These examples provide context for the magnitude of the changes seen in the other words.
Scientific Validity
  • ✅ Inclusion of external event words as a baseline: The inclusion of words whose frequency is tied to known external events ('pandemic', 'ebola') provides an excellent informal control. It allows the reader to benchmark the magnitude and suddenness of the LLM-associated changes against other significant shifts in scientific discourse, strengthening the claim that the post-2022 effect is 'unprecedented'.
  • ✅ Use of a counterfactual model to quantify the effect: Visually representing a counterfactual baseline, even a simple one, is a methodologically sound way to quantify the 'excess' usage. It moves the argument from a purely qualitative observation ('the line goes up') to a quantitative one ('the line is this much higher than expected').
  • 💡 Potential for selection bias in 'example words': The words presented are described as 'manually selected'. This introduces a risk of cherry-picking—selecting only the examples that most strongly support the hypothesis. While this figure is illustrative, its evidentiary power is limited by this potential bias. The overall conclusion of the paper must rely on the more systematic analysis shown in later figures.
  • 💡 Simplicity of the counterfactual model: The counterfactual is a linear extrapolation based on only two preceding years (2021-2022). This is a very simple model that may not capture more complex, non-linear trends or natural volatility in word usage. A model incorporating a longer time series for the baseline might provide a more robust estimate, though the authors are transparent about their method.
Communication
  • ✅ Effective use of small multiples: The use of a 3x3 grid of small multiples is an excellent design choice. It allows for the direct and efficient comparison of trends across nine different words without cluttering a single plot, which is a key principle of effective data visualization.
  • ✅ Clear visual encoding of data vs. counterfactual: The visual distinction between the actual data (solid blue line) and the projected trend (dashed black line) is clear and intuitive, effectively highlighting the deviation that is central to the authors' argument.
  • 💡 Inconsistent Y-axis scales may be misleading: The y-axis scales differ significantly between plots (e.g., 'Delves' maxes out at 0.003, while 'These' reaches 0.36). While this maximizes the visual detail within each plot, it can mislead the viewer about the relative magnitude of the frequency changes. For instance, the slope for 'Delves' appears very steep, but the absolute change is orders of magnitude smaller than for 'Potential'. Consider adding a note to the caption about the varying scales to prevent misinterpretation.
Fig. 2. Words showing increased frequency in 2024.
Figure/Table Image (Page 2)
Fig. 2. Words showing increased frequency in 2024.
First Reference in Text
Across all 26,657 words, we found many with strong excess usage in 2024 (Fig. 2).
Description
  • Two scatter plots showing excess word usage in 2024: The figure consists of two scatter plots, labeled (A) and (B), that visualize the increased usage of 26,657 words in biomedical abstracts in 2024. Each point on the plots represents a single word. The horizontal axis on both plots shows the word's overall frequency in 2024 on a logarithmic scale, ranging from appearing in 0.01% of abstracts (10⁻⁴) to 100% of abstracts (10⁰).
  • Panel A: Relative increase (Frequency Ratio 'r'): Panel (A) plots the 'excess frequency ratio' (r) on the vertical axis, also on a logarithmic scale. This ratio measures how many times more frequent a word was in 2024 compared to its expected frequency based on 2022 data. For example, the word 'delves' shows a ratio near 30, meaning it was used about 30 times more often than expected. This metric is particularly useful for highlighting large relative changes in less common words.
  • Panel B: Absolute increase (Frequency Gap 'δ'): Panel (B) plots the 'excess frequency gap' (δ) on the vertical axis, on a linear scale. This gap measures the absolute difference between a word's actual and expected frequency. For instance, the word 'potential' has a gap of about 0.05, meaning it appeared in 5% more abstracts in 2024 than expected. This metric is better for showing the impact on very common words.
  • Color coding distinguishes 'style' vs. 'content' words: In both panels, words are color-coded as either 'content words' (blue) or 'style words' (orange). The plots visually demonstrate that the words with the highest excess usage, by both relative and absolute measures, are predominantly style words.
Scientific Validity
  • ✅ Use of complementary metrics is a major strength: The dual-metric approach is methodologically robust. Using both a relative measure (frequency ratio 'r') for rare words and an absolute measure (frequency gap 'δ') for common words provides a comprehensive and nuanced view of the vocabulary shift, avoiding the pitfalls of relying on a single metric.
  • ✅ Systematic analysis of the entire vocabulary: Unlike the illustrative examples in Fig. 1, these plots are based on the full dataset of 26,657 words. This systematic approach strongly supports the claim of widespread change and avoids the potential for cherry-picking, adding significant weight to the paper's conclusions.
  • 💡 Potential subjectivity in content/style annotation: The manual annotation of words into 'content' and 'style' categories is a key step that enables the main conclusion. However, this categorization can be subjective. The authors should ideally provide the criteria for this classification or make the full annotated list available to ensure transparency and allow for independent verification of this crucial analytical step.
  • ✅ Clear and quantitative definition of 'excess words': The figure and text clearly define the thresholds for 'excess words' (dashed lines). This provides a clear, quantitative basis for which words are considered significant outliers, making the analysis reproducible and less arbitrary.
Communication
  • ✅ Excellent use of a two-panel layout: The use of a side-by-side two-panel plot is highly effective. It allows for a direct comparison of two different but complementary metrics (relative vs. absolute increase), telling a more complete story than either plot could alone.
  • ✅ Highly effective color coding: The color-coding of words into 'content' (blue) and 'style' (orange) is a powerful visual device. It immediately and clearly communicates the central finding that the words with the most significant excess usage are overwhelmingly stylistic rather than content-related.
  • ✅ Useful annotation of key data points: The annotations labeling the most extreme data points (e.g., 'delves', 'potential') are crucial for interpreting the plots and linking them back to the specific examples discussed in the text.
  • ✅ Informative and detailed caption: The detailed figure caption is very helpful, explaining what each panel shows, defining the axes, and noting the capping of values. This makes the figure more self-contained.
  • 💡 Capping of y-axis values hides true maxima: While necessary for visualization, the capping of the y-axis values (r > 90 shown at 90, δ > 0.05 shown at 0.05) obscures the true magnitude of the change for the most extreme outliers. A note in the caption clarifies this, but it remains a visual limitation.
Fig. 3. Number of excess words per year.
Figure/Table Image (Page 3)
Fig. 3. Number of excess words per year.
First Reference in Text
during the COVID pandemic consisted almost entirely of content words (such as respiratory, remdesivir, etc.), whereas the excess vocabulary in 2024 consisted almost entirely of style words (Fig. 3A).
Description
  • Stacked bar chart of excess words per year: This panel presents a stacked bar chart tracking the number of 'excess words' each year from 2014 to 2024. An 'excess word' is a word used more frequently than statistically expected. The height of each bar represents the total count of such words for that year.
  • Categorization of words by type: Each bar is divided into colored segments representing different categories of words: blue for 'content words' (terms related to the topic, like 'respiratory'), orange for 'style words' (general academic words like 'delves'), and gray for 'other'.
  • Shift from content-driven to style-driven excess vocabulary: The chart shows a significant peak in excess words around 2020-2021, reaching nearly 200 words, which is composed almost entirely of blue 'content words', corresponding to the COVID-19 pandemic. In contrast, 2024 shows the largest peak in the entire period, with over 400 excess words, and this bar is overwhelmingly dominated by orange 'style words'.
  • Exemplar word annotations: Specific words are used to label some of the bars, such as 'Sars', 'Covid', 'Delves', and 'Chatgpt', providing examples of the types of words that became excessively common in those years.
Scientific Validity
  • ✅ Strong comparative evidence: The direct visual comparison between the pandemic-era vocabulary shift (topic-driven, content words) and the post-2022 shift (LLM-driven, style words) provides powerful, intuitive evidence for the paper's central claim that the nature of the change in scientific writing is unprecedented.
  • ✅ Effectively summarizes and visualizes a key trend: This plot effectively visualizes the output of the systematic analysis from Figure 2, aggregating the data over time to show a clear trend. It strongly supports the reference text's claim that the 2024 excess vocabulary is almost entirely style words.
  • 💡 Classification subjectivity: The entire analysis rests on the manual classification of words into 'content' and 'style'. While this is a key contribution, it is also a potential source of subjectivity. The criteria for this classification should be explicitly stated in the methods to ensure transparency and reproducibility.
Communication
  • ✅ Appropriate chart type: The use of a stacked bar chart is an effective choice to simultaneously display the total number of excess words over time and the changing composition of that total (content vs. style).
  • ✅ Effective and consistent color coding: The color scheme is consistent with Figure 2 (blue for content, orange for style), which aids in building a coherent visual narrative across figures. The contrast between the predominantly blue bars in the COVID years and the predominantly orange bar in 2024 is stark and immediately communicates the core finding.
  • ✅ Illustrative annotations: Annotating the bars with a single, highly representative word for that year (e.g., 'Covid' for 2021, 'Chatgpt' for 2024) is a clever way to anchor the abstract data to a concrete and memorable example, enhancing reader engagement.
  • 💡 X-axis labeling could be more complete: The x-axis only labels every other year. While this reduces clutter, labeling every year would improve readability and make it easier to pinpoint specific years like 2021 and 2023 without interpolation.
Fig. 4. Combining excess style words yields a larger frequency gap.
Figure/Table Image (Page 3)
Fig. 4. Combining excess style words yields a larger frequency gap.
First Reference in Text
To create the rare set, we grouped all 2024 excess style words with frequency p < T and computed the frequency gap Arare as a function of threshold T (Fig. 4).
Description
  • Two-panel figure explaining the calculation of the LLM usage estimate: The figure consists of two related plots, (A) and (B), that demonstrate how the authors arrived at their primary estimate of LLM usage. The analysis involves creating a set of 'rare style words' and varying a 'Threshold' (horizontal axis) to determine which words to include in the set. A lower threshold means only the rarest words are included.
  • Panel A: Observed vs. Expected frequency of a word set: Panel (A) shows two lines. The top line ('Observed frequency') is the actual fraction of 2024 abstracts containing at least one word from the set defined by the threshold. The bottom line ('Expected frequency') is the projected fraction based on pre-LLM trends. The visible gap between these lines represents the 'excess' usage.
  • Panel B: The 'Frequency gap' as a function of the threshold: Panel (B) plots the size of the gap from Panel (A) directly. The vertical axis, 'Frequency gap (Δ)', represents this excess usage. The plot shows that the gap is maximized at a certain threshold.
  • Key finding: A maximum frequency gap of 13.6%: The peak of the curve in Panel (B) is explicitly labeled with the value 0.136. This indicates that the maximum frequency gap found is 0.136, or 13.6%, which the authors use as their main lower-bound estimate for the percentage of 2024 abstracts processed with LLMs.
Scientific Validity
  • ✅ Data-driven optimization of the marker word set: The approach of optimizing a threshold (T) to find the set of marker words that maximizes the frequency gap (Δ) is a rigorous and transparent method for deriving a robust lower-bound estimate. It avoids arbitrarily selecting a set of words and instead uses a data-driven approach.
  • ✅ Correctly handles word co-occurrence: The calculation of the frequency of the union of words (i.e., abstracts containing at least one marker word) is a methodologically sound way to handle co-occurrence. It correctly establishes a lower bound without making the incorrect assumption that the usages of different marker words are independent events.
  • ✅ Finding appears robust to minor changes in threshold: The plot in Panel B shows a relatively broad peak around the maximum. This suggests that the final estimate of 13.6% is not highly sensitive to the exact choice of the threshold T, which adds to the robustness of the finding.
  • 💡 Analysis for the 'common set' is not shown: The analysis shown is for the 'rare set' of words. The text mentions a second, independent 'common set' was also used. Showing a similar plot for the 'common set' would further strengthen the paper's claim of robustness by demonstrating that two independent heuristics lead to a similar result.
Communication
  • ✅ Excellent narrative structure: The two-panel structure is highly effective. Panel A clearly shows the components of the calculation (observed and expected frequencies), while Panel B displays the derived result (the gap). This guides the reader logically from the raw data to the final conclusion.
  • ✅ Clear highlighting of the main finding: Explicitly annotating the peak of the curve in Panel B with the key result, '0.136', is a superb communication choice. It immediately draws the reader's eye to the single most important number derived from this analysis.
  • ✅ Appropriate use of a logarithmic scale: The use of a logarithmic scale for the x-axis ('Threshold') is appropriate, as it effectively displays the function's behavior across several orders of magnitude of word frequency, which is where the optimal threshold is being sought.
  • 💡 Y-axis label could be more descriptive: The y-axis label in Panel B, 'Frequency gap (Δ)', is technically correct but could be more intuitive for broader comprehension. Consider a more descriptive label like 'Estimated LLM Usage (Δ)', as this value is presented as the lower-bound estimate throughout the text.
Fig. 5. Frequency gaps estimated for various subcorpora.
Figure/Table Image (Page 3)
Fig. 5. Frequency gaps estimated for various subcorpora.
First Reference in Text
We performed the same analysis as above by various subgroups of PubMed papers. We computed frequency gaps Acommon and Arare for different biomedical fields, affiliation countries, journals, and men and women among the first and the last authors, inferred from their first names (see Materials and Methods). Note that we based all our A estimates on the same two sets of excess style words as before.
Description
  • Composite figure analyzing LLM usage across paper subgroups: This figure is a composite of five panels (A-E) that analyze the 'frequency gap'—the estimated lower bound of LLM usage—across various subgroups of scientific papers (subcorpora).
  • Panel A: Time-series frequency of marker word sets: Panel A is a line chart showing the frequency of abstracts from 2012-2024 that contain at least one word from two distinct marker word sets: a 'rare' set (291 words) and a 'common' set (10 words). Both show a sharp increase after 2022, with the derived frequency gaps being 0.136 (13.6%) and 0.134 (13.4%), respectively.
  • Panels B-D: Correlation of estimates from two different word sets: Panels B, C, and D are scatter plots that compare the frequency gap estimates from the 'rare' set (x-axis) versus the 'common' set (y-axis) for individual subcorpora. Each point represents a specific subgroup, such as a scientific field (B), a country (C), or a journal (D). Most points cluster along the diagonal, indicating that both word sets produce similar estimates.
  • Heterogeneity in LLM usage estimates across subgroups: These scatter plots reveal significant variation. For instance, the 'Computation' field (Panel B) and the country 'Taiwan' (Panel C) show high frequency gaps of around 0.20 (20%), while the 'Nature + Science + Cell' journal group (Panel D) shows a much lower gap of about 0.07 (7%).
  • Panel E: Highlighting extreme usage estimates in specific subcorpora: Panel E consists of several bar charts that display the final averaged frequency gap for specific, highly-impacted subgroups. It highlights extreme cases, such as an estimated LLM usage rate of 0.41 (41%) in 'Computation' papers from China, and 0.34 (34%) in papers from South Korea published in the journal 'Sensors'.
Scientific Validity
  • ✅ Robustness check using two independent marker sets: Using two independently derived marker word sets ('rare' and 'common') and showing that they yield highly correlated estimates (Panels B-D) is a powerful validation of the method's robustness. It demonstrates that the findings are not an artifact of a particular choice of marker words.
  • ✅ Stratified analysis reveals important heterogeneity: The stratified analysis across different fields, countries, and journals is a significant strength. It provides a nuanced picture of LLM adoption, revealing heterogeneity that would be missed by a single aggregate analysis. This is a key scientific contribution of the paper.
  • ✅ Consistent methodology across subgroups: The text confirms that the same marker word sets were used for all subcorpora. This is a methodologically sound choice that ensures the comparisons between groups are fair and consistent.
  • 💡 Lack of uncertainty estimates: The figure presents point estimates for the frequency gaps without any indication of uncertainty, such as error bars or confidence intervals. Given that these are estimates derived from data, providing a measure of statistical uncertainty would substantially increase the rigor of the claims.
  • 💡 Potential for confounding variables: The figure shows strong correlations between LLM usage estimates and certain subgroups (e.g., computational fields, specific countries). While the paper frames this as differences in LLM adoption, the data shown cannot rule out confounding factors. For example, fields with faster publication cycles might show the effect earlier, or journals with different editorial practices might be more or less likely to publish text with LLM-style words. These alternative interpretations are plausible based on the figure alone.
Communication
  • ✅ Excellent multi-panel narrative structure: The figure's multi-panel layout is highly effective. It logically progresses from establishing the analysis method (Panel A), to showing the consistency of results across broad categories (Panels B-D), and finally to highlighting the most extreme and specific findings (Panel E). This creates a clear and compelling visual narrative.
  • ✅ Appropriate use of varied chart types: The use of different but appropriate chart types for different data aspects—line charts for time series, scatter plots for correlations, and bar charts for direct comparisons—is a strong design choice that enhances clarity.
  • ✅ Clear and effective labeling of key points: Key data points in panels B, C, D, and E are clearly labeled (e.g., 'Computation', 'Taiwan', 'Sensors') or have their values printed on them. This is crucial for interpreting the figure and connecting it to the text's specific claims.
  • 💡 Repetitive axis labels could be streamlined: The axis labels in Panels B, C, and D ('Frequency gap Δ based on rare words') are very long and repetitive. A global title for these three panels could state the comparison, allowing for more concise axis labels like 'Δ (rare set)' and 'Δ (common set)'.
  • 💡 Caption could be more descriptive to improve self-containment: The caption is quite concise. While the figure is complex, a slightly more detailed caption explaining what each panel represents (e.g., 'A: Frequency of marker word sets over time. B-D: Comparison of Δ estimates... E: Final Δ estimates for selected subgroups.') would improve its ability to be understood independently from the main text.
Figs. S1 to S7
Figure/Table Image (Page 9)
Figs. S1 to S7
First Reference in Text
For comparison, we did the same analysis for all years from 2013 to 2023 (figs. S1 to S4).
Description
  • Historical baseline of 'excess words' from 2013-2023: Figures S1 through S4 provide a year-by-year historical analysis of 'excess words' in PubMed abstracts from 2013 to 2023. Each year is represented by two scatter plots, identical in format to the main Figure 2, showing the 'excess frequency ratio' (a measure of relative increase) and the 'excess frequency gap' (a measure of absolute increase). This series of plots serves as a crucial baseline to contextualize the findings for 2024.
  • Excess words before 2023 are primarily content-driven: In the pre-LLM era (2013-2022), the figures consistently show that the words with the highest excess usage are overwhelmingly 'content words'—terms related to specific topics. For example, 'ebola' shows a large frequency ratio spike in 2015, 'zika' spikes in 2017 with a ratio of about 40, and words like 'covid', 'coronavirus', and 'pandemic' dominate from 2020 to 2021, with 'covid' reaching a frequency gap of approximately 0.05 in 2021.
  • Low background level of excess words in pre-LLM years: The plots for pre-LLM years show that very few words cross the defined threshold for excess usage, and those that do are tied to clear real-world events or scientific breakthroughs (e.g., disease outbreaks, new drugs like 'pembrolizumab' in 2016, or technologies like 'CRISPR' in 2015).
  • The year 2023 shows the beginning of the shift towards style words: The plot for 2023 (in Figure S4) marks a transition. While still showing content words like 'monkeypox' and 'omicron', it also reveals the emergence of excess 'style words' such as 'intricate' and 'transformer', foreshadowing the large-scale shift observed in 2024.
Scientific Validity
  • ✅ Provides a crucial and robust historical baseline: Presenting the analysis for every year from 2013 to 2023 is a major methodological strength. It establishes a robust historical baseline, demonstrating what 'normal' vocabulary shifts look like and allowing the authors to compellingly argue that the 2024 changes are unprecedented in both scale and nature.
  • ✅ Consistent methodology ensures fair comparison: The consistent application of the same analytical thresholds (the dashed lines) across all years ensures that the comparison is fair and rigorous. This supports the conclusion that the increase in the number and type of excess words in 2024 is a genuine phenomenon, not an artifact of changing criteria.
  • ✅ Strongly supports the paper's central claims: These figures strongly support the paper's narrative. They visually confirm that before the widespread availability of LLMs, large-scale shifts in scientific vocabulary were driven by new topics and discoveries (content), whereas the post-LLM era is characterized by a shift in writing style.
  • 💡 Lack of content/style color-coding: The main Figure 2 uses color to distinguish 'content' from 'style' words, which is a powerful visual aid. This color-coding is absent in the supplementary figures. While one can infer the category of the labeled words, applying the same color scheme here would have made the visual argument for the historical baseline even more direct and compelling.
Communication
  • ✅ Consistent and effective layout: The consistent two-panel layout used for each year, mirroring the main Figure 2, is highly effective. This consistency allows the reader to quickly learn how to interpret the plots and easily compare findings across different years.
  • ✅ Appropriate use of supplementary materials: Placing this extensive year-by-year analysis in the supplementary materials is an appropriate choice. It provides crucial supporting evidence for the main claims without cluttering the primary narrative, which correctly focuses on the 2024 data.
  • 💡 Captions could be more descriptive: The captions for each figure are minimal (e.g., "Figure S1: Excess words in 2013 and 2014. See Figure 2 for explanations."). While this is efficient, adding a brief summary sentence to each caption (e.g., "...showing primarily content-related words linked to specific research trends.") would improve their ability to be understood at a glance.
  • 💡 Overlapping text labels reduce readability: In several plots, the text labels for outlier words overlap, making them difficult to read (e.g., Fig. S2, 2015 plot). Employing a text repulsion algorithm or staggering the labels would significantly improve the clarity of these annotations.

Discussion

Key Aspects

Strengths

Suggestions for Improvement

Materials and Methods

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure S1: Excess words in 2013 and 2014. See Figure 2 for explanations.
Figure/Table Image (Page 10)
Figure S1: Excess words in 2013 and 2014. See Figure 2 for explanations.
First Reference in Text
Not explicitly referenced in main text
Description
  • Baseline analysis of excess words for 2013 and 2014: This figure presents data for the years 2013 and 2014, serving as a baseline for 'excess word' usage before the major events analyzed in the main paper. It consists of four scatter plots in total, with two for each year.
  • Relative increases are seen in technical, content-specific words: The left-hand plots show the 'excess frequency ratio' on the y-axis, which measures the relative increase in a word's usage compared to what was expected. For both 2013 and 2014, the words with high ratios (around 10) are technical, content-specific terms like 'gabaa' (a neurotransmitter) and 'cmax' (a measure in pharmacology).
  • Absolute increases are negligible across all words: The right-hand plots show the 'excess frequency gap' on the y-axis, which measures the absolute increase in the percentage of abstracts using a word. Critically, for both 2013 and 2014, virtually no words cross the significance threshold (the dashed line at 0.01). This indicates that no single word's usage increased by even one percentage point, establishing a 'quiet' baseline of vocabulary change.
Scientific Validity
  • ✅ Establishes a crucial historical baseline: Presenting this baseline data for 'quiet' years is a methodologically crucial step. It provides a strong point of comparison that allows the authors to convincingly argue that the vocabulary shifts seen in the COVID and LLM eras are statistically significant and historically unusual.
  • ✅ Demonstrates methodological consistency and validity: The figure demonstrates the consistent application of the analytical method across different years. By showing that the method identifies only a few, topic-specific outliers in these baseline years, it builds confidence in the method's validity and its ability to detect genuine signals of change.
  • ✅ Data strongly supports the paper's core premise: The data shown strongly supports the paper's underlying premise: that prior to recent major events, large-scale, coordinated shifts in scientific vocabulary were rare, and the few 'excess words' that appeared were directly tied to specific research topics rather than general writing style.
Communication
  • ✅ Consistent and effective visual layout: The figure maintains a consistent two-panel layout for each year, mirroring the main Figure 2. This is an excellent design choice as it reduces the cognitive load on the reader, who can apply their understanding from the main figure to this supplementary data.
  • 💡 Caption is not self-contained: The caption is overly sparse, simply directing the reader to Figure 2. While efficient, it reduces the figure's ability to stand alone. A more descriptive caption summarizing the key takeaway—for instance, 'Note the low number of excess words and negligible frequency gaps, establishing a pre-COVID/pre-LLM baseline'—would significantly improve clarity and self-containment.
  • 💡 Overlapping text labels: In the 2014 panel, some of the text labels for the data points overlap (e.g., 'gabaa' and 'synonymy'), which hinders readability. Employing a text repulsion algorithm or manually adjusting label positions would make the annotations clearer.
Figure S2: Excess words in 2015–2017. See Figure 2 for explanations.
Figure/Table Image (Page 11)
Figure S2: Excess words in 2015–2017. See Figure 2 for explanations.
First Reference in Text
Not explicitly referenced in main text
Description
  • Historical analysis of excess words from 2015-2017: This figure presents a historical analysis of 'excess words' for the years 2015, 2016, and 2017, arranged in three rows. Each row corresponds to a year and contains two scatter plots, mirroring the format of Figure 2, to show relative and absolute increases in word usage.
  • Vocabulary spikes are tied to major public health events: The plots for 2015 and 2017 are dominated by words related to major viral outbreaks. In 2015, 'ebola' shows an 'excess frequency ratio' (a measure of relative increase) of approximately 10. In 2017, 'zika' and 'zikv' show a much larger ratio, with 'zika' being about 40 times more frequent than expected.
  • Outlier words are content-specific and topic-driven: The outlier words are consistently content-specific terms related to emerging scientific topics. Besides disease names, these include 'crispr' in 2015, new cancer drugs ('pembrolizumab', 'nivolumab') in 2016, and the technical term 'convolutional' in 2017, which is associated with the rise of deep learning.
  • Absolute frequency changes remain negligible: Across all three years, the right-hand panels, which show the 'excess frequency gap' (absolute increase), are largely empty. Almost no words cross the significance threshold of 0.01, indicating that even the words with large relative increases did not become frequent enough to cause a substantial absolute shift in the overall vocabulary.
Scientific Validity
  • ✅ Provides a strong historical baseline for comparison: This figure is crucial for establishing the paper's central argument. By showing that historical vocabulary spikes were infrequent, topic-specific, and driven by content words, it provides a strong baseline against which the widespread, stylistic changes of 2024 can be judged as truly unprecedented.
  • ✅ Validates the method's ability to detect real-world signals: The figure demonstrates the sensitivity and validity of the 'excess word' detection method. It correctly identifies and reflects the impact of known real-world events (Ebola, Zika) and major scientific advances (CRISPR, immunotherapy) on the scientific literature of the time.
  • ✅ Reinforces the localized nature of past vocabulary shifts: The consistent finding of negligible absolute frequency gaps (the right-hand panels) across these years is a key piece of evidence. It reinforces that past vocabulary shifts, while notable, were confined to niche topics and did not represent a broad change in the language of science, unlike the effect attributed to LLMs.
Communication
  • ✅ Effective and consistent multi-year layout: The consistent three-row, two-panel layout is effective for comparing vocabulary shifts across multiple years. It allows the reader to quickly grasp the patterns (or lack thereof) in the historical data by maintaining a uniform visual structure.
  • 💡 Caption is not self-contained: The caption is too minimal, merely pointing to Figure 2 for context. It misses the opportunity to guide the reader's interpretation. A more informative caption, such as "Excess words from 2015-2017, showing that vocabulary spikes were driven by specific content words related to disease outbreaks (Ebola, Zika) and new technologies (CRISPR), with negligible changes in absolute frequency gaps," would greatly enhance the figure's self-containment.
  • 💡 Overlapping text labels reduce clarity: In several panels, text labels for data points overlap, hindering readability (e.g., 'miseq' and 'https' in 2015; 'psma' and 'microcephaly' in 2017). Using a text repulsion algorithm or manually adjusting label positions would improve the clarity of these important annotations.
Figure S3: Excess words in 2018–2020. See Figure 2 for explanations.
Figure/Table Image (Page 12)
Figure S3: Excess words in 2018–2020. See Figure 2 for explanations.
First Reference in Text
Not explicitly referenced in main text
Description
  • Historical analysis of excess words from 2018-2020: This figure continues the historical analysis of 'excess words' for the years 2018, 2019, and 2020. It uses the same two-panel format per year, showing relative increase ('excess frequency ratio') on the left and absolute increase ('excess frequency gap') on the right.
  • 2018-2019 show a continued quiet baseline: The years 2018 and 2019 continue the baseline trend seen in previous figures. The few outlier words are content-specific technical terms (e.g., 'convolutional', 'circrnas', 'adversarial'), and critically, the right-hand panels show that no words cross the absolute frequency gap threshold of 0.01.
  • Explosion of COVID-19 content words in 2020: The year 2020 marks a dramatic shift. The left panel shows an explosion of excess words, all clearly related to the COVID-19 pandemic, such as 'wuhan', 'sars', 'coronavirus', and 'pandemic'.
  • First significant absolute frequency gap emerges in 2020: Most importantly, the right-hand panel for 2020 is the first in the historical series to show a significant signal. Multiple words, including 'coronavirus', 'sars', 'disease', and 'patients', cross the 0.01 absolute frequency gap threshold. This signifies a major, widespread shift in the biomedical vocabulary driven by a single, dominant topic.
Scientific Validity
  • ✅ Validates the method on a known major event: This figure powerfully demonstrates the method's ability to detect and quantify the impact of a massive, real-world event on scientific literature. The clear signal in 2020 validates the entire analytical approach.
  • ✅ Establishes a crucial example of a content-driven shift: The 2020 data provides a perfect 'control' case for a content-driven vocabulary shift. It establishes what a large-scale change looks like when it's about a topic. This is essential for the paper's ultimate argument that the LLM-driven shift is different because it's about style.
  • ✅ Provides strong evidence for the impact of the COVID-19 pandemic: The stark contrast between 2019 and 2020 provides extremely strong evidence supporting the claim that the COVID-19 pandemic had an unprecedented effect on biomedical publishing, a key secondary finding mentioned in the paper.
Communication
  • ✅ Clear narrative progression: The figure's layout, presenting each year sequentially, effectively builds a narrative. The visual contrast between the quiet plots of 2018-2019 and the explosive plot of 2020 is stark and immediately communicates the onset of a major event.
  • 💡 Caption lacks self-containment: The caption is minimal and relies on the reader referring back to Figure 2. To improve its self-containment, it could briefly summarize the key finding, for instance: "Excess words from 2018-2020, culminating in the 2020 emergence of a large, content-driven frequency gap associated with the COVID-19 pandemic."
  • 💡 Overlapping text labels in the 2020 panel: In the 2020 panel showing the frequency ratio, the numerous labels for COVID-related terms are densely packed and overlap significantly, making them difficult to read. Employing a text repulsion algorithm or staggering the labels would greatly improve the clarity of this key panel.
Figure S4: Excess words in 2021–2023. See Figure 2 for explanations.
Figure/Table Image (Page 13)