Delving into LLM-assisted writing in biomedical publications through excess vocabulary

Section Analysis

Abstract

Key Aspects

❓ Quantifying LLM Usage in Science: The abstract introduces the central problem: the widespread but unquantified use of Large Language Models (LLMs) for scholarly writing in the face of their known limitations, such as generating inaccuracies and reinforcing biases. It frames the core research question of determining the prevalence of LLM-assisted writing within the academic literature. This sets the context for the study's primary objective, which is to move beyond anecdotal evidence and provide a large-scale, data-driven answer for the biomedical field.
🔬 Excess Vocabulary Analysis Method: The paper proposes a novel, unbiased methodology to detect LLM usage by analyzing 'excess vocabulary.' This approach is inspired by epidemiological studies of excess mortality and involves tracking abrupt increases in the frequency of specific stylistic words post-LLM release. The analysis is performed on a massive corpus of over 15 million biomedical abstracts from PubMed (2010–2024), allowing for a longitudinal study of linguistic shifts without relying on potentially biased ground-truth training data required by previous detection methods.
📊 Widespread and Varied LLM Adoption: The study presents significant quantitative findings, estimating that a minimum of 13.5% of biomedical abstracts published in 2024 have been processed by LLMs. This lower-bound estimate reveals substantial heterogeneity in adoption rates, with usage varying across different scientific disciplines, countries of origin, and journal tiers. The analysis highlights that in certain subcorpora, the estimated usage reaches as high as 40%, indicating concentrated pockets of extremely high LLM integration within the scientific community.
📈 Unprecedented Impact on Scientific Writing: A central conclusion of the research is that the introduction of LLMs has had an unprecedented effect on the language of scientific writing. The authors explicitly state that the magnitude of this linguistic shift surpasses the impact of major global events, including the COVID-19 pandemic, which itself caused significant changes in scientific vocabulary. This finding underscores the transformative and rapid nature of LLM technology's influence on academic communication practices in the biomedical field.

Strengths

✅ Clear and Comprehensive Summary
The abstract masterfully condenses the study's entire scope—problem, methods, results, and conclusion—into a coherent and easily digestible paragraph. It follows a logical progression that makes the paper's contribution immediately clear.

"To answer this question for the field of biomedical research, we present an unbiased, large-scale approach: We study vocabulary changes in more than 15 million biomedical abstracts from 2010 to 2024 and show how the appearance of LLMs led to an abrupt increase in the frequency of certain style words." (Page 1)
✅ Specific and Quantifiable Findings
The abstract provides specific, data-driven estimates of LLM usage, including a conservative lower bound (13.5%) and the range of adoption (reaching 40%). This quantification adds significant weight and credibility to the paper's claims, moving beyond qualitative observation.

"This excess word analysis suggests that at least 13.5% of 2024 abstracts were processed with LLMs. This lower bound differed across disciplines, countries, and journals, reaching 40% for some subcorpora." (Page 1)
✅ High-Impact Contextualization
The conclusion draws a powerful comparison between the linguistic impact of LLMs and major world events like the COVID-19 pandemic. This contextualization effectively highlights the unprecedented scale and speed of the change, making the study's significance immediately apparent to a broad audience.

"We show that LLMs have had an unprecedented impact on scientific writing in biomedical research, surpassing the effect of major world events such as the COVID pandemic." (Page 1)

Suggestions for Improvement

💡 Briefly Exemplify 'Style Words'
High impact for clarity. The abstract mentions an 'abrupt increase in the frequency of certain style words' but does not provide an example. For readers outside of computational linguistics, the term 'style words' might be ambiguous. Including a brief, parenthetical example (e.g., 'delving,' 'notably') directly in the abstract would immediately clarify the type of vocabulary being tracked, making the methodology more intuitive and accessible from the outset without adding significant length.

"...and show how the appearance of LLMs led to an abrupt increase in the frequency of certain style words." (Page 1)

Implementation: Revise the sentence '...and show how the appearance of LLMs led to an abrupt increase in the frequency of certain style words.' to '...and show how the appearance of LLMs led to an abrupt increase in the frequency of certain style words (e.g., 'delving,' 'pivotal').'

Introduction

Key Aspects

🧠 The Rise and Risks of LLMs in Science: The introduction establishes the central problem: the rapid, widespread adoption of Large Language Models (LLMs) like ChatGPT in academic writing following their public release. This adoption is framed as a double-edged sword, offering potential benefits like improved equity while simultaneously raising significant concerns. These worries include threats to research integrity, the propagation of factual inaccuracies, and the potential for misuse by fraudulent paper mills, thereby setting the stage for why quantifying LLM usage is a critical and timely endeavor.
📚 Limitations of Existing Detection Methods: The paper provides a structured review of previous attempts to quantify LLM usage, categorizing them into three groups: LLM detectors, word frequency mixture models, and marker word lists. It then presents a unifying critique, identifying their shared dependence on ground-truth training data as a major limitation. This reliance is explained to be a source of significant bias, as it requires unverified assumptions about which LLMs and prompts researchers actually use, thus justifying the need for a new, unbiased methodology.
🛠️ A Novel 'Excess Vocabulary' Approach: The authors introduce their primary methodological innovation: a data-driven approach that circumvents the need for biased training data. This method is inspired by the epidemiological concept of 'excess mortality' and is adapted to measure the 'excess use' of specific words in scientific literature following the widespread availability of LLMs. By comparing word frequencies against a pre-LLM baseline, this approach offers a novel way to track linguistic footprints directly from published texts, forming the foundation of the paper's analysis.
🎯 Questioning an Unprecedented Linguistic Shift: The introduction culminates by framing the study's core research question, which moves beyond simple detection to contextualize the scale of the LLM-induced linguistic shift. The authors explicitly seek to determine if the nature and magnitude of these changes are comparable to previous shifts caused by major global events, such as the COVID-19 pandemic. This positions the study's goal as assessing whether LLMs represent a truly unprecedented force in the evolution of scientific communication, a question they address using a large corpus of biomedical abstracts.

Strengths

✅ Effective Problem Scaffolding
The introduction effectively builds a compelling narrative, starting from the general principle of how language reflects world events and progressively focusing on the specific technological disruption of LLMs. This logical scaffolding clearly establishes the context, relevance, and urgency of the research question for a broad audience.

"When the world changes, human-written text changes. Major events like wars and revolutions affect word frequency distributions in text corpora (1)... Do technological advances leave a similar footprint on our writing?" (Page 1)
✅ Critical and Efficient Literature Review
The authors skillfully categorize and critique existing LLM detection methods, identifying a common, critical limitation—the requirement for a biased ground-truth training set. This concise review effectively carves out a niche for their novel approach by demonstrating a clear gap in current research and justifying the paper's primary contribution.

"All of these approaches share one common limitation: They require a ground-truth training set of LLM- and human-written texts. Usually, human-written texts are obtained from pre-LLM years, while LLM-written texts are generated by a set of prompts. This setup can introduce biases..." (Page 1)
✅ Compelling Methodological Analogy
The analogy to 'excess mortality' studies is a powerful rhetorical device that makes the paper's core methodology immediately intuitive and accessible. By framing their 'excess word usage' analysis in this way, the authors provide a clear conceptual anchor for readers, elegantly explaining a novel data-driven approach without requiring them to parse complex statistical jargon upfront.

"We were inspired by studies of excess mortality (27–29) that looked at the excess of fatalities during the COVID pandemic compared to pre-COVID mortality. We adapt this idea to LLM-induced changes in word usage..." (Page 1)

Suggestions for Improvement

💡 Briefly Elaborate on the 'Equity' Hope
Medium impact for contextual depth. The introduction mentions the hope that 'LLMs might lead to more equity,' but this intriguing point is left undeveloped. Briefly elaborating on the mechanism for this hope (e.g., by aiding non-native English speakers or those with less access to editorial support) would provide a more balanced perspective on the motivations for LLM adoption. This would enrich the context by contrasting the potential benefits with the well-articulated risks, offering a more complete picture of the academic community's complex relationship with these tools.

"...where many have hoped that LLMs might lead to more equity (5)." (Page 1)

Implementation: Revise the sentence '...where many have hoped that LLMs might lead to more equity (5).' to something like '...where many have hoped that LLMs might lead to more equity, for instance by assisting non-native English speakers in overcoming language barriers (5).'
💡 Preemptively Clarify 'Stylistic Words' with an Example
High impact for clarity. The introduction effectively critiques prior work that relies on 'marker words, known to be overused by LLMs, which are typically stylistic words.' However, it misses an opportunity to make this concept concrete for the reader from the start. Including a brief parenthetical example (e.g., 'delving,' 'intricate') at this point would immediately ground the concept, enhancing the reader's understanding of the specific linguistic phenomenon being investigated before they encounter the detailed results.

"The third group of studies relied on lists of marker words, known to be overused by LLMs, which are typically stylistic words unrelated to the text content (19, 25, 26)." (Page 1)

Implementation: Revise the sentence '...relied on lists of marker words, known to be overused by LLMs, which are typically stylistic words unrelated to the text content (19, 25, 26).' to '...relied on lists of marker words, known to be overused by LLMs (e.g., 'delving,' 'pivotal'), which are typically stylistic words unrelated to the text content (19, 25, 26).'

Results

Key Aspects

📈 Identifying an Unprecedented Word Frequency Surge: The analysis begins by identifying a significant, abrupt increase in the frequency of specific words in 2023-2024. This is quantified using two complementary metrics: the 'excess frequency gap' (δ), which measures the absolute increase for common words, and the 'excess frequency ratio' (r), which captures the relative surge of infrequent words. By comparing these 2024 values to a counterfactual baseline extrapolated from 2021-2022 data, the study establishes that this linguistic shift is anomalous and far exceeds typical year-over-year vocabulary changes observed pre-COVID.
💬 Characterizing the Shift to 'Style Words': A key finding emerges from the qualitative analysis of the excess vocabulary. The paper demonstrates a stark contrast between the LLM-era surge and previous linguistic shifts, such as during the COVID-19 pandemic. While COVID-related excess words were primarily content-based nouns (e.g., 'coronavirus', 'pandemic'), the 2024 excess vocabulary consists almost entirely of stylistic words, predominantly verbs ('delves', 'underscores') and adjectives ('pivotal', 'intricate'). This distinction in word type provides a unique linguistic fingerprint, strongly suggesting a stylistic, rather than topical, driver for the change.
📊 Estimating a 13.5% Lower Bound on LLM Usage: To move from individual word trends to a holistic measure, the study combines marker words to establish a conservative lower bound on LLM usage. Two independent methods are used: one combines a large set of 291 automatically identified 'rare' style words, while the other uses a small, manually selected set of 10 'common' style words. Both approaches converge on a remarkably similar lower bound, estimating that at least 13.5% of all PubMed abstracts in 2024 were processed with an LLM, an impact shown to be twice that of COVID-related literature at its peak.
🗺️ Mapping Heterogeneity Across Fields and Countries: The study reveals significant heterogeneity in LLM adoption by analyzing the lower bound across various subcorpora. The estimated usage is substantially higher in computational fields (~20%) and in papers affiliated with non-English speaking countries like China and South Korea (~20%). Similarly, publishers like MDPI and Frontiers show higher rates (~21% and ~20%) compared to more selective, high-prestige journals like Nature and Science (~7%). This stratified analysis indicates that detectable LLM usage is not uniform but varies systematically with discipline, geography, and publication venue, with some intersections reaching over 40%.

Strengths

✅ Exceptional Data Visualization
The Results section is supported by a series of exceptionally clear and well-designed figures. Figures 1-5 systematically build the paper's argument, from illustrating individual word trends (Fig. 1) and defining excess vocabulary (Fig. 2) to quantifying the final lower-bound estimates and their heterogeneity (Fig. 4 & 5), making complex quantitative findings highly accessible.

"To compare the size of excess vocabulary between years, we defined as excess words all words with δ > 0.01 or log10r > 4/log10p − log102 where p is the frequency in 2024 (see dashed lines in Fig. 2); these thresholds were chosen such that most words in pre-COVID years were well below (figs. S1 to S4)." (Page 2)
✅ Robust Estimation via Convergent Heuristics
The paper enhances the credibility of its central finding by employing two independent heuristics—an automated 'rare set' and a manually curated 'common set'—to estimate the lower bound of LLM usage. The convergence of these distinct methods on a nearly identical estimate (13.6% and 13.4%) provides strong evidence for the robustness of the result and mitigates potential researcher bias.

"As the rare and the common sets were nonoverlapping, both serve as an independent estimate of the lower bound. The fact that we arrived at very similar estimates with these two approaches underscores our results’ robustness." (Page 4)
✅ Effective Contextual Benchmarking
The authors masterfully contextualize their findings by consistently comparing the magnitude of the LLM-induced vocabulary shift to that of the COVID-19 pandemic. This comparison serves as a powerful and intuitive benchmark, demonstrating that the impact of LLMs on scientific writing is not just statistically significant but historically unprecedented, surpassing even major global events.

"This shows that the LLM usage in 2024 was at least two times higher than the size of COVID-related literature at its peak in 2021." (Page 4)

Suggestions for Improvement

💡 Detail the Manual Selection Criteria for the 'Common Set'
High impact for reproducibility and rigor. The Results section states the 'common set' of 10 words was derived by 'manually adjusting the selection,' which could be perceived as a form of cherry-picking that contrasts with the praised objectivity of the 'rare set' method. While the transparency of the small set is a benefit, providing a brief, explicit description of the selection process (e.g., 'we started with the top 20 words by individual δ and iteratively substituted words to maximize the combined, non-overlapping coverage') would preemptively address concerns about researcher bias and significantly enhance methodological transparency.

"To create the common set, we grouped together excess style words with high individual δ values, manually adjusting the selection to maximize the frequency gap Δcommon." (Page 4)

Implementation: After the sentence 'This led to the following set of 10 words...', add a brief methodological note clarifying the manual adjustment process. For example: 'This selection was achieved by starting with the top 20 excess style words by individual δ value and iteratively testing combinations to identify a small, non-overlapping set that maximized the combined frequency gap Δcommon.'

Non-Text Elements

Fig. 1. Frequencies of PubMed abstracts containing several example words.

Figure/Table Image (Page 2)

First Reference in Text

Some words strongly increased their occurrence frequency in 2023-2024 (Fig. 1).

Description

Nine line charts showing word frequency over time: The figure displays a grid of nine separate line charts, each tracking the frequency of a specific word's appearance in the abstracts of biomedical papers from the PubMed database between 2012 and 2024. The vertical axis represents frequency (how common the word is), and the horizontal axis represents the year.
Actual frequency vs. projected 'counterfactual' trend: For each word, a solid blue line shows its actual frequency year by year. A dashed black line starting from 2022 shows a 'counterfactual extrapolation'—a statistical projection of what the frequency would likely have been in 2023 and 2024 based on the trend from 2021-2022. This projection serves as a baseline to measure any unexpected changes.
Sharp increase in frequency for LLM-associated words post-2022: The first six words (e.g., 'delves', 'crucial', 'potential') show a sharp increase in frequency in 2023 and 2024, far exceeding the projected trend. For example, the frequency of 'potential' increased from approximately 0.20 (20% of abstracts) in 2022 to about 0.25 (25% of abstracts) in 2024, while its projected frequency was only around 0.21.
Comparison words illustrate trends from other major events: The last three words ('pandemic', 'ebola', 'convolutional') are included for comparison to show how word frequencies change due to other major real-world or scientific events. The word 'pandemic' spiked dramatically around 2020-2021, reaching a frequency of about 0.04, while 'ebola' had a smaller spike around 2015. These examples provide context for the magnitude of the changes seen in the other words.

Scientific Validity

✅ Inclusion of external event words as a baseline: The inclusion of words whose frequency is tied to known external events ('pandemic', 'ebola') provides an excellent informal control. It allows the reader to benchmark the magnitude and suddenness of the LLM-associated changes against other significant shifts in scientific discourse, strengthening the claim that the post-2022 effect is 'unprecedented'.
✅ Use of a counterfactual model to quantify the effect: Visually representing a counterfactual baseline, even a simple one, is a methodologically sound way to quantify the 'excess' usage. It moves the argument from a purely qualitative observation ('the line goes up') to a quantitative one ('the line is this much higher than expected').
💡 Potential for selection bias in 'example words': The words presented are described as 'manually selected'. This introduces a risk of cherry-picking—selecting only the examples that most strongly support the hypothesis. While this figure is illustrative, its evidentiary power is limited by this potential bias. The overall conclusion of the paper must rely on the more systematic analysis shown in later figures.
💡 Simplicity of the counterfactual model: The counterfactual is a linear extrapolation based on only two preceding years (2021-2022). This is a very simple model that may not capture more complex, non-linear trends or natural volatility in word usage. A model incorporating a longer time series for the baseline might provide a more robust estimate, though the authors are transparent about their method.

Communication

✅ Effective use of small multiples: The use of a 3x3 grid of small multiples is an excellent design choice. It allows for the direct and efficient comparison of trends across nine different words without cluttering a single plot, which is a key principle of effective data visualization.
✅ Clear visual encoding of data vs. counterfactual: The visual distinction between the actual data (solid blue line) and the projected trend (dashed black line) is clear and intuitive, effectively highlighting the deviation that is central to the authors' argument.
💡 Inconsistent Y-axis scales may be misleading: The y-axis scales differ significantly between plots (e.g., 'Delves' maxes out at 0.003, while 'These' reaches 0.36). While this maximizes the visual detail within each plot, it can mislead the viewer about the relative magnitude of the frequency changes. For instance, the slope for 'Delves' appears very steep, but the absolute change is orders of magnitude smaller than for 'Potential'. Consider adding a note to the caption about the varying scales to prevent misinterpretation.

Fig. 2. Words showing increased frequency in 2024.

Figure/Table Image (Page 2)

First Reference in Text

Across all 26,657 words, we found many with strong excess usage in 2024 (Fig. 2).

Description

Two scatter plots showing excess word usage in 2024: The figure consists of two scatter plots, labeled (A) and (B), that visualize the increased usage of 26,657 words in biomedical abstracts in 2024. Each point on the plots represents a single word. The horizontal axis on both plots shows the word's overall frequency in 2024 on a logarithmic scale, ranging from appearing in 0.01% of abstracts (10⁻⁴) to 100% of abstracts (10⁰).
Panel A: Relative increase (Frequency Ratio 'r'): Panel (A) plots the 'excess frequency ratio' (r) on the vertical axis, also on a logarithmic scale. This ratio measures how many times more frequent a word was in 2024 compared to its expected frequency based on 2022 data. For example, the word 'delves' shows a ratio near 30, meaning it was used about 30 times more often than expected. This metric is particularly useful for highlighting large relative changes in less common words.
Panel B: Absolute increase (Frequency Gap 'δ'): Panel (B) plots the 'excess frequency gap' (δ) on the vertical axis, on a linear scale. This gap measures the absolute difference between a word's actual and expected frequency. For instance, the word 'potential' has a gap of about 0.05, meaning it appeared in 5% more abstracts in 2024 than expected. This metric is better for showing the impact on very common words.
Color coding distinguishes 'style' vs. 'content' words: In both panels, words are color-coded as either 'content words' (blue) or 'style words' (orange). The plots visually demonstrate that the words with the highest excess usage, by both relative and absolute measures, are predominantly style words.

Scientific Validity

✅ Use of complementary metrics is a major strength: The dual-metric approach is methodologically robust. Using both a relative measure (frequency ratio 'r') for rare words and an absolute measure (frequency gap 'δ') for common words provides a comprehensive and nuanced view of the vocabulary shift, avoiding the pitfalls of relying on a single metric.
✅ Systematic analysis of the entire vocabulary: Unlike the illustrative examples in Fig. 1, these plots are based on the full dataset of 26,657 words. This systematic approach strongly supports the claim of widespread change and avoids the potential for cherry-picking, adding significant weight to the paper's conclusions.
💡 Potential subjectivity in content/style annotation: The manual annotation of words into 'content' and 'style' categories is a key step that enables the main conclusion. However, this categorization can be subjective. The authors should ideally provide the criteria for this classification or make the full annotated list available to ensure transparency and allow for independent verification of this crucial analytical step.
✅ Clear and quantitative definition of 'excess words': The figure and text clearly define the thresholds for 'excess words' (dashed lines). This provides a clear, quantitative basis for which words are considered significant outliers, making the analysis reproducible and less arbitrary.

Communication

✅ Excellent use of a two-panel layout: The use of a side-by-side two-panel plot is highly effective. It allows for a direct comparison of two different but complementary metrics (relative vs. absolute increase), telling a more complete story than either plot could alone.
✅ Highly effective color coding: The color-coding of words into 'content' (blue) and 'style' (orange) is a powerful visual device. It immediately and clearly communicates the central finding that the words with the most significant excess usage are overwhelmingly stylistic rather than content-related.
✅ Useful annotation of key data points: The annotations labeling the most extreme data points (e.g., 'delves', 'potential') are crucial for interpreting the plots and linking them back to the specific examples discussed in the text.
✅ Informative and detailed caption: The detailed figure caption is very helpful, explaining what each panel shows, defining the axes, and noting the capping of values. This makes the figure more self-contained.
💡 Capping of y-axis values hides true maxima: While necessary for visualization, the capping of the y-axis values (r > 90 shown at 90, δ > 0.05 shown at 0.05) obscures the true magnitude of the change for the most extreme outliers. A note in the caption clarifies this, but it remains a visual limitation.

Fig. 3. Number of excess words per year.

Figure/Table Image (Page 3)

First Reference in Text

during the COVID pandemic consisted almost entirely of content words (such as respiratory, remdesivir, etc.), whereas the excess vocabulary in 2024 consisted almost entirely of style words (Fig. 3A).

Description

Stacked bar chart of excess words per year: This panel presents a stacked bar chart tracking the number of 'excess words' each year from 2014 to 2024. An 'excess word' is a word used more frequently than statistically expected. The height of each bar represents the total count of such words for that year.
Categorization of words by type: Each bar is divided into colored segments representing different categories of words: blue for 'content words' (terms related to the topic, like 'respiratory'), orange for 'style words' (general academic words like 'delves'), and gray for 'other'.
Shift from content-driven to style-driven excess vocabulary: The chart shows a significant peak in excess words around 2020-2021, reaching nearly 200 words, which is composed almost entirely of blue 'content words', corresponding to the COVID-19 pandemic. In contrast, 2024 shows the largest peak in the entire period, with over 400 excess words, and this bar is overwhelmingly dominated by orange 'style words'.
Exemplar word annotations: Specific words are used to label some of the bars, such as 'Sars', 'Covid', 'Delves', and 'Chatgpt', providing examples of the types of words that became excessively common in those years.

Scientific Validity

✅ Strong comparative evidence: The direct visual comparison between the pandemic-era vocabulary shift (topic-driven, content words) and the post-2022 shift (LLM-driven, style words) provides powerful, intuitive evidence for the paper's central claim that the nature of the change in scientific writing is unprecedented.
✅ Effectively summarizes and visualizes a key trend: This plot effectively visualizes the output of the systematic analysis from Figure 2, aggregating the data over time to show a clear trend. It strongly supports the reference text's claim that the 2024 excess vocabulary is almost entirely style words.
💡 Classification subjectivity: The entire analysis rests on the manual classification of words into 'content' and 'style'. While this is a key contribution, it is also a potential source of subjectivity. The criteria for this classification should be explicitly stated in the methods to ensure transparency and reproducibility.

Communication

✅ Appropriate chart type: The use of a stacked bar chart is an effective choice to simultaneously display the total number of excess words over time and the changing composition of that total (content vs. style).
✅ Effective and consistent color coding: The color scheme is consistent with Figure 2 (blue for content, orange for style), which aids in building a coherent visual narrative across figures. The contrast between the predominantly blue bars in the COVID years and the predominantly orange bar in 2024 is stark and immediately communicates the core finding.
✅ Illustrative annotations: Annotating the bars with a single, highly representative word for that year (e.g., 'Covid' for 2021, 'Chatgpt' for 2024) is a clever way to anchor the abstract data to a concrete and memorable example, enhancing reader engagement.
💡 X-axis labeling could be more complete: The x-axis only labels every other year. While this reduces clutter, labeling every year would improve readability and make it easier to pinpoint specific years like 2021 and 2023 without interpolation.

Fig. 4. Combining excess style words yields a larger frequency gap.

Figure/Table Image (Page 3)

First Reference in Text

To create the rare set, we grouped all 2024 excess style words with frequency p < T and computed the frequency gap Arare as a function of threshold T (Fig. 4).

Description

Two-panel figure explaining the calculation of the LLM usage estimate: The figure consists of two related plots, (A) and (B), that demonstrate how the authors arrived at their primary estimate of LLM usage. The analysis involves creating a set of 'rare style words' and varying a 'Threshold' (horizontal axis) to determine which words to include in the set. A lower threshold means only the rarest words are included.
Panel A: Observed vs. Expected frequency of a word set: Panel (A) shows two lines. The top line ('Observed frequency') is the actual fraction of 2024 abstracts containing at least one word from the set defined by the threshold. The bottom line ('Expected frequency') is the projected fraction based on pre-LLM trends. The visible gap between these lines represents the 'excess' usage.
Panel B: The 'Frequency gap' as a function of the threshold: Panel (B) plots the size of the gap from Panel (A) directly. The vertical axis, 'Frequency gap (Δ)', represents this excess usage. The plot shows that the gap is maximized at a certain threshold.
Key finding: A maximum frequency gap of 13.6%: The peak of the curve in Panel (B) is explicitly labeled with the value 0.136. This indicates that the maximum frequency gap found is 0.136, or 13.6%, which the authors use as their main lower-bound estimate for the percentage of 2024 abstracts processed with LLMs.

Scientific Validity

✅ Data-driven optimization of the marker word set: The approach of optimizing a threshold (T) to find the set of marker words that maximizes the frequency gap (Δ) is a rigorous and transparent method for deriving a robust lower-bound estimate. It avoids arbitrarily selecting a set of words and instead uses a data-driven approach.
✅ Correctly handles word co-occurrence: The calculation of the frequency of the union of words (i.e., abstracts containing at least one marker word) is a methodologically sound way to handle co-occurrence. It correctly establishes a lower bound without making the incorrect assumption that the usages of different marker words are independent events.
✅ Finding appears robust to minor changes in threshold: The plot in Panel B shows a relatively broad peak around the maximum. This suggests that the final estimate of 13.6% is not highly sensitive to the exact choice of the threshold T, which adds to the robustness of the finding.
💡 Analysis for the 'common set' is not shown: The analysis shown is for the 'rare set' of words. The text mentions a second, independent 'common set' was also used. Showing a similar plot for the 'common set' would further strengthen the paper's claim of robustness by demonstrating that two independent heuristics lead to a similar result.

Communication

✅ Excellent narrative structure: The two-panel structure is highly effective. Panel A clearly shows the components of the calculation (observed and expected frequencies), while Panel B displays the derived result (the gap). This guides the reader logically from the raw data to the final conclusion.
✅ Clear highlighting of the main finding: Explicitly annotating the peak of the curve in Panel B with the key result, '0.136', is a superb communication choice. It immediately draws the reader's eye to the single most important number derived from this analysis.
✅ Appropriate use of a logarithmic scale: The use of a logarithmic scale for the x-axis ('Threshold') is appropriate, as it effectively displays the function's behavior across several orders of magnitude of word frequency, which is where the optimal threshold is being sought.
💡 Y-axis label could be more descriptive: The y-axis label in Panel B, 'Frequency gap (Δ)', is technically correct but could be more intuitive for broader comprehension. Consider a more descriptive label like 'Estimated LLM Usage (Δ)', as this value is presented as the lower-bound estimate throughout the text.

Fig. 5. Frequency gaps estimated for various subcorpora.

Figure/Table Image (Page 3)

First Reference in Text

We performed the same analysis as above by various subgroups of PubMed papers. We computed frequency gaps Acommon and Arare for different biomedical fields, affiliation countries, journals, and men and women among the first and the last authors, inferred from their first names (see Materials and Methods). Note that we based all our A estimates on the same two sets of excess style words as before.

Description

Composite figure analyzing LLM usage across paper subgroups: This figure is a composite of five panels (A-E) that analyze the 'frequency gap'—the estimated lower bound of LLM usage—across various subgroups of scientific papers (subcorpora).
Panel A: Time-series frequency of marker word sets: Panel A is a line chart showing the frequency of abstracts from 2012-2024 that contain at least one word from two distinct marker word sets: a 'rare' set (291 words) and a 'common' set (10 words). Both show a sharp increase after 2022, with the derived frequency gaps being 0.136 (13.6%) and 0.134 (13.4%), respectively.
Panels B-D: Correlation of estimates from two different word sets: Panels B, C, and D are scatter plots that compare the frequency gap estimates from the 'rare' set (x-axis) versus the 'common' set (y-axis) for individual subcorpora. Each point represents a specific subgroup, such as a scientific field (B), a country (C), or a journal (D). Most points cluster along the diagonal, indicating that both word sets produce similar estimates.
Heterogeneity in LLM usage estimates across subgroups: These scatter plots reveal significant variation. For instance, the 'Computation' field (Panel B) and the country 'Taiwan' (Panel C) show high frequency gaps of around 0.20 (20%), while the 'Nature + Science + Cell' journal group (Panel D) shows a much lower gap of about 0.07 (7%).
Panel E: Highlighting extreme usage estimates in specific subcorpora: Panel E consists of several bar charts that display the final averaged frequency gap for specific, highly-impacted subgroups. It highlights extreme cases, such as an estimated LLM usage rate of 0.41 (41%) in 'Computation' papers from China, and 0.34 (34%) in papers from South Korea published in the journal 'Sensors'.

Scientific Validity

✅ Robustness check using two independent marker sets: Using two independently derived marker word sets ('rare' and 'common') and showing that they yield highly correlated estimates (Panels B-D) is a powerful validation of the method's robustness. It demonstrates that the findings are not an artifact of a particular choice of marker words.
✅ Stratified analysis reveals important heterogeneity: The stratified analysis across different fields, countries, and journals is a significant strength. It provides a nuanced picture of LLM adoption, revealing heterogeneity that would be missed by a single aggregate analysis. This is a key scientific contribution of the paper.
✅ Consistent methodology across subgroups: The text confirms that the same marker word sets were used for all subcorpora. This is a methodologically sound choice that ensures the comparisons between groups are fair and consistent.
💡 Lack of uncertainty estimates: The figure presents point estimates for the frequency gaps without any indication of uncertainty, such as error bars or confidence intervals. Given that these are estimates derived from data, providing a measure of statistical uncertainty would substantially increase the rigor of the claims.
💡 Potential for confounding variables: The figure shows strong correlations between LLM usage estimates and certain subgroups (e.g., computational fields, specific countries). While the paper frames this as differences in LLM adoption, the data shown cannot rule out confounding factors. For example, fields with faster publication cycles might show the effect earlier, or journals with different editorial practices might be more or less likely to publish text with LLM-style words. These alternative interpretations are plausible based on the figure alone.

Communication

✅ Excellent multi-panel narrative structure: The figure's multi-panel layout is highly effective. It logically progresses from establishing the analysis method (Panel A), to showing the consistency of results across broad categories (Panels B-D), and finally to highlighting the most extreme and specific findings (Panel E). This creates a clear and compelling visual narrative.
✅ Appropriate use of varied chart types: The use of different but appropriate chart types for different data aspects—line charts for time series, scatter plots for correlations, and bar charts for direct comparisons—is a strong design choice that enhances clarity.
✅ Clear and effective labeling of key points: Key data points in panels B, C, D, and E are clearly labeled (e.g., 'Computation', 'Taiwan', 'Sensors') or have their values printed on them. This is crucial for interpreting the figure and connecting it to the text's specific claims.
💡 Repetitive axis labels could be streamlined: The axis labels in Panels B, C, and D ('Frequency gap Δ based on rare words') are very long and repetitive. A global title for these three panels could state the comparison, allowing for more concise axis labels like 'Δ (rare set)' and 'Δ (common set)'.
💡 Caption could be more descriptive to improve self-containment: The caption is quite concise. While the figure is complex, a slightly more detailed caption explaining what each panel represents (e.g., 'A: Frequency of marker word sets over time. B-D: Comparison of Δ estimates... E: Final Δ estimates for selected subgroups.') would improve its ability to be understood independently from the main text.

Figs. S1 to S7

Figure/Table Image (Page 9)

First Reference in Text

For comparison, we did the same analysis for all years from 2013 to 2023 (figs. S1 to S4).

Description

Historical baseline of 'excess words' from 2013-2023: Figures S1 through S4 provide a year-by-year historical analysis of 'excess words' in PubMed abstracts from 2013 to 2023. Each year is represented by two scatter plots, identical in format to the main Figure 2, showing the 'excess frequency ratio' (a measure of relative increase) and the 'excess frequency gap' (a measure of absolute increase). This series of plots serves as a crucial baseline to contextualize the findings for 2024.
Excess words before 2023 are primarily content-driven: In the pre-LLM era (2013-2022), the figures consistently show that the words with the highest excess usage are overwhelmingly 'content words'—terms related to specific topics. For example, 'ebola' shows a large frequency ratio spike in 2015, 'zika' spikes in 2017 with a ratio of about 40, and words like 'covid', 'coronavirus', and 'pandemic' dominate from 2020 to 2021, with 'covid' reaching a frequency gap of approximately 0.05 in 2021.
Low background level of excess words in pre-LLM years: The plots for pre-LLM years show that very few words cross the defined threshold for excess usage, and those that do are tied to clear real-world events or scientific breakthroughs (e.g., disease outbreaks, new drugs like 'pembrolizumab' in 2016, or technologies like 'CRISPR' in 2015).
The year 2023 shows the beginning of the shift towards style words: The plot for 2023 (in Figure S4) marks a transition. While still showing content words like 'monkeypox' and 'omicron', it also reveals the emergence of excess 'style words' such as 'intricate' and 'transformer', foreshadowing the large-scale shift observed in 2024.

Scientific Validity

✅ Provides a crucial and robust historical baseline: Presenting the analysis for every year from 2013 to 2023 is a major methodological strength. It establishes a robust historical baseline, demonstrating what 'normal' vocabulary shifts look like and allowing the authors to compellingly argue that the 2024 changes are unprecedented in both scale and nature.
✅ Consistent methodology ensures fair comparison: The consistent application of the same analytical thresholds (the dashed lines) across all years ensures that the comparison is fair and rigorous. This supports the conclusion that the increase in the number and type of excess words in 2024 is a genuine phenomenon, not an artifact of changing criteria.
✅ Strongly supports the paper's central claims: These figures strongly support the paper's narrative. They visually confirm that before the widespread availability of LLMs, large-scale shifts in scientific vocabulary were driven by new topics and discoveries (content), whereas the post-LLM era is characterized by a shift in writing style.
💡 Lack of content/style color-coding: The main Figure 2 uses color to distinguish 'content' from 'style' words, which is a powerful visual aid. This color-coding is absent in the supplementary figures. While one can infer the category of the labeled words, applying the same color scheme here would have made the visual argument for the historical baseline even more direct and compelling.

Communication

✅ Consistent and effective layout: The consistent two-panel layout used for each year, mirroring the main Figure 2, is highly effective. This consistency allows the reader to quickly learn how to interpret the plots and easily compare findings across different years.
✅ Appropriate use of supplementary materials: Placing this extensive year-by-year analysis in the supplementary materials is an appropriate choice. It provides crucial supporting evidence for the main claims without cluttering the primary narrative, which correctly focuses on the 2024 data.
💡 Captions could be more descriptive: The captions for each figure are minimal (e.g., "Figure S1: Excess words in 2013 and 2014. See Figure 2 for explanations."). While this is efficient, adding a brief summary sentence to each caption (e.g., "...showing primarily content-related words linked to specific research trends.") would improve their ability to be understood at a glance.
💡 Overlapping text labels reduce readability: In several plots, the text labels for outlier words overlap, making them difficult to read (e.g., Fig. S2, 2015 plot). Employing a text repulsion algorithm or staggering the labels would significantly improve the clarity of these annotations.

Discussion

Key Aspects

🔑 Summary of an Unprecedented Stylistic Shift: The discussion begins by synthesizing the study's core findings: Large Language Models have driven an unprecedented shift in scientific writing, characterized by a sudden, large-scale increase in stylistic verbs and adjectives rather than content-specific nouns. It reiterates the primary quantitative conclusion that at least 13.5% of 2024 biomedical abstracts show evidence of LLM processing, which translates to over 200,000 papers annually. This summary firmly establishes the magnitude and unique nature of the observed phenomenon.
🗺️ Interpreting the Heterogeneity of LLM Adoption: A key portion of the discussion is dedicated to interpreting the significant heterogeneity in LLM usage, which ranges from 5% to over 40% across different subcorpora. The authors present a balanced analysis, proposing that this variation could stem from a combination of factors: genuine differences in adoption rates across fields and countries, disparities in publication timelines, and, critically, differences in detectability. They hypothesize that more experienced writers may censor obvious LLM-generated phrases, making their usage harder to track with word-based methods.
⚖️ Methodological Limitations and Advantages: The section candidly addresses the study's methodological limitations while simultaneously highlighting its conceptual strengths relative to previous research. The authors acknowledge that their approach cannot definitively distinguish between direct LLM use and human mimicry of AI style. However, they contrast this with the major advantage of their method: it avoids the biases inherent in ground-truth datasets used by other detectors and allows for direct, historical comparison against other linguistic shifts, thereby proving the LLM effect is unparalleled.
💬 Implications for Scientific Policy and Practice: The discussion broadens its scope to consider the profound consequences of widespread LLM adoption for the scientific community and the resulting policy debate. It weighs the benefits, such as improved grammar and readability, against significant risks like factual inaccuracies, plagiarism, and the homogenization of scientific thought. The authors position their analytical method as an essential, data-driven tool that can help monitor the efficacy of and adherence to emerging publisher and funder policies regarding AI use in research.

Strengths

✅ Nuanced Interpretation of Heterogeneity
The discussion offers a sophisticated interpretation of the observed heterogeneity in LLM usage. Instead of attributing it solely to adoption rates, it thoughtfully considers alternative factors like differential editing by authors (e.g., native vs. non-native speakers) and varying publication timelines across disciplines. This nuanced perspective demonstrates critical thinking and strengthens the credibility of the analysis.

"However, the heterogeneity in lower bounds could also point to other factors beyond actual differences in LLM adoption. First, it could highlight nontrivial discrepancies in how authors of different linguistic backgrounds censor suggestions from writing assistants, thereby making the use of LLMs nondetectable..." (Page 4)
✅ Transparent Acknowledgment of Limitations
The authors are commendably transparent about the boundaries of their methodology. They clearly state that their approach cannot distinguish between direct LLM generation and human authors adopting an LLM-like style, nor can it differentiate between various LLMs. This candid acknowledgment of limitations is a hallmark of rigorous scientific reporting and helps frame the results appropriately.

"For example, our approach cannot distinguish word frequency increase due to direct LLM usage from word frequency increase due to people adopting LLM-preferred words and borrowing them for their own writing." (Page 5)
✅ Effective Positioning Against Related Work
The paper effectively situates its contribution by clearly articulating its conceptual advantages over prior studies. By highlighting how its method avoids reliance on potentially biased ground-truth datasets and allows for historical contextualization (e.g., comparison to the COVID-19 pandemic), it powerfully argues for the novelty and significance of its approach.

"Our approach has two conceptual advantages over previous work: First, all prior studies relied on ground-truth LLM-generated and human-written scientific texts. Such datasets can easily be biased... In contrast, our analysis avoids this limitation..." (Page 5)
✅ Clever and Self-Aware Conclusion
The paper's concluding sentence is a masterful and witty flourish. By employing the very stylistic words identified as LLM markers ('meticulously delve,' 'crucial,' 'intricate'), the authors demonstrate a keen self-awareness, ending the paper on a memorable and thought-provoking note that subtly reinforces its own thesis.

"We hope that future work will meticulously delve into tracking LLM usage more accurately and assess which policy changes are crucial to tackle the intricate challenges posed by the rise of LLMs in scientific publishing." (Page 5)

Suggestions for Improvement

💡 Suggest a Method to Test the 'Naïve vs. Advanced' Usage Hypothesis
High impact for future research. The discussion compellingly hypothesizes that the highest detected LLM usage rates (~30-40%) might reflect the true adoption rate, with lower rates in other corpora resulting from more sophisticated, less detectable usage. This remains a qualitative argument. A valuable addition would be to suggest a future study to test this 'naïve vs. advanced usage' theory. For instance, one could correlate the lower-bound estimates with linguistic complexity metrics or author demographics (e.g., affiliation country) to see if a pattern emerges, adding empirical validation to a central interpretive claim.

"We argue that the true LLM usage in biomedical publishing may be closer to the highest lower bounds we observed, as those may be corpora where LLM usage is the most naïve and the easiest to detect." (Page 5)

Implementation: In the 'Interpretation and limitations' subsection, after presenting the hypothesis, add a sentence proposing a future research direction. For example: 'Future work could test this hypothesis by correlating our subcorpus-specific lower bounds with proxies for advanced usage, such as co-authorship with native English-speaking countries or publication in journals with robust editorial oversight, to determine if these factors are associated with lower detectability.'
💡 Propose a Method to Measure Linguistic Homogenization
High impact for methodological extension. The discussion raises the critical risk of linguistic homogenization but treats it as a potential consequence rather than a directly measurable phenomenon. The paper's contribution could be enhanced by explicitly suggesting how its 'excess vocabulary' framework could be adapted to track this. For example, future work could measure not just the rise of specific words, but also a decrease in overall vocabulary diversity or an increase in the semantic similarity of abstracts on the same topic over time. This would transform a stated risk into a testable hypothesis and provide a clear path for future research.

"Such homogenization can degrade the quality of scientific writing. For instance, all LLM-generated introductions on a certain topic might sound the same and would contain the same set of ideas and references..." (Page 5)

Implementation: In the 'Implications and policies' subsection, after discussing homogenization, add a sentence to suggest a future application of the methodology. For example: 'Moreover, our corpus-level analysis framework could be adapted to directly quantify this homogenization risk by tracking metrics such as declining lexical diversity or increasing semantic similarity among abstracts within specific research fields over time.'

Materials and Methods

Key Aspects

📚 Corpus Curation and Filtering: The study is founded on a massive corpus derived from the 2025 PubMed annual snapshot. The authors detail a systematic filtering pipeline, starting with nearly 25 million abstracts and narrowing the dataset to 15.1 million complete, English-language abstracts published between 2010 and 2024. This process also involved extracting key metadata, including the affiliation country and inferred gender of authors, and assigning papers to one of 39 research fields based on journal titles, thereby establishing a large-scale, well-defined dataset for longitudinal analysis.
🛠️ Abstract Preprocessing and Vectorization: A significant effort was dedicated to refining the raw abstract text to ensure analytical validity. This involved a multi-stage preprocessing workflow, beginning with the removal of non-content strings (e.g., copyright notices, editor communications) using over 100 regular expressions. The process also included the complete removal of non-primary research notices like errata and corrigenda. The cleaned text was then converted into a high-dimensional binary word occurrence matrix using Scikit-learn's CountVectorizer, creating the structured data necessary for the subsequent statistical analysis.
📊 Statistical Framework for Excess Usage: The core of the paper's methodology is its novel statistical framework for quantifying 'excess' word usage, inspired by epidemiological studies of excess mortality. The authors define a precise formula for calculating a counterfactual word frequency, which projects expected usage based on trends from the two years immediately preceding the LLM era. By comparing the observed frequency (p) to this counterfactual baseline (q), they derive two key metrics—the excess frequency gap (δ) and ratio (r)—that form the quantitative basis for identifying words with anomalous increases in usage.
🏷️ Human-in-the-Loop Annotation and Subgrouping: To add a qualitative dimension to the quantitative findings, the authors implemented a rigorous human-in-the-loop annotation process. They systematically categorized 900 unique excess words as 'content' or 'style' while blinded to the year of their identification to prevent bias. This section also outlines the criteria for the extensive subgroup analysis, defining the cohorts based on affiliation country, journal, research field, and author gender, and specifying the inclusion threshold (at least 300 papers per year) required for a subgroup to be assigned a final Δ value.

Strengths

✅ Transparent and Reproducible Methods
The section provides exceptionally clear and specific details that enhance reproducibility. Key parameters for computational steps are explicitly stated (e.g., CountVectorizer's `binary = True, min_df = 1e-6`), and the libraries used are named (e.g., Scikit-learn, NLTK), allowing other researchers to replicate the workflow with high fidelity.

"We then computed a binary word occurrence matrix using Count-Vectorizer (binary = True, min_df = 1e-6) from Scikit- learn (55), obtaining a 15,103,888 × 362,441 sparse matrix." (Page 6)
✅ Methodologically Robust Design
The methodology incorporates several steps designed to ensure robustness and minimize researcher bias. The decision to annotate words as 'content' or 'style' while being blinded to the year of their excess usage is a critical control that strengthens the validity of the subsequent analysis distinguishing between topical and stylistic shifts.

"We sorted the list alphabetically and annotated them as content and style words while being blinded to the year in which they were selected as excess words." (Page 6)
✅ Clear Declaration of No LLM Use
In a paper investigating the impact of LLMs on scientific writing, the authors' explicit declaration that they did not use such tools for their own manuscript is a crucial statement of methodological integrity. This transparency preempts potential criticism and reinforces the authenticity of the work.

"We did not use any LLMs for writing or editing the manuscript." (Page 6)

Suggestions for Improvement

💡 Clarify the Rationale for the Extrapolation Formula
High impact for methodological transparency. The statistical analysis section presents the counterfactual projection formula without providing the underlying reasoning for its specific structure (e.g., the factor of 2, the use of `max{..., 0}`). Explaining why this particular non-linear extrapolation was chosen over other potential models, such as a standard linear regression over more data points, would significantly enhance the reader's understanding of this core methodological step and bolster confidence in the conservative nature of the estimates.

"To do the linear extrapolation, we took the frequencies p−3 in year Y − 3 and p−2 in year Y − 2 and computed the counterfactual projection q = p−2 + 2 ⋅ max {p−2 − p−3, 0}." (Page 6)

Implementation: Add a sentence after the formula is presented to explain its logic. For example: "This specific formulation was chosen to conservatively capture potentially accelerating trends observed in the two years prior to the analysis window (Y-3 and Y-2), while the `max` function ensures the projection does not decrease, reflecting a baseline of stable or increasing word usage."

Non-Text Elements

Figure S1: Excess words in 2013 and 2014. See Figure 2 for explanations.

Figure/Table Image (Page 10)

First Reference in Text

Not explicitly referenced in main text

Description

Baseline analysis of excess words for 2013 and 2014: This figure presents data for the years 2013 and 2014, serving as a baseline for 'excess word' usage before the major events analyzed in the main paper. It consists of four scatter plots in total, with two for each year.
Relative increases are seen in technical, content-specific words: The left-hand plots show the 'excess frequency ratio' on the y-axis, which measures the relative increase in a word's usage compared to what was expected. For both 2013 and 2014, the words with high ratios (around 10) are technical, content-specific terms like 'gabaa' (a neurotransmitter) and 'cmax' (a measure in pharmacology).
Absolute increases are negligible across all words: The right-hand plots show the 'excess frequency gap' on the y-axis, which measures the absolute increase in the percentage of abstracts using a word. Critically, for both 2013 and 2014, virtually no words cross the significance threshold (the dashed line at 0.01). This indicates that no single word's usage increased by even one percentage point, establishing a 'quiet' baseline of vocabulary change.

Scientific Validity

✅ Establishes a crucial historical baseline: Presenting this baseline data for 'quiet' years is a methodologically crucial step. It provides a strong point of comparison that allows the authors to convincingly argue that the vocabulary shifts seen in the COVID and LLM eras are statistically significant and historically unusual.
✅ Demonstrates methodological consistency and validity: The figure demonstrates the consistent application of the analytical method across different years. By showing that the method identifies only a few, topic-specific outliers in these baseline years, it builds confidence in the method's validity and its ability to detect genuine signals of change.
✅ Data strongly supports the paper's core premise: The data shown strongly supports the paper's underlying premise: that prior to recent major events, large-scale, coordinated shifts in scientific vocabulary were rare, and the few 'excess words' that appeared were directly tied to specific research topics rather than general writing style.

Communication

✅ Consistent and effective visual layout: The figure maintains a consistent two-panel layout for each year, mirroring the main Figure 2. This is an excellent design choice as it reduces the cognitive load on the reader, who can apply their understanding from the main figure to this supplementary data.
💡 Caption is not self-contained: The caption is overly sparse, simply directing the reader to Figure 2. While efficient, it reduces the figure's ability to stand alone. A more descriptive caption summarizing the key takeaway—for instance, 'Note the low number of excess words and negligible frequency gaps, establishing a pre-COVID/pre-LLM baseline'—would significantly improve clarity and self-containment.
💡 Overlapping text labels: In the 2014 panel, some of the text labels for the data points overlap (e.g., 'gabaa' and 'synonymy'), which hinders readability. Employing a text repulsion algorithm or manually adjusting label positions would make the annotations clearer.

Figure S2: Excess words in 2015–2017. See Figure 2 for explanations.

Figure/Table Image (Page 11)

First Reference in Text

Not explicitly referenced in main text

Description

Historical analysis of excess words from 2015-2017: This figure presents a historical analysis of 'excess words' for the years 2015, 2016, and 2017, arranged in three rows. Each row corresponds to a year and contains two scatter plots, mirroring the format of Figure 2, to show relative and absolute increases in word usage.
Vocabulary spikes are tied to major public health events: The plots for 2015 and 2017 are dominated by words related to major viral outbreaks. In 2015, 'ebola' shows an 'excess frequency ratio' (a measure of relative increase) of approximately 10. In 2017, 'zika' and 'zikv' show a much larger ratio, with 'zika' being about 40 times more frequent than expected.
Outlier words are content-specific and topic-driven: The outlier words are consistently content-specific terms related to emerging scientific topics. Besides disease names, these include 'crispr' in 2015, new cancer drugs ('pembrolizumab', 'nivolumab') in 2016, and the technical term 'convolutional' in 2017, which is associated with the rise of deep learning.
Absolute frequency changes remain negligible: Across all three years, the right-hand panels, which show the 'excess frequency gap' (absolute increase), are largely empty. Almost no words cross the significance threshold of 0.01, indicating that even the words with large relative increases did not become frequent enough to cause a substantial absolute shift in the overall vocabulary.

Scientific Validity

✅ Provides a strong historical baseline for comparison: This figure is crucial for establishing the paper's central argument. By showing that historical vocabulary spikes were infrequent, topic-specific, and driven by content words, it provides a strong baseline against which the widespread, stylistic changes of 2024 can be judged as truly unprecedented.
✅ Validates the method's ability to detect real-world signals: The figure demonstrates the sensitivity and validity of the 'excess word' detection method. It correctly identifies and reflects the impact of known real-world events (Ebola, Zika) and major scientific advances (CRISPR, immunotherapy) on the scientific literature of the time.
✅ Reinforces the localized nature of past vocabulary shifts: The consistent finding of negligible absolute frequency gaps (the right-hand panels) across these years is a key piece of evidence. It reinforces that past vocabulary shifts, while notable, were confined to niche topics and did not represent a broad change in the language of science, unlike the effect attributed to LLMs.

Communication

✅ Effective and consistent multi-year layout: The consistent three-row, two-panel layout is effective for comparing vocabulary shifts across multiple years. It allows the reader to quickly grasp the patterns (or lack thereof) in the historical data by maintaining a uniform visual structure.
💡 Caption is not self-contained: The caption is too minimal, merely pointing to Figure 2 for context. It misses the opportunity to guide the reader's interpretation. A more informative caption, such as "Excess words from 2015-2017, showing that vocabulary spikes were driven by specific content words related to disease outbreaks (Ebola, Zika) and new technologies (CRISPR), with negligible changes in absolute frequency gaps," would greatly enhance the figure's self-containment.
💡 Overlapping text labels reduce clarity: In several panels, text labels for data points overlap, hindering readability (e.g., 'miseq' and 'https' in 2015; 'psma' and 'microcephaly' in 2017). Using a text repulsion algorithm or manually adjusting label positions would improve the clarity of these important annotations.

Figure S3: Excess words in 2018–2020. See Figure 2 for explanations.

Figure/Table Image (Page 12)

First Reference in Text

Not explicitly referenced in main text

Description

Historical analysis of excess words from 2018-2020: This figure continues the historical analysis of 'excess words' for the years 2018, 2019, and 2020. It uses the same two-panel format per year, showing relative increase ('excess frequency ratio') on the left and absolute increase ('excess frequency gap') on the right.
2018-2019 show a continued quiet baseline: The years 2018 and 2019 continue the baseline trend seen in previous figures. The few outlier words are content-specific technical terms (e.g., 'convolutional', 'circrnas', 'adversarial'), and critically, the right-hand panels show that no words cross the absolute frequency gap threshold of 0.01.
Explosion of COVID-19 content words in 2020: The year 2020 marks a dramatic shift. The left panel shows an explosion of excess words, all clearly related to the COVID-19 pandemic, such as 'wuhan', 'sars', 'coronavirus', and 'pandemic'.
First significant absolute frequency gap emerges in 2020: Most importantly, the right-hand panel for 2020 is the first in the historical series to show a significant signal. Multiple words, including 'coronavirus', 'sars', 'disease', and 'patients', cross the 0.01 absolute frequency gap threshold. This signifies a major, widespread shift in the biomedical vocabulary driven by a single, dominant topic.

Scientific Validity

✅ Validates the method on a known major event: This figure powerfully demonstrates the method's ability to detect and quantify the impact of a massive, real-world event on scientific literature. The clear signal in 2020 validates the entire analytical approach.
✅ Establishes a crucial example of a content-driven shift: The 2020 data provides a perfect 'control' case for a content-driven vocabulary shift. It establishes what a large-scale change looks like when it's about a topic. This is essential for the paper's ultimate argument that the LLM-driven shift is different because it's about style.
✅ Provides strong evidence for the impact of the COVID-19 pandemic: The stark contrast between 2019 and 2020 provides extremely strong evidence supporting the claim that the COVID-19 pandemic had an unprecedented effect on biomedical publishing, a key secondary finding mentioned in the paper.

Communication

✅ Clear narrative progression: The figure's layout, presenting each year sequentially, effectively builds a narrative. The visual contrast between the quiet plots of 2018-2019 and the explosive plot of 2020 is stark and immediately communicates the onset of a major event.
💡 Caption lacks self-containment: The caption is minimal and relies on the reader referring back to Figure 2. To improve its self-containment, it could briefly summarize the key finding, for instance: "Excess words from 2018-2020, culminating in the 2020 emergence of a large, content-driven frequency gap associated with the COVID-19 pandemic."
💡 Overlapping text labels in the 2020 panel: In the 2020 panel showing the frequency ratio, the numerous labels for COVID-related terms are densely packed and overlap significantly, making them difficult to read. Employing a text repulsion algorithm or staggering the labels would greatly improve the clarity of this key panel.

Figure S4: Excess words in 2021–2023. See Figure 2 for explanations.

Figure/Table Image (Page 13)

First Reference in Text

Not explicitly referenced in main text

Description

Analysis of the 2021-2023 transition period: This figure presents a historical analysis of 'excess words' for the years 2021, 2022, and 2023, documenting the transition period leading up to the main findings of the paper. It uses the same two-panel scatter plot format for each year, showing relative increase ('excess frequency ratio') and absolute increase ('excess frequency gap').
Peak of COVID-19 content-word usage in 2021: The 2021 plots show the peak of the COVID-19 pandemic's influence on scientific literature. The vocabulary is dominated by content words like 'covid', 'remdesivir', and 'lockdowns'. The absolute frequency gap for 'covid' is at its maximum, around 0.05, meaning it appeared in 5% more abstracts than expected.
Shift towards vaccine-related terms and early style words in 2022: The 2022 plots show a shift. COVID-19 terms are fading, replaced by words related to vaccines like 'biontech', 'pfizer', and 'booster'. Notably, some general academic or style words like 'research', 'between', and 'transformer' begin to appear as outliers.
Emergence of LLM-associated style words in 2023: The 2023 plots clearly show the emergence of the trend analyzed in the main paper. While new research topics like 'monkeypox' and 'alphafold' appear, the most prominent new outliers are style words such as 'intricate', 'holds', and 'into', which are shown with an orange color.

Scientific Validity

✅ Documents a critical transition period: This figure is scientifically crucial as it documents the transition between two distinct types of vocabulary shifts. By showing the decline of the content-driven COVID-19 signal and the simultaneous rise of a style-driven signal, it provides a powerful narrative that strengthens the paper's main conclusion.
✅ Provides strong foreshadowing for the main claim: The appearance of words like 'transformer' (the architecture behind many LLMs) in 2022, followed by LLM-associated style words like 'intricate' in 2023, provides a clear and credible lead-up to the main findings in 2024. This foreshadowing makes the paper's central claim more convincing.
✅ Provides a strong internal control for a content-driven signal: The data for 2021-2022 provides an excellent internal control. It shows what a large, but fading, content-driven signal looks like within the authors' analytical framework, making the different nature of the emerging 2023-2024 signal stand out more clearly.

Communication

✅ Clear narrative progression: The year-by-year layout (2021, 2022, 2023) is highly effective at telling a story of transition, clearly showing the decline of one major vocabulary shift (COVID-19) and the emergence of the next (LLM-related style).
💡 Caption is not self-contained: The caption is too brief, relying entirely on the reader's memory of Figure 2. A more descriptive caption summarizing the key finding—for example, "Data from 2021-2023 showing the transition from a content-driven (COVID-19) vocabulary shift to a style-driven one (e.g., 'intricate', 'into')"—would significantly improve its self-containment.
💡 Inconsistent use of color coding: The main Figure 2 uses color to distinguish 'content' and 'style' words. This is mostly absent here, though 'into' is colored orange. Applying this color scheme consistently would have made the visual transition from a 'blue' (content) dominated 2021 to an 'orange' (style) emerging 2023 much more powerful and immediately apparent.
💡 Overlapping text labels: The labels in the 2021 frequency ratio plot are extremely dense and overlap, making them difficult to decipher. Using a text repulsion algorithm would improve readability.

Figure S5: Number of excess words per year, after lemmatisation.

Figure/Table Image (Page 14)

First Reference in Text

Not explicitly referenced in main text

Description

Bar chart of unique excess lemmas per year: This panel (a) is a stacked bar chart showing the number of unique 'excess words' per year from 2014 to 2024 after undergoing lemmatisation. Lemmatisation is a process of grouping together the different inflected forms of a word so they can be analyzed as a single item; for example, 'delve', 'delves', and 'delving' are all counted as the single lemma 'delve'. This prevents overcounting words that are conceptually the same.
Categorization by word type: Similar to Figure 3, each bar is stacked by word type: 'content words' (blue), 'style words' (orange), and 'other' (gray). The overall trend mirrors Figure 3, but the absolute numbers are lower due to lemmatisation.
Key finding confirmed after lemmatisation: The chart shows that in 2021, there were approximately 180 unique excess lemmas, which were predominantly content words (blue bar). In 2024, the number of unique excess lemmas surged to over 300 (specifically 343, according to the text), and this increase was overwhelmingly driven by style words (orange bar).

Scientific Validity

✅ Lemmatisation provides a critical robustness check: This analysis is a crucial robustness check. By performing the count after lemmatisation, the authors demonstrate that their main finding is not merely an artifact of counting multiple grammatical variations of the same root words. This significantly strengthens the validity of the core conclusion.
✅ Confirms the unprecedented nature of the 2024 shift: The figure confirms that even when counting unique word roots, the number of excess words in 2024 is substantially higher than the peak of the COVID-19 era, and the composition is fundamentally different (style vs. content). This reinforces the claim that the LLM-associated shift is unprecedented.
💡 Subjectivity of content/style classification: As with previous figures, the manual classification of lemmas into 'content' vs. 'style' is a potential source of subjectivity. Providing the classification criteria or the full list is important for transparency.

Communication

✅ Excellent multi-panel design: The two-panel presentation is highly effective. This panel (a) serves as a direct, apples-to-apples comparison with the main Figure 3, while panel (b) provides a new layer of detail.
✅ Consistent and effective color coding: The color scheme is consistent with the main figures (blue for content, orange for style), which reinforces the visual narrative and aids reader comprehension.
💡 Caption could be more descriptive: The caption is brief. While it states the core modification (lemmatisation), adding a sentence summarizing the key finding (e.g., "The trend holds, with 2024 dominated by style lemmas") would improve its self-containment.

Figure S6: Rare excess style words. All 291 excess style words in 2024 with...

Full Caption

Figure S6: Rare excess style words. All 291 excess style words in 2024 with frequency below 0.02, see Figure 4.

Figure/Table Image (Page 15)

First Reference in Text

Not explicitly referenced in main text

Description

A textual list of 291 'rare excess style words': This figure is not a graph or chart, but a complete textual list of the 291 words that form the 'rare set' of marker words used in the paper's main analysis. These are words that were identified as being used unexpectedly more often in 2024 abstracts, while still being relatively uncommon overall (appearing in less than 2% of abstracts).
Words are predominantly formal and stylistic: The list includes words such as 'delineates', 'elucidates', 'grappling', 'intricate', 'meticulous', 'pivotal', 'scrutinize', and 'underscore'. A qualitative review shows these are predominantly formal, often multisyllabic verbs and adjectives that contribute to a sophisticated or 'flowery' writing style, rather than being specific scientific content words.
The data basis for the 'rare set' analysis: This list represents the raw data that was used to generate the 'Arare' estimate of 13.6% LLM usage in Figure 4 and Figure 5. By combining these words into a single set, the authors could track how many abstracts contained at least one of these LLM-favored terms.

Scientific Validity

✅ High degree of transparency and reproducibility: Providing the full list of words used in the 'rare set' analysis is a critical act of scientific transparency. It allows for complete reproducibility, enabling other researchers to verify the findings and understand precisely which words are driving the signal. This is a major strength of the paper.
✅ List has high face validity: A simple inspection of the list provides strong qualitative support, or 'face validity', for the paper's hypothesis. The words are overwhelmingly of the type commonly associated with the verbose and formal style of LLMs (e.g., 'bolster', 'encompasses', 'leverages', 'streamline'), which makes the quantitative findings more intuitive and credible.
✅ Supports the methodological robustness of the overall study: This figure provides the raw data that underpins one of the two independent heuristics the authors use to estimate LLM usage. By making this set explicit, it strengthens the overall methodological argument that their final estimate is robust because it is derived from multiple, transparently-defined sets of words.

Communication

✅ Appropriate and transparent presentation format: Presenting this data as a simple, alphabetized list is the most effective and transparent choice. Any other visualization, like a word cloud, would obscure the individual words and hinder the goal of providing the full dataset for inspection.
✅ Clear and comprehensive caption: The caption is clear, concise, and provides all necessary context: what the words are (rare excess style words), the year (2024), the exact count (291), the frequency threshold used for selection (<0.02), and a reference to the main figure where this set is used (Figure 4).
💡 Readability could be improved with a multi-column layout: The single, dense block of text can be visually overwhelming. Formatting the list into multiple columns would significantly improve scannability and make it easier for readers to browse the words.

Figure S7: 2D visualisation of 2022–2024 abstracts. Caption on next page.

Figure/Table Image (Page 16)

First Reference in Text

Not explicitly referenced in main text

Description

A 'map' of 4.1 million biomedical abstracts: This panel is one of four views of a single map representing over 4 million biomedical research paper abstracts published from 2022 to 2024. The map was created using a technique called t-SNE, which arranges the abstracts so that those with similar content appear close together. Each dot is a single abstract.
Abstracts are colored by research field, forming topical clusters: In this specific view, each abstract is colored according to its scientific field (e.g., 'immunology', 'bioinformatics', 'surgery'). The visualization shows that abstracts from the same field form distinct clusters or 'continents' on the map. For example, there are clear regions for 'neuroscience' and 'cardiology'. This confirms that the map successfully organizes the literature by topic.

Scientific Validity

✅ Excellent validation of the visualization method: This panel serves as an excellent positive control or validation for the entire figure. The clear separation of scientific fields demonstrates that the underlying abstract embeddings (from PubMedBERT) and the t-SNE visualization are meaningful and successfully capture the topical structure of the literature. This builds confidence in the interpretations of the other panels.
✅ Provides a meaningful map of biomedical research: The visualization provides a powerful, high-level overview of the structure of modern biomedical research, showing which fields are distinct and which may be more closely related based on their proximity on the map.

Communication

✅ Excellent multi-panel narrative design: The use of a 2x2 grid to present four different views of the same underlying data structure is an outstanding communication strategy. It allows for direct comparison and builds a layered narrative, culminating in the final panel.
✅ Effective use of categorical color: Using distinct colors for different research fields effectively segments the visual space. The resulting 'continents' of research are intuitive and immediately convey that the underlying data organization is meaningful.
💡 Severe label occlusion: The labels for the research fields are densely clustered and overlap significantly, making many of them unreadable. While labeling is important, a different strategy, such as using numbered regions with a separate legend or interactive tooltips in a digital version, would be necessary for full clarity.

Figure S7: (Figure on the previous page.) A 2D visualisation of all 4 109 080...

Full Caption

Figure S7: (Figure on the previous page.) A 2D visualisation of all 4 109 080 abstracts from 2022 to 2024.

Figure/Table Image (Page 17)

First Reference in Text

Not explicitly referenced in main text

Description

A 'map' of 4.1 million biomedical abstracts: This panel is one of four views of a single map representing over 4 million biomedical research paper abstracts published from 2022 to 2024. The map was created using a technique called t-SNE, which arranges the abstracts so that those with similar content appear close together. Each dot is a single abstract.
Abstracts are colored by research field, forming topical clusters: In this specific view, each abstract is colored according to its scientific field (e.g., 'immunology', 'bioinformatics', 'surgery'). The visualization shows that abstracts from the same field form distinct clusters or 'continents' on the map. For example, there are clear regions for 'neuroscience' and 'cardiology'. This confirms that the map successfully organizes the literature by topic.

Scientific Validity

✅ Excellent validation of the visualization method: This panel serves as an excellent positive control or validation for the entire figure. The clear separation of scientific fields demonstrates that the underlying abstract embeddings (from PubMedBERT) and the t-SNE visualization are meaningful and successfully capture the topical structure of the literature. This builds confidence in the interpretations of the other panels.
✅ Provides a meaningful map of biomedical research: The visualization provides a powerful, high-level overview of the structure of modern biomedical research, showing which fields are distinct and which may be more closely related based on their proximity on the map.

Communication

✅ Excellent multi-panel narrative design: The use of a 2x2 grid to present four different views of the same underlying data structure is an outstanding communication strategy. It allows for direct comparison and builds a layered narrative, culminating in the final panel.
✅ Effective use of categorical color: Using distinct colors for different research fields effectively segments the visual space. The resulting 'continents' of research are intuitive and immediately convey that the underlying data organization is meaningful.
💡 Severe label occlusion: The labels for the research fields are densely clustered and overlap significantly, making many of them unreadable. While labeling is important, a different strategy, such as using numbered regions with a separate legend or interactive tooltips in a digital version, would be necessary for full clarity.

Delving into LLM-assisted writing in biomedical publications through excess vocabulary

First Page Preview

Table of Contents

Overall Summary

Study Background and Main Findings

Research Impact and Future Directions

Critical Analysis and Recommendations

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Discussion

Key Aspects

Strengths

Suggestions for Improvement

Materials and Methods

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements