Paper Review: The Impact of Large Language Models on Scientific Writing: Evidence from a Large-Scale Analysis of PubMed Abstracts

Table of Contents

  1. Abstract
  2. Introduction
  3. Results
  4. Discussion
  5. Methods
  6. References
  7. Acknowledgements

Overall Summary

Overview

This study investigates the influence of Large Language Models (LLMs), specifically ChatGPT, on scientific writing by analyzing changes in vocabulary usage within a dataset of 14 million PubMed abstracts from 2010 to 2024. The researchers employed a novel "excess word usage" approach, inspired by excess mortality studies, to identify words with unusually high-frequency increases after ChatGPT's release. This method allows for an unbiased assessment of LLM impact without relying on pre-existing assumptions about LLM usage patterns.

Key Findings

  • The study estimates that at least 10% of PubMed abstracts in 2024 were likely processed with LLMs, with variations observed across disciplines, countries, and journals.
  • A significant increase in the frequency of certain stylistic words, primarily verbs and adjectives, was observed in 2024, suggesting a potential influence of LLMs on scientific writing style.
  • The 2024 excess vocabulary consisted almost entirely of style words, contrasting with previous years, including the COVID-19 pandemic period, where content words dominated.
  • Higher estimated LLM usage was found in computational fields, certain non-English speaking countries, and journals with expedited review processes.
  • The impact of LLMs on scientific writing is unprecedented, surpassing the effect of major world events like the COVID-19 pandemic, as measured by changes in vocabulary usage.

Strengths

  • The study utilizes a novel and unbiased "excess word usage" approach, avoiding the limitations of existing LLM detection methods that rely on ground-truth training sets.
  • The analysis is based on a vast dataset of 14 million PubMed abstracts, providing a comprehensive and robust assessment of LLM impact on scientific writing.
  • The researchers meticulously documented their methodology, ensuring transparency and reproducibility of the findings.
  • The study provides a balanced discussion of both the potential benefits and risks associated with LLM usage in scientific writing.
  • The findings highlight the significant and unprecedented influence of LLMs on scientific writing, prompting a critical discussion on the ethical and practical implications of this emerging trend.

Areas for Improvement

  • While the linear extrapolation method for calculating counterfactual frequencies is conservative, a more detailed discussion of its limitations and potential biases would strengthen the analysis.
  • A more systematic analysis of the factors contributing to the heterogeneity in estimated LLM usage across different categories would provide more concrete insights.
  • A deeper exploration of the ethical implications of widespread LLM usage in science, particularly regarding authorship, plagiarism, and bias reinforcement, would enhance the practical relevance of the findings.

Significant Elements

  • Figure 2 (Scatter Plot): Visualizes the significant increase in frequency of specific words in 2024, particularly style words, highlighting the magnitude of these changes and allowing for easy identification of the most affected words.
  • Figure 5 (Combination Chart): Presents a comprehensive overview of estimated LLM usage across various categories (fields, countries, journals), highlighting significant variations and reinforcing the argument that LLM usage is widespread but context-dependent.

Conclusion

This study provides compelling evidence for the significant and unprecedented impact of LLMs on scientific writing. The findings raise important questions about the future of academic publishing and the need for guidelines and regulations surrounding LLM use. Continued research and monitoring are crucial to understanding the evolving role of LLMs in shaping scientific communication and ensuring the integrity and quality of published research.

Abstract

Summary

This abstract presents a study investigating the usage of Large Language Models (LLMs), particularly ChatGPT, in academic writing. The researchers analyzed a vast dataset of 14 million PubMed abstracts spanning from 2010 to 2024, focusing on vocabulary changes. They observed a significant increase in the frequency of certain stylistic words after the release of ChatGPT, indicating a potential influence of LLMs on scientific writing. The study estimates that at least 10% of 2024 abstracts were likely processed with LLMs, with variations across disciplines, countries, and journals. The authors highlight the unprecedented impact of LLM-based writing assistants on scientific literature, surpassing the effect of major events like the Covid pandemic.

Strengths

  • The abstract effectively summarizes the key findings of the study, including the estimated prevalence of LLM usage in PubMed abstracts.

    'Our analysis based on excess words usage suggests that at least 10% of 2024 abstracts were processed with LLMs.'p. 1
  • The abstract highlights the novelty of the approach, emphasizing its unbiased and data-driven nature.

    'To answer this question, we use an unbiased, large-scale approach, free from any assumptions on academic LLM usage.'p. 1
  • The abstract clearly states the significance of the findings, emphasizing the unprecedented impact of LLMs on scientific writing.

    'We show that the appearance of LLM-based writing assistants has had an unprecedented impact in the scientific literature, surpassing the effect of major world events such as the Covid pandemic.'p. 1

Suggestions for Improvement

  • While the abstract mentions variations in LLM usage across different categories, it could briefly mention specific examples of these variations to provide a more concrete understanding of the findings.

    'This lower bound differed across disciplines, countries, and journals, and was as high as 30% for some PubMed sub-corpora.'p. 1

Introduction

Summary

The introduction of this paper discusses the impact of major events and technological advancements on human writing, particularly in the context of scientific literature. It highlights the emergence of LLMs, specifically ChatGPT, and their growing integration into academic writing tasks. The authors acknowledge both the potential benefits and concerns associated with LLM usage, including improved writing quality, potential factual errors, bias reinforcement, and plagiarism. The introduction reviews existing approaches to quantify LLM usage in scientific papers, such as LLM detectors, word frequency distribution models, and marker word lists. It identifies limitations of these approaches, particularly their reliance on ground-truth training sets and assumptions about LLM usage patterns. The authors propose a novel, data-driven approach called "excess word usage" to track LLM influence, inspired by studies of excess mortality during the Covid pandemic. This approach aims to identify words with unusually high frequency increases after the release of ChatGPT compared to pre-LLM years, without relying on pre-existing assumptions about LLM usage.

Strengths

  • The introduction effectively establishes the context of the study by discussing the impact of historical events and technological advancements on writing styles.

    'When the world changes, human-written text changes. Major events like wars and revolutions affect word frequency distributions in text corpora (Bochkarev et al., 2014).'p. 1
  • The introduction provides a comprehensive overview of existing methods for detecting LLM usage in scientific papers, highlighting their strengths and limitations.

    'Recent approaches attempting to quantify the increasing use of LLMs in scientific papers fall in three groups.'p. 1
  • The introduction clearly articulates the limitations of existing approaches and justifies the need for a novel, data-driven method.

    'All of these approaches share a common limitation: they require a ground-truth training set of LLM- and human-written texts.'p. 1
  • The introduction clearly introduces the proposed "excess word usage" approach and explains its inspiration from excess mortality studies.

    'Here, we suggest a novel, data-driven, and unbiased approach to track LLM usage in academic texts without these limitations: excess word usage.'p. 2

Suggestions for Improvement

  • The introduction could benefit from a more explicit statement of the research questions or hypotheses guiding the study.

    'This begs the question if the nature and magnitude of the observed changes are comparable to changes that regularly occur due to changing fashions, rising research topics, and global events such as the Covid-19 pandemic—or if LLMs impact scientific writing in an unprecedented way.'p. 2

Visual Elements Analysis

Figure 1

Type: Figure

Visual Type: Line Graph

Description: Figure 1 presents nine line graphs arranged in a 3x3 grid, each depicting the frequency of a specific word in PubMed abstracts over time (2012-2024). The words are: 'delves', 'crucial', 'potential', 'these', 'significant', 'important', 'pandemic', 'ebola', and 'convolutional'. Each graph shows a blue line representing the observed frequency and a black line extending from 2021-2022, representing the counterfactual extrapolation of frequencies to 2023-2024. The first six words are identified as potentially affected by ChatGPT, while the last three are related to major events. Notably, words like 'delves', 'crucial', and 'potential' show a sharp increase in frequency in 2023-2024, deviating from the extrapolated trend. In contrast, words like 'pandemic' and 'ebola' show peaks corresponding to specific events but follow the extrapolated trend in 2023-2024.

Relevance: Figure 1 supports the introduction's argument about the impact of events and technologies on word usage. It visually demonstrates how certain words experience a sudden surge in frequency after the release of ChatGPT, suggesting a potential LLM influence. The comparison with event-related words highlights the distinct pattern observed for words potentially affected by LLMs.

Visual Critique

Appropriateness: The use of line graphs is appropriate for showing the trend of word frequency over time. It effectively visualizes the continuous nature of the data and allows for easy identification of patterns and changes. The connected data points clearly indicate the progression of values along the x-axis.

Strengths
  • Clear labeling of axes and data series
  • Use of different colors to distinguish observed and extrapolated trends
  • Arrangement in a grid for easy comparison across words
Suggestions for Improvement
  • Consider using a logarithmic scale for the y-axis to better visualize differences in less frequent words
  • Add gridlines to facilitate more precise reading of values
  • Provide a brief explanation in the caption about the choice of words and their relevance to the study

Detailed Critique

Analysis Of Presented Data: The figure effectively presents frequency trends for multiple words over time (2012-2024). It effectively shows the sudden increase in frequency for certain words (e.g., 'delves') after 2022, coinciding with the release of ChatGPT.

Statistical Methods: The use of frequency counts and counterfactual extrapolation is a reasonable approach to visualize potential changes in word usage. However, the figure could benefit from the inclusion of confidence intervals or error bars to indicate the precision of these frequency estimates.

Assumptions And Limitations: The analysis assumes that changes in word frequency are primarily due to the influence of LLMs. It's limited by the lack of control for other factors that might influence word usage in scientific writing. The counterfactual extrapolation assumes a linear trend, which may not always hold true.

Improvements And Alternatives: 1) Include a statistical test for the significance of frequency changes pre- and post-2022. 2) Consider using a normalized frequency measure (e.g., words per million) to account for potential changes in the total volume of abstracts over time. 3) Explore alternative extrapolation methods or statistical models to account for non-linear trends.

Consistency And Comparisons: The presentation is consistent across different words, facilitating easy comparison. However, it would be beneficial to include a baseline trend of overall abstract volume or total word count for reference.

Sample Size And Reliability: While the figure doesn't explicitly state the sample size, it's based on a large corpus of PubMed abstracts, which likely provides reliable estimates. However, including the number of abstracts analyzed each year would strengthen the reliability assessment.

Interpretation And Context: The figure effectively illustrates the dramatic increase in usage of certain words post-2022, supporting the hypothesis of LLM influence on scientific writing. However, it's important to note that correlation doesn't imply causation, and other factors could contribute to these trends.

Confidence Rating: 4

Confidence Explanation: This analysis is based on standard practices in corpus linguistics and trend analysis. The large dataset and clear trends provide high confidence in the observations. However, the lack of statistical significance testing and control for confounding factors prevents a perfect confidence score.

Results

Summary

The Results section details the researchers' analysis of a vast dataset of 14.4 million PubMed abstracts spanning from 2010 to 2024. They calculated the frequency of each word per year, focusing on words with a frequency above 10^-4 in both 2023 and 2024. To quantify the increase in word frequency after the release of ChatGPT, they calculated the counterfactual expected frequency in 2024 based on a linear extrapolation of word frequencies from 2021 and 2022. They defined two measures of excess usage: excess frequency gap (δ) and excess frequency ratio (r). The analysis revealed a significant increase in the frequency of certain words in 2024, particularly style words like 'delves', 'showcasing', and 'underscores'. This increase was unprecedented compared to previous years, including the period of the Covid pandemic, which mainly saw an increase in content words. The researchers categorized excess words into content words and style words, finding that the 2024 excess vocabulary consisted almost entirely of style words, primarily verbs and adjectives. This finding suggests a potential influence of LLMs like ChatGPT on scientific writing.

Strengths

  • The Results section clearly presents the methodology for calculating word frequencies and excess usage, making the analysis transparent and reproducible.

    'We then computed the matrix of word occurrences that shows which abstracts contain which words, resulting in a 14.4 M × 2.4 M sparse binary matrix. For each word, we obtained its occurrence frequency per year by normalizing with the total number of papers published in that year.'p. 2
  • The section effectively uses visual representations (Figure 2) to illustrate the distribution of excess words and their frequency changes, making the findings more accessible and impactful.

    'Across all these words, we found many with strong excess usage in 2024 (Figure 2).'p. 2
  • The researchers acknowledge the potential influence of the Covid pandemic on word usage and provide a comparative analysis, strengthening the validity of their findings about the unique impact of LLMs.

    'This changed during the Covid pandemic: in 2020–2022 words like coronavirus, lockdown, and pandemic showed very large excess usages (up to r > 1000 and δ = 0.037), in agreement with the observation that the Covid pandemic had an unprecedented effect on biomedical publishing (González-Márquez et al., 2024).'p. 2

Suggestions for Improvement

  • While the section mentions a linear extrapolation for calculating counterfactual frequencies, it could benefit from a more detailed explanation of this method and its potential limitations. Providing a formula or a visual illustration of the extrapolation process would enhance clarity.

    'To quantify this increase, we calculated counterfactual expected frequency in 2024 based on the linear extrapolation of word frequencies in 2021 and 2022 (see Methods).'p. 2
  • The section could include a statistical analysis of the significance of the observed frequency changes. Performing hypothesis tests to compare pre- and post-ChatGPT word frequencies would provide stronger evidence for the claim of LLM influence.

    'We found that we could obtain a very similar lower bound using a non-overlapping group of only ten excess style words with high individual δ values: across, additionally, comprehensive, crucial, enhancing, exhibited, insights, notably, particularly, within.'p. 4
  • The analysis focuses on words with a frequency above 10^-4. While this threshold is justified, it could be informative to explore the frequency changes of less common words as well. Analyzing a wider range of word frequencies might reveal additional insights into the impact of LLMs.

    'In the following analysis, we focused on 26.6 K words with frequency p above 10−4 in both 2023 and 2024.'p. 2

Visual Elements Analysis

Figure 2

Type: Figure

Visual Type: Scatter Plot

Description: Figure 2 consists of two scatter plots, labeled (a) and (b), showing words with increased frequency in 2024. Plot (a) has word frequency in 2024 on the x-axis (logarithmic scale) and frequency ratio between 2024 and 2022 (r) on the y-axis (logarithmic scale). A dashed line represents the threshold for defining excess words. Words with r > 90 are shown at r = 90. Plot (b) has the same x-axis but shows the frequency gap (δ) between 2024 and 2022 on the y-axis. Both plots highlight specific words with significant frequency increases, such as 'delves', 'showcasing', 'underscores', 'potential', 'findings', and 'crucial'.

Relevance: Figure 2 directly supports the key findings of the Results section by visually demonstrating the significant increase in frequency of certain words in 2024, particularly style words. It highlights the magnitude of these changes and allows for easy identification of words with the most pronounced increases.

Visual Critique

Appropriateness: The use of scatter plots is appropriate for showing the relationship between word frequency and excess usage measures. It allows for the visualization of individual data points and the identification of potential trends or outliers.

Strengths
  • Clear labeling of axes and data points
  • Use of logarithmic scales to accommodate a wide range of values
  • Highlighting of specific words for emphasis
  • Inclusion of a threshold line to define excess words
Suggestions for Improvement
  • Consider adding a color gradient to the data points to represent the frequency of words in 2024, providing additional visual information.
  • Include a brief explanation in the caption about the choice of words labeled on the plots and their relevance to the study.
  • Add gridlines to facilitate more precise reading of values on both axes.

Detailed Critique

Analysis Of Presented Data: The figure effectively presents the distribution of words with increased frequency in 2024, showing a clear trend of higher frequency ratios and gaps for certain style words. The logarithmic scales allow for the visualization of both frequent and infrequent words.

Statistical Methods: The use of frequency ratios and gaps is a straightforward approach to quantify changes in word usage. However, the figure could benefit from the inclusion of confidence intervals or error bars to indicate the precision of these estimates.

Assumptions And Limitations: The analysis assumes that the observed frequency changes are primarily due to the influence of LLMs. It's limited by the lack of control for other factors that might influence word usage in scientific writing.

Improvements And Alternatives: 1) Perform statistical tests to assess the significance of frequency changes pre- and post-ChatGPT. 2) Consider using a normalized frequency measure (e.g., words per million) to account for potential changes in the total volume of abstracts over time.

Consistency And Comparisons: The presentation is consistent with the textual description of the results, providing a clear visual representation of the findings. However, it would be beneficial to include a reference point or baseline trend for comparison.

Sample Size And Reliability: The figure is based on a large dataset of PubMed abstracts, which likely provides reliable estimates. However, including the number of abstracts analyzed each year would strengthen the reliability assessment.

Interpretation And Context: The figure supports the hypothesis of LLM influence on scientific writing by showing a significant increase in the frequency of certain style words in 2024. However, it's important to note that correlation doesn't imply causation, and other factors could contribute to these trends.

Confidence Rating: 4

Confidence Explanation: The figure effectively visualizes the data and highlights key findings. However, the lack of statistical significance testing and control for confounding factors prevents a perfect confidence score.

Figure 3

Type: Figure

Visual Type: Combination Chart (Line Graph and Bar Chart)

Description: Figure 3 comprises two subplots: (a) shows the number of excess words per year, categorized into content words, style words, and other words. It also highlights the word with the highest frequency ratio (r) among excess words with p > 10^-3 and r > 3 for each year. (b) presents the number of excess words per year, categorized into nouns, verbs, adjectives, and other parts of speech. The figure highlights a significant increase in excess style words in 2024, primarily verbs and adjectives, compared to previous years, which were dominated by content words, mostly nouns.

Relevance: Figure 3 provides further evidence for the impact of LLMs on scientific writing by showing the unprecedented increase in excess style words in 2024. It supports the researchers' claim that the 2024 excess vocabulary is qualitatively different from previous years, suggesting a distinct influence of LLMs on writing style.

Visual Critique

Appropriateness: The use of a combination chart is appropriate for showing both the trend of excess words over time and their categorical breakdown. The line graph effectively visualizes the overall trend, while the bar chart allows for easy comparison across categories.

Strengths
  • Clear labeling of axes and data series
  • Use of different colors to distinguish word categories
  • Highlighting of the word with the highest frequency ratio for each year
  • Separate subplots for different categorization schemes
Suggestions for Improvement
  • Consider using a stacked bar chart in subplot (b) to show the cumulative number of excess words per year, providing a more comprehensive view of the data.
  • Add a legend to subplot (a) to identify the word categories represented by different colors.
  • Include a brief explanation in the caption about the criteria used to select the highlighted word for each year.

Detailed Critique

Analysis Of Presented Data: The figure effectively illustrates the dramatic increase in excess style words in 2024, particularly verbs and adjectives, compared to previous years, which were dominated by content words, mostly nouns. This pattern supports the hypothesis of LLM influence on scientific writing.

Statistical Methods: The use of frequency counts and categorization based on word type is a reasonable approach to analyze changes in vocabulary. However, the figure could benefit from the inclusion of statistical tests to assess the significance of these changes.

Assumptions And Limitations: The analysis assumes that the observed changes in word usage are primarily due to the influence of LLMs. It's limited by the lack of control for other factors that might influence word choice in scientific writing.

Improvements And Alternatives: 1) Perform statistical tests to compare the proportion of content words and style words in excess vocabulary pre- and post-ChatGPT. 2) Explore alternative categorization schemes based on word function or semantic categories.

Consistency And Comparisons: The presentation is consistent with the textual description of the results, providing a clear visual representation of the findings. The comparison across different years highlights the unique pattern observed in 2024.

Sample Size And Reliability: The figure is based on a large dataset of PubMed abstracts, which likely provides reliable estimates. However, including the number of abstracts analyzed each year would strengthen the reliability assessment.

Interpretation And Context: The figure supports the hypothesis of LLM influence on scientific writing by showing a significant shift in the type of excess words in 2024. However, it's important to note that correlation doesn't imply causation, and other factors could contribute to these trends.

Confidence Rating: 4

Confidence Explanation: The figure effectively visualizes the data and highlights key findings. However, the lack of statistical significance testing and control for confounding factors prevents a perfect confidence score.

Figure 4

Type: Figure

Visual Type: Line Graph

Description: Figure 4 consists of two line plots, labeled (a) and (b). Plot (a) shows the observed frequency (P) and counterfactual expected frequency (Q) in 2024 of abstracts containing at least one of the excess style words from 2024 with frequency p below a given threshold. The x-axis represents the threshold for word frequency (logarithmic scale), and the y-axis represents the frequency of abstracts. Plot (b) shows the frequency gap (Δ = P - Q) as a function of the threshold. The figure demonstrates that the frequency gap is maximized at a threshold of approximately 0.01, resulting in a frequency gap of 0.111.

Relevance: Figure 4 introduces the concept of using excess style words as markers of ChatGPT usage and demonstrates how combining multiple words can increase the lower bound on LLM usage estimation. It provides a visual representation of the method used to calculate the lower bound and highlights the optimal threshold for maximizing the frequency gap.

Visual Critique

Appropriateness: The use of line graphs is appropriate for showing the relationship between the threshold and the frequency gap. It effectively visualizes the continuous nature of the data and allows for easy identification of the optimal threshold.

Strengths
  • Clear labeling of axes and data series
  • Use of different line styles to distinguish observed and expected frequencies
  • Inclusion of a vertical line to mark the optimal threshold
Suggestions for Improvement
  • Consider adding a shaded area to represent the confidence interval around the observed frequency, providing a visual indication of the uncertainty in the estimate.
  • Include a brief explanation in the caption about the meaning of the optimal threshold and its relevance to LLM usage estimation.
  • Add gridlines to facilitate more precise reading of values on both axes.

Detailed Critique

Analysis Of Presented Data: The figure effectively demonstrates how the frequency gap between observed and expected frequencies varies with the threshold for word frequency. It highlights the optimal threshold for maximizing the frequency gap, which is used to estimate the lower bound on LLM usage.

Statistical Methods: The use of frequency counts and counterfactual extrapolation is a reasonable approach to estimate LLM usage. However, the figure could benefit from the inclusion of statistical tests to assess the significance of the observed frequency gap.

Assumptions And Limitations: The analysis assumes that the observed frequency gap is primarily due to the influence of LLMs. It's limited by the lack of control for other factors that might influence word usage in scientific writing. The counterfactual extrapolation assumes a linear trend, which may not always hold true.

Improvements And Alternatives: 1) Perform statistical tests to compare the observed frequency with the expected frequency at the optimal threshold. 2) Explore alternative extrapolation methods or statistical models to account for non-linear trends.

Consistency And Comparisons: The presentation is consistent with the textual description of the method, providing a clear visual representation of the analysis. However, it would be beneficial to include a reference point or baseline trend for comparison.

Sample Size And Reliability: The figure is based on a large dataset of PubMed abstracts, which likely provides reliable estimates. However, including the number of abstracts analyzed each year would strengthen the reliability assessment.

Interpretation And Context: The figure supports the hypothesis of LLM influence on scientific writing by showing a significant frequency gap at the optimal threshold. However, it's important to note that this is only a lower bound estimate, and the true LLM usage is likely higher.

Confidence Rating: 4

Confidence Explanation: The figure effectively visualizes the data and highlights key findings. However, the lack of statistical significance testing and control for confounding factors prevents a perfect confidence score.

Discussion

Summary

The Discussion section delves into the implications of the findings, highlighting the unprecedented impact of LLMs on scientific writing. The authors acknowledge the potential benefits of LLMs, such as improving grammar and readability, but also emphasize the risks, including factual errors, bias reinforcement, and plagiarism. They discuss the heterogeneity in estimated LLM usage across fields, countries, and journals, suggesting possible explanations related to LLM adoption, linguistic backgrounds, and publication timelines. The authors compare their findings to previous studies, emphasizing the novelty and comprehensiveness of their approach. They conclude by discussing the potential consequences of widespread LLM usage in science and call for a reassessment of policies and regulations surrounding LLM use in academic writing.

Strengths

  • The Discussion section effectively contextualizes the findings by comparing them to previous studies and highlighting the novelty and comprehensiveness of the current research.

    'Our results go beyond other studies on detecting LLM fingerprints in academic writing.'p. 5
  • The section provides a balanced discussion of both the potential benefits and risks associated with LLM usage in scientific writing, acknowledging the complexities of this emerging trend.

    'Scientists use LLM-assisted writing because LLMs can improve grammar, rhetoric, and overall readability of their texts, help translate to English, and quickly generate summaries (Van Veen et al., 2024; Zhang et al., 2024). However, LLMs are infamous for making up references (Walters and Wilder, 2023), providing inaccurate summaries (Tang et al., 2024; Kim et al., 2024), and making false claims that sound authoritative and convincing (Mittelstadt et al., 2023; Ji et al., 2023; Zhang et al., 2023; Zheng and Zhan, 2023).'p. 5
  • The section explores potential explanations for the observed heterogeneity in LLM usage across different categories, considering factors beyond simple adoption rates and highlighting the need for further investigation.

    'However, the heterogeneity in lower bounds could also point to other factors beyond actual differences in LLM adoption.'p. 5
  • The authors conclude with a call for action, urging the academic community to reassess policies and regulations surrounding LLM use in science, and emphasizing the need for ongoing monitoring and research.

    'This trend calls for a reassessment of current policies and regulations around the use of LLMs for science.'p. 5

Suggestions for Improvement

  • While the section discusses potential explanations for the heterogeneity in LLM usage, it could benefit from a more systematic analysis of these factors. Exploring the relationship between LLM usage and variables like author demographics, journal policies, and research funding could provide more concrete insights.

    'This heterogeneity could correspond to actual differences in LLM adoption.'p. 5
  • The section could delve deeper into the ethical implications of widespread LLM usage in science, particularly regarding issues of authorship, plagiarism, and the potential for bias reinforcement. Discussing strategies for mitigating these risks would enhance the practical relevance of the findings.

    'Even worse, it is likely that malign actors such as paper mills will employ LLMs to produce fake publications (Kendall and Teixeira da Silva, 2024).'p. 6
  • The section briefly mentions the potential for LLMs to homogenize scientific writing. Expanding on this point and discussing the implications for creativity, innovation, and the diversity of ideas in scientific discourse would enrich the discussion.

    'Such homogenisation can degrade the quality of scientific writing.'p. 5

Visual Elements Analysis

Figure 5

Type: Figure

Visual Type: Combination Chart (Line Graph and Scatter Plot)

Description: Figure 5 presents a comprehensive overview of the estimated LLM usage across various categories. Subplot (a) shows the frequency of abstracts containing at least one word from two word groups: 'common words' and 'rare words'. Subplots (b), (c), and (d) display scatter plots comparing the frequency gap based on 'rare words' and 'common words' for different fields, countries, and journals, respectively. Subplot (e) shows the frequency of abstracts containing specific words for various PubMed subsets, with the average frequency gap (Δ) indicated for each subset. The figure highlights significant variations in estimated LLM usage, with higher usage in computational fields, certain non-English speaking countries, and journals with expedited review processes.

Relevance: Figure 5 directly supports the discussion on the heterogeneity in estimated LLM usage across different categories. It provides a visual representation of the variations in frequency gaps, allowing for easy comparison across fields, countries, and journals. The figure reinforces the authors' argument that LLM usage is widespread but varies significantly depending on contextual factors.

Visual Critique

Appropriateness: The use of a combination chart is appropriate for presenting both the overall trend of word usage and the comparative analysis across categories. The line graph in subplot (a) effectively shows the increasing frequency of abstracts containing specific word groups, while the scatter plots in subplots (b), (c), and (d) allow for easy comparison of frequency gaps across different categories.

Strengths
  • Clear labeling of axes and data points
  • Use of different colors and symbols to distinguish categories
  • Inclusion of average frequency gap values in subplot (e)
  • Separate subplots for different categories and analysis types
Suggestions for Improvement
  • Consider using a map in subplot (c) to visualize the geographical distribution of LLM usage, providing a more intuitive representation of country-level variations.
  • Add a legend to subplots (b), (c), and (d) to identify the categories represented by different colors or symbols.
  • Include a brief explanation in the caption about the criteria used to select the specific PubMed subsets shown in subplot (e).

Detailed Critique

Analysis Of Presented Data: The figure effectively presents the estimated LLM usage across various categories, highlighting significant variations in frequency gaps. The line graph in subplot (a) shows a clear upward trend for both 'common words' and 'rare words', indicating an increasing prevalence of LLM-influenced abstracts. The scatter plots in subplots (b), (c), and (d) reveal distinct patterns of LLM usage across fields, countries, and journals.

Statistical Methods: The use of frequency gaps based on two word groups ('common words' and 'rare words') is a reasonable approach to estimate LLM usage. However, the figure could benefit from the inclusion of confidence intervals or error bars to indicate the precision of these estimates.

Assumptions And Limitations: The analysis assumes that the observed frequency gaps are primarily due to the influence of LLMs. It's limited by the lack of control for other factors that might influence word usage in scientific writing. The estimation method relies on the selection of specific word groups, which may not be fully representative of all LLM-influenced writing.

Improvements And Alternatives: 1) Perform statistical tests to assess the significance of the observed frequency gaps across different categories. 2) Explore alternative word groups or develop a more comprehensive method for identifying LLM-influenced writing.

Consistency And Comparisons: The presentation is consistent with the textual description of the findings, providing a clear visual representation of the variations in LLM usage. The comparison across different categories highlights the complex interplay of factors influencing LLM adoption.

Sample Size And Reliability: The figure is based on a large dataset of PubMed abstracts, which likely provides reliable estimates. However, including the number of abstracts analyzed for each category would strengthen the reliability assessment.

Interpretation And Context: The figure supports the discussion on the heterogeneity in LLM usage, suggesting that LLM adoption is influenced by factors such as field of research, author demographics, and journal policies. However, it's important to note that these are lower bound estimates, and the true LLM usage is likely higher.

Confidence Rating: 4

Confidence Explanation: The figure effectively visualizes the data and highlights key findings. However, the lack of statistical significance testing and the reliance on specific word groups prevent a perfect confidence score.

Methods

Summary

The Methods section of the paper provides a detailed account of the data collection, pre-processing, statistical analysis, word annotations, subgroup analysis, LLM usage, and data and code availability. The authors describe how they acquired the PubMed dataset, cleaned the abstracts from contaminating strings, computed a binary word occurrence matrix, and performed linear extrapolation to calculate counterfactual word frequencies. They also explain their method for annotating excess words as content or style words and their approach to analyzing subgroups based on countries, journals, fields, and inferred genders. The authors emphasize that they did not use any LLMs for writing the manuscript or performing the data analysis, ensuring the unbiased nature of their study. They also provide information on the availability of their analysis code and the original PubMed data.

Strengths

  • The Methods section provides a clear and comprehensive description of the data collection process, including the source of the data, the time period covered, and the criteria used for selecting abstracts.

    'We downloaded all PubMed abstracts until the end of June 2024 and used all 14.4 M English-language abstracts from 2010 onwards, with only minimal filtering (see Methods).'p. 2
  • The authors meticulously describe the pre-processing steps taken to clean the data, ensuring the removal of extraneous text strings and errata notices, which could have contaminated the analysis.

    'Many abstracts in PubMed data contain strings, usually either in the beginning or in the end, that are not technically part of the abstract text.'p. 6
  • The statistical analysis is well-explained, including the formula used for calculating word frequencies, the method for linear extrapolation, and the thresholds used for defining excess words.

    'To avoid possible divisions by zero, all frequencies were always computed as p = (a + 1)/(b + 1), where a is the number of abstracts in a given year containing a given word, and b is the total number of abstracts in that year.'p. 6
  • The authors provide a transparent account of their word annotation process, explaining how they categorized excess words as content or style words and how they addressed ambiguous cases.

    'We identified 829 unique excess words (surpassing thresholds on r or \u03b4) from 2013 to 2024.'p. 6
  • The subgroup analysis is detailed, outlining the specific categories analyzed (countries, journals, fields, and inferred genders) and the criteria used for selecting subgroups.

    'For the analysis presented in Figure 5 we separately analysed the following subgroups: 50 countries with the most papers in our dataset; 100 journals with the most papers in our dataset in 2024; all 39 domains taken from González-Márquez et al. (2024) (where domains were assigned based on the journal names, e.g., assigning all papers from The Journal of Neuroscience to the \u0027neuroscience\u0027 domain); male and female inferred genders of the first and of the last authors (inferred via the gender package, Blevins and Mullen, 2015).'p. 7
  • The authors explicitly state that they did not use any LLMs for writing the manuscript or performing the data analysis, reinforcing the objectivity and unbiased nature of their study.

    'We did not use ChatGPT or any other LLMs for writing the manuscript or for performing the data analysis.'p. 7
  • The authors provide clear information on the availability of their analysis code and the original PubMed data, facilitating reproducibility and further research.

    'Our analysis code in Python is available at https://github.com/berenslab/chatgpt-excess-words.'p. 7

Suggestions for Improvement

  • The Methods section could benefit from a more detailed discussion of the limitations of the linear extrapolation method used for calculating counterfactual word frequencies. While the authors acknowledge that this method is conservative, they could elaborate on the potential biases introduced by assuming a linear trend, especially given the rapid evolution of language models and their impact on writing styles.

    'To do the linear extrapolation, we took the frequencies p−3 in year Y − 3 and p−2 in year Y − 2 and computed the counterfactual projection q = p−2 + 2 · max{p−2 − p−3 , 0}.'p. 6
  • The authors could provide more information on the accuracy of the gender inference method used for subgroup analysis. They acknowledge the limitations of this method, but a quantitative assessment of its accuracy would strengthen the reliability of the findings related to gender differences in LLM usage.

    'Our gender inference aims to capture perceived gender based on first name and is only approximate.'p. 7

References

Summary

This section lists the bibliographic references cited throughout the research paper. It includes a diverse range of sources, reflecting the interdisciplinary nature of the study. The references cover topics such as the impact of large language models on academic writing, the detection of AI-generated text, the ethical considerations surrounding LLM use, and the historical analysis of language change in scientific literature.

Strengths

  • The References section is comprehensive and includes a wide range of sources relevant to the study's topic.

Suggestions for Improvement

  • The References section is already well-structured and comprehensive, providing a solid foundation for the research presented in the paper. No specific suggestions for improvement are necessary.

Acknowledgements

Summary

The Acknowledgements section expresses gratitude to various individuals and organizations for their contributions to the research. It acknowledges the support of the Dagstuhl seminar 24122, funded by the Leibniz Center for Informatics. The authors also thank the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) for funding through Germany's Excellence Strategy, the German Ministry of Science and Education (BMBF) for funding the Tübingen AI Center, and the Gemeinnützige Hertie-Stiftung. Additionally, they acknowledge the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Rita González-Márquez and the NSF CAREER Grant No IIS-1943506 for supporting Emőke-Ágnes Horvát.

Strengths

  • The Acknowledgements section is concise and effectively expresses gratitude to all individuals and organizations that contributed to the research.

Suggestions for Improvement

  • The Acknowledgements section is already well-written and comprehensive, acknowledging all relevant contributions. No specific suggestions for improvement are necessary.