The Rise of Large Language Models in Scientific Writing: A Large-Scale Analysis

Abstract

Overview

This paper investigates the increasing use of Large Language Models (LLMs) like ChatGPT in scientific writing. The authors conducted a large-scale analysis of nearly a million papers published on arXiv, bioRxiv, and Nature portfolio journals between January 2020 and February 2024. They used a statistical framework to estimate the prevalence of LLM-modified content over time and across different academic fields. The study found a steady increase in LLM usage, with the highest growth in Computer Science papers. Furthermore, the analysis revealed correlations between higher LLM modification and factors such as first-author preprint posting frequency, research area crowdedness, and shorter paper length.

Key Aspects

1. LLM Use Measurement: The study uses a population-level statistical framework to estimate the prevalence of LLM-modified content in scientific papers. This approach is more robust than analyzing individual instances and allows for the identification of broader trends.
2. Increasing LLM Usage: The research reveals a steady increase in LLM usage across various academic fields, with the most significant growth observed in Computer Science papers.
3. Field-Specific Differences: The study highlights differences in LLM adoption across disciplines, with Computer Science showing the highest growth and Mathematics and Nature portfolio journals showing the least.
4. Correlating Factors: The analysis identifies correlations between higher LLM modification and factors like frequent preprint posting by first authors, crowded research areas (measured by paper similarity), and shorter paper lengths.

Strengths

Clear Research Question
The abstract clearly states the research question: to measure the extent of LLM use in academic writing and its potential impact on scientific practices. This focus provides a strong foundation for the study.

"However, we lack a precise measure of the proportion of academic writing substantially modified or produced by LLMs. To address this gap, we conduct the first systematic, large-scale analysis across 950,965 papers..." (Page 1)
Methodology and Scope
The abstract effectively summarizes the methodology, including the statistical framework used and the large dataset analyzed. This transparency strengthens the credibility of the findings.

"...published between January 2020 and February 2024 on the arXiv, bioRxiv, and Nature portfolio journals, using a population-level statistical framework..." (Page 1)
Key Findings
The abstract clearly presents the key findings, including the overall increase in LLM usage and its association with specific factors. This concise presentation of results makes the study's contributions readily apparent.

"Our findings reveal a steady increase in LLM usage...with the largest and fastest growth observed in Computer Science papers...higher levels of LLM-modification are associated with papers whose first authors post preprints more frequently..." (Page 1)

Suggestions for Improvement

Quantify Key Findings
While the abstract mentions the increase in LLM usage, providing specific percentages for the growth observed in different fields would strengthen the impact.

"Our findings reveal a steady increase in LLM usage, with the largest and fastest growth observed in Computer Science papers." (Page 1)

Rationale: Quantifying the findings would provide a more concrete understanding of the scale of LLM adoption in different fields.

Implementation: Include specific percentage increases observed in Computer Science and other fields, e.g., 'up to X% in Computer Science compared to Y% in Mathematics'.
Elaborate on Implications
The abstract could briefly mention the potential implications of the findings for scientific practices. This would broaden the context and highlight the study's significance.

Rationale: Highlighting the implications would underscore the importance of the research and its potential impact on the future of scientific writing.

Implementation: Add a sentence briefly discussing the potential implications, e.g., 'These findings raise important questions about the future of academic writing and the role of LLMs in scientific research.'
Specify LLM
While the abstract mentions LLMs like ChatGPT, explicitly stating which LLM (e.g., GPT-3.5) was used in the study would enhance clarity and reproducibility.

"Recently, there has been immense speculation about how many people are using large language models (LLMs) like ChatGPT in their academic writing..." (Page 1)

Rationale: Specifying the LLM used would provide crucial information for researchers interested in replicating or building upon the study.

Implementation: Replace "LLMs like ChatGPT" with the specific LLM used, e.g., "GPT-3.5".

Introduction

Overview

The introduction discusses the growing use of LLMs like ChatGPT in academic writing and the need to measure its prevalence. It highlights the challenges of detecting LLM-generated text at the individual level and emphasizes the importance of a large-scale analysis to understand the structural factors motivating LLM use and its impact on scientific publishing. The introduction also mentions concerns about accuracy, plagiarism, and ownership, leading some institutions to restrict LLM use in publications. Finally, it introduces the paper's goal: to conduct a systematic, large-scale analysis to quantify the prevalence of LLM-modified content across multiple academic platforms.

Key Aspects

1. The Challenge of Detection: Identifying LLM-generated text in individual papers is difficult due to the subtle nature of modifications, making large-scale analysis crucial.
2. Importance of Large-Scale Analysis: Examining LLM use at scale helps uncover structural factors influencing its adoption and reveals subtle epistemic and linguistic shifts.
3. Institutional Concerns and Restrictions: Concerns about accuracy, plagiarism, and ownership have led some institutions to restrict or ban LLM-generated content in publications.
4. Focus on LLM-Modified Content: The study focuses on text substantially modified by LLMs, going beyond basic spelling and grammar edits, including summaries and prose generation from outlines.
5. Scope of the Study: The paper aims to conduct the first systematic, large-scale analysis to quantify the prevalence of LLM-modified content across multiple academic platforms.

Strengths

Contextual Background
The introduction effectively sets the context by mentioning anecdotal examples of LLM use in papers and peer reviews, highlighting both the humor and concern surrounding this practice.

"Since the release of ChatGPT in late 2022, anecdotal examples of both published papers (Okunyt ˙e, 2023; Deguerin, 2024) and peer reviews (Oransky & Marcus, 2024) which appear to be ChatGPT-generated have inspired humor and concern." (Page 1)
Justification for Large-Scale Analysis
The introduction clearly explains the rationale for a large-scale analysis, emphasizing its importance in understanding structural motivations and capturing subtle shifts that individual-level analysis might miss.

"Applied to scientific publishing, the importance of this at-scale approach is two-fold: first, rather than looking at LLM-use as a type of rule-breaking on an individual level, we can begin to uncover structural circumstances which might motivate its use. Second, by examining LLM-use in academic publishing at-scale, we can capture epistemic and linguistic shifts, miniscule at the individual level, which become apparent with a birdseye view." (Page 2)
Clear Research Goal
The introduction clearly states the paper's objective: to conduct a systematic, large-scale analysis to quantify LLM-modified content. This provides a clear direction for the reader.

"We conduct the first systematic, large-scale analysis to quantify the prevalence of LLM-modified content across multiple academic platforms..." (Page 2)

Suggestions for Improvement

Expand on the Definition of LLM-Modified
While the introduction defines "LLM-modified," providing more specific examples of what constitutes substantial modification would enhance clarity.

"Throughout this paper, we use the term “LLM-modified” to refer to text content substantially updated by ChatGPT beyond basic spelling and grammatical edits." (Page 2)

Rationale: A more precise definition would help readers understand the scope of the analysis and the types of modifications being considered.

Implementation: Include more concrete examples of substantial modifications, such as rewriting paragraphs, adding new sections, or significantly altering the argumentation.
Discuss Potential Benefits of LLMs
While the introduction focuses on concerns, briefly acknowledging potential benefits of LLMs in scientific writing could provide a more balanced perspective.

Rationale: Acknowledging potential benefits would demonstrate a nuanced understanding of the issue and avoid presenting LLMs solely as a threat.

Implementation: Add a sentence or two acknowledging potential benefits, such as improved clarity, conciseness, or language editing, while still emphasizing the need to address the associated risks.
Strengthen the Connection to Later Sections
The introduction could briefly preview the specific methods and datasets used in the study. This would create a smoother transition to the subsequent sections.

Rationale: Previewing the methods and datasets would provide a roadmap for the reader and enhance the overall flow of the paper.

Implementation: Add a brief sentence mentioning the specific datasets (arXiv, bioRxiv, Nature portfolio) and the statistical framework used in the analysis.

Non-Text Elements

figure 1

This figure shows the estimated percentage of sentences in academic paper abstracts that were significantly changed by a Large Language Model (LLM) over time. It compares different subject areas like Computer Science, Electrical Engineering, Mathematics, Physics, Statistics from arXiv, bioRxiv, and Nature portfolio journals. There's a noticeable increase in LLM use after ChatGPT was released in late 2022, especially in Computer Science. Think of it like measuring how much of a cake was made by a machine versus a human baker, but for sentences in research papers.

First Mention

Text: "Figure 1: Estimated Fraction of LLM-Modified Sentences across Academic Writing Venues over Time."

Context: This figure displays the fraction (α) of sentences estimated to have been substantially modified by LLM in abstracts from various academic writing venues.

Relevance: This figure is crucial because it directly addresses the main research question: how much are LLMs being used in scientific writing? It shows a clear trend of increasing LLM use after ChatGPT's release.

Critique

Visual Aspects

The y-axis label 'Estimated alpha' could be more descriptive, like 'Estimated Fraction of LLM-Modified Sentences'.
The figure could benefit from a title summarizing the main takeaway, such as 'Increasing LLM Use in Scientific Abstracts After ChatGPT Release'.
Using a logarithmic scale for the y-axis might better visualize the changes, especially the smaller percentages before ChatGPT.

Analytical Aspects

The figure focuses on abstracts, but it would be helpful to see similar data for other sections of papers.
The figure doesn't show the actual number of papers analyzed, which would provide more context.
It would be interesting to see a breakdown of LLM use within specific subfields of Computer Science.

Numeric Data

Maximum Estimated Alpha (Computer Science): 0.175 fraction
Minimum Estimated Alpha (Mathematics): 0.049 fraction

figure 7

This figure is similar to Figure 1, but instead of abstracts, it looks at the introduction sections of papers. It shows the estimated percentage of sentences modified by LLMs over time, again comparing different academic areas. It also highlights the increase after ChatGPT's launch, mirroring the trend seen in abstracts. It's like comparing the machine-made versus human-made parts of a different layer of the cake.

First Mention

Text: "Figure 7: Estimated Fraction of LLM-Modified Sentences in Introductions Across Academic Writing Venues Over Time."

Context: Further analysis of paper introductions is presented in Figure 7.

Relevance: This figure adds to the main finding by showing that the increased use of LLMs isn't limited to abstracts but extends to introductions as well. This strengthens the argument for widespread LLM adoption in scientific writing.

Critique

Visual Aspects

Similar to Figure 1, the y-axis label could be more descriptive and a summarizing title could be added.
A direct comparison with Figure 1 (e.g., a combined plot or side-by-side presentation) would highlight the similarities and differences between abstracts and introductions.
A logarithmic scale for the y-axis could improve visualization of smaller percentages.

Analytical Aspects

The exclusion of bioRxiv introductions due to data limitations should be explicitly mentioned in the figure caption.
Providing the actual number of papers analyzed would enhance the context.
A breakdown of LLM use within specific subfields, especially in Computer Science, would be informative.

Numeric Data

Maximum Estimated Alpha (Computer Science): 0.155 fraction
Minimum Estimated Alpha (Mathematics): 0.039 fraction

Related Work

Overview

This section discusses existing methods for detecting LLM-generated text, including zero-shot and training-based approaches. It highlights the limitations of these methods, such as overfitting, vulnerability to attacks, and bias. The section also mentions LLM watermarking as a detection method. Finally, it emphasizes the advantage of the distributional GPT quantification framework used in this paper, which estimates the fraction of LLM-modified content at the population level, avoiding the need to classify individual documents.

Key Aspects

1. Zero-Shot Detection: These methods rely on statistical signatures of machine-generated content, but face challenges like needing access to LLM internals and vulnerability to adversarial attacks.
2. Training-Based Detection: These methods train classifiers on human and LLM-generated text, but suffer from overfitting, bias, and questionable reliability.
3. LLM Watermarking: This technique embeds imperceptible signals into the text for detection, but requires the model owner's involvement.
4. Distributional GPT Quantification Framework: This framework, used in the paper, estimates the fraction of LLM-modified content at the population level, offering advantages in stability, accuracy, and computational efficiency.
5. Limitations of Existing Methods: The section highlights the limitations of current detection methods, justifying the use of the distributional framework.

Strengths

Comprehensive Overview
The section provides a good overview of various LLM detection methods, including zero-shot, training-based, and watermarking techniques. This breadth demonstrates a thorough understanding of the field.

"Various methods have been proposed for detecting LLM-modified text, including zero-shot approaches...and training-based methods..." (Page 4)
Critical Evaluation
The section critically evaluates the limitations of existing methods, highlighting their weaknesses and justifying the need for a different approach. This critical perspective strengthens the paper's argument.

"However, these approaches face challenges such as the need for access to LLM internals, overfitting to training data and language models, vulnerability to adversarial attacks..." (Page 4)
Clear Justification for Chosen Method
The section clearly explains why the distributional GPT quantification framework is chosen, emphasizing its advantages over existing methods. This justification reinforces the rationale for the study's methodology.

"In this study, we apply the recently proposed distributional GPT quantification framework (Liang et al., 2024), which estimates the fraction of LLM-modified content in a text corpus at the population level, circumventing the need for classifying individual documents or sentences and improving upon the stability, accuracy, and computational efficiency of existing approaches." (Page 4)

Suggestions for Improvement

Elaborate on Specific Watermarking Techniques
While watermarking is mentioned, briefly discussing specific techniques would provide a more complete picture.

"LLM watermarking. Text watermarking introduces a method to detect AI-modified text by embedding an imperceptible signal, known as a watermark, directly into the text." (Page 24)

Rationale: Elaborating on specific techniques would enhance the reader's understanding of watermarking and its potential for LLM detection.

Implementation: Briefly describe different watermarking methods, such as synonym substitution, syntactic restructuring, or embedding watermarks in the decoding process.
Discuss Ethical Implications of Detection Methods
The section could briefly address the ethical implications of using LLM detection methods, particularly regarding potential misuse or bias.

Rationale: Discussing ethical implications would demonstrate a responsible approach to the topic and acknowledge the potential societal impact of these technologies.

Implementation: Add a sentence or two discussing potential ethical concerns, such as false accusations of plagiarism or discriminatory use of detection tools.
Provide More Context on the Distributional Framework
While the advantages of the distributional framework are mentioned, providing a more concise explanation of how it works would be beneficial.

"...which estimates the fraction of LLM-modified content in a text corpus at the population level..." (Page 4)

Rationale: A brief explanation of the framework's underlying principles would enhance the reader's understanding of its strengths and limitations.

Implementation: Add a sentence or two summarizing the core idea behind the distributional framework, such as its focus on population-level statistics and its ability to handle temporal distribution shifts.

Background: the distributional LLM quantification framework

Overview

This section describes the distributional LLM quantification framework adapted from Liang et al. (2024). This framework estimates the proportion of AI-modified content in a corpus of documents. It works by comparing the probability distributions of words in human-written and LLM-modified documents. The framework doesn't analyze individual documents but looks at word usage patterns across the entire collection to estimate the overall fraction of AI-influenced text.

Key Aspects

1. Problem Formulation: Defines the problem as estimating the fraction (alpha) of AI-modified documents in a corpus, represented as a mixture of human-written and LLM-modified probability distributions.
2. Parameterization: Uses the probabilities of specific words appearing in human and LLM-generated text to distinguish between the two. This involves selecting a set of words and calculating their occurrence probabilities in both types of documents.
3. Estimation: Estimates these word probabilities using separate collections of known human-written and LLM-modified documents. This step involves calculating how often each chosen word appears in each type of document.
4. Inference: Estimates the fraction (alpha) by finding the value that best explains the observed word frequencies in the mixed corpus. This uses a statistical method called maximum likelihood estimation.

Strengths

Clear Problem Definition
The section starts by clearly defining the problem of estimating the fraction of AI-modified documents, setting the stage for the framework description.

"Let P and Q be the probability distributions of human-written and LLM-modified documents, respectively. The mixture distribution is given by Dα(X) = (1 − α)P (x) + αQ(x), where α is the fraction of AI-modified documents. The goal is to estimate α based on observed documents {Xi}N i=1 ∼ Dα." (Page 4)
Step-by-Step Explanation
The framework is explained in a clear, step-by-step manner, making it easy to follow the logic and understand each component.

"The framework consists of the following steps: 1. Problem formulation: ... 2. Parameterization: ... 3. Estimation: ... 4. Inference: ..." (Page 4)
Mathematical Formulation
The use of mathematical equations provides a precise and rigorous description of the framework, allowing for a clear understanding of the underlying calculations.

"Dα(X) = (1 − α)P (x) + αQ(x)" (Page 4)

Suggestions for Improvement

Illustrative Example
While the mathematical formulation is precise, a simple example would make the framework more accessible to a broader audience.

Rationale: An example would help readers grasp the practical application of the framework and visualize how it works with real data.

Implementation: Include a simple example with a small set of words and documents to illustrate the calculation of alpha.
Explain Choice of Tokens
The section mentions using token occurrence probabilities but doesn't explain how the tokens (words) are chosen. Clarifying this would strengthen the methodology.

"To make α identifiable, the framework models the distributions of token occurrences in human-written and LLM-modified documents, denoted as PT and QT, respectively, for a chosen list of tokens T = {ti}M i=1." (Page 4)

Rationale: Explaining the token selection process would address potential biases and improve the transparency of the method.

Implementation: Add a sentence or two explaining how the tokens are selected, considering factors like frequency, distinctiveness, or relevance to the domain.
Connect to Implementation
The section could briefly mention how the framework is implemented in the study, creating a smoother transition to the next section.

Rationale: Connecting the theoretical framework to its practical implementation would enhance the overall coherence of the paper.

Implementation: Add a sentence briefly mentioning the specific implementation details, such as the choice of programming language or software libraries used.

Implementation and Validations

Overview

This section details the implementation of the distributional LLM quantification framework and the validation process used in the study. The authors describe the data collection process from arXiv, bioRxiv, and the Nature portfolio, sampling up to 2,000 papers per month. They explain how the data is split for training and validation, how the model is fitted, and the evaluation metrics used. The validation process involved using pre-ChatGPT papers to assess the model's accuracy in estimating the proportion of LLM-modified content. The results showed good performance, with low prediction error across different levels of LLM modification.

Key Aspects

1. Data Collection: Data was collected from arXiv, bioRxiv, and 15 Nature portfolio journals, sampling up to 2,000 papers per month from January 2020 to February 2024.
2. Data Split and Model Fitting: The data was split temporally, with data from 2020 used for training and data from 2021 onwards used for validation and inference. Separate models were fitted for abstracts and introductions for each major category.
3. Model Evaluation: The model's accuracy was evaluated using pre-ChatGPT papers (before November 2022) as validation data. Validation sets were constructed with varying proportions of LLM-modified content.
4. Validation Results: The validation showed good performance, with a prediction error consistently less than 3.5% across different levels of LLM modification, using various vocabulary subsets (full vocabulary, adjectives, adverbs, and verbs).

Strengths

Clear Data Sources
The section clearly identifies the data sources (arXiv, bioRxiv, Nature portfolio) and the sampling strategy (2,000 papers per month), providing transparency and reproducibility.

"We collect data from three sources: arXiv, bioRxiv, and 15 journals from the Nature portfolio. For each source, we randomly sample up to 2,000 papers per month from January 2020 to February 2024." (Page 5)
Well-Defined Validation Process
The validation process is well-described, including the temporal data split, the construction of validation sets with varying LLM modification proportions, and the use of pre-ChatGPT data.

"To evaluate model accuracy and calibration under temporal distribution shift, we use 3,000 papers from January 1, 2022, to November 29, 2022, a time period prior to the release of ChatGPT, as the validation data." (Page 5)
Robust Validation Results
The validation results demonstrate the model's good performance with low prediction error, strengthening the reliability of the study's findings.

"Full vocabulary, adjectives, adverbs, and verbs all performed well in our application, with a prediction error consistently less than 3.5% at the population level across various ground truth α values (Figure 3)." (Page 5)

Suggestions for Improvement

Clarify Introduction Section Focus
The section mentions focusing on introduction sections but doesn't fully explain why. Providing a stronger rationale would improve the justification.

"We focused on the introduction sections for the main texts, as the introduction was the most consistently and commonly occurring section across diverse categories of papers." (Page 5)

Rationale: A clearer explanation of the focus on introductions would strengthen the methodological choices.

Implementation: Elaborate on why introductions are more suitable than other sections, such as abstracts or methods, for this specific analysis. Consider factors like length, content consistency, or relevance to LLM usage.
Explain Model Fitting Process in Detail
The section mentions fitting separate models but lacks details about the specific fitting process. Providing more information would enhance reproducibility.

"We fit the model with data from 2020, and use data from January 2021 onwards for validation and inference. We fit separate models for abstracts and introductions for each major category." (Page 5)

Rationale: A more detailed description of the model fitting process would allow other researchers to replicate the study and validate the findings.

Implementation: Describe the specific algorithms, parameters, and software used for model fitting. Explain how the models were trained and optimized, including any hyperparameter tuning or cross-validation procedures.
Provide More Detail on LLM-Generated Corpus Creation
While Section 3 is referenced, briefly summarizing the process of generating the LLM-modified corpus in this section would improve clarity and flow.

"The procedure for generating the LLM-generated corpus data is described in Section § 3." (Page 5)

Rationale: Repeating the key aspects of the LLM-generated corpus creation would make this section more self-contained and easier to understand.

Implementation: Summarize the two-stage approach used to generate LLM-produced text, including the abstractive summarization and paragraph generation steps. Briefly mention the prompts used and the rationale behind this approach.

Non-Text Elements

figure 3

This figure validates how well the model estimates the fraction of LLM-modified content (alpha) when there's a time gap between the training data (up to 2020) and the validation data (from early 2022, before ChatGPT). Each small graph within the figure represents a different academic area (like Computer Science, Physics, etc.) and shows how the model's estimate of alpha compares to the actual alpha. It uses different sets of words (full vocabulary, adjectives, adverbs, verbs) to see which works best. Imagine you're trying to guess how much of a cookie is chocolate chips (alpha) based on how sweet it tastes. This figure is like checking if your sweetness-based guess is accurate by comparing it to cookies with known chocolate chip percentages.

First Mention

Text: "Figure 3: Fine-grained Validation of Model Performance Under Temporal Distribution Shift."

Context: We construct validation sets with LLM-modified content proportions (α) ranging from 0% to 25%, in 5% increments, and compared the model’s estimated α with the ground truth α (Figure 3).

Relevance: This figure is important because it shows how reliable the model is, even when dealing with data from different time periods. This reliability is key for trusting the model's estimates of LLM use over time.

Critique

Visual Aspects

The figure is a bit crowded with 13 small graphs. Separating them into two figures, one for abstracts and one for introductions, might improve readability.
The labels on the x and y axes could be larger for easier reading.
The caption could explain what the different colors of the bars represent (full vocabulary, adjectives, etc.).

Analytical Aspects

While the caption mentions a prediction error less than 3.5%, the figure itself doesn't visually represent this error. Adding a line or shaded area to represent this threshold would be helpful.
The figure doesn't show how the model performs with data after ChatGPT's release. Including a similar validation with post-ChatGPT data would strengthen the analysis.
The caption mentions excluding bioRxiv introductions due to data limitations. Explaining this limitation in more detail (e.g., why the data was unavailable) would be helpful.

Numeric Data

Maximum Estimation Error: 0.035 fraction

Main Results and Findings

Overview

This section presents the main findings of the study regarding the prevalence and trends of LLM-modified content in scientific writing. Key findings include a steady increase in LLM usage after the release of ChatGPT, with the most significant growth in Computer Science. The analysis also reveals correlations between LLM usage and factors like first-author preprint posting frequency, paper similarity (indicating crowded research areas), and paper length.

Key Aspects

1. Temporal Trends: The study observes a steady increase in LLM-modified content in both abstracts and introductions of scientific papers, with the largest increase in Computer Science.
2. Preprint Frequency Correlation: Researchers who post more preprints on arXiv tend to have a higher fraction of LLM-modified content in their papers.
3. Paper Similarity Correlation: Papers in more crowded research areas (measured by similarity to neighboring papers) exhibit higher LLM usage.
4. Paper Length Correlation: Shorter papers tend to have a higher fraction of LLM-modified content than longer papers.

Strengths

Clear Presentation of Results
The results are presented clearly, with specific data points and trends highlighted for each key finding. This clarity makes the findings easily understandable.

"Our findings reveal a steady increase in the fraction of AI-modified content (α) in both the abstracts (Figure 1) and the introductions (Figure 7), with the largest and fastest growth observed in Computer Science papers." (Page 7)
Use of Visualizations
The section effectively uses figures (referenced but not included in the text content analysis) to illustrate the key findings, making the data more accessible and engaging.

"Our findings reveal a steady increase in the fraction of AI-modified content (α) in both the abstracts (Figure 1) and the introductions (Figure 7)..." (Page 7)
Specific Data Points
The section provides specific data points, such as the estimated α values for different fields and time points, which adds weight and credibility to the findings.

"By February 2024, the estimated α for Computer Science had increased to 17.5% for abstracts and 15.5% for introductions." (Page 7)

Suggestions for Improvement

Discuss Potential Confounding Factors
While the section presents correlations, it would be beneficial to discuss potential confounding factors that might influence the observed relationships. For example, the correlation between preprint frequency and LLM usage could be influenced by other factors related to research productivity or field-specific practices.

"Our results suggest that researchers posting more preprints tend to utilize LLMs more extensively in their writing." (Page 7)

Rationale: Addressing potential confounding factors would strengthen the analysis and provide a more nuanced interpretation of the results.

Implementation: Add a paragraph discussing potential confounding factors for each correlation presented. For example, consider factors like career stage, research area, or institutional policies that might influence both preprint frequency and LLM usage.
Elaborate on the Implications of Paper Similarity Findings
The section notes the correlation between paper similarity and LLM usage but doesn't fully explore its implications. Discussing the potential impact on research diversity and originality would be valuable.

"There are several ways to interpret these findings. First, LLM-use in writing could cause the similarity in writing or content." (Page 8)

Rationale: Exploring the implications of increased similarity would highlight the potential risks of widespread LLM adoption for the advancement of scientific knowledge.

Implementation: Add a paragraph discussing the potential consequences of increased similarity, such as reduced diversity of research approaches, stifled innovation, or increased difficulty in identifying truly novel contributions.
Provide More Context for Paper Length Correlation
The section mentions the correlation between shorter papers and higher LLM usage but could benefit from more context. Discussing the potential reasons behind this correlation, such as time constraints or the nature of shorter papers (e.g., conference papers vs. journal articles), would enhance the analysis.

"Shorter papers consistently showed higher LLM-modification compared to longer papers, which may indicate that researchers working under time constraints are more likely to rely on AI for writing assistance." (Page 9)

Rationale: Providing more context would help readers understand the factors contributing to the observed correlation and its implications for different types of scientific publications.

Implementation: Add a paragraph discussing potential reasons for the correlation, considering factors like time pressure, page limits for conference papers, or the nature of research presented in shorter versus longer papers. Also, consider exploring whether the use of LLMs in shorter papers is primarily for generating core content or for polishing existing text.

Non-Text Elements

figure 4

This figure examines whether authors who post more preprints on arXiv also tend to have a higher percentage of LLM-modified content in their Computer Science papers. The papers are split into two groups based on how many preprints the first author published in a year: two or fewer, or three or more. The figure then shows the estimated fraction of LLM-modified sentences in both abstracts and introductions for each group over time. It's like checking if students who write more practice essays also use grammar-checking software more often.

First Mention

Text: "Figure 4: Papers authored by first authors who post preprints more frequently tend to have a higher fraction of LLM-modified content."

Context: Papers in arXiv Computer Science are stratified into two groups based on the preprint posting frequency of their first author, as measured by the number of first-authored preprints in the year.

Relevance: This figure explores a potential link between preprint posting frequency and LLM use, suggesting that authors who publish more preprints might be more inclined to use LLMs for writing assistance.

Critique

Visual Aspects

The figure could benefit from a more descriptive title, such as 'LLM Use and Preprint Posting Frequency'.
The x-axis labels could be made more readable by shortening the time period representations (e.g., '22.1-3' instead of '2022.1-3').
Adding a legend directly on the graphs would improve clarity.

Analytical Aspects

The figure only shows correlation, not causation. It doesn't prove that posting more preprints *causes* higher LLM use.
Other factors, such as the specific research area or the author's experience, could influence both preprint frequency and LLM use.
The use of 2023 author groupings for 2024 data, due to incomplete data, should be more clearly explained and justified in the caption.

Numeric Data

Estimated Alpha (Abstracts, >=3 preprints, Feb 2024): 0.193 fraction
Estimated Alpha (Abstracts, <=2 preprints, Feb 2024): 0.156 fraction

figure 5

This figure investigates whether papers in more 'crowded' research areas (those with abstracts more similar to other abstracts) have more LLM-modified content. They measure similarity by converting abstracts into numerical vectors and calculating the distance between these vectors. Papers are grouped into 'more similar' (closer vectors) and 'less similar' (further vectors). The figure then shows how LLM use changes over time for these two groups. Imagine research papers as points on a map; closer points represent similar research. This figure checks if points clustered together have more LLM use.

First Mention

Text: "Figure 5: Papers in more crowded research areas tend to have a higher fraction of LLM-modified content."

Context: Papers in arXiv Computer Science are divided into two groups based on their abstract's embedding distance to their closest peer: papers more similar to their closest peer (below median distance) and papers less similar to their closest peer (above median distance).

Relevance: This figure explores the relationship between research area 'crowdedness' and LLM use, suggesting that LLM use might be higher in areas where papers are more similar to each other.

Critique

Visual Aspects

A more descriptive title, such as 'LLM Use and Research Area Crowdedness', would be beneficial.
Shortening the x-axis labels and adding a legend directly on the graphs would improve readability.
Using different colors or patterns for the bars representing 'more similar' and 'less similar' papers would enhance visual distinction.

Analytical Aspects

The figure only shows correlation, not causation. It's unclear whether LLM use *causes* similarity or if similar research areas are more prone to LLM use.
The method for calculating similarity (embedding distance) could be explained more clearly for a broader audience.
The implications of the findings, such as potential homogenization of scientific writing, could be discussed further.

Numeric Data

Estimated Alpha (More similar, Feb 2024): 0.222 fraction
Estimated Alpha (Less similar, Feb 2024): 0.147 fraction

figure 6

This figure explores if shorter papers tend to have more LLM-modified content. arXiv Computer Science papers are divided into two groups: those shorter than 5,000 words and those longer. The figure shows the estimated LLM use in abstracts and introductions for both groups over time. It's like checking if shorter student essays use grammar-checking tools more than longer essays.

First Mention

Text: "Figure 6: Shorter papers tend to have a higher fraction of LLM-modified content."

Context: arXiv Computer Science papers are stratified by their full text word count, including appendices, into two bins: below or above 5,000 words (the rounded median).

Relevance: This figure investigates the relationship between paper length and LLM use, suggesting that shorter papers might have a higher proportion of LLM-generated content.

Critique

Visual Aspects

A more descriptive title, such as 'LLM Use and Paper Length', would improve clarity.
The x-axis labels could be shortened for better readability.
Adding a legend directly on the graphs would be helpful.

Analytical Aspects

Correlation does not equal causation. Shorter papers might have more LLM use, but the figure doesn't explain why.
Other factors, like the paper's topic or the author's resources, could influence both length and LLM use.
The caption mentions a robustness check and limitations for cs.CL. Explaining these in more detail would strengthen the analysis.

Numeric Data

Estimated Alpha (Shorter papers, Feb 2024): 0.177 fraction
Estimated Alpha (Longer papers, Feb 2024): 0.136 fraction

Discussion

Overview

This section discusses the key findings of the study, highlighting the increase in LLM-modified content in academic writing since the release of ChatGPT, particularly in Computer Science. It also summarizes the correlations found between LLM usage and factors like preprint posting frequency, research area crowdedness, and paper length. The discussion briefly touches on potential risks associated with LLM use in scientific publishing and calls for further research on promoting transparency and diversity in academic writing.

Key Aspects

1. Increased LLM Usage: The study found a sharp increase in LLM-modified content after ChatGPT's release, especially in Computer Science, potentially due to researchers' familiarity with LLMs and pressure to publish quickly.
2. Correlation with Preprint Frequency: Authors who post preprints more frequently tend to have higher LLM usage, possibly indicating a desire to accelerate the writing process in competitive fields.
3. Correlation with Research Area Crowdedness: Papers in more crowded research areas show higher LLM modification, suggesting either LLM-induced similarity or higher LLM use in competitive subfields.
4. Correlation with Paper Length: Shorter papers exhibit higher LLM modification, potentially indicating that researchers with time constraints rely more on AI assistance.

Strengths

Concise Summary of Findings
The discussion effectively summarizes the main findings of the study in a clear and concise manner, making it easy for readers to grasp the key takeaways.

"Our findings show a sharp increase in the estimated fraction of LLM-modified content in academic writing beginning about five months after the release of ChatGPT, with the fastest growth observed in Computer Science papers." (Page 9)
Contextualization of Findings
The discussion provides context for the findings, linking the increased LLM usage in Computer Science to factors like researchers' familiarity with LLMs and the pressure to publish quickly.

"This trend may be partially explained by Computer Science researchers’ familiarity with and access to large language models. Additionally, the fast-paced nature of LLM research and the associated pressure to publish quickly may incentivize the use of LLM writing assistance (Foster et al., 2015)." (Page 9)
Highlights Potential Implications
The discussion briefly touches on the potential risks associated with widespread LLM use in scientific publishing, raising important questions about the security and independence of scientific practice.

"If the majority of modification comes from an LLM owned by a private company, there could be risks to the security and independence of scientific practice." (Page 10)

Suggestions for Improvement

Expand on Potential Risks
While the discussion mentions potential risks, it would be beneficial to elaborate on these risks in more detail. For example, the discussion could explore the potential for bias in LLM-generated text, the ethical implications of using LLMs without proper attribution, and the potential impact on the integrity of the scientific record.

"If the majority of modification comes from an LLM owned by a private company, there could be risks to the security and independence of scientific practice." (Page 10)

Rationale: A more detailed discussion of the risks would provide a more comprehensive understanding of the potential downsides of LLM use in scientific writing and inform future discussions on responsible AI practices.

Implementation: Add a paragraph specifically addressing the potential risks of LLM use, including bias, lack of transparency, and ethical concerns related to authorship and intellectual property.
Discuss Mitigation Strategies
The discussion could benefit from a discussion of potential mitigation strategies to address the identified risks. This could include guidelines for responsible LLM use, transparency requirements for authors, or the development of more sophisticated detection methods.

Rationale: Discussing mitigation strategies would provide a more proactive approach to the issue and offer concrete steps for promoting responsible LLM use in scientific writing.

Implementation: Add a paragraph discussing potential mitigation strategies, such as guidelines for transparent LLM use, requirements for disclosing LLM assistance in publications, or the development of tools and methods for detecting and mitigating bias in LLM-generated text.
Strengthen Call for Future Research
While the discussion calls for further research, it would be beneficial to be more specific about the directions future research should take. This could include investigating the impact of LLMs on specific aspects of scientific writing, exploring the effectiveness of different detection methods, or developing guidelines for responsible LLM use.

"We hope our results inspire further studies of widespread LLM-modified text and conversations about how to promote transparent, epistemically diverse, accurate, and independent scientific publishing." (Page 10)

Rationale: A more specific call for future research would provide a clearer roadmap for researchers interested in contributing to this important area of study.

Implementation: Expand on the call for future research by outlining specific research questions, such as: How do LLMs impact the quality and originality of scientific writing? What are the most effective methods for detecting and mitigating bias in LLM-generated text? How can we develop guidelines and policies for responsible LLM use in academic publishing?

Limitations

Overview

This section acknowledges the limitations of the study. The focus on ChatGPT, while being the dominant LLM, excludes other large language models used in academic writing. The study also addresses the potential for false positives in identifying LLM-generated text, particularly among non-native English writers, but asserts that the low false positive rate in 2022 supports the validity of their findings. Finally, the section recognizes that the observed correlations between LLM usage and paper characteristics are not necessarily causal and suggests further research to explore these relationships.

Key Aspects

1. Focus on ChatGPT: The study primarily focuses on ChatGPT, acknowledging that other LLMs exist and are used for academic writing.
2. Potential for False Positives: The possibility of misidentifying non-native English writing as LLM-generated is acknowledged, but the study's 2022 data suggests a low false positive rate.
3. Correlation vs. Causation: The study recognizes that the observed relationships between LLM usage and paper characteristics (preprint frequency, similarity, length) are correlational, not necessarily causal.
4. Future Research: The section suggests future research to explore causal relationships and other factors influencing LLM usage in academic writing.

Strengths

Acknowledges LLM Diversity
The section explicitly recognizes that ChatGPT is not the only LLM used for academic writing, demonstrating awareness of the broader landscape of language models.

"While our study focused on ChatGPT, which accounts for more than three-quarters of worldwide internet traffic in the category (Van Rossum, 2024), we acknowledge that there are other large language models used for assisting academic writing." (Page 10)
Addresses False Positives
The section directly addresses the concern of false positives, particularly regarding non-native English writers, and provides evidence from their 2022 data to support the validity of their findings.

"Furthermore, while Liang et al. (2023a) demonstrate that GPT-detection methods can falsely identify the writing of language learners as LLM-generated, our results showed that consistently low false positives estimates of α in 2022, which contains a significant fraction of texts written by multilingual scholars." (Page 10)
Acknowledges Correlation vs. Causation
The section clearly distinguishes between correlation and causation, acknowledging that the observed relationships between LLM usage and paper characteristics might be influenced by other factors.

"Finally, the associations that we observe between LLM usage and paper characteristics are correlations which could be affected by other factors such as research topics." (Page 10)

Suggestions for Improvement

Discuss Potential Impact of Other LLMs
While the limitation of focusing on ChatGPT is acknowledged, the section could briefly discuss the potential impact of including other LLMs in the analysis. This would provide a more nuanced perspective on the generalizability of the findings.

"While our study focused on ChatGPT...we acknowledge that there are other large language models used for assisting academic writing." (Page 10)

Rationale: Discussing the potential influence of other LLMs would strengthen the discussion of the study's limitations and provide insights for future research.

Implementation: Add a sentence or two speculating on how the results might change if other popular LLMs were included in the analysis. For example, consider whether the observed trends might be amplified or attenuated if different models were considered.
Elaborate on Future Research Directions
While future research is mentioned, providing more specific research questions or directions would be beneficial. This would guide future work and provide a clearer roadmap for addressing the limitations.

"More causal studies is an important direction for future work." (Page 10)

Rationale: More specific suggestions for future research would make the limitations section more actionable and encourage further investigation.

Implementation: Expand on the mention of future research by suggesting specific research questions or methodologies. For example, suggest exploring causal relationships using experimental designs or quasi-experimental methods with control groups. Also, suggest investigating the impact of specific LLMs other than ChatGPT.
Quantify the Limitations
Where possible, quantifying the limitations would provide a more concrete understanding of their potential impact. For example, estimating the usage share of other LLMs compared to ChatGPT would provide a clearer picture of the scope of this limitation.

"While our study focused on ChatGPT, which accounts for more than three-quarters of worldwide internet traffic in the category (Van Rossum, 2024)..." (Page 10)

Rationale: Quantifying the limitations would provide a more precise assessment of their potential influence on the study's findings.

Implementation: If possible, provide estimates or data on the usage share of other LLMs in academic writing. This would help contextualize the focus on ChatGPT and quantify the potential impact of excluding other models.

Estimated Fraction of LLM-Modified Sentences in Introductions

Overview

This appendix section presents Figure 7, a graph illustrating the estimated fraction of sentences modified by LLMs in the introductions of scientific papers across different academic venues over time. The inclusion of introductions complements the analysis of abstracts presented earlier in the paper (Figure 1). The figure shows trends similar to those observed in abstracts, with a notable increase in LLM usage after the launch of ChatGPT, especially in Computer Science. BioRxiv introductions were not included due to the unavailability of bulk PDF downloads.

Key Aspects

1. Focus on Introductions: This section specifically analyzes LLM modification in introductions, complementing the analysis of abstracts presented earlier in the paper.
2. Trend Consistency: The observed trends in LLM usage in introductions are consistent with those found in abstracts, showing a similar increase after the release of ChatGPT.
3. Computer Science Dominance: As with abstracts, Computer Science papers show the most significant increase in LLM-modified sentences in introductions.
4. BioRxiv Exclusion: Introductions from BioRxiv are excluded from this analysis due to the lack of bulk PDF download access.

Strengths

Complementary Analysis
Analyzing introductions alongside abstracts provides a more comprehensive view of LLM usage in scientific writing, strengthening the overall analysis.

"We focused on the introduction sections for the main texts, as the introduction was the most consistently and commonly occurring section across different categories of papers." (Page 16)
Consistent Findings
The consistent trends observed in both abstracts and introductions reinforce the validity and generalizability of the findings regarding the increasing use of LLMs.

"We found that the results are consistent with those observed in abstracts (Figure 1)." (Page 16)
Clear Justification for BioRxiv Exclusion
The clear explanation for excluding BioRxiv introductions due to data limitations enhances transparency and addresses potential questions about the scope of the analysis.

"We did not include bioRxiv introductions as there is no bulk download of PDFs available." (Page 16)

Suggestions for Improvement

Direct Comparison with Abstract Analysis
While consistency is mentioned, a direct visual or numerical comparison between the results for abstracts and introductions would strengthen the claim of similar trends.

"We found that the results are consistent with those observed in abstracts (Figure 1)." (Page 16)

Rationale: A direct comparison would provide more compelling evidence for the consistent trends and allow readers to easily identify any subtle differences between the two sections.

Implementation: Include a table or a combined plot showing the estimated alpha values for both abstracts and introductions side-by-side, or calculate and report the correlation between the alpha values for the two sections across different venues and time points.
Discuss Implications of Findings for Introductions Specifically
The section could discuss the specific implications of LLM usage in introductions, considering the unique role of introductions in scientific papers.

Rationale: Discussing the specific implications for introductions would provide a more nuanced understanding of how LLMs are being used in this crucial section of a paper.

Implementation: Add a paragraph discussing the potential impact of LLM usage on the quality, clarity, and originality of introductions. Consider how LLM-generated introductions might influence readers' perception of the research and its contribution to the field.
Explore Potential Reasons for Data Limitations
While the lack of bulk downloads is mentioned, exploring potential reasons for this limitation or alternative data acquisition methods would enhance the discussion.

"We did not include bioRxiv introductions as there is no bulk download of PDFs available." (Page 16)

Rationale: Exploring the reasons for data limitations and potential solutions would demonstrate a proactive approach to addressing these limitations and pave the way for future research.

Implementation: Add a sentence or two discussing the reasons behind the unavailability of bulk downloads for BioRxiv introductions. Explore alternative methods for accessing this data, such as contacting BioRxiv directly or using web scraping techniques (with appropriate ethical considerations). If alternative methods are not feasible, discuss the potential impact of this limitation on the generalizability of the findings.

LLM prompts used in the study

Overview

This appendix section details the LLM prompts used in the study to generate LLM-modified text. The process involves a two-stage approach: first, summarizing a human-written paragraph into bullet points (a skeleton) and then expanding that skeleton back into a full paragraph using an LLM. This approach simulates a potential author workflow and allows the researchers to control for content while examining the stylistic differences between human and LLM-generated text. A third prompt for proofreading is also included.

Key Aspects

1. Two-Stage Approach: The study uses a two-stage approach to generate LLM-modified text, involving summarizing a human-written paragraph into bullet points and then expanding those points back into a full paragraph using an LLM.
2. Simulating Author Workflow: This two-stage process simulates how scientists might use LLMs in their writing, starting with an outline and then using the LLM to generate the full text.
3. Content Control: By starting with a human-written paragraph and summarizing it, the researchers control for content and focus on the stylistic differences introduced by the LLM.
4. Proofreading Prompt: A separate prompt is used for proofreading, focusing on grammatical accuracy with minimal changes to the original content.

Strengths

Clear Prompt Descriptions
The prompts are described clearly and concisely, making it easy to understand the purpose and instructions of each stage.

"Now as a first step, first summarize the goal of the text, e.g., is it introduction, or method, results? and then given a complete piece of text from a paper, reverse-engineer it into a list of bullet points." (Page 17)
Rationale for Two-Stage Approach
The rationale for the two-stage approach is well-explained, highlighting its purpose in simulating author workflow and controlling for content.

"Our two-stage approach can be considered a counterfactual framework for generating LLM text: given a paragraph written entirely by a human, how would the text read if it conveyed almost the same content but was generated by an LLM? This additional abstractive summarization step can be seen as the control for the content." (Page 5)
Inclusion of Proofreading Prompt
The inclusion of a proofreading prompt adds another layer of realism to the simulation of author workflow and addresses the potential use of LLMs for polishing text.

"Your task is to proofread the provided sentence for grammatical accuracy. Ensure that the corrections introduce minimal distortion to the original content." (Page 17)

Suggestions for Improvement

Provide Full Prompts
While the prompts are described, providing the exact wording of the prompts used in the study would enhance reproducibility.

"See Appendix for full prompts." (Page 5)

Rationale: Providing the full prompts would allow other researchers to replicate the study precisely and validate the findings.

Implementation: Include the full text of each prompt used in the study, including any specific instructions or parameters provided to the LLM. Consider using a separate appendix section or a supplementary file for this purpose if space is limited.
Discuss Prompt Engineering Choices
The section could benefit from a discussion of the prompt engineering choices made, such as the specific instructions, constraints, or parameters used. This would provide insights into the process of generating realistic LLM-modified text.

Rationale: Discussing prompt engineering choices would enhance the transparency of the methodology and allow readers to understand the potential influence of different prompt variations on the generated text.

Implementation: Add a paragraph discussing the specific prompt engineering choices made, such as the use of instructions, constraints, or parameters. Explain the rationale behind these choices and how they contribute to generating realistic LLM-modified text. Consider discussing any challenges encountered during prompt engineering and how they were addressed.
Provide Examples of Generated Text
Including examples of both the skeleton outlines and the LLM-generated paragraphs would provide a more concrete understanding of the process and its outcomes.

Rationale: Providing examples would allow readers to see the actual output of the two-stage process and assess the quality and realism of the LLM-generated text.

Implementation: Include a few examples of human-written paragraphs, their corresponding skeleton outlines, and the final LLM-generated paragraphs. This would illustrate the transformation process and allow readers to evaluate the effectiveness of the two-stage approach.

Non-Text Elements

figure 8

This figure presents an example prompt used to instruct an LLM to summarize a paragraph from a human-written paper into a skeleton outline. This process mimics how an author might extract the core ideas and information from a piece of text and condense it into a structured, concise form. It's like creating a blueprint or summary of the original paragraph, highlighting the main points and their relationships.

First Mention

Text: "Figure 8: Example prompt for summarizing a paragraph from a human-authored paper into a skeleton"

Context: This process simulates how an author might first only write the main ideas and core information into a concise outline. The goal is to capture the essence of the paragraph in a structured and succinct manner, serving as a foundation for the previous prompt.

Relevance: This figure is relevant because it illustrates the first stage of the two-stage process used to generate realistic LLM-produced training data. By summarizing human-written paragraphs into skeleton outlines, the researchers can then use these outlines to prompt the LLM to generate new text, simulating a potential use case of LLMs in scientific writing.

Critique

Visual Aspects

Instead of just presenting the prompt as plain text, consider using a visual representation, such as a flowchart or diagram, to illustrate the summarization process.
Highlighting key phrases within the prompt, such as 'summarize the goal' and 'reverse-engineer it into a list of bullet points', would improve readability.
Consider adding a visual example of a paragraph being summarized into a skeleton outline to further clarify the process.

Analytical Aspects

The prompt could be more specific about the desired level of detail in the skeleton outline. Providing examples of different levels of summarization would be helpful.
The prompt mentions 'reverse-engineering' which might be confusing to a non-technical audience. Rephrasing this as 'summarizing' or 'condensing' would improve clarity.
The prompt could benefit from a clearer explanation of why this summarization step is important for generating realistic LLM-produced text.

figure 9

This figure shows an example prompt used to instruct an LLM to expand a skeleton outline into a full text paragraph. This simulates the second stage of the training data generation process, where the LLM elaborates on the concise outline to create a more detailed and fleshed-out piece of writing. Think of it like taking a blueprint and using it to construct the actual building.

First Mention

Text: "Figure 9: Example prompt for expanding the skeleton into a full text"

Context: The aim here is to simulate the process of using the structured outline as a basis to generate comprehensive and coherent text. This step mirrors the way an author might flesh out the outline into detailed paragraphs, effectively transforming the condensed ideas into a fully articulated section of a paper.

Relevance: This figure is relevant because it illustrates the second stage of the two-stage process for creating LLM-generated training data. This expansion step simulates how scientists might use LLMs to generate text based on their own outlines, providing a more realistic representation of LLM-assisted writing.

Critique

Visual Aspects

Similar to Figure 8, a visual representation of the expansion process, such as a diagram or example, would enhance understanding.
Highlighting key phrases like 'expand upon the concise version' and 'develop it into a fully fleshed-out text' would improve visual clarity.
Consider showing a visual example of a skeleton outline being expanded into a full text paragraph to demonstrate the process.

Analytical Aspects

The prompt could be more specific about the desired style and tone of the generated text. Should the LLM aim for a formal academic style or a more informal tone?
The prompt could benefit from a clearer explanation of how this expansion step contributes to generating realistic LLM-produced text and why this realism is important for the study.
Providing examples of different expansion styles would be helpful for illustrating the range of possible outputs.

figure 10

This figure presents an example prompt for instructing an LLM to proofread a sentence. The goal is to ensure grammatical accuracy while minimizing changes to the original content. This is like using a grammar-checking tool to polish a sentence without altering its meaning.

First Mention

Text: "Figure 10: Example prompt for proofreading."

Context: Your task is to proofread the provided sentence for grammatical accuracy. Ensure that the corrections introduce minimal distortion to the original content.

Relevance: This figure is relevant because it demonstrates the prompt used for the proofreading step in the training data generation process. While not as central as the summarization and expansion steps, proofreading ensures the grammatical correctness of both human-written and LLM-generated text used in the analysis.

Critique

Visual Aspects

Presenting the prompt as plain text is sufficient, but adding a simple visual element, such as an icon representing proofreading, could enhance visual appeal.
Highlighting key phrases like 'grammatical accuracy' and 'minimal distortion' would emphasize the main goals of the proofreading task.
Consider adding a before-and-after example of a sentence being proofread to illustrate the process.

Analytical Aspects

The prompt could be more specific about the types of grammatical errors to be addressed. Should the LLM focus on punctuation, syntax, or other specific aspects of grammar?
The phrase 'minimal distortion' could be further clarified. Providing examples of acceptable and unacceptable changes would be helpful.
The prompt could benefit from a clearer explanation of why proofreading is important for the overall analysis and how it contributes to the reliability of the results.

Additional Information on Implementation and Validations

Overview

This appendix provides supplementary information about the data collection, the LLMs used, and their parameter settings. The data was collected from arXiv, bioRxiv, and 15 Nature portfolio journals, sampling up to 2,000 papers per month from January 2020 to February 2024. The gpt-3.5-turbo-0125 model, trained on data up to September 2021, was used to generate the training data. The rationale for focusing on ChatGPT is its dominant market share and strong performance in understanding scientific papers. The decoding temperature was set to 1.0, maximum decoding length to 2048 tokens, Top P to 1.0, and both frequency and presence penalties to 0.0. No specific stop sequences were configured.

Key Aspects

1. Data Sources and Sampling: Details the data collection process, including the sources (arXiv, bioRxiv, Nature portfolio journals), sampling strategy (up to 2,000 papers per month), and time frame (January 2020 to February 2024).
2. LLM Choice and Rationale: Explains the use of the gpt-3.5-turbo-0125 model and justifies the focus on ChatGPT due to its market dominance and performance in understanding scientific papers.
3. LLM Parameter Settings: Provides specific details about the parameter settings used for the LLM during training data generation, including temperature, decoding length, Top P, and penalties.
4. Nature Portfolio Journals: Lists the 15 Nature portfolio journals included in the study.

Strengths

Detailed Data Collection Information
The section provides comprehensive information about the data collection process, including sources, sampling strategy, and timeframe, enhancing transparency and reproducibility.

"We collected data for this study from three publicly accessible sources: official APIs provided by arXiv and bioRxiv, and web pages from the Nature portfolio. For each of the five major arXiv categories (Computer Science, Electrical Engineering and Systems Science, Mathematics, Physics, Statistics), we randomly sampled 2,000 papers per month from January 2020 to February 2024." (Page 18)
Justification for LLM Choice
The section clearly explains the rationale for choosing ChatGPT, citing its market share and performance in understanding scientific papers, which strengthens the study's methodology.

"We chose to focus on ChatGPT due to its dominant position in the generative AI market. According to a comprehensive analysis conducted by FlexOS in early 2024, ChatGPT accounts for an overwhelming 76% of global internet traffic in the category..." (Page 18)
Specific LLM Parameter Settings
Providing the specific parameter settings used for the LLM ensures reproducibility and allows other researchers to replicate the study's data generation process.

"Regarding the parameter settings for the LLM, we set the decoding temperature to 1.0 and the maximum decoding length to 2048 tokens during our experiments." (Page 18)

Suggestions for Improvement

Clarify AI Corpus Generation Timing
While Section 3 is referenced for the AI corpus generation procedure, specifying whether this data was generated for each month or for the entire period would enhance clarity.

"The procedure for generating the AI corpus data for a given time period is described in aforementioned Section § 3." (Page 18)

Rationale: Clarifying the timing of AI data generation would provide a more precise understanding of the data generation process and its alignment with the paper sampling.

Implementation: Specify whether the AI corpus data was generated for each month independently or for the entire period at once. Explain the rationale behind the chosen approach.
Explain Handling of Insufficient Papers
The section mentions including all available papers when the 2,000 target wasn't met, but it doesn't explain how this might affect the analysis. Discussing the potential impact of varying sample sizes would enhance the discussion.

"When there were not enough papers to reach our target of 2,000 per month, we included all available papers." (Page 18)

Rationale: Addressing the potential impact of varying sample sizes would strengthen the analysis and acknowledge any potential limitations arising from this practice.

Implementation: Add a sentence or two discussing the potential impact of varying sample sizes across different months or venues. Consider whether this might introduce bias or affect the reliability of the results. If possible, quantify the variation in sample sizes and discuss its potential influence on the findings.
Justify Specific Parameter Choices
While the parameter settings are listed, providing justifications for these specific choices would strengthen the methodology. For example, explain why a temperature of 1.0 was chosen or the rationale behind the penalty settings.

"Regarding the parameter settings for the LLM, we set the decoding temperature to 1.0 and the maximum decoding length to 2048 tokens during our experiments...Both the frequency penalty and presence penalty...were set to 0.0." (Page 18)

Rationale: Justifying the parameter choices would provide a more robust methodological foundation and allow other researchers to understand the reasoning behind the specific LLM configuration.

Implementation: For each parameter setting, provide a brief explanation of the rationale behind the chosen value. Refer to relevant literature or best practices for LLM parameter tuning if applicable. Discuss how different parameter values might affect the generated text and the overall analysis.

Word Frequency Shift in arXiv Computer Science introductions

Overview

This appendix section presents a figure (Figure 11) showing the shift in word frequency within the introductions of arXiv Computer Science papers over the past two years. The figure focuses on the same four words highlighted in Figure 2: "realm," "intricate," "showcasing," and "pivotal." The analysis reveals a similar trend to Figure 2, where these words, after being relatively infrequent for over a decade, experienced a sudden increase in usage starting in 2023. The section notes that data from 2010-2020 is omitted due to the computational cost of processing a large volume of arXiv papers.

Key Aspects

1. Focus on Specific Words: The analysis concentrates on four specific words: "realm," "intricate," "showcasing," and "pivotal," which were previously identified as disproportionately used by LLMs compared to humans.
2. Trend Similarity with Abstracts: The observed trend in introductions mirrors the trend seen in abstracts (Figure 2), where these four words saw a surge in usage after 2022.
3. Time Frame Limitation: The analysis focuses on the past two years (2021-2024) due to the computational challenges of parsing older arXiv papers.
4. Implied LLM Influence: The increased frequency of these words in introductions, similar to abstracts, suggests the growing influence of LLMs on scientific writing.

Strengths

Consistent Methodology
Using the same words as in the abstract analysis (Figure 2) ensures consistency and allows for direct comparison between different sections of the papers.

"The plot shows the frequency over time for the same 4 words as demonstrated in Figure 2." (Page 19)
Clear Visual Representation
The figure (Figure 11, referenced but not included in the text content analysis) provides a clear visual representation of the word frequency shift, making the trend easily discernible.

"Figure 11: Word Frequency Shift in sampled arXiv Computer Science introductions in the past two years." (Page 19)
Transparent Justification for Time Frame
Openly acknowledging the limitation of the time frame due to computational constraints enhances transparency and sets realistic expectations for the scope of the analysis.

"Data from 2010-2020 is not included in this analysis due to the computational complexity of parsing the full text from a large number of arXiv papers." (Page 19)

Suggestions for Improvement

Expand Time Frame if Possible
While computational constraints are understandable, exploring ways to extend the analysis to include data from 2010-2020 would provide a more complete picture of the long-term word frequency trends.

"Data from 2010-2020 is not included in this analysis due to the computational complexity of parsing the full text from a large number of arXiv papers." (Page 19)

Rationale: A longer time frame would allow for a more comprehensive analysis of the word usage trends and provide a stronger baseline for comparison with the post-2020 period.

Implementation: Explore strategies for optimizing the parsing process or using more efficient computational resources to enable the inclusion of older arXiv papers in the analysis. If extending the timeframe is not feasible, discuss the potential impact of this limitation on the interpretation of the results.
Provide Quantitative Comparison with Abstracts
While the section mentions a similar trend to Figure 2, providing a quantitative comparison (e.g., correlation coefficients, statistical tests) between the word frequency shifts in abstracts and introductions would strengthen the analysis.

"The trend is similar for two figures." (Page 19)

Rationale: A quantitative comparison would provide more compelling evidence for the similarity of the trends and allow for a more nuanced understanding of any differences between the two sections.

Implementation: Calculate and report correlation coefficients or perform statistical tests to compare the word frequency shifts observed in abstracts and introductions. Discuss the statistical significance of the findings and any potential differences in the magnitude or timing of the shifts.
Discuss the Significance of the Chosen Words
The section could benefit from a discussion of why these specific four words are significant and what their increased usage might indicate about the influence of LLMs on scientific writing.

"The plot shows the frequency over time for the same 4 words as demonstrated in Figure 2. The words are: realm, intricate, showcasing, pivotal." (Page 19)

Rationale: Explaining the significance of the chosen words would provide a deeper understanding of the implications of the observed frequency shifts and their connection to LLM usage.

Implementation: Add a paragraph discussing the potential reasons why these specific words might be more common in LLM-generated text. Explore linguistic characteristics, stylistic patterns, or semantic connotations associated with these words that might explain their increased usage. Consider whether these words reflect specific writing conventions or biases present in LLM training data.

Non-Text Elements

figure 11

This figure shows how often certain words appeared in the introductions of arXiv Computer Science papers over the past two years. The words 'realm,' 'intricate,' 'showcasing,' and 'pivotal' are tracked. These words were used infrequently before 2023 but saw a sudden increase in 2023, suggesting a potential shift in language use, possibly related to the rise of LLMs. It's like noticing that certain ingredients started showing up more often in recipes after a new cooking gadget became popular.

First Mention

Text: "Figure 11: Word Frequency Shift in sampled arXiv Computer Science introductions in the past two years."

Context: Appendix D, showing word frequency shifts in introductions, is on page 19 and provides additional results.

Relevance: This figure provides additional evidence supporting the main finding of increased LLM usage in Computer Science. The sudden increase in the frequency of specific words after the release of ChatGPT suggests a potential link between LLM use and changes in writing style.

Critique

Visual Aspects

The figure could benefit from a more descriptive title, such as 'Increased Frequency of Specific Words in arXiv CS Introductions After ChatGPT'.
Adding the actual word frequencies (per million words) as data labels on the lines would improve readability and allow for precise comparisons.
Extending the time axis back to 2020, or even earlier if data is available, would provide a clearer picture of the long-term trend and the impact of ChatGPT.

Analytical Aspects

While the figure shows a correlation between the increase in word frequency and the release of ChatGPT, it doesn't establish causation. Other factors could be contributing to this shift in language use.
The figure only focuses on four words. Analyzing a larger set of words or using different selection criteria (e.g., words disproportionately used by LLMs) would provide a more comprehensive view of language changes.
The figure focuses on introductions. Comparing these trends with similar analyses of abstracts or other sections would be informative.

Fine-grained Main Findings

Overview

This appendix section provides further details about the main findings by presenting supporting figures. Figure 12 shows that the relationship between first-author preprint posting frequency and LLM usage holds across different Computer Science sub-categories (cs.CV, cs.LG, cs.CL). Figure 13 demonstrates that the relationship between paper similarity and LLM usage also holds across these sub-categories. Figure 14 examines the relationship between paper length and LLM usage, showing it holds for cs.CV and cs.LG but not for cs.CL, possibly due to limited sample size in cs.CL.

Key Aspects

1. Sub-category Analysis: The findings regarding preprint frequency, paper similarity, and paper length are broken down by Computer Science sub-categories (cs.CV, cs.LG, cs.CL) to provide a more granular view.
2. Preprint Frequency and LLM Usage: The positive correlation between first-author preprint posting frequency and LLM usage is consistent across all three sub-categories.
3. Paper Similarity and LLM Usage: The positive correlation between paper similarity (measured by embedding distance) and LLM usage is also consistent across all three sub-categories.
4. Paper Length and LLM Usage: The positive correlation between shorter papers and higher LLM usage holds for cs.CV and cs.LG, but not for cs.CL, potentially due to a limited sample size in the latter.

Strengths

Granular Analysis
Breaking down the findings by sub-category provides a more detailed and nuanced understanding of the relationships between LLM usage and paper characteristics.

"Papers in each arXiv Computer Science sub-category (cs.CV, cs.LG, and cs.CL) are stratified into two groups based on the preprint posting frequency of their first author..." (Page 20)
Consistent Findings Across Sub-categories
The consistent findings across sub-categories for preprint frequency and paper similarity strengthen the overall conclusions of the study.

"The relationship between first-author preprint posting frequency and LLM usage holds across arXiv Computer Science sub-categories." (Page 20)
Transparency about Limitations
Acknowledging the potential limitation of sample size for cs.CL in the paper length analysis demonstrates transparency and careful interpretation of the results.

"For cs.CL, no significant difference in LLM usage was found between shorter and longer papers, possibly due to the limited sample size..." (Page 22)

Suggestions for Improvement

Provide Statistical Significance
While the figures visually suggest trends, reporting statistical significance (e.g., p-values, confidence intervals) would strengthen the findings.

Rationale: Adding statistical significance would provide a more rigorous evaluation of the observed relationships and allow readers to assess the strength of the evidence.

Implementation: Calculate and report p-values or confidence intervals for the observed correlations in each sub-category. Discuss the statistical significance of the findings and any potential variations in significance across sub-categories.
Discuss Potential Reasons for Inconsistent Finding in cs.CL
While limited sample size is suggested as a possible reason for the inconsistent finding in cs.CL regarding paper length, exploring other potential explanations would enhance the analysis.

"For cs.CL, no significant difference in LLM usage was found between shorter and longer papers, possibly due to the limited sample size..." (Page 22)

Rationale: Exploring alternative explanations would demonstrate a thorough consideration of the findings and provide a more nuanced understanding of the relationship between paper length and LLM usage in different sub-categories.

Implementation: Discuss other potential factors that might explain the inconsistent finding in cs.CL, such as differences in writing practices, research topics, or the types of papers typically submitted to this sub-category. Consider whether the use of LLMs in cs.CL might differ from other sub-categories in terms of the specific tasks or purposes for which they are employed.
Integrate with Main Findings
The section could benefit from a clearer explanation of how these fine-grained findings contribute to the overall conclusions of the study. Explicitly connecting these results to the main findings presented earlier would enhance the coherence of the paper.

Rationale: Connecting the fine-grained findings to the main findings would strengthen the overall narrative of the paper and demonstrate the value of the sub-category analysis.

Implementation: Add a concluding paragraph summarizing the key takeaways from the fine-grained analysis and explicitly linking them to the main findings presented in Section 5. Explain how these detailed results support or refine the broader conclusions of the study.

Non-Text Elements

figure 12

This figure investigates the relationship between how often a paper's first author posts preprints on arXiv and the use of LLMs in their Computer Science papers. The papers are divided into two groups: those whose first authors posted two or fewer preprints in a year, and those who posted three or more. It then shows the estimated fraction of LLM-modified sentences over time for three different Computer Science subcategories (cs.CV, cs.LG, and cs.CL). Think of it like checking if students who submit more draft essays also use grammar-checking software more often, but specifically within different areas of study like literature, history, or science.

First Mention

Text: "Figure 12: The relationship between first-author preprint posting frequency and LLM usage holds across arXiv Computer Science sub-categories."

Context: Papers in each arXiv Computer Science sub-category (cs.CV, cs.LG, and cs.CL) are stratified into two groups based on the preprint posting frequency of their first author, as measured by the number of first-authored preprints in the year: those with ≤ 2 preprints and those with ≥ 3 preprints.

Relevance: This figure supports the broader finding that researchers who post more preprints tend to use LLMs more, showing this trend holds across different areas within Computer Science. This suggests that the relationship isn't specific to just one type of Computer Science research.

Critique

Visual Aspects

The figure could have a more informative title, like 'LLM Use and Preprint Frequency Across CS Subcategories'.
Shortening the x-axis labels (e.g., '22.1-3' for '2022.1-3') would improve readability.
Adding a legend directly onto each subgraph would make it easier to interpret the bars.

Analytical Aspects

The figure shows correlation, not causation. It doesn't prove that posting more preprints *causes* more LLM use.
Other factors, like the researcher's career stage or institutional pressures, could influence both preprint frequency and LLM use.
It would be helpful to explain why the 2023 author grouping was used for the 2024.1-2 data.

figure 13

This figure explores whether papers in more 'crowded' research areas (those with abstracts more similar to other abstracts within the same subcategory) have more LLM-modified content. Similarity is measured by converting abstracts into numerical vectors and calculating the distance between them. Papers are grouped into 'more similar' (closer vectors) and 'less similar' (further vectors). The figure shows how LLM use changes over time for these two groups across three Computer Science subcategories (cs.CV, cs.LG, and cs.CL). Imagine research papers as points on a map, with closer points representing similar research. This figure checks if points clustered together in different academic neighborhoods show more LLM use.

First Mention

Text: "Figure 13: The relationship between paper similarity and LLM usage holds across arXiv Computer Science sub-categories."

Context: Papers in each arXiv Computer Science sub-category (cs.CV, cs.LG, and cs.CL) are divided into two groups based on their abstract's embedding distance to their closest peer within the respective sub-category: papers more similar to their closest peer (below median distance) and papers less similar to their closest peer (above median distance).

Relevance: This figure further investigates the relationship between research area 'crowdedness' and LLM use, showing that the trend observed in the main analysis holds across different Computer Science subcategories. This suggests the relationship isn't limited to a specific area of Computer Science.

Critique

Visual Aspects

A more descriptive title like 'LLM Use and Research Area Crowdedness Across CS Subcategories' would be better.
Shortening the x-axis labels and adding a legend directly on the graphs would improve readability.
Using different colors or patterns for the bars representing 'more similar' and 'less similar' papers would make the graphs easier to understand.

Analytical Aspects

The figure shows correlation, not causation. It's unclear whether LLM use causes similarity or if similar research areas are simply more likely to use LLMs.
The explanation of how similarity is measured could be clearer for a non-technical audience.
The potential implications of these findings, such as the homogenization of scientific writing, could be discussed more thoroughly.

figure 14

This figure explores if shorter papers have more LLM-modified content within different Computer Science subcategories (cs.CV, cs.LG, and cs.CL). Papers are divided into two groups: shorter than 5,000 words and longer than 5,000 words. It then shows the estimated LLM use for both groups over time. It's like seeing if shorter student essays use grammar-checking tools more than longer essays, but specifically for essays on different scientific topics.

First Mention

Text: "Figure 14: The relationship between paper length and LLM usage holds for cs.CV and cs.LG, but not for cs.CL."

Context: Papers in each arXiv Computer Science sub-category (cs.CV, cs.LG, and cs.CL) are stratified by their full text word count, including appendices, into two bins: below or above 5,000 words (the rounded median).

Relevance: This figure provides a more detailed look at the relationship between paper length and LLM use by examining it across different Computer Science subcategories. It also highlights an exception in cs.CL, where the relationship doesn't hold, suggesting other factors might be at play.

Critique

Visual Aspects

A title like 'LLM Use and Paper Length Across CS Subcategories' would be more informative.
Shortening x-axis labels would improve readability.
Adding a legend directly on each subgraph would make interpretation easier.

Analytical Aspects

The figure shows correlation, not causation. It doesn't explain *why* shorter papers might have more LLM use.
Other factors, like the paper's topic or the author's resources, could influence both length and LLM use.
The caption mentions a limited sample size for cs.CL. Quantifying this limitation (e.g., stating the actual sample size) and discussing its potential impact on the results would be helpful.

Proofreading Results on arXiv data

Overview

This appendix section investigates the impact of using LLMs for proofreading on the detection of LLM-generated text. It presents a figure (Figure 15) showing a slight increase in the estimated fraction of LLM-modified content after proofreading across various arXiv categories. This suggests that the method used in the study is robust to minor edits introduced by proofreading, as it can still detect the underlying LLM-generated content.

Key Aspects

1. Proofreading Impact: The analysis focuses on how proofreading with LLMs affects the detection of LLM-generated text.
2. Slight Increase in Estimated LLM Content: The results show a small but noticeable increase in the estimated fraction of LLM-modified content after proofreading.
3. Robustness of the Method: This slight increase indicates that the detection method is robust to minor LLM-generated edits introduced during proofreading.
4. arXiv Categories: The analysis is conducted across different arXiv main categories (Computer Science, Electrical Engineering & Systems Science, Mathematics, Physics, and Statistics).
5. Data and Methodology: The analysis uses 1,000 abstracts from each category, randomly sampled from the period before ChatGPT's release (January 1, 2022, to November 29, 2022).

Strengths

Clear Focus
The section clearly focuses on a specific aspect of LLM detection: the impact of proofreading. This focused approach allows for a detailed investigation of this particular challenge.

"Robustness of estimations to proofreading." (Page 23)
Direct Relevance to Main Findings
The section directly addresses a potential challenge to the main findings of the study: the possibility that LLM-generated text could be masked by proofreading. Demonstrating robustness to proofreading strengthens the validity of the overall results.

"This observation validates our method’s robustness to minor LLM-generated text edits, such as those introduced by simple proofreading." (Page 23)
Appropriate Methodology
The use of a pre-ChatGPT dataset for this analysis is appropriate, as it isolates the impact of proofreading without the confounding effect of widespread ChatGPT usage.

"The analysis was conducted on 1,000 abstracts from each arXiv main category, randomly sampled from the period between January 1, 2022, and November 29, 2022." (Page 23)

Suggestions for Improvement

Quantify the Increase
While the section mentions a "slight increase," quantifying this increase with specific percentage points or effect sizes would strengthen the analysis.

"The plot demonstrates a slight increase in the fraction of LLM-modified content after using Large Language Models (LLMs) for “proofreading” across different arXiv main categories." (Page 23)

Rationale: Quantifying the increase would provide a more precise understanding of the impact of proofreading and allow for a more objective assessment of the method's robustness.

Implementation: Report the specific percentage point increase or effect size observed in each arXiv category after proofreading. Consider including a table or adding numerical labels to Figure 15 to display these values directly.
Explore Different Proofreading Intensities
The section focuses on "simple proofreading." Exploring different levels of proofreading intensity (e.g., light, moderate, heavy editing) would provide a more comprehensive understanding of the method's robustness.

"This observation validates our method’s robustness to minor LLM-generated text edits, such as those introduced by simple proofreading." (Page 23)

Rationale: Analyzing different proofreading intensities would reveal whether the method's robustness holds across a range of editing scenarios, from minor corrections to more substantial revisions.

Implementation: Design experiments with varying levels of proofreading intensity. This could involve using different prompts that instruct the LLM to perform more extensive edits or manually editing the text to simulate different levels of proofreading. Compare the estimated LLM content before and after each level of proofreading to assess the method's robustness.
Discuss Implications for Real-World Scenarios
The section could discuss the implications of these findings for real-world scenarios where proofreading is common practice. This would connect the analysis to the broader context of LLM detection in academic publishing.

Rationale: Discussing real-world implications would enhance the relevance of the findings and provide insights into the challenges and opportunities of LLM detection in practical settings.

Implementation: Add a paragraph discussing how these findings might affect the detection of LLM-generated text in submitted manuscripts or published papers. Consider the challenges of distinguishing between legitimate proofreading and attempts to mask LLM usage. Discuss the potential need for more sophisticated detection methods or guidelines for authors regarding the use of LLMs and proofreading tools.

Non-Text Elements

figure 15

This figure investigates how proofreading with Large Language Models (LLMs) affects the estimation of LLM-generated content in scientific papers. It takes abstracts from different arXiv categories (Computer Science, Electrical Engineering, Mathematics, Physics, Statistics) and measures the estimated fraction of LLM-modified content (alpha) before and after using LLMs for proofreading. The slight increase in alpha after proofreading suggests that the method can detect even small edits made by LLMs, demonstrating its robustness. It's like checking if a tool that measures how much of a cake was made by a machine can still work even if the machine only added the frosting.

First Mention

Text: "Figure 15: Robustness of estimations to proofreading."

Context: Appendix F, showing proofreading results, is presented on page 23 and is relevant for assessing the method's robustness.

Relevance: This figure is important because it addresses a potential weakness of the method: its sensitivity to minor edits. By showing that the method can still detect LLM use even after proofreading, it strengthens the validity of the overall findings.

Critique

Visual Aspects

A more descriptive title like 'Impact of Proofreading on LLM Detection' would be clearer.
Adding the numerical values of the estimated alpha before and after proofreading on top of the bars would improve readability.
The y-axis could be labeled 'Estimated Fraction of LLM-Modified Content' instead of just 'Estimated Alpha' for better understanding by a broader audience.

Analytical Aspects

The figure only uses abstracts. Showing similar results for other sections of papers (e.g., introductions) would provide a more complete picture.
The choice of arXiv categories could be explained. Are these representative of all scientific fields?
The specific proofreading prompt used should be described in more detail. What instructions were given to the LLM?

Extended Related Work

Overview

This section provides a detailed overview of existing methods for detecting LLM-generated text. It covers zero-shot detection methods, training-based detection methods, and LLM watermarking. The section highlights the limitations of each approach, emphasizing challenges like access to LLM internals, overfitting, bias, and the need for model owner involvement in watermarking. It concludes by contrasting these methods with the distributional GPT quantification framework used in the paper, which offers advantages in stability, accuracy, and independence from model owners.

Key Aspects

1. Zero-Shot Detection: These methods use statistical signatures of machine-generated text, but require access to LLM internals and are vulnerable to issues like proxy LLM usage.
2. Training-Based Detection: These methods train classifiers on human and AI-generated text, but face challenges such as overfitting, bias against non-dominant language varieties, and questions about their real-world effectiveness.
3. LLM Watermarking: This technique embeds signals in generated text for detection, but relies on the cooperation of model owners, limiting its applicability.
4. Distributional GPT Quantification Framework: This framework, used in the study, estimates the fraction of LLM-modified content at the population level, offering advantages in stability, accuracy, and computational efficiency while avoiding the limitations of other methods.
5. Implications for Pretraining Data: The increasing use of LLMs in scientific writing, and the potential inclusion of this LLM-generated text in training data, raises concerns about reinforcing biases, reducing language diversity, and potentially leading to model collapse.

Strengths

Comprehensive Coverage
The section thoroughly covers various LLM detection methods, providing a comprehensive overview of the current landscape and demonstrating a strong understanding of the field.

"A major category of LLM text detection uses statistical signatures that are characteristic of machine-generated text...Another category is training-based detection...LLM watermarking introduces a method..." (Page 24)
Detailed Explanation of Limitations
The section clearly explains the limitations of each method, providing specific examples and referencing relevant studies, which strengthens the justification for the chosen framework.

"However, zero-shot detection requires direct access to LLM internals...Training-based detection methods face challenges such as overfitting...one major concern with watermarking is that it requires the involvement of the model or service owner..." (Page 24)
Clear Contrast with Chosen Framework
The section effectively contrasts the limitations of existing methods with the advantages of the distributional GPT quantification framework, highlighting its independence from model owners and its focus on population-level analysis.

"In contrast, the framework by Liang et al. (2024) operates independently of the model or service owner’s intervention, allowing for the monitoring of AI-modified content without requiring their active participation or adoption." (Page 25)

Suggestions for Improvement

Provide More Concrete Examples of Watermarking
While various watermarking techniques are mentioned, providing more concrete examples of how these techniques work would enhance understanding for a broader audience.

"Modern watermarking strategies involve integrating watermarks into the decoding process of language models..." (Page 24)

Rationale: Concrete examples would make the concept of watermarking more accessible and easier to grasp for readers unfamiliar with the technical details.

Implementation: Include brief examples illustrating how specific watermarking techniques, such as the Gumbel watermark or the red-green list approach, work in practice. Consider using simple illustrations or analogies to explain the underlying mechanisms.
Discuss Potential Countermeasures to Watermarking
The section could briefly discuss potential countermeasures that could be used to circumvent watermarking techniques, providing a more balanced perspective on the effectiveness of this approach.

Rationale: Discussing potential countermeasures would acknowledge the limitations of watermarking and provide a more realistic assessment of its long-term viability as a detection method.

Implementation: Add a sentence or two discussing potential ways to remove or obscure watermarks, such as paraphrasing, text manipulation, or adversarial attacks. Consider the ongoing arms race between watermarking techniques and countermeasures.
Expand on the Implications for LLM Pretraining
While the section mentions implications for pretraining data quality, expanding on the potential consequences of using LLM-generated text for training would strengthen the discussion.

"Our findings suggest that a growing proportion of this pretraining data may contain LLM-modified content. Preliminary research indicates that the inclusion of LLM-modified content...can lead to several pitfalls..." (Page 25)

Rationale: A more detailed discussion of the potential consequences would highlight the importance of addressing this issue and motivate further research on data curation and filtering strategies.

Implementation: Elaborate on the potential pitfalls of using LLM-generated text for training, such as the reinforcement of biases, the homogenization of language, and the potential for model collapse. Discuss the long-term implications for the development and deployment of LLMs, and the potential impact on the quality and reliability of future models.

The Rise of Large Language Models in Scientific Writing: A Large-Scale Analysis

Table of Contents

Overall Summary

Overview

Key Findings

Strengths

Areas for Improvement

Significant Elements

Figure 1

Figure 7

Conclusion

Section Analysis

Abstract

Overview

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Overview

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

First Mention

Numeric Data

First Mention

Numeric Data

Related Work

Overview

Key Aspects

Strengths

Suggestions for Improvement

Background: the distributional LLM quantification framework

Overview

Key Aspects

Strengths

Suggestions for Improvement

Implementation and Validations

Overview

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

First Mention

Numeric Data

Main Results and Findings

Overview

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

First Mention

Numeric Data

First Mention

Numeric Data

First Mention

Numeric Data

Discussion

Overview

Key Aspects

Strengths

Suggestions for Improvement

Limitations

Overview

Key Aspects

Strengths

Suggestions for Improvement

Estimated Fraction of LLM-Modified Sentences in Introductions

Overview

Key Aspects

Strengths

Suggestions for Improvement

LLM prompts used in the study

Overview

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

First Mention

First Mention