The Rise of Large Language Models in Scientific Writing: A Large-Scale Analysis

Table of Contents

Overall Summary

Overview

This study investigates the growing use of Large Language Models (LLMs), specifically GPT-3.5-turbo-0125 (a version of ChatGPT), in scientific writing. By analyzing nearly a million papers from arXiv, bioRxiv, and Nature portfolio journals published between January 2020 and February 2024, the researchers used a distributional quantification framework to estimate the prevalence of LLM-modified content. This framework compares word usage patterns between human-written and LLM-generated text to estimate the overall proportion of LLM influence. The study found a steady increase in LLM usage after ChatGPT's release in late 2022, particularly in Computer Science, and explored correlations with author preprint posting frequency, research area crowdedness, and paper length. These findings raise concerns about the potential impact of LLMs on scientific integrity and call for further research into responsible LLM use.

Key Findings

Strengths

Areas for Improvement

Significant Elements

Figure 1

Description: Figure 1 displays the estimated fraction of LLM-modified sentences in abstracts across different academic venues over time. It visually demonstrates the increase in LLM usage after the release of ChatGPT, particularly in Computer Science.

Relevance: This figure directly visualizes the central finding of the study: the increasing prevalence of LLM-modified content in scientific writing.

Figure 7

Description: Figure 7, similar to Figure 1, shows the estimated fraction of LLM-modified sentences in introductions across different academic venues over time. This figure reinforces the findings from the abstract analysis and extends them to another crucial section of scientific papers.

Relevance: This figure provides further evidence of the widespread adoption of LLMs in scientific writing, showing consistent trends across different sections of papers.

Conclusion

This study provides compelling evidence for the increasing use of LLMs, particularly ChatGPT, in scientific writing. The observed correlations between LLM usage and factors like preprint posting frequency, research area crowdedness, and paper length raise important questions about the potential impact of LLMs on scientific practice, including concerns about homogenization of research, potential bias, and the need for transparency. Future research should focus on establishing causal relationships, exploring the ethical implications of LLM use, and developing guidelines for responsible LLM integration in scientific writing. This research is crucial for navigating the evolving landscape of scientific communication and ensuring the integrity and diversity of scientific knowledge production in the age of AI.

Section Analysis

Abstract

Overview

This paper investigates the increasing use of Large Language Models (LLMs) like ChatGPT in scientific writing. The authors conducted a large-scale analysis of nearly a million papers published on arXiv, bioRxiv, and Nature portfolio journals between January 2020 and February 2024. They used a statistical framework to estimate the prevalence of LLM-modified content over time and across different academic fields. The study found a steady increase in LLM usage, with the highest growth in Computer Science papers. Furthermore, the analysis revealed correlations between higher LLM modification and factors such as first-author preprint posting frequency, research area crowdedness, and shorter paper length.

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Overview

The introduction discusses the growing use of LLMs like ChatGPT in academic writing and the need to measure its prevalence. It highlights the challenges of detecting LLM-generated text at the individual level and emphasizes the importance of a large-scale analysis to understand the structural factors motivating LLM use and its impact on scientific publishing. The introduction also mentions concerns about accuracy, plagiarism, and ownership, leading some institutions to restrict LLM use in publications. Finally, it introduces the paper's goal: to conduct a systematic, large-scale analysis to quantify the prevalence of LLM-modified content across multiple academic platforms.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

figure 1

This figure shows the estimated percentage of sentences in academic paper abstracts that were significantly changed by a Large Language Model (LLM) over time. It compares different subject areas like Computer Science, Electrical Engineering, Mathematics, Physics, Statistics from arXiv, bioRxiv, and Nature portfolio journals. There's a noticeable increase in LLM use after ChatGPT was released in late 2022, especially in Computer Science. Think of it like measuring how much of a cake was made by a machine versus a human baker, but for sentences in research papers.

First Mention

Text: "Figure 1: Estimated Fraction of LLM-Modified Sentences across Academic Writing Venues over Time."

Context: This figure displays the fraction (α) of sentences estimated to have been substantially modified by LLM in abstracts from various academic writing venues.

Relevance: This figure is crucial because it directly addresses the main research question: how much are LLMs being used in scientific writing? It shows a clear trend of increasing LLM use after ChatGPT's release.

Critique
Visual Aspects
  • The y-axis label 'Estimated alpha' could be more descriptive, like 'Estimated Fraction of LLM-Modified Sentences'.
  • The figure could benefit from a title summarizing the main takeaway, such as 'Increasing LLM Use in Scientific Abstracts After ChatGPT Release'.
  • Using a logarithmic scale for the y-axis might better visualize the changes, especially the smaller percentages before ChatGPT.
Analytical Aspects
  • The figure focuses on abstracts, but it would be helpful to see similar data for other sections of papers.
  • The figure doesn't show the actual number of papers analyzed, which would provide more context.
  • It would be interesting to see a breakdown of LLM use within specific subfields of Computer Science.
Numeric Data
  • Maximum Estimated Alpha (Computer Science): 0.175 fraction
  • Minimum Estimated Alpha (Mathematics): 0.049 fraction
figure 7

This figure is similar to Figure 1, but instead of abstracts, it looks at the introduction sections of papers. It shows the estimated percentage of sentences modified by LLMs over time, again comparing different academic areas. It also highlights the increase after ChatGPT's launch, mirroring the trend seen in abstracts. It's like comparing the machine-made versus human-made parts of a different layer of the cake.

First Mention

Text: "Figure 7: Estimated Fraction of LLM-Modified Sentences in Introductions Across Academic Writing Venues Over Time."

Context: Further analysis of paper introductions is presented in Figure 7.

Relevance: This figure adds to the main finding by showing that the increased use of LLMs isn't limited to abstracts but extends to introductions as well. This strengthens the argument for widespread LLM adoption in scientific writing.

Critique
Visual Aspects
  • Similar to Figure 1, the y-axis label could be more descriptive and a summarizing title could be added.
  • A direct comparison with Figure 1 (e.g., a combined plot or side-by-side presentation) would highlight the similarities and differences between abstracts and introductions.
  • A logarithmic scale for the y-axis could improve visualization of smaller percentages.
Analytical Aspects
  • The exclusion of bioRxiv introductions due to data limitations should be explicitly mentioned in the figure caption.
  • Providing the actual number of papers analyzed would enhance the context.
  • A breakdown of LLM use within specific subfields, especially in Computer Science, would be informative.
Numeric Data
  • Maximum Estimated Alpha (Computer Science): 0.155 fraction
  • Minimum Estimated Alpha (Mathematics): 0.039 fraction

Related Work

Overview

This section discusses existing methods for detecting LLM-generated text, including zero-shot and training-based approaches. It highlights the limitations of these methods, such as overfitting, vulnerability to attacks, and bias. The section also mentions LLM watermarking as a detection method. Finally, it emphasizes the advantage of the distributional GPT quantification framework used in this paper, which estimates the fraction of LLM-modified content at the population level, avoiding the need to classify individual documents.

Key Aspects

Strengths

Suggestions for Improvement

Background: the distributional LLM quantification framework

Overview

This section describes the distributional LLM quantification framework adapted from Liang et al. (2024). This framework estimates the proportion of AI-modified content in a corpus of documents. It works by comparing the probability distributions of words in human-written and LLM-modified documents. The framework doesn't analyze individual documents but looks at word usage patterns across the entire collection to estimate the overall fraction of AI-influenced text.

Key Aspects

Strengths

Suggestions for Improvement

Implementation and Validations

Overview

This section details the implementation of the distributional LLM quantification framework and the validation process used in the study. The authors describe the data collection process from arXiv, bioRxiv, and the Nature portfolio, sampling up to 2,000 papers per month. They explain how the data is split for training and validation, how the model is fitted, and the evaluation metrics used. The validation process involved using pre-ChatGPT papers to assess the model's accuracy in estimating the proportion of LLM-modified content. The results showed good performance, with low prediction error across different levels of LLM modification.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

figure 3

This figure validates how well the model estimates the fraction of LLM-modified content (alpha) when there's a time gap between the training data (up to 2020) and the validation data (from early 2022, before ChatGPT). Each small graph within the figure represents a different academic area (like Computer Science, Physics, etc.) and shows how the model's estimate of alpha compares to the actual alpha. It uses different sets of words (full vocabulary, adjectives, adverbs, verbs) to see which works best. Imagine you're trying to guess how much of a cookie is chocolate chips (alpha) based on how sweet it tastes. This figure is like checking if your sweetness-based guess is accurate by comparing it to cookies with known chocolate chip percentages.

First Mention

Text: "Figure 3: Fine-grained Validation of Model Performance Under Temporal Distribution Shift."

Context: We construct validation sets with LLM-modified content proportions (α) ranging from 0% to 25%, in 5% increments, and compared the model’s estimated α with the ground truth α (Figure 3).

Relevance: This figure is important because it shows how reliable the model is, even when dealing with data from different time periods. This reliability is key for trusting the model's estimates of LLM use over time.

Critique
Visual Aspects
  • The figure is a bit crowded with 13 small graphs. Separating them into two figures, one for abstracts and one for introductions, might improve readability.
  • The labels on the x and y axes could be larger for easier reading.
  • The caption could explain what the different colors of the bars represent (full vocabulary, adjectives, etc.).
Analytical Aspects
  • While the caption mentions a prediction error less than 3.5%, the figure itself doesn't visually represent this error. Adding a line or shaded area to represent this threshold would be helpful.
  • The figure doesn't show how the model performs with data after ChatGPT's release. Including a similar validation with post-ChatGPT data would strengthen the analysis.
  • The caption mentions excluding bioRxiv introductions due to data limitations. Explaining this limitation in more detail (e.g., why the data was unavailable) would be helpful.
Numeric Data
  • Maximum Estimation Error: 0.035 fraction

Main Results and Findings

Overview

This section presents the main findings of the study regarding the prevalence and trends of LLM-modified content in scientific writing. Key findings include a steady increase in LLM usage after the release of ChatGPT, with the most significant growth in Computer Science. The analysis also reveals correlations between LLM usage and factors like first-author preprint posting frequency, paper similarity (indicating crowded research areas), and paper length.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

figure 4

This figure examines whether authors who post more preprints on arXiv also tend to have a higher percentage of LLM-modified content in their Computer Science papers. The papers are split into two groups based on how many preprints the first author published in a year: two or fewer, or three or more. The figure then shows the estimated fraction of LLM-modified sentences in both abstracts and introductions for each group over time. It's like checking if students who write more practice essays also use grammar-checking software more often.

First Mention

Text: "Figure 4: Papers authored by first authors who post preprints more frequently tend to have a higher fraction of LLM-modified content."

Context: Papers in arXiv Computer Science are stratified into two groups based on the preprint posting frequency of their first author, as measured by the number of first-authored preprints in the year.

Relevance: This figure explores a potential link between preprint posting frequency and LLM use, suggesting that authors who publish more preprints might be more inclined to use LLMs for writing assistance.

Critique
Visual Aspects
  • The figure could benefit from a more descriptive title, such as 'LLM Use and Preprint Posting Frequency'.
  • The x-axis labels could be made more readable by shortening the time period representations (e.g., '22.1-3' instead of '2022.1-3').
  • Adding a legend directly on the graphs would improve clarity.
Analytical Aspects
  • The figure only shows correlation, not causation. It doesn't prove that posting more preprints *causes* higher LLM use.
  • Other factors, such as the specific research area or the author's experience, could influence both preprint frequency and LLM use.
  • The use of 2023 author groupings for 2024 data, due to incomplete data, should be more clearly explained and justified in the caption.
Numeric Data
  • Estimated Alpha (Abstracts, >=3 preprints, Feb 2024): 0.193 fraction
  • Estimated Alpha (Abstracts, <=2 preprints, Feb 2024): 0.156 fraction
figure 5

This figure investigates whether papers in more 'crowded' research areas (those with abstracts more similar to other abstracts) have more LLM-modified content. They measure similarity by converting abstracts into numerical vectors and calculating the distance between these vectors. Papers are grouped into 'more similar' (closer vectors) and 'less similar' (further vectors). The figure then shows how LLM use changes over time for these two groups. Imagine research papers as points on a map; closer points represent similar research. This figure checks if points clustered together have more LLM use.

First Mention

Text: "Figure 5: Papers in more crowded research areas tend to have a higher fraction of LLM-modified content."

Context: Papers in arXiv Computer Science are divided into two groups based on their abstract's embedding distance to their closest peer: papers more similar to their closest peer (below median distance) and papers less similar to their closest peer (above median distance).

Relevance: This figure explores the relationship between research area 'crowdedness' and LLM use, suggesting that LLM use might be higher in areas where papers are more similar to each other.

Critique
Visual Aspects
  • A more descriptive title, such as 'LLM Use and Research Area Crowdedness', would be beneficial.
  • Shortening the x-axis labels and adding a legend directly on the graphs would improve readability.
  • Using different colors or patterns for the bars representing 'more similar' and 'less similar' papers would enhance visual distinction.
Analytical Aspects
  • The figure only shows correlation, not causation. It's unclear whether LLM use *causes* similarity or if similar research areas are more prone to LLM use.
  • The method for calculating similarity (embedding distance) could be explained more clearly for a broader audience.
  • The implications of the findings, such as potential homogenization of scientific writing, could be discussed further.
Numeric Data
  • Estimated Alpha (More similar, Feb 2024): 0.222 fraction
  • Estimated Alpha (Less similar, Feb 2024): 0.147 fraction
figure 6

This figure explores if shorter papers tend to have more LLM-modified content. arXiv Computer Science papers are divided into two groups: those shorter than 5,000 words and those longer. The figure shows the estimated LLM use in abstracts and introductions for both groups over time. It's like checking if shorter student essays use grammar-checking tools more than longer essays.

First Mention

Text: "Figure 6: Shorter papers tend to have a higher fraction of LLM-modified content."

Context: arXiv Computer Science papers are stratified by their full text word count, including appendices, into two bins: below or above 5,000 words (the rounded median).

Relevance: This figure investigates the relationship between paper length and LLM use, suggesting that shorter papers might have a higher proportion of LLM-generated content.

Critique
Visual Aspects
  • A more descriptive title, such as 'LLM Use and Paper Length', would improve clarity.
  • The x-axis labels could be shortened for better readability.
  • Adding a legend directly on the graphs would be helpful.
Analytical Aspects
  • Correlation does not equal causation. Shorter papers might have more LLM use, but the figure doesn't explain why.
  • Other factors, like the paper's topic or the author's resources, could influence both length and LLM use.
  • The caption mentions a robustness check and limitations for cs.CL. Explaining these in more detail would strengthen the analysis.
Numeric Data
  • Estimated Alpha (Shorter papers, Feb 2024): 0.177 fraction
  • Estimated Alpha (Longer papers, Feb 2024): 0.136 fraction

Discussion

Overview

This section discusses the key findings of the study, highlighting the increase in LLM-modified content in academic writing since the release of ChatGPT, particularly in Computer Science. It also summarizes the correlations found between LLM usage and factors like preprint posting frequency, research area crowdedness, and paper length. The discussion briefly touches on potential risks associated with LLM use in scientific publishing and calls for further research on promoting transparency and diversity in academic writing.

Key Aspects

Strengths

Suggestions for Improvement

Limitations

Overview

This section acknowledges the limitations of the study. The focus on ChatGPT, while being the dominant LLM, excludes other large language models used in academic writing. The study also addresses the potential for false positives in identifying LLM-generated text, particularly among non-native English writers, but asserts that the low false positive rate in 2022 supports the validity of their findings. Finally, the section recognizes that the observed correlations between LLM usage and paper characteristics are not necessarily causal and suggests further research to explore these relationships.

Key Aspects

Strengths

Suggestions for Improvement

Estimated Fraction of LLM-Modified Sentences in Introductions

Overview

This appendix section presents Figure 7, a graph illustrating the estimated fraction of sentences modified by LLMs in the introductions of scientific papers across different academic venues over time. The inclusion of introductions complements the analysis of abstracts presented earlier in the paper (Figure 1). The figure shows trends similar to those observed in abstracts, with a notable increase in LLM usage after the launch of ChatGPT, especially in Computer Science. BioRxiv introductions were not included due to the unavailability of bulk PDF downloads.

Key Aspects

Strengths

Suggestions for Improvement

LLM prompts used in the study

Overview

This appendix section details the LLM prompts used in the study to generate LLM-modified text. The process involves a two-stage approach: first, summarizing a human-written paragraph into bullet points (a skeleton) and then expanding that skeleton back into a full paragraph using an LLM. This approach simulates a potential author workflow and allows the researchers to control for content while examining the stylistic differences between human and LLM-generated text. A third prompt for proofreading is also included.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

figure 8

This figure presents an example prompt used to instruct an LLM to summarize a paragraph from a human-written paper into a skeleton outline. This process mimics how an author might extract the core ideas and information from a piece of text and condense it into a structured, concise form. It's like creating a blueprint or summary of the original paragraph, highlighting the main points and their relationships.

First Mention

Text: "Figure 8: Example prompt for summarizing a paragraph from a human-authored paper into a skeleton"

Context: This process simulates how an author might first only write the main ideas and core information into a concise outline. The goal is to capture the essence of the paragraph in a structured and succinct manner, serving as a foundation for the previous prompt.

Relevance: This figure is relevant because it illustrates the first stage of the two-stage process used to generate realistic LLM-produced training data. By summarizing human-written paragraphs into skeleton outlines, the researchers can then use these outlines to prompt the LLM to generate new text, simulating a potential use case of LLMs in scientific writing.

Critique
Visual Aspects
  • Instead of just presenting the prompt as plain text, consider using a visual representation, such as a flowchart or diagram, to illustrate the summarization process.
  • Highlighting key phrases within the prompt, such as 'summarize the goal' and 'reverse-engineer it into a list of bullet points', would improve readability.
  • Consider adding a visual example of a paragraph being summarized into a skeleton outline to further clarify the process.
Analytical Aspects
  • The prompt could be more specific about the desired level of detail in the skeleton outline. Providing examples of different levels of summarization would be helpful.
  • The prompt mentions 'reverse-engineering' which might be confusing to a non-technical audience. Rephrasing this as 'summarizing' or 'condensing' would improve clarity.
  • The prompt could benefit from a clearer explanation of why this summarization step is important for generating realistic LLM-produced text.
figure 9

This figure shows an example prompt used to instruct an LLM to expand a skeleton outline into a full text paragraph. This simulates the second stage of the training data generation process, where the LLM elaborates on the concise outline to create a more detailed and fleshed-out piece of writing. Think of it like taking a blueprint and using it to construct the actual building.

First Mention

Text: "Figure 9: Example prompt for expanding the skeleton into a full text"

Context: The aim here is to simulate the process of using the structured outline as a basis to generate comprehensive and coherent text. This step mirrors the way an author might flesh out the outline into detailed paragraphs, effectively transforming the condensed ideas into a fully articulated section of a paper.

Relevance: This figure is relevant because it illustrates the second stage of the two-stage process for creating LLM-generated training data. This expansion step simulates how scientists might use LLMs to generate text based on their own outlines, providing a more realistic representation of LLM-assisted writing.

Critique
Visual Aspects
  • Similar to Figure 8, a visual representation of the expansion process, such as a diagram or example, would enhance understanding.
  • Highlighting key phrases like 'expand upon the concise version' and 'develop it into a fully fleshed-out text' would improve visual clarity.
  • Consider showing a visual example of a skeleton outline being expanded into a full text paragraph to demonstrate the process.
Analytical Aspects
  • The prompt could be more specific about the desired style and tone of the generated text. Should the LLM aim for a formal academic style or a more informal tone?
  • The prompt could benefit from a clearer explanation of how this expansion step contributes to generating realistic LLM-produced text and why this realism is important for the study.
  • Providing examples of different expansion styles would be helpful for illustrating the range of possible outputs.
figure 10

This figure presents an example prompt for instructing an LLM to proofread a sentence. The goal is to ensure grammatical accuracy while minimizing changes to the original content. This is like using a grammar-checking tool to polish a sentence without altering its meaning.

First Mention

Text: "Figure 10: Example prompt for proofreading."

Context: Your task is to proofread the provided sentence for grammatical accuracy. Ensure that the corrections introduce minimal distortion to the original content.

Relevance: This figure is relevant because it demonstrates the prompt used for the proofreading step in the training data generation process. While not as central as the summarization and expansion steps, proofreading ensures the grammatical correctness of both human-written and LLM-generated text used in the analysis.

Critique
Visual Aspects
  • Presenting the prompt as plain text is sufficient, but adding a simple visual element, such as an icon representing proofreading, could enhance visual appeal.
  • Highlighting key phrases like 'grammatical accuracy' and 'minimal distortion' would emphasize the main goals of the proofreading task.
  • Consider adding a before-and-after example of a sentence being proofread to illustrate the process.
Analytical Aspects
  • The prompt could be more specific about the types of grammatical errors to be addressed. Should the LLM focus on punctuation, syntax, or other specific aspects of grammar?
  • The phrase 'minimal distortion' could be further clarified. Providing examples of acceptable and unacceptable changes would be helpful.
  • The prompt could benefit from a clearer explanation of why proofreading is important for the overall analysis and how it contributes to the reliability of the results.

Additional Information on Implementation and Validations

Overview

This appendix provides supplementary information about the data collection, the LLMs used, and their parameter settings. The data was collected from arXiv, bioRxiv, and 15 Nature portfolio journals, sampling up to 2,000 papers per month from January 2020 to February 2024. The gpt-3.5-turbo-0125 model, trained on data up to September 2021, was used to generate the training data. The rationale for focusing on ChatGPT is its dominant market share and strong performance in understanding scientific papers. The decoding temperature was set to 1.0, maximum decoding length to 2048 tokens, Top P to 1.0, and both frequency and presence penalties to 0.0. No specific stop sequences were configured.

Key Aspects

Strengths

Suggestions for Improvement

Word Frequency Shift in arXiv Computer Science introductions

Overview

This appendix section presents a figure (Figure 11) showing the shift in word frequency within the introductions of arXiv Computer Science papers over the past two years. The figure focuses on the same four words highlighted in Figure 2: "realm," "intricate," "showcasing," and "pivotal." The analysis reveals a similar trend to Figure 2, where these words, after being relatively infrequent for over a decade, experienced a sudden increase in usage starting in 2023. The section notes that data from 2010-2020 is omitted due to the computational cost of processing a large volume of arXiv papers.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

figure 11

This figure shows how often certain words appeared in the introductions of arXiv Computer Science papers over the past two years. The words 'realm,' 'intricate,' 'showcasing,' and 'pivotal' are tracked. These words were used infrequently before 2023 but saw a sudden increase in 2023, suggesting a potential shift in language use, possibly related to the rise of LLMs. It's like noticing that certain ingredients started showing up more often in recipes after a new cooking gadget became popular.

First Mention

Text: "Figure 11: Word Frequency Shift in sampled arXiv Computer Science introductions in the past two years."

Context: Appendix D, showing word frequency shifts in introductions, is on page 19 and provides additional results.

Relevance: This figure provides additional evidence supporting the main finding of increased LLM usage in Computer Science. The sudden increase in the frequency of specific words after the release of ChatGPT suggests a potential link between LLM use and changes in writing style.

Critique
Visual Aspects
  • The figure could benefit from a more descriptive title, such as 'Increased Frequency of Specific Words in arXiv CS Introductions After ChatGPT'.
  • Adding the actual word frequencies (per million words) as data labels on the lines would improve readability and allow for precise comparisons.
  • Extending the time axis back to 2020, or even earlier if data is available, would provide a clearer picture of the long-term trend and the impact of ChatGPT.
Analytical Aspects
  • While the figure shows a correlation between the increase in word frequency and the release of ChatGPT, it doesn't establish causation. Other factors could be contributing to this shift in language use.
  • The figure only focuses on four words. Analyzing a larger set of words or using different selection criteria (e.g., words disproportionately used by LLMs) would provide a more comprehensive view of language changes.
  • The figure focuses on introductions. Comparing these trends with similar analyses of abstracts or other sections would be informative.

Fine-grained Main Findings

Overview

This appendix section provides further details about the main findings by presenting supporting figures. Figure 12 shows that the relationship between first-author preprint posting frequency and LLM usage holds across different Computer Science sub-categories (cs.CV, cs.LG, cs.CL). Figure 13 demonstrates that the relationship between paper similarity and LLM usage also holds across these sub-categories. Figure 14 examines the relationship between paper length and LLM usage, showing it holds for cs.CV and cs.LG but not for cs.CL, possibly due to limited sample size in cs.CL.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

figure 12

This figure investigates the relationship between how often a paper's first author posts preprints on arXiv and the use of LLMs in their Computer Science papers. The papers are divided into two groups: those whose first authors posted two or fewer preprints in a year, and those who posted three or more. It then shows the estimated fraction of LLM-modified sentences over time for three different Computer Science subcategories (cs.CV, cs.LG, and cs.CL). Think of it like checking if students who submit more draft essays also use grammar-checking software more often, but specifically within different areas of study like literature, history, or science.

First Mention

Text: "Figure 12: The relationship between first-author preprint posting frequency and LLM usage holds across arXiv Computer Science sub-categories."

Context: Papers in each arXiv Computer Science sub-category (cs.CV, cs.LG, and cs.CL) are stratified into two groups based on the preprint posting frequency of their first author, as measured by the number of first-authored preprints in the year: those with ≤ 2 preprints and those with ≥ 3 preprints.

Relevance: This figure supports the broader finding that researchers who post more preprints tend to use LLMs more, showing this trend holds across different areas within Computer Science. This suggests that the relationship isn't specific to just one type of Computer Science research.

Critique
Visual Aspects
  • The figure could have a more informative title, like 'LLM Use and Preprint Frequency Across CS Subcategories'.
  • Shortening the x-axis labels (e.g., '22.1-3' for '2022.1-3') would improve readability.
  • Adding a legend directly onto each subgraph would make it easier to interpret the bars.
Analytical Aspects
  • The figure shows correlation, not causation. It doesn't prove that posting more preprints *causes* more LLM use.
  • Other factors, like the researcher's career stage or institutional pressures, could influence both preprint frequency and LLM use.
  • It would be helpful to explain why the 2023 author grouping was used for the 2024.1-2 data.
figure 13

This figure explores whether papers in more 'crowded' research areas (those with abstracts more similar to other abstracts within the same subcategory) have more LLM-modified content. Similarity is measured by converting abstracts into numerical vectors and calculating the distance between them. Papers are grouped into 'more similar' (closer vectors) and 'less similar' (further vectors). The figure shows how LLM use changes over time for these two groups across three Computer Science subcategories (cs.CV, cs.LG, and cs.CL). Imagine research papers as points on a map, with closer points representing similar research. This figure checks if points clustered together in different academic neighborhoods show more LLM use.

First Mention

Text: "Figure 13: The relationship between paper similarity and LLM usage holds across arXiv Computer Science sub-categories."

Context: Papers in each arXiv Computer Science sub-category (cs.CV, cs.LG, and cs.CL) are divided into two groups based on their abstract's embedding distance to their closest peer within the respective sub-category: papers more similar to their closest peer (below median distance) and papers less similar to their closest peer (above median distance).

Relevance: This figure further investigates the relationship between research area 'crowdedness' and LLM use, showing that the trend observed in the main analysis holds across different Computer Science subcategories. This suggests the relationship isn't limited to a specific area of Computer Science.

Critique
Visual Aspects
  • A more descriptive title like 'LLM Use and Research Area Crowdedness Across CS Subcategories' would be better.
  • Shortening the x-axis labels and adding a legend directly on the graphs would improve readability.
  • Using different colors or patterns for the bars representing 'more similar' and 'less similar' papers would make the graphs easier to understand.
Analytical Aspects
  • The figure shows correlation, not causation. It's unclear whether LLM use causes similarity or if similar research areas are simply more likely to use LLMs.
  • The explanation of how similarity is measured could be clearer for a non-technical audience.
  • The potential implications of these findings, such as the homogenization of scientific writing, could be discussed more thoroughly.
figure 14

This figure explores if shorter papers have more LLM-modified content within different Computer Science subcategories (cs.CV, cs.LG, and cs.CL). Papers are divided into two groups: shorter than 5,000 words and longer than 5,000 words. It then shows the estimated LLM use for both groups over time. It's like seeing if shorter student essays use grammar-checking tools more than longer essays, but specifically for essays on different scientific topics.

First Mention

Text: "Figure 14: The relationship between paper length and LLM usage holds for cs.CV and cs.LG, but not for cs.CL."

Context: Papers in each arXiv Computer Science sub-category (cs.CV, cs.LG, and cs.CL) are stratified by their full text word count, including appendices, into two bins: below or above 5,000 words (the rounded median).

Relevance: This figure provides a more detailed look at the relationship between paper length and LLM use by examining it across different Computer Science subcategories. It also highlights an exception in cs.CL, where the relationship doesn't hold, suggesting other factors might be at play.

Critique
Visual Aspects
  • A title like 'LLM Use and Paper Length Across CS Subcategories' would be more informative.
  • Shortening x-axis labels would improve readability.
  • Adding a legend directly on each subgraph would make interpretation easier.
Analytical Aspects
  • The figure shows correlation, not causation. It doesn't explain *why* shorter papers might have more LLM use.
  • Other factors, like the paper's topic or the author's resources, could influence both length and LLM use.
  • The caption mentions a limited sample size for cs.CL. Quantifying this limitation (e.g., stating the actual sample size) and discussing its potential impact on the results would be helpful.

Proofreading Results on arXiv data

Overview

This appendix section investigates the impact of using LLMs for proofreading on the detection of LLM-generated text. It presents a figure (Figure 15) showing a slight increase in the estimated fraction of LLM-modified content after proofreading across various arXiv categories. This suggests that the method used in the study is robust to minor edits introduced by proofreading, as it can still detect the underlying LLM-generated content.

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

figure 15

This figure investigates how proofreading with Large Language Models (LLMs) affects the estimation of LLM-generated content in scientific papers. It takes abstracts from different arXiv categories (Computer Science, Electrical Engineering, Mathematics, Physics, Statistics) and measures the estimated fraction of LLM-modified content (alpha) before and after using LLMs for proofreading. The slight increase in alpha after proofreading suggests that the method can detect even small edits made by LLMs, demonstrating its robustness. It's like checking if a tool that measures how much of a cake was made by a machine can still work even if the machine only added the frosting.

First Mention

Text: "Figure 15: Robustness of estimations to proofreading."

Context: Appendix F, showing proofreading results, is presented on page 23 and is relevant for assessing the method's robustness.

Relevance: This figure is important because it addresses a potential weakness of the method: its sensitivity to minor edits. By showing that the method can still detect LLM use even after proofreading, it strengthens the validity of the overall findings.

Critique
Visual Aspects
  • A more descriptive title like 'Impact of Proofreading on LLM Detection' would be clearer.
  • Adding the numerical values of the estimated alpha before and after proofreading on top of the bars would improve readability.
  • The y-axis could be labeled 'Estimated Fraction of LLM-Modified Content' instead of just 'Estimated Alpha' for better understanding by a broader audience.
Analytical Aspects
  • The figure only uses abstracts. Showing similar results for other sections of papers (e.g., introductions) would provide a more complete picture.
  • The choice of arXiv categories could be explained. Are these representative of all scientific fields?
  • The specific proofreading prompt used should be described in more detail. What instructions were given to the LLM?

Extended Related Work

Overview

This section provides a detailed overview of existing methods for detecting LLM-generated text. It covers zero-shot detection methods, training-based detection methods, and LLM watermarking. The section highlights the limitations of each approach, emphasizing challenges like access to LLM internals, overfitting, bias, and the need for model owner involvement in watermarking. It concludes by contrasting these methods with the distributional GPT quantification framework used in the paper, which offers advantages in stability, accuracy, and independence from model owners.

Key Aspects

Strengths

Suggestions for Improvement

↑ Back to Top