People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text

Table of Contents

Overall Summary

Study Background and Main Findings

This paper investigates human detection of AI-generated text, finding that "expert" annotators (frequent LLM users) significantly outperform both non-experts and most automatic detectors, achieving 99.3% accuracy compared to the best automatic detector's 98%. The study reveals that experts rely on cues like "AI vocabulary," formulaic structures, and originality, while non-experts perform at chance level. Notably, the majority vote of five experts correctly classified all articles in Experiment 1, even with evasion tactics like paraphrasing. However, humanization, particularly with the 01-PRO model, reduced expert confidence, indicating challenges posed by advanced AI.

Research Impact and Future Directions

The study provides compelling evidence that humans who frequently use LLMs for writing tasks, termed "experts," can effectively detect AI-generated text, outperforming most existing automatic detectors. The research clearly distinguishes between correlation and causation by demonstrating that it is the specific experience with LLMs that leads to this enhanced detection ability, not simply general writing expertise. However, the study does not establish a definitive causal link between specific training methods and improved detection accuracy, as training was not explicitly manipulated.

The practical utility of these findings is substantial. The identification of specific clues used by experts, such as "AI vocabulary" and formulaic structures, offers valuable insights for developing more effective detection methods and training programs. The study also highlights the potential of human-machine collaboration, where human expertise can complement and enhance automated detection systems. These findings are particularly relevant in contexts where the integrity of information is crucial, such as journalism, academia, and online content moderation.

While the study provides clear guidance on the potential of human expertise in detecting AI-generated text, it also acknowledges key uncertainties. The effectiveness of the identified clues may evolve as LLMs become more sophisticated, and the study's focus on American English articles limits the generalizability of the findings to other languages and writing styles. Furthermore, the study primarily focuses on the detection of AI-generated text and does not delve deeply into the ethical implications of increasingly sophisticated evasion techniques.

Several critical questions remain unanswered. For instance, what specific training methods are most effective in enhancing human detection capabilities? How can human-machine collaboration be optimized to maximize detection accuracy and efficiency? Additionally, the study's methodology has limitations that could affect the conclusions. The reliance on self-reported LLM usage to define expertise could introduce bias, and the relatively small sample size of expert annotators limits the generalizability of the findings. Future research should address these limitations by employing more objective measures of expertise, using larger and more diverse samples, and exploring the effectiveness of different training interventions.

Critical Analysis and Recommendations

Clear Research Question (written-content)
The paper clearly defines the research question, focusing on human detection of AI-generated text from modern LLMs. This provides a solid foundation for the study and allows for focused investigation.
Section: Abstract
Well-Defined Methodology (written-content)
The methodology, employing human annotators and collecting paragraph-length explanations, is concisely described and provides a clear framework for the study. This enhances the replicability and validity of the research.
Section: Abstract
Significant Findings on Expert Performance (written-content)
The finding that expert annotators outperform automatic detectors, even with evasion tactics, is significant and highlights the potential of human expertise in this domain. This has practical implications for developing more robust detection strategies.
Section: Abstract
Comprehensive Evaluation Metrics (written-content)
Using both TPR and FPR provides a balanced evaluation of human and automatic detectors, allowing for a thorough assessment of their performance. This ensures a nuanced understanding of the strengths and weaknesses of each approach.
Section: How good are humans at detecting AI-generated text?
Detailed Annotator Analysis (written-content)
The analysis of differences between expert and nonexpert annotators, including clues used and accuracy rates, sheds light on cognitive processes involved in detecting AI-generated text. This provides valuable insights for developing training programs and improving detection methods.
Section: How good are humans at detecting AI-generated text?
Quantify Performance Difference (written-content)
The abstract lacks a quantitative measure of the performance difference between expert annotators and automatic detectors. Including specific accuracy rates would significantly strengthen the abstract by providing concrete evidence of the experts' superior performance.
Section: Abstract
Address Potential Bias in Article Selection (written-content)
Generating AI articles based on human-written counterparts may introduce a bias favoring human detection. Acknowledging and discussing this potential bias would enhance the study's external validity and provide a more balanced perspective.
Section: How good are humans at detecting AI-generated text?
Expand on Limitations of Automatic Detectors (written-content)
While the paper highlights human expert performance, elaborating on the specific limitations of automatic detectors would provide a more balanced comparison. This would further strengthen the paper's contribution to the field by providing a clearer understanding of the current state of automated detection.
Section: Fine-grained analysis of expert performance
Elaborate on Training Implications (written-content)
Providing a more detailed discussion of how the findings could inform the design of specific training programs would enhance the paper's practical implications. This would offer concrete guidance for improving human detection capabilities.
Section: Fine-grained analysis of expert performance
Visual Clarity of Heatmaps (graphical-figure)
The heatmaps effectively visualize frequency data, but adding a brief explanation of what a heatmap is and providing clearer axis labels and a color scale legend would enhance their clarity. This would make the figures more accessible to a wider audience.
Section: Fine-grained analysis of expert performance

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 1: A human expert's annotations of an article generated by OpenAI's...
Full Caption

Figure 1: A human expert's annotations of an article generated by OpenAI's 01-PRO with humanization. The expert provides a judgment on whether the text is written by a human or AI, a confidence score, and an explanation (including both free-form text and highlighted spans) of their decision.

Figure/Table Image (Page 1)
Figure 1: A human expert's annotations of an article generated by OpenAI's 01-PRO with humanization. The expert provides a judgment on whether the text is written by a human or AI, a confidence score, and an explanation (including both free-form text and highlighted spans) of their decision.
First Reference in Text
We hire human annotators to read non-fiction English articles, label them as written by either a human or by AI, and provide a paragraph-length explanation of their decision-making process (Figure 1).
Description
  • Overview of Figure 1: Figure 1 shows an example of how a human expert evaluated and marked up a piece of text. This text is a news article that was produced by a Large Language Model, specifically OpenAI's 01-PRO, which is a type of artificial intelligence designed to generate human-like text. The term "humanization" here likely refers to techniques used to make the AI-generated text appear more like it was written by a human. The expert's job is to determine whether the article was written by a human or a machine, and to explain their reasoning.
  • Content of the Article: The article discusses a pilot in Alaska who drops turkeys to rural homes for Thanksgiving. This scenario sets a real-world context for evaluating the AI's ability to generate text that could be mistaken for human writing in a practical, everyday situation.
  • Expert's Annotation Components: The expert provides three pieces of information: a judgment, a confidence score, and an explanation. The "judgment" is a binary classification, meaning the expert decides if the text is "AI-generated" or "human-written." The "confidence score" is a rating on a scale of 1 to 5, where 1 means the expert is not very sure about their judgment, and 5 means they are very sure. The "explanation" is a written justification for their decision, and it can include highlighting parts of the text that influenced their judgment. Highlighting parts of the text means visually marking specific words or phrases, like using a yellow marker on a printed page, to show which parts were particularly important in making their decision.
  • Expert's Decision and Rationale: In this specific example, the expert has labeled the text as "AI-generated" with a confidence score of 4 out of 5, indicating a high level of certainty. The expert's explanation points out that some of the quotes in the article feel realistic, but others are presented in a way that seems unnatural. For instance, the expert notes that the phrase "He shrugged as if that were the most ordinary idea, then laughed" could be simplified. They also suggest that the article could provide more factual details about the challenges faced by people in Alaska and the limited transportation options, indicating that the AI-generated text may lack the depth and detail typically found in human-written articles on similar topics. The expert also mentions that the article gets "sentimental and corny at times," which could be another clue that it was generated by an AI, as AI models might overuse emotional language or clichés in an attempt to mimic human writing.
Scientific Validity
  • Clarity of Methodology: The figure and its caption, along with the reference text, provide a reasonably clear overview of the methodology used for human annotation. The task given to the annotators is well-defined, and the criteria for their judgments are outlined. However, the scientific validity would be enhanced by providing more details about the selection criteria for "expert" annotators and the specific instructions given to them beyond the general task description.
  • Subjectivity in Annotation: The annotation process inherently involves subjective judgment, which can introduce bias. The reliance on human judgment for determining whether a text is AI-generated or human-written is a potential source of variability. The use of a confidence score helps to quantify this subjectivity to some extent, but the explanation provided by the annotator is crucial for understanding the basis of their judgment. The scientific validity of the study could be improved by having multiple annotators evaluate the same texts and measuring the inter-rater reliability to assess the consistency of these subjective judgments.
  • Representation of AI-Generated Text: The figure represents an example of text generated by OpenAI's 01-PRO with humanization. It is important to note that this is just one example, and the quality and characteristics of AI-generated text can vary widely depending on the model and the specific techniques used for humanization. The validity of the study's findings would be strengthened by including a diverse set of examples from different AI models and humanization methods.
Communication
  • Visual Presentation: The figure is visually well-organized, with clear sections for the article text, the annotator's decision, confidence score, and explanation. The use of color-coding and highlighting helps to distinguish between different parts of the annotation. However, the specific colors used for highlighting are not explained in the caption or the figure itself, which could be a minor point of confusion for the reader.
  • Clarity of Explanation: The expert's explanation is concise and provides specific examples from the text to support their judgment. The language used is accessible and avoids overly technical jargon, making it understandable to a reader who may not be familiar with AI-generated text analysis. The explanation effectively communicates the thought process behind the expert's decision.
  • Effectiveness in Demonstrating Annotation Process: The figure serves as an effective illustration of the annotation process described in the text. It demonstrates how an expert evaluates an article, makes a judgment, assigns a confidence score, and provides a rationale. This helps the reader to understand the methodology used in the study and provides a concrete example of the kind of data collected from human annotators.
  • Potential for Misinterpretation: While the figure is generally clear, there is a potential for misinterpretation regarding the term "humanization." A reader unfamiliar with the concept might assume it refers to a process of making the text more humane or ethical, rather than more human-like in style. A brief definition or clarification of this term within the caption or the figure itself could improve understanding.
Table 5: Survey of Annotators, specifically their backgrounds relating to LLM...
Full Caption

Table 5: Survey of Annotators, specifically their backgrounds relating to LLM usage and field of work. Note that expert Annotator #1 was one of the original 5 annotators (along with the 4 non-experts) and remained an annotator for all expert trials.

Figure/Table Image (Page 17)
Table 5: Survey of Annotators, specifically their backgrounds relating to LLM usage and field of work. Note that expert Annotator #1 was one of the original 5 annotators (along with the 4 non-experts) and remained an annotator for all expert trials.
First Reference in Text
Our five expert annotators are native English speakers hailing from the US, UK, and South Africa. Most work as editors, writers, and proofreaders and have extensively used AI assistants. See Table 5 for more information about annotators.
Description
  • Purpose of the Table: This table provides information about the people who participated in the study as annotators, specifically the "expert" annotators. Annotators are the people who labeled the text as either human-written or AI-generated. The table focuses on their backgrounds, particularly their experience with using Large Language Models (LLMs) - a type of AI that can generate text - and their professional fields. This is like a brief introduction to the people involved in the study, similar to how a news article might introduce the key people involved in a story.
  • Content of the Table: The table likely includes information such as the annotators' level of education, their native language (which the reference text indicates is English), their nationality (US, UK, and South Africa), their profession, and their experience with using LLMs. For example, it might show that one annotator has a Master's degree, is from the UK, works as an editor, and uses LLMs daily. By providing this information, the researchers are giving us a better understanding of who these "expert" annotators are and what kind of expertise they bring to the task.
  • Expert vs. Non-expert: The caption highlights that one of the expert annotators, "Annotator #1," was also part of the initial group of five annotators, which included four non-experts. This suggests that this particular expert was involved in the study from the beginning and participated in all the "expert trials," meaning they have a lot of experience with the task. The distinction between "expert" and "non-expert" is based on their self-reported usage of LLMs, as established earlier in the paper.
  • Relevance of Background Information: The annotators' backgrounds are important because they can influence their ability to detect AI-generated text. For example, someone who works as a professional writer or editor and frequently uses LLMs might be better at spotting the nuances of AI-generated text compared to someone with less experience in these areas. By providing information about the annotators' backgrounds, the researchers are adding context to the study's findings and helping us understand the qualifications of the people making the judgments.
Scientific Validity
  • Selection of Expert Annotators: The validity of using "expert" annotators depends on the criteria used for defining expertise. The reference text suggests that expertise is based on being a native English speaker, working in a relevant field (editors, writers, proofreaders), and having extensive experience using AI assistants. These are reasonable criteria, but the specific thresholds for "extensive experience" should be clearly defined. Additionally, while self-reported expertise is a practical approach, it would be beneficial to include some objective measure of their ability to detect AI-generated text, perhaps through a pre-test or calibration task.
  • Sample Size and Diversity: The reference text mentions five expert annotators. While this is a relatively small sample size, it's important to consider the trade-off between the number of annotators and the depth of their expertise. Having a small group of highly qualified experts might be more valuable than a larger group with less specialized knowledge. However, the authors should acknowledge the limitations of a small sample size in terms of generalizability. The diversity of the annotators' backgrounds (US, UK, South Africa) is a strength, as it reduces the potential for cultural or linguistic biases in the annotations.
  • Potential for Bias: The annotators' backgrounds could potentially introduce biases. For example, editors or writers might be more attuned to certain stylistic features or errors that are common in AI-generated text. The authors should discuss the potential for such biases and how they might have mitigated them (e.g., through clear guidelines for annotation, training, or inter-rater reliability checks).
  • Transparency of Information: Providing a detailed survey of the annotators' backgrounds enhances the transparency of the study. This allows readers to better understand the qualifications of the experts and evaluate the potential impact of their backgrounds on the findings. The authors should ensure that the table includes all relevant information about the annotators' demographics, education, profession, and LLM usage, while also respecting their privacy.
Communication
  • Caption Clarity: The caption is clear and concisely explains the purpose of the table, which is to provide information about the annotators' backgrounds in relation to LLM usage and their field of work. It also highlights the unique status of Annotator #1, which is helpful for understanding the continuity of expertise across the study.
  • Table Organization and Content: Assuming a standard tabular format with clear labels for rows (annotators) and columns (background characteristics), the table should be easy to read and understand. The specific characteristics included in the table (e.g., education, profession, LLM usage) should be clearly defined and relevant to the study's focus on detecting AI-generated text.
  • Reference Text Support: The reference text provides additional context about the expert annotators, including their native language and professional backgrounds. This information complements the table and helps to paint a more complete picture of the experts' qualifications.
  • Potential Improvements: The communication effectiveness could be improved by including a brief explanation in the caption of *why* the annotators' backgrounds are important in the context of this study. Additionally, the table itself could include a column summarizing each annotator's overall level of expertise or experience with AI-generated text, perhaps based on a composite score derived from their background characteristics. This would provide a more direct measure of their qualifications for the task.
Figure 4: Guidelines provided to the annotators for the annotation task. The...
Full Caption

Figure 4: Guidelines provided to the annotators for the annotation task. The annotators were also provided additional examples and guidance during the data collection process.

Figure/Table Image (Page 21)
Figure 4: Guidelines provided to the annotators for the annotation task. The annotators were also provided additional examples and guidance during the data collection process.
First Reference in Text
All annotators were required to read the guidelines (Figure 4) and sign a consent form (Figure 5) prior to the labeling task.
Description
  • Purpose of the Figure: This figure displays the instructions given to the people who participated in the study as "annotators." Annotators were responsible for reading articles and deciding whether they were written by a human or an AI. The guidelines are like a rule book or a training manual for the annotators, explaining how to perform the task. Think of it like the instructions you'd get before playing a new board game - they tell you what you need to do, what the goal is, and any special rules you need to follow.
  • Content of the Guidelines: The guidelines likely explain the task in detail, including what criteria to use when deciding if a text is human or AI-generated, how to label or mark the text, and how to record their decisions. The caption mentions that annotators were also given "additional examples and guidance." This means they probably saw examples of human-written and AI-generated text, along with explanations of why they were classified that way. It is like showing someone examples of correctly solved math problems before asking them to solve new ones. The guidelines likely also include instructions for using the annotation interface and details about the rating scale used for indicating confidence in their judgments.
  • Importance of Guidelines: Providing clear and comprehensive guidelines is crucial for ensuring that the annotators understand the task and perform it consistently. This is important for the quality and reliability of the data collected in the study. If the annotators don't understand the instructions or apply them differently, it can introduce errors or inconsistencies into the data, making it harder to draw valid conclusions. The guidelines help to standardize the annotation process, making it more like a controlled experiment.
  • Consent and Ethics: The reference text mentions that annotators were also required to sign a consent form. This is an important ethical requirement in research involving human participants. The consent form ensures that the annotators understand the purpose of the study, what they will be asked to do, any potential risks or benefits, and that their participation is voluntary. By reading the guidelines and signing the consent form, the annotators are indicating that they understand the task and agree to participate.
Scientific Validity
  • Standardization of Annotation Task: Providing guidelines to annotators is a crucial step in ensuring the standardization and consistency of the annotation task. By providing clear instructions and examples, the researchers are attempting to minimize variability in how different annotators approach the task and interpret the criteria for distinguishing between human-written and AI-generated text. This standardization is essential for the scientific validity of the study, as it helps to ensure that the data collected is reliable and comparable across annotators.
  • Clarity and Completeness of Guidelines: The scientific validity of the annotation process depends on the clarity and completeness of the guidelines. The guidelines should clearly define the task, provide specific criteria for distinguishing between human and AI-generated text, and address any potential ambiguities or edge cases. The authors should demonstrate that the guidelines were sufficiently detailed and unambiguous to ensure that annotators understood the task and applied the criteria consistently. The mention of "additional examples and guidance" suggests an effort to further clarify the task, which is a positive aspect.
  • Potential for Subjectivity: Despite the use of guidelines, the task of distinguishing between human and AI-generated text inherently involves subjective judgment. Different annotators might interpret the criteria differently or have varying levels of sensitivity to certain linguistic features. The authors should acknowledge this potential for subjectivity and discuss how they attempted to minimize its impact (e.g., through training, pilot testing, or inter-rater reliability checks).
  • Ethical Considerations: The reference to a consent form demonstrates attention to ethical considerations, which is essential when conducting research involving human participants. The authors should provide more details about the consent process and the information provided to annotators, ensuring that participants were fully informed about the study's purpose, procedures, and their rights before agreeing to participate.
Communication
  • Caption Clarity: The caption clearly states the purpose of the figure: to present the "Guidelines provided to the annotators for the annotation task." It also mentions that additional examples and guidance were provided, which is important information about the annotation process. The caption is concise and easy to understand.
  • Visual Presentation: The provided image of the guidelines is well-structured and easy to read. It uses headings, bullet points, and numbered lists to organize the information, making it easy to follow. The language is clear and direct, and the instructions are broken down into manageable steps. The use of bolding and different font sizes helps to highlight important information and distinguish between different sections of the guidelines.
  • Reference Text Support: The reference text provides important context by mentioning the consent form, emphasizing the ethical considerations involved in the study. It also reinforces the importance of the guidelines in the annotation process.
  • Potential Improvements: While the guidelines are generally well-written, the communication effectiveness could be further improved by providing a brief rationale in the figure caption for *why* guidelines are important in this study (e.g., to ensure consistency and reliability of annotations). Additionally, while the figure mentions examples, it could be beneficial to include a few key examples directly in the figure or in an appendix to illustrate the application of the guidelines. The specific criteria used for distinguishing between human and AI-generated text could also be made more explicit, either in the figure or in an accompanying table. Finally, the instructions could be made more concise by removing redundant phrases like "You do this in 2 ways" and simplifying some of the explanations.
Figure 5: Consent form which the annotators were asked to sign via GoogleForms...
Full Caption

Figure 5: Consent form which the annotators were asked to sign via GoogleForms before collecting the data.

Figure/Table Image (Page 26)
Figure 5: Consent form which the annotators were asked to sign via GoogleForms before collecting the data.
First Reference in Text
All annotators were required to read the guidelines (Figure 4) and sign a consent form (Figure 5) prior to the labeling task.
Description
  • Purpose of the Figure: This figure shows the consent form that the annotators (the people participating in the study) were required to sign before they could start labeling articles as human-written or AI-generated. A consent form is a document that explains the study's purpose, procedures, risks, and benefits to potential participants. It's a way of ensuring that people understand what they're agreeing to before they participate in a research study. It is a standard ethical requirement in research involving human subjects.
  • Content of the Consent Form: The consent form likely includes information about the study's goals, what the annotators will be asked to do (e.g., read articles, label them as human or AI, provide explanations), how much time it will take, any potential risks or benefits of participating, how their data will be used and protected, and their rights as participants (e.g., the right to withdraw from the study at any time). It also includes a statement that participation is voluntary and that they are over 18 years of age. The form in the figure appears to use a question-and-answer format to convey this information, which can make it easier for participants to understand.
  • Use of Google Forms: The caption mentions that the consent form was administered via Google Forms. This is a common online tool for creating surveys and forms. Using an online platform like Google Forms makes it easy for the researchers to collect and manage consent from participants remotely. It also ensures that all participants receive the same information and that their responses are recorded electronically.
  • Importance of Informed Consent: Obtaining informed consent from participants is a fundamental ethical principle in research involving human subjects. It means that participants must be fully informed about the study and voluntarily agree to participate, without any coercion or undue influence. By signing the consent form, the annotators are indicating that they have read and understood the information provided and that they agree to participate in the study under the stated conditions. This helps to protect the rights and welfare of the participants.
Scientific Validity
  • Ethical Requirement: Obtaining informed consent is a standard ethical requirement for research involving human participants. It is essential for ensuring that participants are treated ethically and that their rights are protected. By including the consent form in the paper, the authors are demonstrating their commitment to ethical research practices.
  • Completeness of Information: The scientific validity of the consent process depends on the completeness and clarity of the information provided in the consent form. The form should adequately explain the study's purpose, procedures, risks, benefits, data handling practices, and participant rights in a way that is understandable to the target audience. The provided image suggests that the form covers these key elements, but the full text would need to be reviewed to assess its completeness.
  • Voluntary Participation: The consent form should emphasize that participation is voluntary and that participants can withdraw at any time without penalty. This is crucial for ensuring that participants are not coerced into participating and that they have the autonomy to make their own decisions about involvement in the study. The provided form clearly states this.
  • Documentation and Record-Keeping: Using Google Forms to collect consent provides a clear record of participant agreement. This is important for documentation and accountability. The authors should describe how they stored and managed the consent forms to ensure confidentiality and data security.
Communication
  • Caption Clarity: The caption clearly states the purpose of the figure: to present the consent form used in the study. It also specifies that the form was administered via Google Forms, which is helpful information about the data collection process.
  • Visual Presentation: The provided image of the consent form is well-structured and easy to read. It uses clear headings, bullet points, and a question-and-answer format to organize the information. The language is generally straightforward and accessible, although there are a few instances where the phrasing could be simplified for a broader audience. The form is visually appealing and uses formatting effectively to highlight important information.
  • Reference Text Support: The reference text reinforces the importance of the consent form by stating that all annotators were required to sign it before participating in the labeling task. This highlights the ethical considerations involved in the study.
  • Potential Improvements: While the consent form is generally well-designed, there are a few areas where the communication effectiveness could be improved. Some of the language could be simplified to make it more accessible to a lay audience (e.g., "a risk of breach of confidentiality" could be rephrased as "a risk that your personal information could be revealed"). Additionally, the form could explicitly state that participants should be 18 or older *before* describing the compensation. The form could also benefit from more explicit statements about data privacy and security, assuring participants that their data will be handled responsibly. Finally, while the form mentions that participation will "help inform if people can detect whether text they are reading is written by another human or by AI models," it could elaborate on the potential societal benefits of this research (e.g., helping to address concerns about misinformation or the misuse of AI).
Table 15: Prompt Template for Story Generation, where STORY PROMPT is the...
Full Caption

Table 15: Prompt Template for Story Generation, where STORY PROMPT is the writing prompt from r/WritingPrompts that the human-written story was written about and WORD COUNT is the length of the story to generate.

Figure/Table Image (Page 26)
Table 15: Prompt Template for Story Generation, where STORY PROMPT is the writing prompt from r/WritingPrompts that the human-written story was written about and WORD COUNT is the length of the story to generate.
First Reference in Text
We collect 30 stories from r/WritingPrompts. We generate corresponding AI-generated stories with the prompt in Table 15.
Description
  • Purpose of the Table: This table presents the template used for generating AI-written stories in the study. The template is a set of instructions, or a "prompt," given to an AI model to produce a story. The prompt is designed to mimic the kind of creative writing prompts found on the website Reddit, in a section called r/WritingPrompts. This subreddit is a place where people share ideas and write stories based on prompts. Essentially, the researchers are using a similar approach to get the AI to generate stories for their experiment.
  • Content of the Table: The table shows the structure of the prompt, which includes placeholders for two key pieces of information: "STORY PROMPT" and "WORD COUNT." The "STORY PROMPT" is the actual writing prompt taken from r/WritingPrompts. It is the creative idea or scenario that the story should be based on. The "WORD COUNT" specifies the desired length of the generated story. For example, a prompt might be "STORY PROMPT: A detective discovers a hidden room in a library" and "WORD COUNT: 500". This would instruct the AI to write a 500-word story based on that prompt.
  • Connection to Human-Written Stories: The caption explains that the "STORY PROMPT" used in the template is the same prompt that a human writer used to write a story on r/WritingPrompts. This means that the researchers are creating AI-generated stories that are directly comparable to human-written stories based on the same prompts. This is important for the study because it allows the researchers to compare human and AI writing under similar conditions, making it easier to isolate the differences between them.
  • Generation of AI Stories: The reference text states that the researchers collected 30 stories from r/WritingPrompts and generated corresponding AI-generated stories using the prompt in Table 15. This means that for each human-written story, they used the same prompt to generate an AI version. The AI is essentially being asked to write a story on the same topic and with the same length constraint as a human writer. This allows for a direct comparison between human and AI writing abilities.
Scientific Validity
  • Controlled Comparison: Using the same writing prompts for both human-written and AI-generated stories is a scientifically sound approach for creating a controlled comparison. By holding the prompt constant, the researchers can isolate the effects of the writer (human vs. AI) on the resulting story. This allows for a more direct assessment of the differences between human and AI writing capabilities.
  • Relevance of r/WritingPrompts: Using prompts from r/WritingPrompts is a reasonable choice, as it provides a source of diverse and creative writing prompts that are likely to be representative of the kinds of prompts that humans might use to inspire their own writing. However, the authors should acknowledge that r/WritingPrompts is a specific online community with its own norms and conventions, which might not be fully representative of all types of creative writing.
  • Word Count as a Constraint: Specifying the desired word count in the prompt is important for controlling the length of the generated stories. This helps to ensure that the AI-generated stories are comparable in length to the human-written stories from r/WritingPrompts. However, the authors should discuss how they determined the appropriate word count for each prompt and whether they considered the natural variation in story length that might occur among human writers.
  • Dependence on AI Model Capabilities: The quality and characteristics of the AI-generated stories will depend on the capabilities of the specific AI model used. The authors should clearly state which AI model they used for story generation and discuss any known limitations of the model that might affect the results. They should also consider the potential impact of different model parameters or settings on the generated stories.
Communication
  • Caption Clarity: The caption clearly explains the purpose of the table, which is to present the "Prompt Template for Story Generation." It also defines the placeholders "STORY PROMPT" and "WORD COUNT," which is helpful for understanding how the template is used. The connection to r/WritingPrompts is clearly stated.
  • Table Organization and Content: The provided image of the table is very simple and easy to understand. It presents the prompt template in a clear and concise manner. The use of placeholders is intuitive, and the instructions within the prompt are straightforward.
  • Reference Text Support: The reference text provides important context by explaining that 30 stories were collected from r/WritingPrompts and that corresponding AI-generated stories were created using the prompt in Table 15. This helps readers understand how the prompt template was used in the study.
  • Potential Improvements: The communication effectiveness could be improved by adding a brief explanation in the caption of *why* this prompt template is being used and how it relates to the overall goals of the study. Additionally, while the table is clear, it could benefit from a more descriptive title that reflects its purpose, such as "Prompt Template for Generating AI-Generated Stories Based on r/WritingPrompts Prompts." The table could also include an example of a filled-in prompt to further illustrate how it is used in practice. Finally, it would be helpful to clarify whether the prompt shown in the table is the *exact* prompt used for all story generations or if it was modified slightly for each prompt based on the specific content of the "STORY PROMPT" and "WORD COUNT" placeholders.
Table 16: Story performance of initial 5 annotators, which included 4 nonexpert...
Full Caption

Table 16: Story performance of initial 5 annotators, which included 4 nonexpert annotators and expert annotator # 1, who was used in all article experiments.

Figure/Table Image (Page 26)
Table 16: Story performance of initial 5 annotators, which included 4 nonexpert annotators and expert annotator # 1, who was used in all article experiments.
First Reference in Text
The nonexperts had a TPR of 62.5% not including our expert annotator, as shown in Table 16.
Description
  • Purpose of the Table: This table shows how well a group of five people performed when they tried to distinguish between stories written by humans and stories generated by AI. This was an initial test to see how good people are at this task before doing the main experiments. It is like a practice round before the actual competition. The table compares the performance of four "nonexpert" annotators and one "expert" annotator, who was involved in all the experiments in the study. The term "story performance" likely refers to their accuracy in identifying whether a story was written by a human or an AI.
  • Content of the Table: The table likely shows the True Positive Rate (TPR) and False Positive Rate (FPR) for each of the five annotators, and possibly their average scores. The TPR measures how often they correctly identified AI-generated stories as AI-generated. The FPR measures how often they incorrectly identified human-written stories as AI-generated. The reference text mentions that the nonexperts had an average TPR of 62.5%, which suggests that they were able to correctly identify AI-generated stories a little better than chance (which would be 50%). The table probably also includes the expert's individual performance metrics, allowing for a comparison between the expert and the nonexperts.
  • Expert vs. Nonexpert: The distinction between "expert" and "nonexpert" is based on their experience with AI and writing, as established earlier in the paper. The expert annotator is likely someone who has a lot of experience with AI language models and is skilled at detecting AI-generated text. The nonexperts, on the other hand, have less experience with AI. Comparing their performance helps to understand how much of a difference expertise makes in this task. The fact that the expert was used in all article experiments suggests that their performance in this initial test was considered reliable enough to use them throughout the study.
  • Initial Test Before Main Experiments: This table presents the results of an initial test, or a pilot study, conducted before the main experiments. Pilot studies are often used to refine the experimental design, test the procedures, and get a preliminary sense of the results. In this case, the initial test with the five annotators likely served to evaluate the difficulty of the task, assess the performance of the expert, and potentially identify any issues with the instructions or materials before proceeding with the larger study. The results of this initial test may have informed decisions about the selection of annotators or the design of subsequent experiments.
Scientific Validity
  • Small Sample Size: The initial test involved only five annotators, which is a very small sample size. This limits the generalizability of the findings and makes it difficult to draw strong conclusions about the performance of experts vs. nonexperts. The authors should acknowledge this limitation and interpret the results with caution. A larger sample size would be needed to make more robust claims about the differences between these groups.
  • Selection of Expert: The validity of the comparison between experts and nonexperts depends on how the expert was selected and whether their expertise is truly representative of experts in the field. The authors should provide more details about the criteria used to identify and select the expert annotator. They should also discuss any potential biases that might have been introduced by using only one expert.
  • Task Difficulty: The performance of the annotators depends on the difficulty of the task, which in turn depends on the quality of the AI-generated stories. If the AI-generated stories were relatively easy to detect, even nonexperts might perform well. Conversely, if the stories were very difficult to detect, even the expert might struggle. The authors should discuss the difficulty of the task and provide more information about the characteristics of the stories used in this initial test.
  • Potential Learning Effect: Since expert annotator #1 was used in all subsequent experiments, there's a potential for a learning effect. This annotator might have improved their performance over time simply due to repeated exposure to the task. The authors should consider this potential learning effect when interpreting the results and discuss whether it might have influenced the expert's performance in later experiments.
Communication
  • Caption Clarity: The caption clearly states the purpose of the table: to present the "Story performance of initial 5 annotators." It also specifies the composition of the group (4 nonexperts and 1 expert) and highlights the expert's involvement in all article experiments. The caption is concise and informative.
  • Table Organization and Content: The provided image of the table is well-organized and easy to understand. It clearly labels each annotator and presents their TPR and FPR scores. The table also includes an "Average" row, which is helpful for summarizing the overall performance of the nonexperts. The use of bolding for the "Average" row makes it stand out. The table effectively communicates the performance data in a clear and concise manner.
  • Reference Text Support: The reference text provides additional context by stating the average TPR of the nonexperts (62.5%), which is consistent with the information presented in the table. This helps to reinforce the finding that nonexperts perform slightly better than chance but still relatively poorly.
  • Potential Improvements: The communication effectiveness could be improved by adding a brief explanation in the caption of *why* this initial test was conducted and how it relates to the overall goals of the study. Additionally, while the table is generally clear, it could benefit from a more descriptive title that reflects its purpose, such as "Initial Test of Annotator Performance on Story Detection Task." The table could also include a column for the overall accuracy of each annotator, in addition to TPR and FPR, to provide a more comprehensive measure of performance. Finally, explaining why the expert was chosen to be part of all article experiments would be helpful.
Figure 6: Interface for annotators, with an example annotation from Annotator...
Full Caption

Figure 6: Interface for annotators, with an example annotation from Annotator #4 with a humanized article from §2.5. This is the same article displayed in Figure 1. An annotator can highlight texts, make their decision, put confidence, and write an explanation. This AI-generated article was based off of In Alaska, a pilot drops turkeys to rural homes for Thanksgiving, written by Mark Thiessen & Becky Bohrer, was originally published by Associated Press on November 28, 2024.

Figure/Table Image (Page 27)