One Does Not Simply Meme Alone: Evaluating Co-Creativity Between LLMs and Humans in the Generation of Humor

Zhikun Wu, Thomas Weber, Florian Müller
30th International Conference on Intelligent User Interfaces (IUI '25)
KTH Royal Institute of Technology

Table of Contents

Overall Summary

Study Background and Main Findings

This research investigates the potential of Large Language Models (LLMs) as collaborative partners in generating humorous content, specifically internet memes. The study addresses a gap in existing research, which has primarily focused on LLMs in tasks like writing or narrative creation, by exploring their capabilities in a humor-rich and culturally nuanced domain. The objective was to compare the creative output of three groups: individuals creating memes without AI assistance, individuals collaborating with an LLM, and an LLM generating memes autonomously.

The study employed a two-phase experimental design. In the first phase, participants in the human-only and human-AI collaboration groups were tasked with generating captions for pre-selected meme templates, focusing on topics like work, food, and sports. The human-AI group interacted with an LLM through a chat interface. In the second phase, a separate group of participants rated a random sample of the generated memes on three dimensions: humor, creativity, and shareability. Statistical tests, including the Mann-Whitney U test and ANOVA, were used to analyze the data.

Key findings revealed that participants using the LLM generated significantly more ideas and reported lower perceived effort compared to those working alone. However, memes created entirely by the LLM were rated significantly higher, on average, across all three evaluation metrics (humor, creativity, and shareability). Despite this, when analyzing the top-performing memes, human-created memes excelled in humor, while human-AI collaborations were superior in creativity and shareability. The study concludes that while LLMs can boost productivity and generate content with broad appeal, human creativity remains crucial for achieving high levels of humor. The findings also highlight the complexities of human-AI collaboration and the need for further research to optimize this interaction in creative domains.

Research Impact and Future Directions

This study provides valuable insights into the potential and limitations of using Large Language Models (LLMs) for co-creative tasks, specifically in the domain of humor and meme generation. The findings demonstrate that while LLMs can significantly boost productivity and reduce perceived effort, they do not necessarily lead to higher-quality output when collaborating with humans. Surprisingly, AI-generated memes were rated higher on average across all dimensions (humor, creativity, and shareability) compared to human-generated or human-AI collaborative memes. However, it is crucial to note that the funniest memes were predominantly created by humans, highlighting the continued importance of human creativity in achieving high levels of comedic impact.

The research underscores the complex interplay between human and artificial intelligence in creative endeavors. While AI can serve as a powerful tool for idea generation and content production, human creativity remains essential for nuanced humor and deeper engagement. The study's limitations, such as the short-term interaction and limited collaboration, point to the need for further research to optimize human-AI partnerships in creative domains.

The findings suggest a need for a more nuanced approach to human-AI collaboration, one that leverages the strengths of both. Future research should focus on developing interfaces and interaction paradigms that better facilitate iterative collaboration, explore different prompting strategies, and investigate the long-term effects of LLM assistance on human creativity. The study also raises important questions about the nature of humor and creativity in the digital age, and how AI can be used to enhance, rather than replace, human creative expression.

Critical Analysis and Recommendations

Clear Research Question (written-content)
The abstract clearly defines the research question, focusing on LLMs in co-creating humor-rich content (memes). This is important because it immediately informs the reader of the study's specific focus, setting it apart from broader research on LLMs and creativity. This clarity allows readers to quickly assess the relevance of the study to their interests.
Section: Abstract
Missing Broader Context (written-content)
The abstract lacks sufficient contextualization within the broader field of human-AI collaboration. Adding a sentence or two about the growing interest in this area and the challenges of creative tasks would better capture the reader's attention and convey the study's importance. This would highlight the why behind the research, making it more compelling.
Section: Abstract
Effective Contextualization (written-content)
The introduction effectively establishes the context by highlighting the prevalence of collaboration and its benefits for creativity, referencing established research. This grounding in prior work provides a solid foundation for the study's rationale and helps readers understand the significance of investigating human-AI collaboration.
Section: Introduction
Missing Meme Definition (written-content)
The introduction could benefit from a more explicit definition of "internet memes." While most readers likely have some familiarity with memes, a concise definition would ensure a shared understanding and enhance clarity, particularly for those less familiar with internet culture.
Section: Introduction
Well-Defined Experimental Groups (written-content)
The methodology clearly defines the three experimental groups (baseline, human-AI collaboration, and AI-only). This allows for a direct comparison of meme creation methods, enabling researchers to isolate the effects of LLM assistance. This rigorous design is crucial for drawing valid conclusions about the impact of AI on the creative process.
Section: Methodology
Missing LLM Specification (written-content)
The methodology does not specify which LLM was used, beyond mentioning GPT-4.0 later in the paper. Specifying the model (and ideally, version) is crucial for reproducibility and comparison with future research. Different LLMs have different capabilities, and this detail is essential for interpreting the findings.
Section: Methodology
Increased Idea Generation (written-content)
Participants using the LLM generated significantly more ideas than those in the baseline group (Mann-Whitney U test, p < 0.001). This quantitative finding, derived from a controlled experiment, demonstrates the potential of LLMs to enhance idea generation in creative tasks. This has practical implications for individuals and teams seeking to overcome creative blocks or increase productivity.
Section: Results
Missing Effect Sizes (written-content)
The Results section often omits effect sizes, reporting only p-values. Including effect sizes (e.g., Cohen's d, r) is crucial for understanding the magnitude of the observed differences, not just their statistical significance. This provides a more complete picture of the practical importance of the findings.
Section: Results
Good Connection to Prior Research (written-content)
The discussion effectively summarizes the main findings and connects them to prior research. This contextualization helps readers understand how the study contributes to the existing body of knowledge on human-AI collaboration and creative tasks. It also highlights both consistencies and discrepancies with previous work.
Section: Discussion
Underemphasized Human Strengths (written-content)
The discussion acknowledges that human-AI collaboration did not lead to overall quality improvements, but it doesn't sufficiently emphasize the specific strengths of human contributions, particularly in generating the funniest memes. This nuance is important for a balanced interpretation of the results, highlighting that while AI excels in average performance, human creativity remains crucial for top-tier humor.
Section: Discussion

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 1: Top 4 Memes Generated by AI, Humans, and Human-AI Collaboration...
Full Caption

Figure 1: Top 4 Memes Generated by AI, Humans, and Human-AI Collaboration Across Humor, Creativity, and Shareability Metric.

Figure/Table Image (Page 1)
Figure 1: Top 4 Memes Generated by AI, Humans, and Human-AI Collaboration Across Humor, Creativity, and Shareability Metric.
First Reference in Text
Figure 1: Top 4 Memes Generated by AI, Humans, and Human-AI Collaboration Across Humor, Creativity, and Shareability Metric.
Description
  • Overall Structure and Content: This figure presents a collection of 12 images, formatted as internet memes, arranged in a 3x4 grid. Each row represents a different metric used to evaluate the memes: 'Humor,' 'Creativity,' and 'Shareability.' Each column seems to represent different generation methods: The first column seems to be AI-generated, second column human-generated, the third column human-generated and the fourth is human-AI collaboration. A meme is a humorous image, video, piece of text, etc., that is copied (often with slight variations) and spread rapidly by Internet users. Each meme includes an image macro (a picture with superimposed text) and a short text caption. For example, one meme shows a picture of a surprised cartoon character with the text "When it's only 10 AM but you've already been at work for 5 hours." This implies that the workday feels long, even early in the morning, a relatable and potentially humorous situation.
  • Comparison of Generation Methods: The figure showcases examples of memes created under different conditions: entirely by AI, entirely by humans, and through human-AI collaboration. The exact process of 'collaboration' isn't defined in the figure itself, but presumably, a human and an AI system worked together in some way to create the meme, perhaps with the AI suggesting text and the human choosing the image, or vice-versa.
  • Evaluation Metrics: The figure implies a qualitative assessment based on three subjective metrics: humor (how funny the meme is), creativity (how original and novel the meme is), and shareability (how likely someone would be to share the meme with others). There are no numerical scores or quantitative data presented directly within the figure, suggesting a ranking based on some form of subjective evaluation.
Scientific Validity
  • Selection Methodology: The figure presents examples of generated memes, but lacks methodological details regarding how these specific memes were selected as the 'top 4'. Without information on the selection process (e.g., sample size, rating scales, statistical analysis), it's impossible to assess the validity of the claim that these are indeed the 'top' memes. The figure serves as an illustrative example rather than strong empirical evidence.
  • Subjectivity of Metrics: The metrics (Humor, Creativity, Shareability) are inherently subjective. While relevant to the study's focus, the figure doesn't provide information on how these qualities were assessed or quantified, making it difficult to judge the scientific rigor of the evaluation. Were these ratings obtained from a panel of judges? If so, what were their demographics, and what instructions were they given?
  • Illustrative, not comprehensive: The Figure presents example outputs, which is useful, but it doesn't represent the entirety of results. For example, it is unclear how many memes were created in total, and what the distribution of scores was across all memes.
Communication
  • Clarity and Readability: The figure uses a visually engaging format (memes) which is appropriate for the subject matter. However, the criteria for selecting the 'top 4' are not explicitly stated in the caption or figure itself, which could lead to ambiguity. The use of a grid layout is effective for comparison, but the small size of the text within each meme may hinder readability, especially in a printed format. The categorization into Humor, Creativity, and Shareability is clear, but the rationale for choosing these specific metrics is not given in the caption, although it is likely elaborated in the main body of the paper.
  • Contextualization: While the figure is self-contained in presenting the memes, it lacks context without the accompanying text. A reader unfamiliar with the study might not fully grasp the significance of these specific memes or how they were generated and evaluated without more details in the caption.

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Methodology

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 2: Mapping of Meme Templates to Topics (Work, Food, Sports) in the Study.
Figure/Table Image (Page 3)
Figure 2: Mapping of Meme Templates to Topics (Work, Food, Sports) in the Study.
First Reference in Text
Ideation In the first step, we displayed one of six background images of popular memes(Figure 2)to the participants and asked them to come up with as many captions as they could within five minutes.
Description
  • Meme Templates and Topics: This figure shows five different, common internet meme templates. A 'meme template' is a recognizable image or series of images that people use as a base to create their own memes, usually by adding text. The figure connects these templates to three different topics: Work, Food, and Sports. For example, the 'baby' meme template, showing a baby with a determined expression, is linked to the 'Work' topic. This suggests that participants in the study were asked to create memes related to work using this specific template.
  • Exclusion of one template: The figure presents a visual representation of the association between six meme templates and three predetermined topics. One template, labeled 'choice' is indicated as excluded, leaving five templates for analysis.
  • Identification of Meme Templates: The five meme templates used are: 'baby,' 'boromir,' 'doge,' 'futurama,' and 'toy-story.' These are all well-known and widely used meme formats. 'boromir' refers to a scene from the Lord of the Rings movie; 'doge' is a picture of a Shiba Inu dog; 'futurama' is a still from the animated TV show Futurama; and 'toy-story' refers to a scene from the movie Toy Story.
  • Topic Categories: The three topics are Work, Food, and Sports. These are broad categories, and the figure suggests that participants were asked to generate captions for the provided meme templates that related to one of these three topics. The connection between a template and a topic is visually represented by an arrow.
Scientific Validity
  • Controlled Stimuli: The figure illustrates the stimuli used in the ideation phase of the study. Providing participants with pre-selected meme templates introduces a level of control and standardization to the meme creation process. This approach allows for a more focused comparison between different groups (human-only, human-AI, AI-only) as they are all working with the same basic materials. However, it may also limit the range of creativity compared to allowing participants to choose their own templates.
  • Template Selection: The choice of popular meme templates increases the likelihood that participants will be familiar with the formats, potentially leading to more fluent idea generation. However, the specific criteria for selecting these particular templates (beyond being 'popular') are not described. A more rigorous justification for the selection process would strengthen the methodology.
  • Data Exclusion: The exclusion of one template ('choice') is clearly indicated, but the reason for this exclusion is not provided in the figure itself or the provided reference text. Transparency regarding data exclusion is crucial for scientific validity. The reason should be explained within the main text.
  • Presentation of Experimental Setup: The figure only shows the templates and topics; it doesn't present any results. As such, there's no statistical analysis to assess, just the presentation of the experimental setup.
Communication
  • Clarity and Visual Organization: The figure effectively uses a visual representation to connect meme templates with specific topics. The layout is clear and easy to follow, with distinct sections for each topic and corresponding meme examples. The use of color-coding (red, green, blue) for each topic enhances visual distinction and aids in quick comprehension. The images are large and recognizable meme templates, allowing for immediate identification. The use of arrows to connect templates to labels is intuitive.
  • Caption Conciseness and Accuracy: The caption clearly states the figure's purpose: to show the mapping between meme templates and topics. The inclusion of "(Work, Food, Sports)" explicitly defines the topics covered, providing context. However, the term "Mapping" could be slightly ambiguous to a non-technical reader. A more descriptive term like "Association" or "Relationship" might improve clarity.
  • Handling of Excluded Data: The meme labeled 'choice' with an asterisk and the note '* This image was excluded from the study' is clearly marked and explained, preventing misinterpretation. The red 'X' over the image reinforces its exclusion.
Figure 3: User Interface Overview:Baseline Ideation, Ideation with Chat...
Full Caption

Figure 3: User Interface Overview:Baseline Ideation, Ideation with Chat Interface, Favorite Selection, and Final Image Creation.

Figure/Table Image (Page 4)
Figure 3: User Interface Overview:Baseline Ideation, Ideation with Chat Interface, Favorite Selection, and Final Image Creation.
First Reference in Text
The interface(Figure 3) displayed a blank meme template as well as the instructions to the user.
Description
  • Overview of the User Interface: This figure showcases the different screens a participant would see when using the system to create memes. It's like a series of snapshots showing each step of the process. There are four main parts: coming up with ideas without help (Baseline Ideation), coming up with ideas using a chatbot (Ideation with Chat Interface), choosing their favorite ideas (Favorite Selection), and finally, creating the finished meme image (Image Creation).
  • Ideation Stages: In the 'Baseline Ideation' stage, users see a blank meme template and a space to type in their ideas. In the 'Ideation with Chat Interface' stage, users see the same thing, but also have a chat window where they can interact with an AI to get help generating ideas. An 'AI' or 'Artificial Intelligence' is a computer program designed to mimic human intelligence, in this case, to help generate meme captions. The chatbot is a type of AI that you can talk to (or in this case, type to).
  • Selection and Creation Stages: The 'Favorite Selection' stage shows a list of all the ideas the user generated, and they can pick their top three. The 'Image Creation' stage shows a meme editor where the user can add their chosen text to the meme image and position it how they like.
  • Interface Components: The interface includes elements like text input boxes, buttons (though their specific functions are hard to discern due to image resolution), a chat window, and a meme image editor. These are common components of web-based applications.
Scientific Validity
  • Reproducibility and Implementation Details: The figure provides a visual representation of the experimental setup, allowing for a better understanding of how participants interacted with the system. This aids in assessing the reproducibility of the study, as other researchers can see the interface used. However, the figure doesn't provide details about the underlying implementation, such as the specific chatbot technology used or the algorithms for generating meme captions.
  • Comparison of Conditions: The inclusion of both a baseline ideation interface (without AI assistance) and an ideation interface with a chat interface allows for a direct comparison between the two conditions. This strengthens the study's ability to isolate the impact of AI assistance on meme creation.
  • Complete Workflow Representation: By showing the entire workflow (ideation, selection, creation), the figure helps ensure that all stages of the meme-generation process are accounted for. This improves the overall validity of the experimental design.
Communication
  • Clear Visual Representation of UI: The figure provides a visual walkthrough of the user interface (UI) used in the study, showing the different stages of the meme creation process. The use of screenshots is effective in conveying the actual look and feel of the interface. The four stages (Baseline Ideation, Ideation with Chat Interface, Favorite Selection, and Image Creation) are clearly labeled, providing a logical flow that mirrors the user's experience.
  • Informative Labeling and Workflow: The labeling of each stage is concise and informative. The inclusion of a brief description under each screenshot (e.g., "Ideation with Chat Interface") further clarifies the purpose of each step. The arrows connecting the different stages visually represent the progression of the task, enhancing understanding of the workflow.
  • Readability of Details: While the figure shows the overall structure, the smaller details within each screenshot are difficult to read. This limits the ability to fully understand the specific functionalities and elements within each interface component. Zooming capabilities or higher-resolution images would improve readability.
  • Missing Chatbot Prompt: The chat interface is shown, but the specific prompt used or any system prompt is not visible. Including an example interaction would help readers better understand the capabilities.
Figure 4: Meme Generation Workflow: Human (Baseline), Human-AI Collaboration,...
Full Caption

Figure 4: Meme Generation Workflow: Human (Baseline), Human-AI Collaboration, and AI-Driven Creation.

Figure/Table Image (Page 5)
Figure 4: Meme Generation Workflow: Human (Baseline), Human-AI Collaboration, and AI-Driven Creation.
First Reference in Text
Figure 4: Meme Generation Workflow: Human (Baseline), Human-AI Collaboration, and AI-Driven Creation.
Description
  • Overview of Workflow: This figure presents a diagram showing the steps involved in creating memes in three different ways: by a human alone (Baseline), by a human working with an AI (Human-AI Collaboration), and by an AI alone (AI-driven Creation). It's like a recipe or a set of instructions for each method.
  • Human and Human-AI Workflow: For the human-only (Baseline) method, the steps are: coming up with ideas (Ideation), choosing favorite ideas (Favorite Selection), and creating the final meme image (Image Creation). The same three steps are present for the Human-AI collaboration method, but the 'Ideation' step involves interacting with a chatbot.
  • AI-Driven Workflow: For the AI-driven Creation method, the process involves providing a prompt to the AI ('Generate 20 meme captions for this <image> about the topic of <topic>') and the AI then generates the memes. A 'prompt' is a set of instructions given to an AI to tell it what to do. The <image> and <topic> would be replaced with specific image descriptions and topics, respectively.
  • Inclusion of Experience Survey: Each condition includes a connection to 'Experience Survey'. This means that after each method, participants were asked to complete a survey about their experience.
Scientific Validity
  • Clear Delineation of Experimental Conditions: The figure clearly outlines the different experimental conditions, which is crucial for understanding the study's design and comparing the results across groups. The separation of the workflows allows for a clear isolation of the independent variable (method of meme generation).
  • Inclusion of Baseline Condition: The inclusion of a baseline (human-only) condition provides a crucial point of comparison for evaluating the impact of AI assistance. This allows the researchers to determine whether AI collaboration or AI-driven creation leads to different outcomes compared to human-only meme generation.
  • Level of Detail in Protocol: The figure provides a high-level overview of the workflow, but lacks details about the specific instructions given to participants in each condition. For example, were participants in the human-AI collaboration group given any guidance on how to interact with the chatbot? More detailed information about the experimental protocol would enhance the reproducibility of the study.
  • AI Prompt Specification: The prompt shown for AI-driven creation is clear and well-defined. Providing this level of detail increases the transparency and replicability of the study, as others can use the same prompt to generate similar results.
Communication
  • Clear Visual Representation of Workflow: The figure effectively uses a flowchart to illustrate the workflow for each of the three experimental conditions: Human (Baseline), Human-AI Collaboration, and AI-driven Creation. The use of different colors for each condition (green, teal, and gray, respectively) helps to visually distinguish them. The arrows clearly indicate the flow of the process, and the icons within each step provide a quick visual summary of the action (e.g., writing, selecting, editing). The inclusion of example prompts and system output for the AI-driven creation enhances understanding.
  • Consistent and Informative Labeling: The labeling of each step is concise and informative (e.g., "Ideation," "Favorite Selection," "Image Creation"). The use of consistent terminology across all three conditions facilitates comparison. The caption accurately describes the content of the figure.
  • Readability of Details: While the overall flow is clear, the details within some of the icons and smaller text boxes are difficult to read, particularly the prompt example for generating image captions. This reduces the ability to fully understand the specifics of the AI interaction. Larger or higher-resolution images, and perhaps a separate figure detailing the prompt structure, would be beneficial.
Figure 5: Meme Evaluation Workflow: This diagram illustrates the evaluation...
Full Caption

Figure 5: Meme Evaluation Workflow: This diagram illustrates the evaluation process of memes created by humans, human-AI collaboration, and AI-driven approaches.

Figure/Table Image (Page 5)
Figure 5: Meme Evaluation Workflow: This diagram illustrates the evaluation process of memes created by humans, human-AI collaboration, and AI-driven approaches.
First Reference in Text
Figure 5: Meme Evaluation Workflow: This diagram illustrates the evaluation process of memes created by humans, human-AI collaboration, and AI-driven approaches.
Description
  • Overall Workflow Description: This figure is a flowchart showing how the memes created in the study were evaluated. It breaks down the process into several steps, starting from collecting the initial ideas and ending with an online survey where people rated the memes.
  • Different Creation Conditions: There are three main branches in the flowchart, each representing a different way memes were created: by humans alone (Human), by humans and AI working together (Human-AI Collaboration), and by AI alone (AI-driven). Each branch shows how many initial ideas or memes were collected (e.g., 415 Favorite Ideas for the human baseline, 300 Meme Images for the AI-driven approach).
  • Sampling Process: The 'Generation & Curation' step involves collecting the initial ideas or memes. Then, a 'Random Sample' is taken, reducing the number of memes to 150 for each condition. 'Random sampling' means selecting a smaller group from a larger group in a way that each member of the larger group has an equal chance of being chosen. This helps ensure the smaller group is representative of the larger group.
  • Online Survey and Rating Metrics: Finally, an 'Online Survey for Rating The Images' was conducted. Participants from a platform called 'Prolific' were shown a random sample of 50 images and asked to rate them on three aspects: Humor, Creativity, and Shareability. Prolific is a website where researchers can recruit participants for online studies.
  • Sampling Details: The figure includes notes explaining the sampling process. It states that 10 images were randomly sampled for each combination of background picture and topic, and that each participant in the survey saw a random sample of 50 images.
Scientific Validity
  • Clear Delineation of Evaluation Process: The figure clearly outlines the evaluation process, which is crucial for understanding how the quality of the memes was assessed. The separation of the workflows for each condition allows for a clear comparison of the evaluation results.
  • Transparency in Data Collection: The inclusion of specific numbers (e.g., 415 Favorite Ideas, 335 Meme Images) provides transparency regarding the data collection and sampling process. This allows for a better assessment of the sample size and potential biases.
  • Use of Random Sampling: The use of random sampling is a standard practice in research to reduce bias and ensure that the selected memes are representative of the larger pool of generated memes. The figure clearly indicates that random sampling was employed.
  • Participant Information: The figure mentions using participants from Prolific for the online survey, which is a common platform for recruiting participants. However, it doesn't provide details about the participant demographics or any inclusion/exclusion criteria. This information is important for assessing the generalizability of the findings.
  • Operationalization of Evaluation Metrics: The figure specifies the evaluation metrics (Humor, Creativity, Shareability) but doesn't provide details on how these were measured (e.g., rating scales, instructions to participants). More information about the operationalization of these constructs would strengthen the scientific validity.
Communication
  • Clear Visual Representation and Flow: The figure effectively uses a flowchart to depict the evaluation process of the generated memes. The use of distinct sections for each condition (Human, Human-AI Collaboration, AI-driven) and color-coding (orange, teal, and gray) aids in visual differentiation. The arrows clearly indicate the flow of the process, from generation and curation to sampling and finally to the online survey. The inclusion of specific numbers (e.g., 415 Favorite Ideas, 150 Meme Images) provides a quantitative overview of the data.
  • Consistent and Informative Labeling: The labeling of each step is concise and informative (e.g., "Generation & Curation," "Random Sample," "Online Survey for Rating The Images"). The use of consistent terminology across all conditions facilitates comparison. The caption accurately describes the content and purpose of the figure.
  • Detailed Evaluation Methodology: The inclusion of an "Online Survey for Rating The Images" section with details about the rating process (Humor, Creativity, Shareability) and the use of participants from Prolific enhances the understanding of the evaluation methodology. The note about randomly sampling 10 images for each combination and displaying 50 images to each participant clarifies the sampling strategy.
  • Readability of Details: While the overall flow is clear, the smaller text within some of the boxes is difficult to read, particularly the numbers indicating the quantity of memes at each stage. Larger font sizes or a zoomed-in view of specific sections would improve readability.

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 6: Participants using the LLM were able to produce significantly more...
Full Caption

Figure 6: Participants using the LLM were able to produce significantly more ideas than participants who had no external support, according to the Mann-Whitney-U test (***: p < 0.001)

Figure/Table Image (Page 6)
Figure 6: Participants using the LLM were able to produce significantly more ideas than participants who had no external support, according to the Mann-Whitney-U test (***: p < 0.001)
First Reference in Text
As seen in Figure 7, participants that were able to use the LLM created noticeably more ideas than the participants in the baseline group.
Description
  • Description of the Boxplot: This figure is a boxplot comparing the number of ideas generated by two groups of participants: those who used a Large Language Model (LLM), which is a type of AI, and those who did not (Human only). A boxplot is a way to visually represent the distribution of a dataset. The box shows the middle 50% of the data, the line inside the box represents the median (the middle value), and the 'whiskers' extend to the furthest data point within 1.5 times the interquartile range (the height of the box). Any points outside the whiskers are considered outliers.
  • Axis Labels: The y-axis (vertical axis) shows the 'Number of ideas,' indicating the quantity of ideas generated by each participant. The x-axis (horizontal axis) shows the two groups: 'Human only' and 'Human-AI collaboration.'
  • Comparison of Groups: The boxplot for the 'Human-AI collaboration' group is noticeably higher than the boxplot for the 'Human only' group. This visually indicates that participants who used the LLM generally generated more ideas than those who did not. The caption states that this difference is statistically significant, meaning it's unlikely to have occurred by chance.
  • Statistical Significance: The caption mentions a Mann-Whitney U test, which is a statistical test used to compare two groups when the data is not normally distributed (i.e., it doesn't follow a bell curve). The result 'p < 0.001' means there is less than a 0.1% chance that the observed difference between the groups is due to random variation.
Scientific Validity
  • Statistical Analysis: The figure presents a clear statistical comparison between two groups, supporting the claim of a significant difference in idea generation. The use of the Mann-Whitney U test is appropriate given the non-normal distribution of the data (mentioned in the text). The p-value (p < 0.001) indicates strong statistical significance.
  • Data Visualization and Generalizability: The boxplot provides a visual representation of the data distribution, allowing for a quick assessment of the central tendency and spread of each group. However, the figure alone doesn't provide information about the sample size or the specific characteristics of the participants in each group. This information is crucial for assessing the generalizability of the findings.
  • Conclusion Support: The figure, along with the provided statistical data supports the conclusion that using the LLM led to significantly more ideas being produced.
Communication
  • Clarity and Statistical Reporting: The caption clearly states the main finding: that participants using the LLM generated significantly more ideas. The use of "significantly" is appropriate given the statistical test result (p < 0.001). The reference to the Mann-Whitney U test provides the statistical justification for the claim, and the inclusion of the p-value allows readers to assess the strength of the evidence. However, the figure itself is a boxplot, and the caption doesn't explicitly mention this.
  • Visual Presentation: The graph visually presents the difference in the number of ideas generated between the two groups (Human only and Human-AI collaboration). The use of a boxplot is appropriate for comparing distributions. The y-axis is clearly labeled ("Number of ideas"), and the x-axis distinguishes between the two groups. The difference between the groups is visually apparent.
  • Significance Indication: The use of asterisks (***) to indicate statistical significance is a standard practice, and the corresponding p-value threshold (p < 0.001) is provided. This allows for a quick visual assessment of the significance level.
  • Missing Statistical Details: While the caption mentions the Mann-Whitney U test, it does not provide the test statistic (U value) or the effect size. Including these would provide a more complete picture of the results. Also, the figure could include median values to better inform the reader.
Figure 7: While there were no significant differences in overall workload, the...
Full Caption

Figure 7: While there were no significant differences in overall workload, the "Effort" subscale of the NASA TLX was significantly different according to the Mann-Whitney-U test (*: p < 0.05)

Figure/Table Image (Page 6)
Figure 7: While there were no significant differences in overall workload, the "Effort" subscale of the NASA TLX was significantly different according to the Mann-Whitney-U test (*: p < 0.05)
First Reference in Text
As seen in Figure 7, participants that were able to use the LLM created noticeably more ideas than the participants in the baseline group.
Description
  • Overview of NASA TLX and Groups: This figure shows the results of a questionnaire called the NASA TLX, which measures how demanding a task is. It compares the results for two groups: people who did the task on their own ('Human only') and people who did the task with help from an AI ('Human-AI collaboration'). The results are shown for different aspects of workload, like mental demand, physical demand, and effort. The figure uses boxplots.
  • Boxplot Explanation and Subscales: Each boxplot shows the distribution of scores for a particular aspect of workload. The box represents the middle 50% of the scores, the line inside the box is the median (the middle score), and the 'whiskers' extend to the furthest data point within 1.5 times the interquartile range. The figure also includes an 'Average' NASA TLX score.
  • Key Finding: Effort Subscale: The caption highlights that there was no significant difference in the overall workload between the two groups. However, there was a significant difference in the 'Effort' subscale, meaning that the group working with the AI reported significantly lower effort. 'Significant' in this context, means the difference is unlikely due to random chance.
  • Statistical Test and p-value: The caption mentions a Mann-Whitney U test, a statistical test used to compare two groups when the data isn't normally distributed. The result 'p < 0.05' means there is less than a 5% chance that the observed difference in effort between the groups is due to random variation.
Scientific Validity
  • Use of Standardized Instrument and Appropriate Statistical Test: The figure presents a comparison of NASA TLX scores between two groups, providing a measure of perceived workload. The use of the NASA TLX, a well-established and validated instrument, adds to the scientific validity of the assessment. The statistical analysis using the Mann-Whitney U test is appropriate given the likely non-normal distribution of subjective rating data.
  • Relevance of "Effort" Subscale: The focus on the "Effort" subscale is relevant to the study's research question, as it investigates whether AI assistance reduces the perceived effort required for meme generation. The statistically significant difference (p < 0.05) supports the claim that AI assistance reduced perceived effort.
  • Effect Size Reporting: While the figure shows a significant difference in perceived effort, it doesn't provide information about the effect size. Reporting the effect size (e.g., Cohen's d or Cliff's delta) would provide a more complete picture of the magnitude of the difference.
  • Comparison of Groups: The figure compares only two groups. The results are consistent with the interpretation that the AI reduced effort.
Communication
  • Visual Presentation and Clarity: The figure presents a series of boxplots comparing different subscales of the NASA TLX (Task Load Index) between two groups: Human only and Human-AI collaboration. The use of boxplots is appropriate for visualizing distributions. The y-axis represents the NASA TLX scores, and the x-axis shows the different subscales (Mental Demand, Physical Demand, Temporal Demand, Performance, Effort, Frustration) and the overall average.
  • Caption Accuracy and Statistical Reporting: The caption highlights a key finding: no significant difference in overall workload, but a significant difference in the "Effort" subscale. The reference to the Mann-Whitney U test and the p-value (p < 0.05) provides statistical support for this claim. However, similar to Figure 6's caption, this one also does not mention the type of graph being presented.
  • Significance Indication: The use of an asterisk (*) to indicate statistical significance is conventional, and the corresponding p-value threshold (p < 0.05) is provided. This allows for quick visual identification of significant differences.
  • Labeling and Caption Conciseness: The figure could benefit from explicitly labeling the y-axis as "NASA TLX Score" or similar, to improve clarity. The caption could be more concise by focusing on the key finding related to effort, rather than stating the non-significant overall workload result, which is visually evident in the graph.
  • Missing Units on Axes: The axes are labeled, but there are no units shown. Since NASA-TLX is a standard questionnaire, it might make sense to include the range of values possible.
Figure 8: Pairwise comparison of how participants rated the memes with respect...
Full Caption

Figure 8: Pairwise comparison of how participants rated the memes with respect to the three scales "funny", "creative", and "shareable".

Figure/Table Image (Page 7)
Figure 8: Pairwise comparison of how participants rated the memes with respect to the three scales "funny", "creative", and "shareable".
First Reference in Text
According to these tests, each condition showed significant differences, as shown in Table 1.
Description
  • Overview of Scales and Conditions: This figure shows how people rated the memes on three different scales: how funny they were ('funny'), how original they were ('creative'), and how likely they would be to share them ('shareable'). It compares the ratings for memes created in three different ways: by humans alone, by humans and AI together, and by AI alone.
  • Explanation of Violin Plots: The figure uses violin plots to show the distribution of ratings. A violin plot is like a boxplot, but it also shows the probability density of the data at different values - the wider the plot, the more common that rating is. The white dot represents the median rating, and the thick black bar represents the interquartile range (the middle 50% of the ratings). The thinner black lines extend to the furthest data point within 1.5 times the interquartile range.
  • Comparison of Ratings and Significance: For each scale (Funny, Creative, Shareable), the violin plots show the distribution of ratings for each of the three conditions (Human only, Human-AI Collaboration, AI only). The asterisks (*, **, ***) above the plots indicate statistically significant differences between the groups, meaning the differences are unlikely to be due to random chance. The more asterisks, the stronger the statistical significance.
Scientific Validity
  • Statistical Comparisons: The figure presents a visual comparison of ratings across three conditions and three scales. The use of pairwise comparisons suggests that appropriate statistical tests were conducted to assess the differences between groups (although the specific tests are not mentioned in the figure caption). The reference text mentions Table 1, suggesting that the statistical details are provided elsewhere in the paper.
  • Subjective Ratings and Scale Information: The figure focuses on subjective ratings (funny, creative, shareable), which are relevant to the study's research question. However, without information on the rating scale used (e.g., 1-5, 1-7) and the instructions given to participants, it's difficult to fully assess the validity of the ratings.
  • Data Visualization and Statistical Details: The use of violin plots allows for visualization of the data distribution, but it doesn't provide specific statistical values (e.g., means, standard deviations, test statistics, effect sizes). While the asterisks indicate significant differences, the magnitude of these differences is not immediately apparent. Referring to Table 1 for details is acceptable, but including key statistical values in the figure or caption would be beneficial.
  • Interpretation Support: The reference text points to Table 1 for details of the statistical tests. The figure supports the interpretation of the results, but is more of a visual summary. The scientific validity is tied to the methods described in the text, and the statistics reported in Table 1.
Communication
  • Visual Presentation and Clarity: The figure presents three separate violin plots, each comparing the ratings for a different scale (Funny, Creative, Shareable) across three conditions (Human only, Human-AI Collaboration, AI only). The use of violin plots is appropriate for visualizing distributions, and allows comparison of the shape, central tendency, and spread of the data for each group. The y-axis represents the rating scale, while x-axis shows the different conditions.
  • Caption Accuracy and Clarity: The caption clearly states the purpose of the figure: to compare ratings across the three scales and conditions. The use of "pairwise comparison" implies that statistical tests were performed to compare each pair of conditions. The scales are clearly named and enclosed in quotes, indicating they were presented to participants this way.
  • Significance Indication and Statistical Test Information: The use of asterisks (*, **, ***) to indicate statistical significance is conventional. However, unlike previous figures, the key for these symbols is not provided in the caption, requiring the reader to refer to the text. This makes the figure less self-contained. Also, the caption does not mention the statistical tests that were used for these comparisons.
  • Labeling and Key for Symbols: The figure could benefit from explicitly labeling the y-axis (e.g., "Rating Score") and providing a key for the asterisks within the figure itself or the caption.

Discussion

Key Aspects

Strengths

Suggestions for Improvement

↑ Back to Top