One Does Not Simply Meme Alone: Evaluating Co-Creativity Between LLMs and Humans in the Generation of Humor

Zhikun Wu, Thomas Weber, Florian Müller
30th International Conference on Intelligent User Interfaces (IUI '25)
KTH Royal Institute of Technology

Table of Contents

Overall Summary

Study Background and Main Findings

This research investigates the potential of Large Language Models (LLMs) as collaborative partners in generating humorous content, specifically internet memes. The study addresses a gap in existing research, which has primarily focused on LLMs in tasks like writing or narrative creation, by exploring their capabilities in a humor-rich and culturally nuanced domain. The objective was to compare the creative output of three groups: individuals creating memes without AI assistance, individuals collaborating with an LLM, and an LLM generating memes autonomously.

The study employed a two-phase experimental design. In the first phase, participants in the human-only and human-AI collaboration groups were tasked with generating captions for pre-selected meme templates, focusing on topics like work, food, and sports. The human-AI group interacted with an LLM through a chat interface. In the second phase, a separate group of participants rated a random sample of the generated memes on three dimensions: humor, creativity, and shareability. Statistical tests, including the Mann-Whitney U test and ANOVA, were used to analyze the data.

Key findings revealed that participants using the LLM generated significantly more ideas and reported lower perceived effort compared to those working alone. However, memes created entirely by the LLM were rated significantly higher, on average, across all three evaluation metrics (humor, creativity, and shareability). Despite this, when analyzing the top-performing memes, human-created memes excelled in humor, while human-AI collaborations were superior in creativity and shareability. The study concludes that while LLMs can boost productivity and generate content with broad appeal, human creativity remains crucial for achieving high levels of humor. The findings also highlight the complexities of human-AI collaboration and the need for further research to optimize this interaction in creative domains.

Research Impact and Future Directions

This study provides valuable insights into the potential and limitations of using Large Language Models (LLMs) for co-creative tasks, specifically in the domain of humor and meme generation. The findings demonstrate that while LLMs can significantly boost productivity and reduce perceived effort, they do not necessarily lead to higher-quality output when collaborating with humans. Surprisingly, AI-generated memes were rated higher on average across all dimensions (humor, creativity, and shareability) compared to human-generated or human-AI collaborative memes. However, it is crucial to note that the funniest memes were predominantly created by humans, highlighting the continued importance of human creativity in achieving high levels of comedic impact.

The research underscores the complex interplay between human and artificial intelligence in creative endeavors. While AI can serve as a powerful tool for idea generation and content production, human creativity remains essential for nuanced humor and deeper engagement. The study's limitations, such as the short-term interaction and limited collaboration, point to the need for further research to optimize human-AI partnerships in creative domains.

The findings suggest a need for a more nuanced approach to human-AI collaboration, one that leverages the strengths of both. Future research should focus on developing interfaces and interaction paradigms that better facilitate iterative collaboration, explore different prompting strategies, and investigate the long-term effects of LLM assistance on human creativity. The study also raises important questions about the nature of humor and creativity in the digital age, and how AI can be used to enhance, rather than replace, human creative expression.

Critical Analysis and Recommendations

Clear Research Question (written-content)
The abstract clearly defines the research question, focusing on LLMs in co-creating humor-rich content (memes). This is important because it immediately informs the reader of the study's specific focus, setting it apart from broader research on LLMs and creativity. This clarity allows readers to quickly assess the relevance of the study to their interests.
Section: Abstract
Missing Broader Context (written-content)
The abstract lacks sufficient contextualization within the broader field of human-AI collaboration. Adding a sentence or two about the growing interest in this area and the challenges of creative tasks would better capture the reader's attention and convey the study's importance. This would highlight the why behind the research, making it more compelling.
Section: Abstract
Effective Contextualization (written-content)
The introduction effectively establishes the context by highlighting the prevalence of collaboration and its benefits for creativity, referencing established research. This grounding in prior work provides a solid foundation for the study's rationale and helps readers understand the significance of investigating human-AI collaboration.
Section: Introduction
Missing Meme Definition (written-content)
The introduction could benefit from a more explicit definition of "internet memes." While most readers likely have some familiarity with memes, a concise definition would ensure a shared understanding and enhance clarity, particularly for those less familiar with internet culture.
Section: Introduction
Well-Defined Experimental Groups (written-content)
The methodology clearly defines the three experimental groups (baseline, human-AI collaboration, and AI-only). This allows for a direct comparison of meme creation methods, enabling researchers to isolate the effects of LLM assistance. This rigorous design is crucial for drawing valid conclusions about the impact of AI on the creative process.
Section: Methodology
Missing LLM Specification (written-content)
The methodology does not specify which LLM was used, beyond mentioning GPT-4.0 later in the paper. Specifying the model (and ideally, version) is crucial for reproducibility and comparison with future research. Different LLMs have different capabilities, and this detail is essential for interpreting the findings.
Section: Methodology
Increased Idea Generation (written-content)
Participants using the LLM generated significantly more ideas than those in the baseline group (Mann-Whitney U test, p < 0.001). This quantitative finding, derived from a controlled experiment, demonstrates the potential of LLMs to enhance idea generation in creative tasks. This has practical implications for individuals and teams seeking to overcome creative blocks or increase productivity.
Section: Results
Missing Effect Sizes (written-content)
The Results section often omits effect sizes, reporting only p-values. Including effect sizes (e.g., Cohen's d, r) is crucial for understanding the magnitude of the observed differences, not just their statistical significance. This provides a more complete picture of the practical importance of the findings.
Section: Results
Good Connection to Prior Research (written-content)
The discussion effectively summarizes the main findings and connects them to prior research. This contextualization helps readers understand how the study contributes to the existing body of knowledge on human-AI collaboration and creative tasks. It also highlights both consistencies and discrepancies with previous work.
Section: Discussion
Underemphasized Human Strengths (written-content)
The discussion acknowledges that human-AI collaboration did not lead to overall quality improvements, but it doesn't sufficiently emphasize the specific strengths of human contributions, particularly in generating the funniest memes. This nuance is important for a balanced interpretation of the results, highlighting that while AI excels in average performance, human creativity remains crucial for top-tier humor.
Section: Discussion

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 1: Top 4 Memes Generated by AI, Humans, and Human-AI Collaboration...
Full Caption

Figure 1: Top 4 Memes Generated by AI, Humans, and Human-AI Collaboration Across Humor, Creativity, and Shareability Metric.

Figure/Table Image (Page 1)
Figure 1: Top 4 Memes Generated by AI, Humans, and Human-AI Collaboration Across Humor, Creativity, and Shareability Metric.
First Reference in Text
Figure 1: Top 4 Memes Generated by AI, Humans, and Human-AI Collaboration Across Humor, Creativity, and Shareability Metric.
Description
  • Overall Structure and Content: This figure presents a collection of 12 images, formatted as internet memes, arranged in a 3x4 grid. Each row represents a different metric used to evaluate the memes: 'Humor,' 'Creativity,' and 'Shareability.' Each column seems to represent different generation methods: The first column seems to be AI-generated, second column human-generated, the third column human-generated and the fourth is human-AI collaboration. A meme is a humorous image, video, piece of text, etc., that is copied (often with slight variations) and spread rapidly by Internet users. Each meme includes an image macro (a picture with superimposed text) and a short text caption. For example, one meme shows a picture of a surprised cartoon character with the text "When it's only 10 AM but you've already been at work for 5 hours." This implies that the workday feels long, even early in the morning, a relatable and potentially humorous situation.
  • Comparison of Generation Methods: The figure showcases examples of memes created under different conditions: entirely by AI, entirely by humans, and through human-AI collaboration. The exact process of 'collaboration' isn't defined in the figure itself, but presumably, a human and an AI system worked together in some way to create the meme, perhaps with the AI suggesting text and the human choosing the image, or vice-versa.
  • Evaluation Metrics: The figure implies a qualitative assessment based on three subjective metrics: humor (how funny the meme is), creativity (how original and novel the meme is), and shareability (how likely someone would be to share the meme with others). There are no numerical scores or quantitative data presented directly within the figure, suggesting a ranking based on some form of subjective evaluation.
Scientific Validity
  • Selection Methodology: The figure presents examples of generated memes, but lacks methodological details regarding how these specific memes were selected as the 'top 4'. Without information on the selection process (e.g., sample size, rating scales, statistical analysis), it's impossible to assess the validity of the claim that these are indeed the 'top' memes. The figure serves as an illustrative example rather than strong empirical evidence.
  • Subjectivity of Metrics: The metrics (Humor, Creativity, Shareability) are inherently subjective. While relevant to the study's focus, the figure doesn't provide information on how these qualities were assessed or quantified, making it difficult to judge the scientific rigor of the evaluation. Were these ratings obtained from a panel of judges? If so, what were their demographics, and what instructions were they given?
  • Illustrative, not comprehensive: The Figure presents example outputs, which is useful, but it doesn't represent the entirety of results. For example, it is unclear how many memes were created in total, and what the distribution of scores was across all memes.
Communication
  • Clarity and Readability: The figure uses a visually engaging format (memes) which is appropriate for the subject matter. However, the criteria for selecting the 'top 4' are not explicitly stated in the caption or figure itself, which could lead to ambiguity. The use of a grid layout is effective for comparison, but the small size of the text within each meme may hinder readability, especially in a printed format. The categorization into Humor, Creativity, and Shareability is clear, but the rationale for choosing these specific metrics is not given in the caption, although it is likely elaborated in the main body of the paper.
  • Contextualization: While the figure is self-contained in presenting the memes, it lacks context without the accompanying text. A reader unfamiliar with the study might not fully grasp the significance of these specific memes or how they were generated and evaluated without more details in the caption.

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Methodology

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 2: Mapping of Meme Templates to Topics (Work, Food, Sports) in the Study.
Figure/Table Image (Page 3)
Figure 2: Mapping of Meme Templates to Topics (Work, Food, Sports) in the Study.
First Reference in Text
Ideation In the first step, we displayed one of six background images of popular memes(Figure 2)to the participants and asked them to come up with as many captions as they could within five minutes.
Description
  • Meme Templates and Topics: This figure shows five different, common internet meme templates. A 'meme template' is a recognizable image or series of images that people use as a base to create their own memes, usually by adding text. The figure connects these templates to three different topics: Work, Food, and Sports. For example, the 'baby' meme template, showing a baby with a determined expression, is linked to the 'Work' topic. This suggests that participants in the study were asked to create memes related to work using this specific template.
  • Exclusion of one template: The figure presents a visual representation of the association between six meme templates and three predetermined topics. One template, labeled 'choice' is indicated as excluded, leaving five templates for analysis.
  • Identification of Meme Templates: The five meme templates used are: 'baby,' 'boromir,' 'doge,' 'futurama,' and 'toy-story.' These are all well-known and widely used meme formats. 'boromir' refers to a scene from the Lord of the Rings movie; 'doge' is a picture of a Shiba Inu dog; 'futurama' is a still from the animated TV show Futurama; and 'toy-story' refers to a scene from the movie Toy Story.
  • Topic Categories: The three topics are Work, Food, and Sports. These are broad categories, and the figure suggests that participants were asked to generate captions for the provided meme templates that related to one of these three topics. The connection between a template and a topic is visually represented by an arrow.
Scientific Validity
  • Controlled Stimuli: The figure illustrates the stimuli used in the ideation phase of the study. Providing participants with pre-selected meme templates introduces a level of control and standardization to the meme creation process. This approach allows for a more focused comparison between different groups (human-only, human-AI, AI-only) as they are all working with the same basic materials. However, it may also limit the range of creativity compared to allowing participants to choose their own templates.
  • Template Selection: The choice of popular meme templates increases the likelihood that participants will be familiar with the formats, potentially leading to more fluent idea generation. However, the specific criteria for selecting these particular templates (beyond being 'popular') are not described. A more rigorous justification for the selection process would strengthen the methodology.
  • Data Exclusion: The exclusion of one template ('choice') is clearly indicated, but the reason for this exclusion is not provided in the figure itself or the provided reference text. Transparency regarding data exclusion is crucial for scientific validity. The reason should be explained within the main text.
  • Presentation of Experimental Setup: The figure only shows the templates and topics; it doesn't present any results. As such, there's no statistical analysis to assess, just the presentation of the experimental setup.
Communication
  • Clarity and Visual Organization: The figure effectively uses a visual representation to connect meme templates with specific topics. The layout is clear and easy to follow, with distinct sections for each topic and corresponding meme examples. The use of color-coding (red, green, blue) for each topic enhances visual distinction and aids in quick comprehension. The images are large and recognizable meme templates, allowing for immediate identification. The use of arrows to connect templates to labels is intuitive.
  • Caption Conciseness and Accuracy: The caption clearly states the figure's purpose: to show the mapping between meme templates and topics. The inclusion of "(Work, Food, Sports)" explicitly defines the topics covered, providing context. However, the term "Mapping" could be slightly ambiguous to a non-technical reader. A more descriptive term like "Association" or "Relationship" might improve clarity.
  • Handling of Excluded Data: The meme labeled 'choice' with an asterisk and the note '* This image was excluded from the study' is clearly marked and explained, preventing misinterpretation. The red 'X' over the image reinforces its exclusion.
Figure 3: User Interface Overview:Baseline Ideation, Ideation with Chat...
Full Caption

Figure 3: User Interface Overview:Baseline Ideation, Ideation with Chat Interface, Favorite Selection, and Final Image Creation.

Figure/Table Image (Page 4)
Figure 3: User Interface Overview:Baseline Ideation, Ideation with Chat Interface, Favorite Selection, and Final Image Creation.
First Reference in Text
The interface(Figure 3) displayed a blank meme template as well as the instructions to the user.
Description
  • Overview of the User Interface: This figure showcases the different screens a participant would see when using the system to create memes. It's like a series of snapshots showing each step of the process. There are four main parts: coming up with ideas without help (Baseline Ideation), coming up with ideas using a chatbot (Ideation with Chat Interface), choosing their favorite ideas (Favorite Selection), and finally, creating the finished meme image (Image Creation).
  • Ideation Stages: In the 'Baseline Ideation' stage, users see a blank meme template and a space to type in their ideas. In the 'Ideation with Chat Interface' stage, users see the same thing, but also have a chat window where they can interact with an AI to get help generating ideas. An 'AI' or 'Artificial Intelligence' is a computer program designed to mimic human intelligence, in this case, to help generate meme captions. The chatbot is a type of AI that you can talk to (or in this case, type to).
  • Selection and Creation Stages: The 'Favorite Selection' stage shows a list of all the ideas the user generated, and they can pick their top three. The 'Image Creation' stage shows a meme editor where the user can add their chosen text to the meme image and position it how they like.
  • Interface Components: The interface includes elements like text input boxes, buttons (though their specific functions are hard to discern due to image resolution), a chat window, and a meme image editor. These are common components of web-based applications.
Scientific Validity
  • Reproducibility and Implementation Details: The figure provides a visual representation of the experimental setup, allowing for a better understanding of how participants interacted with the system. This aids in assessing the reproducibility of the study, as other researchers can see the interface used. However, the figure doesn't provide details about the underlying implementation, such as the specific chatbot technology used or the algorithms for generating meme captions.
  • Comparison of Conditions: The inclusion of both a baseline ideation interface (without AI assistance) and an ideation interface with a chat interface allows for a direct comparison between the two conditions. This strengthens the study's ability to isolate the impact of AI assistance on meme creation.
  • Complete Workflow Representation: By showing the entire workflow (ideation, selection, creation), the figure helps ensure that all stages of the meme-generation process are accounted for. This improves the overall validity of the experimental design.
Communication
  • Clear Visual Representation of UI: The figure provides a visual walkthrough of the user interface (UI) used in the study, showing the different stages of the meme creation process. The use of screenshots is effective in conveying the actual look and feel of the interface. The four stages (Baseline Ideation, Ideation with Chat Interface, Favorite Selection, and Image Creation) are clearly labeled, providing a logical flow that mirrors the user's experience.
  • Informative Labeling and Workflow: The labeling of each stage is concise and informative. The inclusion of a brief description under each screenshot (e.g., "Ideation with Chat Interface") further clarifies the purpose of each step. The arrows connecting the different stages visually represent the progression of the task, enhancing understanding of the workflow.
  • Readability of Details: While the figure shows the overall structure, the smaller details within each screenshot are difficult to read. This limits the ability to fully understand the specific functionalities and elements within each interface component. Zooming capabilities or higher-resolution images would improve readability.
  • Missing Chatbot Prompt: The chat interface is shown, but the specific prompt used or any system prompt is not visible. Including an example interaction would help readers better understand the capabilities.
Figure 4: Meme Generation Workflow: Human (Baseline), Human-AI Collaboration,...
Full Caption

Figure 4: Meme Generation Workflow: Human (Baseline), Human-AI Collaboration, and AI-Driven Creation.

Figure/Table Image (Page 5)
Figure 4: Meme Generation Workflow: Human (Baseline), Human-AI Collaboration, and AI-Driven Creation.
First Reference in Text
Figure 4: Meme Generation Workflow: Human (Baseline), Human-AI Collaboration, and AI-Driven Creation.
Description
  • Overview of Workflow: This figure presents a diagram showing the steps involved in creating memes in three different ways: by a human alone (Baseline), by a human working with an AI (Human-AI Collaboration), and by an AI alone (AI-driven Creation). It's like a recipe or a set of instructions for each method.
  • Human and Human-AI Workflow: For the human-only (Baseline) method, the steps are: coming up with ideas (Ideation), choosing favorite ideas (Favorite Selection), and creating the final meme image (Image Creation). The same three steps are present for the Human-AI collaboration method, but the 'Ideation' step involves interacting with a chatbot.
  • AI-Driven Workflow: For the AI-driven Creation method, the process involves providing a prompt to the AI ('Generate 20 meme captions for this <image> about the topic of <topic>') and the AI then generates the memes. A 'prompt' is a set of instructions given to an AI to tell it what to do. The <image> and <topic> would be replaced with specific image descriptions and topics, respectively.
  • Inclusion of Experience Survey: Each condition includes a connection to 'Experience Survey'. This means that after each method, participants were asked to complete a survey about their experience.
Scientific Validity
  • Clear Delineation of Experimental Conditions: The figure clearly outlines the different experimental conditions, which is crucial for understanding the study's design and comparing the results across groups. The separation of the workflows allows for a clear isolation of the independent variable (method of meme generation).
  • Inclusion of Baseline Condition: The inclusion of a baseline (human-only) condition provides a crucial point of comparison for evaluating the impact of AI assistance. This allows the researchers to determine whether AI collaboration or AI-driven creation leads to different outcomes compared to human-only meme generation.
  • Level of Detail in Protocol: The figure provides a high-level overview of the workflow, but lacks details about the specific instructions given to participants in each condition. For example, were participants in the human-AI collaboration group given any guidance on how to interact with the chatbot? More detailed information about the experimental protocol would enhance the reproducibility of the study.
  • AI Prompt Specification: The prompt shown for AI-driven creation is clear and well-defined. Providing this level of detail increases the transparency and replicability of the study, as others can use the same prompt to generate similar results.
Communication
  • Clear Visual Representation of Workflow: The figure effectively uses a flowchart to illustrate the workflow for each of the three experimental conditions: Human (Baseline), Human-AI Collaboration, and AI-driven Creation. The use of different colors for each condition (green, teal, and gray, respectively) helps to visually distinguish them. The arrows clearly indicate the flow of the process, and the icons within each step provide a quick visual summary of the action (e.g., writing, selecting, editing). The inclusion of example prompts and system output for the AI-driven creation enhances understanding.
  • Consistent and Informative Labeling: The labeling of each step is concise and informative (e.g., "Ideation," "Favorite Selection," "Image Creation"). The use of consistent terminology across all three conditions facilitates comparison. The caption accurately describes the content of the figure.
  • Readability of Details: While the overall flow is clear, the details within some of the icons and smaller text boxes are difficult to read, particularly the prompt example for generating image captions. This reduces the ability to fully understand the specifics of the AI interaction. Larger or higher-resolution images, and perhaps a separate figure detailing the prompt structure, would be beneficial.
Figure 5: Meme Evaluation Workflow: This diagram illustrates the evaluation...
Full Caption

Figure 5: Meme Evaluation Workflow: This diagram illustrates the evaluation process of memes created by humans, human-AI collaboration, and AI-driven approaches.

Figure/Table Image (Page 5)
Figure 5: Meme Evaluation Workflow: This diagram illustrates the evaluation process of memes created by humans, human-AI collaboration, and AI-driven approaches.
First Reference in Text
Figure 5: Meme Evaluation Workflow: This diagram illustrates the evaluation process of memes created by humans, human-AI collaboration, and AI-driven approaches.
Description
  • Overall Workflow Description: This figure is a flowchart showing how the memes created in the study were evaluated. It breaks down the process into several steps, starting from collecting the initial ideas and ending with an online survey where people rated the memes.
  • Different Creation Conditions: There are three main branches in the flowchart, each representing a different way memes were created: by humans alone (Human), by humans and AI working together (Human-AI Collaboration), and by AI alone (AI-driven). Each branch shows how many initial ideas or memes were collected (e.g., 415 Favorite Ideas for the human baseline, 300 Meme Images for the AI-driven approach).
  • Sampling Process: The 'Generation & Curation' step involves collecting the initial ideas or memes. Then, a 'Random Sample' is taken, reducing the number of memes to 150 for each condition. 'Random sampling' means selecting a smaller group from a larger group in a way that each member of the larger group has an equal chance of being chosen. This helps ensure the smaller group is representative of the larger group.
  • Online Survey and Rating Metrics: Finally, an 'Online Survey for Rating The Images' was conducted. Participants from a platform called 'Prolific' were shown a random sample of 50 images and asked to rate them on three aspects: Humor, Creativity, and Shareability. Prolific is a website where researchers can recruit participants for online studies.
  • Sampling Details: The figure includes notes explaining the sampling process. It states that 10 images were randomly sampled for each combination of background picture and topic, and that each participant in the survey saw a random sample of 50 images.
Scientific Validity
  • Clear Delineation of Evaluation Process: The figure clearly outlines the evaluation process, which is crucial for understanding how the quality of the memes was assessed. The separation of the workflows for each condition allows for a clear comparison of the evaluation results.
  • Transparency in Data Collection: The inclusion of specific numbers (e.g., 415 Favorite Ideas, 335 Meme Images) provides transparency regarding the data collection and sampling process. This allows for a better assessment of the sample size and potential biases.
  • Use of Random Sampling: The use of random sampling is a standard practice in research to reduce bias and ensure that the selected memes are representative of the larger pool of generated memes. The figure clearly indicates that random sampling was employed.
  • Participant Information: The figure mentions using participants from Prolific for the online survey, which is a common platform for recruiting participants. However, it doesn't provide details about the participant demographics or any inclusion/exclusion criteria. This information is important for assessing the generalizability of the findings.
  • Operationalization of Evaluation Metrics: The figure specifies the evaluation metrics (Humor, Creativity, Shareability) but doesn't provide details on how these were measured (e.g., rating scales, instructions to participants). More information about the operationalization of these constructs would strengthen the scientific validity.
Communication
  • Clear Visual Representation and Flow: The figure effectively uses a flowchart to depict the evaluation process of the generated memes. The use of distinct sections for each condition (Human, Human-AI Collaboration, AI-driven) and color-coding (orange, teal, and gray) aids in visual differentiation. The arrows clearly indicate the flow of the process, from generation and curation to sampling and finally to the online survey. The inclusion of specific numbers (e.g., 415 Favorite Ideas, 150 Meme Images) provides a quantitative overview of the data.
  • Consistent and Informative Labeling: The labeling of each step is concise and informative (e.g., "Generation & Curation," "Random Sample," "Online Survey for Rating The Images"). The use of consistent terminology across all conditions facilitates comparison. The caption accurately describes the content and purpose of the figure.
  • Detailed Evaluation Methodology: The inclusion of an "Online Survey for Rating The Images" section with details about the rating process (Humor, Creativity, Shareability) and the use of participants from Prolific enhances the understanding of the evaluation methodology. The note about randomly sampling 10 images for each combination and displaying 50 images to each participant clarifies the sampling strategy.
  • Readability of Details: While the overall flow is clear, the smaller text within some of the boxes is difficult to read, particularly the numbers indicating the quantity of memes at each stage. Larger font sizes or a zoomed-in view of specific sections would improve readability.

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 6: Participants using the LLM were able to produce significantly more...
Full Caption

Figure 6: Participants using the LLM were able to produce significantly more ideas than participants who had no external support, according to the Mann-Whitney-U test (***: p < 0.001)

Figure/Table Image (Page 6)