One Does Not Simply Meme Alone: Evaluating Co-Creativity Between LLMs and Humans in the Generation of Humor

Section Analysis

Abstract

Key Aspects

Research Focus and Gap: The study investigates the potential of Large Language Models (LLMs) as co-creative partners in generating memes, a humor-driven and culturally specific form of creative expression. This addresses a gap in research, as previous work has primarily focused on LLMs in tasks like writing or narrative creation, with less attention to humor-rich and culturally nuanced domains.
Experimental Design: A user study was conducted with three groups: a human-only group, a human-AI collaboration group, and an AI-only group. This experimental design allows for a direct comparison of meme quality across different creation modalities, providing insights into the relative strengths and weaknesses of each approach.
Evaluation Metrics: The quality of the generated memes was assessed through crowdsourcing, with ratings on creativity, humor, and shareability. These metrics provide a comprehensive evaluation of the memes, capturing different aspects of their effectiveness as a form of communication and entertainment.
Productivity and Effort: The results showed that LLM assistance increased the number of ideas generated and reduced the perceived effort of participants. This suggests that LLMs can be valuable tools for boosting productivity and facilitating the ideation process in creative tasks.
AI-Only Performance: Memes created entirely by AI performed better on average across all evaluation metrics (creativity, humor, and shareability) compared to both human-only and human-AI collaborative memes. This surprising finding suggests that LLMs, drawing from vast datasets, may be adept at producing content with broad appeal.
Top-Performing Memes: When analyzing the top-performing memes, human-created memes excelled in humor, while human-AI collaborations were superior in creativity and shareability. This indicates that while AI can generate broadly appealing content, human creativity remains crucial for achieving high levels of humor, and human-AI collaboration can lead to more creative and shareable outcomes.
Overall Conclusion and Implications: The findings highlight the complexities of human-AI collaboration in creative tasks, emphasizing that while AI can enhance productivity and generate content with broad appeal, human creativity is still essential for achieving deeper levels of connection and engagement. The study underscores the need for further research to optimize human-AI collaboration in creative domains.

Strengths

Clear Research Question
The abstract clearly states the research question, focusing on the unexplored potential of LLMs in co-creating humor-rich and culturally nuanced content, specifically memes.

"While previous research has explored the abilities of Large Language Models (LLMs) to serve as co-creative partners in tasks like writing poetry or creating narratives, the collaborative potential of LLMs in humor-rich and culturally nuanced domains remains an open question." (Page 1)
Concise Methodology Description
The abstract concisely describes the methodology, including the three experimental groups (human-only, human-AI collaboration, and AI-only) and the evaluation metrics (creativity, humor, and shareability).

"We conducted a user study with three groups of 50 participants each: a human-only group creating memes without AI assistance, a human-AI collaboration group interacting with a state-of-the-art LLM model, and an AI-only group where the LLM autonomously generated memes. We assessed the quality of the generated memes through crowdsourcing, with each meme rated on creativity, humor, and shareability." (Page 1)
Summary of Key Findings
The abstract summarizes the key findings, highlighting both the benefits (increased idea generation, reduced effort) and limitations (no improvement in quality with human-AI collaboration) of LLM assistance, as well as the surprising performance of AI-only generated memes.

"Our results showed that LLM assistance increased the number of ideas generated and reduced the effort participants felt. However, it did not improve the quality of the memes when humans were collaborated with LLM. Interestingly, memes created entirely by AI performed better than both human-only and human-AI collaborative memes in all areas on average." (Page 1)
Concluding Remarks
The abstract concludes by acknowledging the complexities of human-AI collaboration in creative tasks and emphasizing the continued importance of human creativity.

"These findings highlight the complexities of human-AI collaboration in creative tasks. While AI can boost productivity and create content that appeals to a broad audience, human creativity remains crucial for content that connects on a deeper level." (Page 1)

Suggestions for Improvement

Enhance Contextualization
This high-impact improvement is crucial for contextualizing the research within the broader field and highlighting its significance. The abstract, as the initial point of contact for readers, should immediately establish the relevance and novelty of the work. Currently, the abstract jumps directly into the study without clearly stating the overall problem being addressed. By adding a sentence or two about the growing interest in human-AI collaboration and the challenges of creative tasks, the abstract can better capture the reader's attention and convey the study's importance.

"Collaboration has been shown to enhance creativity, leading to more innovative and effective outcomes." (Page 1)

Implementation: Add a sentence or two at the beginning of the abstract to frame the research within the broader context of human-AI collaboration and the challenges of creative tasks. For example: 'The increasing integration of AI in creative domains presents both opportunities and challenges. While AI has shown promise in various tasks, its ability to collaborate with humans on complex creative endeavors, particularly those involving humor and cultural understanding, remains unclear.'
Clarify Top-Performing Meme Criteria
This medium-impact improvement would enhance the abstract's clarity and provide a more nuanced understanding of the results. The abstract mentions that human-created memes were better in humor, and human-AI collaborations excelled in creativity and shareability, but it doesn't explicitly state which top-performing memes are being referred to. Adding a brief clarification will improve the precision and impact of the findings.

"However, when looking at the top-performing memes, human-created ones were better in humor, while human-AI collaborations stood out in creativity and shareability." (Page 1)

Implementation: Modify the sentence about top-performing memes to specify the criteria used for selection. For example: 'However, when looking at the top-performing memes based on individual ratings, human-created ones were better in humor, while human-AI collaborations stood out in creativity and shareability.'
Strengthen Concluding Statement
This medium-impact improvement would strengthen the abstract by providing a more specific and impactful takeaway message. The current conclusion is somewhat general. Adding a call to action or a statement about future research directions would give the abstract a more forward-looking perspective.

"These findings highlight the complexities of human-AI collaboration in creative tasks. While AI can boost productivity and create content that appeals to a broad audience, human creativity remains crucial for content that connects on a deeper level." (Page 1)

Implementation: Add a sentence to the conclusion that suggests future research directions or implications. For example: 'Further research is needed to explore optimal interface designs and interaction paradigms that best leverage the complementary strengths of humans and AI in co-creative tasks.' or 'These findings suggest a need for developing AI models that can better understand and incorporate nuanced human creative input.'

Non-Text Elements

Figure 1: Top 4 Memes Generated by AI, Humans, and Human-AI Collaboration...

Full Caption

Figure 1: Top 4 Memes Generated by AI, Humans, and Human-AI Collaboration Across Humor, Creativity, and Shareability Metric.

Figure/Table Image (Page 1)

First Reference in Text

Figure 1: Top 4 Memes Generated by AI, Humans, and Human-AI Collaboration Across Humor, Creativity, and Shareability Metric.

Description

Overall Structure and Content: This figure presents a collection of 12 images, formatted as internet memes, arranged in a 3x4 grid. Each row represents a different metric used to evaluate the memes: 'Humor,' 'Creativity,' and 'Shareability.' Each column seems to represent different generation methods: The first column seems to be AI-generated, second column human-generated, the third column human-generated and the fourth is human-AI collaboration. A meme is a humorous image, video, piece of text, etc., that is copied (often with slight variations) and spread rapidly by Internet users. Each meme includes an image macro (a picture with superimposed text) and a short text caption. For example, one meme shows a picture of a surprised cartoon character with the text "When it's only 10 AM but you've already been at work for 5 hours." This implies that the workday feels long, even early in the morning, a relatable and potentially humorous situation.
Comparison of Generation Methods: The figure showcases examples of memes created under different conditions: entirely by AI, entirely by humans, and through human-AI collaboration. The exact process of 'collaboration' isn't defined in the figure itself, but presumably, a human and an AI system worked together in some way to create the meme, perhaps with the AI suggesting text and the human choosing the image, or vice-versa.
Evaluation Metrics: The figure implies a qualitative assessment based on three subjective metrics: humor (how funny the meme is), creativity (how original and novel the meme is), and shareability (how likely someone would be to share the meme with others). There are no numerical scores or quantitative data presented directly within the figure, suggesting a ranking based on some form of subjective evaluation.

Scientific Validity

Selection Methodology: The figure presents examples of generated memes, but lacks methodological details regarding how these specific memes were selected as the 'top 4'. Without information on the selection process (e.g., sample size, rating scales, statistical analysis), it's impossible to assess the validity of the claim that these are indeed the 'top' memes. The figure serves as an illustrative example rather than strong empirical evidence.
Subjectivity of Metrics: The metrics (Humor, Creativity, Shareability) are inherently subjective. While relevant to the study's focus, the figure doesn't provide information on how these qualities were assessed or quantified, making it difficult to judge the scientific rigor of the evaluation. Were these ratings obtained from a panel of judges? If so, what were their demographics, and what instructions were they given?
Illustrative, not comprehensive: The Figure presents example outputs, which is useful, but it doesn't represent the entirety of results. For example, it is unclear how many memes were created in total, and what the distribution of scores was across all memes.

Communication

Clarity and Readability: The figure uses a visually engaging format (memes) which is appropriate for the subject matter. However, the criteria for selecting the 'top 4' are not explicitly stated in the caption or figure itself, which could lead to ambiguity. The use of a grid layout is effective for comparison, but the small size of the text within each meme may hinder readability, especially in a printed format. The categorization into Humor, Creativity, and Shareability is clear, but the rationale for choosing these specific metrics is not given in the caption, although it is likely elaborated in the main body of the paper.
Contextualization: While the figure is self-contained in presenting the memes, it lacks context without the accompanying text. A reader unfamiliar with the study might not fully grasp the significance of these specific memes or how they were generated and evaluated without more details in the caption.

Introduction

Key Aspects

Collaboration and Creativity: Collaboration is a fundamental aspect of human activity, contributing to enhanced creativity and improved outcomes in various domains. Diverse perspectives, increased motivation, and continuous feedback within collaborative settings support creative processes, leading to higher performance and quality.
LLMs in Creative Activities: Large Language Models (LLMs) are increasingly being used as replacements for human collaborators in creative activities, such as art, music, and literature. LLMs have demonstrated remarkable capabilities in generating content, often matching or surpassing human performance in tasks like divergent thinking.
Co-creativity vs. Task Delegation: Prior research has often focused on LLM outputs as standalone creations, neglecting their potential as true co-creative partners. Co-creativity, unlike simple task delegation, involves iterative refinement of ideas through dialog and feedback loops between humans and AI systems.
Humor as a Research Area: Humor, a complex form of human creativity, has not been extensively investigated in the context of co-creative collaboration with LLMs. Humor relies on surprise, contrast, cultural context, and emotional resonance, making it a challenging area for AI collaboration.
Internet Memes as a Form of Humor: Internet memes, a cultural phenomenon and a form of humor incorporating global and local contexts, serve as a relevant means of evaluating creative humor in the digital age. Memes have become a universal language of the internet, used to express emotions and convey messages.
Research Focus and Objective: The paper investigates the potential of LLMs as co-creative partners for generating humor, specifically focusing on internet memes. The study aims to understand how human interaction with a "humor assistant" (an LLM) affects productivity and how the resulting memes compare to those created solely by humans or AI.
Study Methodology: Two user studies were conducted: one to explore the meme creation process with and without LLM assistance, and another to evaluate the generated memes. The evaluation involved crowdsourcing ratings on humor, creativity, and shareability.
Key Findings: The findings indicate that LLM assistance significantly boosts productivity and reduces perceived effort in creative tasks. However, it does not necessarily enhance the quality of creative output, with AI-generated memes surpassing both human and human-AI collaborations in all three dimensions on average.
Implications and Future Directions: The results emphasize the need for better methods, tools, and processes for integrating AI into an iterative creative process. The challenge lies in designing human-centered AI systems that simplify the creative process while amplifying human creativity.

Strengths

Contextualization of Collaboration
The introduction effectively establishes the context by highlighting the prevalence of collaboration in daily life and its benefits for creativity, referencing established research.

"From co-authoring articles to discussing vacation plans or pair programming, collaborating with other people is a core part of our daily lives in many ways...Prior work showed that such collaborative work increases the performance of project teams [12], improves quality [54], and, in particular, enhances creativity [38] in numerous domains." (Page 1)
Introduction of LLMs in Creativity
The introduction clearly introduces Large Language Models (LLMs) and their increasing role in creative activities, citing relevant examples and research.

"With the rise of Large Language Models (LLMs), there is a growing trend of replacing human collaboration partners with such systems in typically cre-ative activities in areas such as art [30, 59], music [16, 17, 21], and literature [28, 46, 51, 57]." (Page 1)
Identification of Research Gap
The introduction identifies a key gap in the research: the lack of focus on LLMs as co-creative partners, particularly in the domain of humor, and specifically memes.

"However, much of the prior work on the creative aspect of LLMs focuses on their outputs as standalone creations, often ne-glecting their potential as true co-creative partners...Additionally, one important aspect has not yet been investigated in the area of co-creative collaboration with LLMs: humor." (Page 2)
Definition of Co-creativity
The section effectively defines co-creativity in the context of human-AI interaction, drawing on established frameworks and emphasizing the iterative, dialog-based nature of the process.

"Unlike simple task delegation, co-creativity involves iterative co-creation, where humans and AI systems actively refine ideas through dialog and feedback loops, aligning with established frameworks of collabora-tive creativity[7, 38]." (Page 2)
Articulation of Research Question
The introduction clearly articulates the research question and the study's objective: to explore the potential of LLMs as co-creative partners for generating humor, specifically memes, and to compare the outcomes with human-only and AI-only creation.

"In this paper, we add to the body of work on human-AI co-creativity by exploring the potential of LLMs as co-creative partners for generating humor...Consequently, we investigate how people interact with a “humor assistant” in the creation of internet memes and how the availability of such an assistant affects productivity." (Page 2)
Overview of Methodology
The introduction provides a concise overview of the study's methodology, including the two user studies and the evaluation metrics.

"For this, we conducted two user studies: in the first, we asked participants to generate ideas for memes, either collaboratively, using an LLM, or without any assistance, and rate the experience...For it, we sampled 150 images from each group, as well as 150 fully AI generated images, and asked a second group of participants to rate then in terms of how funny they con-sidered them, how creative, and how likely it would be that they would shared them." (Page 2)
Summary of Key Findings
The introduction summarizes the key findings, highlighting both the benefits (increased productivity, reduced effort) and limitations (no overall quality improvement) of LLM assistance, as well as the superior performance of AI-only generated memes.

"These findings indicate that while LLM assistance can signifi-cantly boost productivity and reduce perceived effort in creative tasks, it does not necessarily enhance the quality of creative output." (Page 2)
Discussion of Implications
The introduction concludes by discussing the implications of the findings, emphasizing the complexities of human-AI collaboration and the need for better methods and tools to integrate AI into the creative process.

"These results emphasize how AI can be tool to quickly and easily produce large volumes of ideas that can often already meed a broad, average appeal. However, it also demonstrates a need for better methods, tools and processes for integrating AI into an iterative cre-ative process..." (Page 2)

Suggestions for Improvement

Improve Transition to Research Gap
This medium-impact improvement would strengthen the flow and coherence of the introduction by creating a smoother transition between the discussion of LLMs in creative activities and the identification of the research gap. Currently, the jump from LLMs in art, music, and literature to the lack of focus on co-creativity and humor feels slightly abrupt. Adding a bridging sentence or two would help connect these ideas more logically.

"With the rise of Large Language Models (LLMs), there is a growing trend of replacing human collaboration partners with such systems in typically cre-ative activities in areas such as art [30, 59], music [16, 17, 21], and literature [28, 46, 51, 57]." (Page 1)

Implementation: Insert a sentence or two to bridge the gap between the general discussion of LLMs in creative fields and the specific focus on co-creativity and humor. For example: 'While LLMs have shown promise in these diverse creative domains, their role as active collaborators, rather than simply output generators, remains less explored. This is particularly true in areas requiring nuanced understanding of human expression, such as humor.'
Define Internet Memes
This low-impact improvement would provide additional context for readers unfamiliar with the concept of internet memes. While the introduction mentions memes, it doesn't explicitly define them. Adding a brief definition would enhance clarity and ensure all readers have a shared understanding.

"A well-known example of such humor, that incorporates global and local contexts, are internet memes [1], often in the form of captioned images." (Page 2)

Implementation: Add a brief definition of internet memes. For example: '...internet memes [1], often in the form of captioned images or short videos that combine visual and textual elements to convey humorous or relatable messages.'
Provide More Specific Future Research Directions
This medium-impact improvement would enhance the introduction's conclusion by providing a more specific roadmap for future research. While the introduction mentions the need for better methods and tools, it could be strengthened by outlining specific areas or questions that warrant further investigation.

"Designing smarter and more human-centered AI sys-tems that can simplify the challenging steps of the creative process while enriching and amplifying humans’ unique creative abilities will continue to be a important challenge in the future." (Page 2)

Implementation: Add a sentence or two to the conclusion that outlines specific areas for future research. For example: 'Future research should focus on developing interfaces that better facilitate iterative human-AI collaboration, exploring how different prompting strategies affect creative outcomes, and investigating the long-term effects of LLM assistance on human creativity and skill development.'

Methodology

Key Aspects

Experimental Design: The study employed a between-subject design with three experimental groups: a baseline group creating memes without AI assistance, a human-AI collaboration group using a conversational LLM interface, and an AI-only group where an LLM autonomously generated memes. This design allows for a direct comparison of the creative outputs and processes under different conditions of AI involvement.
Task Description: Participants were tasked with generating captions for pre-selected meme templates, focusing on one of three topics (work, food, or sports). The task was divided into three steps: ideation (generating captions), favorite selection (choosing the top three ideas), and image creation (adding captions to the meme template). This structured approach provided a consistent framework for meme generation across participants and conditions.
Ideation Phase: In the ideation phase, participants in the baseline group generated captions independently, while those in the human-AI collaboration group could interact with an LLM through a chat interface. The LLM's responses were automatically processed to extract any generated ideas, which were added to the participant's idea list. This allowed for a seamless integration of AI-generated suggestions into the creative process.
Evaluation Phase: Following the main study, the generated memes were evaluated in a second online survey. Participants rated a random sample of memes on three dimensions: humor, creativity, and shareability. These dimensions were chosen based on prior work on meme evaluation and reflect key aspects of meme effectiveness.
AI-Only Meme Generation: For the AI-only condition, the LLM was prompted to generate captions for each combination of image and topic. The prompting strategy involved providing the LLM with the image and topic, along with a system prompt that set the context and tone for the interaction. This approach aimed to generate memes that were relevant to the given image and topic.
Two-Phase Study Design: The study involved two phases: meme generation and meme evaluation. In the meme generation phase, participants created memes under different conditions. In the meme evaluation phase, a separate group of participants rated the generated memes. This two-phase approach allowed for both a process-oriented and an outcome-oriented assessment of human-AI collaboration in meme creation.
Apparatus and Data Collection: The study utilized a custom-built user interface for meme generation and a commercial survey platform for data collection and evaluation. The interface allowed for both independent meme creation and interaction with an LLM through a chat interface. The use of online platforms facilitated data collection and participant recruitment.
Participant Recruitment: Participants were recruited through the online platform Prolific, with specific criteria for English language proficiency and prior experience with LLMs (for the first phase). Demographic data was collected to characterize the participant sample. This recruitment strategy aimed to ensure a diverse and qualified participant pool.

Strengths

Clear Task Description
The methodology clearly outlines the three-step task (ideation, favorite selection, image creation) assigned to participants, providing a structured approach to meme generation.

"The participants’ task in the study was to generate captions for memes. More specifically, the task consisted of three steps: Ideation In the first step, we displayed one of six background images of popular memes(Figure 2)to the participants and asked them to come up with as many captions as they could within five minutes." (Page 4)
Well-Defined Experimental Groups
The section clearly defines the three experimental groups (baseline, human-AI collaboration, and AI-only), enabling a comparison of different meme creation methods.

"Each experimental group used different methods for creating memes, comparing the effects of creativity driven solely by humans, human-Al collaboration, and entirely AI-driven creation." (Page 4)
Detailed Procedure Description
The methodology describes the procedure for the study, including participant instructions, time constraints, and compensation, providing details on the experimental setup.

"For generating memes, after recording for their informed consent, we asked participants to spend at least four and at most five minutes on coming up with captions using our UI (Figure 4)." (Page 4)
Image Curation and Sampling
The section explains the process of curating and sampling images for the evaluation phase, ensuring a manageable and representative subset of memes for rating.

"For each idea, we re-generated the captioned image to ensure consistent placement of the text. We then curated the images, ex- cluding all those were the participants clearly entered a caption not matching the task or where the length of the caption obscured" (Page 4)
LLM Prompting for AI-Only Condition
The methodology describes the use of an LLM for generating captions in the AI-only condition, providing details on the prompting strategy.

"We then leveraged LLM to create fully AI generated captions for the third study condition. To this end, we prompted the model to generate captions for each combination of image and topic, giving us additional 150 images." (Page 5)
Evaluation Metrics
The section specifies the evaluation metrics (humor, creativity, shareability) used in the second online survey, aligning with prior work and providing a basis for assessing meme quality.

"For each image, the participated provided feedback along three dimensions: humor, creativity and shareability. These categories were selected based on prior work." (Page 5)
Prompting Strategy for Human-AI Collaboration
The methodology describes the prompting strategy used for the conversational UI in the human-AI collaboration condition, including system prompt constraints and context setting.

"However, we set a system prompt to constrain the functionality and output of the system. This system prompt set the context for the LLM, including the fact that the goal of the system was to help users in creating meme ideas, the tone of the interaction to be helpful and polite, and it constrained the system to produce at most three ideas with a single response." (Page 5)

Suggestions for Improvement

Specify Meme Template Selection Criteria
This medium-impact improvement would increase the methodological rigor and transparency of the study. While the section mentions using popular meme templates (Figure 2), it doesn't explicitly state the criteria for selecting these specific templates. The Methods section is where this information is crucial to assess the validity of the chosen stimuli.

"Ideation In the first step, we displayed one of six background images of popular memes(Figure 2)to the participants and asked them to come up with as many captions as they could" (Page 4)

Implementation: Add a sentence or two explaining the criteria used for selecting the meme templates. For example: 'The six meme templates were selected based on their recognizability and frequent use in online communities, as determined by [cite source, e.g., a meme database or prior research].'
Specify the LLM Used
This medium-impact improvement would enhance the reproducibility of the study. While the section mentions using an LLM, it doesn't specify which LLM was used (beyond GPT-4.0 mentioned in section 3.4, which should be brought forward). The specific model is a critical detail for replication and comparison with future research. This information belongs squarely in the Methods section.

"For conducting the study, we used LLM in two functions: first, as part of the UI where participants could generate ideas with the assistance of a conversational UI." (Page 5)

Implementation: Specify the LLM used in the study. For example: 'We used the GPT-4 model from OpenAI [cite API documentation] for both the conversational UI and the AI-only meme generation.' Move the information about GPT-4.0 from section 3.4 to the beginning of section 3.
Explain Rationale for Topic Selection
This low-impact improvement would provide additional clarity and context for the study's design. While the section mentions three topics (work, food, sports), it doesn't explain the rationale for choosing these specific topics. This explanation is important for understanding the scope and limitations of the study. This information belongs in the Methods section as it justifies the experimental design.

"We asked participants to focus their ideas on one of three topics: work, food, and sports. The goal was to keep the ideas relatively constrained, for compara-" (Page 4)

Implementation: Add a sentence or two explaining the rationale for choosing the three topics. For example: 'The topics of work, food, and sports were chosen to represent a range of common themes in online humor and to provide a diverse set of contexts for meme creation.'
Elaborate on AI-Only Condition Prompting
This medium-impact improvement would strengthen the study's methodology by providing more detail about the instructions given to participants in the AI-only condition. The paper states that the LLM was prompted, but it does not fully describe how the prompt was structured, aside from the general template provided. Providing more detail would allow for better replication and understanding of the AI's behavior.

"Secondly, we used the LLM to generate image captions for gener-ating the memes for the pure Al condition. For this, we again sent" (Page 5)

Implementation: Provide a more detailed description of the prompt used in the AI-only condition. For example, specify if any additional instructions or constraints were included beyond the image and topic. Include examples of specific prompts used.

Non-Text Elements

Figure 2: Mapping of Meme Templates to Topics (Work, Food, Sports) in the Study.

Figure/Table Image (Page 3)

First Reference in Text

Ideation In the first step, we displayed one of six background images of popular memes(Figure 2)to the participants and asked them to come up with as many captions as they could within five minutes.

Description

Meme Templates and Topics: This figure shows five different, common internet meme templates. A 'meme template' is a recognizable image or series of images that people use as a base to create their own memes, usually by adding text. The figure connects these templates to three different topics: Work, Food, and Sports. For example, the 'baby' meme template, showing a baby with a determined expression, is linked to the 'Work' topic. This suggests that participants in the study were asked to create memes related to work using this specific template.
Exclusion of one template: The figure presents a visual representation of the association between six meme templates and three predetermined topics. One template, labeled 'choice' is indicated as excluded, leaving five templates for analysis.
Identification of Meme Templates: The five meme templates used are: 'baby,' 'boromir,' 'doge,' 'futurama,' and 'toy-story.' These are all well-known and widely used meme formats. 'boromir' refers to a scene from the Lord of the Rings movie; 'doge' is a picture of a Shiba Inu dog; 'futurama' is a still from the animated TV show Futurama; and 'toy-story' refers to a scene from the movie Toy Story.
Topic Categories: The three topics are Work, Food, and Sports. These are broad categories, and the figure suggests that participants were asked to generate captions for the provided meme templates that related to one of these three topics. The connection between a template and a topic is visually represented by an arrow.

Scientific Validity

Controlled Stimuli: The figure illustrates the stimuli used in the ideation phase of the study. Providing participants with pre-selected meme templates introduces a level of control and standardization to the meme creation process. This approach allows for a more focused comparison between different groups (human-only, human-AI, AI-only) as they are all working with the same basic materials. However, it may also limit the range of creativity compared to allowing participants to choose their own templates.
Template Selection: The choice of popular meme templates increases the likelihood that participants will be familiar with the formats, potentially leading to more fluent idea generation. However, the specific criteria for selecting these particular templates (beyond being 'popular') are not described. A more rigorous justification for the selection process would strengthen the methodology.
Data Exclusion: The exclusion of one template ('choice') is clearly indicated, but the reason for this exclusion is not provided in the figure itself or the provided reference text. Transparency regarding data exclusion is crucial for scientific validity. The reason should be explained within the main text.
Presentation of Experimental Setup: The figure only shows the templates and topics; it doesn't present any results. As such, there's no statistical analysis to assess, just the presentation of the experimental setup.

Communication

Clarity and Visual Organization: The figure effectively uses a visual representation to connect meme templates with specific topics. The layout is clear and easy to follow, with distinct sections for each topic and corresponding meme examples. The use of color-coding (red, green, blue) for each topic enhances visual distinction and aids in quick comprehension. The images are large and recognizable meme templates, allowing for immediate identification. The use of arrows to connect templates to labels is intuitive.
Caption Conciseness and Accuracy: The caption clearly states the figure's purpose: to show the mapping between meme templates and topics. The inclusion of "(Work, Food, Sports)" explicitly defines the topics covered, providing context. However, the term "Mapping" could be slightly ambiguous to a non-technical reader. A more descriptive term like "Association" or "Relationship" might improve clarity.
Handling of Excluded Data: The meme labeled 'choice' with an asterisk and the note '* This image was excluded from the study' is clearly marked and explained, preventing misinterpretation. The red 'X' over the image reinforces its exclusion.

Figure 3: User Interface Overview:Baseline Ideation, Ideation with Chat...

Full Caption

Figure 3: User Interface Overview:Baseline Ideation, Ideation with Chat Interface, Favorite Selection, and Final Image Creation.

Figure/Table Image (Page 4)

First Reference in Text

The interface(Figure 3) displayed a blank meme template as well as the instructions to the user.

Description

Overview of the User Interface: This figure showcases the different screens a participant would see when using the system to create memes. It's like a series of snapshots showing each step of the process. There are four main parts: coming up with ideas without help (Baseline Ideation), coming up with ideas using a chatbot (Ideation with Chat Interface), choosing their favorite ideas (Favorite Selection), and finally, creating the finished meme image (Image Creation).
Ideation Stages: In the 'Baseline Ideation' stage, users see a blank meme template and a space to type in their ideas. In the 'Ideation with Chat Interface' stage, users see the same thing, but also have a chat window where they can interact with an AI to get help generating ideas. An 'AI' or 'Artificial Intelligence' is a computer program designed to mimic human intelligence, in this case, to help generate meme captions. The chatbot is a type of AI that you can talk to (or in this case, type to).
Selection and Creation Stages: The 'Favorite Selection' stage shows a list of all the ideas the user generated, and they can pick their top three. The 'Image Creation' stage shows a meme editor where the user can add their chosen text to the meme image and position it how they like.
Interface Components: The interface includes elements like text input boxes, buttons (though their specific functions are hard to discern due to image resolution), a chat window, and a meme image editor. These are common components of web-based applications.

Scientific Validity

Reproducibility and Implementation Details: The figure provides a visual representation of the experimental setup, allowing for a better understanding of how participants interacted with the system. This aids in assessing the reproducibility of the study, as other researchers can see the interface used. However, the figure doesn't provide details about the underlying implementation, such as the specific chatbot technology used or the algorithms for generating meme captions.
Comparison of Conditions: The inclusion of both a baseline ideation interface (without AI assistance) and an ideation interface with a chat interface allows for a direct comparison between the two conditions. This strengthens the study's ability to isolate the impact of AI assistance on meme creation.
Complete Workflow Representation: By showing the entire workflow (ideation, selection, creation), the figure helps ensure that all stages of the meme-generation process are accounted for. This improves the overall validity of the experimental design.

Communication

Clear Visual Representation of UI: The figure provides a visual walkthrough of the user interface (UI) used in the study, showing the different stages of the meme creation process. The use of screenshots is effective in conveying the actual look and feel of the interface. The four stages (Baseline Ideation, Ideation with Chat Interface, Favorite Selection, and Image Creation) are clearly labeled, providing a logical flow that mirrors the user's experience.
Informative Labeling and Workflow: The labeling of each stage is concise and informative. The inclusion of a brief description under each screenshot (e.g., "Ideation with Chat Interface") further clarifies the purpose of each step. The arrows connecting the different stages visually represent the progression of the task, enhancing understanding of the workflow.
Readability of Details: While the figure shows the overall structure, the smaller details within each screenshot are difficult to read. This limits the ability to fully understand the specific functionalities and elements within each interface component. Zooming capabilities or higher-resolution images would improve readability.
Missing Chatbot Prompt: The chat interface is shown, but the specific prompt used or any system prompt is not visible. Including an example interaction would help readers better understand the capabilities.

Figure 4: Meme Generation Workflow: Human (Baseline), Human-AI Collaboration,...

Full Caption

Figure 4: Meme Generation Workflow: Human (Baseline), Human-AI Collaboration, and AI-Driven Creation.

Figure/Table Image (Page 5)

First Reference in Text

Figure 4: Meme Generation Workflow: Human (Baseline), Human-AI Collaboration, and AI-Driven Creation.

Description

Overview of Workflow: This figure presents a diagram showing the steps involved in creating memes in three different ways: by a human alone (Baseline), by a human working with an AI (Human-AI Collaboration), and by an AI alone (AI-driven Creation). It's like a recipe or a set of instructions for each method.
Human and Human-AI Workflow: For the human-only (Baseline) method, the steps are: coming up with ideas (Ideation), choosing favorite ideas (Favorite Selection), and creating the final meme image (Image Creation). The same three steps are present for the Human-AI collaboration method, but the 'Ideation' step involves interacting with a chatbot.
AI-Driven Workflow: For the AI-driven Creation method, the process involves providing a prompt to the AI ('Generate 20 meme captions for this <image> about the topic of <topic>') and the AI then generates the memes. A 'prompt' is a set of instructions given to an AI to tell it what to do. The <image> and <topic> would be replaced with specific image descriptions and topics, respectively.
Inclusion of Experience Survey: Each condition includes a connection to 'Experience Survey'. This means that after each method, participants were asked to complete a survey about their experience.

Scientific Validity

Clear Delineation of Experimental Conditions: The figure clearly outlines the different experimental conditions, which is crucial for understanding the study's design and comparing the results across groups. The separation of the workflows allows for a clear isolation of the independent variable (method of meme generation).
Inclusion of Baseline Condition: The inclusion of a baseline (human-only) condition provides a crucial point of comparison for evaluating the impact of AI assistance. This allows the researchers to determine whether AI collaboration or AI-driven creation leads to different outcomes compared to human-only meme generation.
Level of Detail in Protocol: The figure provides a high-level overview of the workflow, but lacks details about the specific instructions given to participants in each condition. For example, were participants in the human-AI collaboration group given any guidance on how to interact with the chatbot? More detailed information about the experimental protocol would enhance the reproducibility of the study.
AI Prompt Specification: The prompt shown for AI-driven creation is clear and well-defined. Providing this level of detail increases the transparency and replicability of the study, as others can use the same prompt to generate similar results.

Communication

Clear Visual Representation of Workflow: The figure effectively uses a flowchart to illustrate the workflow for each of the three experimental conditions: Human (Baseline), Human-AI Collaboration, and AI-driven Creation. The use of different colors for each condition (green, teal, and gray, respectively) helps to visually distinguish them. The arrows clearly indicate the flow of the process, and the icons within each step provide a quick visual summary of the action (e.g., writing, selecting, editing). The inclusion of example prompts and system output for the AI-driven creation enhances understanding.
Consistent and Informative Labeling: The labeling of each step is concise and informative (e.g., "Ideation," "Favorite Selection," "Image Creation"). The use of consistent terminology across all three conditions facilitates comparison. The caption accurately describes the content of the figure.
Readability of Details: While the overall flow is clear, the details within some of the icons and smaller text boxes are difficult to read, particularly the prompt example for generating image captions. This reduces the ability to fully understand the specifics of the AI interaction. Larger or higher-resolution images, and perhaps a separate figure detailing the prompt structure, would be beneficial.

Figure 5: Meme Evaluation Workflow: This diagram illustrates the evaluation...

Full Caption

Figure 5: Meme Evaluation Workflow: This diagram illustrates the evaluation process of memes created by humans, human-AI collaboration, and AI-driven approaches.

Figure/Table Image (Page 5)

First Reference in Text

Figure 5: Meme Evaluation Workflow: This diagram illustrates the evaluation process of memes created by humans, human-AI collaboration, and AI-driven approaches.

Description

Overall Workflow Description: This figure is a flowchart showing how the memes created in the study were evaluated. It breaks down the process into several steps, starting from collecting the initial ideas and ending with an online survey where people rated the memes.
Different Creation Conditions: There are three main branches in the flowchart, each representing a different way memes were created: by humans alone (Human), by humans and AI working together (Human-AI Collaboration), and by AI alone (AI-driven). Each branch shows how many initial ideas or memes were collected (e.g., 415 Favorite Ideas for the human baseline, 300 Meme Images for the AI-driven approach).
Sampling Process: The 'Generation & Curation' step involves collecting the initial ideas or memes. Then, a 'Random Sample' is taken, reducing the number of memes to 150 for each condition. 'Random sampling' means selecting a smaller group from a larger group in a way that each member of the larger group has an equal chance of being chosen. This helps ensure the smaller group is representative of the larger group.
Online Survey and Rating Metrics: Finally, an 'Online Survey for Rating The Images' was conducted. Participants from a platform called 'Prolific' were shown a random sample of 50 images and asked to rate them on three aspects: Humor, Creativity, and Shareability. Prolific is a website where researchers can recruit participants for online studies.
Sampling Details: The figure includes notes explaining the sampling process. It states that 10 images were randomly sampled for each combination of background picture and topic, and that each participant in the survey saw a random sample of 50 images.

Scientific Validity

Clear Delineation of Evaluation Process: The figure clearly outlines the evaluation process, which is crucial for understanding how the quality of the memes was assessed. The separation of the workflows for each condition allows for a clear comparison of the evaluation results.
Transparency in Data Collection: The inclusion of specific numbers (e.g., 415 Favorite Ideas, 335 Meme Images) provides transparency regarding the data collection and sampling process. This allows for a better assessment of the sample size and potential biases.
Use of Random Sampling: The use of random sampling is a standard practice in research to reduce bias and ensure that the selected memes are representative of the larger pool of generated memes. The figure clearly indicates that random sampling was employed.
Participant Information: The figure mentions using participants from Prolific for the online survey, which is a common platform for recruiting participants. However, it doesn't provide details about the participant demographics or any inclusion/exclusion criteria. This information is important for assessing the generalizability of the findings.
Operationalization of Evaluation Metrics: The figure specifies the evaluation metrics (Humor, Creativity, Shareability) but doesn't provide details on how these were measured (e.g., rating scales, instructions to participants). More information about the operationalization of these constructs would strengthen the scientific validity.

Communication

Clear Visual Representation and Flow: The figure effectively uses a flowchart to depict the evaluation process of the generated memes. The use of distinct sections for each condition (Human, Human-AI Collaboration, AI-driven) and color-coding (orange, teal, and gray) aids in visual differentiation. The arrows clearly indicate the flow of the process, from generation and curation to sampling and finally to the online survey. The inclusion of specific numbers (e.g., 415 Favorite Ideas, 150 Meme Images) provides a quantitative overview of the data.
Consistent and Informative Labeling: The labeling of each step is concise and informative (e.g., "Generation & Curation," "Random Sample," "Online Survey for Rating The Images"). The use of consistent terminology across all conditions facilitates comparison. The caption accurately describes the content and purpose of the figure.
Detailed Evaluation Methodology: The inclusion of an "Online Survey for Rating The Images" section with details about the rating process (Humor, Creativity, Shareability) and the use of participants from Prolific enhances the understanding of the evaluation methodology. The note about randomly sampling 10 images for each combination and displaying 50 images to each participant clarifies the sampling strategy.
Readability of Details: While the overall flow is clear, the smaller text within some of the boxes is difficult to read, particularly the numbers indicating the quantity of memes at each stage. Larger font sizes or a zoomed-in view of specific sections would improve readability.

Results

Key Aspects

Increased Idea Generation with LLM: Participants using the LLM generated a significantly higher number of ideas compared to those in the baseline group. This was supported by a Mann-Whitney-U test, indicating a statistically significant difference in both the absolute number of ideas and the average number of ideas per participant. This finding suggests that LLM assistance can enhance idea generation in creative tasks.
Reduced Perceived Effort with LLM: Despite generating more ideas, participants in the LLM-supported group did not experience a significantly higher overall workload, as measured by the Raw TLX. However, they reported significantly lower effort on the "Effort" subscale of the NASA-TLX. This indicates that LLM assistance can reduce the perceived effort required for creative tasks.
Subjective Feedback on Idea Generation and Ownership: Participants' subjective feedback generally aligned with the quantitative findings. LLM-supported users reported feeling that they created significantly more ideas, but they also perceived a lower degree of ownership over the generated ideas compared to the baseline group. This suggests a potential trade-off between productivity and ownership when using LLMs in creative tasks.
Meme Rating Methodology: Memes were rated on three dimensions: funniness, creativity, and shareability. The data for "funny" and "creative" were analyzed using ANOVA and pairwise t-tests (Bonferroni adjusted), while "shareability," which was not normally distributed, was analyzed using the Kruskal-Wallis test and pairwise Mann-Whitney-U tests (Bonferroni adjusted). This rigorous statistical approach allowed for comparisons between the three experimental conditions (human-only, human-AI collaboration, AI-only).
Superior Performance of AI-Generated Memes: Memes generated solely by the LLM were rated significantly more positively across all three dimensions (funny, creative, shareable) compared to memes created with human involvement (either human-only or human-AI collaboration). The only exception was the comparison between human-AI collaboration and AI-only for "shareability," which was not significant. This finding suggests that, on average, AI-generated memes were perceived as superior in this context.
Control for Confounding Factors: Further analysis was conducted to control for potential confounding effects of image and topic selection. The results indicated that image selection had minimal impact, while topic selection did influence ratings, with work-related memes consistently rated higher. This demonstrates a thorough approach to data analysis and consideration of potential biases.
Topic-Specific Effects on Meme Ratings: When analyzing the data by topic, the significant differences in meme ratings were primarily driven by the "work" topic. This suggests that the overall findings regarding the superiority of AI-generated memes may be context-dependent and not uniformly applicable across all topics.

Strengths

Clear Presentation of Quantitative Findings
The Results section clearly presents the quantitative findings on meme creation, specifically the average number of ideas generated by participants.

"During ideation, participants created an average of 6.1 ideas (sd: 3.2) with one participant managing to come with a total of 21 ideas for one of the images." (Page 6)
Appropriate Use of Statistical Tests
The section effectively uses statistical tests (Mann-Whitney-U test) to demonstrate significant differences in idea generation between the LLM-supported group and the baseline group.

"Following a Shapiro-Wilk test to determine non-normality for both the absolute number of ideas (𝑊 = 0.811, 𝑝 < 0.001) and the average number of ideas per participant (𝑊 = 0.820, 𝑝 < 0.001), we used the Mann-Whitney-U test. This test indicated significant differences for the absolute count (𝑊 = 12652, 𝑝 < 0.001) and the average number of ideas (1519.5, 𝑝 < 0.001)." (Page 6)
Workload Analysis
The section reports on the workload experienced by participants, using the Raw TLX and its subscales, and appropriately employs statistical tests (t-test, Mann-Whitney-U test) to analyze the data.

"Statistical analysis of the Raw TLX showed no significant differences (Shapiro-Wilk test: 𝑊 = 0.980, 𝑝 = 0.1632, t-test: 𝑡 = −0.955, 𝑑 𝑓 = 88.811, 𝑝 = 0.342). We found the same to be true for each of the six TLX subscales, except for the question “How hard did you have to work to accomplish your level of performance?”, where participants using the LLM entered significantly lower values (Shapiro-Wilk test: 𝑊 = 0.934, 𝑝 < 0.001, Mann-Whitney-U test: 𝑊 = 755, 𝑝 = 0.027)." (Page 7)
General Feedback Analysis
The section presents findings on general feedback from participants, highlighting significant differences in perceived idea generation and ownership.

"For the first of these was the question whether participants felt that they created a lot of ideas. Results here match the actual idea count, with the LLM-supported users also subjectively noting that they created signiciantly more ideas (Shapiro-Wilk test: 𝑊 = 0.909, 𝑝 < 0.001, Mann-Whitney-U test: 𝑊 = 1308.5, 𝑝 = 0.043)...For the question on perceived ownership “The generated captions are my ideas”, participants that did not use the LLM perceived a higher degree of ownership for the generated ideas (Shapiro-Wilk test: 𝑊 = 0.0.766, 𝑝 < 0.001, Mann-Whitney-U test: 𝑊 = 562, 𝑝 < 0.001)." (Page 7)
Meme Rating Methodology
The section describes the meme rating phase, including the criteria (funny, creative, shareable) and the statistical tests used (ANOVA, pairwise t-tests, Kruskal-Wallis test, Mann-Whitney-U tests).

"In the second phase of our experiments, we had a group of people rate the memes according to three criteria: how funny they thought they were, how creative they considered them, and how likely it was that they would share them...Considering the fact that the data from the “funny” and “creative” scale were likely normally distributed (Shapiro-Wilk test: 𝑊 = 0.994, 𝑝 = 0.062 and 𝑊 = 0.995, 𝑝 = 0.155 respectively), we used the ANOVA and pairwise t-tests, Bonferroni adjusted, to compare these two. The third scale, “shareability” was likely not not normally distributed (Shapiro-Wilk test: 𝑊 = 0.988, 𝑝 < 0.001), we used the Kruskal-Wallis test instead as well as pairwise Mann-Whitney-U tests, also Bonferroni adjusted." (Page 7)
Meme Rating Results
The section presents the results of the meme rating, highlighting significant differences between conditions and the superior performance of AI-generated memes.

"According to these tests, each condition showed significant differences, as shown in Table 1. The pairwise comparison highlighted that it were consistently the memes generated by the LLM alone that were rated more positive than those where created with human involvement." (Page 7)
Addressing Confounding Factors
The section acknowledges potential confounding factors (image and topic selection) and performs additional statistical analyses to address them, demonstrating a thorough approach to data analysis.

"To ensure any unintended side-effects by of the image or topic selection, we performed the same statistical analysis to determine whether the image or the topic had any notable influence on the rating." (Page 7)

Suggestions for Improvement

Add Descriptive Subsection Titles
This medium-impact improvement would strengthen the clarity and organization of the Results section. Currently, the results are presented in subsections (Meme Creation, Meme Rating) that are further divided numerically (4.1.1, 4.1.2, etc.). This structure, while logical, can be difficult to follow. Introducing descriptive subsection titles would improve readability and make it easier for readers to grasp the key findings. This is particularly important for the Results section, where readers are looking for a clear and concise summary of the study's outcomes.

"4.1.1 Idea Generation." (Page 6)

Implementation: Replace the numerical subsection headings (4.1.1, 4.1.2, 4.1.3, 4.2) with descriptive titles that clearly indicate the content of each subsection. For example: 4.1 Meme Creation 4.1.1 Idea Generation --> **Idea Generation with and without LLM Support** * 4.1.2 Workload --> **Perceived Workload and Effort** * 4.1.3 General Feedback --> **Participant Feedback on the Creative Process** * 4.2 Meme Rating --> **Comparative Evaluation of Meme Quality**
Report Effect Sizes
This medium-impact improvement would enhance the clarity and completeness of the Results section. While the section mentions statistical tests and p-values, it often omits effect sizes. Reporting effect sizes is crucial for understanding the magnitude of the observed differences, not just their statistical significance. This is particularly important for the Results section, as it provides the evidence for the study's conclusions.

"This test indicated significant differences for the absolute count (𝑊 = 12652, 𝑝 < 0.001) and the average number of ideas (1519.5, 𝑝 < 0.001)." (Page 6)

Implementation: Include effect sizes (e.g., Cohen's d for t-tests, r for Mann-Whitney U tests) alongside the reported p-values for all relevant statistical comparisons. For example, after reporting the Mann-Whitney U test results for idea generation, add: '(...p < 0.001, r = [calculate and insert effect size])'. Do this consistently throughout the Results section.
Analyze Interactions in Meme Rating Data
This medium-impact improvement would enhance the Results section by providing a more nuanced and detailed analysis of the meme rating data. While the section reports overall differences between conditions, it could benefit from exploring potential interactions between conditions and rating dimensions (humor, creativity, shareability). For example, were AI-generated memes consistently rated higher across all three dimensions, or were there specific dimensions where human-created or human-AI collaborative memes excelled? This is important to present in the Results section as it provides a more complete picture of the findings.

"According to these tests, each condition showed significant differences, as shown in Table 1." (Page 7)

Implementation: Conduct and report additional statistical analyses (e.g., two-way ANOVA) to examine potential interactions between the experimental condition (baseline, human-AI, AI-only) and the rating dimension (humor, creativity, shareability). Present the results of these analyses, including any significant interaction effects, and discuss their implications.
Improve Figure 8 Clarity
This low-impact improvement would improve the readability and visual appeal of the Results section. Figure 8 presents pairwise comparisons of meme ratings, but the axis labels and legends could be made more informative. Specifically, the y-axis label could be more descriptive, and the legend could be positioned to avoid overlapping with the data points. This is a minor improvement, but it contributes to the overall clarity and professionalism of the presentation.

"Figure 8: Pairwise comparison of how participants rated the memes with respect to the three scales “funny”, “creative”, and “shareable”. (*: 𝑝 < 0.05, **: 𝑝 < 0.01, **: 𝑝 < 0.001, pairwise t-test/Mann-Whitney-U test, Bonferroni adjusted)" (Page 7)

Implementation: Improve Figure 8 as follows: Change the y-axis label from numbers to "Average Rating" or "Mean Rating". Reposition the legend (Human only, Human-AI Collaboration, AI only) to a location where it does not overlap with the bars or error bars (e.g., below the x-axis or to the right of the chart). * Consider adding a descriptive title above each subplot (Funny, Creative, Shareable) to clearly differentiate them.
Provide Non-Statistical Summaries
This medium-impact suggestion improves the clarity and flow of the Results section. Currently, the section jumps directly into statistical analyses without a brief, non-statistical summary of the main findings for each subsection. Adding a sentence or two at the beginning of each subsection to summarize the overall trend before diving into the statistical details would significantly improve readability and comprehension, especially for readers who may not be deeply familiar with statistical methods. This is a standard practice in scientific writing and is crucial for effectively communicating results.

"Following a Shapiro-Wilk test to determine non-normality for both the absolute number of ideas (𝑊 = 0.811, 𝑝 < 0.001) and the average number of ideas per participant (𝑊 = 0.820, 𝑝 < 0.001), we used the Mann-Whitney-U test." (Page 6)

Implementation: Add a brief, non-statistical summary sentence at the beginning of each subsection. For example: * **4.1.1 Idea Generation:** "Participants in the LLM-supported group generated significantly more ideas than those in the baseline group." Then proceed with the statistical details. * **4.1.2 Workload:** "Overall workload, as measured by the Raw TLX, did not differ significantly between groups. However, participants using the LLM reported significantly lower effort." Then proceed with the statistical details. * **4.1.3 General Feedback:** "Participants' subjective feedback generally aligned with the quantitative findings, with LLM-supported users reporting having created more ideas but perceiving less ownership." Then proceed with the statistical details. * **4.2 Meme Rating:** "AI-generated memes were generally rated higher than human-generated or human-AI collaborative memes across all three dimensions." Then proceed with the statistical details.

Non-Text Elements

Figure 6: Participants using the LLM were able to produce significantly more...

Full Caption

Figure 6: Participants using the LLM were able to produce significantly more ideas than participants who had no external support, according to the Mann-Whitney-U test (***: p < 0.001)

Figure/Table Image (Page 6)

First Reference in Text

As seen in Figure 7, participants that were able to use the LLM created noticeably more ideas than the participants in the baseline group.

Description

Description of the Boxplot: This figure is a boxplot comparing the number of ideas generated by two groups of participants: those who used a Large Language Model (LLM), which is a type of AI, and those who did not (Human only). A boxplot is a way to visually represent the distribution of a dataset. The box shows the middle 50% of the data, the line inside the box represents the median (the middle value), and the 'whiskers' extend to the furthest data point within 1.5 times the interquartile range (the height of the box). Any points outside the whiskers are considered outliers.
Axis Labels: The y-axis (vertical axis) shows the 'Number of ideas,' indicating the quantity of ideas generated by each participant. The x-axis (horizontal axis) shows the two groups: 'Human only' and 'Human-AI collaboration.'
Comparison of Groups: The boxplot for the 'Human-AI collaboration' group is noticeably higher than the boxplot for the 'Human only' group. This visually indicates that participants who used the LLM generally generated more ideas than those who did not. The caption states that this difference is statistically significant, meaning it's unlikely to have occurred by chance.
Statistical Significance: The caption mentions a Mann-Whitney U test, which is a statistical test used to compare two groups when the data is not normally distributed (i.e., it doesn't follow a bell curve). The result 'p < 0.001' means there is less than a 0.1% chance that the observed difference between the groups is due to random variation.

Scientific Validity

Statistical Analysis: The figure presents a clear statistical comparison between two groups, supporting the claim of a significant difference in idea generation. The use of the Mann-Whitney U test is appropriate given the non-normal distribution of the data (mentioned in the text). The p-value (p < 0.001) indicates strong statistical significance.
Data Visualization and Generalizability: The boxplot provides a visual representation of the data distribution, allowing for a quick assessment of the central tendency and spread of each group. However, the figure alone doesn't provide information about the sample size or the specific characteristics of the participants in each group. This information is crucial for assessing the generalizability of the findings.
Conclusion Support: The figure, along with the provided statistical data supports the conclusion that using the LLM led to significantly more ideas being produced.

Communication

Clarity and Statistical Reporting: The caption clearly states the main finding: that participants using the LLM generated significantly more ideas. The use of "significantly" is appropriate given the statistical test result (p < 0.001). The reference to the Mann-Whitney U test provides the statistical justification for the claim, and the inclusion of the p-value allows readers to assess the strength of the evidence. However, the figure itself is a boxplot, and the caption doesn't explicitly mention this.
Visual Presentation: The graph visually presents the difference in the number of ideas generated between the two groups (Human only and Human-AI collaboration). The use of a boxplot is appropriate for comparing distributions. The y-axis is clearly labeled ("Number of ideas"), and the x-axis distinguishes between the two groups. The difference between the groups is visually apparent.
Significance Indication: The use of asterisks (***) to indicate statistical significance is a standard practice, and the corresponding p-value threshold (p < 0.001) is provided. This allows for a quick visual assessment of the significance level.
Missing Statistical Details: While the caption mentions the Mann-Whitney U test, it does not provide the test statistic (U value) or the effect size. Including these would provide a more complete picture of the results. Also, the figure could include median values to better inform the reader.

Figure 7: While there were no significant differences in overall workload, the...

Full Caption

Figure 7: While there were no significant differences in overall workload, the "Effort" subscale of the NASA TLX was significantly different according to the Mann-Whitney-U test (*: p < 0.05)

Figure/Table Image (Page 6)

First Reference in Text

As seen in Figure 7, participants that were able to use the LLM created noticeably more ideas than the participants in the baseline group.

Description

Overview of NASA TLX and Groups: This figure shows the results of a questionnaire called the NASA TLX, which measures how demanding a task is. It compares the results for two groups: people who did the task on their own ('Human only') and people who did the task with help from an AI ('Human-AI collaboration'). The results are shown for different aspects of workload, like mental demand, physical demand, and effort. The figure uses boxplots.
Boxplot Explanation and Subscales: Each boxplot shows the distribution of scores for a particular aspect of workload. The box represents the middle 50% of the scores, the line inside the box is the median (the middle score), and the 'whiskers' extend to the furthest data point within 1.5 times the interquartile range. The figure also includes an 'Average' NASA TLX score.
Key Finding: Effort Subscale: The caption highlights that there was no significant difference in the overall workload between the two groups. However, there was a significant difference in the 'Effort' subscale, meaning that the group working with the AI reported significantly lower effort. 'Significant' in this context, means the difference is unlikely due to random chance.
Statistical Test and p-value: The caption mentions a Mann-Whitney U test, a statistical test used to compare two groups when the data isn't normally distributed. The result 'p < 0.05' means there is less than a 5% chance that the observed difference in effort between the groups is due to random variation.

Scientific Validity

Use of Standardized Instrument and Appropriate Statistical Test: The figure presents a comparison of NASA TLX scores between two groups, providing a measure of perceived workload. The use of the NASA TLX, a well-established and validated instrument, adds to the scientific validity of the assessment. The statistical analysis using the Mann-Whitney U test is appropriate given the likely non-normal distribution of subjective rating data.
Relevance of "Effort" Subscale: The focus on the "Effort" subscale is relevant to the study's research question, as it investigates whether AI assistance reduces the perceived effort required for meme generation. The statistically significant difference (p < 0.05) supports the claim that AI assistance reduced perceived effort.
Effect Size Reporting: While the figure shows a significant difference in perceived effort, it doesn't provide information about the effect size. Reporting the effect size (e.g., Cohen's d or Cliff's delta) would provide a more complete picture of the magnitude of the difference.
Comparison of Groups: The figure compares only two groups. The results are consistent with the interpretation that the AI reduced effort.

Communication

Visual Presentation and Clarity: The figure presents a series of boxplots comparing different subscales of the NASA TLX (Task Load Index) between two groups: Human only and Human-AI collaboration. The use of boxplots is appropriate for visualizing distributions. The y-axis represents the NASA TLX scores, and the x-axis shows the different subscales (Mental Demand, Physical Demand, Temporal Demand, Performance, Effort, Frustration) and the overall average.
Caption Accuracy and Statistical Reporting: The caption highlights a key finding: no significant difference in overall workload, but a significant difference in the "Effort" subscale. The reference to the Mann-Whitney U test and the p-value (p < 0.05) provides statistical support for this claim. However, similar to Figure 6's caption, this one also does not mention the type of graph being presented.
Significance Indication: The use of an asterisk (*) to indicate statistical significance is conventional, and the corresponding p-value threshold (p < 0.05) is provided. This allows for quick visual identification of significant differences.
Labeling and Caption Conciseness: The figure could benefit from explicitly labeling the y-axis as "NASA TLX Score" or similar, to improve clarity. The caption could be more concise by focusing on the key finding related to effort, rather than stating the non-significant overall workload result, which is visually evident in the graph.
Missing Units on Axes: The axes are labeled, but there are no units shown. Since NASA-TLX is a standard questionnaire, it might make sense to include the range of values possible.

Figure 8: Pairwise comparison of how participants rated the memes with respect...

Full Caption

Figure 8: Pairwise comparison of how participants rated the memes with respect to the three scales "funny", "creative", and "shareable".

Figure/Table Image (Page 7)

First Reference in Text

According to these tests, each condition showed significant differences, as shown in Table 1.

Description

Overview of Scales and Conditions: This figure shows how people rated the memes on three different scales: how funny they were ('funny'), how original they were ('creative'), and how likely they would be to share them ('shareable'). It compares the ratings for memes created in three different ways: by humans alone, by humans and AI together, and by AI alone.
Explanation of Violin Plots: The figure uses violin plots to show the distribution of ratings. A violin plot is like a boxplot, but it also shows the probability density of the data at different values - the wider the plot, the more common that rating is. The white dot represents the median rating, and the thick black bar represents the interquartile range (the middle 50% of the ratings). The thinner black lines extend to the furthest data point within 1.5 times the interquartile range.
Comparison of Ratings and Significance: For each scale (Funny, Creative, Shareable), the violin plots show the distribution of ratings for each of the three conditions (Human only, Human-AI Collaboration, AI only). The asterisks (*, **, ***) above the plots indicate statistically significant differences between the groups, meaning the differences are unlikely to be due to random chance. The more asterisks, the stronger the statistical significance.

Scientific Validity

Statistical Comparisons: The figure presents a visual comparison of ratings across three conditions and three scales. The use of pairwise comparisons suggests that appropriate statistical tests were conducted to assess the differences between groups (although the specific tests are not mentioned in the figure caption). The reference text mentions Table 1, suggesting that the statistical details are provided elsewhere in the paper.
Subjective Ratings and Scale Information: The figure focuses on subjective ratings (funny, creative, shareable), which are relevant to the study's research question. However, without information on the rating scale used (e.g., 1-5, 1-7) and the instructions given to participants, it's difficult to fully assess the validity of the ratings.
Data Visualization and Statistical Details: The use of violin plots allows for visualization of the data distribution, but it doesn't provide specific statistical values (e.g., means, standard deviations, test statistics, effect sizes). While the asterisks indicate significant differences, the magnitude of these differences is not immediately apparent. Referring to Table 1 for details is acceptable, but including key statistical values in the figure or caption would be beneficial.
Interpretation Support: The reference text points to Table 1 for details of the statistical tests. The figure supports the interpretation of the results, but is more of a visual summary. The scientific validity is tied to the methods described in the text, and the statistics reported in Table 1.

Communication

Visual Presentation and Clarity: The figure presents three separate violin plots, each comparing the ratings for a different scale (Funny, Creative, Shareable) across three conditions (Human only, Human-AI Collaboration, AI only). The use of violin plots is appropriate for visualizing distributions, and allows comparison of the shape, central tendency, and spread of the data for each group. The y-axis represents the rating scale, while x-axis shows the different conditions.
Caption Accuracy and Clarity: The caption clearly states the purpose of the figure: to compare ratings across the three scales and conditions. The use of "pairwise comparison" implies that statistical tests were performed to compare each pair of conditions. The scales are clearly named and enclosed in quotes, indicating they were presented to participants this way.
Significance Indication and Statistical Test Information: The use of asterisks (*, **, ***) to indicate statistical significance is conventional. However, unlike previous figures, the key for these symbols is not provided in the caption, requiring the reader to refer to the text. This makes the figure less self-contained. Also, the caption does not mention the statistical tests that were used for these comparisons.
Labeling and Key for Symbols: The figure could benefit from explicitly labeling the y-axis (e.g., "Rating Score") and providing a key for the asterisks within the figure itself or the caption.

Discussion

Key Aspects

LLM Assistance Increases Idea Generation: The study found that participants supported by an LLM assistant generated significantly more meme ideas compared to those working without AI assistance. This suggests that LLMs can serve as effective tools for boosting idea generation in creative tasks, potentially overcoming creative blocks and expanding the range of explored options.
LLM Assistance Reduces Perceived Effort: Participants using the LLM reported a lower perceived effort in completing the meme creation task, despite generating more ideas. This indicates that LLMs can streamline the creative process, making it less demanding and potentially more efficient. This finding aligns with previous research suggesting that AI tools can reduce cognitive load and facilitate creative exploration.
AI-Generated Memes Outperform Others on Average: While LLM assistance increased productivity, it did not lead to a significant improvement in the overall quality of the memes, as rated by a separate group of participants. Memes created solely by AI were, on average, rated higher in terms of humor, creativity, and shareability compared to both human-only and human-AI collaborative memes. This result is attributed to the LLM's training on vast datasets, enabling it to generate content with broad appeal.
Human Creativity Excels in Top-Performing Humor: Despite the overall superiority of AI-generated memes, an analysis of the top-performing memes revealed that human-created memes were predominantly ranked highest in humor. This suggests that while LLMs can generate broadly appealing content, human creativity remains crucial for achieving the highest levels of comedic impact, likely due to the incorporation of personal experiences, cultural nuances, and individual insights.
Human-AI Collaboration Shows Potential in Creativity and Shareability: Human-AI collaboration showed potential, particularly in the dimensions of creativity and shareability, where collaboratively created memes were among the top-ranked. This indicates that combining the strengths of both humans and AI can lead to outcomes that are both original and widely appealing, suggesting a synergistic effect of human and AI contributions.
Reduced Sense of Ownership with LLM Assistance: Participants who used the LLM reported a slightly reduced sense of ownership over their creations. This highlights a potential trade-off between productivity and the feeling of personal connection to the work when using AI assistance. This finding underscores the importance of designing AI tools that balance efficiency with user agency and control.
Limitations and Future Research Directions: The study acknowledges limitations, including the short-term nature of the interaction with the LLM, the potential for participants not fully utilizing the LLM's capabilities, and the influence of cultural and social factors on humor perception. These limitations highlight areas for future research, such as exploring long-term effects, improving interface design, and incorporating more diverse cultural perspectives.
Connection to Existing Literature: The discussion connects the study's findings to previous research, both supporting and contrasting them. It cites studies that show similar benefits of AI in creative tasks and others that highlight challenges in human-AI co-creativity, such as maintaining originality and depth. This contextualization helps to position the current study within the broader field.

Strengths

Summary of Main Findings
The Discussion section effectively summarizes the study's main findings, reiterating the increased idea generation and reduced effort associated with LLM assistance, as well as the surprising result of AI-generated memes outperforming others.

"In our study, participants who worked with the LLM assistant came up with significantly more ideas when creating memes compared to those participants working alone. Interestingly, even though they generated more ideas, they did not feel like the task was more demanding...participants in the LLM-assisted group did report that they had to put in less effort." (Page 8)
Connection to Prior Research
The section appropriately connects the findings to prior research, citing studies that support the observed benefits of AI assistance in creative tasks and those that highlight the challenges of human-AI co-creativity.

"This is consistent with previous studies, which suggest that AI tools can support users in generating more creative ideas by reducing obstacles associated with brainstorming and creative development[24, 31]." (Page 8)
Addressing the Productivity-Ownership Trade-off
The Discussion section addresses the potential trade-off between productivity and ownership when using LLMs, acknowledging the reduced sense of ownership reported by participants in the LLM-assisted group.

"Additionally, participants who used the LLM reported a slightly reduced sense of ownership over their creations, indicating that AI assistance might affect the user's connection to their work." (Page 8)
Explanation for AI-Generated Meme Performance
The section offers a plausible explanation for the superior performance of AI-generated memes, attributing it to the LLM's training on large datasets and its ability to cater to broad tastes in humor.

"The LLM used was trained on large data sets with many cultural references and different types of humor. During the training process, the LLM was most likely to learn the types of humor that it saw most frequently during the training process, i.e. those that resonated best with the crowd. As a result, such LLMs are good at creating content that appeals to a wide audience." (Page 9)
Acknowledgment of Limitations
The section acknowledges the limitations of the study, including the short-term interaction with the AI, the limited collaboration between humans and the LLM, and the potential influence of cultural and social factors on humor perception.

"We are convinced that the presented user study provides valu-able insights into the creativity process when generating humor-ous content together with LLMs and the perceived funniness of AI-generated content compared to human (co-) authored content. However, the design and results of our study imply a number of limitations and directions for future work, which we discuss below." (Page 9)
Contextualization within Human-AI Collaboration
The section appropriately contextualizes the findings within the broader scope of human-AI collaboration, acknowledging the complexities of this interaction and the need for ongoing research and refinement.

"From this, we conclude that the use of AI support in the context of the metrics investigated in this study does not lead to better results in terms of humor. Conversely, we can also assume that users with AI support achieve a consistent result faster and with more variations, without this representing an additional mental burden for the users." (Page 8)

Suggestions for Improvement

Introduce Subheadings for Better Organization
This medium-impact improvement would enhance the Discussion section's clarity and flow by providing a more structured organization. While the section covers relevant points, it lacks clear subheadings to guide the reader through the different aspects of the discussion. This makes it slightly harder to follow the logical progression of arguments and to identify the key takeaways. Adding subheadings is a standard practice in scientific writing and improves readability.

"5 Discussion" (Page 8)

Implementation: Introduce subheadings to organize the Discussion section. Consider the following structure: * **5 Discussion** * **5.1 LLM Support Increases Idea Generation and Reduces Effort** (combining current 5.1 and the first paragraph of 5.2) * **5.2 Impact of LLM Assistance on Meme Quality** (covering the rest of current 5.2) * **5.3 Explaining the Superiority of AI-Generated Memes** (current 5.3) * **5.4 Implications for Human-AI Co-Creativity** (a new section to synthesize the implications) Adjust the wording and content of each subsection to fit the new structure.
Emphasize the Strengths of Human Creativity
This high-impact improvement would strengthen the Discussion section by providing a more balanced and nuanced interpretation of the findings. While the section acknowledges that human-AI collaboration did not lead to overall quality improvements, it doesn't sufficiently emphasize the specific strengths of human contributions, particularly in generating the funniest memes. The current discussion focuses heavily on the average performance of AI-generated memes, potentially overshadowing the unique value of human creativity. This is crucial for the Discussion section, as it needs to provide a comprehensive and balanced interpretation of the results.

"The funniest memes were mainly created by humans, while those rated highest for creativity and shareability were the result of human-AI collaborations." (Page 9)

Implementation: Re-emphasize the finding that human-created memes were among the top-performing in terms of humor. Add a paragraph or sentences explicitly discussing the implications of this finding. For example: "While AI-generated memes performed well on average, it's crucial to note that the funniest memes were predominantly created by humans. This highlights the continued importance of human creativity, particularly in domains requiring nuanced understanding of humor and cultural context. The LLM's ability to generate broadly appealing content does not necessarily translate to the highest levels of comedic impact, which often relies on individual experiences and insights."
Provide More Specific Recommendations for Future Research
This medium-impact improvement would strengthen the Discussion section by providing a more concrete and actionable set of recommendations for future research. While the section identifies limitations and suggests general directions, it could be more specific about the types of studies and interventions that could address these limitations. This is important for guiding future work in the field and maximizing the impact of the current study.

"Future work could integrate more structured prompts or more collaborative tools into AI systems to encourage deeper engagement and iterative idea development." (Page 9)

Implementation: Expand on the suggestions for future work by providing more specific examples of research questions, methodologies, or interventions. For example: * **Instead of:** "Future studies could therefore give partici-pants the opportunity to use AI tools over a longer period of time." * **Consider:** "Future studies could employ a longitudinal design, tracking participants' creative output and collaboration strategies over several weeks or months, to assess the long-term effects of LLM assistance on creativity and skill development." * **Instead of:** "Future work could integrate more structured prompts or more collaborative tools into AI systems..." * **Consider:** "Future work could investigate the effectiveness of different prompting strategies (e.g., providing specific examples, suggesting alternative perspectives, or encouraging iterative refinement) in enhancing human-AI collaboration. Additionally, exploring interface designs that facilitate seamless switching between individual and collaborative work modes could be beneficial."
Synthesize Implications for Human-AI Co-Creativity
This medium-impact improvement would enhance the Discussion by adding a dedicated subsection that synthesizes the overall implications of the study for the field of human-AI co-creativity. While the existing discussion points touch upon these implications, a separate subsection would provide a more focused and impactful summary of the study's contributions and their broader significance. This helps to clearly position the research within the existing literature and highlight its novel contributions.

"The results of our study provide some early insights into how the availability of LLM support influences people's creative process." (Page 8)

Implementation: Add a new subsection (e.g., 5.4 Implications for Human-AI Co-Creativity) that synthesizes the key implications of the study. This subsection should: Summarize the main findings in a concise and accessible way. Discuss how the findings contribute to the understanding of human-AI collaboration in creative tasks. Highlight the potential benefits and limitations of using LLMs as co-creative partners. Suggest practical implications for the design of AI-powered creative tools. * Briefly reiterate the need for future research to address the identified limitations.

One Does Not Simply Meme Alone: Evaluating Co-Creativity Between LLMs and Humans in the Generation of Humor

Table of Contents

Overall Summary

Study Background and Main Findings

Research Impact and Future Directions

Critical Analysis and Recommendations

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Methodology

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Discussion

Key Aspects

Strengths

Suggestions for Improvement