LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings

Benjamin F. Maier, Ulf Aslak, Luca Fiaschi, Nina Rismal, Kemble Fletcher, Christian C. Luhmann, Robbie Dow, Kli Pappas, Thomas V. Wiecki
arXiv: arXiv:2510.08338v1
PyMC Labs, Tallin, Estonia

First Page Preview

First page preview

Table of Contents

Overall Summary

Study Background and Main Findings

This paper addresses a critical challenge in using Large Language Models (LLMs) for consumer research: their tendency to produce unrealistic response distributions when asked for direct numerical ratings. Traditional consumer surveys are costly and prone to biases, and while LLMs offer a scalable alternative by simulating 'synthetic consumers,' this fundamental flaw has limited their utility. The authors' primary objective was to develop and validate a new method that could elicit more human-like survey responses from LLMs.

The core of the study is the introduction of a novel two-step method called Semantic Similarity Rating (SSR). First, an LLM, prompted to impersonate a consumer with specific demographic traits, generates a free-text response expressing its purchase intent for a given product concept. Second, instead of asking the model to convert this text to a single number, the system transforms the text into a high-dimensional vector (an embedding) that captures its meaning. The similarity of this vector is then mathematically compared to pre-defined 'anchor statements' for each point on a 5-point Likert scale. This process yields a full probability distribution across the scale, reflecting the nuance and ambiguity of the original text.

The authors rigorously tested the SSR method against simpler baselines using an extensive dataset of 57 real-world consumer surveys on personal care products, involving 9,300 human participants. The results were compelling: the SSR method successfully replicated the relative ranking of products with over 90% of human test-retest reliability. Critically, it also generated response distributions that were highly similar to human patterns, achieving a Kolmogorov-Smirnov (KS) similarity score of over 0.85, whereas the direct rating baseline failed dramatically on this metric (KS similarity ≈ 0.26). The study also demonstrated that providing demographic context to the LLMs was essential for achieving this high level of performance.

The paper concludes that the shortcomings of LLMs in survey simulation are not inherent limitations of the models but rather artifacts of flawed elicitation techniques. The SSR framework offers a robust, training-free solution that preserves the quantitative metrics of traditional surveys while adding rich, interpretable qualitative feedback. This work presents a significant methodological advancement with the potential to make early-stage product research faster, more affordable, and more insightful.

Research Impact and Future Directions

This paper presents a robust and well-validated solution to a significant obstacle in the use of LLMs for consumer research. The Semantic Similarity Rating (SSR) method is a clear and substantial improvement over direct numerical prompting, enabling synthetic survey panels to replicate not only the relative ranking of products but also the nuanced distribution of human responses with high fidelity. The study's methodological rigor, particularly its use of a large real-world dataset and the introduction of the noise-aware 'correlation attainment' metric, lends strong credibility to its findings.

The primary conclusion—that the failures of LLMs as survey respondents are artifacts of elicitation methods rather than fundamental model incapacities—is well-supported and represents a key conceptual advance for the field. The practical implications are significant: the SSR framework offers a scalable, cost-effective tool for rapid screening of product concepts, potentially accelerating innovation and democratizing access to consumer insights. The dual output of quantitative ratings and qualitative rationales is a particularly powerful feature, combining the strengths of two traditionally separate research modalities.

However, the study's simulation-based design necessarily limits the scope of its conclusions. While the method successfully reproduces human survey responses, it cannot claim to reproduce actual purchasing behavior in a real-world market. The findings are also contingent on the specific domain (personal care products), where a wealth of relevant discussion likely exists in the LLMs' training data; its applicability to more niche or novel domains remains an open question. Furthermore, the observed differences in performance between LLMs (e.g., the paradoxical effect of removing demographics for Gem-2f) underscore that these methods are not universally 'plug-and-play' and require careful, model-specific validation. Despite these limitations, the paper provides a credible and powerful framework that fundamentally changes the prospects for using LLMs in consumer research.

Critical Analysis and Recommendations

Clear Problem Framing and Quantifiable Claims (written-content)
The abstract immediately establishes credibility by framing a clear problem (unrealistic LLM survey ratings) and presenting its solution with strong, specific performance metrics. Citing '90% of human test–retest reliability' and 'KS similarity > 0.85' provides concrete, impressive evidence of the method's efficacy, elevating the paper's claims beyond qualitative description and demonstrating a rigorous validation process from the outset.
Section: Abstract
Lack of Concrete Examples Hinders Initial Comprehension (written-content)
The introduction describes the core SSR mechanism but leaves the crucial concept of 'predefined anchor statements' abstract. Providing a brief, intuitive example (e.g., 'reference texts for each point, such as `I would definitely buy it` for a score of 5') would have immediately demystified this linchpin of the method, grounding the technical description and significantly enhancing reader comprehension from the start.
Section: Introduction
Systematic Literature Review Establishes a Clear Research Gap (written-content)
The Related Works section methodically reviews prior approaches, such as direct numeric elicitation and demographic conditioning, and identifies their shared, fundamental limitation: the failure to produce realistic response distributions. This structured critique effectively carves out a well-defined and necessary research gap that the proposed SSR method is perfectly positioned to address, strongly justifying the paper's contribution.
Section: Related Works
Introduction of a Noise-Aware Success Metric (written-content)
The 'Correlation Attainment' metric is a significant methodological innovation that demonstrates a sophisticated understanding of survey data. By benchmarking the LLM's performance against a theoretical maximum derived from human test-retest reliability, the authors account for the inherent noise in the ground-truth data. This provides a far more rigorous and realistic assessment of the model's capabilities than a simple correlation score, substantially strengthening the validity of the paper's conclusions.
Section: Methods
Visual Evidence Powerfully Demonstrates Methodological Superiority (graphical-figure)
Figure 3 provides compelling visual evidence for the paper's central claim by plotting the distribution of similarity scores for all three tested methods. The clear and substantial separation of the distributions, with SSR (mean KS similarity = 0.88) far outperforming the intermediate FLR (0.72) and the baseline DLR (0.26), offers an immediate and intuitive confirmation of the SSR method's superior ability to replicate human response patterns.
Section: Results
Ablation Study Reveals Critical Role of Demographics (written-content)
A key experiment demonstrated that removing demographic prompts from the synthetic consumers caused correlation attainment to collapse from 92% to 50%. This rigorous ablation study provides powerful evidence that persona conditioning is not merely an enhancement but is essential for generating a meaningful, product-differentiating signal, proving that the LLMs are effectively leveraging this contextual information.
Section: Results
Reframing the Problem is a Key Conceptual Contribution (written-content)
The discussion elevates the paper's contribution beyond a simple technical fix by powerfully reframing the discourse. The authors argue that failures of LLMs in surveys are not intrinsic model limitations but artifacts of poor elicitation methods. This conceptual shift is a significant and impactful takeaway that changes how researchers should approach the problem of synthetic data generation.
Section: Discussion and Conclusion

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Related Works

Key Aspects

Strengths

Suggestions for Improvement

Methods

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 1: Different response generation procedures and SSR response-likelihood...
Full Caption

Figure 1: Different response generation procedures and SSR response-likelihood mapping.

Figure/Table Image (Page 2)
Figure 1: Different response generation procedures and SSR response-likelihood mapping.
First Reference in Text
We evaluated three response strategies (see Fig. 1A):
Description
  • Flowchart of Three Methods for Eliciting Ratings from LLMs: This diagram illustrates three different methods for getting a rating from a 'synthetic consumer,' which is a Large Language Model (LLM) instructed to act like a person in a survey. The process begins by giving the LLM a role ('impersonate a consumer') and a product concept to evaluate. The core of the diagram compares three strategies for how the LLM provides its rating: (1) Direct Likert Rating, where the LLM must choose a single number from 1 to 5; (2) Follow-up Likert Rating, where the LLM first gives a written opinion, and then a second LLM instance, acting as an 'expert,' converts that text into a single number from 1 to 5; and (3) Semantic Similarity Rating (SSR), where the LLM's written opinion is converted into a special numerical code called an 'embedding vector.' This vector, which captures the meaning of the text, is then mathematically compared to pre-defined reference statements for each point on the 1-5 scale to produce a full probability distribution, rather than a single number.
Scientific Validity
  • ✅ Clear depiction of experimental conditions: The flowchart provides a clear and logical overview of the three distinct response elicitation strategies being compared. This serves as an excellent methodological summary, allowing readers to grasp the core experimental design at a glance.
  • 💡 Ambiguity of 'Likert expert' role: The term 'Likert expert' in method (2) is slightly ambiguous within the figure itself. While the main text clarifies this is a new instance of the same model with a different prompt, the diagram alone does not make this clear. For full self-containment, a brief note could be added, e.g., 'Same LLM, new prompt,' to specify the nature of this component.
  • ✅ Inclusion of prompt context: Showing the initial 'System prompt' and 'User prompts' is a methodological strength. It provides crucial context about how the synthetic agents were set up, which is essential for the reproducibility and interpretation of the results.
Communication
  • ✅ Effective flowchart structure: The use of a flowchart is highly effective for illustrating the sequential and branching nature of the response generation process. The visual flow from prompts to the three distinct methods is intuitive and easy to follow.
  • ✅ Use of a concrete example: The inclusion of an example 'Synthetic response' ('I'm somewhat interested...') helps ground the abstract processes in a tangible example, making the different pathways easier for the reader to understand.
  • 💡 Visual linkage could be improved: The arrow from 'Elicit brief textual responses' leads into a general space from which methods (2) and (3) branch. Method (1), however, bypasses this step. The diagram could be slightly reorganized to make it more visually explicit that the textual response is an input for methods (2) and (3) only, perhaps by having the arrow split more directly to those two boxes.
Figure 5: A surrogate product concept similar to those used in the 57 concept...
Full Caption

Figure 5: A surrogate product concept similar to those used in the 57 concept surveys.

Figure/Table Image (Page 11)
Figure 5: A surrogate product concept similar to those used in the 57 concept surveys.
First Reference in Text
When we refer to “image stimulus” in the main text, an image like this, including either both an illustration and the concept description or only a concept description was supplied to an LLM synthetic consumer (see App.
Description
  • Example of a product concept stimulus: This figure displays an example of the marketing material, or 'stimulus,' shown to survey participants to evaluate. It features a fictional product called 'AURAFOAM™ Mood-Infused Body Wash.' The image is split into two parts: on the left is a picture of the product bottle, and on the right is a text description. The text highlights key selling points, such as 'mood-coded fragrance capsules' (different scents intended to create feelings like 'Energize' or 'Calm'), 'clinically inspired neuro-aroma blends' (a marketing term suggesting scents are designed to affect mood), a 'gentle, skin-first formula,' and 'sustainable design' using recycled packaging.
Scientific Validity
  • ✅ Enhances methodological transparency: Providing a concrete example of the stimulus material is a significant strength. It allows the reader to understand the nature and complexity of the information provided to the LLMs, which is crucial for interpreting the results and assessing the task's ecological validity. This moves the methodology from an abstract description to a tangible illustration.
  • ✅ Plausible and representative example: The surrogate concept is well-designed to be representative of modern personal care marketing. It includes a realistic combination of product imagery, branding, and textual features (e.g., emotional benefits, scientific-sounding terms like 'neuro-aroma', sustainability claims). This makes it a suitable and relevant test case for the study's domain.
  • 💡 Clarification on stimulus variability would be beneficial: The caption states the concept is 'similar' to those used. While the reference text mentions some variation (image+text vs. text only), it would strengthen the methods section to briefly describe the general range of variability across the 57 actual stimuli. For example, were the description lengths and number of features generally consistent? This would help readers understand the robustness of the findings across different concept styles.
Communication
  • ✅ Highly effective and clear: The figure is exceptionally clear and serves its purpose perfectly. It immediately illustrates what the authors mean by 'product concept' and 'image stimulus,' making a key component of the experimental design easy to understand for any reader.
  • ✅ Self-contained and illustrative: Combined with its straightforward caption, the figure is entirely self-contained. It effectively communicates the nature of the experimental input without requiring the reader to refer to lengthy descriptions in the methods section.
  • ✅ Professional and clean design: The visual design of the surrogate concept is professional and aesthetically pleasing, which adds to the credibility of the experimental setup by showing that the stimuli were of high quality.

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 2: Comparison of real and synthetic surveys based on GPT-40 with TLLM =...
Full Caption

Figure 2: Comparison of real and synthetic surveys based on GPT-40 with TLLM = 0.5.

Figure/Table Image (Page 4)
Figure 2: Comparison of real and synthetic surveys based on GPT-40 with TLLM = 0.5.
First Reference in Text
Both LLMs yielded a correlation attainment of about p = 80% (cf. Fig. 2A.i and Fig. 6A.i).
Description
  • Correlation between real and synthetic purchase intent scores: This scatter plot compares the average purchase intent scores from real human surveys (horizontal axis) with those from a simulated survey using the GPT-40 AI model (vertical axis). Each dot represents a single product concept. The AI was prompted using the 'Direct Likert Rating' (DLR) method, where it provides a single numerical score. The plot shows a positive trend: as the average score from humans increases, the average score from the AI also tends to increase. Key statistics are reported directly on the plot: a 'correlation attainment' (ρ) of 81.7%, a Pearson correlation (R) of 0.66, and a p-value of less than 10^-7. Correlation attainment is a custom metric used by the authors to compare the AI's performance against the theoretical maximum agreement between two groups of humans.
  • Response range compression: A key visual feature of the plot is the difference in the range of scores. The real human scores on the x-axis span from approximately 3.25 to 4.50, while the synthetic AI scores on the y-axis are compressed into a much narrower range, from about 2.6 to 3.2. This indicates that while the AI follows the general trend, its responses show less variance and are clustered more tightly around a central value compared to human responses.
Scientific Validity
  • ✅ Strong quantitative support for the textual claim: The plot provides direct evidence for the claim made in the reference text. The text states a correlation attainment of 'about p = 80%', and the figure explicitly reports ρ = 81.7%, strongly supporting the manuscript's finding for the DLR method.
  • ✅ Appropriate visualization choice: A scatter plot is the standard and most appropriate method for visualizing the relationship and correlation between two continuous variables, in this case, the mean scores from two different populations (human and synthetic).
  • ✅ Inclusion of relevant statistics: Displaying the correlation attainment (ρ), Pearson's R, and the p-value directly on the plot is a methodologically sound practice. It provides a comprehensive statistical summary that allows for immediate interpretation of the relationship's strength and significance.
  • 💡 Visual data reveals important limitations: The plot clearly visualizes the compression of the synthetic responses (y-axis) compared to the human responses (x-axis). This is a crucial finding that highlights a key limitation of the DLR method—that it produces distributions that are 'overly narrow,' as mentioned later in the text. This visual evidence of a methodological artifact is a strength of the data presentation.
Communication
  • ✅ Effective integration of statistics: Placing the key statistical results directly within the plot area is an efficient way to communicate the main takeaway, making the panel largely self-contained and easy to interpret.
  • 💡 Missing axis labels: The horizontal and vertical axes lack explicit titles (e.g., 'Real Mean Purchase Intent', 'Synthetic Mean Purchase Intent'). While the panel title and overall figure caption provide context, adding explicit axis labels is a standard best practice that would improve clarity and make the plot fully self-explanatory.
  • 💡 Minor notational inconsistency: The reference text uses 'p' to denote correlation attainment, whereas the figure uses the Greek letter 'ρ'. While this is a minor point, ensuring consistent notation throughout the manuscript would prevent any potential reader confusion.
Figure 3: Comparison of purchase intent distribution similarity between real...
Full Caption

Figure 3: Comparison of purchase intent distribution similarity between real and synthetic surveys based on GPT-40 with TLLM = 0.5 for direct Likert ratings (DLRs), textual elicitation with follow-up Likert ratings (FLRs) and semantic similarity ratings (SSRs).

Figure/Table Image (Page 6)
Figure 3: Comparison of purchase intent distribution similarity between real and synthetic surveys based on GPT-40 with TLLM = 0.5 for direct Likert ratings (DLRs), textual elicitation with follow-up Likert ratings (FLRs) and semantic similarity ratings (SSRs).
First Reference in Text
With Kxy = 0.88 for GPT-40 (see Fig. 3) and Kxy = 0.8 for Gem-2f (see Fig. 7).
Description
  • Comparison of three AI response methods: This graph compares how well three different methods for generating survey responses from an AI (GPT-40) match the response patterns of real humans. The horizontal axis shows the 'Kolmogorov-Smirnov (KS) similarity' score, a measure where 1.0 indicates a perfect match between the AI and human response distributions, and 0.0 indicates no match. The height of each curve (vertical axis) represents how frequently a particular similarity score was observed across 57 different surveys.
  • Performance of each method: The three colored distributions clearly show the performance of each method. The pink distribution, for 'Direct Likert Rating' (DLR), is clustered on the far left, with a low average similarity score (Kxy) of 0.26, indicating it performed poorly. The light blue distribution, for 'Follow-up Likert Rating' (FLR), is in the middle, showing a significant improvement with an average score of 0.72. The green distribution, for the proposed 'Semantic Similarity Rating' (SSR) method, is concentrated on the far right, achieving a very high average similarity of 0.88, demonstrating it was the most effective at replicating human survey data.
Scientific Validity
  • ✅ Strong visual and quantitative evidence: The figure provides compelling evidence supporting the paper's central claim. The clear and substantial separation of the three distributions, with SSR's mean Kxy of 0.88 far exceeding that of FLR (0.72) and DLR (0.26), strongly validates the superiority of the SSR method. The large effect size makes the finding highly convincing.
  • ✅ Appropriate choice of metric and visualization: The use of Kolmogorov-Smirnov similarity is methodologically sound for comparing the ordinal distributions of Likert scale data. Visualizing the distribution of these similarity scores, rather than just presenting the mean, is a strength as it shows the consistency of each method's performance across the 57 surveys.
  • 💡 Suggestion for formal statistical testing: While the visual difference between the methods is stark, the argument could be further strengthened by reporting the results of formal statistical tests (e.g., a Kruskal-Wallis test with post-hoc comparisons) to confirm that the distributions of Kxy scores are significantly different from one another. This would add a layer of statistical rigor to the visual interpretation.
Communication
  • ✅ Highly effective visualization choice: The use of overlaid probability density plots is an excellent choice for this data. It allows for an immediate, intuitive comparison of the performance of the three methods, effectively communicating the main finding that SSR is superior.
  • ✅ Informative legend: The legend is well-executed. By including the mean Kxy value for each method directly alongside its name and color, it reinforces the key takeaway and makes the figure highly self-contained.
  • 💡 Improve axis label clarity: The x-axis label 'response distribution KS similarity' is technically accurate but could be made more immediately understandable for a broader audience. Suggest relabeling to something like 'Similarity to Human Response Distribution (KS sim.)'. The y-axis 'pdf (arbitrary units)' is standard but could also be clarified, for instance, as 'Density' or 'Frequency of Occurrence'.
  • 💡 Consider accessibility in design: While the colors provide good visual separation, it is best practice to ensure they are distinguishable for readers with color vision deficiency. Augmenting the color coding with distinct line styles (e.g., solid, dashed, dotted) would guarantee the figure's accessibility.
Figure 4: Mean purchase intent stratified by five demographic and product...
Full Caption

Figure 4: Mean purchase intent stratified by five demographic and product features (shown are results from the SSR method for both GPT-40 and Gem-2f).

Figure/Table Image (Page 6)
Figure 4: Mean purchase intent stratified by five demographic and product features (shown are results from the SSR method for both GPT-40 and Gem-2f).
First Reference in Text
To this end, we measure mean purchase intent across all products, stratified by demographics and product features and present the results in Fig. 4.
Description
  • Purchase intent varies with age: This line graph shows how the average purchase intent (PI), a score of how likely someone is to buy a product, changes across different age groups. The graph plots data for real human participants and two AI models, GPT-40 and Gem-2f. The real human data (black line) shows a distinct curve: purchase intent is lower for younger (age 20-30) and older (age 70+) participants, and peaks for middle-aged participants (around 40-50), reaching a mean PI of about 4.1. The GPT-40 model (orange line) successfully replicates this hump-shaped pattern. The Gem-2f model (blue line) captures the initial increase in PI for younger groups but fails to show the decrease for older participants.
Scientific Validity
  • ✅ Effective test of demographic replication: This panel provides a strong test of the LLMs' ability to replicate nuanced demographic trends. The successful mirroring of the non-linear, concave relationship by GPT-40 is a significant finding that supports the validity of persona-based conditioning.
  • ✅ Inclusion of error bars: The presence of error bars (representing standard errors) is a methodologically sound practice, as it provides an indication of the uncertainty around the mean purchase intent for each group.
  • 💡 Highlights model-specific limitations: The divergence of the Gem-2f model from the human trend for older age cohorts is an important result, suggesting that the ability to replicate demographic patterns is not uniform across all LLMs and may represent a limitation or specific bias in Gem-2f's training data.
Communication
  • ✅ Appropriate graph type: A line plot is an effective choice for visualizing trends across an ordered variable like age, making the concave pattern immediately apparent.
  • ✅ Clear legend: The legend clearly distinguishes between the real data and the two synthetic models, making the comparison straightforward.
  • 💡 Inconsistent y-axis scales: Each panel in Figure 4 uses a different y-axis scale. While this maximizes the visual space for each plot, it makes direct visual comparison of the magnitude of effects across different features (e.g., age vs. income) more difficult. Using a consistent y-axis scale across all panels would aid in this comparison.
Figure 6: Comparison of real and synthetic surveys based on Gem-2f with TLLM =...
Full Caption

Figure 6: Comparison of real and synthetic surveys based on Gem-2f with TLLM = 0.5.

Figure/Table Image (Page 13)
Figure 6: Comparison of real and synthetic surveys based on Gem-2f with TLLM = 0.5.
First Reference in Text
Both LLMs yielded a correlation attainment of about p = 80% (cf. Fig. 2A.i and Fig. 6A.i).
Description
  • Correlation between real and synthetic purchase intent scores (Gem-2f model): This scatter plot compares the average purchase intent scores from real human surveys (horizontal axis) with those from a simulated survey using the Gem-2f AI model (vertical axis). Each pink square represents a single product concept being evaluated. The AI was prompted using the 'Direct Likert Rating' (DLR) method, where it must provide a single number score. The plot shows a positive relationship, with key statistics reported directly on the graph: a 'correlation attainment' (ρ) of 80.2%, a Pearson correlation (R) of 0.64, and a p-value of less than 10^-7. Correlation attainment is a custom metric the authors use to benchmark the AI's performance against the theoretical maximum agreement that could be expected between two separate groups of human surveyors.
  • Distribution of responses: The plot shows that while the human scores (x-axis) range from approximately 3.25 to 4.50, the synthetic scores from the Gem-2f model (y-axis) cover a wider range than the GPT-40 model in Figure 2, from about 2.0 to 4.5. This indicates that the Gem-2f model's responses, while correlated with human responses, have a different pattern of variance.
Scientific Validity
  • ✅ Direct validation of the textual claim: The figure provides strong quantitative support for the claim in the reference text. The text states a correlation attainment of 'about p = 80%', and the plot explicitly shows ρ = 80.2% for the Gem-2f model using the DLR method, confirming the finding.
  • ✅ Appropriate visualization for the research question: Using a scatter plot is the standard and most effective method to visualize and assess the correlation between two continuous variables, in this case, the mean ratings from human and synthetic respondents.
  • ✅ Inclusion of key statistical metrics: The practice of embedding the correlation attainment (ρ), Pearson's R, and the p-value directly onto the plot is commendable. It provides a complete and immediate statistical summary, allowing for a robust interpretation of the data's significance and the strength of the relationship.
  • ✅ Reveals important model-specific behavior: By presenting the results for Gem-2f separately from GPT-40 (in Fig. 2), the study effectively demonstrates that different LLMs exhibit distinct behaviors even under the same conditions. The wider response variance of Gem-2f compared to GPT-40's DLR is a scientifically valuable observation about model-specific artifacts.
Communication
  • ✅ Clear and uncluttered presentation: The plot is clean, with a high data-ink ratio. The statistical information is presented concisely without overwhelming the visual representation of the data points.
  • 💡 Add explicit axis labels: The plot is missing explicit labels for the x and y axes. To improve clarity and make the figure fully self-contained, labels such as 'Real Mean Purchase Intent' and 'Synthetic Mean Purchase Intent (Gem-2f)' should be added.
  • 💡 Ensure consistent notation: The reference text uses 'p' to denote correlation attainment, while the figure uses the Greek letter 'ρ'. Adopting a single, consistent symbol for this custom metric throughout the manuscript would prevent any potential confusion for the reader.
Figure 7: Comparison of purchase intent distribution similarity between real...
Full Caption

Figure 7: Comparison of purchase intent distribution similarity between real and synthetic surveys based on Gem-2f with TLLM = 0.5 for direct Likert ratings (DLRs), textual elicitation with follow-up Likert ratings (FLRs), semantic similarity ratings (SSRs), and best-set SSRs for an experiment where synthetic consumers where prompted without demographic markers.

Figure/Table Image (Page 13)
Figure 7: Comparison of purchase intent distribution similarity between real and synthetic surveys based on Gem-2f with TLLM = 0.5 for direct Likert ratings (DLRs), textual elicitation with follow-up Likert ratings (FLRs), semantic similarity ratings (SSRs), and best-set SSRs for an experiment where synthetic consumers where prompted without demographic markers.
First Reference in Text
With Kxy = 0.88 for GPT-40 (see Fig. 3) and Kxy = 0.8 for Gem-2f (see Fig. 7).
Description
  • Comparison of four AI response generation methods using the Gem-2f model: This figure presents histograms comparing how well four different methods for generating AI responses match real human survey data. The horizontal axis represents the 'Kolmogorov-Smirnov (KS) similarity' score, which measures how closely the AI's response pattern matches the humans'; a score of 1.0 is a perfect match. The vertical axis shows the frequency of each similarity score across 57 surveys. The figure compares four conditions: Direct Likert Rating (DLR), Follow-up Likert Rating (FLR), Semantic Similarity Rating (SSR) with demographic prompts, and SSR without demographic prompts.
  • Performance ranking of the methods: The plot shows a clear performance hierarchy. The DLR method (pink) performs poorly, with an average similarity (Kxy) of 0.39. The FLR method (light teal) is better, with Kxy = 0.59. The standard SSR method (dark teal) is substantially better still, achieving Kxy = 0.80, which aligns with the value cited in the text. Surprisingly, the best performance comes from the SSR method when demographic information is removed from the AI's prompt (blue), which achieves a very high average similarity of Kxy = 0.91.
Scientific Validity
  • ✅ Strong comparative evidence: The figure provides compelling evidence for the relative performance of the different elicitation methods with the Gem-2f model. The clear separation of the distributions validates the paper's central thesis about the superiority of SSR and provides robust data for model comparison.
  • ✅ Inclusion of a critical ablation study: The 'SSR (w/o dem.)' condition is a methodologically strong inclusion. It serves as an ablation study that isolates the effect of demographic prompting. The resulting finding—that removing demographics improves distributional similarity for Gem-2f—is a significant and non-obvious result that deepens the paper's contribution.
  • 💡 Suggestion for statistical testing: While the visual differences are striking, the scientific argument would be strengthened by including formal statistical tests to confirm that the Kxy distributions are significantly different from one another (e.g., using a non-parametric test like Kruskal-Wallis followed by post-hoc tests). This would formally quantify the observed differences.
Communication
  • ✅ Effective visualization for comparison: Using overlaid histograms is an excellent choice for this data, as it allows for a direct and intuitive comparison of the performance distributions of the four methods. The main takeaway is immediately apparent.
  • ✅ Informative and self-contained legend: The legend is highly effective because it includes the mean Kxy score for each condition directly next to its label. This provides the key summary statistic without requiring the reader to search the text, making the figure largely self-contained.
  • 💡 Color choice could be improved: The use of two similar shades of teal for 'FLR' and 'SSR' slightly reduces the visual distinction between these two important conditions. Suggest using a more distinct color palette to improve readability, particularly for readers with color vision deficiency.
  • 💡 Overly long and complex caption: The caption is very detailed, listing every condition shown in the plot. While accurate, it is long and somewhat unwieldy. Suggest simplifying the caption to focus on the main message, for instance: 'Distributional similarity for four response elicitation methods using Gem-2f. The SSR method, particularly without demographic prompts, demonstrates the highest similarity to human data.'
Figure 8: Mean purchase intent stratified by respondents' gender and dwelling...
Full Caption

Figure 8: Mean purchase intent stratified by respondents' gender and dwelling region (shown are results from the SSR method for both GPT-40 and Gem-2f).

Figure/Table Image (Page 15)
Figure 8: Mean purchase intent stratified by respondents' gender and dwelling region (shown are results from the SSR method for both GPT-40 and Gem-2f).
First Reference in Text
SCs replicated the response behavior less well for gender and dwelling region (see Fig. 8).
Description
  • Comparison of purchase intent across demographic groups: This figure contains two line plots (panels A and B) that compare the average purchase intent (PI) scores between real human survey participants and two AI models (GPT-40 and Gem-2f), broken down by demographic categories. Purchase intent is a score indicating how likely a person is to buy a product. Panel A shows the comparison by gender (Female vs. Male). Panel B shows the comparison by four U.S. dwelling regions (Mid West, North East, South, West). The plots are designed to test if the AI models can replicate the subtle differences in purchasing preferences seen across these human subgroups.
  • AI models struggle to replicate human demographic trends: The plots visually support the text's claim that the AI models ('SCs' or synthetic consumers) did not replicate human behavior well for these categories. In Panel A (Gender), real data shows males have slightly higher PI than females, but the Gem-2f model reverses this trend, and GPT-40 exaggerates it. In Panel B (Dwelling Region), the patterns are even more divergent; for example, both AI models predict the lowest PI for the 'South', whereas in the real data, the 'South' has the second-highest PI.
Scientific Validity
  • ✅ Strong evidence for the stated claim: The figure provides clear, direct visual evidence to support the reference text's claim that the synthetic consumers replicated response behavior 'less well' for gender and dwelling region. The visible discrepancies between the lines for real and synthetic data are compelling.
  • ✅ Inclusion of multiple models strengthens the conclusion: By showing results for both GPT-40 and Gem-2f, the figure demonstrates that the failure to replicate these specific demographic trends is not an idiosyncrasy of a single model, which strengthens the overall conclusion about the current limitations of this technique.
  • ✅ Appropriate inclusion of uncertainty: The use of error bars (representing standard errors) is a methodologically sound practice that provides a visual guide to the statistical uncertainty of the mean PI for each group. This helps in judging the significance of the observed differences.
  • :: The text correctly notes that the influence of these features on PI is not strong in the human data. The plots confirm this, as the vertical differences between points are small. This is a crucial piece of context: the models are failing to replicate a weak signal, which is a different and less severe limitation than failing to replicate a strong, clear trend. The analysis is therefore nuanced and well-supported by the visualization.
Communication
  • ✅ Clear and direct comparison: The use of overlaid line plots with a clear legend allows for an easy and direct comparison between the human data and the two AI models within each panel.
  • 💡 Inappropriate graph type for categorical data: While line plots are used, a bar chart would be more appropriate for visualizing this data. Both gender and dwelling region are discrete, unordered categorical variables. The lines connecting the points on the graph incorrectly imply a continuous or ordered relationship between the categories (e.g., a trend from 'Mid West' to 'West'), which does not exist. Suggest replacing the line plots with grouped bar charts for a more conventional and accurate representation.
  • 💡 Inconsistent y-axis scales: The two panels use different y-axis ranges (Panel A: 3.2-4.04; Panel B: 3.44-4.1). While this maximizes the visual detail within each plot, it hinders the direct visual comparison of the magnitude of PI differences between the gender and region categories. Using a consistent y-axis scale for both panels would facilitate a more accurate cross-panel comparison.
Figure 9: Survey histograms for direct Likert ratings at TLLM = 0.5 for GPT-40.
Figure/Table Image (Page 15)