LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings

Benjamin F. Maier, Ulf Aslak, Luca Fiaschi, Nina Rismal, Kemble Fletcher, Christian C. Luhmann, Robbie Dow, Kli Pappas, Thomas V. Wiecki
arXiv: arXiv:2510.08338v1
PyMC Labs, Tallin, Estonia

First Page Preview

First page preview

Table of Contents

Overall Summary

Study Background and Main Findings

This paper addresses a critical challenge in using Large Language Models (LLMs) for consumer research: their tendency to produce unrealistic response distributions when asked for direct numerical ratings. Traditional consumer surveys are costly and prone to biases, and while LLMs offer a scalable alternative by simulating 'synthetic consumers,' this fundamental flaw has limited their utility. The authors' primary objective was to develop and validate a new method that could elicit more human-like survey responses from LLMs.

The core of the study is the introduction of a novel two-step method called Semantic Similarity Rating (SSR). First, an LLM, prompted to impersonate a consumer with specific demographic traits, generates a free-text response expressing its purchase intent for a given product concept. Second, instead of asking the model to convert this text to a single number, the system transforms the text into a high-dimensional vector (an embedding) that captures its meaning. The similarity of this vector is then mathematically compared to pre-defined 'anchor statements' for each point on a 5-point Likert scale. This process yields a full probability distribution across the scale, reflecting the nuance and ambiguity of the original text.

The authors rigorously tested the SSR method against simpler baselines using an extensive dataset of 57 real-world consumer surveys on personal care products, involving 9,300 human participants. The results were compelling: the SSR method successfully replicated the relative ranking of products with over 90% of human test-retest reliability. Critically, it also generated response distributions that were highly similar to human patterns, achieving a Kolmogorov-Smirnov (KS) similarity score of over 0.85, whereas the direct rating baseline failed dramatically on this metric (KS similarity ≈ 0.26). The study also demonstrated that providing demographic context to the LLMs was essential for achieving this high level of performance.

The paper concludes that the shortcomings of LLMs in survey simulation are not inherent limitations of the models but rather artifacts of flawed elicitation techniques. The SSR framework offers a robust, training-free solution that preserves the quantitative metrics of traditional surveys while adding rich, interpretable qualitative feedback. This work presents a significant methodological advancement with the potential to make early-stage product research faster, more affordable, and more insightful.

Research Impact and Future Directions

This paper presents a robust and well-validated solution to a significant obstacle in the use of LLMs for consumer research. The Semantic Similarity Rating (SSR) method is a clear and substantial improvement over direct numerical prompting, enabling synthetic survey panels to replicate not only the relative ranking of products but also the nuanced distribution of human responses with high fidelity. The study's methodological rigor, particularly its use of a large real-world dataset and the introduction of the noise-aware 'correlation attainment' metric, lends strong credibility to its findings.

The primary conclusion—that the failures of LLMs as survey respondents are artifacts of elicitation methods rather than fundamental model incapacities—is well-supported and represents a key conceptual advance for the field. The practical implications are significant: the SSR framework offers a scalable, cost-effective tool for rapid screening of product concepts, potentially accelerating innovation and democratizing access to consumer insights. The dual output of quantitative ratings and qualitative rationales is a particularly powerful feature, combining the strengths of two traditionally separate research modalities.

However, the study's simulation-based design necessarily limits the scope of its conclusions. While the method successfully reproduces human survey responses, it cannot claim to reproduce actual purchasing behavior in a real-world market. The findings are also contingent on the specific domain (personal care products), where a wealth of relevant discussion likely exists in the LLMs' training data; its applicability to more niche or novel domains remains an open question. Furthermore, the observed differences in performance between LLMs (e.g., the paradoxical effect of removing demographics for Gem-2f) underscore that these methods are not universally 'plug-and-play' and require careful, model-specific validation. Despite these limitations, the paper provides a credible and powerful framework that fundamentally changes the prospects for using LLMs in consumer research.

Critical Analysis and Recommendations

Clear Problem Framing and Quantifiable Claims (written-content)
The abstract immediately establishes credibility by framing a clear problem (unrealistic LLM survey ratings) and presenting its solution with strong, specific performance metrics. Citing '90% of human test–retest reliability' and 'KS similarity > 0.85' provides concrete, impressive evidence of the method's efficacy, elevating the paper's claims beyond qualitative description and demonstrating a rigorous validation process from the outset.
Section: Abstract
Lack of Concrete Examples Hinders Initial Comprehension (written-content)
The introduction describes the core SSR mechanism but leaves the crucial concept of 'predefined anchor statements' abstract. Providing a brief, intuitive example (e.g., 'reference texts for each point, such as `I would definitely buy it` for a score of 5') would have immediately demystified this linchpin of the method, grounding the technical description and significantly enhancing reader comprehension from the start.
Section: Introduction
Systematic Literature Review Establishes a Clear Research Gap (written-content)
The Related Works section methodically reviews prior approaches, such as direct numeric elicitation and demographic conditioning, and identifies their shared, fundamental limitation: the failure to produce realistic response distributions. This structured critique effectively carves out a well-defined and necessary research gap that the proposed SSR method is perfectly positioned to address, strongly justifying the paper's contribution.
Section: Related Works
Introduction of a Noise-Aware Success Metric (written-content)
The 'Correlation Attainment' metric is a significant methodological innovation that demonstrates a sophisticated understanding of survey data. By benchmarking the LLM's performance against a theoretical maximum derived from human test-retest reliability, the authors account for the inherent noise in the ground-truth data. This provides a far more rigorous and realistic assessment of the model's capabilities than a simple correlation score, substantially strengthening the validity of the paper's conclusions.
Section: Methods
Visual Evidence Powerfully Demonstrates Methodological Superiority (graphical-figure)
Figure 3 provides compelling visual evidence for the paper's central claim by plotting the distribution of similarity scores for all three tested methods. The clear and substantial separation of the distributions, with SSR (mean KS similarity = 0.88) far outperforming the intermediate FLR (0.72) and the baseline DLR (0.26), offers an immediate and intuitive confirmation of the SSR method's superior ability to replicate human response patterns.
Section: Results
Ablation Study Reveals Critical Role of Demographics (written-content)
A key experiment demonstrated that removing demographic prompts from the synthetic consumers caused correlation attainment to collapse from 92% to 50%. This rigorous ablation study provides powerful evidence that persona conditioning is not merely an enhancement but is essential for generating a meaningful, product-differentiating signal, proving that the LLMs are effectively leveraging this contextual information.
Section: Results
Reframing the Problem is a Key Conceptual Contribution (written-content)
The discussion elevates the paper's contribution beyond a simple technical fix by powerfully reframing the discourse. The authors argue that failures of LLMs in surveys are not intrinsic model limitations but artifacts of poor elicitation methods. This conceptual shift is a significant and impactful takeaway that changes how researchers should approach the problem of synthetic data generation.
Section: Discussion and Conclusion

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Related Works

Key Aspects

Strengths

Suggestions for Improvement

Methods

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 1: Different response generation procedures and SSR response-likelihood...
Full Caption

Figure 1: Different response generation procedures and SSR response-likelihood mapping.

Figure/Table Image (Page 2)
Figure 1: Different response generation procedures and SSR response-likelihood mapping.
First Reference in Text
We evaluated three response strategies (see Fig. 1A):
Description
  • Flowchart of Three Methods for Eliciting Ratings from LLMs: This diagram illustrates three different methods for getting a rating from a 'synthetic consumer,' which is a Large Language Model (LLM) instructed to act like a person in a survey. The process begins by giving the LLM a role ('impersonate a consumer') and a product concept to evaluate. The core of the diagram compares three strategies for how the LLM provides its rating: (1) Direct Likert Rating, where the LLM must choose a single number from 1 to 5; (2) Follow-up Likert Rating, where the LLM first gives a written opinion, and then a second LLM instance, acting as an 'expert,' converts that text into a single number from 1 to 5; and (3) Semantic Similarity Rating (SSR), where the LLM's written opinion is converted into a special numerical code called an 'embedding vector.' This vector, which captures the meaning of the text, is then mathematically compared to pre-defined reference statements for each point on the 1-5 scale to produce a full probability distribution, rather than a single number.
Scientific Validity
  • ✅ Clear depiction of experimental conditions: The flowchart provides a clear and logical overview of the three distinct response elicitation strategies being compared. This serves as an excellent methodological summary, allowing readers to grasp the core experimental design at a glance.
  • 💡 Ambiguity of 'Likert expert' role: The term 'Likert expert' in method (2) is slightly ambiguous within the figure itself. While the main text clarifies this is a new instance of the same model with a different prompt, the diagram alone does not make this clear. For full self-containment, a brief note could be added, e.g., 'Same LLM, new prompt,' to specify the nature of this component.
  • ✅ Inclusion of prompt context: Showing the initial 'System prompt' and 'User prompts' is a methodological strength. It provides crucial context about how the synthetic agents were set up, which is essential for the reproducibility and interpretation of the results.
Communication
  • ✅ Effective flowchart structure: The use of a flowchart is highly effective for illustrating the sequential and branching nature of the response generation process. The visual flow from prompts to the three distinct methods is intuitive and easy to follow.
  • ✅ Use of a concrete example: The inclusion of an example 'Synthetic response' ('I'm somewhat interested...') helps ground the abstract processes in a tangible example, making the different pathways easier for the reader to understand.
  • 💡 Visual linkage could be improved: The arrow from 'Elicit brief textual responses' leads into a general space from which methods (2) and (3) branch. Method (1), however, bypasses this step. The diagram could be slightly reorganized to make it more visually explicit that the textual response is an input for methods (2) and (3) only, perhaps by having the arrow split more directly to those two boxes.
Figure 5: A surrogate product concept similar to those used in the 57 concept...
Full Caption

Figure 5: A surrogate product concept similar to those used in the 57 concept surveys.

Figure/Table Image (Page 11)
Figure 5: A surrogate product concept similar to those used in the 57 concept surveys.
First Reference in Text
When we refer to “image stimulus” in the main text, an image like this, including either both an illustration and the concept description or only a concept description was supplied to an LLM synthetic consumer (see App.
Description
  • Example of a product concept stimulus: This figure displays an example of the marketing material, or 'stimulus,' shown to survey participants to evaluate. It features a fictional product called 'AURAFOAM™ Mood-Infused Body Wash.' The image is split into two parts: on the left is a picture of the product bottle, and on the right is a text description. The text highlights key selling points, such as 'mood-coded fragrance capsules' (different scents intended to create feelings like 'Energize' or 'Calm'), 'clinically inspired neuro-aroma blends' (a marketing term suggesting scents are designed to affect mood), a 'gentle, skin-first formula,' and 'sustainable design' using recycled packaging.
Scientific Validity
  • ✅ Enhances methodological transparency: Providing a concrete example of the stimulus material is a significant strength. It allows the reader to understand the nature and complexity of the information provided to the LLMs, which is crucial for interpreting the results and assessing the task's ecological validity. This moves the methodology from an abstract description to a tangible illustration.
  • ✅ Plausible and representative example: The surrogate concept is well-designed to be representative of modern personal care marketing. It includes a realistic combination of product imagery, branding, and textual features (e.g., emotional benefits, scientific-sounding terms like 'neuro-aroma', sustainability claims). This makes it a suitable and relevant test case for the study's domain.
  • 💡 Clarification on stimulus variability would be beneficial: The caption states the concept is 'similar' to those used. While the reference text mentions some variation (image+text vs. text only), it would strengthen the methods section to briefly describe the general range of variability across the 57 actual stimuli. For example, were the description lengths and number of features generally consistent? This would help readers understand the robustness of the findings across different concept styles.
Communication
  • ✅ Highly effective and clear: The figure is exceptionally clear and serves its purpose perfectly. It immediately illustrates what the authors mean by 'product concept' and 'image stimulus,' making a key component of the experimental design easy to understand for any reader.
  • ✅ Self-contained and illustrative: Combined with its straightforward caption, the figure is entirely self-contained. It effectively communicates the nature of the experimental input without requiring the reader to refer to lengthy descriptions in the methods section.
  • ✅ Professional and clean design: The visual design of the surrogate concept is professional and aesthetically pleasing, which adds to the credibility of the experimental setup by showing that the stimuli were of high quality.

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Figure 2: Comparison of real and synthetic surveys based on GPT-40 with TLLM =...
Full Caption

Figure 2: Comparison of real and synthetic surveys based on GPT-40 with TLLM = 0.5.

Figure/Table Image (Page 4)
Figure 2: Comparison of real and synthetic surveys based on GPT-40 with TLLM = 0.5.
First Reference in Text
Both LLMs yielded a correlation attainment of about p = 80% (cf. Fig. 2A.i and Fig. 6A.i).
Description
  • Correlation between real and synthetic purchase intent scores: This scatter plot compares the average purchase intent scores from real human surveys (horizontal axis) with those from a simulated survey using the GPT-40 AI model (vertical axis). Each dot represents a single product concept. The AI was prompted using the 'Direct Likert Rating' (DLR) method, where it provides a single numerical score. The plot shows a positive trend: as the average score from humans increases, the average score from the AI also tends to increase. Key statistics are reported directly on the plot: a 'correlation attainment' (ρ) of 81.7%, a Pearson correlation (R) of 0.66, and a p-value of less than 10^-7. Correlation attainment is a custom metric used by the authors to compare the AI's performance against the theoretical maximum agreement between two groups of humans.
  • Response range compression: A key visual feature of the plot is the difference in the range of scores. The real human scores on the x-axis span from approximately 3.25 to 4.50, while the synthetic AI scores on the y-axis are compressed into a much narrower range, from about 2.6 to 3.2. This indicates that while the AI follows the general trend, its responses show less variance and are clustered more tightly around a central value compared to human responses.
Scientific Validity
  • ✅ Strong quantitative support for the textual claim: The plot provides direct evidence for the claim made in the reference text. The text states a correlation attainment of 'about p = 80%', and the figure explicitly reports ρ = 81.7%, strongly supporting the manuscript's finding for the DLR method.
  • ✅ Appropriate visualization choice: A scatter plot is the standard and most appropriate method for visualizing the relationship and correlation between two continuous variables, in this case, the mean scores from two different populations (human and synthetic).
  • ✅ Inclusion of relevant statistics: Displaying the correlation attainment (ρ), Pearson's R, and the p-value directly on the plot is a methodologically sound practice. It provides a comprehensive statistical summary that allows for immediate interpretation of the relationship's strength and significance.
  • 💡 Visual data reveals important limitations: The plot clearly visualizes the compression of the synthetic responses (y-axis) compared to the human responses (x-axis). This is a crucial finding that highlights a key limitation of the DLR method—that it produces distributions that are 'overly narrow,' as mentioned later in the text. This visual evidence of a methodological artifact is a strength of the data presentation.
Communication
  • ✅ Effective integration of statistics: Placing the key statistical results directly within the plot area is an efficient way to communicate the main takeaway, making the panel largely self-contained and easy to interpret.
  • 💡 Missing axis labels: The horizontal and vertical axes lack explicit titles (e.g., 'Real Mean Purchase Intent', 'Synthetic Mean Purchase Intent'). While the panel title and overall figure caption provide context, adding explicit axis labels is a standard best practice that would improve clarity and make the plot fully self-explanatory.
  • 💡 Minor notational inconsistency: The reference text uses 'p' to denote correlation attainment, whereas the figure uses the Greek letter 'ρ'. While this is a minor point, ensuring consistent notation throughout the manuscript would prevent any potential reader confusion.
Figure 3: Comparison of purchase intent distribution similarity between real...
Full Caption

Figure 3: Comparison of purchase intent distribution similarity between real and synthetic surveys based on GPT-40 with TLLM = 0.5 for direct Likert ratings (DLRs), textual elicitation with follow-up Likert ratings (FLRs) and semantic similarity ratings (SSRs).

Figure/Table Image (Page 6)
Figure 3: Comparison of purchase intent distribution similarity between real and synthetic surveys based on GPT-40 with TLLM = 0.5 for direct Likert ratings (DLRs), textual elicitation with follow-up Likert ratings (FLRs) and semantic similarity ratings (SSRs).
First Reference in Text
With Kxy = 0.88 for GPT-40 (see Fig. 3) and Kxy = 0.8 for Gem-2f (see Fig. 7).
Description
  • Comparison of three AI response methods: This graph compares how well three different methods for generating survey responses from an AI (GPT-40) match the response patterns of real humans. The horizontal axis shows the 'Kolmogorov-Smirnov (KS) similarity' score, a measure where 1.0 indicates a perfect match between the AI and human response distributions, and 0.0 indicates no match. The height of each curve (vertical axis) represents how frequently a particular similarity score was observed across 57 different surveys.
  • Performance of each method: The three colored distributions clearly show the performance of each method. The pink distribution, for 'Direct Likert Rating' (DLR), is clustered on the far left, with a low average similarity score (Kxy) of 0.26, indicating it performed poorly. The light blue distribution, for 'Follow-up Likert Rating' (FLR), is in the middle, showing a significant improvement with an average score of 0.72. The green distribution, for the proposed 'Semantic Similarity Rating' (SSR) method, is concentrated on the far right, achieving a very high average similarity of 0.88, demonstrating it was the most effective at replicating human survey data.
Scientific Validity
  • ✅ Strong visual and quantitative evidence: The figure provides compelling evidence supporting the paper's central claim. The clear and substantial separation of the three distributions, with SSR's mean Kxy of 0.88 far exceeding that of FLR (0.72) and DLR (0.26), strongly validates the superiority of the SSR method. The large effect size makes the finding highly convincing.
  • ✅ Appropriate choice of metric and visualization: The use of Kolmogorov-Smirnov similarity is methodologically sound for comparing the ordinal distributions of Likert scale data. Visualizing the distribution of these similarity scores, rather than just presenting the mean, is a strength as it shows the consistency of each method's performance across the 57 surveys.
  • 💡 Suggestion for formal statistical testing: While the visual difference between the methods is stark, the argument could be further strengthened by reporting the results of formal statistical tests (e.g., a Kruskal-Wallis test with post-hoc comparisons) to confirm that the distributions of Kxy scores are significantly different from one another. This would add a layer of statistical rigor to the visual interpretation.
Communication
  • ✅ Highly effective visualization choice: The use of overlaid probability density plots is an excellent choice for this data. It allows for an immediate, intuitive comparison of the performance of the three methods, effectively communicating the main finding that SSR is superior.
  • ✅ Informative legend: The legend is well-executed. By including the mean Kxy value for each method directly alongside its name and color, it reinforces the key takeaway and makes the figure highly self-contained.
  • 💡 Improve axis label clarity: The x-axis label 'response distribution KS similarity' is technically accurate but could be made more immediately understandable for a broader audience. Suggest relabeling to something like 'Similarity to Human Response Distribution (KS sim.)'. The y-axis 'pdf (arbitrary units)' is standard but could also be clarified, for instance, as 'Density' or 'Frequency of Occurrence'.
  • 💡 Consider accessibility in design: While the colors provide good visual separation, it is best practice to ensure they are distinguishable for readers with color vision deficiency. Augmenting the color coding with distinct line styles (e.g., solid, dashed, dotted) would guarantee the figure's accessibility.
Figure 4: Mean purchase intent stratified by five demographic and product...
Full Caption

Figure 4: Mean purchase intent stratified by five demographic and product features (shown are results from the SSR method for both GPT-40 and Gem-2f).

Figure/Table Image (Page 6)
Figure 4: Mean purchase intent stratified by five demographic and product features (shown are results from the SSR method for both GPT-40 and Gem-2f).
First Reference in Text
To this end, we measure mean purchase intent across all products, stratified by demographics and product features and present the results in Fig. 4.
Description
  • Purchase intent varies with age: This line graph shows how the average purchase intent (PI), a score of how likely someone is to buy a product, changes across different age groups. The graph plots data for real human participants and two AI models, GPT-40 and Gem-2f. The real human data (black line) shows a distinct curve: purchase intent is lower for younger (age 20-30) and older (age 70+) participants, and peaks for middle-aged participants (around 40-50), reaching a mean PI of about 4.1. The GPT-40 model (orange line) successfully replicates this hump-shaped pattern. The Gem-2f model (blue line) captures the initial increase in PI for younger groups but fails to show the decrease for older participants.
Scientific Validity
  • ✅ Effective test of demographic replication: This panel provides a strong test of the LLMs' ability to replicate nuanced demographic trends. The successful mirroring of the non-linear, concave relationship by GPT-40 is a significant finding that supports the validity of persona-based conditioning.
  • ✅ Inclusion of error bars: The presence of error bars (representing standard errors) is a methodologically sound practice, as it provides an indication of the uncertainty around the mean purchase intent for each group.
  • 💡 Highlights model-specific limitations: The divergence of the Gem-2f model from the human trend for older age cohorts is an important result, suggesting that the ability to replicate demographic patterns is not uniform across all LLMs and may represent a limitation or specific bias in Gem-2f's training data.
Communication
  • ✅ Appropriate graph type: A line plot is an effective choice for visualizing trends across an ordered variable like age, making the concave pattern immediately apparent.
  • ✅ Clear legend: The legend clearly distinguishes between the real data and the two synthetic models, making the comparison straightforward.
  • 💡 Inconsistent y-axis scales: Each panel in Figure 4 uses a different y-axis scale. While this maximizes the visual space for each plot, it makes direct visual comparison of the magnitude of effects across different features (e.g., age vs. income) more difficult. Using a consistent y-axis scale across all panels would aid in this comparison.
Figure 6: Comparison of real and synthetic surveys based on Gem-2f with TLLM =...
Full Caption

Figure 6: Comparison of real and synthetic surveys based on Gem-2f with TLLM = 0.5.

Figure/Table Image (Page 13)
Figure 6: Comparison of real and synthetic surveys based on Gem-2f with TLLM = 0.5.
First Reference in Text
Both LLMs yielded a correlation attainment of about p = 80% (cf. Fig. 2A.i and Fig. 6A.i).
Description
  • Correlation between real and synthetic purchase intent scores (Gem-2f model): This scatter plot compares the average purchase intent scores from real human surveys (horizontal axis) with those from a simulated survey using the Gem-2f AI model (vertical axis). Each pink square represents a single product concept being evaluated. The AI was prompted using the 'Direct Likert Rating' (DLR) method, where it must provide a single number score. The plot shows a positive relationship, with key statistics reported directly on the graph: a 'correlation attainment' (ρ) of 80.2%, a Pearson correlation (R) of 0.64, and a p-value of less than 10^-7. Correlation attainment is a custom metric the authors use to benchmark the AI's performance against the theoretical maximum agreement that could be expected between two separate groups of human surveyors.
  • Distribution of responses: The plot shows that while the human scores (x-axis) range from approximately 3.25 to 4.50, the synthetic scores from the Gem-2f model (y-axis) cover a wider range than the GPT-40 model in Figure 2, from about 2.0 to 4.5. This indicates that the Gem-2f model's responses, while correlated with human responses, have a different pattern of variance.
Scientific Validity
  • ✅ Direct validation of the textual claim: The figure provides strong quantitative support for the claim in the reference text. The text states a correlation attainment of 'about p = 80%', and the plot explicitly shows ρ = 80.2% for the Gem-2f model using the DLR method, confirming the finding.
  • ✅ Appropriate visualization for the research question: Using a scatter plot is the standard and most effective method to visualize and assess the correlation between two continuous variables, in this case, the mean ratings from human and synthetic respondents.
  • ✅ Inclusion of key statistical metrics: The practice of embedding the correlation attainment (ρ), Pearson's R, and the p-value directly onto the plot is commendable. It provides a complete and immediate statistical summary, allowing for a robust interpretation of the data's significance and the strength of the relationship.
  • ✅ Reveals important model-specific behavior: By presenting the results for Gem-2f separately from GPT-40 (in Fig. 2), the study effectively demonstrates that different LLMs exhibit distinct behaviors even under the same conditions. The wider response variance of Gem-2f compared to GPT-40's DLR is a scientifically valuable observation about model-specific artifacts.
Communication
  • ✅ Clear and uncluttered presentation: The plot is clean, with a high data-ink ratio. The statistical information is presented concisely without overwhelming the visual representation of the data points.
  • 💡 Add explicit axis labels: The plot is missing explicit labels for the x and y axes. To improve clarity and make the figure fully self-contained, labels such as 'Real Mean Purchase Intent' and 'Synthetic Mean Purchase Intent (Gem-2f)' should be added.
  • 💡 Ensure consistent notation: The reference text uses 'p' to denote correlation attainment, while the figure uses the Greek letter 'ρ'. Adopting a single, consistent symbol for this custom metric throughout the manuscript would prevent any potential confusion for the reader.
Figure 7: Comparison of purchase intent distribution similarity between real...
Full Caption

Figure 7: Comparison of purchase intent distribution similarity between real and synthetic surveys based on Gem-2f with TLLM = 0.5 for direct Likert ratings (DLRs), textual elicitation with follow-up Likert ratings (FLRs), semantic similarity ratings (SSRs), and best-set SSRs for an experiment where synthetic consumers where prompted without demographic markers.

Figure/Table Image (Page 13)
Figure 7: Comparison of purchase intent distribution similarity between real and synthetic surveys based on Gem-2f with TLLM = 0.5 for direct Likert ratings (DLRs), textual elicitation with follow-up Likert ratings (FLRs), semantic similarity ratings (SSRs), and best-set SSRs for an experiment where synthetic consumers where prompted without demographic markers.
First Reference in Text
With Kxy = 0.88 for GPT-40 (see Fig. 3) and Kxy = 0.8 for Gem-2f (see Fig. 7).
Description
  • Comparison of four AI response generation methods using the Gem-2f model: This figure presents histograms comparing how well four different methods for generating AI responses match real human survey data. The horizontal axis represents the 'Kolmogorov-Smirnov (KS) similarity' score, which measures how closely the AI's response pattern matches the humans'; a score of 1.0 is a perfect match. The vertical axis shows the frequency of each similarity score across 57 surveys. The figure compares four conditions: Direct Likert Rating (DLR), Follow-up Likert Rating (FLR), Semantic Similarity Rating (SSR) with demographic prompts, and SSR without demographic prompts.
  • Performance ranking of the methods: The plot shows a clear performance hierarchy. The DLR method (pink) performs poorly, with an average similarity (Kxy) of 0.39. The FLR method (light teal) is better, with Kxy = 0.59. The standard SSR method (dark teal) is substantially better still, achieving Kxy = 0.80, which aligns with the value cited in the text. Surprisingly, the best performance comes from the SSR method when demographic information is removed from the AI's prompt (blue), which achieves a very high average similarity of Kxy = 0.91.
Scientific Validity
  • ✅ Strong comparative evidence: The figure provides compelling evidence for the relative performance of the different elicitation methods with the Gem-2f model. The clear separation of the distributions validates the paper's central thesis about the superiority of SSR and provides robust data for model comparison.
  • ✅ Inclusion of a critical ablation study: The 'SSR (w/o dem.)' condition is a methodologically strong inclusion. It serves as an ablation study that isolates the effect of demographic prompting. The resulting finding—that removing demographics improves distributional similarity for Gem-2f—is a significant and non-obvious result that deepens the paper's contribution.
  • 💡 Suggestion for statistical testing: While the visual differences are striking, the scientific argument would be strengthened by including formal statistical tests to confirm that the Kxy distributions are significantly different from one another (e.g., using a non-parametric test like Kruskal-Wallis followed by post-hoc tests). This would formally quantify the observed differences.
Communication
  • ✅ Effective visualization for comparison: Using overlaid histograms is an excellent choice for this data, as it allows for a direct and intuitive comparison of the performance distributions of the four methods. The main takeaway is immediately apparent.
  • ✅ Informative and self-contained legend: The legend is highly effective because it includes the mean Kxy score for each condition directly next to its label. This provides the key summary statistic without requiring the reader to search the text, making the figure largely self-contained.
  • 💡 Color choice could be improved: The use of two similar shades of teal for 'FLR' and 'SSR' slightly reduces the visual distinction between these two important conditions. Suggest using a more distinct color palette to improve readability, particularly for readers with color vision deficiency.
  • 💡 Overly long and complex caption: The caption is very detailed, listing every condition shown in the plot. While accurate, it is long and somewhat unwieldy. Suggest simplifying the caption to focus on the main message, for instance: 'Distributional similarity for four response elicitation methods using Gem-2f. The SSR method, particularly without demographic prompts, demonstrates the highest similarity to human data.'
Figure 8: Mean purchase intent stratified by respondents' gender and dwelling...
Full Caption

Figure 8: Mean purchase intent stratified by respondents' gender and dwelling region (shown are results from the SSR method for both GPT-40 and Gem-2f).

Figure/Table Image (Page 15)
Figure 8: Mean purchase intent stratified by respondents' gender and dwelling region (shown are results from the SSR method for both GPT-40 and Gem-2f).
First Reference in Text
SCs replicated the response behavior less well for gender and dwelling region (see Fig. 8).
Description
  • Comparison of purchase intent across demographic groups: This figure contains two line plots (panels A and B) that compare the average purchase intent (PI) scores between real human survey participants and two AI models (GPT-40 and Gem-2f), broken down by demographic categories. Purchase intent is a score indicating how likely a person is to buy a product. Panel A shows the comparison by gender (Female vs. Male). Panel B shows the comparison by four U.S. dwelling regions (Mid West, North East, South, West). The plots are designed to test if the AI models can replicate the subtle differences in purchasing preferences seen across these human subgroups.
  • AI models struggle to replicate human demographic trends: The plots visually support the text's claim that the AI models ('SCs' or synthetic consumers) did not replicate human behavior well for these categories. In Panel A (Gender), real data shows males have slightly higher PI than females, but the Gem-2f model reverses this trend, and GPT-40 exaggerates it. In Panel B (Dwelling Region), the patterns are even more divergent; for example, both AI models predict the lowest PI for the 'South', whereas in the real data, the 'South' has the second-highest PI.
Scientific Validity
  • ✅ Strong evidence for the stated claim: The figure provides clear, direct visual evidence to support the reference text's claim that the synthetic consumers replicated response behavior 'less well' for gender and dwelling region. The visible discrepancies between the lines for real and synthetic data are compelling.
  • ✅ Inclusion of multiple models strengthens the conclusion: By showing results for both GPT-40 and Gem-2f, the figure demonstrates that the failure to replicate these specific demographic trends is not an idiosyncrasy of a single model, which strengthens the overall conclusion about the current limitations of this technique.
  • ✅ Appropriate inclusion of uncertainty: The use of error bars (representing standard errors) is a methodologically sound practice that provides a visual guide to the statistical uncertainty of the mean PI for each group. This helps in judging the significance of the observed differences.
  • :: The text correctly notes that the influence of these features on PI is not strong in the human data. The plots confirm this, as the vertical differences between points are small. This is a crucial piece of context: the models are failing to replicate a weak signal, which is a different and less severe limitation than failing to replicate a strong, clear trend. The analysis is therefore nuanced and well-supported by the visualization.
Communication
  • ✅ Clear and direct comparison: The use of overlaid line plots with a clear legend allows for an easy and direct comparison between the human data and the two AI models within each panel.
  • 💡 Inappropriate graph type for categorical data: While line plots are used, a bar chart would be more appropriate for visualizing this data. Both gender and dwelling region are discrete, unordered categorical variables. The lines connecting the points on the graph incorrectly imply a continuous or ordered relationship between the categories (e.g., a trend from 'Mid West' to 'West'), which does not exist. Suggest replacing the line plots with grouped bar charts for a more conventional and accurate representation.
  • 💡 Inconsistent y-axis scales: The two panels use different y-axis ranges (Panel A: 3.2-4.04; Panel B: 3.44-4.1). While this maximizes the visual detail within each plot, it hinders the direct visual comparison of the magnitude of PI differences between the gender and region categories. Using a consistent y-axis scale for both panels would facilitate a more accurate cross-panel comparison.
Figure 9: Survey histograms for direct Likert ratings at TLLM = 0.5 for GPT-40.
Figure/Table Image (Page 15)
Figure 9: Survey histograms for direct Likert ratings at TLLM = 0.5 for GPT-40.
First Reference in Text
Upon detailed inspection of the Likert response distributions, models typically replied with response '3', i.e. a "safe" regression to the center of the scale (cf. Figs 9-12).
Description
  • Grid of 57 survey response comparisons: This figure is a large grid containing 57 individual mini-graphs, where each graph represents a separate product survey. Each mini-graph is a histogram, a type of bar chart that shows the distribution of answers. The horizontal axis shows the possible answers on a 5-point Likert scale, a common survey format where 1 means 'definitely not' and 5 means 'definitely yes' to a question like 'would you purchase this product?'. The vertical axis shows the proportion of respondents who gave each answer.
  • Contrasting human and AI response patterns: Within each of the 57 graphs, two different response patterns are overlaid. The solid black line shows the distribution of answers from real human participants. These patterns are diverse; in many surveys, the most common answers are 4 or 5. In contrast, the orange lines show the answers from the GPT-40 AI model. Two orange lines are shown, one for when the AI saw only text ('--Text') and one for when it saw an image and text ('--Image'), though these two lines are almost always identical.
  • AI model exhibits strong 'regression to the center': The most striking feature of the figure is the AI's response behavior. In nearly every one of the 57 surveys, the orange lines form a single, sharp spike at the number 3, the exact middle of the 1-to-5 scale. This indicates that the AI, when forced to give a direct numerical rating, overwhelmingly defaults to a neutral, 'safe' answer, regardless of the product concept. This visually confirms the paper's claim of a 'regression to the center of the scale' for this method.
Scientific Validity
  • ✅ Overwhelming evidence for the central claim: The figure provides exceptionally strong and direct evidence for the claim in the reference text. The consistency of the AI's 'response 3' behavior across all 57 diverse surveys robustly demonstrates that this is a systemic artifact of the direct Likert rating (DLR) method and not a random occurrence.
  • ✅ Granular presentation enhances transparency: Presenting the results for every individual survey, rather than an aggregated average, is a major methodological strength. It allows the reader to verify that the 'regression to the mean' phenomenon is pervasive and not an artifact of averaging out more varied responses. This transparency greatly increases confidence in the finding.
  • 💡 Lack of a quantitative summary: While visually powerful, the figure would be scientifically stronger if complemented by a quantitative summary. For instance, stating the overall percentage of GPT-40 DLR responses that were '3' across all surveys would provide a single, impactful statistic to anchor the visual impression.
Communication
  • ✅ Highly effective use of small multiples: The 'small multiples' layout (a grid of many small, similar graphs) is an excellent visualization choice here. It powerfully communicates the consistency and repetitiveness of the AI's response pattern across a large dataset, a message that would be lost in an aggregated plot.
  • 💡 Redundant information in the legend: The legend distinguishes between 'Text' and 'Image' stimuli for the synthetic data. However, the corresponding orange lines are perfectly superimposed in nearly every panel, indicating no difference in the AI's response. This distinction adds clutter without providing new information. Suggest either removing one line and stating in the caption that stimulus type had no effect, or using a single color/linestyle for both.
  • 💡 High information density impacts readability: The figure presents a large amount of data, and the individual plots are quite small. This makes it difficult to inspect the details of the human (black line) distributions. While the main takeaway about the AI is clear, the trade-off is a loss of detail. This is an acceptable design choice given the figure's primary goal, but it is a limitation.
  • ✅ Clear titles and labeling: The main title at the top of the grid, which specifies the model and parameters ('Direct Likert Rating, Model: gpt-40, TLLM = 0.5'), and the individual survey ID numbers above each plot provide excellent context and are crucial for interpretation.
Figure 10: Survey histograms for direct Likert ratings at TLLM = 1.5 for GPT-40.
Figure/Table Image (Page 16)
Figure 10: Survey histograms for direct Likert ratings at TLLM = 1.5 for GPT-40.
First Reference in Text
Upon detailed inspection of the Likert response distributions, models typically replied with response '3', i.e. a "safe" regression to the center of the scale (cf. Figs 9-12).
Description
  • Grid of 57 survey response distributions: This figure is a grid of 57 small histograms, each representing one product survey. Each plot compares the distribution of responses from real humans (solid black line) against those from the GPT-40 AI model (orange lines) on a 5-point Likert scale, where 1 is a very negative response and 5 is a very positive one. This experiment was run with a high 'temperature' setting for the AI (TLLM = 1.5), a parameter that typically encourages more varied and random responses.
  • Persistent regression to the mean by the AI model: Despite the high temperature setting intended to increase response diversity, the AI's behavior remains unchanged from the lower temperature experiment shown in Figure 9. In nearly every one of the 57 surveys, the AI's response is a single, sharp spike at the number 3, the neutral midpoint of the scale. This contrasts sharply with the human data, which shows varied distributions often peaking at scores of 4 or 5. This result strongly confirms the paper's claim that the AI model defaults to a 'safe' central response when using this direct rating method.
Scientific Validity
  • ✅ Robustly confirms the central claim: The figure provides powerful evidence supporting the reference text's claim about 'regression to the center'. The fact that this behavior persists even at a high temperature setting (TLLM=1.5) strengthens the conclusion that this is a fundamental artifact of the Direct Likert Rating (DLR) method, rather than a result of a specific hyperparameter choice.
  • ✅ Excellent use of a parameter sweep for validation: By presenting the results for TLLM=1.5, this figure serves as an important validation experiment. It demonstrates that a simple and common approach to increasing model variability (raising the temperature) is ineffective for this problem, thereby justifying the need for the more complex methods (FLR and SSR) proposed by the authors.
  • ✅ Granular presentation enhances transparency: Showing the results for all 57 individual surveys, rather than an aggregate, is a major strength. It demonstrates the pervasiveness of the AI's response pattern and increases confidence that the finding is not an artifact of averaging.
Communication
  • ✅ Effective use of small multiples: The grid layout (small multiples) is highly effective at conveying the main message: the AI's response pattern is monotonously consistent across a large and diverse set of surveys. This would be impossible to show in a single, aggregated graph.
  • ✅ Clear and informative title: The title is excellent because it clearly specifies the key experimental condition that distinguishes this figure from Figure 9—namely, the temperature setting `TLLM = 1.5`. This is crucial for correct interpretation.
  • 💡 Redundant legend information: The legend distinguishes between 'Text' and 'Image' stimuli for the AI responses, but the two orange lines are visually identical in nearly every plot. This adds unnecessary clutter. It would be clearer to use a single line and note in the caption that stimulus type did not affect the results for this method.
  • 💡 High information density reduces readability: While effective for showing the overall pattern, the small size of the individual plots makes it difficult to discern the details of the human response distributions (the black lines). This is an inherent trade-off of the small multiples approach when displaying a large number of plots.
Figure 11: Survey histograms for direct Likert ratings at TLLM = 0.5 for Gem-2f.
Figure/Table Image (Page 16)
Figure 11: Survey histograms for direct Likert ratings at TLLM = 0.5 for Gem-2f.
First Reference in Text
Upon detailed inspection of the Likert response distributions, models typically replied with response '3', i.e. a "safe" regression to the center of the scale (cf. Figs 9-12).
Description
  • Grid of 57 survey response distributions for the Gem-2f model: This figure is a large grid of 57 small graphs, each representing a single product survey. The graphs are histograms, which show the distribution of answers on a 1-to-5 scale (a Likert scale), where 1 is very negative and 5 is very positive. Each plot compares the responses of real humans (solid black line) to the responses of the Gem-2f AI model (orange lines).
  • AI model consistently chooses the neutral middle option: The most prominent pattern in the figure is the behavior of the Gem-2f model. In almost every one of the 57 surveys, the orange lines form a single, tall spike at the number 3, the exact center of the scale. This shows that the AI model, when asked to provide a direct numerical rating, overwhelmingly defaults to a neutral, 'safe' response. This is in stark contrast to the human data, which is much more varied and often shows peaks at higher ratings like 4 or 5.
Scientific Validity
  • ✅ Provides strong cross-model validation: This figure is crucial as it demonstrates that the 'regression to the center' phenomenon is not unique to the GPT-40 model (shown in Fig. 9) but is also exhibited by the Gem-2f model. This strengthens the paper's central argument that the issue lies with the direct Likert rating (DLR) method itself, rather than being an artifact of a single LLM.
  • ✅ Overwhelming visual evidence for the reference text's claim: The figure provides clear and compelling visual support for the claim that models typically reply with '3'. The consistency of this behavior across all 57 surveys for a second major LLM robustly validates the finding.
  • ✅ Granular data presentation increases transparency: By showing the results for each of the 57 surveys individually, the authors allow for a transparent assessment of the finding. This 'small multiples' approach demonstrates that the phenomenon is pervasive and not simply an artifact of averaging, which significantly increases confidence in the conclusion.
Communication
  • ✅ Effective use of small multiples to show consistency: The grid layout is an excellent visualization choice. It powerfully communicates the monotonous consistency of the AI's response pattern across a large and diverse set of stimuli, a point that would be entirely lost in an aggregated summary plot.
  • ✅ Clear and specific title: The title is very effective as it clearly states the model ('Gem-2f') and the key parameter ('TLLM = 0.5'), allowing for easy and direct comparison with the other similar figures in the paper (e.g., Fig. 9, which uses GPT-40).
  • 💡 Redundant information in the legend could be simplified: The legend distinguishes between 'Text' and 'Image' stimuli for the AI, but the two corresponding orange lines are perfectly superimposed in virtually every panel. This distinction adds visual clutter without providing new information. Suggest simplifying by using a single line for the AI and noting in the caption that stimulus type had no effect on the DLR results.
Figure 12: Survey histograms for direct Likert ratings at TLLM = 1.5 for Gem-2f.
Figure/Table Image (Page 17)
Figure 12: Survey histograms for direct Likert ratings at TLLM = 1.5 for Gem-2f.
First Reference in Text
Upon detailed inspection of the Likert response distributions, models typically replied with response '3', i.e. a "safe" regression to the center of the scale (cf. Figs 9-12).
Description
  • Grid of 57 survey response distributions under specific AI settings: This figure is a large grid of 57 small graphs, each representing a different product survey. Each graph is a histogram, a chart showing the frequency of different answers. The answers are on a 1-to-5 Likert scale, where 1 is a very negative response and 5 is very positive. The plots compare the responses of real humans (solid black line) to those of the Gem-2f AI model (orange lines). The AI was run with a high 'temperature' (TLLM = 1.5), a setting that is meant to make its responses more creative and varied.
  • AI model persistently chooses the neutral middle option: Despite the high temperature setting intended to increase diversity, the AI's behavior is highly consistent and different from humans. In nearly every one of the 57 surveys, the AI's response is a single, sharp spike at the number 3, the neutral midpoint of the scale. This directly contrasts with the human data, which shows a wide variety of response patterns, often with the most frequent answers being 4 or 5. This visual evidence strongly supports the paper's claim that this method causes the AI to give a 'safe' central response.
Scientific Validity
  • ✅ Robustly validates the paper's central finding: This figure provides crucial validation by showing that the 'regression to the center' phenomenon persists across different models (Gem-2f vs. GPT-40) and different hyperparameters (TLLM=1.5 vs. 0.5). This demonstrates that the issue is a fundamental artifact of the direct Likert rating method itself, not a quirk of one specific model or setting.
  • ✅ Strong evidence against a simple solution: By testing a high-temperature setting, the authors proactively address a potential counterargument—that the lack of diversity could be fixed by simply making the model more 'creative'. The clear failure of this approach, as shown in the figure, strengthens the paper's justification for developing a more sophisticated method like SSR.
  • ✅ Transparent presentation of raw data: The choice to display the distributions for all 57 surveys individually, rather than showing an average, is a major strength. This transparency allows the reader to verify the pervasiveness and consistency of the AI's behavior, which builds significant confidence in the finding.
Communication
  • ✅ Effective use of a small multiples grid: The grid layout is an excellent visualization choice for this data. It powerfully communicates the monotonous consistency of the AI's response pattern across a large and varied set of surveys, a message that would be lost in a single, aggregated plot.
  • ✅ Clear and specific figure title: The title is highly effective because it clearly specifies the exact conditions being shown: the model ('Gem-2f') and the temperature setting ('TLLM = 1.5'). This is essential for allowing readers to accurately compare this figure with Figures 9, 10, and 11.
  • 💡 Redundant legend information: The legend distinguishes between 'Text' and 'Image' stimuli for the AI, but the two corresponding orange lines are visually identical in nearly every plot. This adds unnecessary visual clutter. It would be clearer to use a single line for the AI and state in the caption that stimulus type did not affect the results for this method.
Figure 13: Success metrics for direct Likert ratings at TLLM = 0.5 for GPT-40.
Figure/Table Image (Page 17)
Figure 13: Success metrics for direct Likert ratings at TLLM = 0.5 for GPT-40.
First Reference in Text
and KXY = 0.39 for Gem-2f (cf. Figs. 2B, 3, 6B, 3, and 13-16).
Description
  • Performance of AI with text-only product descriptions: This set of plots evaluates the performance of the GPT-40 AI model when it only reads a text description of a product and is forced to give a direct numerical rating. The top scatter plot compares the average score from the AI (vertical axis) to the average score from humans (horizontal axis). It shows a moderate positive relationship (Pearson's R = 0.72), but the AI's scores are compressed into a narrower range than the human scores. The bottom histogram shows the distribution of a similarity score (KS similarity) that measures how well the AI's full response pattern matched the humans' for each survey. The average similarity was low, at 0.36.
Scientific Validity
  • ✅ Appropriate metrics for evaluation: The use of two distinct metrics is a methodological strength. Pearson's R on the mean scores evaluates the model's ability to rank products correctly on average, while the KS similarity of the distributions evaluates its ability to produce realistic, human-like response patterns. This provides a comprehensive assessment.
  • ✅ Strong evidence of method's limitations: The combination of a compressed response range in the scatter plot and a low mean KS similarity (0.36) provides strong, quantitative evidence for the paper's claim that the direct Likert rating (DLR) method produces unrealistic response distributions, even when the average ratings show some correlation with human data.
Communication
  • ✅ Clear multi-plot layout: Presenting both the correlation of means and the distribution of similarity scores provides a more complete picture than either plot alone. The layout effectively communicates two different aspects of model performance for the same condition.
  • ✅ Key statistics are displayed on the plots: Including the R-value, p-value, and mean KS similarity directly on the respective plots makes them highly self-contained and easy to interpret, reinforcing the main takeaways without requiring the reader to hunt for these values in the text.
  • 💡 Inconsistent axis labeling: The axes on the scatter plot are labeled 'LLM mean (likert)' and 'real mean'. More descriptive and parallel labels, such as 'Synthetic Mean Purchase Intent' and 'Human Mean Purchase Intent', would improve clarity.
Figure 14: Success metrics for direct Likert ratings at TLLM = 1.5 for GPT-40.
Figure/Table Image (Page 18)
Figure 14: Success metrics for direct Likert ratings at TLLM = 1.5 for GPT-40.
First Reference in Text
Not explicitly referenced in main text
Description
  • Performance of AI with text-only product descriptions at high 'temperature': This set of plots evaluates the GPT-40 AI model using the 'Direct Likert Rating' method when its 'temperature' (a setting that encourages creativity) is set high to 1.5, and it only sees a text description of the product. The top scatter plot compares the average AI rating (vertical axis) to the average human rating (horizontal axis), showing a positive correlation (Pearson's R = 0.72). The bottom histogram displays the distribution of a similarity score (KS similarity) measuring how well the AI's response pattern matched the humans' for each survey. The average similarity score was low, at 0.37.
Scientific Validity
  • ✅ Important robustness check: This figure serves as a valuable robustness check by testing a different hyperparameter (TLLM = 1.5). It demonstrates that even when attempting to induce more response variability, the distributional similarity to human data remains poor (mean KS = 0.37), which reinforces the conclusion that the direct Likert rating method is fundamentally flawed.
  • ✅ Comprehensive evaluation using dual metrics: The use of two distinct metrics—correlation of means (for ranking) and distributional similarity (for realism)—provides a thorough and nuanced assessment of the model's performance. This dual approach correctly identifies that while the model can capture some of the product ranking signal, it fails to generate human-like response patterns.
Communication
  • ✅ Clear two-plot layout: The format of presenting a scatter plot for correlation and a histogram for distributional similarity is effective. It allows for a clear and comprehensive summary of the two key success metrics for this experimental condition.
  • ✅ On-plot statistics aid interpretation: Including the R-value, p-value, and mean KS similarity directly on the respective plots is excellent practice. It makes the figure's main takeaways immediately accessible and easy to interpret.
  • 💡 Not explicitly referenced in the main text: This figure, while containing important validation data, is not explicitly referenced or discussed in the main body of the text. This is a significant communication gap, as it forces the reader to interpret the results and their implications without guidance from the authors.
  • 💡 Axis labels could be more descriptive: The axis labels ('LLM mean (likert)', 'real mean') are minimal. Using more descriptive labels like 'Synthetic Mean Purchase Intent' and 'Human Mean Purchase Intent' would improve clarity and make the plot more self-explanatory.
Figure 15: Success metrics for direct Likert ratings at TLLM = 0.5 for Gem-2f.
Figure/Table Image (Page 18)
Figure 15: Success metrics for direct Likert ratings at TLLM = 0.5 for Gem-2f.
First Reference in Text
Not explicitly referenced in main text
Description
  • Performance of Gem-2f AI with text-only product descriptions: This set of plots evaluates the performance of the Gem-2f AI model when it only reads a text description of a product and is forced to give a direct numerical rating. The top scatter plot compares the average score from the AI (vertical axis) to the average score from humans (horizontal axis). It shows a moderate positive relationship with a Pearson's R of 0.54. The bottom histogram shows the distribution of a similarity score (KS similarity) that measures how well the AI's full response pattern matched the humans' for each survey. The average similarity was low, at 0.45.
Scientific Validity
  • ✅ Provides important cross-model validation: This figure is scientifically valuable because it presents results for a second model (Gem-2f), allowing for direct comparison with GPT-40 (shown in Figure 13). This helps establish that the poor performance of the direct Likert rating (DLR) method is a general issue, not an idiosyncrasy of a single model.
  • ✅ Dual-metric approach gives a comprehensive view: The use of two distinct metrics is a methodological strength. The correlation of mean scores assesses the model's ability to rank products, while the KS similarity of distributions assesses its ability to generate realistic, human-like response patterns. This provides a more complete picture of performance.
Communication
  • ✅ Clear two-plot layout: The format of presenting a scatter plot for correlation and a histogram for distributional similarity is effective. It allows for a clear and comprehensive summary of the two key success metrics for this experimental condition.
  • ✅ On-plot statistics aid interpretation: Including the R-value, p-value, and mean KS similarity directly on the respective plots is excellent practice, making the figure's main takeaways immediately accessible.
  • 💡 Not explicitly referenced in the main text: This figure contains important validation data but is not explicitly referenced or discussed in the main body of the text. This is a significant communication gap, as it leaves the reader to interpret the results and their implications without guidance from the authors.
  • 💡 Axis labels could be more descriptive: The axis labels ('LLM mean (likert)', 'real mean') are minimal. Using more descriptive labels like 'Synthetic Mean Purchase Intent' and 'Human Mean Purchase Intent' would improve clarity and make the plot more self-explanatory.
Figure 16: Success metrics for direct Likert ratings at TLLM = 1.5 for Gem-2f.
Figure/Table Image (Page 19)
Figure 16: Success metrics for direct Likert ratings at TLLM = 1.5 for Gem-2f.
First Reference in Text
Not explicitly referenced in main text
Description
  • Performance of Gem-2f AI with text-only product descriptions at high 'temperature': This set of plots evaluates the Gem-2f AI model using the 'Direct Likert Rating' method, where the AI gives a single number rating. The AI's 'temperature' (a setting that encourages creativity) was set high to 1.5, and it only saw a text description of each product. The top scatter plot compares the average AI rating (vertical axis) to the average human rating (horizontal axis), showing a moderate positive correlation (Pearson's R = 0.55). The bottom histogram displays the distribution of a similarity score (KS similarity) measuring how well the AI's response pattern matched the humans' for each survey. The average similarity score was low, at 0.46.
Scientific Validity
  • ✅ Important robustness check: This figure serves as a valuable robustness check by testing a different model (Gem-2f) under a different hyperparameter setting (TLLM = 1.5). It demonstrates that the poor distributional similarity of the direct Likert rating (DLR) method is a consistent finding, not an artifact of a specific model or setting, which strengthens the paper's overall conclusions.
  • ✅ Comprehensive evaluation with dual metrics: The use of two distinct metrics—correlation of means (for ranking) and distributional similarity (for realism)—provides a thorough and nuanced assessment of model performance. This dual approach is methodologically sound and effectively shows that the model fails to generate human-like response patterns even if it captures some of the product ranking signal.
Communication
  • ✅ Clear two-plot layout: The format of presenting a scatter plot for correlation and a histogram for distributional similarity is effective. It allows for a clear and comprehensive summary of the two key success metrics for this experimental condition.
  • ✅ On-plot statistics aid interpretation: Including the R-value, p-value, and mean KS similarity directly on the respective plots is excellent practice, making the figure's main takeaways immediately accessible and easy to interpret.
  • 💡 Not explicitly referenced in the main text: This figure, while containing important validation data, is not explicitly referenced or discussed in the main body of the text. This is a significant communication gap, as it forces the reader to interpret the results and their implications without guidance from the authors.
  • 💡 Axis labels could be more descriptive: The axis labels ('LLM mean (likert)', 'real mean') are minimal. Using more descriptive labels like 'Synthetic Mean Purchase Intent' and 'Human Mean Purchase Intent' would improve clarity and make the plot more self-explanatory.
Figure 17: First set of survey histograms for textual elicitation with GPT-40...
Full Caption

Figure 17: First set of survey histograms for textual elicitation with GPT-40 and follow-up ratings at TLLM = 0.5, with image stimulus and full demography setup.

Figure/Table Image (Page 19)
Figure 17: First set of survey histograms for textual elicitation with GPT-40 and follow-up ratings at TLLM = 0.5, with image stimulus and full demography setup.
First Reference in Text
Not explicitly referenced in main text
Description
  • Grid of survey response distributions for two advanced AI methods: This figure is a large grid of small graphs, each representing a single product survey. The graphs are histograms, showing the distribution of answers on a 1-to-5 Likert scale. The figure compares the responses of real humans (solid black line) to the responses of the GPT-40 AI model (orange line) using two different methods. The top row for each survey shows the results for the 'Semantic Similarity Rating' (SSR) method, and the bottom row shows the results for the 'Follow-up Likert Rating' (FLR) method. This allows for a direct visual comparison of how well each method replicates human response patterns.
  • AI models show much more realistic response patterns: Unlike the AI's behavior in previous figures (e.g., Figure 9), where it consistently chose the neutral middle option '3', here the AI's responses (orange lines) are much more varied and human-like. For both the SSR and FLR methods, the AI generates full distributions of responses that often follow the general shape of the human data, including peaks at higher ratings like 4 and 5. This visually demonstrates that these textual elicitation methods are far superior to the direct rating method.
  • SSR method appears to be a closer match than FLR: By visually comparing the top row (SSR) to the bottom row (FLR) for each survey, the SSR method generally appears to provide a closer match to the human data. The orange lines in the SSR plots often capture the peaks, valleys, and overall shape of the black human data lines more accurately than the FLR plots do, which sometimes appear slightly less aligned.
Scientific Validity
  • ✅ Strong evidence for the superiority of textual elicitation: The figure provides compelling, granular evidence that the textual elicitation methods (SSR and FLR) are vastly superior to the direct Likert rating (DLR) method. The stark visual contrast between the realistic distributions here and the 'spike at 3' in Figures 9-12 robustly supports the paper's central claims.
  • ✅ Transparent, per-survey presentation: Presenting the results for each survey individually is a major methodological strength. It allows the reader to see that the improved performance of SSR and FLR is consistent across many different product concepts and is not just an artifact of averaging. This transparency greatly increases confidence in the findings.
  • ✅ Allows for qualitative comparison between SSR and FLR: The side-by-side presentation provides strong visual evidence that SSR generally outperforms FLR in replicating the nuances of the human response distributions, which supports the quantitative summary presented earlier in Figure 3.
Communication
  • ✅ Detailed and informative caption: The caption is excellent, as it clearly and comprehensively describes the specific experimental conditions being visualized: the model (GPT-40), the methods (SSR and FLR), the temperature (TLLM=0.5), and the stimulus/demography setup. This level of detail is crucial for correct interpretation.
  • 💡 Not explicitly referenced in the main text: This figure contains the detailed, raw distributional evidence that underpins the summary statistics in Figure 3, yet it is not directly referenced in the main text. This is a significant communication oversight. The authors should explicitly point readers to this figure to show the granular data supporting their aggregate claims.
  • 💡 Layout could be slightly clearer: The figure alternates rows between SSR and FLR, with labels only on the far-left y-axis. This format is functional but could be slightly confusing. Suggest adding a small title above the first plot in each row (e.g., 'SSR Method', 'FLR Method') to make the structure immediately obvious.
  • ✅ Effective use of small multiples: The grid layout is a very effective way to demonstrate the consistency of the methods' performance across a large number of different surveys. This visual proof of robustness is a key strength of the presentation.
Figure 18: Second set of survey histograms for textual elicitation with GPT-40...
Full Caption

Figure 18: Second set of survey histograms for textual elicitation with GPT-40 and follow-up ratings at TLLM = 0.5, with image stimulus and full demography setup.

Figure/Table Image (Page 20)
Figure 18: Second set of survey histograms for textual elicitation with GPT-40 and follow-up ratings at TLLM = 0.5, with image stimulus and full demography setup.
First Reference in Text
Not explicitly referenced in main text
Description
  • Continuation of survey response distribution comparisons: This figure is a continuation of Figure 17, presenting a grid of histograms for the remaining product surveys. Each small graph shows the distribution of answers on a 1-to-5 Likert scale, a standard survey format where 1 is a strong negative response and 5 is a strong positive one. The figure compares the response patterns of real humans (solid black line) against those of the GPT-40 AI model (orange line).
  • Comparison of two advanced AI methods: Two different AI methods are visualized in alternating rows. The top row for each survey shows the 'Semantic Similarity Rating' (SSR) method, while the bottom row shows the 'Follow-up Likert Rating' (FLR) method. This layout allows for a direct visual comparison of how well each method replicates the shape of the human response distribution for a given product survey.
  • AI models generate realistic, human-like response patterns: In stark contrast to the direct rating method shown in earlier figures (e.g., Figure 9), both the SSR and FLR methods produce varied and realistic response distributions. The AI's responses (orange lines) are spread across the 1-5 scale and often mimic the general shape of the human data (black lines), including capturing peaks at higher ratings like 4 and 5. This provides further visual evidence that these textual elicitation methods are far superior.
Scientific Validity
  • ✅ Provides comprehensive evidence for the paper's claims: This figure, in conjunction with Figure 17, provides the full, granular dataset that underpins the summary statistics presented earlier in the paper (e.g., in Figure 3). Showing the results for every single survey demonstrates the consistency and robustness of the findings, adding significant weight to the conclusion that textual elicitation methods, especially SSR, are superior.
  • ✅ Transparent presentation of data: The choice to present the distribution for every survey individually is a major strength. It prevents any concerns that the good performance is an artifact of averaging and allows for a transparent assessment of the methods' consistency across different stimuli.
  • ✅ Supports qualitative comparison of SSR and FLR: The side-by-side (row-by-row) presentation allows for a qualitative visual comparison between the SSR and FLR methods, which generally supports the quantitative finding that SSR provides a better fit to the human data.
Communication
  • ✅ Effective use of small multiples: The grid layout is an excellent visualization choice. It powerfully communicates the consistency of the methods' performance across the entire dataset, which would be impossible to convey in a single aggregated plot.
  • ✅ Clear and comprehensive caption: The caption is very well-written, clearly stating all the specific experimental conditions being visualized (model, methods, temperature, stimulus, demographics). This level of detail is crucial for correct interpretation and comparison with other figures.
  • 💡 Not explicitly referenced in the main text: This figure contains the detailed evidence supporting the paper's main claims, yet it is not directly referenced in the main text. This is a significant communication oversight. The authors should explicitly point readers to Figures 17 and 18 to show the granular data that supports their aggregate claims, strengthening the link between the summary statistics and the underlying distributions.
  • 💡 Layout could be slightly improved for clarity: The figure alternates rows between SSR and FLR, with labels only on the far-left y-axis. While functional, this could be slightly confusing. Suggest adding a small title above the first plot in each row (e.g., 'SSR Method', 'FLR Method') to make the structure immediately obvious to the reader.
Figure 19: First set of survey histograms for textual elicitation with Gem-2f...
Full Caption

Figure 19: First set of survey histograms for textual elicitation with Gem-2f and follow-up ratings at TLLM = 0.5, with image stimulus and full demography setup.

Figure/Table Image (Page 20)
Figure 19: First set of survey histograms for textual elicitation with Gem-2f and follow-up ratings at TLLM = 0.5, with image stimulus and full demography setup.
First Reference in Text
Not explicitly referenced in main text
Description
  • Grid of survey response distributions for the Gem-2f model: This figure is a large grid of small graphs, each representing a single product survey. The graphs are histograms, which show the distribution of answers on a 1-to-5 Likert scale (where 1 is a strong negative response and 5 is a strong positive one). Each plot compares the responses of real humans (solid black line) to the responses of the Gem-2f AI model (orange line).
  • Comparison of two advanced AI methods: Two different AI methods are visualized in alternating rows. The top row for each survey shows the 'Semantic Similarity Rating' (SSR) method, while the bottom row shows the 'Follow-up Likert Rating' (FLR) method. Both methods are based on the AI first generating a textual response, which is then converted into a rating. This layout allows for a direct visual comparison of how well each method replicates the human response pattern for a given survey.
  • AI models generate realistic, human-like response patterns: In stark contrast to the direct rating method shown in earlier figures (e.g., Figure 11), both the SSR and FLR methods produce varied and realistic response distributions. The AI's responses (orange lines) are spread across the 1-5 scale and often mimic the general shape of the human data (black lines), including capturing peaks at higher ratings like 4 and 5. This provides strong visual evidence that these textual elicitation methods are far superior to direct numerical prompting for this model as well.
Scientific Validity
  • ✅ Provides crucial cross-model validation: This figure is scientifically important because it demonstrates that the superiority of textual elicitation methods is not limited to the GPT-40 model but also holds for Gem-2f. This replication across different models strengthens the paper's central claim that the methodology itself is the key factor.
  • ✅ Transparent, per-survey presentation: Presenting the results for each survey individually, rather than as an aggregate, is a major methodological strength. It allows the reader to verify that the improved performance of SSR and FLR is consistent across many different product concepts and is not just an artifact of averaging. This transparency greatly increases confidence in the findings.
  • ✅ Supports quantitative summary data: This figure provides the detailed, underlying distributional data that supports the aggregated summary statistics shown for Gem-2f in Figure 7. It serves as the visual proof for the quantitative claims made about the methods' performance.
Communication
  • ✅ Effective use of small multiples: The grid layout is an excellent visualization choice. It powerfully communicates the consistency of the methods' performance across a large number of different surveys, a message that would be lost in a single, aggregated plot.
  • ✅ Clear and comprehensive caption: The caption is very well-written, clearly stating all the specific experimental conditions being visualized (model, methods, temperature, stimulus, demographics). This level of detail is crucial for correct interpretation and comparison with other figures.
  • 💡 Not explicitly referenced in the main text: This figure contains the detailed evidence supporting the paper's claims about Gem-2f, yet it is not directly referenced in the main text. This is a significant communication oversight. The authors should explicitly point readers to Figures 19 and 20 to show the granular data that supports their aggregate claims, strengthening the link between the summary statistics and the underlying distributions.
  • 💡 Layout could be slightly improved for clarity: The figure alternates rows between SSR and FLR, with labels only on the far-left y-axis. While functional, this could be slightly confusing. Suggest adding a small title above the first plot in each row (e.g., 'SSR Method', 'FLR Method') to make the structure immediately obvious to the reader.
Figure 20: Second set of survey histograms for textual elicitation with Gem-2f...
Full Caption

Figure 20: Second set of survey histograms for textual elicitation with Gem-2f and follow-up ratings at TLLM = 0.5, with image stimulus and full demography setup.

Figure/Table Image (Page 21)
Figure 20: Second set of survey histograms for textual elicitation with Gem-2f and follow-up ratings at TLLM = 0.5, with image stimulus and full demography setup.
First Reference in Text
Not explicitly referenced in main text
Description
  • Continuation of survey response distribution comparisons for the Gem-2f model: This figure is a continuation of Figure 19, presenting a grid of histograms for the remaining product surveys. Each small graph shows the distribution of answers on a 1-to-5 Likert scale, a standard survey format where 1 is a strong negative response and 5 is a strong positive one. The figure compares the response patterns of real humans (solid black line) against those of the Gem-2f AI model (orange line).
  • Comparison of two advanced AI methods: Two different AI methods are visualized in alternating rows. The top row for each survey shows the 'Semantic Similarity Rating' (SSR) method, while the bottom row shows the 'Follow-up Likert Rating' (FLR) method. Both methods are based on the AI first generating a textual response, which is then converted into a rating. This layout allows for a direct visual comparison of how well each method replicates the human response pattern for a given survey.
  • AI models generate realistic, human-like response patterns: In stark contrast to the direct rating method shown in earlier figures (e.g., Figure 11), both the SSR and FLR methods produce varied and realistic response distributions. The AI's responses (orange lines) are spread across the 1-5 scale and often mimic the general shape of the human data (black lines), including capturing peaks at higher ratings like 4 and 5. This provides further visual evidence that these textual elicitation methods are far superior.
Scientific Validity
  • ✅ Provides comprehensive evidence for the paper's claims: This figure, in conjunction with Figure 19, provides the full, granular dataset that underpins the summary statistics presented earlier in the paper (e.g., in Figure 7). Showing the results for every single survey demonstrates the consistency and robustness of the findings, adding significant weight to the conclusion that textual elicitation methods are superior for the Gem-2f model as well.
  • ✅ Transparent presentation of data: The choice to present the distribution for every survey individually is a major strength. It prevents any concerns that the good performance is an artifact of averaging and allows for a transparent assessment of the methods' consistency across different stimuli.
  • ✅ Supports qualitative comparison of SSR and FLR: The side-by-side (row-by-row) presentation allows for a qualitative visual comparison between the SSR and FLR methods, which generally supports the quantitative finding that SSR provides a better fit to the human data for the Gem-2f model.
Communication
  • ✅ Effective use of small multiples: The grid layout is an excellent visualization choice. It powerfully communicates the consistency of the methods' performance across the entire dataset, which would be impossible to convey in a single aggregated plot.
  • ✅ Clear and comprehensive caption: The caption is very well-written, clearly stating all the specific experimental conditions being visualized (model, methods, temperature, stimulus, demographics). This level of detail is crucial for correct interpretation and comparison with other figures.
  • 💡 Not explicitly referenced in the main text: This figure contains the detailed evidence supporting the paper's main claims, yet it is not directly referenced in the main text. This is a significant communication oversight. The authors should explicitly point readers to Figures 19 and 20 to show the granular data that supports their aggregate claims, strengthening the link between the summary statistics and the underlying distributions.
  • 💡 Layout could be slightly improved for clarity: The figure alternates rows between SSR and FLR, with labels only on the far-left y-axis. While functional, this could be slightly confusing. Suggest adding a small title above the first plot in each row (e.g., 'SSR Method', 'FLR Method') to make the structure immediately obvious to the reader.
Figure 21: Success metrics for textual elicitation at TLLM = 0.5 with GPT-40,...
Full Caption

Figure 21: Success metrics for textual elicitation at TLLM = 0.5 with GPT-40, with image stimulus and full demography setup.

Figure/Table Image (Page 21)
Figure 21: Success metrics for textual elicitation at TLLM = 0.5 with GPT-40, with image stimulus and full demography setup.
First Reference in Text
Not explicitly referenced in main text
Description
  • Performance of the Semantic Similarity Rating (SSR) method: This set of plots evaluates the performance of the GPT-40 AI model using the 'Semantic Similarity Rating' (SSR) method. The top scatter plot compares the average purchase intent score from the AI (y-axis) to the average from real humans (x-axis). It shows a strong positive correlation (Pearson's R = 0.72). The bottom histogram shows the distribution of a similarity score (KS similarity) that measures how well the AI's response pattern matched the humans' for each survey. The distribution is heavily skewed to the right, indicating very high similarity, with an average score of 0.88.
Scientific Validity
  • ✅ Strong quantitative evidence for SSR effectiveness: The figure provides robust evidence for the success of the SSR method. The combination of a strong correlation in mean ratings (R=0.72) and a very high mean distributional similarity (KS sim = 0.88) strongly supports the paper's central claim that this method can reproduce human-like survey outcomes.
  • ✅ Comprehensive evaluation with dual metrics: The use of two distinct metrics is a methodological strength. The correlation of means assesses the model's ability to rank products correctly, while the KS similarity assesses its ability to produce realistic response distributions. This provides a more complete and convincing evaluation than either metric alone.
Communication
  • ✅ Effective two-plot summary: Presenting both the scatter plot for mean correlation and the histogram for distributional similarity is an effective way to summarize the two key aspects of performance for this experimental condition.
  • 💡 Jargon on y-axis label: The y-axis label on the scatter plot, 'LLM mean (embed)', is technical and may not be immediately clear to all readers. A more descriptive label, such as 'Synthetic Mean (SSR Method)', would improve clarity.
  • 💡 Not explicitly referenced in the main text: This figure, which provides the key summary metrics for the paper's best-performing method with GPT-40, is not directly referenced in the main text. This is a significant communication gap, as it contains crucial data that should be explicitly highlighted to the reader.
Figure 22: Success metrics for textual elicitation at TLLM = 0.5 with Gem-2f,...
Full Caption

Figure 22: Success metrics for textual elicitation at TLLM = 0.5 with Gem-2f, with image stimulus and full demography setup.

Figure/Table Image (Page 22)
Figure 22: Success metrics for textual elicitation at TLLM = 0.5 with Gem-2f, with image stimulus and full demography setup.
First Reference in Text
Not explicitly referenced in main text
Description
  • Performance of the Semantic Similarity Rating (SSR) method with Gem-2f model: This set of plots evaluates the performance of the Gem-2f AI model using the 'Semantic Similarity Rating' (SSR) method. The top scatter plot compares the average purchase intent score from the AI (y-axis) to the average from real humans (x-axis). It shows a strong positive correlation (Pearson's R = 0.72). The bottom histogram shows the distribution of a similarity score (KS similarity) that measures how well the AI's response pattern matched the humans' for each survey. The distribution is heavily skewed to the right, indicating very high similarity, with an average score of 0.80.
Scientific Validity
  • ✅ Provides crucial cross-model validation: This figure is scientifically important as it demonstrates that the high performance of the SSR method is not limited to the GPT-40 model but is also achieved with Gem-2f. This replication across different models strengthens the paper's central claim that the methodology itself is the key factor.
  • ✅ Strong quantitative evidence for SSR effectiveness: The figure provides robust evidence for the success of the SSR method with Gem-2f. The combination of a strong correlation in mean ratings (R=0.72) and a very high mean distributional similarity (KS sim = 0.80) strongly supports the paper's claims.
Communication
  • ✅ Effective two-plot summary: Presenting both the scatter plot for mean correlation and the histogram for distributional similarity is an effective way to summarize the two key aspects of performance for this experimental condition.
  • 💡 Jargon on y-axis label: The y-axis label on the scatter plot, 'LLM mean (embed)', is technical and may not be immediately clear to all readers. A more descriptive label, such as 'Synthetic Mean (SSR Method)', would improve clarity.
  • 💡 Not explicitly referenced in the main text: This figure, which provides the key summary metrics for the paper's best-performing method with Gem-2f, is not directly referenced in the main text. This is a significant communication gap, as it contains crucial data that should be explicitly highlighted to the reader.
Figure 23: First set of survey histograms for textual elicitation with GPT-40...
Full Caption

Figure 23: First set of survey histograms for textual elicitation with GPT-40 and follow-up ratings at TLLM = 0.5, with text stimulus and alternating between prompting the LLM with full demographic information and zero demographic information.

Figure/Table Image (Page 22)
Figure 23: First set of survey histograms for textual elicitation with GPT-40 and follow-up ratings at TLLM = 0.5, with text stimulus and alternating between prompting the LLM with full demographic information and zero demographic information.
First Reference in Text
Not explicitly referenced in main text
Description
  • Grid comparing AI responses with and without demographic information: This figure is a large grid of small graphs (histograms), each representing a single product survey. The graphs show the distribution of answers on a 1-to-5 Likert scale. Each plot compares the responses of real humans (solid black line) to the GPT-40 AI model under two different conditions: when the AI is given demographic information to impersonate a specific person (orange line, 'w/ demographics'), and when it is given no demographic information (green line, 'w/o demographics').
  • Two different AI methods are shown: The figure visualizes two different methods for getting ratings from the AI, shown in alternating rows. The top row for each survey shows the 'Semantic Similarity Rating' (SSR) method, and the bottom row shows the 'Follow-up Likert Rating' (FLR) method. This allows for a direct visual comparison of how the presence or absence of demographic information affects each method.
  • Demographic information is crucial for realistic AI responses: The key takeaway is that demographic information is critical for generating human-like responses with this model. The orange lines (with demographics) generally track the shape of the human data (black lines) reasonably well. In contrast, the green lines (without demographics) are very different and unrealistic; they typically show a single, massive spike at the highest ratings (4 or 5), indicating that without a persona, the AI model tends to be overly positive and less nuanced.
Scientific Validity
  • ✅ Excellent use of an ablation study: This figure represents a well-designed ablation study. By systematically removing the demographic information ('w/o demographics' condition), the authors effectively isolate and demonstrate the causal impact of persona conditioning on the model's output. This is a strong methodological approach that provides clear, interpretable results.
  • ✅ Strong, granular evidence for the importance of persona: The figure provides compelling evidence that persona conditioning is not just helpful but essential for achieving realistic response distributions with GPT-40. The stark and consistent difference between the orange and green lines across all 57 surveys robustly supports the paper's claims about the necessity of using detailed prompts.
  • ✅ Transparent data presentation: Showing the results for each survey individually, rather than in an aggregated form, is a major strength. This transparency allows the reader to verify that the observed effect is pervasive across different product concepts and not an artifact of averaging, which greatly increases confidence in the conclusion.
Communication
  • ✅ Effective use of color to highlight the key comparison: The use of distinct orange and green lines for the two key experimental conditions ('w/ demographics' vs. 'w/o demographics') makes the central comparison of the figure immediately clear and easy to follow across all the small plots.
  • ✅ Detailed and clear caption: The caption is excellent. It comprehensively and accurately describes the complex experimental setup being visualized, including the model, methods, stimulus type, and the crucial comparison between full and zero demographic information. This level of detail is essential for proper interpretation.
  • 💡 Not explicitly referenced in the main text: This figure contains the critical experimental evidence for the importance of demographic prompting, yet it is not directly referenced in the main text. This is a significant communication failure. The authors should explicitly cite this figure in the results section to guide the reader to the detailed visual data that supports their claims.
  • 💡 Layout could be slightly improved: The figure alternates rows between the SSR and FLR methods, with labels only on the far-left y-axis. While functional, adding a small title above the first plot in each row (e.g., 'SSR Method', 'FLR Method') would make the structure more immediately obvious to the reader.
Figure 24: Second set of survey histograms for textual elicitation with GPT-40...
Full Caption

Figure 24: Second set of survey histograms for textual elicitation with GPT-40 and follow-up ratings at TLLM = 0.5, with text stimulus and alternating between prompting the LLM with full demographic information and zero demographic information.

Figure/Table Image (Page 23)
Figure 24: Second set of survey histograms for textual elicitation with GPT-40 and follow-up ratings at TLLM = 0.5, with text stimulus and alternating between prompting the LLM with full demographic information and zero demographic information.
First Reference in Text
Not explicitly referenced in main text
Description
  • Grid comparing AI responses with and without demographic information (continuation): This figure is a continuation of Figure 23, presenting a large grid of small graphs (histograms) for the remaining product surveys. Each graph shows the distribution of answers on a 1-to-5 Likert scale. Each plot compares the responses of real humans (solid black line) to the GPT-40 AI model under two different conditions: when the AI is given demographic information to impersonate a specific person (orange line, 'w/ demographics'), and when it is given no demographic information (green line, 'w/o demographics').
  • Two different AI methods are shown: The figure visualizes two different methods for getting ratings from the AI, shown in alternating rows. The top row for each survey shows the 'Semantic Similarity Rating' (SSR) method, and the bottom row shows the 'Follow-up Likert Rating' (FLR) method. This allows for a direct visual comparison of how the presence or absence of demographic information affects each method.
  • Demographic information is consistently shown to be crucial for realistic AI responses: The key takeaway, consistent with Figure 23, is that demographic information is critical for generating human-like responses with this model. The orange lines (with demographics) generally track the shape of the human data (black lines) reasonably well. In contrast, the green lines (without demographics) are very different and unrealistic; they typically show a single, massive spike at the highest ratings (4 or 5), indicating that without a persona, the AI model tends to be overly positive and less nuanced.
Scientific Validity
  • ✅ Provides comprehensive evidence for the paper's claims: This figure, in conjunction with Figure 23, provides the full, granular dataset that underpins the claims about the importance of demographic conditioning. Showing the results for every single survey demonstrates the consistency and robustness of the findings, adding significant weight to the conclusion.
  • ✅ Strong ablation study design: The comparison between the 'with demographics' and 'without demographics' conditions is a methodologically sound ablation study. It clearly isolates the effect of persona conditioning and demonstrates its critical importance across the entire dataset.
  • ✅ Transparent presentation of data: The choice to present the distribution for every survey individually is a major strength. It prevents any concerns that the observed effect is an artifact of averaging and allows for a transparent assessment of the methods' consistency across different stimuli.
Communication
  • ✅ Effective use of small multiples: The grid layout is an excellent visualization choice. It powerfully communicates the consistency of the methods' performance and the effect of demographic conditioning across the entire dataset, which would be impossible to convey in a single aggregated plot.
  • ✅ Clear and comprehensive caption: The caption is very well-written, clearly stating all the specific experimental conditions being visualized (model, methods, stimulus, and the key demographic comparison). This level of detail is crucial for correct interpretation.
  • 💡 Not explicitly referenced in the main text: This figure contains critical experimental evidence, yet it is not directly referenced in the main text. This is a significant communication oversight. The authors should explicitly point readers to Figures 23 and 24 to show the granular data that supports their aggregate claims about the importance of demographic conditioning.
  • 💡 Layout could be slightly improved for clarity: The figure alternates rows between SSR and FLR, with labels only on the far-left y-axis. While functional, this could be slightly confusing. Suggest adding a small title above the first plot in each row (e.g., 'SSR Method', 'FLR Method') to make the structure immediately obvious to the reader.
Figure 25: First set of survey histograms for textual elicitation with Gem-2f...
Full Caption

Figure 25: First set of survey histograms for textual elicitation with Gem-2f and follow-up ratings at TLLM = 0.5, with text stimulus and alternating between prompting the LLM with full demographic information and zero demographic information.

Figure/Table Image (Page 23)
Figure 25: First set of survey histograms for textual elicitation with Gem-2f and follow-up ratings at TLLM = 0.5, with text stimulus and alternating between prompting the LLM with full demographic information and zero demographic information.
First Reference in Text
Not explicitly referenced in main text
Description
  • Grid comparing Gem-2f AI responses with and without demographic information: This figure is a large grid of small graphs (histograms), each representing a single product survey. The graphs show the distribution of answers on a 1-to-5 Likert scale. Each plot compares the responses of real humans (solid black line) to the Gem-2f AI model under two different conditions: when the AI is given demographic information to impersonate a specific person (orange line, 'w/ demographics'), and when it is given no demographic information (green line, 'w/o demographics').
  • Two different AI methods are shown: The figure visualizes two different methods for getting ratings from the AI, shown in alternating rows. The top row for each survey shows the 'Semantic Similarity Rating' (SSR) method, and the bottom row shows the 'Follow-up Likert Rating' (FLR) method. This allows for a direct visual comparison of how the presence or absence of demographic information affects each method.
  • Surprising finding: AI without demographics is more realistic for Gem-2f: The key takeaway from this figure is surprising and contrasts with the results for the other AI model (GPT-40). For Gem-2f, the green lines (without demographics) often provide a closer match to the human data (black lines) than the orange lines (with demographics). This suggests that for this specific model, providing a detailed persona may not be necessary or could even be detrimental to achieving a realistic distribution of survey responses.
Scientific Validity
  • ✅ Crucial cross-model validation: This figure is scientifically vital because it shows a result for Gem-2f that is the opposite of the result for GPT-40 (seen in Fig. 23). This demonstrates that the effect of demographic prompting is highly model-dependent, which is a significant and nuanced finding. It prevents overgeneralization from the results of a single model.
  • ✅ Strong evidence for a surprising interaction: The figure provides the granular, per-survey evidence for the surprising conclusion (summarized in Fig. 7) that removing demographic prompts improves distributional similarity for Gem-2f. The consistency of this pattern across the many surveys shown here makes the finding robust.
  • ✅ Transparent data presentation: Showing the results for each survey individually, rather than in an aggregated form, is a major strength. This transparency allows the reader to verify the consistency of this counterintuitive result, which greatly increases confidence in the conclusion.
Communication
  • ✅ Effective use of color and small multiples: The use of distinct orange and green lines clearly highlights the key experimental comparison, and the small multiples grid layout is an excellent way to show the consistency of the finding across the dataset.
  • ✅ Detailed and clear caption: The caption is excellent. It comprehensively and accurately describes the complex experimental setup being visualized, including the model, methods, stimulus type, and the crucial comparison between full and zero demographic information. This level of detail is essential for proper interpretation.
  • 💡 Not explicitly referenced in the main text: This figure contains the critical experimental evidence for a surprising and important model-specific finding, yet it is not directly referenced in the main text. This is a significant communication failure. The authors should explicitly cite this figure in the results section to guide the reader to the detailed visual data that supports their claims.
  • 💡 Layout could be slightly improved: The figure alternates rows between the SSR and FLR methods, with labels only on the far-left y-axis. While functional, adding a small title above the first plot in each row (e.g., 'SSR Method', 'FLR Method') would make the structure more immediately obvious to the reader.
Figure 26: Second set of survey histograms for textual elicitation with Gem-2f...
Full Caption

Figure 26: Second set of survey histograms for textual elicitation with Gem-2f and follow-up ratings at TLLM = 0.5, with text stimulus and alternating between prompting the LLM with full demographic information and zero demographic information.

Figure/Table Image (Page 24)
Figure 26: Second set of survey histograms for textual elicitation with Gem-2f and follow-up ratings at TLLM = 0.5, with text stimulus and alternating between prompting the LLM with full demographic information and zero demographic information.
First Reference in Text
Not explicitly referenced in main text
Description
  • Grid comparing Gem-2f AI responses with and without demographic information (continuation): This figure is a continuation of Figure 25, presenting a large grid of small graphs (histograms) for the remaining product surveys. Each graph shows the distribution of answers on a 1-to-5 Likert scale. Each plot compares the responses of real humans (solid black line) to the Gem-2f AI model under two different conditions: when the AI is given demographic information to impersonate a specific person (orange line, 'w/ demographics'), and when it is given no demographic information (green line, 'w/o demographics').
  • Two different AI methods are shown: The figure visualizes two different methods for getting ratings from the AI, shown in alternating rows. The top row for each survey shows the 'Semantic Similarity Rating' (SSR) method, and the bottom row shows the 'Follow-up Likert Rating' (FLR) method. This allows for a direct visual comparison of how the presence or absence of demographic information affects each method.
  • Surprising finding is confirmed: AI without demographics is more realistic for Gem-2f: The key takeaway, consistent with Figure 25, is surprising and contrasts with the results for the other AI model (GPT-40). For Gem-2f, the green lines (without demographics) often provide a closer match to the human data (black lines) than the orange lines (with demographics). This suggests that for this specific model, providing a detailed persona may not be necessary or could even be detrimental to achieving a realistic distribution of survey responses.
Scientific Validity
  • ✅ Provides comprehensive evidence for a key finding: This figure, in conjunction with Figure 25, provides the full, granular dataset that underpins the surprising finding that removing demographic prompts improves distributional similarity for Gem-2f. Showing the results for every single survey demonstrates the consistency and robustness of this counterintuitive result.
  • ✅ Crucial for cross-model comparison: The results shown here are scientifically vital because they contrast directly with the results for GPT-40 (seen in Figs. 23-24). This demonstrates that the effect of demographic prompting is highly model-dependent, a significant and nuanced finding that prevents overgeneralization from the results of a single model.
  • ✅ Transparent data presentation: Showing the results for each survey individually, rather than in an aggregated form, is a major strength. This transparency allows the reader to verify the consistency of the surprising result, which greatly increases confidence in the conclusion.
Communication
  • ✅ Effective use of small multiples: The grid layout is an excellent visualization choice. It powerfully communicates the consistency of the methods' performance and the surprising effect of demographic conditioning across the entire dataset, a message that would be lost in a single aggregated plot.
  • ✅ Detailed and clear caption: The caption is excellent. It comprehensively and accurately describes the complex experimental setup being visualized, including the model, methods, stimulus type, and the crucial comparison between full and zero demographic information. This level of detail is essential for proper interpretation.
  • 💡 Not explicitly referenced in the main text: This figure contains critical experimental evidence for a surprising and important model-specific finding, yet it is not directly referenced in the main text. This is a significant communication failure. The authors should explicitly cite this figure in the results section to guide the reader to the detailed visual data that supports their claims.
  • 💡 Layout could be slightly improved: The figure alternates rows between the SSR and FLR methods, with labels only on the far-left y-axis. While functional, adding a small title above the first plot in each row (e.g., 'SSR Method', 'FLR Method') would make the structure more immediately obvious to the reader.
Figure 27: Success metrics for textual elicitation and demography experiments,...
Full Caption

Figure 27: Success metrics for textual elicitation and demography experiments, at TLLM = 0.5 with GPT-40 and with text stimulus.

Figure/Table Image (Page 24)
Figure 27: Success metrics for textual elicitation and demography experiments, at TLLM = 0.5 with GPT-40 and with text stimulus.
First Reference in Text
Not explicitly referenced in main text
Description
  • Comparison of AI performance with and without demographic information: This figure presents four sets of plots evaluating the GPT-40 AI model's performance under different conditions. It compares two methods—Semantic Similarity Rating (SSR) and Follow-up Likert Rating (FLR)—and for each method, it shows the results when the AI was given demographic information to impersonate a person ('w/ dem.') versus when it was not ('w/o dem.'). Each set includes a scatter plot measuring how well the AI's average product ratings correlate with human ratings, and a histogram measuring how similar the AI's response patterns were to human patterns (KS similarity, where 1.0 is a perfect match).
  • Demographics improve product ranking ability: The scatter plots show that providing demographic information is crucial for the AI to rank products correctly. For the SSR method, the correlation (R) between AI and human average scores was 0.68 with demographics, but dropped to 0.41 without them. A similar, more pronounced drop occurred for the FLR method, from R=0.56 with demographics to R=0.32 without them.
  • Demographics have a mixed effect on response pattern realism: The histograms reveal a more complex story. For the FLR method, removing demographics slightly worsened the realism of the response patterns, with the average KS similarity dropping from 0.64 to 0.57. However, for the SSR method, removing demographics surprisingly improved the realism, with the average KS similarity increasing from an already high 0.84 to an even higher 0.91.
Scientific Validity
  • ✅ Excellent use of an ablation study: The direct comparison between the 'with demographics' and 'without demographics' conditions constitutes a well-designed ablation study. It effectively isolates the impact of persona conditioning and reveals a nuanced, method-dependent effect, which is a significant scientific contribution.
  • ✅ Reveals a complex and important interaction: The figure uncovers a critical trade-off: for the SSR method with GPT-40, removing demographic information degrades the model's ability to rank products correctly (lower correlation) but paradoxically improves its ability to generate realistic overall response distributions (higher KS similarity). This is a non-obvious and important finding about the behavior of these systems.
  • ✅ Comprehensive evaluation with dual metrics: The use of two distinct success metrics (correlation of means and KS similarity of distributions) is methodologically sound. It allows the authors to disentangle two different aspects of performance—ranking accuracy and distributional realism—which is essential for the nuanced interpretation that this figure supports.
Communication
  • ✅ Effective layout for comparison: The 2x2 grid of plot sets allows for a clear and direct comparison of the key experimental conditions. Placing the 'with' and 'without' demography conditions side-by-side effectively highlights the main experimental manipulation.
  • 💡 Not explicitly referenced in the main text: This figure contains the core evidence for a complex and important finding regarding the role of demographics with GPT-40, yet it is not referenced in the main text. This is a major communication gap. The authors must cite this figure and discuss the trade-off between correlation and distributional similarity in the results section.
  • 💡 High information density: The figure is very dense, presenting eight individual plots. While this is comprehensive, it can be overwhelming for the reader. The authors could consider splitting this into two separate figures, one for the SSR method and one for the FLR method, to improve readability.
  • 💡 Inconsistent and jargon-heavy axis labels: The y-axis labels on the scatter plots ('LLM mean (embed)' vs. 'LLM mean (likert)') are inconsistent and use internal jargon. Using clearer, more descriptive labels like 'Synthetic Mean (SSR Method)' and 'Synthetic Mean (FLR Method)' would significantly improve clarity and make the figure more self-contained.
Figure 28: Success metrics for textual elicitation and demography experiments,...
Full Caption

Figure 28: Success metrics for textual elicitation and demography experiments, at TLLM = 0.5 with Gem-2f and with text stimulus.

Figure/Table Image (Page 25)
Figure 28: Success metrics for textual elicitation and demography experiments, at TLLM = 0.5 with Gem-2f and with text stimulus.
First Reference in Text
Not explicitly referenced in main text
Description
  • Comparison of AI performance with and without demographic information for the Gem-2f model: This figure presents four sets of plots evaluating the Gem-2f AI model's performance under different conditions. It compares two methods—Semantic Similarity Rating (SSR) and Follow-up Likert Rating (FLR)—and for each method, it shows the results when the AI was given demographic information to impersonate a person ('w/ dem.') versus when it was not ('w/o dem.'). Each set includes a scatter plot measuring how well the AI's average product ratings correlate with human ratings, and a histogram measuring how similar the AI's response patterns were to human patterns (KS similarity, where 1.0 is a perfect match).
  • Demographics have a mixed effect on product ranking ability: The scatter plots show that providing demographic information has a small positive effect on the AI's ability to rank products correctly. For the SSR method, the correlation (R) between AI and human average scores was 0.58 with demographics, but dropped slightly to 0.48 without them. A similar drop occurred for the FLR method, from R=0.59 with demographics to R=0.48 without them.
  • Surprising finding: Removing demographics improves realism for the SSR method: The histograms reveal a complex and surprising result. For the FLR method, removing demographics had no effect on the realism of the response patterns (average KS similarity was 0.62 in both cases). However, for the more advanced SSR method, removing demographics paradoxically improved the realism, with the average KS similarity increasing from an already high 0.80 to an even higher 0.91. This suggests that for this specific model and method, a generic prompt may be better at replicating human-like response distributions than a specific persona-based prompt.
Scientific Validity
  • ✅ Crucial cross-model validation: This figure is scientifically vital because it shows a result for Gem-2f that is the opposite of the result for GPT-40 (seen in Fig. 27). This demonstrates that the effect of demographic prompting is highly model-dependent, which is a significant and nuanced finding that correctly prevents overgeneralization from the results of a single model.
  • ✅ Strong evidence for a surprising model-method interaction: The figure uncovers a critical trade-off for Gem-2f with SSR: removing demographic information slightly degrades the model's ability to rank products correctly (lower correlation) but significantly improves its ability to generate realistic overall response distributions (higher KS similarity). This is a non-obvious and important finding about the behavior of these systems.
  • ✅ Comprehensive evaluation with dual metrics: The use of two distinct success metrics (correlation of means and KS similarity of distributions) is methodologically sound. It allows the authors to disentangle two different aspects of performance—ranking accuracy and distributional realism—which is essential for the nuanced interpretation that this figure supports.
Communication
  • ✅ Effective layout for comparison: The 2x2 grid of plot sets allows for a clear and direct comparison of the four key experimental conditions. Placing the 'with' and 'without' demography conditions side-by-side effectively highlights the main experimental manipulation.
  • 💡 Not explicitly referenced in the main text: This figure contains the core evidence for a complex and surprising finding regarding the role of demographics with Gem-2f, yet it is not referenced in the main text. This is a major communication gap. The authors must cite this figure and discuss the trade-off between correlation and distributional similarity in the results section.
  • 💡 Inconsistent and jargon-heavy axis labels: The y-axis labels on the scatter plots ('LLM mean (embed)' vs. 'LLM mean (likert)') are inconsistent and use internal jargon. Using clearer, more descriptive labels like 'Synthetic Mean (SSR Method)' and 'Synthetic Mean (FLR Method)' would significantly improve clarity and make the figure more self-contained.
Figure 29: Success metrics for textual elicitation and demography experiments,...
Full Caption

Figure 29: Success metrics for textual elicitation and demography experiments, at TLLM = 0.5 with Gem-2f and image stimulus.

Figure/Table Image (Page 25)
Figure 29: Success metrics for textual elicitation and demography experiments, at TLLM = 0.5 with Gem-2f and image stimulus.
First Reference in Text
Not explicitly referenced in main text
Description
  • Performance of AI without demographic information: This figure evaluates the performance of the Gem-2f AI model when it is not given any demographic information to impersonate a person. It compares two different methods for generating ratings: Semantic Similarity Rating (SSR) on the left, and Follow-up Likert Rating (FLR) on the right. Each side has a scatter plot on top, which measures how well the AI's average product ratings correlate with human ratings, and a histogram below, which measures how similar the AI's overall response patterns are to human patterns (KS similarity, where 1.0 is a perfect match).
  • SSR method fails at product ranking but excels at distributional realism: The plots on the left show a critical trade-off for the SSR method. The scatter plot shows a very weak, non-significant correlation (R = 0.14), meaning the AI completely fails to rank products correctly. However, the histogram shows that the response patterns it generates are extremely realistic, with a very high average KS similarity of 0.91.
  • FLR method retains some ranking ability but with less realism: The plots on the right show that the FLR method performs differently. The scatter plot shows a moderate, statistically significant correlation (R = 0.53), indicating that it retains some ability to rank products correctly, unlike the SSR method in this condition. However, this comes at the cost of less realistic response patterns, as shown by the lower average KS similarity of 0.67 in the histogram.
Scientific Validity
  • ✅ Crucial ablation study revealing a key trade-off: This figure represents a well-designed ablation study (removing demographic information). It uncovers a critical and non-obvious trade-off: without a persona, the SSR method loses its ability to rank products (correlation) while paradoxically achieving near-perfect distributional similarity. This is a significant scientific finding about the behavior of these models.
  • ✅ Strong evidence for the importance of persona for ranking: The extremely low correlation (R=0.14) for the SSR method without demographics provides powerful evidence that persona conditioning is essential for the model to leverage product information and produce a meaningful ranking signal. This robustly supports the claims made in the main text (Section 4.3).
  • 💡 Ambiguity from using the 'best reference set': The caption states that the results for SSR are from the 'best reference set (4)'. This raises concerns about potential cherry-picking. While the main text often refers to averages over all reference sets, relying on the 'best' set for this specific figure could exaggerate the reported KS similarity. The authors should either justify this choice or present the average results to ensure the finding is robust.
Communication
  • ✅ Effective comparative layout: The side-by-side comparison of the SSR and FLR methods, each with its two corresponding metrics, is a clear and effective layout that allows for a direct visual assessment of their relative performance and trade-offs.
  • 💡 Not explicitly referenced in the main text: This figure contains the core evidence for the important claims made in Section 4.3 about the model's performance without demographic prompts, yet it is not cited in the text. This is a major communication failure, as it disconnects the authors' claims from the supporting data.
  • 💡 Inconsistent and jargon-heavy axis labels: The y-axis labels on the scatter plots ('LLM mean (embed)' vs. 'LLM mean (likert)') are inconsistent and use internal jargon. Using clearer, more descriptive, and parallel labels such as 'Synthetic Mean (SSR Method)' and 'Synthetic Mean (FLR Method)' would significantly improve clarity and make the figure more self-contained.
  • 💡 Caption detail could be clearer: The note about using the 'best reference set (4)' is placed at the end of the caption. It should be clarified that this applies specifically to the SSR results to avoid ambiguity, for instance: 'For semantic similarity rating (SSR), results from the best reference set (4) are shown.'
Figure 30: First set of survey histograms for textual elicitation with Gem-2f...
Full Caption

Figure 30: First set of survey histograms for textual elicitation with Gem-2f and follow-up ratings at TLLM = 0.5, with image stimulus and prompting the LLM with zero demographic information.

Figure/Table Image (Page 26)
Figure 30: First set of survey histograms for textual elicitation with Gem-2f and follow-up ratings at TLLM = 0.5, with image stimulus and prompting the LLM with zero demographic information.
First Reference in Text
Not explicitly referenced in main text
Description
  • Grid comparing Gem-2f AI responses without demographic information to human responses: This figure is a large grid of small graphs (histograms), each representing a single product survey. The graphs show the distribution of answers on a 1-to-5 Likert scale. Each plot compares the responses of real humans (solid black line) to the Gem-2f AI model (orange line) under a specific condition: the AI was given no demographic information to impersonate a person.
  • Two different AI methods are shown: The figure visualizes two different methods for getting ratings from the AI, shown in alternating rows. The top row for each survey shows the 'Semantic Similarity Rating' (SSR) method, and the bottom row shows the 'Follow-up Likert Rating' (FLR) method. This allows for a direct visual comparison of how each method performs without persona-based prompting.
  • AI without demographics shows surprisingly high realism: The key takeaway from this figure is that for the Gem-2f model, the AI's responses (orange lines) are remarkably similar to the human data (black lines) even without any demographic guidance. The SSR method in particular (top rows) shows a very close match in the shape and peaks of the distributions across many of the surveys. This provides the visual evidence for the surprising quantitative results shown in figures like Figure 7 and 29.
Scientific Validity
  • ✅ Crucial cross-model validation: This figure is scientifically vital because it shows a result for Gem-2f that is the opposite of the result for GPT-40 (seen in Fig. 23). This demonstrates that the effect of demographic prompting is highly model-dependent, which is a significant and nuanced finding that correctly prevents overgeneralization from the results of a single model.
  • ✅ Strong evidence for a surprising finding: The figure provides the granular, per-survey evidence for the surprising conclusion that removing demographic prompts improves distributional similarity for Gem-2f. The consistency of this pattern across the many surveys shown here makes the finding robust.
  • 💡 Ambiguity from using the 'best reference set': The caption states that the results for SSR are from the 'best reference set (4)'. This raises concerns about potential cherry-picking. While the main text often refers to averages over all reference sets, relying on the 'best' set for this specific figure could exaggerate the reported KS similarity. The authors should either justify this choice or present the average results to ensure the finding is robust.
Communication
  • ✅ Effective use of small multiples: The grid layout is an excellent visualization choice. It powerfully communicates the consistency of this surprising result across many different surveys, a message that would be lost in a single, aggregated plot.
  • ✅ Detailed and clear caption: The caption is excellent. It comprehensively and accurately describes the complex experimental setup being visualized, including the model, methods, stimulus type, and the crucial condition of zero demographic information. This level of detail is essential for proper interpretation.
  • 💡 Not explicitly referenced in the main text: This figure contains the critical experimental evidence for a surprising and important model-specific finding, yet it is not directly referenced in the main text. This is a significant communication failure. The authors should explicitly cite this figure in the results section to guide the reader to the detailed visual data that supports their claims.
  • 💡 Layout could be slightly improved: The figure alternates rows between the SSR and FLR methods, with labels only on the far-left y-axis. While functional, adding a small title above the first plot in each row (e.g., 'SSR Method', 'FLR Method') would make the structure more immediately obvious to the reader.
Figure 31: Second set of survey histograms for textual elicitation with Gem-2f...
Full Caption

Figure 31: Second set of survey histograms for textual elicitation with Gem-2f and follow-up ratings at TLLM = 0.5, with image stimulus and prompting the LLM with zero demographic information.

Figure/Table Image (Page 26)
Figure 31: Second set of survey histograms for textual elicitation with Gem-2f and follow-up ratings at TLLM = 0.5, with image stimulus and prompting the LLM with zero demographic information.
First Reference in Text
Not explicitly referenced in main text
Description
  • Grid comparing Gem-2f AI responses without demographic information to human responses (continuation): This figure is a continuation of Figure 30, presenting a large grid of small graphs (histograms) for the remaining product surveys. Each graph shows the distribution of answers on a 1-to-5 Likert scale. Each plot compares the responses of real humans (solid black line) to the Gem-2f AI model (orange line) under a specific condition: the AI was given no demographic information to impersonate a person.
  • Two different AI methods are shown: The figure visualizes two different methods for getting ratings from the AI, shown in alternating rows. The top row for each survey shows the 'Semantic Similarity Rating' (SSR) method, and the bottom row shows the 'Follow-up Likert Rating' (FLR) method. This allows for a direct visual comparison of how each method performs without persona-based prompting.
  • AI without demographics shows surprisingly high realism: The key takeaway from this figure is that for the Gem-2f model, the AI's responses (orange lines) are remarkably similar to the human data (black lines) even without any demographic guidance. The SSR method in particular (top rows) shows a very close match in the shape and peaks of the distributions across many of the surveys. This provides the visual evidence for the surprising quantitative results shown in figures like Figure 7 and 29.
Scientific Validity
  • ✅ Provides comprehensive evidence for a key finding: This figure, in conjunction with Figure 30, provides the full, granular dataset that underpins the surprising finding that removing demographic prompts improves distributional similarity for Gem-2f. Showing the results for every single survey demonstrates the consistency and robustness of this counterintuitive result.
  • ✅ Crucial for cross-model comparison: The results shown here are scientifically vital because they contrast directly with the results for GPT-40 (seen in Figs. 23-24). This demonstrates that the effect of demographic prompting is highly model-dependent, a significant and nuanced finding that prevents overgeneralization from the results of a single model.
  • 💡 Ambiguity from using the 'best reference set': The caption states that the results for SSR are from the 'best reference set (4)'. This raises concerns about potential cherry-picking. While the main text often refers to averages over all reference sets, relying on the 'best' set for this specific figure could exaggerate the reported KS similarity. The authors should either justify this choice or present the average results to ensure the finding is robust.
Communication
  • ✅ Effective use of small multiples: The grid layout is an excellent visualization choice. It powerfully communicates the consistency of this surprising result across many different surveys, a message that would be lost in a single, aggregated plot.
  • ✅ Detailed and clear caption: The caption is excellent. It comprehensively and accurately describes the complex experimental setup being visualized, including the model, methods, stimulus type, and the crucial condition of zero demographic information. This level of detail is essential for proper interpretation.
  • 💡 Not explicitly referenced in the main text: This figure contains the critical experimental evidence for a surprising and important model-specific finding, yet it is not directly referenced in the main text. This is a significant communication failure. The authors should explicitly cite this figure in the results section to guide the reader to the detailed visual data that supports their claims.
  • 💡 Layout could be slightly improved: The figure alternates rows between the SSR and FLR methods, with labels only on the far-left y-axis. While functional, adding a small title above the first plot in each row (e.g., 'SSR Method', 'FLR Method') would make the structure more immediately obvious to the reader.
Figure 32: Scan over post-elicitation temperature T values and change in...
Full Caption

Figure 32: Scan over post-elicitation temperature T values and change in success metrics for textual elicitation at TLLM = 0.5 with GPT-40 and image stimulus, with full demography setup.

Figure/Table Image (Page 27)
Figure 32: Scan over post-elicitation temperature T values and change in success metrics for textual elicitation at TLLM = 0.5 with GPT-40 and image stimulus, with full demography setup.
First Reference in Text
Not explicitly referenced in main text
Description
  • Hyperparameter tuning for the Semantic Similarity Rating (SSR) method: This figure shows how the performance of the authors' proposed method, Semantic Similarity Rating (SSR), changes when a key setting called 'post-elicitation temperature T' is adjusted. This 'temperature' controls how concentrated or spread out the AI's probabilistic answer is; a low temperature forces a confident choice, while a high temperature spreads its belief across multiple options. The figure contains seven plots: six small ones for different 'reference sets' (the anchor statements used to define the 1-5 scale) and one large plot showing the average result.
  • Two performance metrics are evaluated: Each plot tracks two different success metrics. The black lines (left y-axis) show 'Pearson R', which measures how well the AI's average product rankings correlate with human rankings. The orange lines (right y-axis) show 'KS similarity', which measures how well the overall pattern of the AI's answers matches the pattern of human answers. For both metrics, higher values are better. The solid lines represent the SSR method, while the dashed horizontal lines show the performance of a simpler method, 'Follow-up Likert rating' (Likert), for comparison.
  • Optimal performance is found around T=1.0: The plots show that the SSR method's performance (solid lines) is consistently better than the simpler Likert method (dashed lines) across a wide range of temperatures. The performance of SSR generally peaks when the temperature T is around 1.0. For example, in the main 'mean' plot, the KS similarity (orange line) is highest around T=0.75, while the Pearson R (black line) is highest around T=1.25. This shows that the authors' choice of T=1.0 for their main experiments is a good, near-optimal compromise that balances both performance metrics.
Scientific Validity
  • ✅ Demonstrates excellent methodological rigor: This figure shows a thorough hyperparameter sweep for a key parameter (T) of the proposed SSR method. This analysis demonstrates that the authors have rigorously investigated the sensitivity of their results and provides a strong empirical justification for the parameter choice used in the main analysis.
  • ✅ Includes a robust validation across reference sets: By testing six different reference sets and presenting the average, the authors demonstrate that their findings are robust and not an artifact of a single, fortuitously chosen set of anchor statements. This significantly strengthens the generalizability of their conclusions about the optimal temperature T.
  • ✅ Reveals a nuanced performance trade-off: The figure reveals a scientifically interesting and nuanced trade-off: the temperature that maximizes distributional similarity (KS sim) is slightly different from the temperature that maximizes ranking correlation (Pearson R). This is an important finding about the behavior of the method and justifies the choice of a compromise value.
Communication
  • ✅ Clear and effective visualization design: The use of dual-axis line plots is an effective way to simultaneously visualize the impact of a single parameter on two different performance metrics. The inclusion of the FLR method's performance as horizontal baselines provides an immediate and easy-to-interpret comparison.
  • 💡 Not explicitly referenced in the main text: This figure contains the crucial justification for a key methodological choice (the value of T), yet it is not referenced anywhere in the main body of the paper. This is a significant communication failure. The authors must cite this figure in the methods or results section to support their choice of T=1 and to show the robustness of their method.
  • ✅ Comprehensive and clear caption: The caption is excellent. It clearly explains what is being plotted, the different lines, the experimental conditions, and the meaning of the horizontal lines, making the figure largely self-contained and easy to understand.
Figure 33: First set of survey histograms for textual elicitation to question...
Full Caption

Figure 33: First set of survey histograms for textual elicitation to question "How relevant is this concept for you?" with Gem-2f and follow-up ratings at TLLM = 0.5, with image stimulus and full demography setup.

Figure/Table Image (Page 27)
Figure 33: First set of survey histograms for textual elicitation to question "How relevant is this concept for you?" with Gem-2f and follow-up ratings at TLLM = 0.5, with image stimulus and full demography setup.
First Reference in Text
Not explicitly referenced in main text
Description
  • Testing the AI methods on a new survey question: This figure shows a grid of small graphs (histograms) that test how well the AI methods work on a different type of survey question: 'How relevant is this concept for you?' instead of the 'purchase intent' question used in the rest of the paper. Each graph shows the distribution of answers on a 1-to-5 scale, comparing real human responses (solid black line) to those from the Gem-2f AI model (orange line).
  • Comparison of two advanced AI methods: The figure visualizes two different methods for generating AI ratings, shown in alternating rows. The top row for each survey shows the 'Semantic Similarity Rating' (SSR) method, and the bottom row shows the 'Follow-up Likert Rating' (FLR) method. Both methods are based on the AI first generating a textual response, which is then converted into a rating.
  • AI models successfully generate realistic responses for the new question: The key finding is that both AI methods successfully generate realistic, human-like response patterns for this new 'relevance' question. The AI's responses (orange lines) are spread across the 1-5 scale and generally follow the shape of the human data (black lines), demonstrating that the techniques are not limited to a single question type. Visually, the SSR method (top rows) often appears to be a closer match to the human data than the FLR method (bottom rows).
Scientific Validity
  • ✅ Crucial generalization experiment: This figure is scientifically important as it serves as a generalization test. By demonstrating that the methods work well for a different construct ('relevance') beyond the main focus ('purchase intent'), the authors significantly strengthen their claim about the general applicability of their framework. This shows the methods are not narrowly overfitted to a single task.
  • ✅ Provides strong visual support for quantitative claims: This figure provides the detailed, per-survey visual evidence for the aggregate distributional similarity scores (Kxy) reported in Section 4.4 of the main text. The visual superiority of SSR over FLR in many of the plots aligns with the reported quantitative scores (Kxy = 0.81 for SSR vs. 0.62 for FLR).
  • ✅ Transparent data presentation: The choice to present the distribution for every survey individually is a major strength. It allows the reader to verify the consistency of the methods' performance on this new task, which increases confidence in the generalization claim.
Communication
  • ✅ Effective use of small multiples: The grid layout is an excellent visualization choice for this data. It powerfully communicates the consistency of the methods' performance across many different surveys for this new question.
  • ✅ Detailed and clear caption: The caption is very well-written, clearly stating that this figure addresses a new question ('How relevant...?') and detailing the specific experimental conditions. This is essential for understanding that this is a generalization test.
  • 💡 Weakly referenced in the main text: While the main text in Section 4.4 mentions '(cf. Figs. 33-34)', this is a weak reference. Given the importance of this generalization test, the authors should explicitly point the reader to this figure to see the detailed distributions that support the quantitative Kxy values, strengthening the connection between the text and the evidence.
  • 💡 Y-axis labels use jargon: The y-axis labels ('Embeddings' for SSR and 'LLM-Likert' for FLR) are internal jargon. Using clearer, more descriptive labels like 'SSR Method' and 'FLR Method' would be much more accessible to a broader audience and consistent with the text.
Figure 34: Second set of survey histograms for textual elicitation to question...
Full Caption

Figure 34: Second set of survey histograms for textual elicitation to question "How relevant is this concept for you?" with Gem-2f and follow-up ratings at TLLM = 0.5, with image stimulus and full demography setup.

Figure/Table Image (Page 28)
Figure 34: Second set of survey histograms for textual elicitation to question "How relevant is this concept for you?" with Gem-2f and follow-up ratings at TLLM = 0.5, with image stimulus and full demography setup.
First Reference in Text
Not explicitly referenced in main text
Description
  • Testing the AI methods on a new survey question (continuation): This figure is a continuation of Figure 33, showing the remaining survey results for a test of how well the AI methods work on a different type of survey question: 'How relevant is this concept for you?' instead of the 'purchase intent' question used in the rest of the paper. Each small graph (histogram) shows the distribution of answers on a 1-to-5 scale, comparing real human responses (solid black line) to those from the Gem-2f AI model (orange line).
  • Comparison of two advanced AI methods: The figure visualizes two different methods for generating AI ratings, shown in alternating rows. The top row for each survey shows the 'Semantic Similarity Rating' (SSR) method, and the bottom row shows the 'Follow-up Likert Rating' (FLR) method. Both methods are based on the AI first generating a textual response, which is then converted into a rating.
  • AI models successfully generate realistic responses for the new question: The key finding, consistent with Figure 33, is that both AI methods successfully generate realistic, human-like response patterns for this new 'relevance' question. The AI's responses (orange lines) are spread across the 1-5 scale and generally follow the shape of the human data (black lines), demonstrating that the techniques are not limited to a single question type. Visually, the SSR method (top rows) often appears to be a closer match to the human data than the FLR method (bottom rows).
Scientific Validity
  • ✅ Crucial generalization experiment: This figure, in conjunction with Figure 33, completes a vital generalization test. By demonstrating that the methods work well for a different construct ('relevance') across the entire dataset, the authors significantly strengthen their claim about the general applicability of their framework. This shows the methods are not narrowly overfitted to a single task.
  • ✅ Provides comprehensive visual support for quantitative claims: This figure provides the detailed, per-survey visual evidence for the aggregate distributional similarity scores (Kxy) reported in Section 4.4 of the main text. The visual superiority of SSR over FLR in many of the plots aligns with the reported quantitative scores (Kxy = 0.81 for SSR vs. 0.62 for FLR).
  • ✅ Transparent data presentation: The choice to present the distribution for every survey individually is a major strength. It allows the reader to verify the consistency of the methods' performance on this new task, which increases confidence in the generalization claim.
Communication
  • ✅ Effective use of small multiples: The grid layout is an excellent visualization choice for this data. It powerfully communicates the consistency of the methods' performance across many different surveys for this new question.
  • ✅ Detailed and clear caption: The caption is very well-written, clearly stating that this figure addresses a new question ('How relevant...?') and detailing the specific experimental conditions. This is essential for understanding that this is a generalization test.
  • 💡 Weakly referenced in the main text: While the main text in Section 4.4 mentions '(cf. Figs. 33-34)', this is a weak reference. Given the importance of this generalization test, the authors should explicitly point the reader to this figure to see the detailed distributions that support the quantitative Kxy values, strengthening the connection between the text and the evidence.
  • 💡 Y-axis labels use jargon: The y-axis labels ('Embeddings' for SSR and 'LLM-Likert' for FLR) are internal jargon. Using clearer, more descriptive labels like 'SSR Method' and 'FLR Method' would be much more accessible to a broader audience and consistent with the text.
Figure 35: Success metrics for textual elicitation to question "How relevant is...
Full Caption

Figure 35: Success metrics for textual elicitation to question "How relevant is this concept for you?" with Gem-2f and follow-up ratings at TLLM = 0.5, with image stimulus and full demography setup.

Figure/Table Image (Page 28)
Figure 35: Success metrics for textual elicitation to question "How relevant is this concept for you?" with Gem-2f and follow-up ratings at TLLM = 0.5, with image stimulus and full demography setup.
First Reference in Text
Not explicitly referenced in main text
Description
  • Performance of the Semantic Similarity Rating (SSR) method on a 'relevance' question: This set of plots evaluates how well the Gem-2f AI model, using the 'Semantic Similarity Rating' (SSR) method, can answer the question 'How relevant is this concept for you?'. The top scatter plot compares the average AI score (y-axis) to the average human score (x-axis), showing a moderately strong positive correlation with a Pearson's R of 0.66. The bottom histogram shows the distribution of a similarity score (KS similarity) that measures how well the AI's full response pattern matched the humans' for each survey. The distribution is skewed to the right, indicating high similarity, with an average score of 0.81.
Scientific Validity
  • ✅ Crucial generalization experiment: This figure is scientifically important as it serves as a generalization test. By demonstrating that the SSR method works well for a different construct ('relevance') beyond the main focus ('purchase intent'), the authors significantly strengthen their claim about the general applicability of their framework.
  • ✅ Strong quantitative support for generalization: The combination of a significant correlation (R=0.66) and high mean distributional similarity (KS sim = 0.81) provides robust evidence that the SSR method's effectiveness is not limited to a single question type.
  • 💡 Inconsistency between text and figure metrics: The reference text cites this figure in support of a 'correlation attainment ρ = 82%', but the figure itself only displays the Pearson correlation 'R=0.66'. While ρ is derived from R, the relationship is not defined in the figure. For full clarity, the figure should either display the ρ value or the caption should clarify how R relates to ρ.
Communication
  • ✅ Effective two-plot summary: Presenting both the scatter plot for mean correlation and the histogram for distributional similarity is an effective way to summarize the two key aspects of performance for this experimental condition.
  • 💡 Jargon on y-axis label: The y-axis label on the scatter plot, 'LLM mean (embed)', is technical and may not be immediately clear to all readers. A more descriptive label, such as 'Synthetic Mean (SSR Method)', would improve clarity.
  • ✅ On-plot statistics aid interpretation: Including the R-value, p-value, and mean KS similarity directly on the respective plots is excellent practice, making the figure's main takeaways immediately accessible.
Table 1: Metrics for all experiments on purchase intent.
Figure/Table Image (Page 14)
Table 1: Metrics for all experiments on purchase intent.
First Reference in Text
For detailed results, see Figs. 17-22 and Tab. 1.
Description
  • Comprehensive summary of all experimental results: This table is a detailed summary of all the different experiments conducted in the study. Each row represents a specific experiment, defined by the AI model used (GPT-40 or Gem-2f), the elicitation method ('Direct' numerical rating vs. 'Textual' response), the type of prompt ('Stimulus': Text or Image), whether demographic information was used ('Dem.'), and a randomness setting for the AI ('TLLM').
  • Evaluation using multiple performance metrics: The table reports several key performance metrics. 'Correlation attainment (ρ)' is a custom score from 0% to 100% that measures how well the AI's ranking of products matches the human ranking, relative to the best possible performance. 'Kxy' is the Kolmogorov-Smirnov similarity, a score from 0 to 1 measuring how well the shape of the AI's response distribution matches the human distribution. 'Rxy' is the standard Pearson correlation between average scores. The final columns show the average purchase intent score ('E[PIs]') generated by the AI.
  • Key performance highlights: The table highlights the best-performing condition: using the GPT-40 model with a 'Textual' (SSR) method, full demographic information, and an image stimulus achieved the highest correlation attainment (ρ = 90.2%) and excellent distributional similarity (Kxy = 0.88). In contrast, 'Direct' rating methods consistently produced poor distributional similarity, with Kxy scores as low as 0.26. The table also shows the surprising result that for the Gem-2f model, removing demographic information ('None') led to the highest distributional similarity (Kxy = 0.91), although it reduced the correlation attainment.
Scientific Validity
  • ✅ High degree of transparency and completeness: The table's primary strength is its comprehensiveness. By reporting the results for all experimental permutations, the authors provide a high level of transparency, allowing readers to scrutinize the data and verify the claims made in the text. This is a hallmark of rigorous research.
  • ✅ Use of multiple, complementary metrics: The evaluation relies on a suite of metrics (ρ, Kxy, Rxy, Cxy) that capture different aspects of performance (ranking vs. distributional shape). This multi-faceted approach is methodologically sound and provides a more nuanced and robust assessment than any single metric could.
  • 💡 Ambiguity of 'avg' rows: The 'avg' rows for the 'Textual' elicitation method are crucial for the main claims, but the table does not explicitly state what is being averaged over (presumably, the different reference sets Σ). A footnote clarifying the nature of this averaging would improve methodological clarity.
  • 💡 Potential for perceived cherry-picking: The table reports results for both the 'avg' and the 'Best Σ' (best reference set). While the main text appropriately focuses on the average results, presenting the 'best' result in the main summary table could be misconstrued as cherry-picking. It might be clearer to relegate the 'best set' results to the appendix to maintain focus on the more robust average performance.
Communication
  • ✅ Logical and structured organization: The table is well-organized, with columns systematically grouped by experimental conditions on the left and performance metrics on the right. This logical flow aids in navigating the large amount of information.
  • 💡 Extremely high information density: The table is very dense, which can make it overwhelming and difficult to parse. The sheer number of columns and rows makes it hard to quickly identify the key takeaways. Suggest using visual cues, such as bolding the rows or cells with the best performance for each key metric (e.g., highest ρ and Kxy), to guide the reader's attention to the most important results.
  • 💡 Undefined abbreviations and complex headers: The table uses several abbreviations ('Elicit.', 'Dem.', 'Stim.', 'Lik.') that are not defined in the caption or a footnote. Furthermore, the nested column headers for 'SSR' and 'Lik.' under each metric are efficient but visually complex. Suggest adding a footnote to define all abbreviations and simplifying the headers where possible, for example, by explicitly labeling the 'Lik.' column as 'FLR' to match the text.
  • 💡 Caption is incomplete: The caption in the original paper is very brief. The expanded caption provided for this critique is much better, as it explains the meaning of the different columns and metrics. A good table should be as self-contained as possible, and a detailed caption is essential for that. The authors should expand their caption to define the key terms and abbreviations used.

Discussion and Conclusion

Key Aspects

Strengths

Suggestions for Improvement