LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings

Section Analysis

Abstract

Key Aspects

🗺️ Problem Context and LLM Opportunity: The abstract establishes the core problem: traditional consumer research is a multi-billion dollar industry plagued by significant limitations, including panel biases and restricted scalability. It then introduces Large Language Models (LLMs) as a technologically advanced alternative with the potential to overcome these issues by simulating synthetic consumers. This framing effectively positions the paper's contribution as a solution to a high-value, real-world challenge in market research.
🧠 The Elicitation Challenge with LLMs: A critical barrier to using LLMs for survey simulation is identified: when prompted directly for numerical ratings (e.g., on a Likert scale), they generate response distributions that are unrealistic and inconsistent with human data. This specific technical challenge is presented as the central obstacle the paper aims to solve. It clarifies that the issue may not be with LLMs themselves, but with the method of eliciting responses from them.
🛠️ Semantic Similarity Rating (SSR) Method: The paper's novel methodological contribution, semantic similarity rating (SSR), is introduced as the solution. SSR is a two-step process that avoids the pitfalls of direct numerical elicitation. First, it prompts LLMs to generate natural language, textual responses expressing their purchase intent. Second, it maps these textual responses onto a Likert scale distribution by computationally measuring the semantic similarity, via embeddings, between the generated text and a set of predefined reference statements.
📊 Key Findings and Validation: The abstract summarizes the compelling results from testing SSR on an extensive, industry-provided dataset of 57 product surveys. The method is shown to be highly effective, achieving 90% of human test-retest reliability, which approaches the theoretical maximum for such data. Furthermore, it successfully produces realistic response distributions (KS similarity > 0.85) and provides the added benefit of rich, qualitative feedback, validating its utility for scalable and interpretable consumer research.

Strengths

✅ Clear problem-solution framing
The abstract excels at establishing a significant real-world problem (costly, biased consumer research), identifying a specific technical barrier in a promising new approach (unrealistic LLM ratings), and presenting a novel, well-defined solution (SSR). This logical flow immediately conveys the paper's relevance and contribution to the reader.

"Consumer research costs companies billions annually yet suffers from panel biases and limited scale. Large language models (LLMs) offer an alternative by simulating synthetic consumers, but produce unrealistic response distributions... We present semantic similarity rating (SSR)..." (Page 1)
✅ Strong, quantifiable performance metrics
The abstract substantiates its claims with specific, impressive metrics. Citing "90% of human test–retest reliability" and "KS similarity > 0.85" provides concrete evidence of the method's efficacy, elevating the claims beyond qualitative description and demonstrating a rigorous validation process.

"SSR achieves 90% of human test–retest reliability while maintaining realistic response distributions (KS similarity > 0.85)." (Page 1)
✅ Highlights both quantitative and qualitative benefits
The abstract effectively communicates that the SSR method not only solves the primary quantitative problem of unrealistic distributions but also generates a valuable secondary output: "rich qualitative feedback." This dual advantage of preserving traditional metrics while adding new interpretative depth makes the proposed framework highly compelling.

"Additionally, these synthetic respondents provide rich qualitative feedback explaining their ratings." (Page 1)

Suggestions for Improvement

💡 Specify the class of LLMs tested
High impact. The abstract refers to "Large language models (LLMs)" generically. Given the rapid evolution in this field, specifying the class or generation of models used (e.g., GPT-4 class, Gemini class) would provide immediate context on the technological baseline and the generalizability of the findings. This detail is crucial for reproducibility and helps readers benchmark the results against the current state-of-the-art.

"Large language models (LLMs) offer an alternative by simulating synthetic consumers, but produce unrealistic response distributions when asked directly for numerical ratings." (Page 1)

Implementation: In the first sentence mentioning LLMs, add a brief parenthetical to ground the research. For example, modify "Large language models (LLMs) offer an alternative..." to "Large language models (LLMs), such as those of the GPT-4 class, offer an alternative..." to provide necessary context without adding significant length.

Introduction

Key Aspects

🗺️ The High-Stakes Context of Consumer Research: The introduction establishes the setting by highlighting the immense scale and economic importance of consumer research, a multi-billion dollar industry central to corporate product development. It immediately frames the problem space by identifying the well-documented limitations of traditional survey panels, such as response biases (e.g., satisficing, acquiescence) and the resulting generation of noisy, unreliable data. This positions the search for alternatives not as a purely academic exercise, but as a solution to a significant, practical business problem.
🧠 LLMs as Synthetic Consumers: Promise and Pitfalls: The paper introduces Large Language Models (LLMs) as a promising technological solution to augment or replace human survey panels, leveraging their ability to simulate consumer personas. However, it quickly pivots to the central challenge that motivates the research: the failure of direct numerical elicitation. The text explains that when LLMs are asked to provide direct Likert-scale ratings, they produce unrealistic response distributions that are systematically skewed and overly narrow, a critical flaw that undermines their validity as survey respondents.
⚙️ The SSR Method: A Novel Elicitation Framework: The core methodological contribution, Semantic Similarity Rating (SSR), is proposed as the solution to the elicitation problem. The introduction clearly outlines this two-step process: first, LLMs generate qualitative, free-text responses expressing purchase intent, and second, these textual responses are mapped to a 5-point Likert scale. This mapping is achieved by calculating the cosine similarity between the embedding of the generated text and the embeddings of predefined anchor statements, a novel application of established NLP and survey methodology principles.
🎯 Thesis: Validating LLMs through Better Methods: The section culminates in a clear thesis statement, arguing that the shortcomings of LLMs in surveys are artifacts of poor elicitation methods, not fundamental limitations of the models themselves. It previews the paper's validation, which uses a large dataset of 57 real-world surveys to demonstrate that SSR successfully recovers human-like response distributions and relative product rankings. The introduction of a new metric, "correlation attainment," and the mention of qualitative benefits further signal a comprehensive and rigorous defense of this central claim.

Strengths

✅ Effective Problem-Gap-Solution Structure
The introduction excels at guiding the reader through a logical funnel, starting with the broad, high-value problem of flawed consumer research, narrowing to the specific technical gap of unrealistic LLM-generated ratings, and clearly positioning the proposed SSR method as the targeted solution. This classic structure makes the paper's purpose and contribution immediately understandable and compelling.

"Thus, traditional consumer panels often provide noisy measurements of demand... One recurring challenge is the direct elicitation of Likert-scale responses... Instead, we propose to use textual elicitation followed by semantic-similarity rating (SSR)..." (Page 1)
✅ Early and Clear Articulation of the Core Method
The paper introduces and explains the concept of Semantic Similarity Rating (SSR) within the introduction itself, rather than deferring the core idea to the methods section. This upfront clarity provides the reader with a solid conceptual foundation, allowing them to understand the paper's central mechanism from the outset and better contextualize the results that follow.

"Instead, we propose to use textual elicitation followed by semantic-similarity rating (SSR): LLMs generate free-text statements of purchase intent, which are then projected onto a 5-point (5pt) Likert scale..." (Page 1)
✅ Previews a Rigorous and Relevant Validation
The introduction builds significant credibility by signaling a robust validation strategy. It specifies the use of a large, industry-sourced dataset ("57 consumer research surveys") and introduces a bespoke success metric ("correlation attainment") tailored to the problem, demonstrating methodological thoughtfulness. This reassures the reader that the paper's claims will be substantiated by extensive, real-world evidence.

"Using 57 consumer research surveys on personal care product concepts... we introduce correlation attainment herein which is inspired by human test–retest reliability experiments..." (Page 1)

Suggestions for Improvement

💡 Briefly exemplify "anchor statements" for immediate clarity
High impact. The concept of "predefined anchor statements" is the linchpin of the SSR method but remains abstract in the introduction. Providing a brief, intuitive example would immediately demystify this core component for the reader. This would ground the technical description in a concrete idea, enhancing comprehension of the proposed mechanism without preempting the detailed explanation in the Methods section (as confirmed in Appendix C.1).

"...which are then projected onto a 5-point (5pt) Likert scale by computing the cosine similarity of embeddings with those of predefined anchor statements." (Page 1)

Implementation: In the sentence explaining the SSR projection, add a parenthetical example after the mention of anchor statements. Modify the sentence to read: "...projected onto a 5-point (5pt) Likert scale by computing the cosine similarity of embeddings with those of predefined anchor statements (e.g., reference texts for each point, such as 'I would definitely buy it')."
💡 Explicitly state the research questions to frame the study
Medium impact. The introduction effectively implies its objectives, but formally stating them as 2-3 explicit research questions would improve the section's structure and align with scientific reporting conventions. This would provide readers with a clear and direct roadmap, sharpening their focus on the specific hypotheses being tested and making it easier to evaluate the subsequent sections against the paper's stated goals.

"This approach draws on established methods in NLP (semantic similarity mapping) and survey methodology (anchoring vignettes), but to our knowledge has not been applied in the context of LLMs as survey respondents." (Page 1)

Implementation: At the end of the fourth paragraph, after introducing the SSR approach and its novelty, insert a sentence to frame the investigation. For example: "This leads to our primary research questions: (1) Can the SSR method produce more human-like purchase intent distributions from LLMs than direct numerical elicitation? (2) Does this approach enable LLMs to replicate the relative ranking of product concepts found in human surveys?"

Related Works

Key Aspects

📉 The Pitfall of Direct Numeric Elicitation: The review first establishes the predominant approach in prior literature: using Large Language Models (LLMs) as survey respondents via direct numeric elicitation, where models are prompted for specific values on scales like Likert or 'feeling thermometers'. It immediately identifies a critical and consistent limitation with this method. Studies consistently find that this approach yields response distributions that are statistically unrealistic, exhibiting significantly less variance than human data and a strong tendency to regress towards 'typical' or central answers, thus failing to capture the true spread of consumer opinions.
💬 Intermediate Step: Textual-to-Numeric Mapping: A second, less common stream of research is examined, which involves a two-step process: first eliciting qualitative, free-text responses from LLMs, and subsequently mapping these textual answers onto numerical scales. The paper provides examples such as counting brand completions or training 'Doppelgänger LLMs'. While acknowledging that these methods better handle the ambiguity of natural language, the review critiques them for ultimately reducing the rich textual information back into a single, deterministic number, thereby discarding the nuanced, probabilistic nature of the original response.
🧑‍🤝‍🧑 Demographic Conditioning as a Partial Solution: The section analyzes the impact of demographic conditioning, a technique where prompts are enriched with socio-demographic backstories to create more specific personas for the LLMs to impersonate. The literature shows that this method successfully improves the alignment between synthetic subgroup responses and human benchmarks, increasing the validity of the simulations for specific population segments. However, the review concludes that while this conditioning is a valuable enhancement, it fails to resolve the more fundamental, persistent issue of artificially narrow response distributions.
🗺️ Situating the Paper's Methodological Approach: Finally, the section positions the paper's contribution within the broader methodological landscape by distinguishing between two primary strategies for improving LLM performance: fine-tuning on specific survey data and zero-shot elicitation or prompt engineering. It notes that while some work explores fine-tuning, a significant portion of the literature—the portion this paper aligns with—focuses on developing more effective prompting techniques that do not require additional model training. This contextualizes the paper's work as a novel contribution to the field of prompt engineering for synthetic data generation.

Strengths

✅ Establishes a Clear and Justified Research Gap
The section is structured as a compelling logical funnel. It begins by identifying the most common method (direct numeric elicitation) and its core flaw (narrow distributions), then systematically examines other approaches (textual mapping, demographic conditioning) and demonstrates how they also fail to resolve this fundamental issue. This methodical process effectively carves out a well-defined and necessary research gap that the paper's proposed SSR method is perfectly positioned to address.

"A consistent limitation is that response distributions are too narrow... Conditioning increases validity but does not overcome the fundamental issue of narrow distributions." (Page 2)
✅ Concise and Thematically Focused
The review avoids becoming an exhaustive list of citations. Instead, it remains highly focused on the key themes and limitations directly relevant to the paper's central argument. By concentrating on the specific problems of response distribution and elicitation methods, it efficiently builds a strong, coherent case for the necessity of a new approach without extraneous detail.

"A consistent limitation is that response distributions are too narrow: models regress to “typical” answers, showing far less variance than human data and producing unrealistically confident estimates [13, 18]." (Page 2)

Suggestions for Improvement

💡 Synthesize approaches in a comparative table
High impact. To enhance clarity and reader comprehension, the section would benefit from a comparative table summarizing the different approaches discussed. Such a table would visually synthesize the key attributes of each method (e.g., Direct Numeric, Textual-to-Numeric, Demographic Conditioning), list representative citations, and explicitly state the primary limitation identified in the text. This would provide a powerful, at-a-glance overview that reinforces the paper's argument and the specific research gap it aims to fill.

"Some work uses fine-tuning with survey data to make LLMs more human-like [11, 12, 19]. But a large share of the literature, including refs. [10, 13, 18, 20], stays with zero-shot elicitation or prompt engineering." (Page 2)

Implementation: At the end of the section, insert a table with columns such as: 'Approach', 'Description', 'Key Citations', and 'Identified Limitation'. For example, the row for 'Direct Numeric Elicitation' would describe it as 'Prompting LLMs for a single integer rating' and list 'Produces overly narrow, low-variance response distributions' as the limitation.
💡 Sharpen the contrast with prior textual-to-numeric methods
Medium impact. The section correctly identifies that prior textual methods reduce responses to a single number. However, the argument could be made more forceful by explicitly contrasting this with the paper's own approach. Directly foreshadowing that the proposed SSR method preserves ambiguity by mapping text to a distribution rather than a single point would more sharply delineate the paper's specific methodological innovation from the closest related work, making its novelty more immediately apparent to the reader.

"While such pipelines acknowledge the ambiguity of open-ended responses, they ultimately reduce them back to single numbers." (Page 2)

Implementation: At the end of the second paragraph, after the final sentence, add a transitional sentence to sharpen the contrast. For instance: "...they ultimately reduce them back to single numbers, failing to capture the full probabilistic nature of a response that our proposed method is designed to retain."

Methods

Key Aspects

📊 Industry-Scale Validation Dataset: The study's methodology is grounded in a substantial real-world dataset comprising 57 distinct consumer research surveys on personal care product concepts. This dataset, provided by a leading corporation, includes responses from 9,300 unique U.S. participants, with demographic data such as age, gender, and income. The core task for participants was to rate their purchase intent on a 5-point Likert scale. A critical characteristic of this ground-truth data, which informs the evaluation, is that the mean purchase intent is narrowly distributed and skewed towards positive values.
⚖️ Formalizing Synthetic Consumers: The paper establishes a clear formal framework to distinguish between human and synthetic survey respondents. A human consumer provides a single, deterministic integer rating on the Likert scale. In contrast, a 'synthetic consumer' is defined as a Large Language Model (LLM) prompted to impersonate a human based on their demographic attributes. Crucially, the framework allows a synthetic consumer's response to be represented as a full probability mass function over the five possible ratings, a conceptual distinction that is fundamental to the proposed SSR method.
🎯 Dual Success Metrics: To rigorously evaluate the performance of synthetic panels, the authors define two primary success criteria. The first, Distributional Similarity, uses the Kolmogorov-Smirnov (KS) similarity to measure how closely the synthetic response distribution matches the human one, a choice justified by the metric's respect for the ordinal nature of Likert scales. The second, a novel metric termed Correlation Attainment, quantifies how well synthetic panels recover the relative ranking of products by comparing the Pearson correlation of mean purchase intents against a theoretical maximum derived from human test-retest reliability, thus providing a noise-adjusted measure of performance.
⚙️ Comparative Response Generation Strategies: The core of the experimental design involves the systematic comparison of three distinct strategies for generating synthetic responses. The first, Direct Likert Rating (DLR), serves as a simple baseline where the LLM outputs a single integer. The second, Follow-up Likert Rating (FLR), is an intermediate two-step method where an LLM first generates free text, which is then mapped to a single integer by another LLM instance acting as an expert. The third and novel approach, Semantic Similarity Rating (SSR), also elicits free text but then transforms it into a probabilistic distribution across the Likert scale by computing the cosine similarity of its embedding with predefined reference statement embeddings.

Strengths

✅ Introduction of a noise-aware success metric
The 'Correlation Attainment' metric is a significant methodological strength. By establishing a performance ceiling based on human test-retest reliability, the authors account for the inherent noise and narrow distribution of the ground-truth data. This provides a far more rigorous and realistic assessment of the model's capabilities than a simple correlation score would, demonstrating a sophisticated understanding of the problem domain.

"Because correlation is upper bounded by noisy human data with a narrow PIs distribution, we measure success across all 57 surveys in terms of the maximum attainable correlation, akin to test–retest reliability." (Page 2)
✅ Clear and logical comparative experimental design
The methodology is exceptionally clear due to its systematic comparison of three well-defined response generation strategies (DLR, FLR, and SSR). This design effectively isolates the impact of the elicitation technique, allowing the authors to convincingly demonstrate the superiority of their proposed SSR method not just in isolation, but in direct contrast to a simple baseline and a plausible, more complex alternative.

"We evaluated three response strategies (see Fig. 1A): Direct Likert rating (DLR)... Follow-up Likert rating (FLR)... Semantic similarity rating (SSR)." (Page 2)
✅ Sufficient detail for reproducibility
The section provides specific and crucial details about the implementation, which is essential for scientific reproducibility. The authors explicitly name the LLMs (GPT-4o, Gemini-2.0-flash), the embedding model used for SSR (OpenAI’s “text-embedding-3-small”), and the temperature parameter for the reported results (T_LLM = 0.5), giving other researchers a clear blueprint of the experimental setup.

"We used two models (GPT-4o and Gemini-2.0-flash, “Gem-2f” in the following) and ran experiments with 𝑇LLM = 0.5 and 𝑇LLM = 1.5." (Page 4)

Suggestions for Improvement

💡 Provide a concrete example of the 'Likert rating expert' prompt
High impact. The Follow-up Likert Rating (FLR) method is a critical baseline for comparison. The paper states that the prompt for this method included examples of sentiment-to-rating mappings. Including a brief, illustrative example of such a mapping directly within the Methods section would significantly improve clarity and reproducibility. This would allow readers to more concretely grasp the mechanism of this baseline and better appreciate the conceptual leap to the probabilistic SSR approach.

"In this system prompt, we included examples of what kind of statements can lead to which rating." (Page 4)

Implementation: After the sentence describing the expert prompt, add a parenthetical example. For instance: "...we included examples of what kind of statements can lead to which rating (e.g., 'A statement like `I'm intrigued but would need to know the price` corresponds to a rating of 3')."
💡 Briefly justify the choice of specific LLMs
Medium impact. The paper specifies that GPT-4o and Gemini-2.0-flash were used for production runs after initial experiments. To further strengthen the methodological rigor, a single sentence explaining the rationale for selecting these particular models would be beneficial. Clarifying whether they were chosen for being state-of-the-art, for their specific multimodal capabilities, or for other reasons would provide important context about the technological baseline of the study and the potential generalizability of the findings.

"We used two models (GPT-4o and Gemini-2.0-flash, “Gem-2f” in the following) and ran experiments with 𝑇LLM = 0.5 and 𝑇LLM = 1.5." (Page 4)

Implementation: In the paragraph specifying the models used, add a brief clause or sentence clarifying the selection criteria. For example: "We used two models (GPT-4o and Gemini-2.0-flash...), selected for their state-of-the-art performance in language understanding and generation at the time of the experiments..."

Non-Text Elements

Figure 1: Different response generation procedures and SSR response-likelihood...

Full Caption

Figure 1: Different response generation procedures and SSR response-likelihood mapping.

Figure/Table Image (Page 2)

First Reference in Text

We evaluated three response strategies (see Fig. 1A):

Description

Flowchart of Three Methods for Eliciting Ratings from LLMs: This diagram illustrates three different methods for getting a rating from a 'synthetic consumer,' which is a Large Language Model (LLM) instructed to act like a person in a survey. The process begins by giving the LLM a role ('impersonate a consumer') and a product concept to evaluate. The core of the diagram compares three strategies for how the LLM provides its rating: (1) Direct Likert Rating, where the LLM must choose a single number from 1 to 5; (2) Follow-up Likert Rating, where the LLM first gives a written opinion, and then a second LLM instance, acting as an 'expert,' converts that text into a single number from 1 to 5; and (3) Semantic Similarity Rating (SSR), where the LLM's written opinion is converted into a special numerical code called an 'embedding vector.' This vector, which captures the meaning of the text, is then mathematically compared to pre-defined reference statements for each point on the 1-5 scale to produce a full probability distribution, rather than a single number.

Scientific Validity

✅ Clear depiction of experimental conditions: The flowchart provides a clear and logical overview of the three distinct response elicitation strategies being compared. This serves as an excellent methodological summary, allowing readers to grasp the core experimental design at a glance.
💡 Ambiguity of 'Likert expert' role: The term 'Likert expert' in method (2) is slightly ambiguous within the figure itself. While the main text clarifies this is a new instance of the same model with a different prompt, the diagram alone does not make this clear. For full self-containment, a brief note could be added, e.g., 'Same LLM, new prompt,' to specify the nature of this component.
✅ Inclusion of prompt context: Showing the initial 'System prompt' and 'User prompts' is a methodological strength. It provides crucial context about how the synthetic agents were set up, which is essential for the reproducibility and interpretation of the results.

Communication

✅ Effective flowchart structure: The use of a flowchart is highly effective for illustrating the sequential and branching nature of the response generation process. The visual flow from prompts to the three distinct methods is intuitive and easy to follow.
✅ Use of a concrete example: The inclusion of an example 'Synthetic response' ('I'm somewhat interested...') helps ground the abstract processes in a tangible example, making the different pathways easier for the reader to understand.
💡 Visual linkage could be improved: The arrow from 'Elicit brief textual responses' leads into a general space from which methods (2) and (3) branch. Method (1), however, bypasses this step. The diagram could be slightly reorganized to make it more visually explicit that the textual response is an input for methods (2) and (3) only, perhaps by having the arrow split more directly to those two boxes.

Figure 5: A surrogate product concept similar to those used in the 57 concept...

Full Caption

Figure 5: A surrogate product concept similar to those used in the 57 concept surveys.

Figure/Table Image (Page 11)

First Reference in Text

When we refer to “image stimulus” in the main text, an image like this, including either both an illustration and the concept description or only a concept description was supplied to an LLM synthetic consumer (see App.

Description

Example of a product concept stimulus: This figure displays an example of the marketing material, or 'stimulus,' shown to survey participants to evaluate. It features a fictional product called 'AURAFOAM™ Mood-Infused Body Wash.' The image is split into two parts: on the left is a picture of the product bottle, and on the right is a text description. The text highlights key selling points, such as 'mood-coded fragrance capsules' (different scents intended to create feelings like 'Energize' or 'Calm'), 'clinically inspired neuro-aroma blends' (a marketing term suggesting scents are designed to affect mood), a 'gentle, skin-first formula,' and 'sustainable design' using recycled packaging.

Scientific Validity

✅ Enhances methodological transparency: Providing a concrete example of the stimulus material is a significant strength. It allows the reader to understand the nature and complexity of the information provided to the LLMs, which is crucial for interpreting the results and assessing the task's ecological validity. This moves the methodology from an abstract description to a tangible illustration.
✅ Plausible and representative example: The surrogate concept is well-designed to be representative of modern personal care marketing. It includes a realistic combination of product imagery, branding, and textual features (e.g., emotional benefits, scientific-sounding terms like 'neuro-aroma', sustainability claims). This makes it a suitable and relevant test case for the study's domain.
💡 Clarification on stimulus variability would be beneficial: The caption states the concept is 'similar' to those used. While the reference text mentions some variation (image+text vs. text only), it would strengthen the methods section to briefly describe the general range of variability across the 57 actual stimuli. For example, were the description lengths and number of features generally consistent? This would help readers understand the robustness of the findings across different concept styles.

Communication

✅ Highly effective and clear: The figure is exceptionally clear and serves its purpose perfectly. It immediately illustrates what the authors mean by 'product concept' and 'image stimulus,' making a key component of the experimental design easy to understand for any reader.
✅ Self-contained and illustrative: Combined with its straightforward caption, the figure is entirely self-contained. It effectively communicates the nature of the experimental input without requiring the reader to refer to lengthy descriptions in the methods section.
✅ Professional and clean design: The visual design of the surrogate concept is professional and aesthetically pleasing, which adds to the credibility of the experimental setup by showing that the stimuli were of high quality.

Results

Key Aspects

📉 Baseline Failure of Direct Likert Ratings: The section first establishes a performance baseline using the Direct Likert Rating (DLR) method, where LLMs provide a single integer rating. While this approach achieved a respectable correlation attainment of approximately 80%, it failed critically on distributional similarity, with models exhibiting a strong regression-to-the-mean. The LLMs consistently defaulted to a 'safe' central rating of '3', rarely using the scale's extremes ('1' or '5'), in stark contrast to human data which peaked at '4' and '5'. This finding demonstrates that direct numerical elicitation produces fundamentally unrealistic response distributions, a key problem the paper aims to solve.
📈 Superior Performance of Textual Elicitation: The results demonstrate the marked superiority of textual elicitation methods, Follow-up Likert Rating (FLR) and Semantic Similarity Rating (SSR), over the DLR baseline. Both FLR and SSR improved correlation attainment to 85-90%. More importantly, SSR dramatically improved distributional similarity, achieving values up to 0.88 (compared to DLR's 0.26 for GPT-4o), successfully replicating the human response patterns. While FLR was an improvement over DLR, it consistently lagged behind SSR in distributional fidelity, validating SSR as the most effective method for generating realistic, human-aligned survey data.
👥 Replication of Human Demographic Patterns: This part of the analysis investigates whether synthetic consumers (SCs) replicate known human response patterns when data is stratified by demographic and product features. The results show a strong correspondence for several key attributes, including age, income level, product category, and price tier, where SCs mirrored the trends observed in human participants. For example, both humans and SCs showed a concave purchase intent pattern with age and responded negatively to products from a specific source. This provides strong evidence that the models are not just generating plausible text but are effectively conditioning their responses on the provided persona and concept details.
🔑 Crucial Role of Demographics and Robustness Validations: The section presents additional experiments that underscore the method's robustness and reveal the critical function of demographic data. A key ablation study showed that removing demographic prompts from SCs paradoxically improved distributional similarity but caused correlation attainment to collapse from 92% to 50%, proving that demographics are essential for generating a meaningful, product-differentiating signal. Furthermore, the SSR method was shown to outperform a supervised LightGBM model trained on the data and successfully generalized to a different survey question about concept relevance, solidifying the method's effectiveness and broad applicability.

Strengths

✅ Clear Comparative Structure from Baseline to Novel Method
The results are presented in a highly effective and logical sequence. The section begins by establishing the clear failure of a simple baseline (DLR), then demonstrates the incremental improvement of an intermediate method (FLR), and culminates in showing the superior performance of the proposed SSR method. This structure makes the contribution and advantages of SSR exceptionally clear and persuasive.

"For the SSR method, distributional similarity markedly increased compared to the naive DLR approach, with 𝐾𝑥 𝑦 = 0.88 for GPT-4o... FLRs yield improved distributions compared to DLRs, but fall behind distributional similarity values reached by SSRs..." (Page 4)
✅ Effective Use of Data Visualization
The section makes excellent use of figures to convey complex quantitative comparisons. The combination of scatter plots to show correlation (Fig. 2A, 6A) and overlaid distribution plots to show similarity (Fig. 2B, 3, 7) provides a powerful and intuitive visual summary of the core findings. This allows readers to quickly grasp the performance differences between DLR, FLR, and SSR without needing to parse dense statistical text.

"Figure 2: Comparison of real and synthetic surveys based on GPT-4o... (A) Mean purchase intent comparison for (A.i) Direct likert ratings (DLRs), (A.ii) textual elicitation with follow-up Likert ratings (FLRs) and (A.iii) semantic similarity ratings (SSRs)." (Page 5)
✅ Inclusion of Rigorous Ablation and Generalization Tests
Beyond presenting the primary findings, the authors include several additional experiments that substantially strengthen the paper's claims. The ablation study removing demographics, the comparison against a strong supervised ML baseline (LightGBM), and the test of generalization to a new question ('relevance') all serve as powerful robustness checks. This demonstrates a thorough and rigorous approach to validation.

"...zero-shot LLM elicitation—which requires no access to training data from the surveys—synthesizes human responses more effectively than a supervised ML model that does." (Page 5)

Suggestions for Improvement

💡 Consolidate primary model performance in a summary table
Medium impact. The core performance metrics for GPT-4o and Gem-2f are distributed across different paragraphs and figures (e.g., Fig. 2 for GPT-4o, Fig. 6 for Gem-2f), requiring the reader to synthesize information from multiple locations. A small, consolidated table within the main text summarizing the key outcomes—correlation attainment (𝜌) and distributional similarity (K_xy) for DLR, FLR, and SSR across both LLMs—would significantly enhance clarity and provide an immediate, at-a-glance comparison of the main experimental results.

"Both LLMs yielded a correlation attainment of about 𝜌 = 80% (cf. Fig. 2A.i and Fig. 6A.i)." (Page 4)

Implementation: At the end of section 4.2, insert a new table (e.g., Table 2) that summarizes the main results. The table could have rows for each method (DLR, FLR, SSR) and columns for each model's key metrics (e.g., 'GPT-4o 𝜌', 'GPT-4o K_xy', 'Gem-2f 𝜌', 'Gem-2f K_xy'), pulling the headline numbers from the text and figures into one location.
💡 Visually represent the difference in purchase intent spread
Medium impact. The text makes an interesting observation that synthetic mean purchase intents are more spread out than human ones, suggesting LLMs may be less prone to positivity bias. This is a potentially significant finding regarding the characteristics of synthetic respondents. However, this point is only mentioned in passing. Visualizing this phenomenon, for instance with a simple box plot comparing the distribution of the 57 mean PI values for humans versus each LLM, would provide direct evidence for this claim and add a valuable dimension to the comparison of synthetic and real data.

"Generally, the synthetic mean purchase intents are far more spread out than the real mean purchase intents: When a product is less attractive, LLMs tend to rate them lower than their human counter parts, on average." (Page 4)

Implementation: In section 4.2, after the sentence making this claim, add a reference to a new figure or a new panel in an existing figure (e.g., "(see Fig. X)"). This new visual could consist of three box plots side-by-side, showing the distribution of the 57 mean purchase intent scores for the real human data, the GPT-4o SSR results, and the Gem-2f SSR results, clearly illustrating the difference in variance.

Non-Text Elements

Figure 2: Comparison of real and synthetic surveys based on GPT-40 with TLLM =...

Full Caption

Figure 2: Comparison of real and synthetic surveys based on GPT-40 with TLLM = 0.5.

Figure/Table Image (Page 4)

First Reference in Text

Both LLMs yielded a correlation attainment of about p = 80% (cf. Fig. 2A.i and Fig. 6A.i).

Description

Correlation between real and synthetic purchase intent scores: This scatter plot compares the average purchase intent scores from real human surveys (horizontal axis) with those from a simulated survey using the GPT-40 AI model (vertical axis). Each dot represents a single product concept. The AI was prompted using the 'Direct Likert Rating' (DLR) method, where it provides a single numerical score. The plot shows a positive trend: as the average score from humans increases, the average score from the AI also tends to increase. Key statistics are reported directly on the plot: a 'correlation attainment' (ρ) of 81.7%, a Pearson correlation (R) of 0.66, and a p-value of less than 10^-7. Correlation attainment is a custom metric used by the authors to compare the AI's performance against the theoretical maximum agreement between two groups of humans.
Response range compression: A key visual feature of the plot is the difference in the range of scores. The real human scores on the x-axis span from approximately 3.25 to 4.50, while the synthetic AI scores on the y-axis are compressed into a much narrower range, from about 2.6 to 3.2. This indicates that while the AI follows the general trend, its responses show less variance and are clustered more tightly around a central value compared to human responses.

Scientific Validity

✅ Strong quantitative support for the textual claim: The plot provides direct evidence for the claim made in the reference text. The text states a correlation attainment of 'about p = 80%', and the figure explicitly reports ρ = 81.7%, strongly supporting the manuscript's finding for the DLR method.
✅ Appropriate visualization choice: A scatter plot is the standard and most appropriate method for visualizing the relationship and correlation between two continuous variables, in this case, the mean scores from two different populations (human and synthetic).
✅ Inclusion of relevant statistics: Displaying the correlation attainment (ρ), Pearson's R, and the p-value directly on the plot is a methodologically sound practice. It provides a comprehensive statistical summary that allows for immediate interpretation of the relationship's strength and significance.
💡 Visual data reveals important limitations: The plot clearly visualizes the compression of the synthetic responses (y-axis) compared to the human responses (x-axis). This is a crucial finding that highlights a key limitation of the DLR method—that it produces distributions that are 'overly narrow,' as mentioned later in the text. This visual evidence of a methodological artifact is a strength of the data presentation.

Communication

✅ Effective integration of statistics: Placing the key statistical results directly within the plot area is an efficient way to communicate the main takeaway, making the panel largely self-contained and easy to interpret.
💡 Missing axis labels: The horizontal and vertical axes lack explicit titles (e.g., 'Real Mean Purchase Intent', 'Synthetic Mean Purchase Intent'). While the panel title and overall figure caption provide context, adding explicit axis labels is a standard best practice that would improve clarity and make the plot fully self-explanatory.
💡 Minor notational inconsistency: The reference text uses 'p' to denote correlation attainment, whereas the figure uses the Greek letter 'ρ'. While this is a minor point, ensuring consistent notation throughout the manuscript would prevent any potential reader confusion.

Figure 3: Comparison of purchase intent distribution similarity between real...

Full Caption

Figure 3: Comparison of purchase intent distribution similarity between real and synthetic surveys based on GPT-40 with TLLM = 0.5 for direct Likert ratings (DLRs), textual elicitation with follow-up Likert ratings (FLRs) and semantic similarity ratings (SSRs).

Figure/Table Image (Page 6)

First Reference in Text

With Kxy = 0.88 for GPT-40 (see Fig. 3) and Kxy = 0.8 for Gem-2f (see Fig. 7).

Description

Comparison of three AI response methods: This graph compares how well three different methods for generating survey responses from an AI (GPT-40) match the response patterns of real humans. The horizontal axis shows the 'Kolmogorov-Smirnov (KS) similarity' score, a measure where 1.0 indicates a perfect match between the AI and human response distributions, and 0.0 indicates no match. The height of each curve (vertical axis) represents how frequently a particular similarity score was observed across 57 different surveys.
Performance of each method: The three colored distributions clearly show the performance of each method. The pink distribution, for 'Direct Likert Rating' (DLR), is clustered on the far left, with a low average similarity score (Kxy) of 0.26, indicating it performed poorly. The light blue distribution, for 'Follow-up Likert Rating' (FLR), is in the middle, showing a significant improvement with an average score of 0.72. The green distribution, for the proposed 'Semantic Similarity Rating' (SSR) method, is concentrated on the far right, achieving a very high average similarity of 0.88, demonstrating it was the most effective at replicating human survey data.

Scientific Validity

✅ Strong visual and quantitative evidence: The figure provides compelling evidence supporting the paper's central claim. The clear and substantial separation of the three distributions, with SSR's mean Kxy of 0.88 far exceeding that of FLR (0.72) and DLR (0.26), strongly validates the superiority of the SSR method. The large effect size makes the finding highly convincing.
✅ Appropriate choice of metric and visualization: The use of Kolmogorov-Smirnov similarity is methodologically sound for comparing the ordinal distributions of Likert scale data. Visualizing the distribution of these similarity scores, rather than just presenting the mean, is a strength as it shows the consistency of each method's performance across the 57 surveys.
💡 Suggestion for formal statistical testing: While the visual difference between the methods is stark, the argument could be further strengthened by reporting the results of formal statistical tests (e.g., a Kruskal-Wallis test with post-hoc comparisons) to confirm that the distributions of Kxy scores are significantly different from one another. This would add a layer of statistical rigor to the visual interpretation.

Communication

✅ Highly effective visualization choice: The use of overlaid probability density plots is an excellent choice for this data. It allows for an immediate, intuitive comparison of the performance of the three methods, effectively communicating the main finding that SSR is superior.
✅ Informative legend: The legend is well-executed. By including the mean Kxy value for each method directly alongside its name and color, it reinforces the key takeaway and makes the figure highly self-contained.
💡 Improve axis label clarity: The x-axis label 'response distribution KS similarity' is technically accurate but could be made more immediately understandable for a broader audience. Suggest relabeling to something like 'Similarity to Human Response Distribution (KS sim.)'. The y-axis 'pdf (arbitrary units)' is standard but could also be clarified, for instance, as 'Density' or 'Frequency of Occurrence'.
💡 Consider accessibility in design: While the colors provide good visual separation, it is best practice to ensure they are distinguishable for readers with color vision deficiency. Augmenting the color coding with distinct line styles (e.g., solid, dashed, dotted) would guarantee the figure's accessibility.

Figure 4: Mean purchase intent stratified by five demographic and product...

Full Caption

Figure 4: Mean purchase intent stratified by five demographic and product features (shown are results from the SSR method for both GPT-40 and Gem-2f).

Figure/Table Image (Page 6)

First Reference in Text

To this end, we measure mean purchase intent across all products, stratified by demographics and product features and present the results in Fig. 4.

Description

Purchase intent varies with age: This line graph shows how the average purchase intent (PI), a score of how likely someone is to buy a product, changes across different age groups. The graph plots data for real human participants and two AI models, GPT-40 and Gem-2f. The real human data (black line) shows a distinct curve: purchase intent is lower for younger (age 20-30) and older (age 70+) participants, and peaks for middle-aged participants (around 40-50), reaching a mean PI of about 4.1. The GPT-40 model (orange line) successfully replicates this hump-shaped pattern. The Gem-2f model (blue line) captures the initial increase in PI for younger groups but fails to show the decrease for older participants.

Scientific Validity

✅ Effective test of demographic replication: This panel provides a strong test of the LLMs' ability to replicate nuanced demographic trends. The successful mirroring of the non-linear, concave relationship by GPT-40 is a significant finding that supports the validity of persona-based conditioning.
✅ Inclusion of error bars: The presence of error bars (representing standard errors) is a methodologically sound practice, as it provides an indication of the uncertainty around the mean purchase intent for each group.
💡 Highlights model-specific limitations: The divergence of the Gem-2f model from the human trend for older age cohorts is an important result, suggesting that the ability to replicate demographic patterns is not uniform across all LLMs and may represent a limitation or specific bias in Gem-2f's training data.

Communication

✅ Appropriate graph type: A line plot is an effective choice for visualizing trends across an ordered variable like age, making the concave pattern immediately apparent.
✅ Clear legend: The legend clearly distinguishes between the real data and the two synthetic models, making the comparison straightforward.
💡 Inconsistent y-axis scales: Each panel in Figure 4 uses a different y-axis scale. While this maximizes the visual space for each plot, it makes direct visual comparison of the magnitude of effects across different features (e.g., age vs. income) more difficult. Using a consistent y-axis scale across all panels would aid in this comparison.

Figure 6: Comparison of real and synthetic surveys based on Gem-2f with TLLM =...

Full Caption

Figure 6: Comparison of real and synthetic surveys based on Gem-2f with TLLM = 0.5.

Figure/Table Image (Page 13)

First Reference in Text

Both LLMs yielded a correlation attainment of about p = 80% (cf. Fig. 2A.i and Fig. 6A.i).

Description

Correlation between real and synthetic purchase intent scores (Gem-2f model): This scatter plot compares the average purchase intent scores from real human surveys (horizontal axis) with those from a simulated survey using the Gem-2f AI model (vertical axis). Each pink square represents a single product concept being evaluated. The AI was prompted using the 'Direct Likert Rating' (DLR) method, where it must provide a single number score. The plot shows a positive relationship, with key statistics reported directly on the graph: a 'correlation attainment' (ρ) of 80.2%, a Pearson correlation (R) of 0.64, and a p-value of less than 10^-7. Correlation attainment is a custom metric the authors use to benchmark the AI's performance against the theoretical maximum agreement that could be expected between two separate groups of human surveyors.
Distribution of responses: The plot shows that while the human scores (x-axis) range from approximately 3.25 to 4.50, the synthetic scores from the Gem-2f model (y-axis) cover a wider range than the GPT-40 model in Figure 2, from about 2.0 to 4.5. This indicates that the Gem-2f model's responses, while correlated with human responses, have a different pattern of variance.

Scientific Validity

✅ Direct validation of the textual claim: The figure provides strong quantitative support for the claim in the reference text. The text states a correlation attainment of 'about p = 80%', and the plot explicitly shows ρ = 80.2% for the Gem-2f model using the DLR method, confirming the finding.
✅ Appropriate visualization for the research question: Using a scatter plot is the standard and most effective method to visualize and assess the correlation between two continuous variables, in this case, the mean ratings from human and synthetic respondents.
✅ Inclusion of key statistical metrics: The practice of embedding the correlation attainment (ρ), Pearson's R, and the p-value directly onto the plot is commendable. It provides a complete and immediate statistical summary, allowing for a robust interpretation of the data's significance and the strength of the relationship.
✅ Reveals important model-specific behavior: By presenting the results for Gem-2f separately from GPT-40 (in Fig. 2), the study effectively demonstrates that different LLMs exhibit distinct behaviors even under the same conditions. The wider response variance of Gem-2f compared to GPT-40's DLR is a scientifically valuable observation about model-specific artifacts.

Communication

✅ Clear and uncluttered presentation: The plot is clean, with a high data-ink ratio. The statistical information is presented concisely without overwhelming the visual representation of the data points.
💡 Add explicit axis labels: The plot is missing explicit labels for the x and y axes. To improve clarity and make the figure fully self-contained, labels such as 'Real Mean Purchase Intent' and 'Synthetic Mean Purchase Intent (Gem-2f)' should be added.
💡 Ensure consistent notation: The reference text uses 'p' to denote correlation attainment, while the figure uses the Greek letter 'ρ'. Adopting a single, consistent symbol for this custom metric throughout the manuscript would prevent any potential confusion for the reader.

Figure 7: Comparison of purchase intent distribution similarity between real...

Full Caption

Figure 7: Comparison of purchase intent distribution similarity between real and synthetic surveys based on Gem-2f with TLLM = 0.5 for direct Likert ratings (DLRs), textual elicitation with follow-up Likert ratings (FLRs), semantic similarity ratings (SSRs), and best-set SSRs for an experiment where synthetic consumers where prompted without demographic markers.

Figure/Table Image (Page 13)

First Reference in Text

With Kxy = 0.88 for GPT-40 (see Fig. 3) and Kxy = 0.8 for Gem-2f (see Fig. 7).

Description

Comparison of four AI response generation methods using the Gem-2f model: This figure presents histograms comparing how well four different methods for generating AI responses match real human survey data. The horizontal axis represents the 'Kolmogorov-Smirnov (KS) similarity' score, which measures how closely the AI's response pattern matches the humans'; a score of 1.0 is a perfect match. The vertical axis shows the frequency of each similarity score across 57 surveys. The figure compares four conditions: Direct Likert Rating (DLR), Follow-up Likert Rating (FLR), Semantic Similarity Rating (SSR) with demographic prompts, and SSR without demographic prompts.
Performance ranking of the methods: The plot shows a clear performance hierarchy. The DLR method (pink) performs poorly, with an average similarity (Kxy) of 0.39. The FLR method (light teal) is better, with Kxy = 0.59. The standard SSR method (dark teal) is substantially better still, achieving Kxy = 0.80, which aligns with the value cited in the text. Surprisingly, the best performance comes from the SSR method when demographic information is removed from the AI's prompt (blue), which achieves a very high average similarity of Kxy = 0.91.

Scientific Validity

✅ Strong comparative evidence: The figure provides compelling evidence for the relative performance of the different elicitation methods with the Gem-2f model. The clear separation of the distributions validates the paper's central thesis about the superiority of SSR and provides robust data for model comparison.
✅ Inclusion of a critical ablation study: The 'SSR (w/o dem.)' condition is a methodologically strong inclusion. It serves as an ablation study that isolates the effect of demographic prompting. The resulting finding—that removing demographics improves distributional similarity for Gem-2f—is a significant and non-obvious result that deepens the paper's contribution.
💡 Suggestion for statistical testing: While the visual differences are striking, the scientific argument would be strengthened by including formal statistical tests to confirm that the Kxy distributions are significantly different from one another (e.g., using a non-parametric test like Kruskal-Wallis followed by post-hoc tests). This would formally quantify the observed differences.

Communication

✅ Effective visualization for comparison: Using overlaid histograms is an excellent choice for this data, as it allows for a direct and intuitive comparison of the performance distributions of the four methods. The main takeaway is immediately apparent.
✅ Informative and self-contained legend: The legend is highly effective because it includes the mean Kxy score for each condition directly next to its label. This provides the key summary statistic without requiring the reader to search the text, making the figure largely self-contained.
💡 Color choice could be improved: The use of two similar shades of teal for 'FLR' and 'SSR' slightly reduces the visual distinction between these two important conditions. Suggest using a more distinct color palette to improve readability, particularly for readers with color vision deficiency.
💡 Overly long and complex caption: The caption is very detailed, listing every condition shown in the plot. While accurate, it is long and somewhat unwieldy. Suggest simplifying the caption to focus on the main message, for instance: 'Distributional similarity for four response elicitation methods using Gem-2f. The SSR method, particularly without demographic prompts, demonstrates the highest similarity to human data.'

Figure 8: Mean purchase intent stratified by respondents' gender and dwelling...

Full Caption

Figure 8: Mean purchase intent stratified by respondents' gender and dwelling region (shown are results from the SSR method for both GPT-40 and Gem-2f).

Figure/Table Image (Page 15)

First Reference in Text

SCs replicated the response behavior less well for gender and dwelling region (see Fig. 8).

Description

Comparison of purchase intent across demographic groups: This figure contains two line plots (panels A and B) that compare the average purchase intent (PI) scores between real human survey participants and two AI models (GPT-40 and Gem-2f), broken down by demographic categories. Purchase intent is a score indicating how likely a person is to buy a product. Panel A shows the comparison by gender (Female vs. Male). Panel B shows the comparison by four U.S. dwelling regions (Mid West, North East, South, West). The plots are designed to test if the AI models can replicate the subtle differences in purchasing preferences seen across these human subgroups.
AI models struggle to replicate human demographic trends: The plots visually support the text's claim that the AI models ('SCs' or synthetic consumers) did not replicate human behavior well for these categories. In Panel A (Gender), real data shows males have slightly higher PI than females, but the Gem-2f model reverses this trend, and GPT-40 exaggerates it. In Panel B (Dwelling Region), the patterns are even more divergent; for example, both AI models predict the lowest PI for the 'South', whereas in the real data, the 'South' has the second-highest PI.

Scientific Validity

✅ Strong evidence for the stated claim: The figure provides clear, direct visual evidence to support the reference text's claim that the synthetic consumers replicated response behavior 'less well' for gender and dwelling region. The visible discrepancies between the lines for real and synthetic data are compelling.
✅ Inclusion of multiple models strengthens the conclusion: By showing results for both GPT-40 and Gem-2f, the figure demonstrates that the failure to replicate these specific demographic trends is not an idiosyncrasy of a single model, which strengthens the overall conclusion about the current limitations of this technique.
✅ Appropriate inclusion of uncertainty: The use of error bars (representing standard errors) is a methodologically sound practice that provides a visual guide to the statistical uncertainty of the mean PI for each group. This helps in judging the significance of the observed differences.
:: The text correctly notes that the influence of these features on PI is not strong in the human data. The plots confirm this, as the vertical differences between points are small. This is a crucial piece of context: the models are failing to replicate a weak signal, which is a different and less severe limitation than failing to replicate a strong, clear trend. The analysis is therefore nuanced and well-supported by the visualization.

Communication

✅ Clear and direct comparison: The use of overlaid line plots with a clear legend allows for an easy and direct comparison between the human data and the two AI models within each panel.
💡 Inappropriate graph type for categorical data: While line plots are used, a bar chart would be more appropriate for visualizing this data. Both gender and dwelling region are discrete, unordered categorical variables. The lines connecting the points on the graph incorrectly imply a continuous or ordered relationship between the categories (e.g., a trend from 'Mid West' to 'West'), which does not exist. Suggest replacing the line plots with grouped bar charts for a more conventional and accurate representation.
💡 Inconsistent y-axis scales: The two panels use different y-axis ranges (Panel A: 3.2-4.04; Panel B: 3.44-4.1). While this maximizes the visual detail within each plot, it hinders the direct visual comparison of the magnitude of PI differences between the gender and region categories. Using a consistent y-axis scale for both panels would facilitate a more accurate cross-panel comparison.

Figure 9: Survey histograms for direct Likert ratings at TLLM = 0.5 for GPT-40.

Figure/Table Image (Page 15)

First Reference in Text

Upon detailed inspection of the Likert response distributions, models typically replied with response '3', i.e. a "safe" regression to the center of the scale (cf. Figs 9-12).

Description

Grid of 57 survey response comparisons: This figure is a large grid containing 57 individual mini-graphs, where each graph represents a separate product survey. Each mini-graph is a histogram, a type of bar chart that shows the distribution of answers. The horizontal axis shows the possible answers on a 5-point Likert scale, a common survey format where 1 means 'definitely not' and 5 means 'definitely yes' to a question like 'would you purchase this product?'. The vertical axis shows the proportion of respondents who gave each answer.
Contrasting human and AI response patterns: Within each of the 57 graphs, two different response patterns are overlaid. The solid black line shows the distribution of answers from real human participants. These patterns are diverse; in many surveys, the most common answers are 4 or 5. In contrast, the orange lines show the answers from the GPT-40 AI model. Two orange lines are shown, one for when the AI saw only text ('--Text') and one for when it saw an image and text ('--Image'), though these two lines are almost always identical.
AI model exhibits strong 'regression to the center': The most striking feature of the figure is the AI's response behavior. In nearly every one of the 57 surveys, the orange lines form a single, sharp spike at the number 3, the exact middle of the 1-to-5 scale. This indicates that the AI, when forced to give a direct numerical rating, overwhelmingly defaults to a neutral, 'safe' answer, regardless of the product concept. This visually confirms the paper's claim of a 'regression to the center of the scale' for this method.

Scientific Validity

✅ Overwhelming evidence for the central claim: The figure provides exceptionally strong and direct evidence for the claim in the reference text. The consistency of the AI's 'response 3' behavior across all 57 diverse surveys robustly demonstrates that this is a systemic artifact of the direct Likert rating (DLR) method and not a random occurrence.
✅ Granular presentation enhances transparency: Presenting the results for every individual survey, rather than an aggregated average, is a major methodological strength. It allows the reader to verify that the 'regression to the mean' phenomenon is pervasive and not an artifact of averaging out more varied responses. This transparency greatly increases confidence in the finding.
💡 Lack of a quantitative summary: While visually powerful, the figure would be scientifically stronger if complemented by a quantitative summary. For instance, stating the overall percentage of GPT-40 DLR responses that were '3' across all surveys would provide a single, impactful statistic to anchor the visual impression.

Communication

✅ Highly effective use of small multiples: The 'small multiples' layout (a grid of many small, similar graphs) is an excellent visualization choice here. It powerfully communicates the consistency and repetitiveness of the AI's response pattern across a large dataset, a message that would be lost in an aggregated plot.
💡 Redundant information in the legend: The legend distinguishes between 'Text' and 'Image' stimuli for the synthetic data. However, the corresponding orange lines are perfectly superimposed in nearly every panel, indicating no difference in the AI's response. This distinction adds clutter without providing new information. Suggest either removing one line and stating in the caption that stimulus type had no effect, or using a single color/linestyle for both.
💡 High information density impacts readability: The figure presents a large amount of data, and the individual plots are quite small. This makes it difficult to inspect the details of the human (black line) distributions. While the main takeaway about the AI is clear, the trade-off is a loss of detail. This is an acceptable design choice given the figure's primary goal, but it is a limitation.
✅ Clear titles and labeling: The main title at the top of the grid, which specifies the model and parameters ('Direct Likert Rating, Model: gpt-40, TLLM = 0.5'), and the individual survey ID numbers above each plot provide excellent context and are crucial for interpretation.

Figure 10: Survey histograms for direct Likert ratings at TLLM = 1.5 for GPT-40.

Figure/Table Image (Page 16)

First Reference in Text

Upon detailed inspection of the Likert response distributions, models typically replied with response '3', i.e. a "safe" regression to the center of the scale (cf. Figs 9-12).

Description

Grid of 57 survey response distributions: This figure is a grid of 57 small histograms, each representing one product survey. Each plot compares the distribution of responses from real humans (solid black line) against those from the GPT-40 AI model (orange lines) on a 5-point Likert scale, where 1 is a very negative response and 5 is a very positive one. This experiment was run with a high 'temperature' setting for the AI (TLLM = 1.5), a parameter that typically encourages more varied and random responses.
Persistent regression to the mean by the AI model: Despite the high temperature setting intended to increase response diversity, the AI's behavior remains unchanged from the lower temperature experiment shown in Figure 9. In nearly every one of the 57 surveys, the AI's response is a single, sharp spike at the number 3, the neutral midpoint of the scale. This contrasts sharply with the human data, which shows varied distributions often peaking at scores of 4 or 5. This result strongly confirms the paper's claim that the AI model defaults to a 'safe' central response when using this direct rating method.

Scientific Validity

✅ Robustly confirms the central claim: The figure provides powerful evidence supporting the reference text's claim about 'regression to the center'. The fact that this behavior persists even at a high temperature setting (TLLM=1.5) strengthens the conclusion that this is a fundamental artifact of the Direct Likert Rating (DLR) method, rather than a result of a specific hyperparameter choice.
✅ Excellent use of a parameter sweep for validation: By presenting the results for TLLM=1.5, this figure serves as an important validation experiment. It demonstrates that a simple and common approach to increasing model variability (raising the temperature) is ineffective for this problem, thereby justifying the need for the more complex methods (FLR and SSR) proposed by the authors.
✅ Granular presentation enhances transparency: Showing the results for all 57 individual surveys, rather than an aggregate, is a major strength. It demonstrates the pervasiveness of the AI's response pattern and increases confidence that the finding is not an artifact of averaging.

Communication

✅ Effective use of small multiples: The grid layout (small multiples) is highly effective at conveying the main message: the AI's response pattern is monotonously consistent across a large and diverse set of surveys. This would be impossible to show in a single, aggregated graph.
✅ Clear and informative title: The title is excellent because it clearly specifies the key experimental condition that distinguishes this figure from Figure 9—namely, the temperature setting `TLLM = 1.5`. This is crucial for correct interpretation.
💡 Redundant legend information: The legend distinguishes between 'Text' and 'Image' stimuli for the AI responses, but the two orange lines are visually identical in nearly every plot. This adds unnecessary clutter. It would be clearer to use a single line and note in the caption that stimulus type did not affect the results for this method.
💡 High information density reduces readability: While effective for showing the overall pattern, the small size of the individual plots makes it difficult to discern the details of the human response distributions (the black lines). This is an inherent trade-off of the small multiples approach when displaying a large number of plots.

Figure 11: Survey histograms for direct Likert ratings at TLLM = 0.5 for Gem-2f.

Figure/Table Image (Page 16)

First Reference in Text

Upon detailed inspection of the Likert response distributions, models typically replied with response '3', i.e. a "safe" regression to the center of the scale (cf. Figs 9-12).

Description

Grid of 57 survey response distributions for the Gem-2f model: This figure is a large grid of 57 small graphs, each representing a single product survey. The graphs are histograms, which show the distribution of answers on a 1-to-5 scale (a Likert scale), where 1 is very negative and 5 is very positive. Each plot compares the responses of real humans (solid black line) to the responses of the Gem-2f AI model (orange lines).
AI model consistently chooses the neutral middle option: The most prominent pattern in the figure is the behavior of the Gem-2f model. In almost every one of the 57 surveys, the orange lines form a single, tall spike at the number 3, the exact center of the scale. This shows that the AI model, when asked to provide a direct numerical rating, overwhelmingly defaults to a neutral, 'safe' response. This is in stark contrast to the human data, which is much more varied and often shows peaks at higher ratings like 4 or 5.

Scientific Validity

✅ Provides strong cross-model validation: This figure is crucial as it demonstrates that the 'regression to the center' phenomenon is not unique to the GPT-40 model (shown in Fig. 9) but is also exhibited by the Gem-2f model. This strengthens the paper's central argument that the issue lies with the direct Likert rating (DLR) method itself, rather than being an artifact of a single LLM.
✅ Overwhelming visual evidence for the reference text's claim: The figure provides clear and compelling visual support for the claim that models typically reply with '3'. The consistency of this behavior across all 57 surveys for a second major LLM robustly validates the finding.
✅ Granular data presentation increases transparency: By showing the results for each of the 57 surveys individually, the authors allow for a transparent assessment of the finding. This 'small multiples' approach demonstrates that the phenomenon is pervasive and not simply an artifact of averaging, which significantly increases confidence in the conclusion.

Communication

✅ Effective use of small multiples to show consistency: The grid layout is an excellent visualization choice. It powerfully communicates the monotonous consistency of the AI's response pattern across a large and diverse set of stimuli, a point that would be entirely lost in an aggregated summary plot.
✅ Clear and specific title: The title is very effective as it clearly states the model ('Gem-2f') and the key parameter ('TLLM = 0.5'), allowing for easy and direct comparison with the other similar figures in the paper (e.g., Fig. 9, which uses GPT-40).
💡 Redundant information in the legend could be simplified: The legend distinguishes between 'Text' and 'Image' stimuli for the AI, but the two corresponding orange lines are perfectly superimposed in virtually every panel. This distinction adds visual clutter without providing new information. Suggest simplifying by using a single line for the AI and noting in the caption that stimulus type had no effect on the DLR results.

Figure 12: Survey histograms for direct Likert ratings at TLLM = 1.5 for Gem-2f.

Figure/Table Image (Page 17)

First Reference in Text

Upon detailed inspection of the Likert response distributions, models typically replied with response '3', i.e. a "safe" regression to the center of the scale (cf. Figs 9-12).

Description

Grid of 57 survey response distributions under specific AI settings: This figure is a large grid of 57 small graphs, each representing a different product survey. Each graph is a histogram, a chart showing the frequency of different answers. The answers are on a 1-to-5 Likert scale, where 1 is a very negative response and 5 is very positive. The plots compare the responses of real humans (solid black line) to those of the Gem-2f AI model (orange lines). The AI was run with a high 'temperature' (TLLM = 1.5), a setting that is meant to make its responses more creative and varied.
AI model persistently chooses the neutral middle option: Despite the high temperature setting intended to increase diversity, the AI's behavior is highly consistent and different from humans. In nearly every one of the 57 surveys, the AI's response is a single, sharp spike at the number 3, the neutral midpoint of the scale. This directly contrasts with the human data, which shows a wide variety of response patterns, often with the most frequent answers being 4 or 5. This visual evidence strongly supports the paper's claim that this method causes the AI to give a 'safe' central response.

Scientific Validity

✅ Robustly validates the paper's central finding: This figure provides crucial validation by showing that the 'regression to the center' phenomenon persists across different models (Gem-2f vs. GPT-40) and different hyperparameters (TLLM=1.5 vs. 0.5). This demonstrates that the issue is a fundamental artifact of the direct Likert rating method itself, not a quirk of one specific model or setting.
✅ Strong evidence against a simple solution: By testing a high-temperature setting, the authors proactively address a potential counterargument—that the lack of diversity could be fixed by simply making the model more 'creative'. The clear failure of this approach, as shown in the figure, strengthens the paper's justification for developing a more sophisticated method like SSR.
✅ Transparent presentation of raw data: The choice to display the distributions for all 57 surveys individually, rather than showing an average, is a major strength. This transparency allows the reader to verify the pervasiveness and consistency of the AI's behavior, which builds significant confidence in the finding.

Communication

✅ Effective use of a small multiples grid: The grid layout is an excellent visualization choice for this data. It powerfully communicates the monotonous consistency of the AI's response pattern across a large and varied set of surveys, a message that would be lost in a single, aggregated plot.
✅ Clear and specific figure title: The title is highly effective because it clearly specifies the exact conditions being shown: the model ('Gem-2f') and the temperature setting ('TLLM = 1.5'). This is essential for allowing readers to accurately compare this figure with Figures 9, 10, and 11.
💡 Redundant legend information: The legend distinguishes between 'Text' and 'Image' stimuli for the AI, but the two corresponding orange lines are visually identical in nearly every plot. This adds unnecessary visual clutter. It would be clearer to use a single line for the AI and state in the caption that stimulus type did not affect the results for this method.

Figure 13: Success metrics for direct Likert ratings at TLLM = 0.5 for GPT-40.

Figure/Table Image (Page 17)

First Reference in Text

and KXY = 0.39 for Gem-2f (cf. Figs. 2B, 3, 6B, 3, and 13-16).

Description

Performance of AI with text-only product descriptions: This set of plots evaluates the performance of the GPT-40 AI model when it only reads a text description of a product and is forced to give a direct numerical rating. The top scatter plot compares the average score from the AI (vertical axis) to the average score from humans (horizontal axis). It shows a moderate positive relationship (Pearson's R = 0.72), but the AI's scores are compressed into a narrower range than the human scores. The bottom histogram shows the distribution of a similarity score (KS similarity) that measures how well the AI's full response pattern matched the humans' for each survey. The average similarity was low, at 0.36.

Scientific Validity

✅ Appropriate metrics for evaluation: The use of two distinct metrics is a methodological strength. Pearson's R on the mean scores evaluates the model's ability to rank products correctly on average, while the KS similarity of the distributions evaluates its ability to produce realistic, human-like response patterns. This provides a comprehensive assessment.
✅ Strong evidence of method's limitations: The combination of a compressed response range in the scatter plot and a low mean KS similarity (0.36) provides strong, quantitative evidence for the paper's claim that the direct Likert rating (DLR) method produces unrealistic response distributions, even when the average ratings show some correlation with human data.

Communication

✅ Clear multi-plot layout: Presenting both the correlation of means and the distribution of similarity scores provides a more complete picture than either plot alone. The layout effectively communicates two different aspects of model performance for the same condition.
✅ Key statistics are displayed on the plots: Including the R-value, p-value, and mean KS similarity directly on the respective plots makes them highly self-contained and easy to interpret, reinforcing the main takeaways without requiring the reader to hunt for these values in the text.
💡 Inconsistent axis labeling: The axes on the scatter plot are labeled 'LLM mean (likert)' and 'real mean'. More descriptive and parallel labels, such as 'Synthetic Mean Purchase Intent' and 'Human Mean Purchase Intent', would improve clarity.

Figure 14: Success metrics for direct Likert ratings at TLLM = 1.5 for GPT-40.

Figure/Table Image (Page 18)

First Reference in Text

Not explicitly referenced in main text

Description

Performance of AI with text-only product descriptions at high 'temperature': This set of plots evaluates the GPT-40 AI model using the 'Direct Likert Rating' method when its 'temperature' (a setting that encourages creativity) is set high to 1.5, and it only sees a text description of the product. The top scatter plot compares the average AI rating (vertical axis) to the average human rating (horizontal axis), showing a positive correlation (Pearson's R = 0.72). The bottom histogram displays the distribution of a similarity score (KS similarity) measuring how well the AI's response pattern matched the humans' for each survey. The average similarity score was low, at 0.37.

Scientific Validity

✅ Important robustness check: This figure serves as a valuable robustness check by testing a different hyperparameter (TLLM = 1.5). It demonstrates that even when attempting to induce more response variability, the distributional similarity to human data remains poor (mean KS = 0.37), which reinforces the conclusion that the direct Likert rating method is fundamentally flawed.
✅ Comprehensive evaluation using dual metrics: The use of two distinct metrics—correlation of means (for ranking) and distributional similarity (for realism)—provides a thorough and nuanced assessment of the model's performance. This dual approach correctly identifies that while the model can capture some of the product ranking signal, it fails to generate human-like response patterns.

Communication

✅ Clear two-plot layout: The format of presenting a scatter plot for correlation and a histogram for distributional similarity is effective. It allows for a clear and comprehensive summary of the two key success metrics for this experimental condition.
✅ On-plot statistics aid interpretation: Including the R-value, p-value, and mean KS similarity directly on the respective plots is excellent practice. It makes the figure's main takeaways immediately accessible and easy to interpret.
💡 Not explicitly referenced in the main text: This figure, while containing important validation data, is not explicitly referenced or discussed in the main body of the text. This is a significant communication gap, as it forces the reader to interpret the results and their implications without guidance from the authors.
💡 Axis labels could be more descriptive: The axis labels ('LLM mean (likert)', 'real mean') are minimal. Using more descriptive labels like 'Synthetic Mean Purchase Intent' and 'Human Mean Purchase Intent' would improve clarity and make the plot more self-explanatory.

Figure 15: Success metrics for direct Likert ratings at TLLM = 0.5 for Gem-2f.

Figure/Table Image (Page 18)

First Reference in Text

Not explicitly referenced in main text

Description

Performance of Gem-2f AI with text-only product descriptions: This set of plots evaluates the performance of the Gem-2f AI model when it only reads a text description of a product and is forced to give a direct numerical rating. The top scatter plot compares the average score from the AI (vertical axis) to the average score from humans (horizontal axis). It shows a moderate positive relationship with a Pearson's R of 0.54. The bottom histogram shows the distribution of a similarity score (KS similarity) that measures how well the AI's full response pattern matched the humans' for each survey. The average similarity was low, at 0.45.

Scientific Validity

✅ Provides important cross-model validation: This figure is scientifically valuable because it presents results for a second model (Gem-2f), allowing for direct comparison with GPT-40 (shown in Figure 13). This helps establish that the poor performance of the direct Likert rating (DLR) method is a general issue, not an idiosyncrasy of a single model.
✅ Dual-metric approach gives a comprehensive view: The use of two distinct metrics is a methodological strength. The correlation of mean scores assesses the model's ability to rank products, while the KS similarity of distributions assesses its ability to generate realistic, human-like response patterns. This provides a more complete picture of performance.

Communication

✅ Clear two-plot layout: The format of presenting a scatter plot for correlation and a histogram for distributional similarity is effective. It allows for a clear and comprehensive summary of the two key success metrics for this experimental condition.
✅ On-plot statistics aid interpretation: Including the R-value, p-value, and mean KS similarity directly on the respective plots is excellent practice, making the figure's main takeaways immediately accessible.
💡 Not explicitly referenced in the main text: This figure contains important validation data but is not explicitly referenced or discussed in the main body of the text. This is a significant communication gap, as it leaves the reader to interpret the results and their implications without guidance from the authors.
💡 Axis labels could be more descriptive: The axis labels ('LLM mean (likert)', 'real mean') are minimal. Using more descriptive labels like 'Synthetic Mean Purchase Intent' and 'Human Mean Purchase Intent' would improve clarity and make the plot more self-explanatory.

Figure 16: Success metrics for direct Likert ratings at TLLM = 1.5 for Gem-2f.

Figure/Table Image (Page 19)

First Reference in Text

Not explicitly referenced in main text

Description

Performance of Gem-2f AI with text-only product descriptions at high 'temperature': This set of plots evaluates the Gem-2f AI model using the 'Direct Likert Rating' method, where the AI gives a single number rating. The AI's 'temperature' (a setting that encourages creativity) was set high to 1.5, and it only saw a text description of each product. The top scatter plot compares the average AI rating (vertical axis) to the average human rating (horizontal axis), showing a moderate positive correlation (Pearson's R = 0.55). The bottom histogram displays the distribution of a similarity score (KS similarity) measuring how well the AI's response pattern matched the humans' for each survey. The average similarity score was low, at 0.46.

Scientific Validity

✅ Important robustness check: This figure serves as a valuable robustness check by testing a different model (Gem-2f) under a different hyperparameter setting (TLLM = 1.5). It demonstrates that the poor distributional similarity of the direct Likert rating (DLR) method is a consistent finding, not an artifact of a specific model or setting, which strengthens the paper's overall conclusions.
✅ Comprehensive evaluation with dual metrics: The use of two distinct metrics—correlation of means (for ranking) and distributional similarity (for realism)—provides a thorough and nuanced assessment of model performance. This dual approach is methodologically sound and effectively shows that the model fails to generate human-like response patterns even if it captures some of the product ranking signal.

Communication

✅ Clear two-plot layout: The format of presenting a scatter plot for correlation and a histogram for distributional similarity is effective. It allows for a clear and comprehensive summary of the two key success metrics for this experimental condition.
✅ On-plot statistics aid interpretation: Including the R-value, p-value, and mean KS similarity directly on the respective plots is excellent practice, making the figure's main takeaways immediately accessible and easy to interpret.
💡 Not explicitly referenced in the main text: This figure, while containing important validation data, is not explicitly referenced or discussed in the main body of the text. This is a significant communication gap, as it forces the reader to interpret the results and their implications without guidance from the authors.
💡 Axis labels could be more descriptive: The axis labels ('LLM mean (likert)', 'real mean') are minimal. Using more descriptive labels like 'Synthetic Mean Purchase Intent' and 'Human Mean Purchase Intent' would improve clarity and make the plot more self-explanatory.

Figure 17: First set of survey histograms for textual elicitation with GPT-40...

Full Caption

Figure 17: First set of survey histograms for textual elicitation with GPT-40 and follow-up ratings at TLLM = 0.5, with image stimulus and full demography setup.

Figure/Table Image (Page 19)

First Reference in Text

Not explicitly referenced in main text

Description

Grid of survey response distributions for two advanced AI methods: This figure is a large grid of small graphs, each representing a single product survey. The graphs are histograms, showing the distribution of answers on a 1-to-5 Likert scale. The figure compares the responses of real humans (solid black line) to the responses of the GPT-40 AI model (orange line) using two different methods. The top row for each survey shows the results for the 'Semantic Similarity Rating' (SSR) method, and the bottom row shows the results for the 'Follow-up Likert Rating' (FLR) method. This allows for a direct visual comparison of how well each method replicates human response patterns.
AI models show much more realistic response patterns: Unlike the AI's behavior in previous figures (e.g., Figure 9), where it consistently chose the neutral middle option '3', here the AI's responses (orange lines) are much more varied and human-like. For both the SSR and FLR methods, the AI generates full distributions of responses that often follow the general shape of the human data, including peaks at higher ratings like 4 and 5. This visually demonstrates that these textual elicitation methods are far superior to the direct rating method.
SSR method appears to be a closer match than FLR: By visually comparing the top row (SSR) to the bottom row (FLR) for each survey, the SSR method generally appears to provide a closer match to the human data. The orange lines in the SSR plots often capture the peaks, valleys, and overall shape of the black human data lines more accurately than the FLR plots do, which sometimes appear slightly less aligned.

Scientific Validity

✅ Strong evidence for the superiority of textual elicitation: The figure provides compelling, granular evidence that the textual elicitation methods (SSR and FLR) are vastly superior to the direct Likert rating (DLR) method. The stark visual contrast between the realistic distributions here and the 'spike at 3' in Figures 9-12 robustly supports the paper's central claims.
✅ Transparent, per-survey presentation: Presenting the results for each survey individually is a major methodological strength. It allows the reader to see that the improved performance of SSR and FLR is consistent across many different product concepts and is not just an artifact of averaging. This transparency greatly increases confidence in the findings.
✅ Allows for qualitative comparison between SSR and FLR: The side-by-side presentation provides strong visual evidence that SSR generally outperforms FLR in replicating the nuances of the human response distributions, which supports the quantitative summary presented earlier in Figure 3.

Communication

✅ Detailed and informative caption: The caption is excellent, as it clearly and comprehensively describes the specific experimental conditions being visualized: the model (GPT-40), the methods (SSR and FLR), the temperature (TLLM=0.5), and the stimulus/demography setup. This level of detail is crucial for correct interpretation.
💡 Not explicitly referenced in the main text: This figure contains the detailed, raw distributional evidence that underpins the summary statistics in Figure 3, yet it is not directly referenced in the main text. This is a significant communication oversight. The authors should explicitly point readers to this figure to show the granular data supporting their aggregate claims.
💡 Layout could be slightly clearer: The figure alternates rows between SSR and FLR, with labels only on the far-left y-axis. This format is functional but could be slightly confusing. Suggest adding a small title above the first plot in each row (e.g., 'SSR Method', 'FLR Method') to make the structure immediately obvious.
✅ Effective use of small multiples: The grid layout is a very effective way to demonstrate the consistency of the methods' performance across a large number of different surveys. This visual proof of robustness is a key strength of the presentation.

Figure 18: Second set of survey histograms for textual elicitation with GPT-40...

Full Caption

Figure 18: Second set of survey histograms for textual elicitation with GPT-40 and follow-up ratings at TLLM = 0.5, with image stimulus and full demography setup.

Figure/Table Image (Page 20)

First Reference in Text

Not explicitly referenced in main text

Description

Continuation of survey response distribution comparisons: This figure is a continuation of Figure 17, presenting a grid of histograms for the remaining product surveys. Each small graph shows the distribution of answers on a 1-to-5 Likert scale, a standard survey format where 1 is a strong negative response and 5 is a strong positive one. The figure compares the response patterns of real humans (solid black line) against those of the GPT-40 AI model (orange line).
Comparison of two advanced AI methods: Two different AI methods are visualized in alternating rows. The top row for each survey shows the 'Semantic Similarity Rating' (SSR) method, while the bottom row shows the 'Follow-up Likert Rating' (FLR) method. This layout allows for a direct visual comparison of how well each method replicates the shape of the human response distribution for a given product survey.
AI models generate realistic, human-like response patterns: In stark contrast to the direct rating method shown in earlier figures (e.g., Figure 9), both the SSR and FLR methods produce varied and realistic response distributions. The AI's responses (orange lines) are spread across the 1-5 scale and often mimic the general shape of the human data (black lines), including capturing peaks at higher ratings like 4 and 5. This provides further visual evidence that these textual elicitation methods are far superior.

Scientific Validity

✅ Provides comprehensive evidence for the paper's claims: This figure, in conjunction with Figure 17, provides the full, granular dataset that underpins the summary statistics presented earlier in the paper (e.g., in Figure 3). Showing the results for every single survey demonstrates the consistency and robustness of the findings, adding significant weight to the conclusion that textual elicitation methods, especially SSR, are superior.
✅ Transparent presentation of data: The choice to present the distribution for every survey individually is a major strength. It prevents any concerns that the good performance is an artifact of averaging and allows for a transparent assessment of the methods' consistency across different stimuli.
✅ Supports qualitative comparison of SSR and FLR: The side-by-side (row-by-row) presentation allows for a qualitative visual comparison between the SSR and FLR methods, which generally supports the quantitative finding that SSR provides a better fit to the human data.

Communication

✅ Effective use of small multiples: The grid layout is an excellent visualization choice. It powerfully communicates the consistency of the methods' performance across the entire dataset, which would be impossible to convey in a single aggregated plot.
✅ Clear and comprehensive caption: The caption is very well-written, clearly stating all the specific experimental conditions being visualized (model, methods, temperature, stimulus, demographics). This level of detail is crucial for correct interpretation and comparison with other figures.
💡 Not explicitly referenced in the main text: This figure contains the detailed evidence supporting the paper's main claims, yet it is not directly referenced in the main text. This is a significant communication oversight. The authors should explicitly point readers to Figures 17 and 18 to show the granular data that supports their aggregate claims, strengthening the link between the summary statistics and the underlying distributions.
💡 Layout could be slightly improved for clarity: The figure alternates rows between SSR and FLR, with labels only on the far-left y-axis. While functional, this could be slightly confusing. Suggest adding a small title above the first plot in each row (e.g., 'SSR Method', 'FLR Method') to make the structure immediately obvious to the reader.

Figure 19: First set of survey histograms for textual elicitation with Gem-2f...

Full Caption

Figure 19: First set of survey histograms for textual elicitation with Gem-2f and follow-up ratings at TLLM = 0.5, with image stimulus and full demography setup.

Figure/Table Image (Page 20)

First Reference in Text

Not explicitly referenced in main text

Description

Grid of survey response distributions for the Gem-2f model: This figure is a large grid of small graphs, each representing a single product survey. The graphs are histograms, which show the distribution of answers on a 1-to-5 Likert scale (where 1 is a strong negative response and 5 is a strong positive one). Each plot compares the responses of real humans (solid black line) to the responses of the Gem-2f AI model (orange line).
Comparison of two advanced AI methods: Two different AI methods are visualized in alternating rows. The top row for each survey shows the 'Semantic Similarity Rating' (SSR) method, while the bottom row shows the 'Follow-up Likert Rating' (FLR) method. Both methods are based on the AI first generating a textual response, which is then converted into a rating. This layout allows for a direct visual comparison of how well each method replicates the human response pattern for a given survey.
AI models generate realistic, human-like response patterns: In stark contrast to the direct rating method shown in earlier figures (e.g., Figure 11), both the SSR and FLR methods produce varied and realistic response distributions. The AI's responses (orange lines) are spread across the 1-5 scale and often mimic the general shape of the human data (black lines), including capturing peaks at higher ratings like 4 and 5. This provides strong visual evidence that these textual elicitation methods are far superior to direct numerical prompting for this model as well.

Scientific Validity

✅ Provides crucial cross-model validation: This figure is scientifically important because it demonstrates that the superiority of textual elicitation methods is not limited to the GPT-40 model but also holds for Gem-2f. This replication across different models strengthens the paper's central claim that the methodology itself is the key factor.
✅ Transparent, per-survey presentation: Presenting the results for each survey individually, rather than as an aggregate, is a major methodological strength. It allows the reader to verify that the improved performance of SSR and FLR is consistent across many different product concepts and is not just an artifact of averaging. This transparency greatly increases confidence in the findings.
✅ Supports quantitative summary data: This figure provides the detailed, underlying distributional data that supports the aggregated summary statistics shown for Gem-2f in Figure 7. It serves as the visual proof for the quantitative claims made about the methods' performance.

Communication

✅ Effective use of small multiples: The grid layout is an excellent visualization choice. It powerfully communicates the consistency of the methods' performance across a large number of different surveys, a message that would be lost in a single, aggregated plot.
✅ Clear and comprehensive caption: The caption is very well-written, clearly stating all the specific experimental conditions being visualized (model, methods, temperature, stimulus, demographics). This level of detail is crucial for correct interpretation and comparison with other figures.
💡 Not explicitly referenced in the main text: This figure contains the detailed evidence supporting the paper's claims about Gem-2f, yet it is not directly referenced in the main text. This is a significant communication oversight. The authors should explicitly point readers to Figures 19 and 20 to show the granular data that supports their aggregate claims, strengthening the link between the summary statistics and the underlying distributions.
💡 Layout could be slightly improved for clarity: The figure alternates rows between SSR and FLR, with labels only on the far-left y-axis. While functional, this could be slightly confusing. Suggest adding a small title above the first plot in each row (e.g., 'SSR Method', 'FLR Method') to make the structure immediately obvious to the reader.

Figure 20: Second set of survey histograms for textual elicitation with Gem-2f...

Full Caption

Figure 20: Second set of survey histograms for textual elicitation with Gem-2f and follow-up ratings at TLLM = 0.5, with image stimulus and full demography setup.

Figure/Table Image (Page 21)

First Reference in Text

Not explicitly referenced in main text

Description

Continuation of survey response distribution comparisons for the Gem-2f model: This figure is a continuation of Figure 19, presenting a grid of histograms for the remaining product surveys. Each small graph shows the distribution of answers on a 1-to-5 Likert scale, a standard survey format where 1 is a strong negative response and 5 is a strong positive one. The figure compares the response patterns of real humans (solid black line) against those of the Gem-2f AI model (orange line).
Comparison of two advanced AI methods: Two different AI methods are visualized in alternating rows. The top row for each survey shows the 'Semantic Similarity Rating' (SSR) method, while the bottom row shows the 'Follow-up Likert Rating' (FLR) method. Both methods are based on the AI first generating a textual response, which is then converted into a rating. This layout allows for a direct visual comparison of how well each method replicates the human response pattern for a given survey.
AI models generate realistic, human-like response patterns: In stark contrast to the direct rating method shown in earlier figures (e.g., Figure 11), both the SSR and FLR methods produce varied and realistic response distributions. The AI's responses (orange lines) are spread across the 1-5 scale and often mimic the general shape of the human data (black lines), including capturing peaks at higher ratings like 4 and 5. This provides further visual evidence that these textual elicitation methods are far superior.

Scientific Validity

✅ Provides comprehensive evidence for the paper's claims: This figure, in conjunction with Figure 19, provides the full, granular dataset that underpins the summary statistics presented earlier in the paper (e.g., in Figure 7). Showing the results for every single survey demonstrates the consistency and robustness of the findings, adding significant weight to the conclusion that textual elicitation methods are superior for the Gem-2f model as well.
✅ Transparent presentation of data: The choice to present the distribution for every survey individually is a major strength. It prevents any concerns that the good performance is an artifact of averaging and allows for a transparent assessment of the methods' consistency across different stimuli.
✅ Supports qualitative comparison of SSR and FLR: The side-by-side (row-by-row) presentation allows for a qualitative visual comparison between the SSR and FLR methods, which generally supports the quantitative finding that SSR provides a better fit to the human data for the Gem-2f model.

Communication

✅ Effective use of small multiples: The grid layout is an excellent visualization choice. It powerfully communicates the consistency of the methods' performance across the entire dataset, which would be impossible to convey in a single aggregated plot.
✅ Clear and comprehensive caption: The caption is very well-written, clearly stating all the specific experimental conditions being visualized (model, methods, temperature, stimulus, demographics). This level of detail is crucial for correct interpretation and comparison with other figures.
💡 Not explicitly referenced in the main text: This figure contains the detailed evidence supporting the paper's main claims, yet it is not directly referenced in the main text. This is a significant communication oversight. The authors should explicitly point readers to Figures 19 and 20 to show the granular data that supports their aggregate claims, strengthening the link between the summary statistics and the underlying distributions.
💡 Layout could be slightly improved for clarity: The figure alternates rows between SSR and FLR, with labels only on the far-left y-axis. While functional, this could be slightly confusing. Suggest adding a small title above the first plot in each row (e.g., 'SSR Method', 'FLR Method') to make the structure immediately obvious to the reader.

Figure 21: Success metrics for textual elicitation at TLLM = 0.5 with GPT-40,...

Full Caption

Figure 21: Success metrics for textual elicitation at TLLM = 0.5 with GPT-40, with image stimulus and full demography setup.

Figure/Table Image (Page 21)

First Reference in Text

Not explicitly referenced in main text

Description

Performance of the Semantic Similarity Rating (SSR) method: This set of plots evaluates the performance of the GPT-40 AI model using the 'Semantic Similarity Rating' (SSR) method. The top scatter plot compares the average purchase intent score from the AI (y-axis) to the average from real humans (x-axis). It shows a strong positive correlation (Pearson's R = 0.72). The bottom histogram shows the distribution of a similarity score (KS similarity) that measures how well the AI's response pattern matched the humans' for each survey. The distribution is heavily skewed to the right, indicating very high similarity, with an average score of 0.88.

Scientific Validity

✅ Strong quantitative evidence for SSR effectiveness: The figure provides robust evidence for the success of the SSR method. The combination of a strong correlation in mean ratings (R=0.72) and a very high mean distributional similarity (KS sim = 0.88) strongly supports the paper's central claim that this method can reproduce human-like survey outcomes.
✅ Comprehensive evaluation with dual metrics: The use of two distinct metrics is a methodological strength. The correlation of means assesses the model's ability to rank products correctly, while the KS similarity assesses its ability to produce realistic response distributions. This provides a more complete and convincing evaluation than either metric alone.

Communication

✅ Effective two-plot summary: Presenting both the scatter plot for mean correlation and the histogram for distributional similarity is an effective way to summarize the two key aspects of performance for this experimental condition.
💡 Jargon on y-axis label: The y-axis label on the scatter plot, 'LLM mean (embed)', is technical and may not be immediately clear to all readers. A more descriptive label, such as 'Synthetic Mean (SSR Method)', would improve clarity.
💡 Not explicitly referenced in the main text: This figure, which provides the key summary metrics for the paper's best-performing method with GPT-40, is not directly referenced in the main text. This is a significant communication gap, as it contains crucial data that should be explicitly highlighted to the reader.

Figure 22: Success metrics for textual elicitation at TLLM = 0.5 with Gem-2f,...

Full Caption

Figure 22: Success metrics for textual elicitation at TLLM = 0.5 with Gem-2f, with image stimulus and full demography setup.

Figure/Table Image (Page 22)

First Reference in Text

Not explicitly referenced in main text

Description

Performance of the Semantic Similarity Rating (SSR) method with Gem-2f model: This set of plots evaluates the performance of the Gem-2f AI model using the 'Semantic Similarity Rating' (SSR) method. The top scatter plot compares the average purchase intent score from the AI (y-axis) to the average from real humans (x-axis). It shows a strong positive correlation (Pearson's R = 0.72). The bottom histogram shows the distribution of a similarity score (KS similarity) that measures how well the AI's response pattern matched the humans' for each survey. The distribution is heavily skewed to the right, indicating very high similarity, with an average score of 0.80.

Scientific Validity

✅ Provides crucial cross-model validation: This figure is scientifically important as it demonstrates that the high performance of the SSR method is not limited to the GPT-40 model but is also achieved with Gem-2f. This replication across different models strengthens the paper's central claim that the methodology itself is the key factor.
✅ Strong quantitative evidence for SSR effectiveness: The figure provides robust evidence for the success of the SSR method with Gem-2f. The combination of a strong correlation in mean ratings (R=0.72) and a very high mean distributional similarity (KS sim = 0.80) strongly supports the paper's claims.

Communication

✅ Effective two-plot summary: Presenting both the scatter plot for mean correlation and the histogram for distributional similarity is an effective way to summarize the two key aspects of performance for this experimental condition.
💡 Jargon on y-axis label: The y-axis label on the scatter plot, 'LLM mean (embed)', is technical and may not be immediately clear to all readers. A more descriptive label, such as 'Synthetic Mean (SSR Method)', would improve clarity.
💡 Not explicitly referenced in the main text: This figure, which provides the key summary metrics for the paper's best-performing method with Gem-2f, is not directly referenced in the main text. This is a significant communication gap, as it contains crucial data that should be explicitly highlighted to the reader.

Figure 23: First set of survey histograms for textual elicitation with GPT-40...

Full Caption

Figure 23: First set of survey histograms for textual elicitation with GPT-40 and follow-up ratings at TLLM = 0.5, with text stimulus and alternating between prompting the LLM with full demographic information and zero demographic information.

Figure/Table Image (Page 22)

First Reference in Text

Not explicitly referenced in main text

Description

Grid comparing AI responses with and without demographic information: This figure is a large grid of small graphs (histograms), each representing a single product survey. The graphs show the distribution of answers on a 1-to-5 Likert scale. Each plot compares the responses of real humans (solid black line) to the GPT-40 AI model under two different conditions: when the AI is given demographic information to impersonate a specific person (orange line, 'w/ demographics'), and when it is given no demographic information (green line, 'w/o demographics').
Two different AI methods are shown: The figure visualizes two different methods for getting ratings from the AI, shown in alternating rows. The top row for each survey shows the 'Semantic Similarity Rating' (SSR) method, and the bottom row shows the 'Follow-up Likert Rating' (FLR) method. This allows for a direct visual comparison of how the presence or absence of demographic information affects each method.
Demographic information is crucial for realistic AI responses: The key takeaway is that demographic information is critical for generating human-like responses with this model. The orange lines (with demographics) generally track the shape of the human data (black lines) reasonably well. In contrast, the green lines (without demographics) are very different and unrealistic; they typically show a single, massive spike at the highest ratings (4 or 5), indicating that without a persona, the AI model tends to be overly positive and less nuanced.

Scientific Validity

✅ Excellent use of an ablation study: This figure represents a well-designed ablation study. By systematically removing the demographic information ('w/o demographics' condition), the authors effectively isolate and demonstrate the causal impact of persona conditioning on the model's output. This is a strong methodological approach that provides clear, interpretable results.
✅ Strong, granular evidence for the importance of persona: The figure provides compelling evidence that persona conditioning is not just helpful but essential for achieving realistic response distributions with GPT-40. The stark and consistent difference between the orange and green lines across all 57 surveys robustly supports the paper's claims about the necessity of using detailed prompts.
✅ Transparent data presentation: Showing the results for each survey individually, rather than in an aggregated form, is a major strength. This transparency allows the reader to verify that the observed effect is pervasive across different product concepts and not an artifact of averaging, which greatly increases confidence in the conclusion.

Communication

✅ Effective use of color to highlight the key comparison: The use of distinct orange and green lines for the two key experimental conditions ('w/ demographics' vs. 'w/o demographics') makes the central comparison of the figure immediately clear and easy to follow across all the small plots.
✅ Detailed and clear caption: The caption is excellent. It comprehensively and accurately describes the complex experimental setup being visualized, including the model, methods, stimulus type, and the crucial comparison between full and zero demographic information. This level of detail is essential for proper interpretation.
💡 Not explicitly referenced in the main text: This figure contains the critical experimental evidence for the importance of demographic prompting, yet it is not directly referenced in the main text. This is a significant communication failure. The authors should explicitly cite this figure in the results section to guide the reader to the detailed visual data that supports their claims.
💡 Layout could be slightly improved: The figure alternates rows between the SSR and FLR methods, with labels only on the far-left y-axis. While functional, adding a small title above the first plot in each row (e.g., 'SSR Method', 'FLR Method') would make the structure more immediately obvious to the reader.

Figure 24: Second set of survey histograms for textual elicitation with GPT-40...

Full Caption

Figure 24: Second set of survey histograms for textual elicitation with GPT-40 and follow-up ratings at TLLM = 0.5, with text stimulus and alternating between prompting the LLM with full demographic information and zero demographic information.

Figure/Table Image (Page 23)

First Reference in Text

Not explicitly referenced in main text

Description

Grid comparing AI responses with and without demographic information (continuation): This figure is a continuation of Figure 23, presenting a large grid of small graphs (histograms) for the remaining product surveys. Each graph shows the distribution of answers on a 1-to-5 Likert scale. Each plot compares the responses of real humans (solid black line) to the GPT-40 AI model under two different conditions: when the AI is given demographic information to impersonate a specific person (orange line, 'w/ demographics'), and when it is given no demographic information (green line, 'w/o demographics').
Two different AI methods are shown: The figure visualizes two different methods for getting ratings from the AI, shown in alternating rows. The top row for each survey shows the 'Semantic Similarity Rating' (SSR) method, and the bottom row shows the 'Follow-up Likert Rating' (FLR) method. This allows for a direct visual comparison of how the presence or absence of demographic information affects each method.
Demographic information is consistently shown to be crucial for realistic AI responses: The key takeaway, consistent with Figure 23, is that demographic information is critical for generating human-like responses with this model. The orange lines (with demographics) generally track the shape of the human data (black lines) reasonably well. In contrast, the green lines (without demographics) are very different and unrealistic; they typically show a single, massive spike at the highest ratings (4 or 5), indicating that without a persona, the AI model tends to be overly positive and less nuanced.

Scientific Validity

✅ Provides comprehensive evidence for the paper's claims: This figure, in conjunction with Figure 23, provides the full, granular dataset that underpins the claims about the importance of demographic conditioning. Showing the results for every single survey demonstrates the consistency and robustness of the findings, adding significant weight to the conclusion.
✅ Strong ablation study design: The comparison between the 'with demographics' and 'without demographics' conditions is a methodologically sound ablation study. It clearly isolates the effect of persona conditioning and demonstrates its critical importance across the entire dataset.
✅ Transparent presentation of data: The choice to present the distribution for every survey individually is a major strength. It prevents any concerns that the observed effect is an artifact of averaging and allows for a transparent assessment of the methods' consistency across different stimuli.

Communication

✅ Effective use of small multiples: The grid layout is an excellent visualization choice. It powerfully communicates the consistency of the methods' performance and the effect of demographic conditioning across the entire dataset, which would be impossible to convey in a single aggregated plot.
✅ Clear and comprehensive caption: The caption is very well-written, clearly stating all the specific experimental conditions being visualized (model, methods, stimulus, and the key demographic comparison). This level of detail is crucial for correct interpretation.
💡 Not explicitly referenced in the main text: This figure contains critical experimental evidence, yet it is not directly referenced in the main text. This is a significant communication oversight. The authors should explicitly point readers to Figures 23 and 24 to show the granular data that supports their aggregate claims about the importance of demographic conditioning.
💡 Layout could be slightly improved for clarity: The figure alternates rows between SSR and FLR, with labels only on the far-left y-axis. While functional, this could be slightly confusing. Suggest adding a small title above the first plot in each row (e.g., 'SSR Method', 'FLR Method') to make the structure immediately obvious to the reader.

Figure 25: First set of survey histograms for textual elicitation with Gem-2f...

Full Caption

Figure 25: First set of survey histograms for textual elicitation with Gem-2f and follow-up ratings at TLLM = 0.5, with text stimulus and alternating between prompting the LLM with full demographic information and zero demographic information.

Figure/Table Image (Page 23)

First Reference in Text

Not explicitly referenced in main text

Description

Grid comparing Gem-2f AI responses with and without demographic information: This figure is a large grid of small graphs (histograms), each representing a single product survey. The graphs show the distribution of answers on a 1-to-5 Likert scale. Each plot compares the responses of real humans (solid black line) to the Gem-2f AI model under two different conditions: when the AI is given demographic information to impersonate a specific person (orange line, 'w/ demographics'), and when it is given no demographic information (green line, 'w/o demographics').
Two different AI methods are shown: The figure visualizes two different methods for getting ratings from the AI, shown in alternating rows. The top row for each survey shows the 'Semantic Similarity Rating' (SSR) method, and the bottom row shows the 'Follow-up Likert Rating' (FLR) method. This allows for a direct visual comparison of how the presence or absence of demographic information affects each method.
Surprising finding: AI without demographics is more realistic for Gem-2f: The key takeaway from this figure is surprising and contrasts with the results for the other AI model (GPT-40). For Gem-2f, the green lines (without demographics) often provide a closer match to the human data (black lines) than the orange lines (with demographics). This suggests that for this specific model, providing a detailed persona may not be necessary or could even be detrimental to achieving a realistic distribution of survey responses.

Scientific Validity

✅ Crucial cross-model validation: This figure is scientifically vital because it shows a result for Gem-2f that is the opposite of the result for GPT-40 (seen in Fig. 23). This demonstrates that the effect of demographic prompting is highly model-dependent, which is a significant and nuanced finding. It prevents overgeneralization from the results of a single model.
✅ Strong evidence for a surprising interaction: The figure provides the granular, per-survey evidence for the surprising conclusion (summarized in Fig. 7) that removing demographic prompts improves distributional similarity for Gem-2f. The consistency of this pattern across the many surveys shown here makes the finding robust.
✅ Transparent data presentation: Showing the results for each survey individually, rather than in an aggregated form, is a major strength. This transparency allows the reader to verify the consistency of this counterintuitive result, which greatly increases confidence in the conclusion.

Communication

✅ Effective use of color and small multiples: The use of distinct orange and green lines clearly highlights the key experimental comparison, and the small multiples grid layout is an excellent way to show the consistency of the finding across the dataset.
✅ Detailed and clear caption: The caption is excellent. It comprehensively and accurately describes the complex experimental setup being visualized, including the model, methods, stimulus type, and the crucial comparison between full and zero demographic information. This level of detail is essential for proper interpretation.
💡 Not explicitly referenced in the main text: This figure contains the critical experimental evidence for a surprising and important model-specific finding, yet it is not directly referenced in the main text. This is a significant communication failure. The authors should explicitly cite this figure in the results section to guide the reader to the detailed visual data that supports their claims.
💡 Layout could be slightly improved: The figure alternates rows between the SSR and FLR methods, with labels only on the far-left y-axis. While functional, adding a small title above the first plot in each row (e.g., 'SSR Method', 'FLR Method') would make the structure more immediately obvious to the reader.

Figure 26: Second set of survey histograms for textual elicitation with Gem-2f...

Full Caption

Figure 26: Second set of survey histograms for textual elicitation with Gem-2f and follow-up ratings at TLLM = 0.5, with text stimulus and alternating between prompting the LLM with full demographic information and zero demographic information.

Figure/Table Image (Page 24)

First Reference in Text

Not explicitly referenced in main text

Description

Grid comparing Gem-2f AI responses with and without demographic information (continuation): This figure is a continuation of Figure 25, presenting a large grid of small graphs (histograms) for the remaining product surveys. Each graph shows the distribution of answers on a 1-to-5 Likert scale. Each plot compares the responses of real humans (solid black line) to the Gem-2f AI model under two different conditions: when the AI is given demographic information to impersonate a specific person (orange line, 'w/ demographics'), and when it is given no demographic information (green line, 'w/o demographics').
Two different AI methods are shown: The figure visualizes two different methods for getting ratings from the AI, shown in alternating rows. The top row for each survey shows the 'Semantic Similarity Rating' (SSR) method, and the bottom row shows the 'Follow-up Likert Rating' (FLR) method. This allows for a direct visual comparison of how the presence or absence of demographic information affects each method.
Surprising finding is confirmed: AI without demographics is more realistic for Gem-2f: The key takeaway, consistent with Figure 25, is surprising and contrasts with the results for the other AI model (GPT-40). For Gem-2f, the green lines (without demographics) often provide a closer match to the human data (black lines) than the orange lines (with demographics). This suggests that for this specific model, providing a detailed persona may not be necessary or could even be detrimental to achieving a realistic distribution of survey responses.

Scientific Validity

✅ Provides comprehensive evidence for a key finding: This figure, in conjunction with Figure 25, provides the full, granular dataset that underpins the surprising finding that removing demographic prompts improves distributional similarity for Gem-2f. Showing the results for every single survey demonstrates the consistency and robustness of this counterintuitive result.
✅ Crucial for cross-model comparison: The results shown here are scientifically vital because they contrast directly with the results for GPT-40 (seen in Figs. 23-24). This demonstrates that the effect of demographic prompting is highly model-dependent, a significant and nuanced finding that prevents overgeneralization from the results of a single model.
✅ Transparent data presentation: Showing the results for each survey individually, rather than in an aggregated form, is a major strength. This transparency allows the reader to verify the consistency of the surprising result, which greatly increases confidence in the conclusion.

Communication

✅ Effective use of small multiples: The grid layout is an excellent visualization choice. It powerfully communicates the consistency of the methods' performance and the surprising effect of demographic conditioning across the entire dataset, a message that would be lost in a single aggregated plot.
✅ Detailed and clear caption: The caption is excellent. It comprehensively and accurately describes the complex experimental setup being visualized, including the model, methods, stimulus type, and the crucial comparison between full and zero demographic information. This level of detail is essential for proper interpretation.
💡 Not explicitly referenced in the main text: This figure contains critical experimental evidence for a surprising and important model-specific finding, yet it is not directly referenced in the main text. This is a significant communication failure. The authors should explicitly cite this figure in the results section to guide the reader to the detailed visual data that supports their claims.
💡 Layout could be slightly improved: The figure alternates rows between the SSR and FLR methods, with labels only on the far-left y-axis. While functional, adding a small title above the first plot in each row (e.g., 'SSR Method', 'FLR Method') would make the structure more immediately obvious to the reader.

Figure 27: Success metrics for textual elicitation and demography experiments,...

Full Caption

Figure 27: Success metrics for textual elicitation and demography experiments, at TLLM = 0.5 with GPT-40 and with text stimulus.

Figure/Table Image (Page 24)

First Reference in Text

Not explicitly referenced in main text

Description

Comparison of AI performance with and without demographic information: This figure presents four sets of plots evaluating the GPT-40 AI model's performance under different conditions. It compares two methods—Semantic Similarity Rating (SSR) and Follow-up Likert Rating (FLR)—and for each method, it shows the results when the AI was given demographic information to impersonate a person ('w/ dem.') versus when it was not ('w/o dem.'). Each set includes a scatter plot measuring how well the AI's average product ratings correlate with human ratings, and a histogram measuring how similar the AI's response patterns were to human patterns (KS similarity, where 1.0 is a perfect match).
Demographics improve product ranking ability: The scatter plots show that providing demographic information is crucial for the AI to rank products correctly. For the SSR method, the correlation (R) between AI and human average scores was 0.68 with demographics, but dropped to 0.41 without them. A similar, more pronounced drop occurred for the FLR method, from R=0.56 with demographics to R=0.32 without them.
Demographics have a mixed effect on response pattern realism: The histograms reveal a more complex story. For the FLR method, removing demographics slightly worsened the realism of the response patterns, with the average KS similarity dropping from 0.64 to 0.57. However, for the SSR method, removing demographics surprisingly improved the realism, with the average KS similarity increasing from an already high 0.84 to an even higher 0.91.

Scientific Validity

✅ Excellent use of an ablation study: The direct comparison between the 'with demographics' and 'without demographics' conditions constitutes a well-designed ablation study. It effectively isolates the impact of persona conditioning and reveals a nuanced, method-dependent effect, which is a significant scientific contribution.
✅ Reveals a complex and important interaction: The figure uncovers a critical trade-off: for the SSR method with GPT-40, removing demographic information degrades the model's ability to rank products correctly (lower correlation) but paradoxically improves its ability to generate realistic overall response distributions (higher KS similarity). This is a non-obvious and important finding about the behavior of these systems.
✅ Comprehensive evaluation with dual metrics: The use of two distinct success metrics (correlation of means and KS similarity of distributions) is methodologically sound. It allows the authors to disentangle two different aspects of performance—ranking accuracy and distributional realism—which is essential for the nuanced interpretation that this figure supports.

Communication

✅ Effective layout for comparison: The 2x2 grid of plot sets allows for a clear and direct comparison of the key experimental conditions. Placing the 'with' and 'without' demography conditions side-by-side effectively highlights the main experimental manipulation.
💡 Not explicitly referenced in the main text: This figure contains the core evidence for a complex and important finding regarding the role of demographics with GPT-40, yet it is not referenced in the main text. This is a major communication gap. The authors must cite this figure and discuss the trade-off between correlation and distributional similarity in the results section.
💡 High information density: The figure is very dense, presenting eight individual plots. While this is comprehensive, it can be overwhelming for the reader. The authors could consider splitting this into two separate figures, one for the SSR method and one for the FLR method, to improve readability.
💡 Inconsistent and jargon-heavy axis labels: The y-axis labels on the scatter plots ('LLM mean (embed)' vs. 'LLM mean (likert)') are inconsistent and use internal jargon. Using clearer, more descriptive labels like 'Synthetic Mean (SSR Method)' and 'Synthetic Mean (FLR Method)' would significantly improve clarity and make the figure more self-contained.

Figure 28: Success metrics for textual elicitation and demography experiments,...

Full Caption

Figure 28: Success metrics for textual elicitation and demography experiments, at TLLM = 0.5 with Gem-2f and with text stimulus.

Figure/Table Image (Page 25)

First Reference in Text

Not explicitly referenced in main text

Description

Comparison of AI performance with and without demographic information for the Gem-2f model: This figure presents four sets of plots evaluating the Gem-2f AI model's performance under different conditions. It compares two methods—Semantic Similarity Rating (SSR) and Follow-up Likert Rating (FLR)—and for each method, it shows the results when the AI was given demographic information to impersonate a person ('w/ dem.') versus when it was not ('w/o dem.'). Each set includes a scatter plot measuring how well the AI's average product ratings correlate with human ratings, and a histogram measuring how similar the AI's response patterns were to human patterns (KS similarity, where 1.0 is a perfect match).
Demographics have a mixed effect on product ranking ability: The scatter plots show that providing demographic information has a small positive effect on the AI's ability to rank products correctly. For the SSR method, the correlation (R) between AI and human average scores was 0.58 with demographics, but dropped slightly to 0.48 without them. A similar drop occurred for the FLR method, from R=0.59 with demographics to R=0.48 without them.
Surprising finding: Removing demographics improves realism for the SSR method: The histograms reveal a complex and surprising result. For the FLR method, removing demographics had no effect on the realism of the response patterns (average KS similarity was 0.62 in both cases). However, for the more advanced SSR method, removing demographics paradoxically improved the realism, with the average KS similarity increasing from an already high 0.80 to an even higher 0.91. This suggests that for this specific model and method, a generic prompt may be better at replicating human-like response distributions than a specific persona-based prompt.

Scientific Validity

✅ Crucial cross-model validation: This figure is scientifically vital because it shows a result for Gem-2f that is the opposite of the result for GPT-40 (seen in Fig. 27). This demonstrates that the effect of demographic prompting is highly model-dependent, which is a significant and nuanced finding that correctly prevents overgeneralization from the results of a single model.
✅ Strong evidence for a surprising model-method interaction: The figure uncovers a critical trade-off for Gem-2f with SSR: removing demographic information slightly degrades the model's ability to rank products correctly (lower correlation) but significantly improves its ability to generate realistic overall response distributions (higher KS similarity). This is a non-obvious and important finding about the behavior of these systems.
✅ Comprehensive evaluation with dual metrics: The use of two distinct success metrics (correlation of means and KS similarity of distributions) is methodologically sound. It allows the authors to disentangle two different aspects of performance—ranking accuracy and distributional realism—which is essential for the nuanced interpretation that this figure supports.

Communication

✅ Effective layout for comparison: The 2x2 grid of plot sets allows for a clear and direct comparison of the four key experimental conditions. Placing the 'with' and 'without' demography conditions side-by-side effectively highlights the main experimental manipulation.
💡 Not explicitly referenced in the main text: This figure contains the core evidence for a complex and surprising finding regarding the role of demographics with Gem-2f, yet it is not referenced in the main text. This is a major communication gap. The authors must cite this figure and discuss the trade-off between correlation and distributional similarity in the results section.
💡 Inconsistent and jargon-heavy axis labels: The y-axis labels on the scatter plots ('LLM mean (embed)' vs. 'LLM mean (likert)') are inconsistent and use internal jargon. Using clearer, more descriptive labels like 'Synthetic Mean (SSR Method)' and 'Synthetic Mean (FLR Method)' would significantly improve clarity and make the figure more self-contained.

Figure 29: Success metrics for textual elicitation and demography experiments,...

Full Caption

Figure 29: Success metrics for textual elicitation and demography experiments, at TLLM = 0.5 with Gem-2f and image stimulus.

Figure/Table Image (Page 25)

First Reference in Text

Not explicitly referenced in main text

Description

Performance of AI without demographic information: This figure evaluates the performance of the Gem-2f AI model when it is not given any demographic information to impersonate a person. It compares two different methods for generating ratings: Semantic Similarity Rating (SSR) on the left, and Follow-up Likert Rating (FLR) on the right. Each side has a scatter plot on top, which measures how well the AI's average product ratings correlate with human ratings, and a histogram below, which measures how similar the AI's overall response patterns are to human patterns (KS similarity, where 1.0 is a perfect match).
SSR method fails at product ranking but excels at distributional realism: The plots on the left show a critical trade-off for the SSR method. The scatter plot shows a very weak, non-significant correlation (R = 0.14), meaning the AI completely fails to rank products correctly. However, the histogram shows that the response patterns it generates are extremely realistic, with a very high average KS similarity of 0.91.
FLR method retains some ranking ability but with less realism: The plots on the right show that the FLR method performs differently. The scatter plot shows a moderate, statistically significant correlation (R = 0.53), indicating that it retains some ability to rank products correctly, unlike the SSR method in this condition. However, this comes at the cost of less realistic response patterns, as shown by the lower average KS similarity of 0.67 in the histogram.

Scientific Validity

✅ Crucial ablation study revealing a key trade-off: This figure represents a well-designed ablation study (removing demographic information). It uncovers a critical and non-obvious trade-off: without a persona, the SSR method loses its ability to rank products (correlation) while paradoxically achieving near-perfect distributional similarity. This is a significant scientific finding about the behavior of these models.
✅ Strong evidence for the importance of persona for ranking: The extremely low correlation (R=0.14) for the SSR method without demographics provides powerful evidence that persona conditioning is essential for the model to leverage product information and produce a meaningful ranking signal. This robustly supports the claims made in the main text (Section 4.3).
💡 Ambiguity from using the 'best reference set': The caption states that the results for SSR are from the 'best reference set (4)'. This raises concerns about potential cherry-picking. While the main text often refers to averages over all reference sets, relying on the 'best' set for this specific figure could exaggerate the reported KS similarity. The authors should either justify this choice or present the average results to ensure the finding is robust.

Communication

✅ Effective comparative layout: The side-by-side comparison of the SSR and FLR methods, each with its two corresponding metrics, is a clear and effective layout that allows for a direct visual assessment of their relative performance and trade-offs.
💡 Not explicitly referenced in the main text: This figure contains the core evidence for the important claims made in Section 4.3 about the model's performance without demographic prompts, yet it is not cited in the text. This is a major communication failure, as it disconnects the authors' claims from the supporting data.
💡 Inconsistent and jargon-heavy axis labels: The y-axis labels on the scatter plots ('LLM mean (embed)' vs. 'LLM mean (likert)') are inconsistent and use internal jargon. Using clearer, more descriptive, and parallel labels such as 'Synthetic Mean (SSR Method)' and 'Synthetic Mean (FLR Method)' would significantly improve clarity and make the figure more self-contained.
💡 Caption detail could be clearer: The note about using the 'best reference set (4)' is placed at the end of the caption. It should be clarified that this applies specifically to the SSR results to avoid ambiguity, for instance: 'For semantic similarity rating (SSR), results from the best reference set (4) are shown.'

Figure 30: First set of survey histograms for textual elicitation with Gem-2f...

Full Caption

Figure 30: First set of survey histograms for textual elicitation with Gem-2f and follow-up ratings at TLLM = 0.5, with image stimulus and prompting the LLM with zero demographic information.

Figure/Table Image (Page 26)

First Reference in Text

Not explicitly referenced in main text

Description

Grid comparing Gem-2f AI responses without demographic information to human responses: This figure is a large grid of small graphs (histograms), each representing a single product survey. The graphs show the distribution of answers on a 1-to-5 Likert scale. Each plot compares the responses of real humans (solid black line) to the Gem-2f AI model (orange line) under a specific condition: the AI was given no demographic information to impersonate a person.
Two different AI methods are shown: The figure visualizes two different methods for getting ratings from the AI, shown in alternating rows. The top row for each survey shows the 'Semantic Similarity Rating' (SSR) method, and the bottom row shows the 'Follow-up Likert Rating' (FLR) method. This allows for a direct visual comparison of how each method performs without persona-based prompting.
AI without demographics shows surprisingly high realism: The key takeaway from this figure is that for the Gem-2f model, the AI's responses (orange lines) are remarkably similar to the human data (black lines) even without any demographic guidance. The SSR method in particular (top rows) shows a very close match in the shape and peaks of the distributions across many of the surveys. This provides the visual evidence for the surprising quantitative results shown in figures like Figure 7 and 29.

Scientific Validity

✅ Crucial cross-model validation: This figure is scientifically vital because it shows a result for Gem-2f that is the opposite of the result for GPT-40 (seen in Fig. 23). This demonstrates that the effect of demographic prompting is highly model-dependent, which is a significant and nuanced finding that correctly prevents overgeneralization from the results of a single model.
✅ Strong evidence for a surprising finding: The figure provides the granular, per-survey evidence for the surprising conclusion that removing demographic prompts improves distributional similarity for Gem-2f. The consistency of this pattern across the many surveys shown here makes the finding robust.
💡 Ambiguity from using the 'best reference set': The caption states that the results for SSR are from the 'best reference set (4)'. This raises concerns about potential cherry-picking. While the main text often refers to averages over all reference sets, relying on the 'best' set for this specific figure could exaggerate the reported KS similarity. The authors should either justify this choice or present the average results to ensure the finding is robust.

Communication

✅ Effective use of small multiples: The grid layout is an excellent visualization choice. It powerfully communicates the consistency of this surprising result across many different surveys, a message that would be lost in a single, aggregated plot.
✅ Detailed and clear caption: The caption is excellent. It comprehensively and accurately describes the complex experimental setup being visualized, including the model, methods, stimulus type, and the crucial condition of zero demographic information. This level of detail is essential for proper interpretation.
💡 Not explicitly referenced in the main text: This figure contains the critical experimental evidence for a surprising and important model-specific finding, yet it is not directly referenced in the main text. This is a significant communication failure. The authors should explicitly cite this figure in the results section to guide the reader to the detailed visual data that supports their claims.
💡 Layout could be slightly improved: The figure alternates rows between the SSR and FLR methods, with labels only on the far-left y-axis. While functional, adding a small title above the first plot in each row (e.g., 'SSR Method', 'FLR Method') would make the structure more immediately obvious to the reader.

Figure 31: Second set of survey histograms for textual elicitation with Gem-2f...

Full Caption

Figure 31: Second set of survey histograms for textual elicitation with Gem-2f and follow-up ratings at TLLM = 0.5, with image stimulus and prompting the LLM with zero demographic information.

Figure/Table Image (Page 26)

First Reference in Text

Not explicitly referenced in main text

Description

Grid comparing Gem-2f AI responses without demographic information to human responses (continuation): This figure is a continuation of Figure 30, presenting a large grid of small graphs (histograms) for the remaining product surveys. Each graph shows the distribution of answers on a 1-to-5 Likert scale. Each plot compares the responses of real humans (solid black line) to the Gem-2f AI model (orange line) under a specific condition: the AI was given no demographic information to impersonate a person.
Two different AI methods are shown: The figure visualizes two different methods for getting ratings from the AI, shown in alternating rows. The top row for each survey shows the 'Semantic Similarity Rating' (SSR) method, and the bottom row shows the 'Follow-up Likert Rating' (FLR) method. This allows for a direct visual comparison of how each method performs without persona-based prompting.
AI without demographics shows surprisingly high realism: The key takeaway from this figure is that for the Gem-2f model, the AI's responses (orange lines) are remarkably similar to the human data (black lines) even without any demographic guidance. The SSR method in particular (top rows) shows a very close match in the shape and peaks of the distributions across many of the surveys. This provides the visual evidence for the surprising quantitative results shown in figures like Figure 7 and 29.

Scientific Validity

✅ Provides comprehensive evidence for a key finding: This figure, in conjunction with Figure 30, provides the full, granular dataset that underpins the surprising finding that removing demographic prompts improves distributional similarity for Gem-2f. Showing the results for every single survey demonstrates the consistency and robustness of this counterintuitive result.
✅ Crucial for cross-model comparison: The results shown here are scientifically vital because they contrast directly with the results for GPT-40 (seen in Figs. 23-24). This demonstrates that the effect of demographic prompting is highly model-dependent, a significant and nuanced finding that prevents overgeneralization from the results of a single model.
💡 Ambiguity from using the 'best reference set': The caption states that the results for SSR are from the 'best reference set (4)'. This raises concerns about potential cherry-picking. While the main text often refers to averages over all reference sets, relying on the 'best' set for this specific figure could exaggerate the reported KS similarity. The authors should either justify this choice or present the average results to ensure the finding is robust.

Communication

✅ Effective use of small multiples: The grid layout is an excellent visualization choice. It powerfully communicates the consistency of this surprising result across many different surveys, a message that would be lost in a single, aggregated plot.
✅ Detailed and clear caption: The caption is excellent. It comprehensively and accurately describes the complex experimental setup being visualized, including the model, methods, stimulus type, and the crucial condition of zero demographic information. This level of detail is essential for proper interpretation.
💡 Not explicitly referenced in the main text: This figure contains the critical experimental evidence for a surprising and important model-specific finding, yet it is not directly referenced in the main text. This is a significant communication failure. The authors should explicitly cite this figure in the results section to guide the reader to the detailed visual data that supports their claims.
💡 Layout could be slightly improved: The figure alternates rows between the SSR and FLR methods, with labels only on the far-left y-axis. While functional, adding a small title above the first plot in each row (e.g., 'SSR Method', 'FLR Method') would make the structure more immediately obvious to the reader.

Figure 32: Scan over post-elicitation temperature T values and change in...

Full Caption

Figure 32: Scan over post-elicitation temperature T values and change in success metrics for textual elicitation at TLLM = 0.5 with GPT-40 and image stimulus, with full demography setup.

Figure/Table Image (Page 27)

First Reference in Text

Not explicitly referenced in main text

Description

Hyperparameter tuning for the Semantic Similarity Rating (SSR) method: This figure shows how the performance of the authors' proposed method, Semantic Similarity Rating (SSR), changes when a key setting called 'post-elicitation temperature T' is adjusted. This 'temperature' controls how concentrated or spread out the AI's probabilistic answer is; a low temperature forces a confident choice, while a high temperature spreads its belief across multiple options. The figure contains seven plots: six small ones for different 'reference sets' (the anchor statements used to define the 1-5 scale) and one large plot showing the average result.
Two performance metrics are evaluated: Each plot tracks two different success metrics. The black lines (left y-axis) show 'Pearson R', which measures how well the AI's average product rankings correlate with human rankings. The orange lines (right y-axis) show 'KS similarity', which measures how well the overall pattern of the AI's answers matches the pattern of human answers. For both metrics, higher values are better. The solid lines represent the SSR method, while the dashed horizontal lines show the performance of a simpler method, 'Follow-up Likert rating' (Likert), for comparison.
Optimal performance is found around T=1.0: The plots show that the SSR method's performance (solid lines) is consistently better than the simpler Likert method (dashed lines) across a wide range of temperatures. The performance of SSR generally peaks when the temperature T is around 1.0. For example, in the main 'mean' plot, the KS similarity (orange line) is highest around T=0.75, while the Pearson R (black line) is highest around T=1.25. This shows that the authors' choice of T=1.0 for their main experiments is a good, near-optimal compromise that balances both performance metrics.

Scientific Validity

✅ Demonstrates excellent methodological rigor: This figure shows a thorough hyperparameter sweep for a key parameter (T) of the proposed SSR method. This analysis demonstrates that the authors have rigorously investigated the sensitivity of their results and provides a strong empirical justification for the parameter choice used in the main analysis.
✅ Includes a robust validation across reference sets: By testing six different reference sets and presenting the average, the authors demonstrate that their findings are robust and not an artifact of a single, fortuitously chosen set of anchor statements. This significantly strengthens the generalizability of their conclusions about the optimal temperature T.
✅ Reveals a nuanced performance trade-off: The figure reveals a scientifically interesting and nuanced trade-off: the temperature that maximizes distributional similarity (KS sim) is slightly different from the temperature that maximizes ranking correlation (Pearson R). This is an important finding about the behavior of the method and justifies the choice of a compromise value.

Communication

✅ Clear and effective visualization design: The use of dual-axis line plots is an effective way to simultaneously visualize the impact of a single parameter on two different performance metrics. The inclusion of the FLR method's performance as horizontal baselines provides an immediate and easy-to-interpret comparison.
💡 Not explicitly referenced in the main text: This figure contains the crucial justification for a key methodological choice (the value of T), yet it is not referenced anywhere in the main body of the paper. This is a significant communication failure. The authors must cite this figure in the methods or results section to support their choice of T=1 and to show the robustness of their method.
✅ Comprehensive and clear caption: The caption is excellent. It clearly explains what is being plotted, the different lines, the experimental conditions, and the meaning of the horizontal lines, making the figure largely self-contained and easy to understand.

Figure 33: First set of survey histograms for textual elicitation to question...

Full Caption

Figure 33: First set of survey histograms for textual elicitation to question "How relevant is this concept for you?" with Gem-2f and follow-up ratings at TLLM = 0.5, with image stimulus and full demography setup.

Figure/Table Image (Page 27)

First Reference in Text

Not explicitly referenced in main text

Description

Testing the AI methods on a new survey question: This figure shows a grid of small graphs (histograms) that test how well the AI methods work on a different type of survey question: 'How relevant is this concept for you?' instead of the 'purchase intent' question used in the rest of the paper. Each graph shows the distribution of answers on a 1-to-5 scale, comparing real human responses (solid black line) to those from the Gem-2f AI model (orange line).
Comparison of two advanced AI methods: The figure visualizes two different methods for generating AI ratings, shown in alternating rows. The top row for each survey shows the 'Semantic Similarity Rating' (SSR) method, and the bottom row shows the 'Follow-up Likert Rating' (FLR) method. Both methods are based on the AI first generating a textual response, which is then converted into a rating.
AI models successfully generate realistic responses for the new question: The key finding is that both AI methods successfully generate realistic, human-like response patterns for this new 'relevance' question. The AI's responses (orange lines) are spread across the 1-5 scale and generally follow the shape of the human data (black lines), demonstrating that the techniques are not limited to a single question type. Visually, the SSR method (top rows) often appears to be a closer match to the human data than the FLR method (bottom rows).

Scientific Validity

✅ Crucial generalization experiment: This figure is scientifically important as it serves as a generalization test. By demonstrating that the methods work well for a different construct ('relevance') beyond the main focus ('purchase intent'), the authors significantly strengthen their claim about the general applicability of their framework. This shows the methods are not narrowly overfitted to a single task.
✅ Provides strong visual support for quantitative claims: This figure provides the detailed, per-survey visual evidence for the aggregate distributional similarity scores (Kxy) reported in Section 4.4 of the main text. The visual superiority of SSR over FLR in many of the plots aligns with the reported quantitative scores (Kxy = 0.81 for SSR vs. 0.62 for FLR).
✅ Transparent data presentation: The choice to present the distribution for every survey individually is a major strength. It allows the reader to verify the consistency of the methods' performance on this new task, which increases confidence in the generalization claim.

Communication

✅ Effective use of small multiples: The grid layout is an excellent visualization choice for this data. It powerfully communicates the consistency of the methods' performance across many different surveys for this new question.
✅ Detailed and clear caption: The caption is very well-written, clearly stating that this figure addresses a new question ('How relevant...?') and detailing the specific experimental conditions. This is essential for understanding that this is a generalization test.
💡 Weakly referenced in the main text: While the main text in Section 4.4 mentions '(cf. Figs. 33-34)', this is a weak reference. Given the importance of this generalization test, the authors should explicitly point the reader to this figure to see the detailed distributions that support the quantitative Kxy values, strengthening the connection between the text and the evidence.
💡 Y-axis labels use jargon: The y-axis labels ('Embeddings' for SSR and 'LLM-Likert' for FLR) are internal jargon. Using clearer, more descriptive labels like 'SSR Method' and 'FLR Method' would be much more accessible to a broader audience and consistent with the text.

Figure 34: Second set of survey histograms for textual elicitation to question...

Full Caption

Figure 34: Second set of survey histograms for textual elicitation to question "How relevant is this concept for you?" with Gem-2f and follow-up ratings at TLLM = 0.5, with image stimulus and full demography setup.

Figure/Table Image (Page 28)

First Reference in Text

Not explicitly referenced in main text

Description

Testing the AI methods on a new survey question (continuation): This figure is a continuation of Figure 33, showing the remaining survey results for a test of how well the AI methods work on a different type of survey question: 'How relevant is this concept for you?' instead of the 'purchase intent' question used in the rest of the paper. Each small graph (histogram) shows the distribution of answers on a 1-to-5 scale, comparing real human responses (solid black line) to those from the Gem-2f AI model (orange line).
Comparison of two advanced AI methods: The figure visualizes two different methods for generating AI ratings, shown in alternating rows. The top row for each survey shows the 'Semantic Similarity Rating' (SSR) method, and the bottom row shows the 'Follow-up Likert Rating' (FLR) method. Both methods are based on the AI first generating a textual response, which is then converted into a rating.
AI models successfully generate realistic responses for the new question: The key finding, consistent with Figure 33, is that both AI methods successfully generate realistic, human-like response patterns for this new 'relevance' question. The AI's responses (orange lines) are spread across the 1-5 scale and generally follow the shape of the human data (black lines), demonstrating that the techniques are not limited to a single question type. Visually, the SSR method (top rows) often appears to be a closer match to the human data than the FLR method (bottom rows).

Scientific Validity

✅ Crucial generalization experiment: This figure, in conjunction with Figure 33, completes a vital generalization test. By demonstrating that the methods work well for a different construct ('relevance') across the entire dataset, the authors significantly strengthen their claim about the general applicability of their framework. This shows the methods are not narrowly overfitted to a single task.
✅ Provides comprehensive visual support for quantitative claims: This figure provides the detailed, per-survey visual evidence for the aggregate distributional similarity scores (Kxy) reported in Section 4.4 of the main text. The visual superiority of SSR over FLR in many of the plots aligns with the reported quantitative scores (Kxy = 0.81 for SSR vs. 0.62 for FLR).
✅ Transparent data presentation: The choice to present the distribution for every survey individually is a major strength. It allows the reader to verify the consistency of the methods' performance on this new task, which increases confidence in the generalization claim.

Communication

✅ Effective use of small multiples: The grid layout is an excellent visualization choice for this data. It powerfully communicates the consistency of the methods' performance across many different surveys for this new question.
✅ Detailed and clear caption: The caption is very well-written, clearly stating that this figure addresses a new question ('How relevant...?') and detailing the specific experimental conditions. This is essential for understanding that this is a generalization test.
💡 Weakly referenced in the main text: While the main text in Section 4.4 mentions '(cf. Figs. 33-34)', this is a weak reference. Given the importance of this generalization test, the authors should explicitly point the reader to this figure to see the detailed distributions that support the quantitative Kxy values, strengthening the connection between the text and the evidence.
💡 Y-axis labels use jargon: The y-axis labels ('Embeddings' for SSR and 'LLM-Likert' for FLR) are internal jargon. Using clearer, more descriptive labels like 'SSR Method' and 'FLR Method' would be much more accessible to a broader audience and consistent with the text.

Figure 35: Success metrics for textual elicitation to question "How relevant is...

Full Caption

Figure 35: Success metrics for textual elicitation to question "How relevant is this concept for you?" with Gem-2f and follow-up ratings at TLLM = 0.5, with image stimulus and full demography setup.

Figure/Table Image (Page 28)

First Reference in Text

Not explicitly referenced in main text

Description

Performance of the Semantic Similarity Rating (SSR) method on a 'relevance' question: This set of plots evaluates how well the Gem-2f AI model, using the 'Semantic Similarity Rating' (SSR) method, can answer the question 'How relevant is this concept for you?'. The top scatter plot compares the average AI score (y-axis) to the average human score (x-axis), showing a moderately strong positive correlation with a Pearson's R of 0.66. The bottom histogram shows the distribution of a similarity score (KS similarity) that measures how well the AI's full response pattern matched the humans' for each survey. The distribution is skewed to the right, indicating high similarity, with an average score of 0.81.

Scientific Validity

✅ Crucial generalization experiment: This figure is scientifically important as it serves as a generalization test. By demonstrating that the SSR method works well for a different construct ('relevance') beyond the main focus ('purchase intent'), the authors significantly strengthen their claim about the general applicability of their framework.
✅ Strong quantitative support for generalization: The combination of a significant correlation (R=0.66) and high mean distributional similarity (KS sim = 0.81) provides robust evidence that the SSR method's effectiveness is not limited to a single question type.
💡 Inconsistency between text and figure metrics: The reference text cites this figure in support of a 'correlation attainment ρ = 82%', but the figure itself only displays the Pearson correlation 'R=0.66'. While ρ is derived from R, the relationship is not defined in the figure. For full clarity, the figure should either display the ρ value or the caption should clarify how R relates to ρ.

Communication

✅ Effective two-plot summary: Presenting both the scatter plot for mean correlation and the histogram for distributional similarity is an effective way to summarize the two key aspects of performance for this experimental condition.
💡 Jargon on y-axis label: The y-axis label on the scatter plot, 'LLM mean (embed)', is technical and may not be immediately clear to all readers. A more descriptive label, such as 'Synthetic Mean (SSR Method)', would improve clarity.
✅ On-plot statistics aid interpretation: Including the R-value, p-value, and mean KS similarity directly on the respective plots is excellent practice, making the figure's main takeaways immediately accessible.

Table 1: Metrics for all experiments on purchase intent.

Figure/Table Image (Page 14)

First Reference in Text

For detailed results, see Figs. 17-22 and Tab. 1.

Description

Comprehensive summary of all experimental results: This table is a detailed summary of all the different experiments conducted in the study. Each row represents a specific experiment, defined by the AI model used (GPT-40 or Gem-2f), the elicitation method ('Direct' numerical rating vs. 'Textual' response), the type of prompt ('Stimulus': Text or Image), whether demographic information was used ('Dem.'), and a randomness setting for the AI ('TLLM').
Evaluation using multiple performance metrics: The table reports several key performance metrics. 'Correlation attainment (ρ)' is a custom score from 0% to 100% that measures how well the AI's ranking of products matches the human ranking, relative to the best possible performance. 'Kxy' is the Kolmogorov-Smirnov similarity, a score from 0 to 1 measuring how well the shape of the AI's response distribution matches the human distribution. 'Rxy' is the standard Pearson correlation between average scores. The final columns show the average purchase intent score ('E[PIs]') generated by the AI.
Key performance highlights: The table highlights the best-performing condition: using the GPT-40 model with a 'Textual' (SSR) method, full demographic information, and an image stimulus achieved the highest correlation attainment (ρ = 90.2%) and excellent distributional similarity (Kxy = 0.88). In contrast, 'Direct' rating methods consistently produced poor distributional similarity, with Kxy scores as low as 0.26. The table also shows the surprising result that for the Gem-2f model, removing demographic information ('None') led to the highest distributional similarity (Kxy = 0.91), although it reduced the correlation attainment.

Scientific Validity

✅ High degree of transparency and completeness: The table's primary strength is its comprehensiveness. By reporting the results for all experimental permutations, the authors provide a high level of transparency, allowing readers to scrutinize the data and verify the claims made in the text. This is a hallmark of rigorous research.
✅ Use of multiple, complementary metrics: The evaluation relies on a suite of metrics (ρ, Kxy, Rxy, Cxy) that capture different aspects of performance (ranking vs. distributional shape). This multi-faceted approach is methodologically sound and provides a more nuanced and robust assessment than any single metric could.
💡 Ambiguity of 'avg' rows: The 'avg' rows for the 'Textual' elicitation method are crucial for the main claims, but the table does not explicitly state what is being averaged over (presumably, the different reference sets Σ). A footnote clarifying the nature of this averaging would improve methodological clarity.
💡 Potential for perceived cherry-picking: The table reports results for both the 'avg' and the 'Best Σ' (best reference set). While the main text appropriately focuses on the average results, presenting the 'best' result in the main summary table could be misconstrued as cherry-picking. It might be clearer to relegate the 'best set' results to the appendix to maintain focus on the more robust average performance.

Communication

✅ Logical and structured organization: The table is well-organized, with columns systematically grouped by experimental conditions on the left and performance metrics on the right. This logical flow aids in navigating the large amount of information.
💡 Extremely high information density: The table is very dense, which can make it overwhelming and difficult to parse. The sheer number of columns and rows makes it hard to quickly identify the key takeaways. Suggest using visual cues, such as bolding the rows or cells with the best performance for each key metric (e.g., highest ρ and Kxy), to guide the reader's attention to the most important results.
💡 Undefined abbreviations and complex headers: The table uses several abbreviations ('Elicit.', 'Dem.', 'Stim.', 'Lik.') that are not defined in the caption or a footnote. Furthermore, the nested column headers for 'SSR' and 'Lik.' under each metric are efficient but visually complex. Suggest adding a footnote to define all abbreviations and simplifying the headers where possible, for example, by explicitly labeling the 'Lik.' column as 'FLR' to match the text.
💡 Caption is incomplete: The caption in the original paper is very brief. The expanded caption provided for this critique is much better, as it explains the meaning of the different columns and metrics. A good table should be as self-contained as possible, and a detailed caption is essential for that. The authors should expand their caption to define the key terms and abbreviations used.

Discussion and Conclusion

Key Aspects

🔑 Core Finding and Problem Reframing: The section begins by synthesizing the paper's primary contribution: the Semantic Similarity Rating (SSR) method enables LLM-based synthetic consumers to replicate key outcomes of consumer testing with high fidelity, achieving over 90% of human test-retest reliability. Crucially, this success is used to reframe the central problem. The authors argue that prior failures of LLMs in surveys, such as skewed distributions and regression-to-the-mean, are not inherent flaws of the models themselves but are instead artifacts of simplistic elicitation techniques like direct numerical prompting.
🛠️ Methodological Advantages and Practicality: A key theme is the practical utility and accessibility of the SSR approach. The authors emphasize that the method is zero-shot, requiring no costly and data-intensive fine-tuning, which makes it a widely applicable 'plug-and-play' tool. Furthermore, a significant advantage highlighted is the retention of rich qualitative information. Unlike raw Likert scores, the free-text responses generated by the LLMs provide detailed rationales that can be mined for deeper consumer insights, combining quantitative rigor with qualitative depth.
❓ Limitations and Boundary Conditions: The discussion provides a transparent and thorough account of the method's limitations, establishing clear boundary conditions for its use. These include a sensitivity to the design of reference statements, the inconsistent replication of certain demographic patterns (e.g., gender and region), and the performance dependency on the chosen embedding model. Most fundamentally, the authors state that SSR's validity is bounded by the LLM's training data; it cannot be expected to generate valid preferences for niche domains where relevant human discourse is sparse in its training corpus.
🗺️ Future Directions and Broader Impact: The conclusion outlines a clear roadmap for future research and speculates on the transformative potential of the SSR framework. Proposed extensions include generalizing the method to other survey constructs beyond purchase intent, optimizing SSR parameters automatically, and exploring more sophisticated multi-stage prompting strategies. The authors posit that, if further validated, SSR could fundamentally alter early-stage product research by enabling rapid, low-cost screening of concepts, thereby accelerating innovation and democratizing access to consumer insights.

Strengths

✅ Articulates a clear conceptual contribution
The section excels at elevating the paper's contribution beyond a mere methodological improvement. It powerfully reframes the discourse around LLMs as survey respondents, arguing that observed failures are due to flawed elicitation methods, not fundamental model incapacity. This conceptual shift is a significant and impactful takeaway.

"These findings suggest that many of the shortcomings of prior attempts at using LLMs as survey respondents—such as skewed distributions, over-positivity, or regression-to-the-mean—are not intrinsic limitations of LLMs, but rather artifacts of how responses were elicited." (Page 6)
✅ Presents a balanced and credible perspective
The authors build significant credibility by dedicating substantial space to a candid discussion of the method's limitations. By proactively addressing dependencies, demographic inconsistencies, and the fundamental bounds of training data, they present a nuanced and realistic assessment of their framework's capabilities, avoiding overstatement and strengthening the paper's scientific rigor.

"While promising, the method is not without limitations. SSR relies on carefully designed reference statements... Another limitation concerns demographics... More fundamentally, the usefulness of SSR is bounded by the knowledge domains represented in the LLM’s training data." (Page 7)
✅ Connects findings to tangible, real-world impact
The conclusion effectively translates the research findings into a compelling vision for practical application. It moves beyond academic implications to describe how SSR-enabled synthetic consumers could concretely change industry practices in product research by reducing costs, accelerating iteration, and enabling smaller firms to compete. This focus on tangible impact makes the work highly relevant.

"Instead of commissioning large human surveys for every product idea, companies could first screen concepts synthetically, reserving human panels for the most promising candidates. This would reduce costs, accelerate iteration, and enable smaller firms to access consumer insights that were previously out of reach." (Page 7)

Suggestions for Improvement

💡 Formally frame the reduced positivity bias as a key secondary finding
High impact. The observation that synthetic responses appear less prone to positivity bias and exhibit a wider dynamic range is a potentially major finding with significant practical implications, as it suggests synthetic panels could provide more discriminative signals than human ones. However, it is mentioned only briefly. Elevating this point from a passing observation to a formally stated secondary finding and explicitly recommending its investigation would significantly strengthen the paper's contribution.

"Moreover, we observe that synthetic responses appear less prone to the positivity bias common in human surveys, producing a wider spread of purchase intent." (Page 7)

Implementation: In the paragraph discussing qualitative information, add a sentence to more formally position this finding. For example: "This observation suggests a key secondary finding of our work: synthetic respondents may offer a corrective to the well-documented positivity bias in human survey data. We recommend future work to formally test this hypothesis by statistically comparing the variance and skewness of synthetic versus human responses across a broader set of domains."
💡 Acknowledge the interpretability limits of the embedding space
Medium impact. The discussion correctly identifies that SSR's performance depends on the choice of embedding model. However, it could be strengthened by briefly acknowledging the inherent 'black box' nature of the embedding space itself as a limitation on interpretability. While the LLM's textual rationales are interpretable, the high-dimensional space where the crucial text-to-Likert mapping occurs is not. Acknowledging this nuance would demonstrate a more comprehensive self-critique of the method's transparency.

"Additionally, SSR’s performance depends on the choice of embedding model and similarity measure. Although cosine similarity proved effective, further benchmarking could reveal alternative embedding spaces..." (Page 7)

Implementation: In the paragraph discussing limitations, after mentioning the dependency on the embedding model, add a sentence to address this point. For instance: "Furthermore, while the free-text rationales enhance interpretability, the high-dimensional embedding space where the semantic mapping occurs remains inherently complex, and the alignment of its geometric properties with human cognitive similarity is an important area for continued investigation."

LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings

First Page Preview

Table of Contents

Overall Summary

Study Background and Main Findings

Research Impact and Future Directions

Critical Analysis and Recommendations

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Related Works

Key Aspects

Strengths

Suggestions for Improvement

Methods

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Discussion and Conclusion

Key Aspects

Strengths

Suggestions for Improvement