This paper addresses a critical challenge in using Large Language Models (LLMs) for consumer research: their tendency to produce unrealistic response distributions when asked for direct numerical ratings. Traditional consumer surveys are costly and prone to biases, and while LLMs offer a scalable alternative by simulating 'synthetic consumers,' this fundamental flaw has limited their utility. The authors' primary objective was to develop and validate a new method that could elicit more human-like survey responses from LLMs.
The core of the study is the introduction of a novel two-step method called Semantic Similarity Rating (SSR). First, an LLM, prompted to impersonate a consumer with specific demographic traits, generates a free-text response expressing its purchase intent for a given product concept. Second, instead of asking the model to convert this text to a single number, the system transforms the text into a high-dimensional vector (an embedding) that captures its meaning. The similarity of this vector is then mathematically compared to pre-defined 'anchor statements' for each point on a 5-point Likert scale. This process yields a full probability distribution across the scale, reflecting the nuance and ambiguity of the original text.
The authors rigorously tested the SSR method against simpler baselines using an extensive dataset of 57 real-world consumer surveys on personal care products, involving 9,300 human participants. The results were compelling: the SSR method successfully replicated the relative ranking of products with over 90% of human test-retest reliability. Critically, it also generated response distributions that were highly similar to human patterns, achieving a Kolmogorov-Smirnov (KS) similarity score of over 0.85, whereas the direct rating baseline failed dramatically on this metric (KS similarity ≈ 0.26). The study also demonstrated that providing demographic context to the LLMs was essential for achieving this high level of performance.
The paper concludes that the shortcomings of LLMs in survey simulation are not inherent limitations of the models but rather artifacts of flawed elicitation techniques. The SSR framework offers a robust, training-free solution that preserves the quantitative metrics of traditional surveys while adding rich, interpretable qualitative feedback. This work presents a significant methodological advancement with the potential to make early-stage product research faster, more affordable, and more insightful.
This paper presents a robust and well-validated solution to a significant obstacle in the use of LLMs for consumer research. The Semantic Similarity Rating (SSR) method is a clear and substantial improvement over direct numerical prompting, enabling synthetic survey panels to replicate not only the relative ranking of products but also the nuanced distribution of human responses with high fidelity. The study's methodological rigor, particularly its use of a large real-world dataset and the introduction of the noise-aware 'correlation attainment' metric, lends strong credibility to its findings.
The primary conclusion—that the failures of LLMs as survey respondents are artifacts of elicitation methods rather than fundamental model incapacities—is well-supported and represents a key conceptual advance for the field. The practical implications are significant: the SSR framework offers a scalable, cost-effective tool for rapid screening of product concepts, potentially accelerating innovation and democratizing access to consumer insights. The dual output of quantitative ratings and qualitative rationales is a particularly powerful feature, combining the strengths of two traditionally separate research modalities.
However, the study's simulation-based design necessarily limits the scope of its conclusions. While the method successfully reproduces human survey responses, it cannot claim to reproduce actual purchasing behavior in a real-world market. The findings are also contingent on the specific domain (personal care products), where a wealth of relevant discussion likely exists in the LLMs' training data; its applicability to more niche or novel domains remains an open question. Furthermore, the observed differences in performance between LLMs (e.g., the paradoxical effect of removing demographics for Gem-2f) underscore that these methods are not universally 'plug-and-play' and require careful, model-specific validation. Despite these limitations, the paper provides a credible and powerful framework that fundamentally changes the prospects for using LLMs in consumer research.
The abstract excels at establishing a significant real-world problem (costly, biased consumer research), identifying a specific technical barrier in a promising new approach (unrealistic LLM ratings), and presenting a novel, well-defined solution (SSR). This logical flow immediately conveys the paper's relevance and contribution to the reader.
The abstract substantiates its claims with specific, impressive metrics. Citing "90% of human test–retest reliability" and "KS similarity > 0.85" provides concrete evidence of the method's efficacy, elevating the claims beyond qualitative description and demonstrating a rigorous validation process.
The abstract effectively communicates that the SSR method not only solves the primary quantitative problem of unrealistic distributions but also generates a valuable secondary output: "rich qualitative feedback." This dual advantage of preserving traditional metrics while adding new interpretative depth makes the proposed framework highly compelling.
High impact. The abstract refers to "Large language models (LLMs)" generically. Given the rapid evolution in this field, specifying the class or generation of models used (e.g., GPT-4 class, Gemini class) would provide immediate context on the technological baseline and the generalizability of the findings. This detail is crucial for reproducibility and helps readers benchmark the results against the current state-of-the-art.
Implementation: In the first sentence mentioning LLMs, add a brief parenthetical to ground the research. For example, modify "Large language models (LLMs) offer an alternative..." to "Large language models (LLMs), such as those of the GPT-4 class, offer an alternative..." to provide necessary context without adding significant length.
The introduction excels at guiding the reader through a logical funnel, starting with the broad, high-value problem of flawed consumer research, narrowing to the specific technical gap of unrealistic LLM-generated ratings, and clearly positioning the proposed SSR method as the targeted solution. This classic structure makes the paper's purpose and contribution immediately understandable and compelling.
The paper introduces and explains the concept of Semantic Similarity Rating (SSR) within the introduction itself, rather than deferring the core idea to the methods section. This upfront clarity provides the reader with a solid conceptual foundation, allowing them to understand the paper's central mechanism from the outset and better contextualize the results that follow.
The introduction builds significant credibility by signaling a robust validation strategy. It specifies the use of a large, industry-sourced dataset ("57 consumer research surveys") and introduces a bespoke success metric ("correlation attainment") tailored to the problem, demonstrating methodological thoughtfulness. This reassures the reader that the paper's claims will be substantiated by extensive, real-world evidence.
High impact. The concept of "predefined anchor statements" is the linchpin of the SSR method but remains abstract in the introduction. Providing a brief, intuitive example would immediately demystify this core component for the reader. This would ground the technical description in a concrete idea, enhancing comprehension of the proposed mechanism without preempting the detailed explanation in the Methods section (as confirmed in Appendix C.1).
Implementation: In the sentence explaining the SSR projection, add a parenthetical example after the mention of anchor statements. Modify the sentence to read: "...projected onto a 5-point (5pt) Likert scale by computing the cosine similarity of embeddings with those of predefined anchor statements (e.g., reference texts for each point, such as 'I would definitely buy it')."
Medium impact. The introduction effectively implies its objectives, but formally stating them as 2-3 explicit research questions would improve the section's structure and align with scientific reporting conventions. This would provide readers with a clear and direct roadmap, sharpening their focus on the specific hypotheses being tested and making it easier to evaluate the subsequent sections against the paper's stated goals.
Implementation: At the end of the fourth paragraph, after introducing the SSR approach and its novelty, insert a sentence to frame the investigation. For example: "This leads to our primary research questions: (1) Can the SSR method produce more human-like purchase intent distributions from LLMs than direct numerical elicitation? (2) Does this approach enable LLMs to replicate the relative ranking of product concepts found in human surveys?"
The section is structured as a compelling logical funnel. It begins by identifying the most common method (direct numeric elicitation) and its core flaw (narrow distributions), then systematically examines other approaches (textual mapping, demographic conditioning) and demonstrates how they also fail to resolve this fundamental issue. This methodical process effectively carves out a well-defined and necessary research gap that the paper's proposed SSR method is perfectly positioned to address.
The review avoids becoming an exhaustive list of citations. Instead, it remains highly focused on the key themes and limitations directly relevant to the paper's central argument. By concentrating on the specific problems of response distribution and elicitation methods, it efficiently builds a strong, coherent case for the necessity of a new approach without extraneous detail.
High impact. To enhance clarity and reader comprehension, the section would benefit from a comparative table summarizing the different approaches discussed. Such a table would visually synthesize the key attributes of each method (e.g., Direct Numeric, Textual-to-Numeric, Demographic Conditioning), list representative citations, and explicitly state the primary limitation identified in the text. This would provide a powerful, at-a-glance overview that reinforces the paper's argument and the specific research gap it aims to fill.
Implementation: At the end of the section, insert a table with columns such as: 'Approach', 'Description', 'Key Citations', and 'Identified Limitation'. For example, the row for 'Direct Numeric Elicitation' would describe it as 'Prompting LLMs for a single integer rating' and list 'Produces overly narrow, low-variance response distributions' as the limitation.
Medium impact. The section correctly identifies that prior textual methods reduce responses to a single number. However, the argument could be made more forceful by explicitly contrasting this with the paper's own approach. Directly foreshadowing that the proposed SSR method preserves ambiguity by mapping text to a distribution rather than a single point would more sharply delineate the paper's specific methodological innovation from the closest related work, making its novelty more immediately apparent to the reader.
Implementation: At the end of the second paragraph, after the final sentence, add a transitional sentence to sharpen the contrast. For instance: "...they ultimately reduce them back to single numbers, failing to capture the full probabilistic nature of a response that our proposed method is designed to retain."
The 'Correlation Attainment' metric is a significant methodological strength. By establishing a performance ceiling based on human test-retest reliability, the authors account for the inherent noise and narrow distribution of the ground-truth data. This provides a far more rigorous and realistic assessment of the model's capabilities than a simple correlation score would, demonstrating a sophisticated understanding of the problem domain.
The methodology is exceptionally clear due to its systematic comparison of three well-defined response generation strategies (DLR, FLR, and SSR). This design effectively isolates the impact of the elicitation technique, allowing the authors to convincingly demonstrate the superiority of their proposed SSR method not just in isolation, but in direct contrast to a simple baseline and a plausible, more complex alternative.
The section provides specific and crucial details about the implementation, which is essential for scientific reproducibility. The authors explicitly name the LLMs (GPT-4o, Gemini-2.0-flash), the embedding model used for SSR (OpenAI’s “text-embedding-3-small”), and the temperature parameter for the reported results (T_LLM = 0.5), giving other researchers a clear blueprint of the experimental setup.
High impact. The Follow-up Likert Rating (FLR) method is a critical baseline for comparison. The paper states that the prompt for this method included examples of sentiment-to-rating mappings. Including a brief, illustrative example of such a mapping directly within the Methods section would significantly improve clarity and reproducibility. This would allow readers to more concretely grasp the mechanism of this baseline and better appreciate the conceptual leap to the probabilistic SSR approach.
Implementation: After the sentence describing the expert prompt, add a parenthetical example. For instance: "...we included examples of what kind of statements can lead to which rating (e.g., 'A statement like `I'm intrigued but would need to know the price` corresponds to a rating of 3')."
Medium impact. The paper specifies that GPT-4o and Gemini-2.0-flash were used for production runs after initial experiments. To further strengthen the methodological rigor, a single sentence explaining the rationale for selecting these particular models would be beneficial. Clarifying whether they were chosen for being state-of-the-art, for their specific multimodal capabilities, or for other reasons would provide important context about the technological baseline of the study and the potential generalizability of the findings.
Implementation: In the paragraph specifying the models used, add a brief clause or sentence clarifying the selection criteria. For example: "We used two models (GPT-4o and Gemini-2.0-flash...), selected for their state-of-the-art performance in language understanding and generation at the time of the experiments..."
Figure 1: Different response generation procedures and SSR response-likelihood mapping.
Figure 5: A surrogate product concept similar to those used in the 57 concept surveys.
The results are presented in a highly effective and logical sequence. The section begins by establishing the clear failure of a simple baseline (DLR), then demonstrates the incremental improvement of an intermediate method (FLR), and culminates in showing the superior performance of the proposed SSR method. This structure makes the contribution and advantages of SSR exceptionally clear and persuasive.
The section makes excellent use of figures to convey complex quantitative comparisons. The combination of scatter plots to show correlation (Fig. 2A, 6A) and overlaid distribution plots to show similarity (Fig. 2B, 3, 7) provides a powerful and intuitive visual summary of the core findings. This allows readers to quickly grasp the performance differences between DLR, FLR, and SSR without needing to parse dense statistical text.
Beyond presenting the primary findings, the authors include several additional experiments that substantially strengthen the paper's claims. The ablation study removing demographics, the comparison against a strong supervised ML baseline (LightGBM), and the test of generalization to a new question ('relevance') all serve as powerful robustness checks. This demonstrates a thorough and rigorous approach to validation.
Medium impact. The core performance metrics for GPT-4o and Gem-2f are distributed across different paragraphs and figures (e.g., Fig. 2 for GPT-4o, Fig. 6 for Gem-2f), requiring the reader to synthesize information from multiple locations. A small, consolidated table within the main text summarizing the key outcomes—correlation attainment (𝜌) and distributional similarity (K_xy) for DLR, FLR, and SSR across both LLMs—would significantly enhance clarity and provide an immediate, at-a-glance comparison of the main experimental results.
Implementation: At the end of section 4.2, insert a new table (e.g., Table 2) that summarizes the main results. The table could have rows for each method (DLR, FLR, SSR) and columns for each model's key metrics (e.g., 'GPT-4o 𝜌', 'GPT-4o K_xy', 'Gem-2f 𝜌', 'Gem-2f K_xy'), pulling the headline numbers from the text and figures into one location.
Medium impact. The text makes an interesting observation that synthetic mean purchase intents are more spread out than human ones, suggesting LLMs may be less prone to positivity bias. This is a potentially significant finding regarding the characteristics of synthetic respondents. However, this point is only mentioned in passing. Visualizing this phenomenon, for instance with a simple box plot comparing the distribution of the 57 mean PI values for humans versus each LLM, would provide direct evidence for this claim and add a valuable dimension to the comparison of synthetic and real data.
Implementation: In section 4.2, after the sentence making this claim, add a reference to a new figure or a new panel in an existing figure (e.g., "(see Fig. X)"). This new visual could consist of three box plots side-by-side, showing the distribution of the 57 mean purchase intent scores for the real human data, the GPT-4o SSR results, and the Gem-2f SSR results, clearly illustrating the difference in variance.
Figure 2: Comparison of real and synthetic surveys based on GPT-40 with TLLM = 0.5.
Figure 3: Comparison of purchase intent distribution similarity between real and synthetic surveys based on GPT-40 with TLLM = 0.5 for direct Likert ratings (DLRs), textual elicitation with follow-up Likert ratings (FLRs) and semantic similarity ratings (SSRs).
Figure 4: Mean purchase intent stratified by five demographic and product features (shown are results from the SSR method for both GPT-40 and Gem-2f).
Figure 6: Comparison of real and synthetic surveys based on Gem-2f with TLLM = 0.5.
Figure 7: Comparison of purchase intent distribution similarity between real and synthetic surveys based on Gem-2f with TLLM = 0.5 for direct Likert ratings (DLRs), textual elicitation with follow-up Likert ratings (FLRs), semantic similarity ratings (SSRs), and best-set SSRs for an experiment where synthetic consumers where prompted without demographic markers.
Figure 8: Mean purchase intent stratified by respondents' gender and dwelling region (shown are results from the SSR method for both GPT-40 and Gem-2f).