When ELIZA meets therapists: A Turing test for the heart and mind

S. Gabe Hatch, Zachary T. Goodman, Laura Vowels, H. Dorian Hatch, Alyssa L. Brown, Shayna Guttman, Yunying Le, Benjamin Bailey, Russell J. Bailey, Charlotte R. Esplin, Steven M. Harris, D. Payton Holt, Jr., Merranda McLaughlin, Patrick O'Connell, Karen Rothman, Lane Ritchie, D. Nicholas Top, Jr., Scott R. Braithwaite
PLOS Mental Health
Hatch Data and Mental Health, Orem, Utah, United States of America

Table of Contents

Overall Summary

Study Background and Main Findings

This study investigated whether ChatGPT could generate responses to couple therapy vignettes that were indistinguishable from those of human therapists, and whether those responses aligned with therapeutic common factors. Participants (N=830) were unable to reliably distinguish between therapist and ChatGPT responses. ChatGPT responses were rated significantly higher on common factors (d = 1.63, 95% CI [1.49, 1.78]). Linguistic analysis revealed differences, with ChatGPT using more positive sentiment. Post-hoc analyses revealed an attribution bias favoring responses believed to be from therapists.

Research Impact and Future Directions

The study provides compelling evidence that ChatGPT can generate responses to couple therapy vignettes that are indistinguishable from those written by experienced therapists, and are even rated higher on key therapeutic factors. However, it is crucial to distinguish between correlation and causation. The study demonstrates a correlation between ChatGPT-generated responses and higher ratings on common factors, but it does not prove that ChatGPT *causes* better therapeutic outcomes. The experimental design, comparing ChatGPT to therapists, allows for causal inferences about the *relative* quality of the responses in this specific context, but not about the effectiveness of ChatGPT as a therapist in real-world settings.

The practical utility of these findings is significant, suggesting that GenAI could potentially play a valuable role in expanding access to mental health resources, particularly in areas like web-based relationship interventions. The study places these findings within the context of existing research on AI in mental health, referencing Turing's work and previous studies on human-machine indistinguishability. However, the study could benefit from a more thorough discussion of how its findings specifically advance the broader field and address existing gaps in the literature.

While the study offers promising insights, it also acknowledges key uncertainties. The authors appropriately caution against overgeneralizing the findings, emphasizing the limitations of using vignettes and the need for further research in real-world therapeutic settings. They provide clear guidance for future research, suggesting the need to investigate the long-term effects of GenAI-based interventions and to explore the potential for bias in AI models. The authors also highlight the importance of ethical considerations, such as data privacy and informed consent.

The study raises critical unanswered questions, particularly regarding the long-term effectiveness and safety of using GenAI in therapy. While the methodological limitations, such as the use of vignettes and a brief outcome measure, are acknowledged, their potential impact on the conclusions could be discussed in more detail. Specifically, the study could benefit from a more thorough analysis of whether these limitations fundamentally affect the interpretation of the findings. For example, how might the results differ in a real-world therapy setting, where the interaction is dynamic and ongoing, rather than a single exchange of written responses?

Critical Analysis and Recommendations

Indistinguishability of Responses (written-content)
Participants were unable to reliably distinguish between therapist-written and ChatGPT-generated responses. This supports the growing body of evidence suggesting AI can mimic human communication, raising questions about the future role of AI in traditionally human-dominated fields.
Section: Results
Superior ChatGPT Ratings (written-content)
ChatGPT-generated responses were rated *higher* on therapeutic common factors than therapist-written responses (d = 1.63, 95% CI [1.49, 1.78]). This suggests that, in this specific context, AI may be capable of generating text that aligns more closely with established therapeutic principles than that of human experts.
Section: Results
Preregistration and Open Science (written-content)
The study was preregistered and the authors provided links to data, code, and materials. This enhances transparency and reproducibility, allowing other researchers to verify and build upon the findings.
Section: Method
Bayesian Statistical Approach (written-content)
The study used Bayesian statistical methods, focusing on effect sizes and credible intervals. This provides a more nuanced understanding of the results than traditional null hypothesis significance testing.
Section: Method
Limited Generalizability (written-content)
The study only examined responses to *vignettes*, not real therapy sessions. This limits the generalizability of the findings to actual therapeutic practice.
Section: Discussion
Vignette Content Underspecified (written-content)
The study did not include a detailed description of the *content* of the vignettes. Providing examples or a more thorough characterization of the vignettes would allow readers to better assess the relevance and applicability of the findings.
Section: Method
Insufficient Ethical Discussion (written-content)
The study did not deeply explore the ethical considerations of using GenAI in therapy. A more thorough discussion of issues like data privacy, informed consent, and potential bias is needed.
Section: Discussion
Research Questions Not Explicitly Stated (written-content)
The study did not explicitly state the research questions or aims within the Introduction section itself. Including a clear statement of the research questions would improve clarity and provide a stronger link between the background and the study's objectives.
Section: Introduction

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Method

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Table 1. Engineered prompt.
Figure/Table Image (Page 6)
Table 1. Engineered prompt.
First Reference in Text
Patterns of prompt engineering (not fine-tuning) were used to sculpt responses from the GenAI models (see Table 1) [33].
Description
  • Key aspect of what is shown: Table 1 presents the engineered prompt used to guide the ChatGPT 4.0 model in generating responses to couple therapy vignettes. The prompt is designed to simulate the behavior of a couple therapist and optimize responses based on five common factors of therapy. These factors include therapeutic alliance (establishing a collaborative relationship), empathy (recognizing and validating emotions), professionalism (adhering to ethical codes), cultural competence (respecting cultural differences), and therapeutic technique and efficacy (relying on established therapeutic forms). The prompt provides instructions to ChatGPT on how to behave as a couple therapist, including optimizing the common factors of therapy, and it also includes an example vignette to guide the model's responses.
Scientific Validity
  • Representation of Therapeutic Principles: The engineered prompt is a crucial part of the methodology, as it dictates how the GenAI model will respond. The validity of the study depends on the prompt's ability to accurately represent the key principles of couple therapy. The prompt incorporates established therapeutic factors, but the extent to which it captures the nuances of actual therapeutic practice is open to interpretation.
  • Prompt Engineering Approach: The prompt engineering approach lacks fine-tuning, which could limit the GenAI's ability to adapt to specific scenarios. The decision to avoid fine-tuning is justified as a way to experimentally evaluate the potential of ChatGPT with minimal intervention, but it also introduces a potential source of bias.
Communication
  • Placement and Relevance: The table is referenced in the main text, indicating its importance. Providing the full prompt in a table enhances transparency and allows other researchers to replicate the methodology. However, the table would be more effective if it were placed closer to where it is first referenced in the text.
  • Clarity of Structure: The prompt's structure, with numbered instructions followed by an example, is easy to follow. However, the note at the end regarding the accuracy of information about the number of participating therapists could be confusing. It should be rephrased for clarity.

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Table 2. Posterior probabilities of attributional accuracy.
Figure/Table Image (Page 8)
Table 2. Posterior probabilities of attributional accuracy.
First Reference in Text
Aim 1 examined if participants could accurately identify whether responses were written by therapists or ChatGPT. Overall, participants performed poorly in accurate identification regardless of the author. Identification within authors was poor with participants correctly guessing that the therapist was the author 56.1% of the time and participants correctly guessing that ChatGPT was the author 51.2% of the time. Between authors, participants were only able to correctly identify therapists 5% more often than ChatGPT (56.1% versus 51.2%, respectively). Although this difference was statistically reliable, accurate identification within groups was only marginally better than chance and accurate identification in the between group comparison was close to zero (i.e., 5%; see Table 2).
Description
  • Key aspect of what is shown: Table 2 presents the posterior probabilities of participants correctly attributing responses to either a human therapist or ChatGPT across the 18 couple therapy vignettes. For each vignette, the table shows \( \Phi_1 \), which is the posterior probability of a participant correctly identifying a therapist's response as written by a therapist, and \( \Phi_2 \), which is the posterior probability of a participant correctly identifying a ChatGPT response as written by ChatGPT. The table also includes \( \sigma_{\Phi_1} \) and \( \sigma_{\Phi_2} \), which represent the standard deviation of the posterior probabilities for the therapist and ChatGPT respectively. The difference between the posterior probabilities (\( \Delta\Phi_{1,2} \)) is also shown, along with its 95% credible interval (CI). Finally, the table includes Cohen's d as an effect size and its 95% credible interval, which provides a measure of the magnitude of the difference between the attributional accuracy for therapist and ChatGPT responses.
Scientific Validity
  • Bayesian Analysis: The use of posterior probabilities and credible intervals is appropriate for Bayesian analysis, providing a measure of uncertainty around the estimates. However, it's important to consider the choice of prior distributions used in the Bayesian analysis, as they can influence the results. A sensitivity analysis to assess the impact of different priors would strengthen the validity of the findings.
  • Effect Size Measure: Cohen's d is a suitable effect size measure for quantifying the magnitude of the difference between attributional accuracy for therapist and ChatGPT responses. The inclusion of 95% credible intervals for Cohen's d provides a range of plausible values for the effect size. However, it's important to note that Cohen's d is a standardized effect size, and its interpretation depends on the context of the study.
  • Vignette Representativeness: The table presents results for each vignette, allowing for a detailed examination of the variability in attributional accuracy across different scenarios. However, it is important to consider whether the vignettes are representative of real-world couple therapy situations. If the vignettes are not representative, the generalizability of the findings may be limited.
Communication
  • Clarity and Organization: The table effectively summarizes the results of the Turing test for each vignette, allowing for a clear comparison of attributional accuracy between therapist and ChatGPT responses. However, the table could be improved by adding a column that explicitly states whether the difference in attributional accuracy between therapist and ChatGPT responses is statistically significant (e.g., using asterisks).
  • Interpretation of Effect Sizes: While the table includes effect sizes (Cohen's d) and confidence intervals, it may be beneficial to provide a brief interpretation of these values in the caption or in a footnote. This would help readers quickly understand the practical significance of the findings.
Table 3. Posterior predictions of common factor ratings drawn from a multilevel...
Full Caption

Table 3. Posterior predictions of common factor ratings drawn from a multilevel model including vignette, author, attribution, and an author-by-attribution interaction.

Figure/Table Image (Page 9)
Table 3. Posterior predictions of common factor ratings drawn from a multilevel model including vignette, author, attribution, and an author-by-attribution interaction.
First Reference in Text
A stark attribution bias was observed (see Table 2).
Description
  • Key aspect of what is shown: Table 3 presents the posterior predictions of common factor ratings, which are numerical assessments of key therapeutic elements like alliance and empathy, derived from a multilevel model. This statistical model takes into account several factors: the specific vignette used, the author of the response (either a human therapist or ChatGPT), how participants attributed the response (whether they believed it was written by a therapist or ChatGPT), and the interaction between author and attribution. The table breaks down these predictions for each vignette, providing a mean (\( \mu \)) and standard deviation (\( \sigma \)) of the posterior distribution of the common factor ratings under different conditions. Specifically, it shows the ratings when the author was a therapist, and it was attributed to a therapist; when the author was ChatGPT, and it was attributed to ChatGPT; when the author was ChatGPT, and it was attributed to a therapist; and when the author was a therapist, and it was attributed to ChatGPT.
Scientific Validity
  • Multilevel Modeling: The use of a multilevel model is appropriate for analyzing data with hierarchical structure (vignettes, authors, attributions). However, the validity of the results depends on the assumptions of the model being met. A thorough assessment of model fit and diagnostics is crucial.
  • Interaction Term: The inclusion of an author-by-attribution interaction is valuable for investigating attribution biases. However, it's important to consider whether this interaction is theoretically justified and whether it significantly improves the model fit. A comparison of models with and without the interaction term would strengthen the analysis.
  • Posterior Predictions: The interpretation of posterior predictions should be cautious, as they are conditional on the model and the priors used. A sensitivity analysis to assess the impact of different prior distributions would enhance the robustness of the findings.
Communication
  • Data Overload: The table provides a detailed breakdown of common factor ratings based on the multilevel model. However, the sheer volume of data presented in the table could be overwhelming for readers. Consider visually highlighting key findings or patterns within the table to improve readability.
  • Caption Clarity: The table caption is descriptive but could be more concise. Consider rephrasing the caption to emphasize the key finding of an attribution bias and to more clearly indicate the purpose of the table.

Discussion

Key Aspects

Strengths

Suggestions for Improvement

↑ Back to Top