When ELIZA meets therapists: A Turing test for the heart and mind

Section Analysis

Abstract

Key Aspects

Central Research Question: The central research question explores whether generative artificial intelligence (GenAI), specifically ChatGPT, can generate responses to therapeutic vignettes that are indistinguishable from those written by human therapists. This addresses the broader question of whether machines can effectively perform tasks traditionally requiring human empathy and clinical judgment. The question is significant given the increasing use of AI in various fields, including mental health.
Methodology: The study employed a Turing test-like approach, where participants were asked to differentiate between responses written by ChatGPT and those written by therapists. This methodology is inspired by Alan Turing's work on machine intelligence. The use of a large sample (N = 830) enhances the statistical power and generalizability of the findings.
Indistinguishability of Responses: The study found that participants were generally unable to distinguish between responses written by ChatGPT and those written by therapists. This suggests that ChatGPT can generate text that mimics human therapists' responses in this specific context. This aligns with previous research showing the difficulty in differentiating between human and machine-generated text.
ChatGPT Ratings on Psychotherapy Principles: The study also found that responses generated by ChatGPT were rated higher on key psychotherapy principles compared to those written by therapists. This suggests that, at least in the context of written responses to vignettes, ChatGPT may be capable of generating text that aligns well with established therapeutic principles. The specific principles are not mentioned in the abstract, but are detailed later in the paper as the common factors.
Linguistic Differences: Linguistic analysis revealed differences in language patterns between ChatGPT and therapists. This indicates that while the content may be indistinguishable in terms of meaning and therapeutic value, there are underlying stylistic differences in how ChatGPT constructs its responses. These differences were further explored through sentiment analysis and part-of-speech tagging.
Implications and Future Research: The study concludes that ChatGPT has the potential to improve psychotherapeutic processes. This is a significant claim, suggesting that AI could play a role in enhancing the effectiveness or accessibility of mental health interventions. The authors also acknowledge limitations and anticipate future research directions.

Strengths

Concise Summary
The abstract concisely summarizes the research question, methodology, key findings, and implications of the study, providing a clear overview of the entire work.

"In a large sample (N = 830), we showed that a) participants could rarely tell the difference between responses written by ChatGPT and responses written by a therapist, b) the responses written by ChatGPT were generally rated higher in key psychotherapy principles, and c) the language patterns between ChatGPT and therapists were different." (Page 1)
Clear Research Question
The abstract clearly states the central research question, addressing the capability of machines to act as therapists, which is a timely and relevant topic given the advancements in generative artificial intelligence.

"“Can machines be therapists?” is a question receiving increased attention given the relative ease of working with generative artificial intelligence." (Page 1)
Structured Aims
The abstract succinctly outlines the three main aims of the study, providing a structured approach to investigating the research question.

"It is uncertain whether, in a preregistered competition where therapists and ChatGPT respond to therapeutic vignettes about couple therapy, a) a panel of participants can tell which responses are ChatGPT-generated and which are written by therapists (N = 13), b) the gen- erated responses or the therapist-written responses fall more in line with key therapy princi- ples, and c) linguistic differences between conditions are present." (Page 1)
Potential Implications
The abstract highlights the potential implications of the research, suggesting that ChatGPT could improve psychotherapeutic processes and lead to the development of new methods for testing and creating interventions.

"This may be an early indication that ChatGPT has the potential to improve psychotherapeu- tic processes. We anticipate that this work may lead to the development of different methods of testing and creating psychotherapeutic interventions." (Page 1)

Suggestions for Improvement

Specify Psychotherapy Principles
High impact. While the abstract mentions "key psychotherapy principles," it would enhance clarity and impact to briefly specify which principles are being referred to. This provides more context for readers unfamiliar with the specific framework used in the study. This belongs in the abstract to give a more complete, yet still concise, overview of the study.

"b) the responses written by ChatGPT were generally rated higher in key psychotherapy principles" (Page 1)

Implementation: Specifically mention the common factors (therapeutic alliance, empathy, expectations, cultural competence, and therapist effects) either directly in the sentence describing aim (b) or in the subsequent sentence summarizing the findings. For example: '...rated higher in key psychotherapy principles, such as therapeutic alliance and empathy...' or '...rated higher on measures of therapeutic alliance, empathy, expectations, cultural competence, and therapist effects.'
Elaborate on Limitations
Medium impact. Although the abstract mentions limitations, it does so very briefly at the end. Adding a slightly more specific, yet still concise, phrase about the *type* of limitations would provide a more balanced overview. This belongs in the abstract to offer a complete picture of the study's scope and boundaries.

"Further, we discuss limitations" (Page 1)

Implementation: Instead of just stating "Further, we discuss limitations," add a brief phrase indicating the nature of the limitations. For example: 'Further, we discuss limitations related to the use of vignettes and sample size.' or 'Further, we discuss limitations concerning the generalizability of findings from vignette-based responses.'
Clarify ChatGPT as GenAI
Low impact. The abstract uses "ChatGPT" consistently, but it might be beneficial for clarity to include "(GenAI)" after the first mention of ChatGPT, to reinforce the broader category of technology being investigated. This is a minor point, but would improve understanding for those less familiar with specific AI models.

"It is uncertain whether, in a preregistered competition where therapists and ChatGPT respond to therapeutic vignettes about couple therapy..." (Page 1)

Implementation: After the first mention of "ChatGPT," add "(GenAI)" in parentheses. For example, change the first sentence to: '"Can machines be therapists?” is a question receiving increased attention given the relative ease of working with generative artificial intelligence. Although recent (and decades-old) research has found that humans struggle to tell the difference between responses from machines and humans, recent findings suggest that artificial intelligence can write empathic- ally and the generated content is rated highly by therapists and outperforms professionals. It is uncertain whether, in a preregistered competition where therapists and ChatGPT (GenAI) respond to therapeutic vignettes about couple therapy...'

Introduction

Key Aspects

Historical Context of AI in Therapy: The introduction frames the research within the historical context of AI development, starting with Alan Turing's question, "Can machines think?" and the subsequent creation of ELIZA, an early chatbot designed to mimic a Rogerian psychotherapist. This historical perspective highlights the long-standing interest in using machines to simulate human interaction, particularly in therapeutic contexts.
GenAI as Adjunct and Independent Solution: The introduction reviews recent studies demonstrating the potential of GenAI in psychotherapy, both as an adjunct to human services and as a standalone solution. Examples include HAILEY, an AI writing assistant, and studies using ChatGPT for coaching and generating therapeutic responses. These examples illustrate the growing evidence supporting GenAI's capabilities in this field.
Web-Based Relationship Interventions: The introduction highlights web-based relationship interventions as a specific, practical application of GenAI. This focus on couple therapy provides a concrete context for the research and demonstrates the potential for GenAI to improve access to and dissemination of evidence-based relationship interventions.
Limitations of Current Literature: The introduction identifies limitations in the current literature, including the lack of comparisons where expert writers are aware of the comparison being made, the scarcity of session transcripts from randomized controlled trials, and the limited focus on empathy as the sole therapeutic element. These limitations justify the need for the current study.
Common Factors Metatheory: The introduction introduces the common factors metatheory, which posits that successful therapies share common ingredients: therapeutic alliance, empathy, expectations, cultural competence, and therapist effects. This framework provides a theoretical lens for evaluating GenAI's capabilities beyond just empathy.
Addressing Stagnating Effect Sizes: The introduction discusses the potential for GenAI to address the issue of stagnating or declining effect sizes in evidence-based psychotherapies. This highlights a significant challenge in the field and suggests that GenAI could contribute to improving the effectiveness of interventions.
Building on and Addressing Previous Research: The introduction positions the current study as building upon previous research while addressing its limitations. The study aims to provide a more stringent test of GenAI's capabilities by comparing it to therapists who are aware of the comparison, examining multiple common factors, and analyzing linguistic patterns.

Strengths

Historical Context
The introduction effectively establishes the historical context of the research by referencing Alan Turing's work and the development of ELIZA, one of the first chatbots. This provides a strong foundation for understanding the evolution of AI in therapeutic settings.

""Can machines think?" represents a simple question raised by Alan Turing shortly after the Second World War [1]. When proposing the “imitation game,” Turing proposed an “interrogator” asking a series of questions through writing to both a human and a machine, the interrogator must decide the author of the response." (Page 2)
Review of Existing Research
The introduction clearly presents the growing body of evidence suggesting the potential of GenAI in psychotherapy, both as an adjunct to human services and as an independent solution. It cites several relevant studies, providing a comprehensive overview of the current state of research.

"Several recent psychotherapy-related studies indicate the promising effects of using GenAI as an adjunct to human services or as an independent solution." (Page 2)
Specific Use-Case
The introduction identifies a specific practical use-case for GenAI: web-based relationship interventions for couples. This provides a focused application area for the research and highlights the potential for GenAI to expand the reach and accessibility of evidence-based programs.

"One practical use-case for GenAI is web-based interventions for couples." (Page 3)
Identification of Limitations
The introduction clearly outlines the limitations of the current literature, identifying theoretical and applied gaps. This demonstrates a critical understanding of the existing research and sets the stage for the current study's contributions.

"Though the current literature has several strengths, theoretical and applied gaps remain." (Page 3)
Theoretical Framework
The introduction introduces the common factors metatheory and its relevance to the research. This provides a theoretical framework for understanding the mechanisms of change in psychotherapy and for evaluating GenAI's capabilities.

"Indeed, some responses might be more linguistically in line with the common factors orientation to therapy [27, 28], a proposed framework based on the common factors metatheory [29, 30]." (Page 3)

Suggestions for Improvement

Explicitly State Research Questions
High impact. The introduction could be improved by more explicitly stating the research question(s) or aims of the *current* study *within* the Introduction itself. While the Abstract outlines the aims, and the Introduction builds a strong case, a concise statement of the research questions near the end of the Introduction would improve clarity and provide a stronger transition to the Methods section. This belongs in the Introduction to provide a clear and direct link between the background and the current study's objectives.

"Though the current literature has several strengths, theoretical and applied gaps remain." (Page 3)

Implementation: Add a paragraph or a few sentences towards the end of the Introduction, before "The current study" section, explicitly stating the research questions. For example: 'Based on the limitations of existing research, the current study aims to address the following questions: 1) Can participants distinguish between responses written by therapists and ChatGPT in a couple therapy context? 2) How do therapist-written and ChatGPT-generated responses compare in terms of alignment with key therapeutic principles? 3) Are there linguistic differences between the two types of responses?'
List the Common Factors
Medium impact. While the introduction mentions the common factors, it could benefit from briefly listing them directly. This would enhance clarity for readers unfamiliar with the framework. This belongs in the Introduction to provide a more complete understanding of the theoretical underpinnings of the study.

"The common factors metatheory describes how different therapeutic orientations produce similar results for clients, stipulating that successful therapies share at least five common “ingredients:” therapeutic alliance, empathy, expectations, cultural competence, and therapist effects [27, 28, 31–33]." (Page 3)

Implementation: Add a sentence after introducing the common factors metatheory, briefly listing the five factors. For example: 'These common ingredients include therapeutic alliance, empathy, expectations, cultural competence, and therapist effects [27, 28, 31–33].'
Streamline Use-Case Section
Low impact. The section "A specific use-case: Web-based relationship interventions" could be slightly more concise. While it provides important context, streamlining it would improve the overall flow of the Introduction.

"One practical use-case for GenAI is web-based interventions for couples. For approximately the last two decades, couple interventionists have tried to disseminate quality relationship content using technology [9–16]." (Page 3)

Implementation: Consider condensing the paragraph by removing some of the specific details about effect sizes of existing programs, focusing instead on the general potential of GenAI in this area. The specific details can be elaborated upon in the Discussion section if necessary.

Method

Key Aspects

Preregistered Comparative Design: The study employed a preregistered, comparative design to evaluate the ability of ChatGPT to generate responses to couple therapy vignettes that are indistinguishable from those of human therapists. The preregistration aspect adds to the rigor and transparency of the research, reducing the risk of bias and increasing confidence in the findings.
Expert Response Generation: Expert responses were generated by a group of 13 therapists with advanced degrees and experience in clinical psychology, counseling psychology, marriage and family therapy, and psychiatry. These experts were aware of the comparison being made with ChatGPT, creating a more stringent test of the AI's capabilities.
GenAI Response Generation: ChatGPT 4.0 was used to generate responses to the vignettes, employing prompt engineering techniques to guide the AI's responses. The prompt defined key therapeutic principles, including the common factors, and instructed the AI to adhere to ethical guidelines.
Panel Procedure and Data Collection: A panel of 830 participants, recruited through CloudResearch, was randomly assigned to receive either a therapist-written or a ChatGPT-generated response to each vignette. Participants were then asked to identify the author of the response (Turing test) and rate it on the common factors of therapy.
Common Factors Measurement: The study used a brief measure of common factors, developed specifically for this research, to assess participants' perceptions of the responses. This measure included items assessing therapeutic alliance, empathy, expectations, cultural competence, and therapist effects.
Linguistic Analysis: Sentiment analysis and part-of-speech tagging were used to analyze linguistic differences between the therapist-written and ChatGPT-generated responses. These analyses provided insights into the stylistic characteristics of the responses.
Bayesian Data Analysis: Bayesian statistical techniques were used for data analysis, focusing on effect sizes rather than p-values. This approach allows for a more nuanced interpretation of the results and avoids the pitfalls of relying solely on null hypothesis significance testing.
Ethical Considerations: The study obtained ethical approval from the Brigham Young University Institutional Review Board, and written consent was obtained from all participants. This ensures that the research was conducted ethically and in accordance with relevant guidelines.

Strengths

Ethical Approval and Preregistration
The Method section clearly states the ethical approval and preregistration of the study, which enhances transparency and reproducibility. Providing links to the preregistration, data, code, and study materials further strengthens the study's credibility.

"Before data collection began, permission was sought and granted from the Brigham Young University Institutional Review Board. Written consent was obtained from all participants and recruitment ran from March 8, 2024, through March 11, 2024. Before panel data were col- lected, all analyses, hypotheses, and tests were preregistered (https://osf.io/up7v4/?view_only= ef738a5211a643fa97262859f84cf33f) with the Open Science Framework. Further, data, code, study materials, and supplemental tables can be accessed here (https://osf.io/8mnsc/?view_ only=7e12583f88394f0e8db97688c0bad40f)." (Page 4)
Detailed Procedure for Expert Responses
The section provides a detailed description of the procedure used to generate expert responses, including the recruitment of experts, their qualifications, and the process of assigning vignettes and selecting the best responses. This level of detail allows for a thorough understanding of how the human-generated responses were obtained.

"Originally, a group of 15 experts was recruited but 2 discontinued participation through con- venience sampling. A group of 13 experts with advanced degrees in clinical psychology (N = 9), counseling psychology (N = 1), marriage and family therapy (N = 2), and psychiatry (N = 1) were all recruited through convenience sampling and the tenure of therapy experience was between 5 –20+ years." (Page 4)
Clear Procedure for GenAI Responses
The section clearly outlines the procedure for creating GenAI vignette responses, including the use of ChatGPT 4.0, the prompt engineering approach, and the selection of the best responses by the research team. This provides a clear understanding of how the AI-generated responses were created and curated.

"The GenAI responses were created using ChatGPT 4.0. These models are capable and have been shown in previous studies to generate empathic responses when confronted with similar tasks [7, 8]. Patterns of prompt engineering (not fine-tuning) were used to sculpt responses from the GenAI models (see Table 1) [33]." (Page 5)
Panel Procedure and Sample Description
The section describes the panel procedure and sample description, including the use of CloudResearch, the random assignment of participants to conditions, and the tasks performed by the panel. This provides sufficient information about the data collection process.

"All vignettes and the responses with the highest number of votes (from the experts and the GenAI) were aggregated into a survey and distributed to a panel using CloudResearch, a plat- form that allows quick access to millions of diverse respondents, which resulted in a sample representative of the population of the United States." (Page 5)
Measures Specification
The section specifies the measures used in the study, including the Turing test and the common factors of therapy measure. The description of the common factors measure includes its development, the items used, and its psychometric properties, providing a clear understanding of how these constructs were operationalized.

"Given the already long length of the survey, a brief measure of common factors was created for this survey by measuring five constructs described includ- ing therapeutic alliance, empathy, expectations, cultural competence, and therapist effects [28]." (Page 7)
Detailed Data Analysis Plan
The Method section details the data analysis plan, including the use of Bayesian techniques, the specific statistical tests employed, and the software packages used. This level of detail enhances the transparency and reproducibility of the analysis.

"Because we desired to focus on effect sizes instead of p-values to indicate scientific importance, Bayesian techniques were used given their ease of interpretation." (Page 7)

Suggestions for Improvement

Provide More Detail on Vignette Content
Medium impact. The Method section could be improved by providing more detail about the *content* of the vignettes. While it mentions that the vignettes varied in length, difficulty, and subject content, providing examples or a more detailed description of the topics covered would enhance the reader's understanding of the stimuli used in the study. This belongs in the Method section because it directly pertains to the materials used in the research. The section describes *how* the vignettes were assigned and selected, but not *what* they contained. Providing more detail would strengthen the paper by allowing readers to better evaluate the relevance and generalizability of the findings. It would also facilitate replication efforts by other researchers.

"Thus, eighteen vignettes of varying length, diffi- culty, and subject content were created by the first author who was trained as a clinical psycholo- gist with a specialization in Integrative Behavioral Couple Therapy, therapists were then randomly assigned to one of the two therapist groups (see S1 Table)." (Page 5)

Implementation: Add a paragraph or a few sentences describing the content of the vignettes. Include examples of the types of relationship issues or conflicts depicted. Consider including a supplementary table with summaries of each vignette. For instance: 'The vignettes covered a range of common couple therapy issues, including communication difficulties, conflict resolution, intimacy concerns, and external stressors. For example, one vignette depicted a couple struggling with communication breakdown after a job loss, while another portrayed a couple dealing with differing expectations regarding intimacy.'
Clarify Instructions for Common Factors Ratings
Medium impact. The Method section should clarify the specific instructions given to the *panel participants* regarding the common factors ratings. While it mentions the items used to measure the common factors, it doesn't explicitly state what the participants were told about these factors. This is important for understanding how the participants interpreted and applied the rating scale. This belongs in the Method section under the 'Panel procedure and sample description' subsection, as it directly relates to the task performed by the participants. Clarifying the instructions would strengthen the paper by ensuring that readers understand the precise task the participants were asked to perform. This would enhance the interpretability of the results and allow for a more accurate assessment of the validity of the common factors ratings.

"After, the panel was asked to a) rate how in line with the common factors the response was, and b) guess whether the response" (Page 5)

Implementation: Add a sentence or two clarifying the instructions given to participants regarding the common factors ratings. For example: 'Participants were asked to rate the extent to which each response demonstrated the following characteristics, based on their understanding of these concepts: understanding of the speaker, caring and understanding, appropriateness for the therapy setting, relevance for different backgrounds and cultures, and whether it is something a good therapist would say.'
Summarize Key Elements of Prompt in Text
Low impact. While Table 1 presents the engineered prompt, the Method section could benefit from briefly summarizing the key elements of the prompt *within the text itself*. This would improve the flow of the section and make it easier for readers to understand the prompt without constantly referring to the table. This belongs in the 'Procedure to create GenAI vignette responses' subsection, as it directly pertains to the creation of the AI-generated responses. Summarizing the prompt within the text would strengthen the paper by providing a more cohesive and self-contained description of the methodology. It would also improve readability and reduce the cognitive load on the reader.

"Patterns of prompt engineering (not fine-tuning) were used to sculpt responses from the GenAI models (see Table 1) [33]." (Page 5)

Implementation: Add a sentence or two summarizing the key elements of the prompt. For example: 'The prompt instructed ChatGPT to behave as a couple therapist, optimizing for the five common factors of therapy: therapeutic alliance, empathy, professionalism, cultural competence, and therapeutic technique and efficacy. It also specified that responses should adhere to relevant ethical codes and draw upon established therapeutic approaches.'

Non-Text Elements

Table 1. Engineered prompt.

Figure/Table Image (Page 6)

First Reference in Text

Patterns of prompt engineering (not fine-tuning) were used to sculpt responses from the GenAI models (see Table 1) [33].

Description

Key aspect of what is shown: Table 1 presents the engineered prompt used to guide the ChatGPT 4.0 model in generating responses to couple therapy vignettes. The prompt is designed to simulate the behavior of a couple therapist and optimize responses based on five common factors of therapy. These factors include therapeutic alliance (establishing a collaborative relationship), empathy (recognizing and validating emotions), professionalism (adhering to ethical codes), cultural competence (respecting cultural differences), and therapeutic technique and efficacy (relying on established therapeutic forms). The prompt provides instructions to ChatGPT on how to behave as a couple therapist, including optimizing the common factors of therapy, and it also includes an example vignette to guide the model's responses.

Scientific Validity

Representation of Therapeutic Principles: The engineered prompt is a crucial part of the methodology, as it dictates how the GenAI model will respond. The validity of the study depends on the prompt's ability to accurately represent the key principles of couple therapy. The prompt incorporates established therapeutic factors, but the extent to which it captures the nuances of actual therapeutic practice is open to interpretation.
Prompt Engineering Approach: The prompt engineering approach lacks fine-tuning, which could limit the GenAI's ability to adapt to specific scenarios. The decision to avoid fine-tuning is justified as a way to experimentally evaluate the potential of ChatGPT with minimal intervention, but it also introduces a potential source of bias.

Communication

Placement and Relevance: The table is referenced in the main text, indicating its importance. Providing the full prompt in a table enhances transparency and allows other researchers to replicate the methodology. However, the table would be more effective if it were placed closer to where it is first referenced in the text.
Clarity of Structure: The prompt's structure, with numbered instructions followed by an example, is easy to follow. However, the note at the end regarding the accuracy of information about the number of participating therapists could be confusing. It should be rephrased for clarity.

Results

Key Aspects

Aim 1: Indistinguishability of Responses: The study's first aim investigated whether participants could accurately distinguish between responses to therapeutic vignettes written by human therapists and those generated by ChatGPT. The results showed that participants performed poorly in identifying the author, with accuracy rates only marginally better than chance. This finding suggests that, in this context, ChatGPT's responses were largely indistinguishable from those of human therapists.
Aim 2: Higher Ratings for ChatGPT on Common Factors: The second aim compared participant ratings of responses on alignment with therapeutic common factors. Responses generated by ChatGPT were rated significantly higher on these factors than those written by human therapists. This indicates that ChatGPT's responses were perceived as more aligned with key therapeutic principles, such as therapeutic alliance, empathy, and cultural competence.
Post Hoc Aim 2: Attribution Bias: A post hoc analysis to Aim 2 explored the impact of participant perception of the author (ChatGPT or therapist). An attribution bias was observed, where responses attributed to therapists received more positive ratings, regardless of the actual author. The highest ratings were given to ChatGPT responses misattributed to therapists.
Aim 3: Linguistic Differences: The third aim examined sentiment and part-of-speech differences between the two sets of responses. ChatGPT responses exhibited more positive sentiment, less negative sentiment, and were generally longer, containing more nouns, verbs, adjectives, adverbs, and pronouns than therapist-written responses.
Post Hoc Aims 2 and 3: Impact of Linguistic Features: A post hoc analysis to Aims 2 and 3 investigated whether controlling for response length and parts of speech would reduce the observed differences in common factor ratings. Controlling for these variables reduced the effect sizes for connection and empathy, and widened the credible interval for cultural competence, suggesting a complex relationship between linguistic features and perceived therapeutic quality.
Bayesian Statistical Approach: The study employed Bayesian statistical methods, focusing on effect sizes and credible intervals rather than traditional null hypothesis significance testing. This approach provides a more nuanced understanding of the magnitude and reliability of the observed effects.
Multiple Post-Hoc Analyses: The study included multiple post-hoc analyses to further examine the results. These analyses investigated the effect of using different measures for the common factors, explored participant bias based on perceived authorship, and controlled for the length and parts of speech used in the responses. This provided additional layers of evidence to support the main findings.

Strengths

Clear Summary of Findings for Each Aim
The Results section clearly and concisely summarizes the findings for each of the three main aims, providing a structured presentation of the study's outcomes.

"Aim 1 examined if participants could accurately identify whether responses were written by therapists or ChatGPT. Overall, participants performed poorly in accurate identification regardless of the author." (Page 8)
Effective Use of Quantitative Data
The section effectively uses quantitative data, including means, standard deviations, effect sizes, and credible intervals, to support the findings. This allows for a precise and objective evaluation of the results.

"Estimates aggregated across all vignettes revealed responses written by ChatGPT (μ = 27.72, σ = 0.83) were rated higher on the common factors of therapy than those written by therapists (μ = 26.12, σ = 0.82), indicating a large and reliable difference (d = 1.63, 95% CI [1.49, 1.78]) favoring ChatGPT." (Page 8)
Inclusion of Post Hoc Analyses
The section includes post hoc analyses that provide additional insights and explore potential explanations for the observed findings. These analyses add depth to the interpretation of the results.

"As a post hoc addition to Aim 2, we explored whether the pattern of findings held when using different measures. transforEmotion was used to compute the probability that the responses written by human therapists and ChatGPT were more in line with the common factors of therapy [31]." (Page 9)
Effective Use of Tables
The Results section appropriately utilizes tables (Table 2 and Table 3) to present detailed statistical data in a clear and organized manner. These tables enhance the readability and comprehensibility of the findings.

"Although this difference was statistically reliable, accurate identification within groups was only marginally better than chance and accurate identification in the between group comparison was close to zero (i.e., 5%; see Table 2)." (Page 8)
Clear Distinction Between Main Aims and Post Hoc Analyses
The section maintains a clear distinction between the main aims and the post hoc analyses, preventing confusion and ensuring that the exploratory nature of the latter is appropriately acknowledged.

"Post hoc addition to Aim 2: Panel perception of vignette author. As a post hoc addition to Aim 2, we explored whether the pattern of findings held when using different measures." (Page 9)

Suggestions for Improvement

Add Brief Interpretive Statements
Medium impact. The Results section would benefit from including brief interpretive statements *alongside* the statistical results. While the section presents the data clearly, adding a sentence or two after each key finding to explain its *meaning* would improve reader comprehension and provide a smoother transition to the Discussion section. This belongs in the Results section to bridge the gap between reporting the data and interpreting its significance. Currently, the section focuses almost exclusively on reporting the statistical findings, leaving the interpretation entirely to the Discussion section. Adding brief interpretive statements within the Results section would enhance its clarity and provide a more complete picture of the study's outcomes. This would also help readers who may not be as familiar with statistical terminology to grasp the meaning of the results more easily.

"Estimates aggregated across all vignettes revealed responses written by ChatGPT (μ = 27.72, σ = 0.83) were rated higher on the common factors of therapy than those written by therapists (μ = 26.12, σ = 0.82), indicating a large and reliable difference (d = 1.63, 95% CI [1.49, 1.78]) favoring ChatGPT." (Page 8)

Implementation: After presenting the statistical results for each aim (and post hoc analysis), add a sentence or two summarizing the meaning of the findings. For example, after reporting the results for Aim 1, you could add: 'This indicates that participants were largely unable to distinguish between responses generated by ChatGPT and those written by experienced therapists.' After reporting the main finding for Aim 2: 'This suggests that, in this context, ChatGPT's responses were perceived as more aligned with therapeutic common factors than those of human therapists.'
Explicitly State Direction of Effects
Low impact. The Results section could be improved by explicitly stating the *direction* of the effects in the text, even when tables are provided. While Table 2 shows the differences, adding phrases like "higher for ChatGPT" or "lower for therapists" within the text would further enhance clarity, especially for readers who might skim the tables. This belongs in the Results section to ensure that the direction of the findings is immediately clear, regardless of whether the reader closely examines the tables. While the tables provide detailed statistical information, explicitly stating the direction of effects in the text would make the results more accessible and easier to understand at a glance. This is a relatively minor change that would significantly improve the readability of the section.

"Between authors, participants were only able to correctly identify therapists 5% more often than ChatGPT (56.1% versus 51.2%, respectively)." (Page 8)

Implementation: When reporting results, explicitly state the direction of the effect. For example, instead of just saying '...differences in accurate identification...', say '...participants were slightly *more* accurate at identifying therapists, though this difference was small...' or 'ChatGPT responses had *more* positive sentiment...'.
Reiterate Rationale for Credible Intervals
Low impact. While the section mentions the use of 95% credible intervals, it could briefly reiterate *why* these are used instead of p-values. This would reinforce the methodological choice and provide context for readers unfamiliar with Bayesian statistics. This belongs in the Results section as a brief reminder of the analytical approach, reinforcing the rationale presented in the Methods section. While the Methods section explains the use of Bayesian techniques, briefly reiterating this choice in the Results section would help readers who may have skipped or forgotten that detail. This is a minor addition that would enhance the clarity and completeness of the Results section.

"Estimates aggregated across all vignettes revealed responses written by ChatGPT (μ = 27.72, σ = 0.83) were rated higher on the common factors of therapy than those written by therapists (μ = 26.12, σ = 0.82), indicating a large and reliable difference (d = 1.63, 95% CI [1.49, 1.78]) favoring ChatGPT." (Page 8)

Implementation: Add a brief phrase reiterating the rationale for using credible intervals. For example: '...indicating a large and reliable difference (d = 1.63, 95% CI [1.49, 1.78]), with the credible interval suggesting a true effect favoring ChatGPT, consistent with our focus on effect sizes rather than p-values.'

Non-Text Elements

Table 2. Posterior probabilities of attributional accuracy.

Figure/Table Image (Page 8)

First Reference in Text

Aim 1 examined if participants could accurately identify whether responses were written by therapists or ChatGPT. Overall, participants performed poorly in accurate identification regardless of the author. Identification within authors was poor with participants correctly guessing that the therapist was the author 56.1% of the time and participants correctly guessing that ChatGPT was the author 51.2% of the time. Between authors, participants were only able to correctly identify therapists 5% more often than ChatGPT (56.1% versus 51.2%, respectively). Although this difference was statistically reliable, accurate identification within groups was only marginally better than chance and accurate identification in the between group comparison was close to zero (i.e., 5%; see Table 2).

Description

Key aspect of what is shown: Table 2 presents the posterior probabilities of participants correctly attributing responses to either a human therapist or ChatGPT across the 18 couple therapy vignettes. For each vignette, the table shows \( \Phi_1 \), which is the posterior probability of a participant correctly identifying a therapist's response as written by a therapist, and \( \Phi_2 \), which is the posterior probability of a participant correctly identifying a ChatGPT response as written by ChatGPT. The table also includes \( \sigma_{\Phi_1} \) and \( \sigma_{\Phi_2} \), which represent the standard deviation of the posterior probabilities for the therapist and ChatGPT respectively. The difference between the posterior probabilities (\( \Delta\Phi_{1,2} \)) is also shown, along with its 95% credible interval (CI). Finally, the table includes Cohen's d as an effect size and its 95% credible interval, which provides a measure of the magnitude of the difference between the attributional accuracy for therapist and ChatGPT responses.

Scientific Validity

Bayesian Analysis: The use of posterior probabilities and credible intervals is appropriate for Bayesian analysis, providing a measure of uncertainty around the estimates. However, it's important to consider the choice of prior distributions used in the Bayesian analysis, as they can influence the results. A sensitivity analysis to assess the impact of different priors would strengthen the validity of the findings.
Effect Size Measure: Cohen's d is a suitable effect size measure for quantifying the magnitude of the difference between attributional accuracy for therapist and ChatGPT responses. The inclusion of 95% credible intervals for Cohen's d provides a range of plausible values for the effect size. However, it's important to note that Cohen's d is a standardized effect size, and its interpretation depends on the context of the study.
Vignette Representativeness: The table presents results for each vignette, allowing for a detailed examination of the variability in attributional accuracy across different scenarios. However, it is important to consider whether the vignettes are representative of real-world couple therapy situations. If the vignettes are not representative, the generalizability of the findings may be limited.

Communication

Clarity and Organization: The table effectively summarizes the results of the Turing test for each vignette, allowing for a clear comparison of attributional accuracy between therapist and ChatGPT responses. However, the table could be improved by adding a column that explicitly states whether the difference in attributional accuracy between therapist and ChatGPT responses is statistically significant (e.g., using asterisks).
Interpretation of Effect Sizes: While the table includes effect sizes (Cohen's d) and confidence intervals, it may be beneficial to provide a brief interpretation of these values in the caption or in a footnote. This would help readers quickly understand the practical significance of the findings.

Table 3. Posterior predictions of common factor ratings drawn from a multilevel...

Full Caption

Table 3. Posterior predictions of common factor ratings drawn from a multilevel model including vignette, author, attribution, and an author-by-attribution interaction.

Figure/Table Image (Page 9)

First Reference in Text

A stark attribution bias was observed (see Table 2).

Description

Key aspect of what is shown: Table 3 presents the posterior predictions of common factor ratings, which are numerical assessments of key therapeutic elements like alliance and empathy, derived from a multilevel model. This statistical model takes into account several factors: the specific vignette used, the author of the response (either a human therapist or ChatGPT), how participants attributed the response (whether they believed it was written by a therapist or ChatGPT), and the interaction between author and attribution. The table breaks down these predictions for each vignette, providing a mean (\( \mu \)) and standard deviation (\( \sigma \)) of the posterior distribution of the common factor ratings under different conditions. Specifically, it shows the ratings when the author was a therapist, and it was attributed to a therapist; when the author was ChatGPT, and it was attributed to ChatGPT; when the author was ChatGPT, and it was attributed to a therapist; and when the author was a therapist, and it was attributed to ChatGPT.

Scientific Validity

Multilevel Modeling: The use of a multilevel model is appropriate for analyzing data with hierarchical structure (vignettes, authors, attributions). However, the validity of the results depends on the assumptions of the model being met. A thorough assessment of model fit and diagnostics is crucial.
Interaction Term: The inclusion of an author-by-attribution interaction is valuable for investigating attribution biases. However, it's important to consider whether this interaction is theoretically justified and whether it significantly improves the model fit. A comparison of models with and without the interaction term would strengthen the analysis.
Posterior Predictions: The interpretation of posterior predictions should be cautious, as they are conditional on the model and the priors used. A sensitivity analysis to assess the impact of different prior distributions would enhance the robustness of the findings.

Communication

Data Overload: The table provides a detailed breakdown of common factor ratings based on the multilevel model. However, the sheer volume of data presented in the table could be overwhelming for readers. Consider visually highlighting key findings or patterns within the table to improve readability.
Caption Clarity: The table caption is descriptive but could be more concise. Consider rephrasing the caption to emphasize the key finding of an attribution bias and to more clearly indicate the purpose of the table.

Discussion

Key Aspects

Summary of Aims: The study investigated three primary aims: assessing the ability of participants to distinguish between therapist-written and ChatGPT-generated responses to therapeutic vignettes, comparing the ratings of these responses on common therapeutic factors, and examining linguistic differences between the two types of responses. These aims address the core question of whether AI can generate text that is comparable to that of human therapists in a therapeutic context.
Indistinguishability of Responses: The study found that participants were largely unable to distinguish between responses written by therapists and those generated by ChatGPT, supporting Turing's prediction about the indistinguishability of human and machine-generated text. This finding aligns with previous research indicating the difficulty in differentiating between human and AI-generated content.
Higher Ratings for ChatGPT: ChatGPT-generated responses were rated higher on the common factors of therapy compared to therapist-written responses. This suggests that, in this specific context, ChatGPT was able to generate text that aligned more closely with established therapeutic principles, such as therapeutic alliance, empathy, and cultural competence.
Linguistic Differences: Linguistic analysis revealed differences between the two types of responses, with ChatGPT exhibiting more positive sentiment, less negative sentiment, and using more nouns and adjectives, even after controlling for response length. This suggests that ChatGPT may have a different stylistic approach to generating therapeutic responses.
Attribution Bias: Post hoc analyses indicated an attribution bias, where participants rated responses higher when they believed they were written by a therapist, regardless of the actual author. This suggests a potential bias against AI or a preference for human therapists, even when the content is comparable.
Role of Linguistic Features: Controlling for response length and part-of-speech reduced the effect sizes for some common factors, suggesting that linguistic features play a role in the perception of therapeutic quality. This highlights the complex relationship between language and therapeutic effectiveness.
Limitations of the Study: The study acknowledges several limitations, including the use of vignettes rather than real therapy sessions, the limited number of therapists and vignettes, and the potential for the results to not generalize to individual therapy. These limitations highlight the need for future research to address these issues.
Potential for Integration into Mental Health: The study discusses the potential for GenAI to be integrated into mental health settings, potentially expanding access to services and improving the flexibility of coaching. However, it also emphasizes the need for careful monitoring and supervision by responsible clinicians.
Creativity vs. Evidence Base: The study highlights the paradox between response creativity and the cost of operating outside of the evidence base in psychotherapy research. This raises important questions about the balance between innovation and adherence to established therapeutic principles when using GenAI.
Need for Technical Literacy and Supervision: The study concludes by emphasizing the need for mental health experts to become technically literate in machine learning and to ensure that GenAI models are carefully trained and supervised to ensure high-quality care. This underscores the importance of responsible development and implementation of AI in therapeutic settings.

Strengths

Clear Summary of Aims and Findings
The Discussion section effectively summarizes the study's three main aims and their corresponding findings, providing a clear and concise overview of the research.

"The current study was set up to investigate three aims: 1) determine whether a panel of participants could accurately identify whether ChatGPT or expert therapists authored responses to therapeutic vignettes, 2) examine whether responses written by ChatGPT and therapists were rated higher, lower, or equal in line with the common factors of therapy, and 3) determine whether there were sentiment and part-of-speech differences between ChatGPT generated and therapist-written responses." (Page 10)
Connection to Hypothesis and Previous Research
The section connects the findings back to the original hypothesis and previous research, highlighting the consistency of the results with Turing's prediction and other studies showing the difficulty in differentiating between human and machine-generated responses.

"When determining whether participants could tell the difference between responses written by a knowing expert and responses created using GenAI, accurate identification was only marginally better than chance. This pattern of findings is consistent with our original hypothesis and previous research: differences in accurate identification will be close to zero [2, 7, 8]." (Page 10)
Acknowledgment of Critiques and Limitations
The Discussion section acknowledges potential critiques and limitations of the study, demonstrating a critical and balanced evaluation of the research. This includes addressing concerns about generalizability, the brief nature of the outcome measure, and the clinician sampling plan.

"The second aim is likely to be critiqued for several reasons. First, these are responses to therapy-like vignettes and may not generalize to actual therapy. These responses represent a hypothetical "snapshot ” of therapy. Second, vignettes were based on couple therapy scenarios and the results might not generalize to individual therapy. Third, the outcome measured was brief and might not fit neatly into varying definitions of what is (or is not) therapeutic." (Page 11)
Exploration of Potential Explanations
The section explores potential explanations for the findings, such as the possibility that ChatGPT contextualizes better than therapists due to its use of more nouns and adjectives. This demonstrates a thoughtful consideration of the underlying mechanisms driving the results.

"Considering that nouns can be used to describe people, places, and things, and adjectives can be used to provide more context, this could mean that ChatGPT contextualizes better than the therapists. Better contextualization may have led respondents to rate the ChatGPT responses higher on the common factors of therapy." (Page 11)
Discussion of Implications
The section discusses the implications of the findings for mental health providers and researchers, highlighting the potential for GenAI to be integrated into mental health settings and the need for careful monitoring and supervision. This demonstrates a consideration of the broader impact of the research.

"Throughout this study, we have demonstrated that GenAI has the powerful potential to meaningfully and linguistically compete with mental health experts in couple-therapy-like settings. This illustrates the initial potential for GenAI, with more training, data, and ongoing close supervision, to be integrated into mental health settings." (Page 12)
Cautious Interpretation of Post Hoc Aims
The section appropriately qualifies the interpretation of the post hoc aims, acknowledging their limitations due to small sample size, the exploratory nature of the analyses, and the need for replication. This demonstrates a cautious and responsible approach to interpreting the findings.

"We interpret the post hoc aims cautiously for the following reasons. First, this is a small sample size (i.e., 36 total vignette responses, 18 from therapists and 18 generated by ChatGPT) leading our estimates to be imprecise, as illustrated by large credible intervals. Second, this is a new line of research in need of replication. Third, post hoc aims were not preregistered and were explorations that took place after examining the data." (Page 11)

Suggestions for Improvement

Connect Findings to Broader Literature
Medium impact. The Discussion section could be improved by more explicitly connecting the findings of *this* study to the *broader* literature on AI in mental health. While it references some previous work, a more thorough integration of the current results with existing research would strengthen the paper's contribution to the field. The discussion section needs to place the current findings in the broader context of AI and mental health research. This would involve comparing and contrasting the results with those of other studies, highlighting areas of agreement and disagreement, and discussing how the current work advances the field's understanding of GenAI's potential in therapeutic settings.

"When determining whether participants could tell the difference between responses written by a knowing expert and responses created using GenAI, accurate identification was only marginally better than chance. This pattern of findings is consistent with our original hypothesis and previous research: differences in accurate identification will be close to zero [2, 7, 8]." (Page 10)

Implementation: Add a paragraph or two specifically addressing the broader literature on AI in mental health. Discuss how the current findings compare to those of other studies, highlighting similarities and differences. For example: 'These findings build upon previous research demonstrating the potential of AI in mental health (cite relevant studies). While some studies have focused on AI as an adjunct to human therapists (cite examples), our work suggests that GenAI can also generate responses that are perceived as therapeutically valuable on their own. This aligns with findings by X et al. (year) but contrasts with Y et al. (year), who found that...'
Discuss Ethical Considerations More Deeply
Medium impact. The Discussion section could benefit from a more nuanced discussion of the *ethical considerations* surrounding the use of GenAI in therapy. While it mentions the need for supervision, it could delve deeper into issues such as data privacy, informed consent, and the potential for bias in AI models. The Discussion section is the appropriate place to address these complex ethical issues. This would involve a more thorough exploration of the potential risks and benefits of using GenAI in therapeutic settings, as well as a discussion of the safeguards that need to be in place to ensure responsible and ethical implementation.

"Though these implications are exciting, mental health researchers and providers must be aware of the potential impact of GenAI on psychotherapy research, the underlying technophobia that could prevent treatment-seekers from engaging with GenAI, and the cost of making responses more creative." (Page 12)

Implementation: Add a paragraph or two specifically addressing the ethical considerations of using GenAI in therapy. Discuss issues such as data privacy, informed consent, and the potential for bias in AI models. For example: 'The use of GenAI in therapy raises important ethical considerations. Data privacy is paramount, and robust measures must be in place to protect sensitive client information. Informed consent is also crucial, with clients needing to be fully aware of the nature of the AI system and its limitations. Furthermore, it is essential to address the potential for bias in AI models, ensuring that they do not perpetuate or exacerbate existing inequalities.'
Streamline Summary of Results
Low impact. The Discussion section could be slightly more concise in its summary of the results. While it's important to reiterate the key findings, streamlining this part would improve the overall flow and allow more space for the discussion of implications and limitations. The Discussion section should provide a concise summary, but the current version repeats some details that could be condensed. This would involve removing redundant phrases and focusing on the most essential findings. This is a relatively minor change that would improve the readability and impact of the section.

"The current study was set up to investigate three aims: 1) determine whether a panel of participants could accurately identify whether ChatGPT or expert therapists authored responses to therapeutic vignettes, 2) examine whether responses written by ChatGPT and therapists were rated higher, lower, or equal in line with the common factors of therapy, and 3) determine whether there were sentiment and part-of-speech differences between ChatGPT generated and therapist-written responses." (Page 10)

Implementation: Review the first two paragraphs of the Discussion section and identify any redundant phrases or sentences. Condense the summary of the results to focus on the most important findings. For example, instead of repeating the statistical results in detail, simply state the main conclusions: 'Participants were unable to reliably distinguish between therapist and ChatGPT responses. ChatGPT responses were rated higher on common factors. Linguistic differences were observed, with ChatGPT using more positive sentiment and a greater number of nouns and adjectives.'

When ELIZA meets therapists: A Turing test for the heart and mind

Table of Contents

Overall Summary

Study Background and Main Findings

Research Impact and Future Directions

Critical Analysis and Recommendations

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Method

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Discussion

Key Aspects

Strengths

Suggestions for Improvement