This study investigated whether ChatGPT could generate responses to couple therapy vignettes that were indistinguishable from those of human therapists, and whether those responses aligned with therapeutic common factors. Participants (N=830) were unable to reliably distinguish between therapist and ChatGPT responses. ChatGPT responses were rated significantly higher on common factors (d = 1.63, 95% CI [1.49, 1.78]). Linguistic analysis revealed differences, with ChatGPT using more positive sentiment. Post-hoc analyses revealed an attribution bias favoring responses believed to be from therapists.
The study provides compelling evidence that ChatGPT can generate responses to couple therapy vignettes that are indistinguishable from those written by experienced therapists, and are even rated higher on key therapeutic factors. However, it is crucial to distinguish between correlation and causation. The study demonstrates a correlation between ChatGPT-generated responses and higher ratings on common factors, but it does not prove that ChatGPT *causes* better therapeutic outcomes. The experimental design, comparing ChatGPT to therapists, allows for causal inferences about the *relative* quality of the responses in this specific context, but not about the effectiveness of ChatGPT as a therapist in real-world settings.
The practical utility of these findings is significant, suggesting that GenAI could potentially play a valuable role in expanding access to mental health resources, particularly in areas like web-based relationship interventions. The study places these findings within the context of existing research on AI in mental health, referencing Turing's work and previous studies on human-machine indistinguishability. However, the study could benefit from a more thorough discussion of how its findings specifically advance the broader field and address existing gaps in the literature.
While the study offers promising insights, it also acknowledges key uncertainties. The authors appropriately caution against overgeneralizing the findings, emphasizing the limitations of using vignettes and the need for further research in real-world therapeutic settings. They provide clear guidance for future research, suggesting the need to investigate the long-term effects of GenAI-based interventions and to explore the potential for bias in AI models. The authors also highlight the importance of ethical considerations, such as data privacy and informed consent.
The study raises critical unanswered questions, particularly regarding the long-term effectiveness and safety of using GenAI in therapy. While the methodological limitations, such as the use of vignettes and a brief outcome measure, are acknowledged, their potential impact on the conclusions could be discussed in more detail. Specifically, the study could benefit from a more thorough analysis of whether these limitations fundamentally affect the interpretation of the findings. For example, how might the results differ in a real-world therapy setting, where the interaction is dynamic and ongoing, rather than a single exchange of written responses?
The abstract concisely summarizes the research question, methodology, key findings, and implications of the study, providing a clear overview of the entire work.
The abstract clearly states the central research question, addressing the capability of machines to act as therapists, which is a timely and relevant topic given the advancements in generative artificial intelligence.
The abstract succinctly outlines the three main aims of the study, providing a structured approach to investigating the research question.
The abstract highlights the potential implications of the research, suggesting that ChatGPT could improve psychotherapeutic processes and lead to the development of new methods for testing and creating interventions.
High impact. While the abstract mentions "key psychotherapy principles," it would enhance clarity and impact to briefly specify which principles are being referred to. This provides more context for readers unfamiliar with the specific framework used in the study. This belongs in the abstract to give a more complete, yet still concise, overview of the study.
Implementation: Specifically mention the common factors (therapeutic alliance, empathy, expectations, cultural competence, and therapist effects) either directly in the sentence describing aim (b) or in the subsequent sentence summarizing the findings. For example: '...rated higher in key psychotherapy principles, such as therapeutic alliance and empathy...' or '...rated higher on measures of therapeutic alliance, empathy, expectations, cultural competence, and therapist effects.'
Medium impact. Although the abstract mentions limitations, it does so very briefly at the end. Adding a slightly more specific, yet still concise, phrase about the *type* of limitations would provide a more balanced overview. This belongs in the abstract to offer a complete picture of the study's scope and boundaries.
Implementation: Instead of just stating "Further, we discuss limitations," add a brief phrase indicating the nature of the limitations. For example: 'Further, we discuss limitations related to the use of vignettes and sample size.' or 'Further, we discuss limitations concerning the generalizability of findings from vignette-based responses.'
Low impact. The abstract uses "ChatGPT" consistently, but it might be beneficial for clarity to include "(GenAI)" after the first mention of ChatGPT, to reinforce the broader category of technology being investigated. This is a minor point, but would improve understanding for those less familiar with specific AI models.
Implementation: After the first mention of "ChatGPT," add "(GenAI)" in parentheses. For example, change the first sentence to: '"Can machines be therapists?” is a question receiving increased attention given the relative ease of working with generative artificial intelligence. Although recent (and decades-old) research has found that humans struggle to tell the difference between responses from machines and humans, recent findings suggest that artificial intelligence can write empathic- ally and the generated content is rated highly by therapists and outperforms professionals. It is uncertain whether, in a preregistered competition where therapists and ChatGPT (GenAI) respond to therapeutic vignettes about couple therapy...'
The introduction effectively establishes the historical context of the research by referencing Alan Turing's work and the development of ELIZA, one of the first chatbots. This provides a strong foundation for understanding the evolution of AI in therapeutic settings.
The introduction clearly presents the growing body of evidence suggesting the potential of GenAI in psychotherapy, both as an adjunct to human services and as an independent solution. It cites several relevant studies, providing a comprehensive overview of the current state of research.
The introduction identifies a specific practical use-case for GenAI: web-based relationship interventions for couples. This provides a focused application area for the research and highlights the potential for GenAI to expand the reach and accessibility of evidence-based programs.
The introduction clearly outlines the limitations of the current literature, identifying theoretical and applied gaps. This demonstrates a critical understanding of the existing research and sets the stage for the current study's contributions.
The introduction introduces the common factors metatheory and its relevance to the research. This provides a theoretical framework for understanding the mechanisms of change in psychotherapy and for evaluating GenAI's capabilities.
High impact. The introduction could be improved by more explicitly stating the research question(s) or aims of the *current* study *within* the Introduction itself. While the Abstract outlines the aims, and the Introduction builds a strong case, a concise statement of the research questions near the end of the Introduction would improve clarity and provide a stronger transition to the Methods section. This belongs in the Introduction to provide a clear and direct link between the background and the current study's objectives.
Implementation: Add a paragraph or a few sentences towards the end of the Introduction, before "The current study" section, explicitly stating the research questions. For example: 'Based on the limitations of existing research, the current study aims to address the following questions: 1) Can participants distinguish between responses written by therapists and ChatGPT in a couple therapy context? 2) How do therapist-written and ChatGPT-generated responses compare in terms of alignment with key therapeutic principles? 3) Are there linguistic differences between the two types of responses?'
Medium impact. While the introduction mentions the common factors, it could benefit from briefly listing them directly. This would enhance clarity for readers unfamiliar with the framework. This belongs in the Introduction to provide a more complete understanding of the theoretical underpinnings of the study.
Implementation: Add a sentence after introducing the common factors metatheory, briefly listing the five factors. For example: 'These common ingredients include therapeutic alliance, empathy, expectations, cultural competence, and therapist effects [27, 28, 31–33].'
Low impact. The section "A specific use-case: Web-based relationship interventions" could be slightly more concise. While it provides important context, streamlining it would improve the overall flow of the Introduction.
Implementation: Consider condensing the paragraph by removing some of the specific details about effect sizes of existing programs, focusing instead on the general potential of GenAI in this area. The specific details can be elaborated upon in the Discussion section if necessary.
The Method section clearly states the ethical approval and preregistration of the study, which enhances transparency and reproducibility. Providing links to the preregistration, data, code, and study materials further strengthens the study's credibility.
The section provides a detailed description of the procedure used to generate expert responses, including the recruitment of experts, their qualifications, and the process of assigning vignettes and selecting the best responses. This level of detail allows for a thorough understanding of how the human-generated responses were obtained.
The section clearly outlines the procedure for creating GenAI vignette responses, including the use of ChatGPT 4.0, the prompt engineering approach, and the selection of the best responses by the research team. This provides a clear understanding of how the AI-generated responses were created and curated.
The section describes the panel procedure and sample description, including the use of CloudResearch, the random assignment of participants to conditions, and the tasks performed by the panel. This provides sufficient information about the data collection process.
The section specifies the measures used in the study, including the Turing test and the common factors of therapy measure. The description of the common factors measure includes its development, the items used, and its psychometric properties, providing a clear understanding of how these constructs were operationalized.
The Method section details the data analysis plan, including the use of Bayesian techniques, the specific statistical tests employed, and the software packages used. This level of detail enhances the transparency and reproducibility of the analysis.
Medium impact. The Method section could be improved by providing more detail about the *content* of the vignettes. While it mentions that the vignettes varied in length, difficulty, and subject content, providing examples or a more detailed description of the topics covered would enhance the reader's understanding of the stimuli used in the study. This belongs in the Method section because it directly pertains to the materials used in the research. The section describes *how* the vignettes were assigned and selected, but not *what* they contained. Providing more detail would strengthen the paper by allowing readers to better evaluate the relevance and generalizability of the findings. It would also facilitate replication efforts by other researchers.
Implementation: Add a paragraph or a few sentences describing the content of the vignettes. Include examples of the types of relationship issues or conflicts depicted. Consider including a supplementary table with summaries of each vignette. For instance: 'The vignettes covered a range of common couple therapy issues, including communication difficulties, conflict resolution, intimacy concerns, and external stressors. For example, one vignette depicted a couple struggling with communication breakdown after a job loss, while another portrayed a couple dealing with differing expectations regarding intimacy.'
Medium impact. The Method section should clarify the specific instructions given to the *panel participants* regarding the common factors ratings. While it mentions the items used to measure the common factors, it doesn't explicitly state what the participants were told about these factors. This is important for understanding how the participants interpreted and applied the rating scale. This belongs in the Method section under the 'Panel procedure and sample description' subsection, as it directly relates to the task performed by the participants. Clarifying the instructions would strengthen the paper by ensuring that readers understand the precise task the participants were asked to perform. This would enhance the interpretability of the results and allow for a more accurate assessment of the validity of the common factors ratings.
Implementation: Add a sentence or two clarifying the instructions given to participants regarding the common factors ratings. For example: 'Participants were asked to rate the extent to which each response demonstrated the following characteristics, based on their understanding of these concepts: understanding of the speaker, caring and understanding, appropriateness for the therapy setting, relevance for different backgrounds and cultures, and whether it is something a good therapist would say.'
Low impact. While Table 1 presents the engineered prompt, the Method section could benefit from briefly summarizing the key elements of the prompt *within the text itself*. This would improve the flow of the section and make it easier for readers to understand the prompt without constantly referring to the table. This belongs in the 'Procedure to create GenAI vignette responses' subsection, as it directly pertains to the creation of the AI-generated responses. Summarizing the prompt within the text would strengthen the paper by providing a more cohesive and self-contained description of the methodology. It would also improve readability and reduce the cognitive load on the reader.
Implementation: Add a sentence or two summarizing the key elements of the prompt. For example: 'The prompt instructed ChatGPT to behave as a couple therapist, optimizing for the five common factors of therapy: therapeutic alliance, empathy, professionalism, cultural competence, and therapeutic technique and efficacy. It also specified that responses should adhere to relevant ethical codes and draw upon established therapeutic approaches.'
The Results section clearly and concisely summarizes the findings for each of the three main aims, providing a structured presentation of the study's outcomes.
The section effectively uses quantitative data, including means, standard deviations, effect sizes, and credible intervals, to support the findings. This allows for a precise and objective evaluation of the results.
The section includes post hoc analyses that provide additional insights and explore potential explanations for the observed findings. These analyses add depth to the interpretation of the results.
The Results section appropriately utilizes tables (Table 2 and Table 3) to present detailed statistical data in a clear and organized manner. These tables enhance the readability and comprehensibility of the findings.
The section maintains a clear distinction between the main aims and the post hoc analyses, preventing confusion and ensuring that the exploratory nature of the latter is appropriately acknowledged.
Medium impact. The Results section would benefit from including brief interpretive statements *alongside* the statistical results. While the section presents the data clearly, adding a sentence or two after each key finding to explain its *meaning* would improve reader comprehension and provide a smoother transition to the Discussion section. This belongs in the Results section to bridge the gap between reporting the data and interpreting its significance. Currently, the section focuses almost exclusively on reporting the statistical findings, leaving the interpretation entirely to the Discussion section. Adding brief interpretive statements within the Results section would enhance its clarity and provide a more complete picture of the study's outcomes. This would also help readers who may not be as familiar with statistical terminology to grasp the meaning of the results more easily.
Implementation: After presenting the statistical results for each aim (and post hoc analysis), add a sentence or two summarizing the meaning of the findings. For example, after reporting the results for Aim 1, you could add: 'This indicates that participants were largely unable to distinguish between responses generated by ChatGPT and those written by experienced therapists.' After reporting the main finding for Aim 2: 'This suggests that, in this context, ChatGPT's responses were perceived as more aligned with therapeutic common factors than those of human therapists.'
Low impact. The Results section could be improved by explicitly stating the *direction* of the effects in the text, even when tables are provided. While Table 2 shows the differences, adding phrases like "higher for ChatGPT" or "lower for therapists" within the text would further enhance clarity, especially for readers who might skim the tables. This belongs in the Results section to ensure that the direction of the findings is immediately clear, regardless of whether the reader closely examines the tables. While the tables provide detailed statistical information, explicitly stating the direction of effects in the text would make the results more accessible and easier to understand at a glance. This is a relatively minor change that would significantly improve the readability of the section.
Implementation: When reporting results, explicitly state the direction of the effect. For example, instead of just saying '...differences in accurate identification...', say '...participants were slightly *more* accurate at identifying therapists, though this difference was small...' or 'ChatGPT responses had *more* positive sentiment...'.
Low impact. While the section mentions the use of 95% credible intervals, it could briefly reiterate *why* these are used instead of p-values. This would reinforce the methodological choice and provide context for readers unfamiliar with Bayesian statistics. This belongs in the Results section as a brief reminder of the analytical approach, reinforcing the rationale presented in the Methods section. While the Methods section explains the use of Bayesian techniques, briefly reiterating this choice in the Results section would help readers who may have skipped or forgotten that detail. This is a minor addition that would enhance the clarity and completeness of the Results section.
Implementation: Add a brief phrase reiterating the rationale for using credible intervals. For example: '...indicating a large and reliable difference (d = 1.63, 95% CI [1.49, 1.78]), with the credible interval suggesting a true effect favoring ChatGPT, consistent with our focus on effect sizes rather than p-values.'
Table 3. Posterior predictions of common factor ratings drawn from a multilevel model including vignette, author, attribution, and an author-by-attribution interaction.
The Discussion section effectively summarizes the study's three main aims and their corresponding findings, providing a clear and concise overview of the research.
The section connects the findings back to the original hypothesis and previous research, highlighting the consistency of the results with Turing's prediction and other studies showing the difficulty in differentiating between human and machine-generated responses.
The Discussion section acknowledges potential critiques and limitations of the study, demonstrating a critical and balanced evaluation of the research. This includes addressing concerns about generalizability, the brief nature of the outcome measure, and the clinician sampling plan.
The section explores potential explanations for the findings, such as the possibility that ChatGPT contextualizes better than therapists due to its use of more nouns and adjectives. This demonstrates a thoughtful consideration of the underlying mechanisms driving the results.
The section discusses the implications of the findings for mental health providers and researchers, highlighting the potential for GenAI to be integrated into mental health settings and the need for careful monitoring and supervision. This demonstrates a consideration of the broader impact of the research.
The section appropriately qualifies the interpretation of the post hoc aims, acknowledging their limitations due to small sample size, the exploratory nature of the analyses, and the need for replication. This demonstrates a cautious and responsible approach to interpreting the findings.
Medium impact. The Discussion section could be improved by more explicitly connecting the findings of *this* study to the *broader* literature on AI in mental health. While it references some previous work, a more thorough integration of the current results with existing research would strengthen the paper's contribution to the field. The discussion section needs to place the current findings in the broader context of AI and mental health research. This would involve comparing and contrasting the results with those of other studies, highlighting areas of agreement and disagreement, and discussing how the current work advances the field's understanding of GenAI's potential in therapeutic settings.
Implementation: Add a paragraph or two specifically addressing the broader literature on AI in mental health. Discuss how the current findings compare to those of other studies, highlighting similarities and differences. For example: 'These findings build upon previous research demonstrating the potential of AI in mental health (cite relevant studies). While some studies have focused on AI as an adjunct to human therapists (cite examples), our work suggests that GenAI can also generate responses that are perceived as therapeutically valuable on their own. This aligns with findings by X et al. (year) but contrasts with Y et al. (year), who found that...'
Medium impact. The Discussion section could benefit from a more nuanced discussion of the *ethical considerations* surrounding the use of GenAI in therapy. While it mentions the need for supervision, it could delve deeper into issues such as data privacy, informed consent, and the potential for bias in AI models. The Discussion section is the appropriate place to address these complex ethical issues. This would involve a more thorough exploration of the potential risks and benefits of using GenAI in therapeutic settings, as well as a discussion of the safeguards that need to be in place to ensure responsible and ethical implementation.
Implementation: Add a paragraph or two specifically addressing the ethical considerations of using GenAI in therapy. Discuss issues such as data privacy, informed consent, and the potential for bias in AI models. For example: 'The use of GenAI in therapy raises important ethical considerations. Data privacy is paramount, and robust measures must be in place to protect sensitive client information. Informed consent is also crucial, with clients needing to be fully aware of the nature of the AI system and its limitations. Furthermore, it is essential to address the potential for bias in AI models, ensuring that they do not perpetuate or exacerbate existing inequalities.'
Low impact. The Discussion section could be slightly more concise in its summary of the results. While it's important to reiterate the key findings, streamlining this part would improve the overall flow and allow more space for the discussion of implications and limitations. The Discussion section should provide a concise summary, but the current version repeats some details that could be condensed. This would involve removing redundant phrases and focusing on the most essential findings. This is a relatively minor change that would improve the readability and impact of the section.
Implementation: Review the first two paragraphs of the Discussion section and identify any redundant phrases or sentences. Condense the summary of the results to focus on the most important findings. For example, instead of repeating the statistical results in detail, simply state the main conclusions: 'Participants were unable to reliably distinguish between therapist and ChatGPT responses. ChatGPT responses were rated higher on common factors. Linguistic differences were observed, with ChatGPT using more positive sentiment and a greater number of nouns and adjectives.'