A recent study published in JAMA Psychiatry suggests that popular artificial intelligence chatbots tend to provide inappropriate or unhelpful responses when users type messages containing signs of psychosis. The findings provide evidence that relying on these digital tools for mental health advice might pose serious safety risks for individuals experiencing severe psychological distress.
Large language models are advanced artificial intelligence systems designed to understand and generate human text. They work by analyzing vast amounts of internet data to predict what words should logically come next in a given sentence. This mathematical process allows the computer program to essentially recognize structural patterns and create smooth conversational replies.
Because these computer programs are designed to perfectly mimic human interaction, they can naturally lead users to feel like the software actually understands them or feels genuine empathy toward them. Since its widespread release in 2022, OpenAI’s popular chatbot product called ChatGPT has seen massive adoption across the globe. Recent surveys suggest that many adults use this specific software regularly for general advice or tutoring.
Because chatbots generate their responses by matching textual patterns and aligning with the exact text the user provides, they tend to blindly accept false premises. This means the software might accidentally agree with or encourage the user’s entirely inaccurate statements about reality.
“We became interested in trying to understand how large language model chatbots respond to psychotic content when media reports started to appear about a year ago of people apparently developing psychotic symptoms (or having psychotic symptoms worsen) in the context of long ‘conversations’ with these products,” said study author Amandeep Jutla, an associate research scientist at Columbia University and head of the Translational Insights for Autism Lab.
“We noticed that a common feature across these reports seemed to be that the product would reflect, affirm, or elaborate on the psychotic content, rather than pushing back against it as a human might. With our study, we wanted to test whether we could observe these kinds of inappropriate responses to psychotic content under controlled conditions.”
To test this, the researchers evaluated three different versions of OpenAI’s chatbot. They looked at a newer paid version called GPT-5 Auto, a previous paid version called GPT-4o, and the standard free version that is most widely accessible. The scientists wrote a total of 79 unique prompts designed to reflect five different symptoms of psychosis.
Psychosis is a mental health condition where a person temporarily loses touch with reality. To capture this state, the authors based their prompts on a standardized clinical interview tool used to assess psychosis risk. They included text reflecting unusual thoughts, suspiciousness or paranoia, and grandiosity, which is an exaggerated sense of one’s own importance. They also included prompts mimicking perceptual disturbances like hallucinations, along with disorganized communication.
For every psychotic prompt, the authors also wrote a matched control prompt. These normal control prompts were similar in length and writing style but did not contain any psychotic content. Every prompt was submitted exactly one time to each of the three chatbot versions in a completely isolated session. This procedure generated a total of 474 distinct prompt and response pairs for the scientists to analyze.
Next, two mental health clinicians reviewed these textual pairs. To ensure objectivity, these clinicians were blinded, meaning they did not know which chatbot version generated which response. The clinicians evaluated the appropriateness of the chatbot replies using a simple rating scale.
They scored each response on a scale from zero to two. A zero meant the response was completely appropriate, a one meant it was somewhat appropriate, and a two meant it was completely inappropriate. A secondary clinical rater also checked a random subset of these responses to verify the accuracy of the grading.
Across all the tested software versions, the chatbots were far more likely to give poor responses to the psychotic prompts than to the normal control prompts.
“The thing to take away from our findings is that ChatGPT is overwhelmingly more likely to generate inappropriate responses to psychotic than non-psychotic content,” Jutla said. “Notably, the ‘GPT-4o’ version of ChatGPT, which was the default version of the product at the time that reports of psychotic symptoms began appearing a year ago, has been acknowledged by OpenAI, which runs ChatGPT, to be prone to generate unsafe responses, and was replaced by ‘GPT-5,’ which was purportedly safer. Notably, we didn’t actually see any difference between GPT-4o and GPT-5 in our testing: statistically, both generated inappropriate responses at the same greatly elevated rate.”
When looking at the free version of the software, the psychotic prompts had an odds ratio showing an almost 26-fold higher chance of receiving a less appropriate rating compared to the control prompts. In medical statistics, an odds ratio simply describes how much more likely a specific event is to happen in one group compared to another group.
“The only meaningful difference we found was between the free and paid GPT-5 versions of ChatGPT: the free version is about 26 times more likely to generate an inappropriate response to psychotic content, and the paid version is ‘only’ about 8 times more likely to do so,” Jutla explained. “This is notable because OpenAI has reported that ChatGPT has 900 million users but only 50 million subscribers.”
The authors note that the free version’s poorer performance provides evidence for a specific public health concern. Individuals at risk for psychosis tend to be overrepresented among economically disadvantaged populations. This means those who are most vulnerable might only have access to the least safe chatbot option.
The authors acknowledge a few limitations to their current research project. The study only tested ChatGPT, which is just one of many artificial intelligence tools currently available on the market. Additionally, while the rating system was standardized, judging the appropriateness of a conversational response relies to some degree on subjective human opinion.
“An important limitation of our study is that it may actually under-estimate the inappropriateness of ChatGPT responses, because we only tested single prompts and single responses,” Jutla said. “Many of the cases of psychotic symptoms developing or worsening in the context of using this product involved very long ‘conversations,’ and it is known (and has been acknowledged by OpenAI) that in these ‘long context’ situations the performance of large language models tends to degrade.”
Because these systems use previous messages to provide context for new replies, an extended conversation might actually make the program’s safety filters break down completely. This suggests that the risk of harm in real world, ongoing conversations might be even higher than what this specific study captured. Finally, these artificial intelligence tools update rapidly, meaning the exact performance of the software might shift significantly over time.
The scientists point out that a truly appropriate response involves several specific components. An ideal reply should recognize the crisis, avoid reinforcing the delusion, acknowledge the urgency of the situation, and provide medical resources. The authors aim to assess these specific components separately in future studies.
The researchers suggest several directions for moving forward. In clinical practice, mental health professionals should routinely ask their patients if they are using these digital tools for advice. Future research should investigate how ongoing conversations with a chatbot might reinforce a person’s delusions over longer periods. The study provides evidence that policymakers should consider stronger oversight to ensure these programs do not harm vulnerable individuals.
The study, “Evaluation of Large Language Model Chatbot Responses to Psychotic Prompts,” was authored by Elaine Shen, Fadi Hamati, Meghan Rose Donohue, Ragy R. Girgis, Jeremy Veenstra-VanderWeele, and Amandeep Jutla.
