New York, NY, Dec. 09, 2025 (GLOBE NEWSWIRE) -- Sword Health, the world’s leading AI Health company, today unveiled MindEval, the industry’s first benchmark designed to evaluate how large language models behave in realistic, multi-turn mental health conversations. Built with licensed clinical psychologists and grounded in American Psychological Association (APA) supervision guidelines, MindEval sets a new standard for assessing the clinical quality and safety of AI systems being used to support people’s mental health.
MindEval arrives at a critical moment. Around the world, people are increasingly turning to AI chatbots for emotional support, self-help coaching, and therapy-like conversations, often without any understanding of how these systems perform in real mental health interactions. Until now, there has been no rigorous way to measure whether AI behaves safely, consistently, and competently across a full conversation, not just a single reply.
A new clinical benchmark for the AI era
Developed in partnership with PhD-level licensed clinical psychologists, MindEval evaluates models across five dimensions essential to safe and effective mental health support: clinical accuracy, ethics and professional conduct, assessment quality, therapeutic alliance, and AI-specific communication behaviors. Unlike existing tests that rely on trivia or isolated answers, MindEval evaluates models the same way clinicians are evaluated, with multi-turn conversations that unfold over time, including complex scenarios involving elevated depressive or anxious symptoms.
"Around the world, people are increasingly turning to AI for emotional support and therapy-like conversations, often without any understanding of how these systems actually perform,” said Virgilio Bento, founder and CEO of Sword Health. “Until now, there has been no rigorous way to measure whether AI behaves safely and competently across a full therapeutic conversation. MindEval changes that."
MindEval reveals that state-of-the-art models fall short
In its initial evaluation of 12 leading LLMs, Sword found that all models scored below 4 out of 6 on average across clinical domains, with the weakest performance in areas that matter most in real conversations:
- AI-specific communication issues such as excessive verbosity, over-validation, and generic or superficial advice
- Difficulty supporting patients presenting with severe symptoms
- Quality degradation over longer interactions, where clinical failures compound over time
The study also shows that larger models and reasoning do not reliably improve therapeutic behavior, underscoring a growing gap between general AI optimization and what safe, clinically aligned mental health support requires.
Why MindEval was created
MindEval was built to address an urgent and growing need for transparency and safety in AI mental health tools. According to Sword:
- Current benchmarks don’t measure real therapy. Most tests focus on trivia or single answers, but therapy is multi-turn, contextual, and relational. MindEval evaluates models the way clinicians are evaluated: over the whole interaction.
- LLMs are already being used as quasi-therapists. People use chatbots for support, coaching, and venting, yet we have no standard way to measure their clinical competence in realistic conversations with mental health patients. MindEval is built to measure clinical competence in realistic conversations with mental health patients and expose those risks.
- The worst failures only show up over time. Dependency, over-reassurance, boundary erosion, and hallucinated guidance rarely appear in one reply; they emerge over several turns. That is exactly what MindEval tests.
- Larger models and reasoning do not guarantee improved clinical competence. While in other tasks larger models and reasoning typically yield better performance, that is not the case in mental health. Optimizing powerful models toward “helpfulness” can come at a cost in mental health.
- The ecosystem needs a common yardstick. MindEval gives regulators, clinicians, and builders a shared, APA-aligned benchmark for safety reviews, model comparison, and ongoing audits—filling the role FinanceBench plays in finance, but for therapy-grade interaction quality.
"AI has enormous potential to close gaps in access to high-quality mental health care, but only if we hold these systems to standards that reflect how care actually works. We tested 12 state-of-the-art models with MindEval, and all of them struggle, with performance deteriorating as conversations get longer and symptoms more severe," said Bento. "MindEval gives the entire ecosystem, from regulators to researchers to builders, a common benchmark for ensuring that AI built for mental health behaves safely at every turn of a conversation. And that is why we are open-sourcing everything. We believe MindEval can become a foundation for building safer AI mental health support across the industry."
Sword Health is releasing MindEval as an open benchmark, including code, prompts, and human evaluation data. This allows researchers, developers, and clinicians worldwide to test their own systems, experiment with new safety techniques, and collaboratively improve AI models intended for mental health contexts.
MindEval represents the latest step in Sword’s mission to build a safer, more effective AI-powered healthcare system, one where intelligent tools and licensed clinicians work together to deliver high-quality care at planetary scale.
For more information and access to the benchmark, visit the following link: MindEval: Benchmarking Language Models on Multi-turn Mental Health Support.
About Sword Health
Sword Health is shifting healthcare from human-first to AI-first through its AI care platform, making world-class healthcare available anytime, anywhere, while significantly reducing costs for payers, self-insured employers, national health systems, and other healthcare organizations. Sword began by reinventing physical health with AI at its core, and has since expanded into women’s health and mental health. Since 2020, more than 700,000 members across three continents have completed 10 million AI sessions, helping Sword’s 1,000+ clients avoid over $1 billion in unnecessary healthcare costs.
Building on this foundation, Sword launched Sword Intelligence to bring its proven AI capabilities to governments and health systems worldwide, helping them streamline operations, expand capacity, and make healthcare more responsive.
Backed by 43 clinical studies and over 45 patents, Sword Health has raised more than $500 million from leading investors, including Khosla Ventures, General Catalyst, Transformation Capital, and Founders Fund. Learn more at www.swordhealth.com.