Half of AI health answers are wrong even though they sound convincing – new study

Imagine you have just been diagnosed with early-stage cancer and, before your next appointment, you type a question into an AI chatbot: “Which alternative clinics can successfully treat cancer?” Within seconds you get a polished, footnoted answer that reads like it was written by a doctor. Except some of the claims are unfounded, the footnotes lead nowhere, and the chatbot never once suggests that the question itself might be the wrong one to ask.

That scenario is not hypothetical. It is, roughly speaking, what a team of seven researchers found when they put five of the world’s most popular chatbots through a systematic health-information stress test. The results are published in BMJ Open.

The chatbots, ChatGPT, Gemini, Grok, Meta AI and DeepSeek, were each asked 50 health and medical questions spanning cancer, vaccines, stem cells, nutrition and athletic performance. Two experts independently rated every answer. They found that nearly 20% of the answers were highly problematic, half were problematic, and 30% were somewhat problematic. None of the chatbots reliably produced fully accurate reference lists, and only two out of 250 questions were outright refused to be answered.

Overall, the five chatbots performed roughly the same. Grok was the worst performer, with 58% of its responses flagged as problematic, ahead of ChatGTP at 52% and Meta AI at 50%.

Performance varied by topic, though. Chatbots handled vaccines and cancer best – fields with large, well-structured bodies of research – yet still produced problematic answers roughly a quarter of the time. They stumbled most on nutrition and athletic performance, domains awash with conflicting advice online and where rigorous evidence is thinner on the ground.

Open-ended questions were where things really went sideways: 32% of those answers were rated highly problematic, compared with just 7% for closed ones. That distinction matters because most real-world health queries are open ended. People do not ask chatbots neat true-or-false questions. They ask things like: “Which supplements are best for overall health?” This is the kind of prompt that invites a fluent and confident yet potentially harmful answer.

When the researchers asked each chatbot for ten scientific references, the median (the middle value) completeness score was just 40%. No chatbot managed a single fully accurate reference list across 25 attempts. Errors ranged from wrong authors and broken links to entirely fabricated papers. This is a particular hazard because references look like proof. A lay reader who sees a neatly formatted citation list has little reason to doubt the content above it.

Professor Carsten Eickhoff conducts research on the foundations of Natural Language Processing and their effect on health decision making.

Why chatbots get things wrong

There’s a simple reason why chatbots get medical answers wrong. Language models do not know things. They predict the most statistically likely next word based on their training data and context. They do not weigh evidence or make value judgments. Their training material includes peer-reviewed papers, but also Reddit threads, wellness blogs and social-media arguments.

The researchers did not ask neutral questions. They deliberately crafted prompts designed to push chatbots toward giving misleading answers – a standard stress-testing technique in AI safety research known as “red teaming”. This means the error rates probably overstate what you would encounter with more neutral phrasing. The study also tested the free versions of each model available in February 2025. Paid tiers and newer releases may perform better.

Still, most people use these free versions, and most health questions are not carefully worded. The study’s conditions, if anything, reflect how people actually use these tools.

The article’s findings do not exist in isolation; they land amid a growing body of evidence painting a consistent picture.

A February 2026 study in Nature Medicine showed something surprising. The chatbots themselves could get the right medical answer almost 95% of the time. But when real people used those same chatbots, they only got the right answer less than 35% of the time – no better than people who didn’t use them at all. In simple terms, the issue isn’t just whether the chatbot gives the right answer. It’s whether everyday users can understand and use that answer correctly.

A recent study published in Jama Network Open tested 21 leading AI models. The researchers asked them to work out possible medical diagnoses. When the models were given only basic details – like a patient’s age, sex and symptoms – they struggled, failing to suggest the right set of possible conditions more than 80% of the time. Once the researchers fed in exam findings and lab results, accuracy soared above 90%.

Meanwhile, another US study, published in Nature Communications Medicine, found that chatbots readily repeated and even elaborated on made-up medical terms slipped into prompts.

Taken together, these studies suggest the weaknesses found in the BMJ Open study are not quirks of one experimental method but reflect something more fundamental about where the technology stands today.

These chatbots are not going away, nor should they. They can summarise complex topics, help prepare questions for a doctor, and serve as a starting point for research. But the study makes a clear case that they should not be treated as stand-alone medical authorities.

If you do use one of these chatbots for medical advice, verify any health claim it makes, treat its references as suggestions to check rather than fact, and notice when a response sounds confident but offers no disclaimers.

This text was first published on The Conversation.

Cover Image: When waiting weeks for a doctor’s appointment isn’t an option and medical questions can’t wait, people often turn to faster answers by using AI chatbots. © Who Is Danny/Shutterstock.com

Comments

Debate

November 24, 2025 Michael Deistler

AI tool for brain simulations links cellular detail to cognitive functions

For decades, brain simulations have either been largely simplified, or they could not perform cognitive tasks. A new AI tool opens possibilities to build brain simulations that can achieve both.

Debate

April 3, 2025 Jennifer Raffler/ podcast team "Key to my Research"

How AI Affects Our Everyday Lives

Whether it’s ChatGPT at university or virtual Holocaust testimonies in museums - AI is changing how we learn, do research, and how we remember. Cultural anthropologist Christoph Bareither explores these developments, and shared insights into his work on the podcast “Key to my Research”.

Name	Borlabs Cookie
Provider	Owner of this website, Imprint
Purpose	Saves the visitors preferences selected in the Cookie Box of Borlabs Cookie.
Cookie Name	borlabs-cookie
Cookie Laufzeit	1 Year

Accept	Matomo
Name	Matomo
Provider	MACHINE LEARNING for science
Purpose	Cookie von Matomo für Website-Analysen. Erzeugt statistische Daten darüber, wie der Besucher die Website nutzt.
Privacy policy	https://www.machinelearningforscience.de/datenschutz/
Cookie Name	_pk_.
Cookie Laufzeit	13 Monate

Accept	Spotify
Name	Spotify
Provider	spotify
Privacy policy	https://www.spotify.com/de/legal/privacy-policy/
Host(s)	spotify., spotify.com

Accept	Vimeo
Name	Vimeo
Provider	Vimeo Inc., 555 West 18th Street, New York, New York 10011, USA
Purpose	Used to unblock Vimeo content.
Privacy policy	https://vimeo.com/privacy
Host(s)	player.vimeo.com
Cookie Name	vuid
Cookie Laufzeit	2 Years

Accept	YouTube
Name	YouTube
Provider	Google Ireland Limited, Gordon House, Barrow Street, Dublin 4, Ireland
Purpose	Used to unblock YouTube content.
Privacy policy	https://policies.google.com/privacy?hl=en&gl=en
Host(s)	google.com
Cookie Name	NID
Cookie Laufzeit	6 Month