Clinical Reliability of Generative AI is Under Scrutiny in Healthcare
In the News
Clinical Reliability of Generative AI is Under Scrutiny in Healthcare
Researchers warn AI chatbots remain unreliable for medical advice, with studies showing they can repeat misinformation and provide confusing guidance to users seeking health information.
Artificial intelligence chatbots are increasingly used for health information, yet new research suggests they remain unreliable when asked for medical advice. Two recent studies have found that large language models (LLMs) can repeat inaccurate information, misinterpret symptoms, and provide guidance that may mislead users.
One study titled "Evaluating Large Language Models for Medical Misinformation", published in The Lancet Digital Health last month, examined how AI models respond to false medical advice embedded in prompts designed to resemble real clinical material.
The study, conducted by researchers at the Icahn School of Medicine at Mount Sinai in New York, tested 20 open-source and proprietary LLMs.
Researchers evaluated the systems using 3.4 million prompts drawn from hospital discharge notes, simulated clinical scenarios, and social media posts. Each prompt included fabricated medical advice to test whether the models would detect errors or repeat them.
The results showed that the systems often failed to identify incorrect information. On average, the models repeated false medical claims in about 32% of cases. When the same misinformation appeared inside realistic hospital discharge notes, the error rate rose to nearly 47%.
"Data provenance is a governance imperative the AI industry has yet to systematically address,” said Greg Killian, SVP & Head of Business for Life Sciences and Healthcare at EPAM. “As AI-powered chatbots become a go-to resource for health information, disclosure around training data is becoming key. LLMs also lack reliable ways to express uncertainty in real time. They are trained on static datasets in a domain where clinical evidence evolves continuously, and that gap is a risk currently being absorbed by the end user."
Killian said testing standards must extend beyond internal validation, including independent audits, adversarial red-teaming and domain-specific benchmarks before deployment. He added that oversight should be risk-based rather than restrictive, pointing to regulatory sandboxes, model cards, and clearer liability frameworks as practical starting points. Unlike traditional medical systems built on verified databases, LLMs generate responses based on patterns learned from large datasets that include sources of uneven quality.
Read the full article here.