Artificial intelligence (AI), specifically large language models (LLMs), is rapidly advancing and holds significant potential to transform healthcare. These technologies may expand access to medical knowledge, which allows patients to conduct preliminary health assessments. The public increasingly uses AI chatbots to seek health-related information, which includes symptom understanding and possible diagnosis. Models like GPT-4o, Llama 3, and Command R+ have shown promising results on medical knowledge benchmarks. There is limited evidence on their effectiveness in real-world health decision-making, which highlights the necessity of effective communication between patients and AI, as well as the importance of how users interpret the advice provided. It is important to understand human interaction with AI in healthcare before broad implementation.
A study published in Nature Medicine aimed to assess whether LLP-powered chatbots can reliably assist members of the public in making medical self-assessments and determine proper healthcare actions. The researchers investigated whether people using AI chatbots would perform better than those relying on conventional resources when identifying possible medical conditions and deciding to give a proper level of medical care. This study also examined how human AI interaction influences decision-making accuracy and whether existing AI evaluation methods, like medical benchmarks or simulated patient interactions, accurately predict real-world performance.
Researchers conducted a randomized experimental study that involved 1298 adult participants in the United Kingdom. Participants were recruited online and randomly assigned to one of four groups: 3 groups that used different AI chatbots and a control group that relied on standard information sources like internet searches or trusted medical websites like the NHS. Each participant was presented with two clinical scenarios describing common health problems encountered in everyday life. These scenarios were developed by clinical guidance from the National Institute for Health and Care Excellence and validated by experienced physicians. For each scenario, participants were asked to determine the most proper healthcare action from a 5-level scale, which ranges from self-care to calling an ambulance. They were also asked to list medical conditions that might explain the symptoms and to explain their reasoning. Participants in the experimental groups could interact with the assigned AI model as many times as they wished to help make their decisions. Researchers evaluated responses based on whether participants correctly identified relevant medical conditions and selected the correct level of medical care as per the physician-established standards.
The results showed that although AI models themselves demonstrated strong medical knowledge, their use did not significantly improve participants’ decision-making. When AI systems were tested directly without human interaction, they were able to suggest at least one relevant medical condition in over 90% of cases and correctly recommend appropriate healthcare actions in half of the scenarios. However, participants who used AI chatbots performed worse in detecting relevant medical conditions compared with those in the control group. Individuals using traditional resources were about 1.76 times more likely to detect a correct condition and 1.57 times more likely to recognize serious “red flag” conditions. In terms of choosing the correct healthcare service, participants using AI performed similarly to the control group with an overall accuracy rate of about 43%. This indicates that most participants selected an incorrect course of action. The study also revealed that participants often underestimated the seriousness of their symptoms.
Further analysis of conversations between users and AI models showed significant communication challenges. Users frequently failed to provide sufficient details about their symptoms, limiting the AI’s ability to generate accurate advice. AI systems often present multiple possible conditions, only about one-third of which are correct, which makes it difficult for users to determine which information is reliable. In many cases, users ignored correct suggestions made by the AI or misunderstood the model’s recommendations. Additional findings showed that AI benchmarks and simulated patient tests like those using the MedQA dataset greatly overestimated the effectiveness of AI systems compared with real human interactions.
Overall, this study showed that modern AI language models possess substantial medical knowledge, but their effectiveness for patients depends heavily on the quality of human-AI interaction. Users and AI systems did not outperform traditional health information methods and were less accurate in identifying medical conditions. Key limitations included communication gaps, incomplete symptom reporting, and challenges in interpreting AI suggestions. These results suggest that AI chatbots need significant enhancement in reliability and user interaction design before they can be safely deployed in patient-facing healthcare applications.
Reference: Bean AM, Payne RE, Parsons G, et al. Reliability of LLMs as medical assistants for the general public: a randomized preregistered study. Nature Medicine. 2026;32:609-615. doi:10.1038/s41591-025-04074-y



