Evaluating AI for Better Health: Insights from OpenAI’s HealthBench Benchmark

May 16, 2025

Shubham Sali

Large language models (LLMs) are used to access information, assist health care providers, and promote healthier decisions. Current assessments use static and multiple-choice forms, which are unable to represent the dynamic character of interactions in the real world. There were 5000 multi-turn conversations between LLMs and users, either laypeople or medical experts, in Health Bench. There was a total of 48,562 unique factors in the evaluation of the rubric-based scoring system.

This study aims to assist education by helping the model development ecosystem to measure the direct benefits of AI systems for human beings. A positive or negative score is obtained if the model shows a satisfactory or failed response concerning a specific guideline. 262 physicians from 26 medical specialties from 60 countries are working collectively to create the dataset. Doctors interacted with each other to record expectations for high-quality model behaviour and discussed specific examples. Health Bench evaluated a range of LLMs, including GPTs and OpenAI. They are 25 times cheaper, and smaller devices like the GPT-4.1 nano now perform better than older models. With and without AI support, doctors were asked to write the best answers to benchmark examples. The nonzero point value for each rubric criterion ranges from -10 to 10, with negative points applied to criteria that are not desired.

There was no relationship present between length and performance. Still, physician responses were frequently shorter for low ratings. For any meta-evaluation, the researchers compared the model results with those of doctors using a collection of cases with physician remarks. Because of many physicians’ high macro F1 scores, GPT-4.1 was selected as the default grader. A new benchmark is used in the assessment of AI in healthcare by Health Bench. It allows advanced open-ended evaluation and highlights critical areas where model performance should be upgraded. This experiment created controversy because of doctors’ opinions on good model responses and their potential use.

The study shows the quality of model responses and potential regarding human health, time, and cost savings. Doctors contributed to providing the ideal answers and continuous coaching to generate expert responses of the highest quality. Future research was required if it included real-world investigations inside certain workflows. To improve awareness in current and future models with cases and limitations, one should produce proof of the model’s ability in front of the healthcare community. Common guidelines should be applied to encourage the AI research community to move closer to models that serve human beings in the real world.

Reference: Arora RK, Wei J, Hicks RS, et al. HealthBench: Evaluating Large Language Models Towards Improved Human Health. OpenAI; 2025. Accessed May 14, 2025. Introducing HealthBench | OpenAI