Evaluating AI for Better Health: Insights from OpenAI’s HealthBench Benchmark

Large language models (LLMs) are used to access information, assist health care providers, and promote healthier decisions. Current assessments use static and multiple-choice forms, which are unable to represent the dynamic character of interactions in the real world. There were 5000 multi-turn conversations between LLMs and users, either laypeople or medical experts, in Health Bench. There was a total of 48,562 unique factors in the evaluation of the rubric-based scoring system.

This study aims to assist education by helping the model development ecosystem to measure the direct benefits of AI systems for human beings. A positive or negative score is obtained if the model shows a satisfactory or failed response concerning a specific guideline. 262 physicians from 26 medical specialties from 60 countries are working collectively to create the dataset. Doctors interacted with each other to record expectations for high-quality model behaviour and discussed specific examples. Health Bench evaluated a range of LLMs, including GPTs and OpenAI. They are 25 times cheaper, and smaller devices like the GPT-4.1 nano now perform better than older models. With and without AI support, doctors were asked to write the best answers to benchmark examples. The nonzero point value for each rubric criterion ranges from -10 to 10, with negative points applied to criteria that are not desired.

There was no relationship present between length and performance. Still, physician responses were frequently shorter for low ratings. For any meta-evaluation, the researchers compared the model results with those of doctors using a collection of cases with physician remarks. Because of many physicians’ high macro F1 scores, GPT-4.1 was selected as the default grader. A new benchmark is used in the assessment of AI in healthcare by Health Bench. It allows advanced open-ended evaluation and highlights critical areas where model performance should be upgraded. This experiment created controversy because of doctors’ opinions on good model responses and their potential use.

The study shows the quality of model responses and potential regarding human health, time, and cost savings. Doctors contributed to providing the ideal answers and continuous coaching to generate expert responses of the highest quality. Future research was required if it included real-world investigations inside certain workflows. To improve awareness in current and future models with cases and limitations, one should produce proof of the model’s ability in front of the healthcare community. Common guidelines should be applied to encourage the AI research community to move closer to models that serve human beings in the real world.

Reference: Arora RK, Wei J, Hicks RS, et al. HealthBench: Evaluating Large Language Models Towards Improved Human Health. OpenAI; 2025. Accessed May 14, 2025. Introducing HealthBench | OpenAI

Latest Posts

Free CME credits

Both our subscription plans include Free CME/CPD AMA PRA Category 1 credits.

Digital Certificate PDF

On course completion, you will receive a full-sized presentation quality digital certificate.

medtigo Simulation

A dynamic medical simulation platform designed to train healthcare professionals and students to effectively run code situations through an immersive hands-on experience in a live, interactive 3D environment.

medtigo Points

medtigo points is our unique point redemption system created to award users for interacting on our site. These points can be redeemed for special discounts on the medtigo marketplace as well as towards the membership cost itself.
 
  • Registration with medtigo = 10 points
  • 1 visit to medtigo’s website = 1 point
  • Interacting with medtigo posts (through comments/clinical cases etc.) = 5 points
  • Attempting a game = 1 point
  • Community Forum post/reply = 5 points

    *Redemption of points can occur only through the medtigo marketplace, courses, or simulation system. Money will not be credited to your bank account. 10 points = $1.

All Your Certificates in One Place

When you have your licenses, certificates and CMEs in one place, it's easier to track your career growth. You can easily share these with hospitals as well, using your medtigo app.

Our Certificate Courses