OpenAI releases HealthBench to evaluate AI in healthcare
OpenAI, the maker of ChatGPT, released an open-source benchmark designed to measure the performance and safety of large language models in healthcare.
The large data set, called HealthBench, goes beyond exam-style queries and tests how well artificial intelligence models perform in realistic health scenarios, based on what physician experts say matters most, the company said in a blog post Monday.
“Improving human health will be one of the defining impacts of AGI. If developed and deployed effectively, large language models have the potential to expand access to health information, support clinicians in delivering high-quality care, and help people advocate for their health and that of their communities,” the company wrote in the post.
“Evaluations are essential to understanding how models perform in health settings,” company executives said in the post. “Significant efforts have already been made across academia and industry, yet many existing evaluations do not reflect realistic scenarios, lack rigorous validation against expert medical opinion, or leave no room for state-of-the-art models to improve.”
The HealthBench paper can be found here (PDF), and the code can be found here.
The evaluation framework was built in partnership with 262 physicians who have practiced in 60 countries, the company said.
HealthBench has 5,000 realistic health conversations baked in and grades model responses against physician-written rubrics, assessing safety, appropriateness and accuracy.
The conversations in HealthBench simulate interactions between AI models and individual users or clinicians, the company said, and they were produced via both synthetic generation and human adversarial testing. The conversations were “created to be realistic and similar to real-world use of large language models: they are multi-turn and multilingual, capture a range of layperson and healthcare provider personas, span a range of medical specialties and contexts, and were selected for difficulty,” OpenAI said.
HealthBench evaluates 48,562 unique rubric criteria spanning several health contexts and behavioral dimensions, such as accuracy, instruction following and communication.
Model responses are evaluated by a model-based grader to assess whether each rubric criterion is met, and responses receive an overall score based on the total score of criteria met compared to the maximum possible score.
HealthBench conversations are split into seven themes, such as emergency situations, handling uncertainty or global health. Each theme contains its own rubric for grading.
Karan Singhal, who runs OpenAI’s health AI team, said in a LinkedIn post that HealthBench was developed for two audiences: the AI research community to “shape shared standards and incentivize models that benefit humanity” and healthcare organizations to provide “high-quality evidence, towards a better understanding of current and future use cases and limitations.”
OpenAI said HealthBench was developed to evaluate AI systems in health with several core principles in mind. For one, the scores should reflect real-world impact, the company said. “This should go beyond exam questions to capture complex, real-life scenarios and workflows that mirror the ways individuals and clinicians interact with models,” OpenAI said in the post.
Evaluations should reflect the standards and priorities of healthcare professionals, providing a rigorous foundation for improving AI systems, the company said. And, benchmarks support progress. “Current models should show substantial room for improvement, offering model developers incentives to continuously improve performance,” the company noted.
Ethan Goh, M.D., executive director of Stanford AI Research and Science Evaluation, told Fierce Healthcare that HealthBench is a step in the right direction toward advancing the evaluation of healthcare AI performance.
“Many prior benchmarks (e.g. MedQA, MultiMedQA, MedMCQA, USMLE) rely on multiple choice questions, often taken from doctor exams. These are now saturated and less useful for measuring improvement (ie, AI models are scoring close to 100%),” Goh wrote in a LinkedIn post. “HealthBench addresses this gap with a benchmark for task-level evaluation, covering patient and clinician use cases.”
“Many industry players were already using their models for various healthcare applications, and frankly not doing a great job with robust evaluation of AI responses, since in a rush to deploy a working prototype, which can be incredibly high stakes if for a consumer- or provider-facing use case,” Goh said. “So this does help fill the gap somewhat.”
OpenAI has been ramping up its partnerships with healthcare and life sciences organizations, but HealthBench marks its first healthcare AI application.
The company is working with Sanofi and Formation Bio to build an AI-powered tool designed to improve drug development by speeding up clinical trial recruitment. Iodine Software is working with the company to integrate generative AI and large language models, including GPT-4, across the breadth of its solutions for clinical administration and revenue cycle management.
Color Health also built gen AI tools with an AI-powered cancer co-pilot app, created in partnership with OpenAI. It’s working with OpenAI to test out computer-generated personalized care plans for cancer patients. UTHealth Houston also partners with OpenAI to build and deploy algorithms for use in medical training and at the patient’s bedside.
link
