Health systems band together to test and publicly rank top AI models
Since the launch of ChatGPT in 2022, technology companies have been racing to bring generative AI-powered tools to the healthcare market. However providers have faced a quandary about what — and whether — to buy.
As Google, Amazon, Microsoft and OpenAI rapidly expand their suite of artificial intelligence offerings, providers say they don’t know how to compare the efficacy of products or determine which tool might best meet their specific needs.
A group of health systems, led by Boston-based Mass General Brigham, is hoping to solve that problem.
On Wednesday, the academic medical center launched the Healthcare AI Challenge Collaborative, which will allow participating clinicians to test the latest AI offerings in simulated clinical settings. Clinicians will pit models against each other in head-to-head competition and produce public rankings of the commercial tools by the end of the year.
Participating health systems say that the chance to directly compare AI products is overdue.
Despite the rapid proliferation of AI in healthcare, the industry has been slow to agree upon how to assess quality. Industry groups have taken stabs at rolling out evaluation frameworks, however, guidelines remain in draft form.
Without standardized evaluation metrics, it’s difficult to compare between even the most similar tools, said Richard Bruce, associate professor of radiology and the vice chair of informatics at the University of Wisconsin School of Medicine and Public Health.
“Are there any [common] metrics that directly compare them? Currently, to the best of my knowledge, other than surveys of users and anecdotes, tools are not directly benchmarked against each other,” he said. “There’s no easy way to get an apples to apples comparison.”
So far, Emory Healthcare, the radiology departments at the University of Wisconsin School of Medicine and Public Health and the University of Washington School of Medicine, and industry group the American College of Radiology are participating in the collaborative. MGB said they plan to expand the program.
The health systems will initially test nine models, according to a MGB spokesperson, including products from Microsoft, Google, Amazon Web Services, OpenAI and Harrison.AI.
The clinicians will assess the models on draft report generation, key findings, differential diagnosis and other factors, according to MGB.
The metrics for evaluating the models are “evolving,” Bruce said, and may depend on the tool’s clinical use case. For example, while model accuracy will always be highly weighted, there are some instances —like when the model is used to produce a text report — where readability might be more important.
“Some of those are going to be very much on a subjective quality,” Bruce said. “Like, do I feel that the style in which this text is presented is more easily readable or amenable to patients?”
Ultimately, the health systems will create a “leaderboard” of tools, according to Dushyant Sahani, professor and chair of the department of radiology at the University of Washington.
The leaderboard will be used both to provide feedback to technology companies and to help health systems shop for tech, according to MGB.
Health systems that do not directly participate in the challenge may be able to use the rankings to decide what tools to purchase, according to Sahani — which he views as a win for health equity.
In the race to implement AI, experts have raised concerns that smaller-resourced providers, who may not have time to research new tools, might be left behind.
“Health systems can use the transparent rankings to inform decision-making and establish benchmarking standards,” Sahani said. “Insights and best practices from the consortium can be adopted by non-participating health systems.”
Google and Microsoft declined to comment on this article.
link