Generative AI is transforming healthcare. However, to apply GenAI in healthcare practices, it is essential to first understand what LLMs can accurately predict for specific healthcare tasks. Traditionally, LLMs are benchmarked against some standard medical Q&A datasets, including medical licensing exams. These tests are generally multiple-choice questions, which do not represent the real-world clinical situation faced by doctors.
To more realistically benchmark healthcare tasks, we have developed our own ELHS Benchmarking System, starting with the systematic evaluation of top LLMs for key healthcare tasks across various specialties and diseases. The resulting benchmarking scoreboards will gradually provide the baseline for how the top LLMs perform in patient symptom checking, diagnostic prediction, and treatment selection prediction. These scoreboards are shared publicly here and will be gradually updated as more datasets are created and LLM technologies evolve.
The benchmarking employs simple top-2 scores for predictions to calculate the overall percentage accuracy. The broader the disease coverage, the more reliable the overall accuracy becomes. Accuracy varies for individual diseases. More details see our JAMIA paper published in collaboration with Professor Tian at Stanford University.
Our initial benchmarking results suggest that general-purpose LLMs have achieved such high accuracy across a broad range of tasks that they are ready for evaluation by doctors and medical students in real-world healthcare settings to assess their clinical benefits.
From the scoreboards, Copilot users can understand which diseases and clinical tasks may benefit from GenAI, helping them choose the target problems in care delivery for GenAI to solve. Furthermore, if users identify that their expertise or best practices cannot be matched by GenAI, they may consider converting their expertise into GenAI abilities through fine-tuning open-source LLMs, thereby producing their own LLMs to optimize healthcare delivery and expand dissemination in medical communities.