To provide reliable evidence for using generative AI (GenAI) in clinical learning and research, we systematically evaluate top GenAI tools for key healthcare tasks across various specialties and diseases using our ELHS Benchmarking System. The resulting benchmarking scoreboards are publicly available here and will be continuously updated as large language model (LLM) technologies evolve. The benchmarking employs simple top-2 scores for predictions to calculate the overall percentage accuracy. The broader the disease coverage, the more reliable the overall accuracy becomes. Accuracy varies for individual diseases. See our JAMIA paper published in collaboration with Stanford University.

DxB: Diagnostic Prediction Benchmarking
Diagnostic prediction benchmarking tests GenAI in predicting possible disease causes from all available information, including symptoms and diagnostic tests.
Dataset Diseases OpenAI ChatGPT-4 Google Gemini-1.5 Baidu Ernie-4 Date
Neurology 63 93.22% 92.14% 90.56% 20240509
Oncology 112 85.98% 86.22% 89.88% 20240404

ScB: Symptom Checking Benchmarking
Symptom checking benchmarking tests various GenAI models in predicting possible disease causes from only symptoms. MCSC: Mayo Clinic Symptom Checker.
Dataset Diseases OpenAI ChatGPT-4 Google Gemini-1.0 Baidu Ernie-4 Date
MCSC Diseases 181 90.5% 81.38% 82.38% 20240404
MCSC Symptoms 194 78.71% 20230815

