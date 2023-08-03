A study conducted by researchers at Harvard Medical School (HMS) has revealed that automated scoring systems struggle to effectively evaluate AI-generated reports compared to human radiologists. The researchers tested various scoring metrics on reports generated by artificial intelligence (AI) tools and found that the automated systems misinterpreted and overlooked clinical errors made by the AI tool.

Recognizing the importance of accurately evaluating AI systems in order to generate clinically useful and trustworthy radiology reports, the team designed a new method called RadGraph F1 to evaluate the performance of AI tools that generate reports from medical images. Additionally, they developed a composite evaluation tool known as RadCliQ, which combines multiple metrics into a single score to mimic how a human radiologist would assess the performance of an AI model.

When using these new scoring tools to evaluate several state-of-the-art AI models, the researchers discovered a significant gap between the models’ actual score and the highest possible score. This analysis is crucial for advancing AI in medicine and bringing it to the next level, according to the study’s co-first author, Feiyang ‘Kathy’ Yu.

Looking ahead, the researchers envision building generalist medical AI models that are capable of performing a range of complex tasks, including problem-solving in unprecedented scenarios. This would enable the AI models to communicate fluently with radiologists and physicians, assisting them in diagnosis and treatment decisions.

Furthermore, the team aims to develop AI assistants that can directly explain and contextualize imaging findings to patients using plain language. By aligning better with radiologists and integrating seamlessly into the clinical workflow, the researchers believe that their new metrics will contribute to improving patient care.