LLMs to Augment Medical Data: A Proof Of Concept

September 5, 2024 • Reading time 2 minutes

Advancements in machine learning have led to the rise of pre-trained Large-Language Models (LLMs) that can be used to interpret, understand, and process medical data. As a demonstration of this technology, we have built a proof-of-concept (POC) tool that transforms unstructured and doctors’ notes and chest x-rays into a structured database.

What is the opportunity for healthcare providers?

According to some studies, up to 97% of medical data sits unused within hospital premises. This is a huge untapped resource that could help providers optimise pathways, improve patient care and unlock financial gains through better coding and saving clinical time.

The main downside, however, is that much of this data sits in unstructured formats, such as images. Extracting meaningful data from these can be time consuming and expensive, but our proof of concept demonstrates how the latest AI developments could overcome these barriers, allowing greater utilisation of data for hospitals.

Proof of concept using chest X-rays

Our tool harnesses a state-of-the-art open-source pretrained language model (Zephyr-7B) and an image to text model that extracts information from DICOM images (llava-1.5-7b). We have tested the feasibility of our approach by applying it to the MIMIC-CXR public dataset of chest X-rays and doctors’ notes.

Our POC is able to reliably take this complex data source, which contains rich medical information, and distil the details into predetermined fields of a table, leading to a simplified version of the original dataset.

Processing pipeline for X-Ray images and doctor’s notes.

Why are structured tables, such as knowledge graphs, more useful than their unstructured counterparts?

By compressing rich clinical information into salient indicators and storing it in tables, we can create structured datasets that relay important details about patient care, such as their diagnosis, procedures and outcomes.

In turn, researchers, auditors, and clinicians can easily draw insights from large numbers of patients using simple data analysis approaches, allowing them to understand how interventions impact patient outcomes and determine methods to improve patient care.

Transformed knowledge graph output from the discharge notes and multiple X rays of a single patient.

Plenty of use cases

A tool like this could be employed to validate secondary-use datasets which are currently manually created by clinical coding experts, such as hospital episode statistics (HES), a nationally mandated dataset in English hospitals. This approach is error prone and is subject to variation in coding methodology at different hospitals and countries. By cross-referencing the data with the structured datasets generated by this tool, we can improve the accuracy and completeness of the information, unlocking deeper insights into patient care.

A more operational use case could involve the scheduling of further scans or treatment. The structured output produced by the tool could be fed into another model to automatically triage patients on hospital waitlists based on the severity of their conditions. This would ensure patients most at need of urgent care or at risk of deterioration are seen quickly.

Tom Michaelis

Tom is a Lead Data Scientist at Edge Health with experience creating AI-powered products for the life science sector. He has led on the development and deployment of Machine Learning and Generative AI algorithms to solve pain points within private and public sector.