Introduction:
Integrating large language models (LLMs) into healthcare is set to transform medical research. Most clinical research relies on data manually extracted by data managers, a laborious and time-consuming process. To streamline such tasks, the National Institutes of Health Integrated Data Analysis Platform (NIDAP) Text Extraction Program (NTEP) was developed. This artificial intelligence data aggregation platform, powered by LLMs, can output a collection of data within seconds after a prompt engineering process by the clinician. In this study, we aim to compare the accuracy of data extracted by NTEP with data that was manually extracted by NIH data managers in patients with prostate cancer enrolled in our institution’s prospective trial.
Methods:
We conducted a comparative analysis between datasets extracted by data managers and NTEP. Both were tasked to extract data for four MRI-related variables for patients enrolled in the prostate cancer natural history trial (NCT02594202): prostate volume, PSA density, number of lesions, and PI-RADS score. Custom-built LLM prompts were built by urologists using GPT-4 prompts aimed to extract the data directly from electronic medical record (EMR) documents. Both datasets were then subject to minor processing and formatting to allow for comparison between extraction methods. Prostate volumes were rounded to the appropriated absolute value, PSA density was rounded to three decimals places, and only the highest PI-RADS lesion reported by data managers was evaluated. Statistical analysis was performed with SPSS 29.0 to evaluate the correlation between pair observations in continuous variables via a Spearman's rho, and to quantify the level of agreement between categorical variables, a Cohen's kappa was performed.
Results:
A total of 1728 MRIs from 1289 patients were evaluated. In comparing the datasets extracted by NIDAP and the data managers, we found that agreement between values occurred 1598 times (92.5%) for prostate volume, 1705 times (98.7%) for PSA density, 1221 times (70.7%) for number of lesions, and 1577 times (91.3%) for PI-RADS score. In reports that had pair observations, both NIDAP and data managers results appeared highly concordant, however, the results between both groups differed from 0.5% to 6.8%. There were also cases where the datasets were missing data entirely; notably, for the number of lesions on MRI, the data managers did not report data in 488 (28.2%) instances. (Table 1)
Conclusion:
NTEP is a useful tool to facilitate data extraction from EMRs. Although there is a high concordance when data was reported by both NIDAP and data managers, NIDAP was able to extract more information, leading to fewer missing variables. Future research should involve larger cohorts to validate the platform’s scalability and efficiency compared to traditional manual extraction methods, and data quality extracted by NTEP should be further assessed. We anticipate that the integration of LLMs will significantly enhance and transform the data extraction process.
Funding: N/A
Image(s) (click to enlarge):
EVALUATING ARTIFICIAL INTELLIGENCE DATA EXTRACTION FROM PROSTATE MRI REPORTS: A COMPARATIVE STUDY WITH TRADITIONAL METHODS
Category
Prostate Cancer > Other
Description
Poster #92
Presented By: Eugene Lee
Authors:
Eugene Lee
Ruben Blachman-Braun
Charles Hesswani
William S. Azar
Braden Millan
Mitchell J. Hwang
Dylan M. Junkin
Christopher R. Koller
Sahil H. Parikh
Kyle C. Schuppe
Daniel Nethala
Neil Mendhiratta
Alexander P. Kenigsberg
Baris Turkbey
Maria J. Merino
George Zaki
Janelle Cortner
Sandeep Gurram
Peter A. Pinto