Introduction:
The identification of treatment-related toxicity is essential for clinical research but is highly labor-intensive and subject to substantial variation when classifying events into pertinent categories. Large language models (LLMs) are promising tools for generating language and classifying unstructured text. Given their novelty, the application of LLMs to the interpretation of adverse event (AE) reporting in medicine is underexplored. In this study, we examined the performance of LLMs to characterize the safety profile of an injectable hydrogel perirectal spacer used for radiation treatment of prostate cancer.
Methods:
We queried the US Food and Drug Administration’s MAUDE database to identify reported events related to perirectal hydrogel spacers between August and November 2023. Reports were manually classified by three trained reviewers using the following criteria: primary & secondary AEs (from a preselected category list), presence of symptoms, radiation plan status, and an adverse event severity grade. Disagreements between reviewers were resolved by a board-certified urologist to create a gold standard document. Next, text LLM prompts were created to generate these classifications from the reports. Two types of prompts were created: in the first method (joint-prompt), all classifications for the same report were requested in one prompt. In the second method (split-prompt), each classification was requested in separate prompts. The LLMs GPT-3.5-Turbo, GPT-4-Turbo, and GPT-4o were evaluated. Each (prompt, model) combination was run three separate times per example, then compared to the gold standard, along with the human reviewers.
Results:
Among the 97 cases reviewed, the most common primary AEs identified in the gold standard were malpositioned gel (47/97), pain (9/97), and infection/inflammation/abscess (8/97). Average F1 scores (a measure of precision and recall) for primary problem identification were: 0.782 (human, 95%CI: 0.756-0.810), 0.777 (GPT-4o joint, 95%CI: 0.761-0.792), 0.739 (GPT-4o split, 95%CI: 0.723-0.755), 0.749 (GPT-4-Turbo joint, 95%CI: 0.743-0.755), 0.746 (GPT-4-Turbo split, 95%CI: 0.730-0.761), 0.777 (GPT-3.5-Turbo joint, 95%CI: 0.765-0.789) and 0.711 (GPT-3.5-Turbo split, 95%CI: 0.691-0.732). The corresponding Fleiss’ Kappa scores for the primary problem (a measure of inter-rater or intra-model reliability, where higher scores indicate greater agreement across raters), were: 0.609 (human), 0.894 (GPT-4o joint), 0.916 (GPT-4o split), 0.901 (GPT-4-Turbo joint), 0.885 (GPT-4-Turbo split), 0.823 (GPT-3.5-Turbo joint), and 0.872 (GPT-3.5-Turbo split).
Conclusion:
These findings suggest that LLMs could potentially be deployed to assist in the characterization of prostate cancer treatment-related adverse events, with similar accuracy and reduced inter-rater variability compared with manual classification.
Funding: Yale School of Medicine
Image(s) (click to enlarge):
CLASSIFICATION OF ADVERSE EVENTS AFTER PROSTATE CANCER HYDROGEL PERIRECTAL SPACER INSERTION USING LARGE LANGUAGE MODELS
Category
Prostate Cancer > Potentially Localized
Description
Poster #154
Presented By: Nishan Sohoni
Authors:
Nishan Sohoni
Nimit S. Sohoni
Ryan A. Sutherland
Vinaik M. Sundaresan
Julia E. Olivieri
Michael S. Leapman