Leveraging Large Language Models for Cancer Research and Care

Summer Han Chair
Stanford University
 
Fatma Gunturkun Organizer
Stanford University
 
Aparajita Khan Organizer
Stanford University
 
Wednesday, Aug 6: 2:00 PM - 3:50 PM
0817 
Topic-Contributed Paper Session 
Music City Center 
Room: CC-201A 

Applied

Yes

Main Sponsor

Section on Text Analysis

Co Sponsors

Health Policy Statistics Section
Korean International Statistical Society

Presentations

CancerLLM: A Large Language Model in Cancer Domain

Dr. Zhang will introduce large language model that were trained local health care data to conduct multiple downstream tasks such as phenotyping extraction, diagnosis generation and treatment suggestions. The CancerLLM outperformed than other state-of-the-art models on these cancer domain specific tasks. The CancerLLM can also be used for disease prognosis tasks, such as cardiotoxicity prediction models.  

Speaker

Rui Zhang, University of Minnesota School of Public Health

Detecting Cancer Recurrence from Radiology Imaging Reports Using Supervised Deep Learning and Large Language Models

Cancer recurrences are critical events that negatively impacts patients' quality of life, often leading to mortality. Yet they are not systematically recorded in population-based cancer registries. Studies have shown that there are potentials in using traditional Natural Language Processing (NLP) techniques to extract the presence and timing of recurrence from clinical notes, pathology reports and radiology reports. Recent developments in large language models (LLMs) within the realm of generative Artificial Intelligence (AI) have shown greater promise in information extraction from narrative texts. Regardless, research is still lacking regarding the use of deep learning methods and LLMs for extracting recurrence information from population-based electronic radiology imaging reports (E-RAD).

In this study, we aim to bridge this gap by comparing various AI frameworks to detect cancer recurrence from population-based E-RAD. This study includes 13,703 radiology reports from 4,498 patients diagnosed between 2011 and 2018 with 12 primary cancer types (e.g., Breast, Prostate, Lung, etc.), and linked to the Surveillance, Epidemiology, and End Results (SEER) program. We developed a deep learning-based framework with a rule-based negation detection algorithm to identify recurrence or metastasis instances and then compared it with the performance of a traditional keyword-searching algorithm and various general and medical-specific LLMs. Our proposed approach outperformed the best-performing comparison models on a validation dataset, achieving higher accuracy (0.99 vs. 0.97), sensitivity (0.88 vs. 0.50), precision (0.78 vs. 0.60), and F1 score (0.82 vs. 0.55), while maintaining a similar specificity (0.99 vs. 0.99).
 

Co-Author(s)

Stephen Schwartz, Fred Hutchinson Cancer Research Center
Ruth Etzioni, Fred Hutchinson Cancer Research Center

Speaker

Lucas Liu, Fred Hutchinson Cancer Center

Evaluating Large Language Models to Accelerate Cancer Clinical Trials

Background: Adequate patient awareness and understanding of cancer clinical trials is essential for trial recruitment, informed decision-making, and protocol adherence. While Large Language Models (LLMs) have shown promise for patient education, their role in enhancing patient awareness of clinical trials remains unexplored. This study explored the performance and risks of LLMs in generating trial-specific educational content for potential participants.

Methods: GPT4 was prompted to generate short clinical trial summaries and multiple-choice question-answer pairs from informed consent forms (ICFs) from ClinicalTrials.gov. Zero-shot learning was used for summaries, using a direct summarization, sequential extraction, and summarization approach. One-shot learning was used for question-answer pairs development. We evaluated performance through patient surveys of summary effectiveness and crowdsourced annotation of question-answer pair accuracy, using held-out cancer trial ICFs not used in prompt development.

Results: For summaries, both prompting approaches achieved comparable results for readability and core content. Patients found summaries to be understandable, and to improve clinical trial comprehension and interest in learning more about trials. The generated multiple-choice questions achieved high accuracy and agreement with crowdsourced annotators. For both summaries and multiple-choice questions, GPT4 was most likely to include inaccurate information when prompted to provide information that was not adequately described in the ICFs.

Conclusions: LLMs such as GPT4 show promise in generating patient-friendly educational content for clinical trials with minimal trial-specific engineering. The findings serve as a proof-of-concept for the role of LLMs in improving patient education and engagement in clinical trials, as well as the need for ongoing human oversight. 

Speaker

Lizhou Fan, Harward University

Harnessing Large Language Models in Extracting Longitudinal Smoking History from Unstructured Clinical Notes in Electronic Health Records for Improved Cancer Surveillance

Accurate smoking documentation in electronic health records is essential for effective risk assessment, screening, prevention, and patient monitoring. However, key smoking information is often absent or inaccurately recorded in structured data, contributing to inconsistencies in longitudinal data arising from recall bias and reporting errors. Large language models (LLMs) offer a promising solution in interpreting clinical text narratives. We developed a framework that utilizes LLMs to extract and harmonize longitudinal smoking histories, incorporating our rule-based smoothing techniques. These techniques aim to improve the quality of post-deployment smoking data by addressing conflicts and inconsistencies in key variables through trend analysis and back calculation. We compared BERT-based models against generative AI models (Gemini 1.5 Flash, PaLM 2 Text-Bison, GPT-4) using a dataset of 1,683 manually-annotated clinical notes from 500 patients across academic and community healthcare systems and deployed them on 80,037 notes from 4988 patients to extract 7 smoking variables, including status, pack-years, duration, and cessation. We assess the clinical utility of the curated longitudinal smoking data in evaluating the effectiveness of different post-treatment cancer surveillance strategies for detecting second malignancies, where smoking is a key critical prognostic factor. 

Co-Author(s)

Anna Graber-Naidich, Stanford University
Aparajita Khan, Indian Institute of Technology Roorkee
Fatma Gunturkun, Stanford University
Summer Han, Stanford University

Speaker

Ingrid Luo, Stanford Univeristy

Automated Detection of Distant Metastasis in Lung Cancer from Longitudinal Reports Using State-of-the-Art Large Language Models

Timely and accurate identification of distant recurrence in non-small cell lung cancer (NSCLC) is critical for prognostic assessment and treatment optimization. Traditional methods relying on structured data often fail to capture the nuanced clinical details embedded in unstructured radiology and pathology reports. Recent advancements in large language models (LLMs) offer a promising approach for automating information extraction, enabling a more comprehensive and scalable analysis of recurrence patterns.

This study aims to systematically compare and evaluate the performance of multiple state-of-the-art LLMs—including GPT-4o, o1, DeepSeek-R1, LLaMA 3.3 (70B), Gemini 1.5 Pro, and MedLM-large—in detecting distant recurrence. The dataset comprises 30,161 radiology and pathology reports (collected between 2020 and 2022) from 2,116 lung cancer patients. A subset of 7,083 notes from 500 patients was manually annotated to establish a test dataset.

Zero-shot prompting was employed with standardized prompts across all models. A sample of errors was manually reviewed to identify common failure patterns. Fairness analysis was conducted to assess potential biases across demographic subgroups. The final model was deployed on 23,078 unannotated reports from additional lung cancer patients.

This study provides a structured framework for evaluating the performance of cutting-edge LLMs in clinical information extraction and underscores their potential to enhance the identification of distant recurrence in NSCLC.
 

Co-Author(s)

Summer Han, Stanford University
Aparajita Khan, Indian Institute of Technology Roorkee
Chloe Su

Speaker

Fatma Gunturkun, Stanford University