AI-driven Information Extraction from Unstructured Documents to Facilitate Decision Making in Clinical Development

Ryumei Nakada Co-Author
Rutgers University
 
Michelle Ngo Co-Author
Merck & Co., Inc.
 
Xiang Peng Co-Author
Merck
 
Junshui Ma Co-Author
Merck
 
Sabrina Shuyan Wan Co-Author
Merck
 
Thomas Jemielita Co-Author
 
Yulia Sidi Co-Author
Merck
 
Federico Ferrari Co-Author
 
Michelle Ngo Speaker
Merck & Co., Inc.
 
Tuesday, Aug 5: 11:15 AM - 11:35 AM
Invited Paper Session 
Music City Center 

Description

This project aims to develop an AI-driven automated database to facilitate decision-making in oncology clinical trials. To support the downstream decision-making framework, extensive historical data is needed such as tumor indication, biomarker information, Objective Response Rate (ORR), Overall Survival (OS), and Progression-free Survival (PFS). Currently, such available historical data in-house come from manual data collection, which is inefficient and laborious.
To automate this process, a variable extraction tool was developed that retrieves essential information from various sources, including external websites and internal documents. The tool leverages the recent development in large language models to transform unstructured data into structured data, incorporating key steps such as data pre-processing, context compression, multiple extraction phases, and extraction validation.
This approach ensures high-quality data extraction comparable to human efforts. The presentation will focus on the pipeline for automated data collection.

Keywords

LLM, genAI, clinical trials, structured database, information retrieval, variable extraction