Big Data Analysis with Applications to Biostatistics

Yichuan Zhao Chair
Georgia State University
 
Ding-Geng Chen Discussant
Arizona State University, College of Health Solutions
 
Yichuan Zhao Organizer
Georgia State University
 
Monday, Aug 4: 8:30 AM - 10:20 AM
0586 
Topic-Contributed Paper Session 
Music City Center 
Room: CC-101C 

Applied

Yes

Main Sponsor

International Chinese Statistical Association

Co Sponsors

International Statistical Institute
Section on Statistical Learning and Data Science

Presentations

Cancer Human Disease Networks (cHDNs) via Deep Learning SEER-Medicare

Cancer patients often also suffer from other disease conditions. For more effective management and treatment, it is crucial to understand the "big picture". Human disease network (HDN) analysis provides an effective way for describing the interrelationships among diseases. The goal of this study is to mine the SEER-Medicare data and construct the HDNs for major cancer types for the elderly. For network construction, we adopt penalized deep neural networks (pDNNs). The DNNs can be more flexible than the regression-based and other analyses, and penalization can effectively distinguish important disease interconnections from noises. As a "byproduct", we establish the asymptotic properties of pDNNs. The constructed cHDNs are carefully analyzed in terms of node, module, and network properties. 

Keywords

human disease network

deep learning

SEER-Medicare

cancer 

Speaker

Shuangge Ma

Improve Fairness with Shift-Adjusted Neyman–Pearson Classifiers via Single Index Modeling (SACSI) Presentation

Neyman-Pearson classifiers, which aim to maximize the clinical benefit while adhering to risk constraints, are crucial in many practical fields, including early cancer detection. However, applying these classifiers can be challenging due to discrepancies between the data distributions of the source and target populations. The potential impact can be disproportionally severe on the under-represented groups. We propose a semi-parametric model-based approach for adapting NP classifier decision rules to different populations while equitably controlling classification errors specific to clinical applications. Our method involves a shift-adjustment strategy that leverages from the target population a small unlabeled sample and minimal auxiliary information alongside the labeled source data. This approach enhances the fairness of the learned decision rules and ensures they are consistently tailored for the target population. We demonstrate the performance through theoretical studies and simulations and illustrate the approach with an example of a prostate cancer study. 

Keywords

Algorithm fairness

Data shift

Neyman-Pearson Classifier 

Speaker

Yingqi Zhao, Fred Hutchinson Cancer Research Center

Dynamic Propensity Trajectory Modeling and Matching with Time-Dependent Covariates for Causal Inference

In observational studies, propensity score (PS)-based causal inference techniques are commonly utilized to address selection bias in treatment assignment. Most existing PS research focuses on time-invariant treatments within a cross-sectional design. Limited attention has been given to PS processes in a longitudinal context involving survival endpoints, and even less work exists regarding time-varying treatments. Note that time-varying propensity score matching methods, as proposed by Lu (2005), have addressed time-dependent treatment receipt but have primarily been limited to continuous outcome measures, with only modest extensions. These methods consider pretreatment characteristics at a specific time point t without fully leveraging historical hazard information preceding time t. To bridge this gap, we introduce the dynamic propensity trajectory (DPT) framework and DPT-based matching (DPTM) techniques. These approaches achieve covariate balance across the entire study period, encompassing both time-invariant and time-varying covariates leading up to treatment initiation. In the primary analysis after matching, we quantify the causal treatment effects for time-to-event outcomes following treatment initiation. We apply the proposed methods to the Chronic Renal Insufficiency Cohort (CRIC) study to investigate the effects of antihypertensive medications in reducing the risk of cardiovascular disease among patients with chronic kidney disease. Additionally, we evaluate these methods in simulation studies, where our approaches outperform existing ones and result in the smallest bias. 

Keywords

Causal treatment Effect

Cox Proportional Hazard model

Observational Study

Propensity Score

Time-dependent Confounders 

Speaker

Ming Wang, Case Western Reserve University

Promising Tools for Integrating Information from Secondary Outcomes to Improve Primary Data Analysis: A New Usage of Secondary Outcomes in the Era of Big Data

In addition to the primary outcome, secondary outcomes are gaining prominence in contemporary biomedical research. These can be easily derived from traditional endpoints in clinical trials (source 1) and from compound or risk prediction scores in large-scale cohort studies or real-world data analysis (source 2). Despite being termed 'secondary,' these outcomes have significant potential to enhance estimation and inference in primary outcome analysis. This is particularly true when the primary outcome is a summary score derived from secondary outcomes, which may lack the detailed information specific to each secondary outcome. This talk will summarize the challenges of integrating information from secondary outcomes into primary data analysis and will describe recently developed tools to address these challenges. We will begin with an early version that considers only one secondary outcome (Tool1.0) and then move on to a more updated version that can handle multiple secondary outcomes (Tool2.0). Building on the first two versions, we will describe the latest version (Tool3.0), which facilitates more robust information integration in a data-driven manner and has great potential applications in the era of big data. Real data examples will be provided, and future directions toward Tool4.0 will be discussed at the end of the talk. 

Keywords

Data integration

Statistical learning

Secondary outcomes 

Speaker

Chixiang Chen, University of Maryland School of Medicine