Precision Medicine and Machine Learning Methods

Sean McGrath Chair
Harvard Medical School and Harvard Pilgrim Health Care Institute
 
Thursday, Aug 7: 8:30 AM - 10:20 AM
4214 
Contributed Papers 
Music City Center 
Room: CC-201A 

Main Sponsor

Section on Statistics in Epidemiology

Presentations

A tree-based scan statistic for database studies with time-to-event outcomes

Tree-based scan statistics (TBSSs) are machine learning methods for disproportionality analyses in database studies. They simultaneously scan for thousands of hierarchically related outcomes to detect potential signals of harm from health products while controlling for multiplicity. They have been extensively used in pharmacoepidemiology. Current TBSS implementations do not allow for comparative safety evaluation with time-to-event outcomes, available in most database studies. Explicitly accounting for person time can improve the power to detect signals compared to methods that only use number of events. We propose three novel TBSSs for time-to-event data. The first assumes proportional Hazard Rates (HRs) for each node and uses a permutation scheme for inference. The second builds on exponential survival models for the terminal nodes of the hierarchy, implying a constant HR for each node. It uses a parametric bootstrap for inference. The third approach uses robust asymptotic approximations of the HRs to build an approximate parametric bootstrap. We compare the proposed methods with standard TBSSs in various simulation scenarios and database study. 

Keywords

Data mining

Epidemiology

Multiple testing

Scan statistics

Tree variable. 

Co-Author(s)

Georg Hahn
Shirley Wang

First Author

Massimiliano Russo, The Ohio State University

Presenting Author

Massimiliano Russo, The Ohio State University

Comparing machine learning to existing risk scores when predicting CVD in type 2 diabetes patients

Type 2 diabetes (T2D) increases risk of cardiovascular disease (CVD). Several calculators have been developed to estimate risk of CVD; however, they may underestimate risk in populations such as people with T2D. The Look AHEAD randomized clinical trial tested a behavioral weight-loss intervention in overweight/obese adults with T2D. Fatal and non-fatal CVD was the primary outcome and congestive heart failure (CHF) was a secondary outcome. We use repository data from 4685 Look AHEAD participants to build models to predict survival probability for the time to primary outcome (number of events, ne=763) and CHF (ne=201) using different machine learning (ML) algorithms. The best model from ML algorithms is chosen by comparing their discrimination, calibration and overall accuracy using the C-index, the D-calibration index, and the integrated Brier score, respectively. We then use data from the ACCORD study to validate our ML model and check whether it is better than the PREVENT calculator, the Framingham risk score, and the ACC/AHA pooled cohort equations calculator in predicting these two outcomes. This lets us determine if these calculators need to be improved for people with T2D. 

Keywords

survival outcome prediction

model selection

model validation

Risk prediction 

Co-Author(s)

Emma Stinson, National Institute of Diabetes and Digestive and Kidney Diseases
William Knowler, National Institute of Diabetes and Digestive and Kidney Diseases
Jonathan Krakoff, National Institute of Diabetes and Digestive and Kidney Diseases
Robert Hanson, National Institute of Diabetes and Digestive and Kidney Diseases

First Author

Elsa Vazquez Arreola, National Institute of Diabetes and Digestive and Kidney Diseases

Presenting Author

Elsa Vazquez Arreola, National Institute of Diabetes and Digestive and Kidney Diseases

Development of an R Package to Predict Item Responses Accounting for Differential Item Functioning

Patient-reported outcome measures (PROMs) are multi-item scales that capture patients' appraisals of their quality of life. Differential item functioning (DIF) is a potential source of measurement bias that occurs when patients with the same health status interpret PROMs items differently due to characteristics such as demographics and comorbid conditions. DIF can be detected on multiple covariates using item-focused tree (IFT) models that combine item-specific logistic regression with recursive partitioning based on structural change tests. The DIFtree package in R fits IFT models but lacks tools to evaluate model performance. We developed an R package, IFTPredictor, to predict item responses, adjusted for DIF, using the IFT model. The package accepts a fitted IFT model, a dataset for prediction, and total item scores as inputs. It generates logistic regression equations, predicted probabilities and responses for each item, incorporating subgroup-specific covariates for DIF items. Predicted responses are used to evaluate the IFT model on accuracy, precision, and calibration. Our package will enhance the usability of the IFT model for DIF analysis on multiple covariates. 

Keywords

Model-based recursive partitioning

Measurement invariance

Item responses

Machine-learning

Logistic regression 

Co-Author(s)

Barret Monchka, George and Fay Yee Centre for Healthcare Innovation, University of Manitoba
Lisa Lix, University of Manitoba

First Author

Bodawatte Gedara Muditha Lakmali, University of Manitoba

Presenting Author

Bodawatte Gedara Muditha Lakmali, University of Manitoba

Functional Regression Model with Autocorrelation: Applications to Cancer Mortality Rates

This report extends the Generalized Least Squares (GLS) method to accommodate functional regression models with dependent errors. Specifically, we apply an AR(1) autocorrelation structure to effectively model and forecast age-adjusted lung cancer mortality rates across nine U.S. registries. Utilizing data recorded from 1975 to 2015 for various age groups, we investigate the intrinsic functional structure of these mortality rates. Our study further evaluates the predictive performance of the functional regression model in comparison to classical time series methods, such as ARIMA. 

Keywords

Functional Data

Autocorrelation

ARIMA

Registries

Time Series

Regression 

First Author

Keshav Pokhrel, University of Michigan-Dearborn

Presenting Author

Keshav Pokhrel, University of Michigan-Dearborn

Pooled analysis of nested case-control and case-cohort studies for risk assessment and prediction

Pooled analyses across multiple cohort studies are increasingly common due to greater statistical power. However, in prospective biomarker studies, full cohort measurements of biomarkers of interest are often unavailable due to logistical and financial constraints, requiring nested case-control or case-cohort designs. While methods exist for pooling nested case-control samples, combining studies with different sampling designs remains a challenge. Motivated by the B2RISK consortium, which includes both designs, we propose methods for relative risk evaluation and risk prediction by pooling multiple studies with different designs. For relative risk evaluation, we use inverse probability weighting with a robust variance estimator for consistent estimation. For risk prediction, we employ pseudo-likelihood to incorporate parental cohort data from nested case-control studies, leading to consistent and efficient risk prediction rules. Through extensive simulations, we evaluate our methods' performance and demonstrate their advantages over standard approaches, including commonly used ad hoc methods in practice. We apply our methods to the B2RISK breast cancer study for illustration. 

Keywords

biomarker data

combined analysis

conditional likelihood

unconditional likelihood

logistic regression

selection bias 

Co-Author(s)

Susan Hankinson, School of Public Health and Health Sciences, University of Massachuesetts, Amherst
Jing Qian, University of Massachusetts Amherst

First Author

Jinghan Cui, University of Massachusetts, Amherst

Presenting Author

Jinghan Cui, University of Massachusetts, Amherst

Validity of absolute risk estimates derived from matched case-control studies and population rates

Absolute risk prediction models are important tools for disease prevention. They are best studied in prospective cohorts. However, when the disease incidence rate is low, synthesizing data and information from multiple sources is an important strategy. Previously, we exemplified this strategy by proposing a two-stage procedure to estimate a logistic regression model for predicting lung cancer occurrence among never-smoking females in Taiwan based on age-matched case-control studies and age-specific lung cancer incidence rates among never-smoking females in Taiwan. With additional information on the age-specific population distribution of the risk factors, we establishes in this presentation its asymptotic theory, uses it to construct confidence intervals, examines its numerical performance by simulation studies, and applies it to estimate the numbers and confidence interval of Taiwanese never-smoking women whose lung cancer risk is higher than the thresholds discussed in the literature regarding low-dose computed tomography lung cancer screening, which is useful in health policy decision making. 

Keywords

Absolute risk prediction model

Matched case-control studies

Data synthesis

Low-dose computed tomography lung cancer
screening 

Co-Author(s)

Hsiao-Han Hung, National Health Research Institutes, Taiwan
Ting-Wai Chang, National Health Research Institutes, Taiwan
Hsin-Fang Jiang, National Health Research Institutes, Taiwan
Chao Hsiung, National Health Research Institutes
I-Shou Chang, National Health Research Institutes

First Author

Li-Hsin Chien, National Dong Hwa University

Presenting Author

Li-Hsin Chien, National Dong Hwa University