Frontiers in the Analysis of Contaminated Data in Decision Making

Gen Li Chair
University of Michigan
 
Tianying Wang Organizer
Colorado State University
 
Sunday, Aug 4: 4:00 PM - 5:50 PM
1011 
Invited Paper Session 
Oregon Convention Center 
Room: CC-255 

Applied

Yes

Main Sponsor

ENAR

Co Sponsors

Caucus for Women in Statistics
Section on Statistics in Epidemiology

Presentations

Causal Learning of Paired Vectors with Label Noise: Impact and Correction Methods

Causal inference involves determining whether a cause-effect relationship exists between two sets of interest, a task that can be framed as a binary classification problem. When dealing with a sequence of independent and identically distributed paired vectors, the kernel mean embedding of the probability distribution can be utilized to map the empirical distribution to a feature space. Subsequently, a classifier is trained in this feature space to predict causation for future vector pairs. However, this approach is susceptible to mislabeling of causal relationships, a common challenge in causation studies. In this talk, I will discuss the impact of mislabeled outputs on the training results. Moreover, I will present robust learning methods that take into account the mislabeling effects and offer theoretical justifications for the validity of these proposed methods.  

Speaker

Grace Yi, University of Western Ontario

Cox Model with Left-truncation, Complex Censoring, and Error-prone Survival Outcomes

Time-to-event analysis is one of the most popular tools for modeling disease process data. It may be expensive/invasive to measure the true survival outcome (e.g., time to abnormality of cerebrospinal fluid biomarkers). Thus, the true survival outcome is only available for a small group of participants, which brings limitations in sample size and estimating efficiency. An inexpensive/less invasive auxiliary outcome that measures the true outcome with error may be collected. We propose a likelihood-based method and an EM algorithm for Cox regression models, which incorporate the error-prone auxiliary outcomes and improve efficiency. The proposed method allows left-truncation in the event time of interest and complex censoring, which further complicates the problem. We show that the proposed regression coefficient estimator is consistent and asymptotically normally distributed. We evaluate the finite sample performance of the proposed method through simulation studies. We illustrate the proposed method using the Alzheimer's Disease Neuroimaging Initiative (ADNI) study. 

Speaker

Sharon Xie, University of Pennsylvania, Perelman School of Medicine

Prediction in Measurement Error Models

We study the well-known difficult problem of prediction in measurement error models. By targeting directly the prediction interval instead of the point prediction, we construct a prediction interval by providing estimators of both the center and the length of the interval, which achieves a pre-determined prediction level. The constructing procedure requires a working model for the distribution of the variable prone to error. If the working model is correct, the prediction interval estimator obtains the smallest variability in terms of assessing the true center and length. If the working model is incorrect, the prediction interval estimation is still consistent. We further study how the length of the prediction interval depends on the choice of the true prediction interval center and provide guidance on obtaining a minimal prediction interval length. Numerical experiments are conducted to illustrate the performance, and we apply our method to predict the concentration of $A\beta1-42$ in cerebrospinal fluid in Alzheimer's disease data. 

Speaker

Yanyuan Ma, Penn State University

A Pseudo-simulation Extrapolation Method for Misspecified Models with Errors-in-variables in Epidemiological Studies

In epidemiology studies, it is often of interest to consider a misspecified model, which categorizes a continuous variable to analyze the risk of obesity for a better model interpretation. When the continuous variable is contaminated with measurement errors, ignoring this issue and performing regular statistical analysis leads to severely biased point estimators and invalid confidence intervals. However, most existing methods addressing measurement errors either do not consider model misspecification or have strong parametric distributional assumptions. We propose a flexible pseudo-simulation extrapolation method, which provides valid and robust statistical inference under various models and has no distributional assumptions on the observed data. We demonstrate that the proposed method can provide unbiased point estimation and valid confidence intervals under various regression models. By analyzing the Food Frequency Questionnaire in UK Biobank data, we show that ignoring measurement errors underestimates the impact of high fat intake diet on BMI and obesity by at least 30% and 60%, respectively, compared to the results of correcting measurement errors by the proposed method. 

Speaker

Tianying Wang, Colorado State University