Wednesday, Aug 6: 8:30 AM - 10:20 AM
4140
Contributed Papers
Music City Center
Room: CC-103A
This session will have presenters showing latest research in high dimensional regression ranging from new methods for handling longitudinal data, diurnal data, spatial data with high dimensional covariates, etc..
Main Sponsor
Biometrics Section
Presentations
There is a growing interest in longitudinal omics data paired with some longitudinal clinical outcome. Given a large set of continuous omics variables and some continuous clinical outcome, each measured for a few subjects at only a few time points, we seek to identify those variables that co-vary over time with the outcome in one or more treatment groups. To motivate this problem we study a dataset with hundreds of urinary metabolites along with Tuberculosis mycobacterial load as our clinical outcome, with the objective of identifying potential biomarkers for disease progression in two treatment groups. For such data clinicians usually apply simple linear mixed effects models which often lack power given the low number of replicates and time points. Our previous method, PROLONG, combines group lasso and network Laplacian penalties on first-differenced data, increasing power and utilizing the variance across both time and omics features. We extend this PROLONG model to multiple treatment groups by debiasing the group lasso + laplacian model and performing inference on the debiased estimator.
Keywords
Omics
High Dimensional
Metabolomics
Regression
A discrete spatial lattice can be cast as a network structure over which spatially-correlated outcomes are observed. A second network structure may also capture similarities among measured features, when such information is available. Incorporating the network structures when analyzing such doubly-structured data can improve predictive power, and lead to better identification of important features in the data-generating process. Motivated by applications in spatial disease mapping, we develop a new doubly regularized regression framework to incorporate these network structures for analyzing high-dimensional datasets. Our estimators can be easily implemented with standard convex optimization algorithms. In addition, we describe a procedure to obtain asymptotically valid confidence intervals and hypothesis tests for our model parameters. We show empirically that our framework provides improved predictive accuracy and inferential power compared to existing high-dimensional spatial methods. These advantages hold given fully accurate network information, and also with networks which are partially misspecified or uninformative.
Keywords
high-dimensional data
penalized regression
spatial data
networks
Improving strength of routine is a target for many therapies and treatments of mood and affective disorders. Smartphone usage data enables us to model person-specific diurnal patterns of usage that provide useful insight into a person's routine and behavior. Considering phone usage as a point process, existing approaches focus on capturing self-exciting behavior, the phenomenon where the rate of usage is heightened during and immediately after using one's phone. While this self-exciting phenomenon is important, there are limited methods that also allow for flexible modeling of diurnal effects on the rate of smartphone usage. We propose a framework that can combine the self-exciting Hawkes process with a penalized Fourier series to capture important diurnal trends. Through simulation experiments and an application to a cohort of patients with affective disorders, we show the benefit of models that account for self-exciting and diurnal patterns concurrently.
Keywords
mobile health
longitudinal and correlated data
point processes
diurnal patterns
event data
mental health
Extensive literature explores the modeling of dynamic and functional responses using functional regression approaches that apply smoothing techniques to capture complex data trends in functional covariates and time-varying scalar predictors. However, there has been relatively less focus on understanding how longitudinal and functional predictors prone to measurement errors influence dynamic functional outcomes. Addressing this gap, we propose a functional dynamic modeling framework that accounts for measurement errors in both functional and scalar predictors. This approach aims to enhance our understanding of how self-reported mealtimes, which serve as longitudinal measures, influence glycemic dynamics over time. Additionally, our model incorporates actigraphy-measured physical activity, which is prone to measurement errors, to provide a more comprehensive analysis. Finite sample properties were established through simulations. We applied the methods to data from a prospective cohort study of 277 healthy pregnant women to determine optimal meal timing and its association with dynamic glycemic outcomes in pregnancy.
Keywords
Functional data
Glycemic dynamics
Measurement Error
Optimal Meal Timing
Physical Activity
Meal Type
Multimodal statistical models have gained much attention in recent years, yet there lacks rigorous statistical inference tools for inferring the significance of a single modality within a multimodal model. This inference problem is particularly challenging in high-dimensional multimodal models. In high-dimensional multimodal generalized linear models, we propose a novel entropy-based metric, called the Expected Relative Entropy (ERE), to quantify the information gain of one modality in addition to all other modalities in the model. We then propose a deviance-based statistic to estimate the ERE. We prove that the deviance-based statistic is consistent with the ERE and derive its asymptotic distribution, which enables the calculation of confidence intervals and p-values to assess the significance of a given modality. We numerically evaluate the empirical performance of our proposed inference tool on various high-dimensional multimodal generalized linear models and demonstrate its good performance. We also apply our method to a multimodal neuroimaging dataset to demonstrate its capability to infer the significance of imaging modalities, which is crucial for neuroscience studies.
Keywords
High-dimensional inference
Multimodal data,
Relative Entropy
Sure Independence Screening
Transfer learning has been proven useful for leveraging information from multiple similar source datasets to enhance the performance of the target model. A fundamental challenge in transfer learning is avoiding negative transfer when there is heterogeneity among the sources and between the source and target datasets. Traditional methods are typically based on identifying informative sources. This creates a binary all-in or all-out decision, potentially resulting in the loss of useful information. In this paper, we introduce Targeted-IFS, a new transfer learning framework for high-dimensional Generalized Linear Models (GLMs) under heterogeneous sources. To avoid negative transfer and ensure effective transfer of useful information from sources, Targeted-IFS employs a pre-transfer debiasing step to correct estimates of selected informative features across all sources, rather than selecting the informative sources. We theoretically show that the Targeted-IFS method avoids negative transfer, achieving a convergence rate no worse than the classical LASSO using only target data, regardless of source heterogeneity. Simulations confirm its robustness to complex source heterogeneity and imp
Keywords
Generalized linear model
heterogeneity
informative support
negative transfer
robust transfer learning
Co-Author(s)
Yudong Wang, University of Pennsylvania, Perelman School of Medicine
Tingyin Wang
Yumou Qiu, Peking University
Yang Ning, Cornell University
Yong Chen, University of Pennsylvania, Perelman School of Medicine
First Author
Jie Hu, University of Pennsylvania
Presenting Author
Jie Hu, University of Pennsylvania
Conditional density estimation in high dimensional data has been studied extensively inrecent times. In this talk, we propose a model to estimate the conditional density of responses which varies spatially given a covariate vector and a specific location. By utilizing a variation of logspline models, we nonparametrically approximate the unknown link using a triangular basis expansion and assuming a Gaussian prior on the coefficients. We show that the posterior contracts to the true density at a minimax optimal (upto a logarithmic constant) rate. We evaluate the performance of our method with numerous simulations, and compare the results with related high dimensional density estimation techniques. We illustrate our method on a summary measure, namely, the Fractional Anisotropy, collected from 213 subjects at 83 brain locations in a dataset generated by the Alzheimer's Disease Neuroimaging Initiative to identify the functional relationship between the various covariatesand the response with the various locations.
Keywords
Density estimation
Posterior Contraction
Spline
Scalable approximation