Harnessing the power of large-scale and heterogeneous data with integrative analysis

Jiuchen Zhang Chair
University of California, Irvine
 
Qi Xu Organizer
Carnegie Mellon University
 
Sunday, Aug 3: 4:00 PM - 5:50 PM
0656 
Topic-Contributed Paper Session 
Music City Center 
Room: CC-103C 

Applied

Yes

Main Sponsor

Section on Statistical Learning and Data Science

Co Sponsors

International Chinese Statistical Association
Section on Statistics in Genomics and Genetics

Presentations

Federated Transfer Learning with Differential Privacy

Federated learning is gaining increasing popularity, with data heterogeneity and privacy being two prominent challenges. In this paper, we address both issues within a federated transfer learning framework, aiming to enhance learning on a target data set by leveraging information from multiple heterogeneous source data sets while adhering to privacy constraints. We rigorously formulate the notion of federated differential privacy, which offers privacy guarantees for each data set without assuming a trusted central server. Under this privacy constraint, we study three classical statistical problems, namely univariate mean estimation, low-dimensional linear regression, and high-dimensional linear regression. By investigating the minimax rates and identifying the costs of privacy for these problems, we show that federated differential privacy is an intermediate privacy model between the well-established local and central models of differential privacy. Our analyses incorporate data heterogeneity and privacy, highlighting the fundamental costs of both in federated learning and underscoring the benefit of knowledge transfer across data sets. 

Co-Author(s)

Ye Tian, Columbia University, Department of Statistics
Yang Feng, New York University
Yi Yu, University of Warwick

Speaker

Mengchu Li

Pretraining and the Lasso

Pretraining is a popular and powerful paradigm in machine learning to pass information from one dataset to another. For example, suppose we have a modest-sized dataset of images of cats and dogs, and we plan to fit a neural network to classify them. With pretraining, we start with a neural network trained on a large corpus of images, consisting of not just cats and dogs but hundreds of other image types. We then fix all network weights except the top layer(s), which perform the final classification, and fine-tune those on our dataset. This often results in dramatically better performance than the network trained solely on our smaller dataset.

In this talk, I will present a framework for pretraining the lasso, which allows us to enjoy the performance benefits of pretraining while retaining the interpretability and simplicity of sparse linear modeling. Suppose for example we wish to predict cancer survival time using a dataset that spans multiple cancer types. With lasso pretraining, we start by fitting a lasso model using the entire dataset, then we use this to guide the fitting of a specific model for each cancer type. Importantly, we have a hyperparameter which determines the influence of the overall model on the specific models. This process also reveals which features are predictive for most or all classes, and which are predictive for one or just a few. This latter set will often be of most interest to the scientist.

Lasso pretraining is a general framework with a wide variety of applications, including stratified models, multi-response models and conditional average treatment estimation, and I will demonstrate its use with real-world biomedical examples. 

Co-Author(s)

Mert Pilanci, Stanford University
Balasubramanian Narasimhan, Stanford University
Julia Salzman, Stanford University
Jonathan Taylor, Stanford University
Robert Tibshirani, Stanford University

Speaker

Erin Craig

Recovery of Biological Signals Lost in Single-cell Batch Integration

Data integration is essential for aligning cells across batches in single-cell analyses, yet current methods often blur the line between technical and biological variation—sometimes erasing meaningful biological signals. In this talk, I will introduce a statistical framework and computational method that leverages experimental design to disentangle batch effects from true biological differences. Using a "pool-of-controls" approach, the method enables the recovery of biological signals that would otherwise be lost during integration. 

Co-Author(s)

Nancy Zhang, University of Pennsylvania
Zongming Ma

Speaker

Zhaojun Zhang

Sparse principal component analysis via double thresholding with applications in pseudo-bulk expression data

We study the problem of principal component estimation in high-dimensional settings, where the leading principal components exhibit both group and individual sparsity. This simultaneous sparsity structure is commonly observed in multi-cell-type gene expression data, where the same genes are often expressed across related cell subtypes in biological processes. To incorporate this structure into PCA, we propose a double-thresholding algorithm that first filters out group-level signals via group thresholding, then applies individual thresholding within each selected group to enforce individual sparsity. Our algorithm is computationally efficient and scalable, making it well-suited for high-dimensional gene expression analysis. Furthermore, we establish the consistency and convergence rate of the resulting estimator. Experiments on both simulated and real datasets demonstrate the effectiveness of our approach. 

Co-Author(s)

Jing Lei, Carnegie Mellon University
Kathryn Roeder, Carnegie Mellon University

Speaker

Qi Xu, Carnegie Mellon University