Sparse principal component analysis via double thresholding with applications in pseudo-bulk expression data

Jing Lei Co-Author
Carnegie Mellon University
 
Kathryn Roeder Co-Author
Carnegie Mellon University
 
Qi Xu Speaker
Carnegie Mellon University
 
Sunday, Aug 3: 5:05 PM - 5:25 PM
Topic-Contributed Paper Session 
Music City Center 
We study the problem of principal component estimation in high-dimensional settings, where the leading principal components exhibit both group and individual sparsity. This simultaneous sparsity structure is commonly observed in multi-cell-type gene expression data, where the same genes are often expressed across related cell subtypes in biological processes. To incorporate this structure into PCA, we propose a double-thresholding algorithm that first filters out group-level signals via group thresholding, then applies individual thresholding within each selected group to enforce individual sparsity. Our algorithm is computationally efficient and scalable, making it well-suited for high-dimensional gene expression analysis. Furthermore, we establish the consistency and convergence rate of the resulting estimator. Experiments on both simulated and real datasets demonstrate the effectiveness of our approach.