IMS Lawrence D. Brown Ph.D. Student Award

Linjun Zhang Chair
Rutgers University
 
Stefan Wager Organizer
Stanford University
 
Wednesday, Aug 6: 2:00 PM - 3:50 PM
0339 
Invited Paper Session 
Music City Center 
Room: CC-208B 

Applied

No

Main Sponsor

IMS

Presentations

Alignment of Untargeted Data through their Covariances: A Novel Perspective on a Classical Tool in Optimal Transport

Feature alignment is a core challenge in statistics and machine learning, with critical applications in biostatistics, particularly in the alignment of untargeted metabolomics, proteomics, and lipidomics studies. These studies measure unlabeled compounds across patient cohorts, allowing for novel biomarker discovery but presenting complex feature matching problems when comparing, pooling, or annotating datasets. Traditional alignment methods from computer science often fail to capture the biological constraints required in such tasks. To address this, we explore the use of optimal transport—specifically, the Gromov–Wasserstein (GW) algorithm—for aligning features across biological datasets. We introduce GromovMatcher, a constrained GW solver, which demonstrates robust and accurate feature matching in real-world metabolomic studies of liver and pancreatic cancer, highlighting its utility in metabolomic data analysis.

Motivated by these results, we propose a new statistical framework for feature alignment between two unlabeled datasets whose features follow a Gaussian distribution with an unknown covariance structure. The key challenge is to recover the permutation in features of one dataset relative to the other. We develop both a quasi-maximum likelihood estimator (QMLE) and a GW-based approach to solve this "covariance alignment" problem, framing it as a quadratic assignment problem. We demonstrate experimentally that computation of the GW estimator scales favorably via Sinkhorn optimization. Our theoretical analysis shows that both QMLE and GW estimators achieve minimax-optimal statistical rates, offering the first statistical justification for using GW in feature alignment.

This work is part of my PhD research with Philippe Rigollet and Yanjun Han at MIT, in collaboration with Vivian Viallon's group at IARC in Lyon.
 

Speaker

George Stepaniants

Model-free selective inference with conformal prediction

Artificial Intelligence (AI) has revolutionized decision-making and scientific discovery in fields like drug discovery, marketing, and healthcare. To ensure the reliability of these models, uncertainty quantification methods such as conformal prediction aim to build prediction sets covering unknown labels of new data. These methods provide on-average (marginal) guarantees which, despite being useful, can be insufficient in decision-making processes that usually come with a selective nature. For instance, early stages of drug discovery aim to identify a subset of promising drug candidates rather than assessing an "average" instance.

We introduce Conformal Selection, a novel framework that offers selective inference capabilities to conformal prediction to address these challenges. We focus on applications where predictions from black-box models are used to shortlist unlabeled test samples whose unobserved outcomes satisfy a desired property, such as identifying drug candidates with high binding affinity. Existing methods based on conformal prediction sets can neglect the selection bias, leading to high fraction of false leads in shortlisted candidates.

Leveraging a set of labeled data that are exchangeable with the unlabeled test points, our method constructs conformal p-values that quantify the confidence in unobserved large outcomes. It then uses the Benjamini–Hochberg (BH) procedure to select the promising candidates whose p-values fall below a data-dependent threshold. We show that this procedure provides finite-sample, distribution-free FDR control.

In addition, I will talk about extensions of the conformal selection method to address the challenges of distribution shift and model selection for optimal performance. One extension, called weighted conformal selection, achieves FDR control when there is a covariate shift between the calibration and test data. Another extension, called optimized conformal selection, maintains FDR control even though the data are reused for selecting a data-dependent, best-performing conformity score. I'll also demonstrate practical applications of the framework in drug discovery and alignment of large language models.

This is based on my PhD work with Emmanuel Candès and new works shortly after.
 

Speaker

Ying Jin, Stanford University

Dynamic Networks with Possibly Erratic Changes over Time

Dynamic analysis is a fundamental problem in network analysis. Real networks often exhibit a dynamic or multi-layered nature. Unlike static network analysis which relies on a single snapshot, dynamic network analysis focuses on the mechanisms driving the time evolution of network properties. For example, trade relationship between countries or gene regulatory networks are expected to change over time and cell development. Historically considered a blind spot within network science due to its complexity and limited data availability, dynamic network analysis has recently become an active area of research holding great potential for applications in the social sciences, biology, and many other disciplines.

We start by introducing the dynamic degree-corrected, mixed membership, stochastic block model. Consider a dynamic network setting where we have a total of T mixed-membership networks for the same set of n nodes and K communities. We assume that, in each snapshot, there may be severe degree heterogeneity, and across time, the degrees of a node may have erratic changes, but the memberships of the node may evolve slowly. We are interested in estimating the memberships of all n nodes across all T snapshots. The problem is complex, for we need to address multiple challenges simultaneously; and often, fixing an existing issue may introduce a new problem. We have explored several seemingly plausible approaches and found all of them to be non-optimal.

Leveraging the insights gained from these studies, we propose dyn-MSCORE as a new approach to estimating mixed memberships. Our method combines kernel smoothing with Mixed-SCORE and incorporates several new ideas, which enable us to resolve all major issues (e.g., temporal misalignment, nonlinearity, and severe degree heterogeneity) simultaneously without producing a new major issue. We establish sharp bounds for the error rates of dyn-MSCORE and demonstrate that the rates are optimal. Additionally, we identify an interesting phase transition, depicting how the error rates and optimal kernel bandwidths depend on T and network sparsity. We further investigate the benefit of kernel smoothing, identifying two sub-regions where kernel smoothing is helpful and not helpful, respectively. Our method is supported by simulated studies and real-data examples, including representing and analyzing trading patterns between countries using international trade data.

This talk is based on my PhD work with Professor Tracy Ke at Harvard University along with Professor Jiashun Jin from Carnegie Mellon University. 

Speaker

Louis Cammarata, Harvard University