Imperfect data, constrained algorithms

Richard Samworth Chair
University of Cambridge
 
Richard Samworth Organizer
University of Cambridge
 
Thursday, Aug 7: 8:30 AM - 10:20 AM
0143 
Invited Paper Session 
Music City Center 
Room: CC-101A 

Applied

No

Main Sponsor

IMS

Co Sponsors

Section on Nonparametric Statistics
Section on Statistical Learning and Data Science

Presentations

Adaptive density estimation under low-rank constraints

In this talk, we address the challenge of bivariate probability density
estimation under low-rank constraints for both discrete and continuous
distributions. For discrete distributions, we model the target as a
low-rank probability matrix. In the continuous case, we assume the
density function is Lipschitz continuous over an unknown compact
rectangular support and can be decomposed into a sum of K separable
components, each represented as a product of two one-dimensional
functions. We introduce an estimator that leverages these low-rank
constraints, achieving significantly improved convergence rates. We
also derive lower bounds for both discrete and continuous cases,
demonstrating that our estimators achieve minimax optimal convergence
rates within logarithmic factors.  

Keywords

Low-rank models

Density estimation

Adaptive estimation 

Speaker

Olga Klopp

WITHDRAWN Bounding the false discovery rate with pooled controls

In a multiple testing problem where the number of hypotheses being tested is large, requiring control of the false discovery rate (FDR) means that we will likely only be able to make discoveries if some p-values are extremely small. In some settings, however, only coarse p-values are available due to limited sample size or limited computational budget. By pooling data across multiple tests, we can regain power even in the presence of these constraints---but traditionally, ensuring validity of this type of procedure would require assumptions, such as a Bayesian model, to allow for sharing information across many tests. In this work we show that, with some mild modifications to the procedure, FDR control can hold without any assumptions across the tests.
 

Keywords

False discovery rate

Permutation tests 

Co-Author

Rina Barber

Density Ratio Permutation Tests with connections to distributional shifts and conditional two-sample testing

We introduce novel hypothesis tests to allow for statistical inference for density ratios. More precisely, we introduce the Density Ratio Permutation Test (DRPT) for testing $H_0: g \propto r f$ based on independent data drawn from distributions with densities $f$ and $g$, where the hypothesised density ratio $r$ is a fixed function. The proposed test employs an efficient Markov Chain Monte Carlo algorithm to draw permutations of the combined dataset according to a distribution determined by $r$, producing exchangeable versions of the whole sample and thereby establishing finite-sample validity. Regarding the test's behaviour under the alternative hypothesis, we begin by demonstrating that if the test statistic is chosen as an Integral Probability Metric (IPM), the DRPT is consistent under mild assumptions on the function class that defines the IPM. We then narrow our focus to the setting where the function class is a Reproducing Kernel Hilbert Space, and introduce a generalisation of the classical Maximum Mean Discrepancy (MMD), which we term Shifted-MMD. For continuous data, assuming that a normalised version of $g - rf$ lies in a Sobolev ball, we establish the minimax optimality of the DRPT based on the Shifted-MMD. For discrete data with finite support, we characterise the complex permutation sampling distribution using a noncentral hypergeometric distribution, significantly reducing computational costs. We further extend our approach to scenarios with an unknown shift factor $r$, estimating it from part of the data using Density Ratio Estimation techniques, and derive Type-I error bounds based on estimation error. Additionally, we demonstrate how the DRPT can be adapted for conditional two-sample testing, establishing it as a versatile tool for assessing modelling assumptions on importance weights, covariate shifts and related scenarios, which frequently arise in contexts such as transfer learning and causal inference. Finally, we validate our theoretical findings through experiments on both simulated and real-world datasets. This is joint work with Alberto Bordino (University of Warwick). 

Keywords

Efficient estimation

Semi-supervised learning

Missing data

Data fusion

ANOVA decompositions 

Speaker

Thomas Berrett, University of Warwick

Low-degree phase transitions for detecting a planted clique in sublinear time

As the scale of data grows, natural questions arise concerning tension between computation and statistics, and one of the most striking examples of such a phenomenon is the existence of a statistical-computational gap. The prototypical problem exhibiting such a gap is that of planted clique detection in which we aim to detect the presence of a clique of size k in a random graph on n vertices. In this talk, we revisit the question of planted clique detection and consider finer-grained---namely sublinear---notions of computational complexity. Under this lens, we show that much more intricate statistical-computational behavior emerges and demonstrate sharp tradeoffs between signal strength and computational complexity. Joint work with Jay Mardia and Alex Wein. 

Keywords

Statistical-computational gaps

Incomplete data

Sublinear algorithms 

Speaker

Kabir Verchand, Georgia Institute of Technology