Explorations in Online Learning and Time Series

Shi Bo Chair
Boston University
 
Tuesday, Aug 5: 8:30 AM - 10:20 AM
4090 
Contributed Papers 
Music City Center 
Room: CC-212 

Main Sponsor

IMS

Presentations

Online Tensor Learning: Computational and Statistical Trade-offs

Large tensor learning algorithms are typically computationally expensive and require storing a vast amount of data. In this paper, we propose a unified online Riemannian gradient descent (oRGrad) algorithm for tensor learning, which is computationally efficient, consumes much less memory, and can handle sequentially arriving data while making timely predictions. The algorithm is applicable to both linear and generalized linear models. If the time horizon T is known, oRGrad achieves statistical optimality by choosing an appropriate fixed step size. We find that noisy tensor completion particularly benefits from online algorithms by avoiding the trimming procedure and ensuring sharp entry-wise statistical error, which is often technically challenging for offline methods. The regret of oRGrad is analyzed, revealing a fascinating trilemma concerning the computational convergence rate, statistical error, and regret bound. By selecting an appropriate constant step size, oRGrad achieves an O(T^{1/2}) regret. We then introduce the adaptive-oRGrad algorithm, which can achieve the optimal O(log T ) regret by adaptively selecting step sizes, regardless of whether the time horizon is known. 

Keywords

Tensor

High dimensional statistics

Online learning

Regret analysis 

Co-Author(s)

Dong Xia, Hong Kong University of Science and Technology
Yang Chen, University of Michigan
Jian-Feng Cai, HKUST

First Author

Jingyang Li

Presenting Author

Jingyang Li

Online Testing of Grouped Hypotheses

Recently there has been a growing interest in "online" testing of hypotheses, where the hypotheses are generated sequentially, potentially over an infinite period. Online testing procedures make real time decisions for each hypothesis, before future hypotheses are available, with the goal of controlling an overall error measure related to the False Discovery Rate at every decision point. We consider an online testing problem where at every time point, a group of hypotheses is obtained, and such groups are obtained sequentially, possibly indefinitely. Testing of such grouped hypotheses involves combining online testing procedures with offline procedures that leverage the grouping structure of the hypotheses. Our proposed online grouped testing method is based on the local false discovery rate, and inspired from the online procedure proposed by Gang, Sun and Wang (2021), and grouped hypotheses testing procedure proposed by Sarkar and Zhao (2022). This talk will introduce our method, discuss the role of alpha investment, a key strategy for controlled allocation of significance level to each test, and theoretical guarantee of control on an overall error measure while optimizing power. 

Keywords

Multiple Hypotheses Testing

Online Hypotheses Testing

Grouped Testing

Local FDR 

First Author

Shinjini Nandi, Montana State University

Presenting Author

Shinjini Nandi, Montana State University

Prediction-Powered Inference with Imputed Covariates and Nonuniform Sampling

Machine learning models are increasingly used to produce predictions that serve as input data in subsequent statistical analyses. For example, computer vision predictions of economic and environmental indicators based on satellite imagery are used in downstream regressions; similarly, language models are widely used to approximate human ratings and opinions in social science research. However, failure to properly account for errors in the machine learning predictions renders standard statistical procedures invalid. Prior work uses what we call the Predict-Then-Debias estimator to give valid confidence intervals when machine learning algorithms impute missing variables, assuming a small complete sample from the population of interest. We expand the scope by introducing bootstrap confidence intervals that apply when the complete data is a nonuniform (i.e., weighted, stratified, or clustered) sample and to settings where an arbitrary subset of features is imputed. Importantly, the method can be applied to many settings without requiring additional calculations. We prove that these confidence intervals are valid under no assumptions on the quality of the machine learning model. 

Keywords

prediction-powered inference

synthetic data

missing data

measurement error

two-phase sampling designs

bootstrap 

Co-Author(s)

Sherrie Wang, MIT
Kerri Lu, MIT
Tijana Zrnic, University of California
Stephen Bates, Stanford University

First Author

Dan Kluger, MIT

Presenting Author

Dan Kluger, MIT

Sequentializing a Test: Anytime Validity is Free

An anytime valid sequential test permits us to peek at observations as they arrive. This means we can stop, continue or adapt the testing process based on the current data, without invalidating the inference. Given a maximum number of observations N, one may believe that this benefit must be paid for in terms of power when compared to a conventional test that waits until all N observations have arrived. Our key contribution is to show that this is false: for any valid test based on N observations, we derive an anytime valid sequential test that matches it after N observations. In addition, we show that the value of the sequential test before a rejection is attained can be directly used as a significance level for a subsequent test. We illustrate this for the z-test. There, we find that the current state-of-the-art based on log-optimal e-values can be obtained as a special limiting case that replicates a z-test with level alpha shrinking as N grows 

Keywords

Hypothesis Testing

Sequential Testing

E-value

Anytime valid

Optional stopping 

Co-Author

Sam van Meer, Erasmus University Rotterdam

First Author

Nick Koning

Presenting Author

Nick Koning

Single-Cell RNA Sequencing Data in Forensic Science

While DNA remains the cornerstone of forensic science, RNA offers significant potential for additional applications. Single-cell RNA sequencing provides exceptional resolution to address complicated mixed samples, but it suffers from sparsity due to both biological zeros and technical dropouts. We begin by discussing the role of RNA in forensic science, then investigate imputation methods to recover missing gene expression values and enhance the reliability of RNA evidence. Focusing on different categories of imputation, we discuss the advantages and limitations of each. Simulation studies demonstrate the potential of these techniques to improve data quality, ultimately paving the way for more robust RNA-based forensic analyses. 

Keywords

Missing data

Matrix completion

Data imputation

Clustering

Dimension reduction

Single-cell RNA sequencing 

Co-Author

Giuseppe Vinci, University of Notre Dame

First Author

Xiangyu Xu, University of Notre Dame

Presenting Author

Xiangyu Xu, University of Notre Dame

Vertex Alignment and Localizing First-order Changepoints in Time Series of Graphs

We consider localization of changepoints in a time series of networks. Existing methodologies rely on correctly-specified vertex alignment between networks across time. We consider the impact of vertex misalignment on inference for dynamic networks, and describe two models for network evolution as illustrative cases: one in which vertex misalignment is comparatively inconsequential, and
another in which it renders localization effectively impossible. We characterize when changepoints in network evolutionary processes can be successfully localized without alignment and prove an identifiability theorem on when certain changepoints cannot be localized at all. We also describe how procedures such as graph matching and optimal transport can be used to mitigate error from misalignments in some cases and provide simulations and real data analysis demonstrating their efficacy. 

Keywords

Time series of networks

changepoint localization

Euclidean mirror 

Co-Author(s)

Zachary Lubberts, University of Virginia
Avanti Athreya, Johns Hopkins University
Youngser Park, Johns Hopkins University
Carey Priebe, Johns Hopkins University

First Author

Tianyi Chen, Johns hopkins

Presenting Author

Tianyi Chen, Johns hopkins