Thursday, Aug 7: 10:30 AM - 12:20 PM
0322
Invited Paper Session
Music City Center
Room: CC-210
metric-space-valued data
statistical inferences
shape data
functional data analysis
Applied
No
Main Sponsor
IMS
Co Sponsors
Section on Nonparametric Statistics
Section on Statistical Learning and Data Science
Presentations
We introduce a new framework to analyze shape descriptors that capture
the geometric features of an ensemble of point clouds. At the core of
our approach is the point of view that the data arises as sampled
recordings from a metric space-valued stochastic process, possibly of
nonstationary nature, thereby integrating geometric data analysis into
the realm of functional time series analysis. Our framework allows
for natural incorporation of spatial-temporal dynamics, heterogeneous
sampling, and the study of convergence rates. Further, we derive
complete invariants for classes of metric space-valued stochastic
processes in the spirit of Gromov, and relate these invariants to
so-called ball volume processes. Under mild dependence conditions, a
weak invariance principle in $D([0,1]\times [0,\mathscr{R}])$ is
established for sequential empirical versions of the latter, assuming
the probabilistic structure possibly changes over time. Finally, we
use this result to introduce novel test statistics for topological
change, which are distribution-free in the limit under the hypothesis
of stationarity. We explore these test statistics on time series of
single-cell mRNA expression data, using shape descriptors coming from
topological data analysis.
Keywords
topological data analysis
functional data analysis
persistent homology
locally stationary processes
U-statistics
We introduce a powerful scan statistic and the corresponding test for detecting the presence and pinpointing the location of a change point within the distribution of a data sequence with the data elements residing in a separable metric space (Ω, d). These change points mark abrupt shifts in the distribution of the data sequence as characterized using distance profiles, where the distance profile of an element ω ∈ Ω is the distribution of distances from ω as dictated by the data. This approach is tuning parameter free, fully non-parametric and universally applicable to diverse data types, including distributional and network data, as long as distances between the data objects are available. We obtain an explicit characterization of the asymptotic distribution of the test statistic under the null hypothesis of no change points, rigorous guarantees on the consistency of the test in the presence of change points under fixed and local alternatives and near-optimal convergence of the estimated change point location, all under practicable settings. To compare with state-of-the-art methods we conduct simulations covering multivariate data, bivariate distributional data and sequences of graph Laplacians, and illustrate our method on real data sequences of the U.S. electricity generation compositions and Bluetooth proximity networks.
Keywords
change point detection test
distance profiles
scan statistics
random objects
An inferential toolkit for analyzing object-valued responses, i.e., data situated in general metric spaces, paired with Euclidean predictors, is of interest for many statistical applications. We develop a conformal approach that utilizes conditional optimal transport costs for distance profiles. Distance profiles correspond to one-dimensional distributions of probability mass falling into balls of increasing radius. The average transport cost to transport a given distance profile to all other distance profiles is the basis for the proposed conditional profile scores. The distribution of conditional profile average transport costs serves as conformity score for general metric space-valued responses, facilitating the construction of prediction sets by the split conformal algorithm. We derive the uniform convergence rate of the proposed conformity score estimators and establish asymptotic conditional validity for the resulting prediction sets. The utility of the proposed conditional profile score is demonstrated through its finite sample performance in various metric spaces, with network data from New York taxi trips and compositional data on energy sourcing of U.S. states. This talk is based on joint work with Hang Zhou, UC Davis.
Keywords
Conformity Score
Metric Statistics
Random Objects
Distance Profiles
Optimal Transport
Networks
Statistical optimal transport has become an emerging topic for the analysis of complex and geometric data. A fundamental assumption is that data are drawn randomly according to probability distributions. This, however, is often challenged in applications where modifications of optimal transport (unbalanced optimal transport, UOT) are successfully applied to situations when the underlying data do not come from a probability measure. This hinders a statistical analysis due to the lack of a valid random mechanism. In this talk we provide several statistical models where UOT becomes meaningful and develop first statistical theory for it. Specifically, we analyze extensions of the Kantorovich-Rubinstein (KR) transport for finitely supported measures. The KR transport depends on a penalty which serves as a relaxation from finding true couplings between the marginal measures. The main result is a non-asymptotic bound on the expected error for the empirical KR distance as well as for its barycenters. Depending on the penalty we find phase transitions, in analogue to the unbalanced case. Our approach justifies simple randomized computational schemes for UOT, which can be used for fast approximate computations in combination with any exact solver. Using synthetic and real datasets, we empirically analyze the empirical UOT in simulation studies and investigate the validity of our theoretical bounds. Finally, UOT based inference is applied to protein colocalization in cell biology.
Keywords
Kantorovich-Rubinstein transport
protein colocalization