Analyzing Complex Data in Non-Euclidean Spaces: Networks and Beyond

Weijing Tang Chair
Carnegie Mellon University
 
Satarupa Bhattacharjee Organizer
University of Florida
 
Cornelius Fritz Organizer
Trinity College Dublin
 
Tuesday, Aug 5: 8:30 AM - 10:20 AM
0439 
Invited Paper Session 
Music City Center 
Room: CC-208A 

Applied

No

Main Sponsor

Section on Nonparametric Statistics

Co Sponsors

IMS
Social Statistics Section

Presentations

Autoregressive Networks with Dependent Edges

We propose an autoregressive framework for modelling dynamic networks with dependent edges. It encompasses the models which accommodate, for example, transitivity, density-dependent and other stylized features often observed in real network data. By assuming the edges of network at each time are independent conditionally on their lagged values, the models, which exhibit a close connection with temporal ERGMs, facilitate both simulation and the maximum likelihood estimation in the straightforward manner. Due to the possible large number of parameters in the models, the initial MLEs may suffer from slow convergence rates. An improved estimator for each component parameter is proposed based on an iteration based on the projection which mitigates the impact of the other parameters. Based on a martingale difference structure, the asymptotic distribution of the improved estimator is derived without the stationarity assumption. The limiting distribution is not normal in general, and it reduces to normal when the underlying process satisfies some mixing conditions. Illustration with a transitivity model was carried out in both simulation and a real network data set. 

Keywords

conditional independence

dynamic networks

maximum likelihood estimation

stylized features
of network data

transitivity 

Co-Author(s)

Jinyuan Chang, Southwestern University of Finance and Economics
Qin Fang
Peter MacDonald, University of Waterloo
Qiwei Yao, London School of Economics
Eric Kolaczyk, McGill University

Speaker

Peter MacDonald, University of Waterloo

Robust clustering and testing for large complex networks using rank statistics

This talk presents new methods and theory for robust spectral clustering and hypothesis testing in large edge-weighted random graphs using rank statistics. The proposed approach brings together contemporary developments in spectral methods and classical developments in the theory of rank-based nonparametric tests. Unlike non-robust approaches, our methodology remains effective in the presence of outliers, heavy-tailed distributions, and heterogeneous noise variances. Applications to human connectome data are provided and suggest directions for future work. 

Keywords

Clustering

Robust statistics

Network analysis

Random graph

Nonparametric statistics 

Speaker

Joshua Cape, University of Wisconsin-Madison

Nonlinear PCA: Estimation of Algebraic Varieties

An algebraic variety is defined as the set of solutions of a system of polynomial equations over the reals. In this paper we consider the goal of recovering an unknown algebraic variety from noisy measurements of latent variables lying on the algebraic variety. Note that, estimation of an algebraic variety --- which generalizes the notion of solutions to a linear system of equations --- generalizes the concept of principal component analysis (PCA) to nonlinear structures (i.e., solutions of a system of polynomial equations). Our estimation strategy proceeds via three steps: (i) construction of the {\it moment matrix} from the vandermonde matrix associated with the data set and the degree of the fitted polynomial, (ii) debiasing the moment matrix, and (iii) eigenvalue decomposition of the moment matrix to recover the underling algebraic variety. We present theoretical results regarding the recovery guarantees of the underlying algebraic variety. We illustrate the power and usefulness of our methodology via simulation and real data examples.
 

Keywords

debiasing, moment matrix, singular value decomposition 

Speaker

Bodhisattva Sen, Columbia University

Nonparametric Data Analysis on Stratified Spaces

This talk is part of joint work with Robert L. Paige, Mihaela Pricop Jeckstadt and their collaborators. The primary goal of our presentation is dissemination of results from the books (i) NonpArametric Statistics on Stratified Spaces and their Applications (NASSSA), coauthored with Daniel E. Osborne, and (ii) Geometric-Topological Statistical Methods for the Analysis of Image Data with Applications (GETOSMAIDA), coauthored with Robert L. Paige. Part of our collaborative work presented addresses aspects of Optimal Transport on object spaces.
NASSSA is organized as follows. The first section of Part I is dedicated to key examples of complex data from which one extracts data representable as points on a stratified space, and a review on data analysis on manifolds. A separate chapter here is dedicated to a summary of results on nonparametric methods on manifolds. In Part II we address some key results on asymptotic and nonparametric bootstrap on some stratified spaces, where we feature certain object spaces with a manifold stratification arising in Statistics, and analyze data on them. In Part III, we apply this methodology to concrete examples of Object Data Analysis (ODA). Part IV, consists in three more applied sections only, one on MANOVA on stratified spaces, one on application to linguistics, and the last one on extrinsic PCA and other future topics of data analysis on stratified spaces.
Chapter 1 provides an overview of object data on stratified spaces extracted from various sources,
presenting a number of examples of such data in practical applications. The key type of data presented here are RNA phylogenetic trees, with
emphasis on the SARS-COV2 virus, and protein data. Another important data example is that of planar graphs. Magnetic Resonance Angiography (MRA) data is also included here, with its important ramifications in brain arteries 3D image reconstructions. One introduces also alphabets based tree data, that is used in the last chapter. Last, but not least, digital camera face imaging data is presented here, used in 3D projective shape data analysis.
Chapter 2 is dedicated to a review of nonparametric statistical methods on manifolds, that is nonparametrics for smooth stratified spaces.

The notion of stratified space introduced as metric space that admits a certain dimension decreasing filtration by manifolds glued to each other, such that each boundary of a manifold part of this filtration, is a union of lowed dimensional manifolds is given in Chapter 3. The median and the mean of the probability measure associated with a random point on a Riemannian manifold (M, ρ) were introduced in the case of a the empirical distribution by Cartan (1928) and, in the general case of a population, by Fréchet (1948). Statistics on stratified spaces are also defined here, as well as the asymptotic behavior of the sample intrinsic mean on the simplest nontrivial type of stratified space.
Chapter 4 is dedicated to an analysis of extrinsic Fréchet mean sets, for particular stratified spaces.

Following Fréchet's original ideas, in Chapter 5 we consider a probability distribution on an open book, we define the concept of sticky intrinsic mean. This new phenomenon is quantified by a LLN stating that the empirical mean of a random object with a sticky mean, eventually almost surely lies on the spine of the open book. A CLT stating that the limiting distribution is Gaussian and supported on the spine is also given here, as well as versions of the LLN and CLT for the cases where the intrinsic mean is nonsticky or partly sticky.

In Chapter 6, we consider a connected graph G with a distance function d, so that each couple of points x,y can be connected with a geodesic whose length is exactly d(x,y) Given a probability distribution Q on G associated with the random object X, we are interested in the Fréchet function F: G →[0,∞), where F(x) is the expected value of the square of the d(X,x). Here we suppose that the Fréchet function assumes its minimum in a unique point µ ∈ G, and under the additional assumption that a small neighborhood of the cut locus of this point has Q measure zero, we derive a CLT for i.i.d.r.o.'s from Q, including cases of stickiness. Building on the results in the previous chapter, in Chapter 6, one also considers central limit theorem for random samples on a graph.

Chapter 7 is dedicated to an analysis of brain artery trees. Such trees do not have a natural common set of leaves, therefore they are matched based on the cortical correspondence using anatomical shape via spatial locations, to place landmarks on the cortical surface, where landmarks are projected from the cortical surface to the closest point on the brain artery tree, so that all trees in the sample have the same set of labeled leaves, making possible representation of these artery trees as points in a space of phylogenetic trees.
This representation of MRA images of brain arteries trees as points in a phylogenetic tree space enables the use of tree space geodesics, to quantify and visualize their differences, and a notion of center called the Fréchet mean.
High-dimensional structure in the data is explored using multidimensional scaling, minimum spanning trees based on geodesic distances, and tree space triangles. The effect of gender and age on brain artery system is studied, noting that the distances of the closest brain arteries to the cortical surface increase with age, tending to be smaller in females than males.

Chapter 8 is dedicated to a CLT on Stratified Spaces with an Application to Phylogenies of SARS-CoV-2 Data Analysis on phylogenetic trees. Note that such trees are built after RNA sequences via a Clustal Omega alignment method, a computer program used for multiple sequence alignment in Bioinformatics.

In Chapter 9, we consider CLT on certain stratified spaces in dimensions one and two, and on open books.

In Chapter 10 we provide an investigation of the critical question of two possible origins of the Covid 19 pandemic, using functional data on rooted RNA based phylogenetic trees, regarded as points on open books.
In Chapter 11 one addresses a new theme: comparing human interaction in writing based on an alphabet. Here we focus on Indo-European languages, developing a historical perspective of the genesis of West European languages, via an alphabet based clustering of these languages. 3 leafs trees are built using a single linkage method for clustering based on distances between samples from languages which use the Latin Script. Taking three languages at a time, the mean is determined. If the mean exhibits non-sticky properties, then one the languages may come from a different ancestor than the other two. If the mean is considered sticky, then the languages may share a common ancestor or all languages may have different ancestry.

Chapter 12 addresses the problem of MANOVA on smooth stratified spaces, with an application to face analysis based on 3D projective shapes of facial configurations extracted from digital camera images.

In Chapter 13 we introduce additional topics, that will be explored in the future, including extrinsic PCA on manifolds and networks analysis.

GETOSMAIDA is primarily focused on introducing various types of shapes as points on object spaces with a manifold structure, with the extraction of shape data from image data and their analysis. Topological Data Analysis is one of the aspects featured in this part of the talk.

Paige and Patrangenaru thank the National Science Foundation for awards NSF-DMS:2311058 and NSF-DMS:2311059, respectively. Pricop Jeckstadt acknowledges support from M-ERA Net Project SMILE, Grant number 315/2022.
She also thanks the Isaac Newton Institute for Mathematical Sciences, Cambridge, for support and hospitality during the programme "Discretization and recovery in high-dimensional spaces", where part of work for this talk was undertaken; her work was partially supported by EPSRC grant EP/R014604/1.

SELECTED REFERENCES.

1. Élie Cartan (1928). Léçons sur la Géométrie des Espaces de Riemann (in French), Gauthier-Villars, Paris, France.

2. Maurice Fréchet (1948). Les élements aléatoires de nature quelconque dans un espace distancié (In French).
Ann. Inst. H. Poincaré, 10, 215-310 

Keywords

statistics on stratified spaces

statistical image analysis

medical imaging

sticky CLT

Latin alphabet based language clustering

extrinsic PCA 

Speaker

Victor Patrangenaru, Florida State University