Statistical methods for network data

Peter MacDonald Chair
University of Waterloo
 
Wednesday, Aug 6: 2:00 PM - 3:50 PM
4198 
Contributed Papers 
Music City Center 
Room: CC-202C 

Main Sponsor

Section on Statistical Learning and Data Science

Presentations

Community Detection for Signed Networks

Community detection, discovering the underlying communities within a network from observed connections, is a fundamental problem in network analysis, which has been extensively studied across various domains. In the context of signed networks, not only the connections but also their signs play a crucial role in community identification. Particularly, the empirical evidence of balance theory in real-world signed networks makes it a compelling property for this purpose. In this work, we propose a novel balanced stochastic block model, which has a hierarchical community structure induced by balance theory. We also develop a fast maximum pseudo likelihood estimation approach for community detection with exact recovery. Our proposed method is used to detect meaningful node clusters for downstream applications. 

Keywords

community detection

signed networks

stochastic block model 

Co-Author(s)

Weijing Tang, Carnegie Mellon University
Ji Zhu, University of Michigan

First Author

Yichao Chen, University of Michigan

Presenting Author

Yichao Chen, University of Michigan

Embedding Network Autoregression for time series analysis and causal peer effect inference

We propose an Embedding Network Autoregressive Model (ENAR) for multivariate networked longitudinal data. We assume the network is generated from a latent variable model, and these unobserved variables are included in a structural peer effect model or a time series network autoregressive model as additive effects. This approach takes a unified view of two related yet fundamentally different problems: (1) modeling and predicting multivariate networked time series data and (2) causal peer influence estimation in the presence of homophily from finite time longitudinal data. We show that the estimated momentum and peer effect parameters are consistent and asymptotically normally distributed in asymptotic setups with a growing number of network vertices N while including growing time points T (time series) and finite T (peer effect) cases. Our theoretical results encompass cases when the network is modeled with the RDPG model and a more general latent space model. We also develop selection criteria when the number of latent variables K is unknown that provably does not under-select and show that the theoretical guarantees hold with the selected number for K as well. 

Keywords

Network peer effect

Network time series

Social influence

Latent homophily

Network embedding

Social network 

First Author

Jae Ho Chang, The Ohio State University

Presenting Author

Subhadeep Paul, The Ohio State University

Minority Representation in Network Rankings: Methods for Estimation, Testing, and Fairness

Networks, composed of nodes and connections, are widely used to model relationships across various fields. Centrality metrics, vital for assessing node importance, inform decisions such as identifying key nodes or prioritizing resources. However, networks often suffer from noise, such as missing or incorrect edges, which can distort centrality-based decisions. Specifically, if edge noise is driven by label information, it can lead to unfair decision-making, distorting the representation of certain groups, such as minorities. To address this, we focus on networks with label information and introduce a formal definition of minority representation, defined as the proportion of minority nodes among the top-ranked nodes. We model systematic bias using label-related missing edge errors. We propose methods to estimate and test bias parameters under various noisy scenarios. Asymptotic limits of minority representation statistics are derived under specific network models and used to uncover de-biased representations. Simulation results demonstrate the effectiveness of our estimation, testing, and correction procedures. We apply our methods to a contact network, showcasing applicability. 

Keywords

Contact networks

Graphon model

Noisy networks

Stochastic block model

Systematic bias 

Co-Author(s)

Peter MacDonald, University of Waterloo
Eric Kolaczyk, McGill University

First Author

Hui Shen, McGill, Statistics

Presenting Author

Hui Shen, McGill, Statistics

Network Goodness-of-Fit for the block-model family

The block-model family includes four popular network models: SBM, DCBM, MMSBM, and DCMM. To evaluate how well these four models fit real networks, we propose GoF-MSCORE as a new Goodness-of-Fit metric for DCMM, based on two main ideas. The first is to use cycle count statistics as a general framework for GoF. The second is a novel network fitting scheme. Extending GoF-MSCORE to SBM, DCBM, and MMSBM results in a series of GoF metrics covering each of the four models in the block-model family. We show that for the four models, if the assumed model is correct, then as the network size diverges, the corresponding GoF metric converges to N(0,1), a parameter-free null limiting distribution. We also analyze the power of these metrics and demonstrate that they are optimal in many settings. For 12 frequently used real networks, we apply the proposed GoF metrics and find that DCMM fits well with almost all of them, whereas SBM, DCBM, and MMSBM fail to fit many of these networks, particularly when the networks are relatively large. We also show that DCMM is nearly as broad as the rank-K network model. Based on these results, we recommend DCMM as a promising model for undirected networks. 

Keywords

Network analysis

Goodness-of-Fit

Block model

Community detection

Mixed membership

Cycle-Count statistics 

Co-Author(s)

Tracy Ke, Harvard University
Jingming Wang

First Author

Jiashun Jin, Carnegie Mellon University

Presenting Author

Jiajun Tang

Network Inference for non-Gaussian Data

Networks provide a powerful framework for capturing complex interactions among variables and analyzing their unified functions. In this study, we propose a novel method for inferring undirected networks in parametric models, specifically designed for non-Gaussian data. Our approach assumes a flexible distribution family for each variable that accommodates heavy tails and skewness, which are two common data features leading to deviations from normality. The method constructs an undirected network by sequentially inferring both network structure and edge strength. The network structure is estimated based on Gaussian-transformed data, adapting the non-paranormal framework by integrating parametric statistics to improve stability and computational efficiency. Edge strengths within the estimated network are subsequently evaluated by quantifying the conditional independence to incorporate parametric statistics through the assumed distribution family, facilitating efficient and precise calculations. By addressing challenges in modeling complex data, our method offers enhanced flexibility and provides new insights for non-Gaussian network construction. 

Keywords

Conditional independence

Network inference

Non-Gaussian data

Nonparanormal transformation

Parametric statistics

Undirected networks 

Co-Author(s)

Xianzheng Huang, University of South Carolina
Hongmei Zhang, University of Memphis

First Author

Jiasong Duan, University of South Carolina

Presenting Author

Jiasong Duan, University of South Carolina

Spectral Embeddings of Correlation Networks

In many applications, weighted networks are constructed based on time series data in order to facilitate the application of network analysis tools. Most typically, a time series is associated with each vertex, and edge weights are given by correlations or other measures of dependence between times series. The result is a network that violates the additive, independent noise assumptions of most common network models. Nonetheless, it is common to apply embedding methods to networks built from correlations. In this work, we consider a setup in which a collection of time series are observed subject to noise, and a network is constructed based on correlations between the noisy series. We prove that, under suitable conditions, applying the adjacency spectral embedding to the network of noisily measured correlations recovers the embeddings of the true time series in the large-network limit. Additionally, we show that the resulting embedding encodes, up to orthogonal rotation, the Fourier coefficients of the true time series. This observation is novel to the networks literature, to the best of our knowledge. 

Keywords

Networks

Time series

Embeddings 

First Author

Keith Levin, University of Wisconsin

Presenting Author

Keith Levin, University of Wisconsin

Statistical inference for core-periphery structure

Core-periphery (CP) structure is an important meso-scale network property where nodes group into a small, densely interconnected {core} and a sparse {periphery} whose members primarily connect to the core rather than to each other. While this structure has been observed in numerous real-world networks, there has been minimal statistical formalization of it. In this work, we develop a statistical framework for CP structures by introducing a model-agnostic and generalizable population parameter which quantifies the strength of a CP structure at the level of the data-generating mechanism. We study this parameter under four canonical random graph models and establish theoretical guarantees for label recovery, including exact label recovery. Next, we construct intersection tests for validating the presence and strength of a CP structure under multiple null models, and prove theoretical guarantees for type I error and power. These tests provide a formal distinction between exogenous (or induced) and endogenous (or intrinsic) CP structure in heterogeneous networks, enabling a level of structural resolution that goes beyond merely detecting the presence of CP structure. The proposed methods show excellent performance on synthetic data, and our applications demonstrate that statistically significant CP structure is somewhat rare in real-world networks. 

Keywords

Networks

Core-periphery

Random graph models 

Co-Author(s)

Srijan Sengupta, North Carolina State University
Diganta Mukherjee, Indian Statistical Institute, Kolkata

First Author

Eric Yanchenko, Akita International University

Presenting Author

Eric Yanchenko, Akita International University