Inference Aided by Computing

Youngseok Song Chair
West Virginia University
 
Wednesday, Aug 7: 8:30 AM - 10:20 AM
5138 
Contributed Papers 
Oregon Convention Center 
Room: CC-E147 

Main Sponsor

Section on Statistical Computing

Presentations

WITHDRAWN Efficient Two-Sample Hypothesis Testing for Large Networks : a Nonparametric Approach

This paper provides an analysis of random networks, particularly in the context of two-sample hypothesis testing within the Random Dot Product Graph (RDPG) framework. We differentiate between semiparametric and nonparametric testing setups, with a focus on the latter, known for its versatility and size-independence between the vertex sets of two networks. The nonparametric setup starts with an assumption that all the vertices have a set of exchangeable latent distances that determines the interactions between them. The key question investigated here is the comparison between the two sets of latent distances from the two networks. Working with a U-statistic based nonparametric test statistic that approximates maximum mean discrepancy, we address computational challenges through a network subsampling method. Subsampling is a divide-and-conquer based method that reduces computation by analyzing smaller networks and then combining them. Our objectives include designing a subsampling-based method for estimating latent positions and validating the accuracy of a bootstrap-based testing procedure. 

Keywords

Two-sample hypothesis testing

Subsampling

Nonparametric

Random Dot Product Graph 

Abstracts


Co-Author(s)

Srijan Sengupta, North Carolina State University
Yuguo Chen, University of Illinois at Urbana-Champaign

First Author

Kaustav Chakraborty

Presenting Author

Kaustav Chakraborty

Power Calculations in Meta-Analysis

Power calculations in the environment of testing hypotheses are well hashed out. To unleash these calculations, we need 1. A family of models; 2. A null hypothesis; 3. The alternative hypothesis; 4. Data; 5. A test statistic T 6. The distribution of T under the null and alternative hypotheses; 7. Alpha; 9. A test. Power calculations let us choose a test and sample size. The main goal of this presentation is to bring entire modus operandi into the realm of Meta-Analysis. The overarching purpose of Meta-Analysis is to synthesize several studies all focusing on the same testing problem. Let m be the number of studies chosen for synthesis. Information on the studies is collected in two ways. 1. Relevant summary statistics from each study. 2. P-values from each study. There are studies, which give only p-values. These studies are the focus of this presentation. There are scores of tests proposed and used in the literature on synthesis. Tippett's test, Fisher's test, Pearson's test are some examples. We initiate power calculations as a function m of the number of studies on these tests. We show how power calculations help us to make comparisons between the tests. 

Keywords

Power

Sample Size

Meta-Analysis

Tippett test

Fisher test

Number of Studies for Meta-Analysis 

View Abstract 3150

Co-Author(s)

Anand Seth, SK Patent Associates LLC
Nisha Sheshashayee
Suyang Gao, University of Cincinnati
Neelakshi Chatterjee
Zhaochong Yu, Division of Biostatistics and Bioinformatics, DEPHS, University of Cincinnati

First Author

Marepalli Rao, University of Cincinnati

Presenting Author

Marepalli Rao, University of Cincinnati

Learning from peers: Evolutionary Stochastic Gradient Langevin Dynamic

Though stochastic gradient Markov chain Monte Carlo (SGMCMC) algorithms are often used to solve non-convex learning problems, not many attempts have been made yet in developing a population SGMCMC algorithm. Such a population algorithm, involving a group of Markov chains, can improve mixing through interactions between different chains. In this paper, we propose an Evolutionary Stochastic Gradient Langevin Dynamic (ESGLD) algorithm: a population SGMCMC algorithm taking advantage of the evolutionary operators that have been proven powerful in overcoming local traps in Monte Carlo simulations with the Metropolis-Hastings algorithm. We prove the convergence of the ESGLD algorithm and demonstrate, through synthetic and real data experiments, that the ESGLD algorithm outperforms other SGMCMC algorithms in terms of the speed of convergence and effective sample generation. 

Keywords

evolutionary Monte Carlo

Stochastic gradient Langevin Dynamic

non-convex learning

population Markov chain Monte Carlo

local trap 

View Abstract 2468

Co-Author

Faming Liang, Purdue University

First Author

Yushu Huang

Presenting Author

Yushu Huang

Assessing the Robustness of Mediation Analysis in the Presence of Outliers: A Simulation Study

Causal inference is an analytical framework used to identify and understand cause-and-effect relationships between variables in observational studies. In health sciences, where controlled experiments are often difficult or unethical to conduct, causal inference plays a crucial role in drawing conclusions about the impact of various factors on outcomes. Most existing causal inference methods rely on the hypothesis that the distributional properties needed to run the models are not violated. However, outliers may violate the assumptions of statistical models, such as normality or linearity. If causal inference methods depend on these assumptions, the presence of outliers can lead to model misspecification and biased results. Outliers might significantly impact inference, as they have the potential to distort the estimation of causal relationships between variables and influence the sensitivity of causal inference analyses. Thus, in this study, we evaluate the effect of outliers on mediation analysis using an extensive simulation study, demonstrating how they pose challenges by distorting estimates, introducing bias, and affecting the generalizability of results. 

Keywords

Causal inference

Outliers

Robustness 

View Abstract 3768

Co-Author(s)

Evrim Oral, LSUHSC School of Public Health
Yaqi Zou
Ece Oral

First Author(s)

Yaqi Zou
Evrim Oral, LSUHSC School of Public Health

Presenting Author

Evrim Oral, LSUHSC School of Public Health

Choosing methods of approximating and combining discrete p-values: an optimal transport approach

Combining p-values in meta-analysis is a popular method when test data are unavailable or challenging to merge into a global significance. A variety of methods with different statistical properties exist in the continuous case (when the null distribution of the p-value is uniform). Heard and Delanchy (2018) reframed each method as a likelihood ratio test, guiding the selection of a most powerful combiner for a specific alternative. Discrete p-values present additional challenges, as their null distribution varies significantly, making the distribution of each combiner intractable. We first present a testing framework based on a Wasserstein-closest modification of a p-value towards a target distribution, show that under very mild conditions it produces asymptotically consistent tests. We present the closed form approximation statistics for common methods (Fisher, Pearson, Edgington, Stouffer, George) and presenting the optimal choice of a most powerful discrete combiner in many alternative hypothesis settings, presenting some applications in public health, weak and sparse signal detection, and genetic and genomic association tests. 

Keywords

p-value combination

Meta-Analysis

Stouffer's Method

Edgington’s method

Fisher’s method

George’s method 

View Abstract 2803

First Author

Gonzalo Contador

Presenting Author

Gonzalo Contador

Clustering Singular and Non-Singular Covariance Matrices for Classification

In classification problems when working in high dimensions with a large number of classes and few observations per class, linear discriminant analysis (LDA) requires the strong assumptions of a shared covariance matrix between all classes and quadratic discriminant analysis leads to singular or unstable covariance matrix estimates. Both of these can lead to lower than desired classification performance. We introduce a novel, model-based clustering method which can relax the shared covariance assumptions of LDA by clustering sample covariance matrices, either singular or non-singular. This will lead to covariance matrix estimates which are pooled within each cluster. We show using simulated and real data that our method for classification tends to yield better discrimination compared to other methods. 

Keywords

Finite Mixture Models

EM-algorithm

Model Based Clustering

Classification

Singular Covariance Matrices

Pattern Recognition 

View Abstract 3080

Co-Author

Semhar Michael, South Dakota State University

First Author

Andrew Simpson

Presenting Author

Andrew Simpson

Efficient inference for start-up demonstration tests

Auxiliary Markov chains have been used as a mechanism to efficiently compute the distribution of a pattern statistic in a Markovian sequence. However, if distributions are needed for many values of input probabilities and not just one set of values, the entire computation needs to be repeated. In this work, a method is forwarded that reduces computational burden for this scenario. Counts of data strings with various values of sufficient statistics are updated instead of probabilities. The final counts are then used to reconstruct probabilities for the many input probabilities, improving efficiency. In this talk, the methodology is illustrated on computing the probability of accepting a unit in start-up demonstration tests for many different start-up probabilities. 

Keywords

minimal deterministic finite automaton

sequence alignment

sequential computation

spaced seeds

sparse Markov models

start-up demonstration tests 

View Abstract 2870

Co-Author(s)

Laurent Noe, CRIStAL (UMR 9189 Lille University/CNRS) - INRIA Lille Nord-Europe,
Elie Alhajjar, RAND Corporation
Nonhle Mdziniso, Rochester Institute of Technology

First Author

Donald Martin, NC State University

Presenting Author

Nonhle Mdziniso, Rochester Institute of Technology