Wednesday, Aug 7: 8:30 AM - 10:20 AM
5138
Contributed Papers
Oregon Convention Center
Room: CC-E147
Main Sponsor
Section on Statistical Computing
Presentations
This paper provides an analysis of random networks, particularly in the context of two-sample hypothesis testing within the Random Dot Product Graph (RDPG) framework. We differentiate between semiparametric and nonparametric testing setups, with a focus on the latter, known for its versatility and size-independence between the vertex sets of two networks. The nonparametric setup starts with an assumption that all the vertices have a set of exchangeable latent distances that determines the interactions between them. The key question investigated here is the comparison between the two sets of latent distances from the two networks. Working with a U-statistic based nonparametric test statistic that approximates maximum mean discrepancy, we address computational challenges through a network subsampling method. Subsampling is a divide-and-conquer based method that reduces computation by analyzing smaller networks and then combining them. Our objectives include designing a subsampling-based method for estimating latent positions and validating the accuracy of a bootstrap-based testing procedure.
Keywords
Two-sample hypothesis testing
Subsampling
Nonparametric
Random Dot Product Graph
Abstracts
Power calculations in the environment of testing hypotheses are well hashed out. To unleash these calculations, we need 1. A family of models; 2. A null hypothesis; 3. The alternative hypothesis; 4. Data; 5. A test statistic T 6. The distribution of T under the null and alternative hypotheses; 7. Alpha; 9. A test. Power calculations let us choose a test and sample size. The main goal of this presentation is to bring entire modus operandi into the realm of Meta-Analysis. The overarching purpose of Meta-Analysis is to synthesize several studies all focusing on the same testing problem. Let m be the number of studies chosen for synthesis. Information on the studies is collected in two ways. 1. Relevant summary statistics from each study. 2. P-values from each study. There are studies, which give only p-values. These studies are the focus of this presentation. There are scores of tests proposed and used in the literature on synthesis. Tippett's test, Fisher's test, Pearson's test are some examples. We initiate power calculations as a function m of the number of studies on these tests. We show how power calculations help us to make comparisons between the tests.
Keywords
Power
Sample Size
Meta-Analysis
Tippett test
Fisher test
Number of Studies for Meta-Analysis
Though stochastic gradient Markov chain Monte Carlo (SGMCMC) algorithms are often used to solve non-convex learning problems, not many attempts have been made yet in developing a population SGMCMC algorithm. Such a population algorithm, involving a group of Markov chains, can improve mixing through interactions between different chains. In this paper, we propose an Evolutionary Stochastic Gradient Langevin Dynamic (ESGLD) algorithm: a population SGMCMC algorithm taking advantage of the evolutionary operators that have been proven powerful in overcoming local traps in Monte Carlo simulations with the Metropolis-Hastings algorithm. We prove the convergence of the ESGLD algorithm and demonstrate, through synthetic and real data experiments, that the ESGLD algorithm outperforms other SGMCMC algorithms in terms of the speed of convergence and effective sample generation.
Keywords
evolutionary Monte Carlo
Stochastic gradient Langevin Dynamic
non-convex learning
population Markov chain Monte Carlo
local trap
Causal inference is an analytical framework used to identify and understand cause-and-effect relationships between variables in observational studies. In health sciences, where controlled experiments are often difficult or unethical to conduct, causal inference plays a crucial role in drawing conclusions about the impact of various factors on outcomes. Most existing causal inference methods rely on the hypothesis that the distributional properties needed to run the models are not violated. However, outliers may violate the assumptions of statistical models, such as normality or linearity. If causal inference methods depend on these assumptions, the presence of outliers can lead to model misspecification and biased results. Outliers might significantly impact inference, as they have the potential to distort the estimation of causal relationships between variables and influence the sensitivity of causal inference analyses. Thus, in this study, we evaluate the effect of outliers on mediation analysis using an extensive simulation study, demonstrating how they pose challenges by distorting estimates, introducing bias, and affecting the generalizability of results.
Keywords
Causal inference
Outliers
Robustness
Combining p-values in meta-analysis is a popular method when test data are unavailable or challenging to merge into a global significance. A variety of methods with different statistical properties exist in the continuous case (when the null distribution of the p-value is uniform). Heard and Delanchy (2018) reframed each method as a likelihood ratio test, guiding the selection of a most powerful combiner for a specific alternative. Discrete p-values present additional challenges, as their null distribution varies significantly, making the distribution of each combiner intractable. We first present a testing framework based on a Wasserstein-closest modification of a p-value towards a target distribution, show that under very mild conditions it produces asymptotically consistent tests. We present the closed form approximation statistics for common methods (Fisher, Pearson, Edgington, Stouffer, George) and presenting the optimal choice of a most powerful discrete combiner in many alternative hypothesis settings, presenting some applications in public health, weak and sparse signal detection, and genetic and genomic association tests.
Keywords
p-value combination
Meta-Analysis
Stouffer's Method
Edgington’s method
Fisher’s method
George’s method
In classification problems when working in high dimensions with a large number of classes and few observations per class, linear discriminant analysis (LDA) requires the strong assumptions of a shared covariance matrix between all classes and quadratic discriminant analysis leads to singular or unstable covariance matrix estimates. Both of these can lead to lower than desired classification performance. We introduce a novel, model-based clustering method which can relax the shared covariance assumptions of LDA by clustering sample covariance matrices, either singular or non-singular. This will lead to covariance matrix estimates which are pooled within each cluster. We show using simulated and real data that our method for classification tends to yield better discrimination compared to other methods.
Keywords
Finite Mixture Models
EM-algorithm
Model Based Clustering
Classification
Singular Covariance Matrices
Pattern Recognition
Auxiliary Markov chains have been used as a mechanism to efficiently compute the distribution of a pattern statistic in a Markovian sequence. However, if distributions are needed for many values of input probabilities and not just one set of values, the entire computation needs to be repeated. In this work, a method is forwarded that reduces computational burden for this scenario. Counts of data strings with various values of sufficient statistics are updated instead of probabilities. The final counts are then used to reconstruct probabilities for the many input probabilities, improving efficiency. In this talk, the methodology is illustrated on computing the probability of accepting a unit in start-up demonstration tests for many different start-up probabilities.
Keywords
minimal deterministic finite automaton
sequence alignment
sequential computation
spaced seeds
sparse Markov models
start-up demonstration tests