Tuesday, Aug 5: 8:30 AM - 10:20 AM
4088
Contributed Papers
Music City Center
Room: CC-104E
Main Sponsor
Section on Statistical Computing
Presentations
Verifying that inference using a parametric regression model is reliable is a crucial step in statistical model building. It helps avoid invalid modeling conclusions based on false assumptions. For example, the p-value associated with a coefficient in a linear model is unreliable if the mean function being used is incorrect. Until now, there has been no easy and reliable way in R to test whether or not the mean function is correct.
In my presentation, I shall introduce my new R package, "distfreereg", that I have written to implement the distribution-free testing procedure for parametric regression models introduced by Estate Khmaladze in 2021. I shall outline Khmaladze's algorithm, discuss the main features of the package, and illustrate its use with an example.
Keywords
goodness-of-fit testing
regression
distribution-free testing
R package
We examine two-sample hypothesis testing in random networks within the Random Dot Product Graph (RDPG) framework, and develop a time-efficient algorithm. We distinguish between semiparametric and nonparametric testing, emphasizing the latter for its flexibility and independence from network size. The nonparametric approach assumes that vertex interactions are governed by exchangeable latent distances, and the central question is whether the latent distance distributions differ between two networks. To address this, a U-statistic-based test statistic approximating maximum mean discrepancy is used, which is computationally complex for large networks. Given the challenge, we introduce a subsampling-based method that partitions large networks, analyzes smaller subgraphs, and aggregates the results. Our contributions include designing a subsampling-based latent position estimator and validating a bootstrap-based testing procedure, as well as developing several faster divide-and-conquer testing methods. This work advances efficient and consistent network analysis, with broad applicability across diverse domains.
Keywords
Two-sample hypothesis testing
Network model
Nonparametric testing
Subsampling
Time efficient algorithm
Random Dot Product Graph
Multiple testing procedures that control the false discovery rate (FDR) have been widely adopted for testing large number of hypotheses. The Benjamini and Hochberg multiple testing procedure (the BH procedure) was the first procedure introduced to control the FDR. However, as the total number of false null hypotheses increases, the BH procedure becomes overly conservative and thus lacks power. In this paper, we present a Two-Stage BH procedure with a tuning parameter. In stage I, this procedure estimates the total number of true null hypotheses m0, which is then used to adjust the level of significance when applying the BH procedure in stage II. The proposed procedure incorporates a tuning parameter providing tighter control of the FDR and enhancing statistical power. Theoretical properties of the proposed procedure and its power performance will be presented.
Keywords
False discovery rate
Multiple testing procedures
Benjamini-Hochberg
Tuning parameter
The problem of testing/estimating a common explanatory variable based on combined information from independent calibration models is addressed. The response variables are measured using different instruments, methods, or at different laboratories. It is assumed that the calibration model at each source is a simple linear regression model and the model parameters at the different sources are different. In this scenario, the problem of constructing a confidence interval (CI) for a common unknown value of the explanatory variable is addressed. Confidence intervals for the unknown explanatory variable that can be found by inverting some popular combined tests are proposed. These CIs are exact and better than a CI in the literature. All CIs are compared with respect to precision and some recommendations are made. Interval estimation methods are illustrated using two examples.
Keywords
Combined tests; Controlled calibration; Fisher's test; Maximum likelihood estimates
A modified likelihood ratio test and confidence intervals for the mean of a two-parameter negative binomial (NB) distribution are proposed and compared with available ones. The problems of testing/estimating the ratio or the difference of the means of two NB distributions are also considered.
Assuming that the dispersion parameters are equal an improved version of the likelihood ratio test for the ratio of means of two NB distributions is proposed. Methods of variance estimate recovery (MOVER) are used to find confidence intervals for the ratio or the difference of two means when the dispersion parameters are unknown and arbitrary. The tests and interval estimation methods are illustrated using an example with count data on seizures from two groups of patients.
Keywords
over-dispersion
powers
score test
standardized LRT
type I error
In data analysis, unexpected results often prompt researchers to revisit their proce- dures to identify potential issues. While some researchers may struggle to identify the root causes, experienced researchers can often quickly diagnose problems by checking a few key assumptions. These checked assumptions, or expectations, are typically informal, difficult to trace, and rarely discussed in publications. In this paper, we introduce the term analysis validation checks to formalize and externalize these informal assumptions. We then introduce a procedure to identify a subset of checks that best predict the occurrence of unexpected outcomes, based on simula- tions of the original data. The checks are evaluated in terms of accuracy, determined by binary classification metrics, and independence, which measures the shared in- formation among checks. We demonstrate this approach with a toy example using step count data and a generalized linear model example examining the effect of particulate matter air pollution on daily mortality.
Keywords
data analysis
data validation
diagnostics
Co-Author
Roger Peng, University of Texas, Austin
First Author
Sherry Zhang, The University of Texas at Austin
Presenting Author
Sherry Zhang, The University of Texas at Austin
The accurate and efficient estimation of Bayes factors is critical for Bayesian model comparison, particularly when evaluating competing hypotheses in complex statistical models. Traditional computational approaches often suffer from inefficiency, instability, and poor scalability, especially when dealing with non-conjugate priors. In this work, we propose MCMC-CE, an advanced method that extends the cross-entropy (CE) technique—originally developed for rare-event probability estimation—to improve the computation of marginal likelihoods in Bayesian hypothesis testing and linear regression models. Our approach integrates the CE method within a Markov chain Monte Carlo (MCMC) framework to optimize proposal distributions and efficiently approximate the marginal likelihood. We apply MCMC-CE to both hypothesis testing via Bayes factors and Bayesian model averaging. Extensive simulation studies and real-world data applications demonstrate that MCMC-CE significantly outperforms existing methods in terms of computational speed, numerical stability, and estimation accuracy. These results suggest that MCMC-CE provides a powerful and scalable solution for Bayesian inference in challenging modeling scenarios.
Keywords
Marginal likelihood
Cross-entropy method
Markov chain Monte Carlo
Bayes factor
Bayesian model averaging
Bayesian linear regression
Co-Author(s)
Devin Lundy, Augusta Univeristy
Vy Ong, Wayne State University
Yin Wan, Wanye State University
First Author
Yang Shi, Wayne State University
Presenting Author
Yang Shi, Wayne State University