WNAR Contributed Session

Shinjini Nandi Chair
Montana State University
 
Monday, Aug 4: 8:30 AM - 10:20 AM
4044 
Contributed Papers 
Music City Center 
Room: CC-207C 

Main Sponsor

WNAR

Presentations

A Projected Normal Distribution for Improved Characterization of the Orbital Poles of the Plutinos

The distribution of the orbital planes of the small bodies in the solar system has been of long-standing interest to astronomers. We propose the projected normal (PN) distribution to characterize the widely dispersed orbital poles of the Plutinos in the Kuiper belt. The PN distribution is able to describe the mean pole and the correlation between the directional components of the orbital pole unit vectors. These correlations have been ignored by the von Mises-Fisher model that has been used recently, but they provide a more comprehensive understanding of the astrophysical characteristics governing the orbital distribution of the small bodies in the solar system. The fitted PN mean pole of the Plutino samples describes the data distribution with more precision compared to the traditional methods including the von Mises-Fisher distribution and the debiased mean method. The correlations between the directional components of the orbital unit vectors of the plutinos, that are estimated for the first time by our fitted PN model, are found to be significant. 

Keywords

Directional variables

Astrostatistics

Numerical integration 

Co-Author

Ranjan Maitra, Iowa State University

First Author

Fan Dai, Michigan Technological University

Presenting Author

Fan Dai, Michigan Technological University

GLM Inference with AI-Generated Synthetic Data Using Misspecified Linear Regression

Privacy concerns in data analysis have led to the growing interest in synthetic data, which strives to preserve the statistical properties of the original dataset while ensuring privacy by excluding real records. Recent advances in deep neural networks and generative artificial intelligence have facilitated the generation of synthetic data. However, although prediction with synthetic data has been the focus of recent research, statistical inference with synthetic data remains underdeveloped. In particular, in many settings, including generalized linear models (GLMs), the estimator obtained using synthetic data converges much more slowly than in standard settings. To address these limitations, we propose a method that leverages summary statistics from the original data. Using a misspecified linear regression estimator, we then develop inference that greatly improves the convergence rate and restores the standard root-n behavior for GLMs. 

Keywords

privacy

generalized linear models

synthetic data

summary statistics

misspecified model

inference 

Co-Author

Ali Shojaie, University of Washington

First Author

Nir Keret

Presenting Author

Nir Keret

Rethinking OLS: Direct Estimation of Average First-Order Trends in the Conditional Mean Function

A key to valid and reproducible inference is the use of a priori model specification. In such a framework, practitioners often specify simple regression models where most covariate effects are modeled linearly due to the desire for inference on estimands with simple interpretations. However, these simplified models are nearly guaranteed to be misspecified for the true model in which case standard interpretations of their parameters no longer hold. We therefore argue that this approach of starting with a model and defining target estimands from it is unideal. Instead, we advocate for starting with a model-robust estimand whose existence is based on minimal assumptions on the underlying data mechanism. As an alternative to OLS with linear covariate effects, we propose estimation of the average slopes in the conditional mean function as simple and interpretable first-order trends for summarizing continuous covariate effects. We propose a cubic B-spline-based estimator and give analytical and empirical results showing its effectiveness. We then apply our method to data from a recruitment registry for Alzheimer's disease clinical research and compare results to an OLS-based analysis. 

Keywords

Model-robust regression

Non-parametric regression

Robust statistical methods

Model misspecification 

Co-Author

Daniel Gillen, University of California-Irvine

First Author

Adam Birnbaum

Presenting Author

Adam Birnbaum

A novel statistical approach for replicating multi-omics networks across study groups and cohorts.

Multiple omics data provide researchers with a more comprehensive understanding of the mechanisms underlying complex diseases. Network-based approaches are effective in integrating multiple omics data and simultaneously capture interactions between different molecules. Understanding how multi-omics networks replicate across experimental conditions, demographic groups, and study cohorts can uncover conserved and differential biological changes associated with disease outcomes. While replication analyses have been well-established for single biomarkers, there is a lack of methods specifically addressing the replication of intermolecular interactions in biological networks. To bridge the gap, we propose developing a novel approach to facilitate network replication across study groups and cohorts while leveraging machine learning to identify consistent molecular signatures most relevant to outcomes of interest. To demonstrate the utility of the proposed method, we will use multi-omics data from studies of chronic obstructive pulmonary disease (COPD), the Genetic Epidemiology of COPD (COPDGene) and the Study of COPD Subgroups and Biomarkers (SPIROMICS) cohorts. 

Keywords

multi-omics data

network analysis

network replication

COPD 

First Author

Thao Vu, University of Colorado, Denver

Presenting Author

Thao Vu, University of Colorado, Denver

Sample Size in Cancer Prognostic Studies and Clinical Staging of the Disease

In classical approaches for cancer staging studies, sample size is computed based on differentials in risk of death between different stages of the disease. Risk can be expressed as risk-differences, risk-ratios, hazard ratios or other similar measures and sample size is derived accordingly (e.g. based on HR between adjacent stages in terms of survival). This approach has several drawbacks, as (i) in its simplest formulations, it assumes independence among stages, (ii) it requires most often proportionality in hazards between stages, with difficulties in managing crossing of curves, (iii) it does not incorporate information on side-variables, like biomarkers, omics and other relevant or even latent factors. The latter aspects are managed by alternative approaches (e.g. Riley 2019, 2021) which concentrate on the overall model precision. However, there's no a-priori guarantee that all models converge toward a unique indication w.r.t. sample size needed. Our study compares the different approaches using Monte Carlo simulations, based on Lung Cancer Staging data as derived from published literature. In particular, the loss of power and the limitations of detecting reasonable number of relevant covariates is evaluated. A concrete example on lung cancer staging system using a composite approach is presented.  

Keywords

Sample Size

Time-to-event outcome

External validation

Monte Carlo simulation 

Co-Author(s)

Gloria Brigiari, Unit of Biostatistics, Epidemiology and Public Health Department of Cardiac, Thoracic, Vascular Sciences, and Public Health University of Padova
Dario Gregori, University of Padova

First Author

Ester Rosa, University of Padua

Presenting Author

Gloria Brigiari, Unit of Biostatistics, Epidemiology and Public Health Department of Cardiac, Thoracic, Vascular Sciences, and Public Health University of Padova

Evaluating the Impact of Various Analytics Metrics on YouTube Viewership Using Statistical Methods

Millions of videos are uploaded to YouTube daily, but only a fraction gain widespread viewership. Understanding key analytics is essential for creators optimizing reach and engagement. This study analyzed YouTube data to identify factors influencing video performance using correlation analysis and multiple linear regression. We examined watch time, average percentage viewed, average view duration, impressions, and click-through rates on video views. Results showed that impressions, click-through rates, and audience retention had little impact in the first 24 hours but gained significance over time, suggesting a shift from engagement-driven to algorithm-driven exposure.

To account for variations among videos, we applied a linear mixed-effects model with a random intercept per video and a random slope for days after posting. This approach captured individual growth patterns, explaining 98% of variance in views. Our findings underscore the role of fixed and random effects in video performance trends, providing actionable insights for creators optimizing long-term reach. 

Keywords

Click-through rate (CTR)

YouTube

Audience retention

Watch time

Average percentage viewed

Linear mixed effects model 

Co-Author

Nahid Hasan, East Texas A&M University

First Author

Numan Ahmad

Presenting Author

Numan Ahmad