SPEED 2: Statistical Design of Experiments and Sample Size Considerations, Part 1

Aubrey Odom Chair
Boston University
 
Monday, Aug 4: 8:30 AM - 10:20 AM
4041 
Contributed Speed 
Music City Center 
Room: CC-104A 

Presentations

Active multiple testing with proxy p-values and e-values

Researchers often lack the resources to test every hypothesis of interest directly or compute
test statistics comprehensively, but often possess auxiliary data from which we can compute
an estimate of the experimental outcome. We introduce a novel approach for selecting which
hypotheses to query a statistic in a hypothesis testing setup by leveraging estimates to compute proxy statistics. Our framework allows a scientist to
propose a proxy statistic, and then query the true statistic with some probability based on
the value of the proxy. We make no assumptions about how the proxy is derived and it can be
arbitrarily dependent with the true statistic. If the true statistic is not queried, the proxy is used
in its place. We characterize "active" methods that produce valid p-values and e-values in this
setting and utilize this framework to create procedures with false
discovery rate (FDR) control. Through simulations and real data analysis of causal effects in
scCRISPR screen experiments, we empirically demonstrate that our proxy framework has both
high power and low resource usage when our proxies are accurate estimates of the respective true statistics. 

Keywords

multiple testing

e-values

false discovery rate (FDR)

active sampling 

Co-Author(s)

Catherine Wang, Carnegie Mellon University
Kathryn Roeder, Carnegie Mellon University
Larry Wasserman, Carnegie Mellon University
Aaditya Ramdas, Carnegie Mellon University

First Author

Ziyu Xu

Presenting Author

Ziyu Xu

Adapting Cochran’s sample-size rule to an estimate from a complex sample

Given the mean of a simple random sample without replacement, Cochran claimed that calculating a two-sided 95% confidence interval for the population mean using a standard normal-distribution table is often reasonable when the sample size is 25 times the square of the skewness coefficient of the population. We adapt a variant of this crude rule to a two-sided confidence interval for a parameter estimated from a complex probability sample. We conjecture that the standard two-sided 95% confidence interval for an estimated parameter is usually reasonable when the absolute value of the skewness coefficient of the nearly unbiased parameter estimate is less than 0.2. Conversely, it is not reasonable to use the standard two-sided 95% confidence interval for an estimated parameter when the absolute value of the skewness coefficient of its estimate is greater than 0.2. This warning is particularly germane for estimated proportions and differences between proportions. When applying out conjecture, an estimate's skewness coefficient will rarely be known. Instead, it will usually need to be estimated, often in an ad hoc manner. 

Keywords

Effective sample size

Third central moment

Confidence interval

Skewness coefficient

Clopper-Pearson confidence interval 

First Author

Phillip Kott, Semi-retired

Presenting Author

Phillip Kott, Semi-retired

Adaptive Multi-fidelity Optimization via Online EM with Applications to Digital Design Selection

This work addresses the challenge of optimal resource allocation in digital experimentation, where computational budgets must be efficiently distributed across competing model configurations to identify the best-performing design.
Building upon the multi-fidelity framework of Peng et al. (2019), which integrated low and high-fidelity observations for ranking and selection procedures, we propose significant methodological advances through online stochastic approximation techniques.
Our key innovation is an online variant of the Expectation-Maximization (EM) algorithm that incorporates stochastic approximation principles, enabling efficient parameter estimation for latent variable models in streaming settings.
Unlike previous batch EM methods, our approach processes observations sequentially while maintaining theoretical convergence guarantees, established through rigorous martingale-based analysis.
We prove the algorithm achieves asymptotic efficiency equivalent to the maximum likelihood estimator while requiring substantially less computational overhead. 

Keywords

Digital experimentation

ranking and selection

online learning 

Co-Author

Jiayang Sun, George Mason University

First Author

Wei Dai, George Mason University

Presenting Author

Wei Dai, George Mason University

Calculating sample size for methylation sequencing studies

Background: DNA methylation regulates the expression of genes and therefore can be utilized for several applications, including detection of differentially methylated sites or regions. Little guidance is available for determining sample size to adequately power a study .
Methods: To calculate sample size and power, an over-dispersed binomial model was utilized. We performed an empirical review of sequencing studies conducted between 2011-2018 at our institution to calculate the overdispersion parameter and median read depth across 4 disease types, including normal tissues for a total of 352 samples.
Results: The median overdispersion parameter was 2.4 [IQR, 1.8-3.6] and a median read depth of 31 [26-34]. Assuming no overdispersion, the required sample size to detect a difference of 2% in controls to 6% in cases, 12 samples per group is required. Based on an overdispersion of 2.4, 29 samples per group is required to achieve 80% power.
Conclusion: The overdispersion parameter differed between tissues and platforms. These empirical results can help provide guidance in calculating sample size and power. We recommend methylation studies should account for inflated variances. 

Keywords

Methylation Sequencing

Sample Size Calculation

Binomial Overdispersion 

Co-Author(s)

Jeanette Eckel-Passow, Mayo Clinic
William Taylor, Mayo Clinic
John Kisiel, Mayo Clinic
Douglas Mahoney, Mayo Clinic

First Author

Seth Slettedahl

Presenting Author

Seth Slettedahl

Causal Effects of School Closures on Mental Health During the COVID-19 Pandemic in Ontario, Canada

In Canada, public health strategies aimed at reducing SARS-CoV-2 transmission during the COVID-19 pandemic included school closures, resulting in widespread education disruption. It is now widely understood that the pandemic precipitated an unprecedented mental health crisis among children and youth in Canada. However, there is little to no longitudinal evidence on the causal impact that school closures may have had on this crisis. As the world navigates the post-pandemic recovery, governments and researchers have already begun critically evaluating various pandemic response policies that were enacted, which is critical to better prepare for and more effectively respond to future pandemics. Using linked EHR data across various healthcare sectors, obtained from IntelliHealth Ontario (ON), we identify all healthcare resource utilization records among children and youth in ON for a comprehensive array of adverse mental health-related conditions and outcomes. Our analytic strategy is to exploit the staggered returns to the classroom in ON following the 2020/21 Winter Break to estimate the causal effects of interest. This is ongoing work, and we will summarize our findings thus far. 

Keywords

COVID-19

Mental Health

Health Policy

School Closures

Children and Youth

Public Health 

Co-Author(s)

Kuan Liu, University of Toronto
Geoffrey Anderson, University of Toronto

First Author

Jay Xu

Presenting Author

Jay Xu

Controlling Type I Error Rates for Secondary Endpoints through an Alpha Spending Method

At the interim analysis (IA) of a study, if superiority of the investigational product (IP) over the control on the primary efficacy endpoint is established, superiority of IP over the control on secondary efficacy endpoints will be tested sequentially at the significance level corresponding to the O'Brien-Fleming boundary using the information fraction (IF) of 60%. However, the true IF value for a specific secondary efficacy endpoint depends on the final number of that endpoint event collected at the end of study, which is unknown at the time of IA. We propose to adjust the alpha level at the final analysis to control the overall type I error rate in testing for secondary efficacy endpoints. A simulation study demonstrates that the proposed alpha spending method is expected to effectively control the type I error rates for secondary efficacy endpoints. It also shows that the differences between true IF values and the benchmark value of 60% for secondary efficacy endpoints are limited regardless of treatment effect sizes, which can be attributed to the similarity in risk over time patterns. 

Keywords

Overall Type I Error Control

Alpha Spending Method

Interim Analysis

Information Fraction 

Co-Author

Wentao Lu, Southern Methodist University

First Author

Biju Wang, Johnson & Johnson

Presenting Author

Biju Wang, Johnson & Johnson

Deep Learning Survival Analysis for Competing Risk with Functional Covariates and Missing Imputation

Discharging patients from the intensive care unit (ICU) marks an important moment in their recovery-it's a transition from acute care to a lower level of dependency. However, even after leaving the ICU, many patients still face serious risks for adverse outcomes, such as ICU readmission due to complications or subsequent in-hospital death. In this study, we develop a unified deep-learning framework for competing risk modeling to improve the prediction performance on ICU patient outcomes. The proposed approach integrates discrete-time survival models with specially designed neural network architectures, which could handle complex data structures consisting of functional covariates and missing data. By incorporating gradient-based imputation techniques with discrete-time modeling, this framework allows for precise interval-based risk estimation, explicitly addressing the complexities arising from competing events, such as disease progression and mortality, which may censor each other. In addition, we validate the effectiveness of our method using two large ICU datasets and several simulated datasets, demonstrating improved prediction accuracy and generalizability over traditional models. The proposed framework effectively captures the dynamic interactions between various risk factors and their impact on patient outcomes. 

Keywords

Competing Risks

Discrete-time Survival

Deep Learning

functional data

Missing Imputation

Critical care 

Co-Author(s)

Penglei Gao, The Cleveland Clinic Foundation
Abhijit Duggal, The Cleveland Clinic Foundation
Shuaiqi Huang, The Cleveland Clinic Foundation
Faming Liang, Purdue University
Xiaofeng Wang, The Cleveland Clinic Foundation

First Author

Yan Zou, The Cleveland Clinic Foundation

Presenting Author

Yan Zou, The Cleveland Clinic Foundation

Enhancing Clinical and Real World Evidence Outcomes: A Macro for Advanced Multi-Block Randomization

Randomization integrity is essential in clinical trials to ensure equitable treatment allocation and valid study outcomes. As trials become more complex and diverse in participant makeup, traditional randomization techniques often fall short. This paper introduces a sophisticated SAS macro that addresses these challenges by facilitating advanced multi-block randomization, essential for managing varied trial designs and patient groups. The macro simplifies the creation and management of multiple block sizes, ensuring robust and flexible study designs.
This abstract details the macro's implementation within a clinical trial setting, illustrating its efficiency and effectiveness in complex randomization scenarios. Additionally, the discussion extends to the macro's applicability in real world evidence (RWE) settings, underscoring its potential to adapt to varying data landscapes and contribute to more generalizable research findings. By enhancing randomization techniques, this macro serves as a critical tool for researchers and statisticians in both controlled and observational study environments, promoting improved rigor and relevance in biostatistical research. 

Keywords

Randomization Techniques

Clinical Trial Design

SAS Macro

Real World Evidence

Multi-Block Randomization 

First Author

Fengzheng Zhu

Presenting Author

Fengzheng Zhu

Innovations in Accelerating Dose Finding Designs with Backfilling

Dose finding studies are used to determine the optimal dose for a treatment in early clinical development. In these studies, ad hoc backfilling occurs to increase the sample size for investigating key doses of interest. Recently, formal frameworks for backfilling have been developed to improve upon ad hoc backfilling for dose finding designs for complete cohorts. We have worked to improve the flexibility of these designs to make dose assignment decisions with partial information to handle late-onset toxicity or fast accrual. We compare our new innovation with previous methods to demonstrate the benefits of formal backfilling without large increases in study duration, as well as demonstrate how such designs can be integrated into a seamless framework. 

Keywords

Dose Finding

Backfilling

Clinical Trials

MTD

Dose Escalation

Seamless Design 

First Author

Frank Shen, BMS

Presenting Author

Frank Shen, BMS

Multiple imputation analysis of Miettinen-Nurminen for binary endpoints

Cochran-Mantel-Haenszel (CMH) and Miettinen-Nurminen (MN) methods are widely used in comparing the proportions among treatment groups in clinical trial data analysis. When missing data exists, multiple imputation (MI) can be used for dealing with the missingness. CMH method with Rubin's rule are commonly used for estimating the confidence interval (CI) and hypothesis testing, when MI is used for missing data. Previous publications have proposed an approach to estimate CIs using MN with MI for missing data. However, the algorithm of estimating the CI and hypothesis testing for MN method with MI at the same time is not available. In addition, the difference of performance between proposed the MN method and the CMH method with Rubin's rule remain unexplored. In this study, we extended the algorithm of MN method with MI and demonstrated the comparable performance of proposed MN method with CMH method using comprehensive simulations. 

Keywords

Missing data

Multiple imputation

Binary endpoints

Miettinen-Nurminen Method

Cochran-Mantel-Haenszel Method 

Co-Author(s)

Qing Li, Merck & Co., Inc.
Jia Hua, Merck
Bin Dong, Merck & Co.

First Author

Jinhao Zou, Merck

Presenting Author

Jinhao Zou, Merck

Optimal Number of Replicates to Ensure Reproducibility in Pre-Clinical Studies

Reproducibility in pre-clinical research has gained attention, especially concerning the application of statistical methods. This awareness underscores the need for increased rigor and reproducibility, particularly in experimental replications. Factors such as the experimenter, animal batch, and environmental conditions can influence experimental outcomes, causing findings from a single experiment to be potentially non-reproducible under different conditions. Typically, two independent experiments are conducted with a possible third if initial results conflict. Despite frequent replication, a formal framework for assessing and optimizing replication numbers is lacking. This project aims to quantify reproducibility and establish necessary replication numbers. We simulated a 2x3 factorial design experiment (2 groups, 3 conditions) and replicated it 2 to 8 times. Data were analyzed using a linear mixed-effects model (LMEM), allowing group effects to vary across experiments. The LMEM quantified between-replication variation, serving as a measure of reproducibility. Our simulations showed that reproducibility reaches nearly 100% at 4 replications regardless of group effect size. 

Keywords

Quantify reproducibility in pre-clinical research

optimal number of replications

incorporate all replications in data analysis

covariance parameter estimates

Linear Mixed Effects Model

Random Effects 

Co-Author(s)

Meredith Akerman, Northwell Health
Sujith Rajan, NYU Grossman Long Island School of Medicine
Chandana Prakashmurthy, NYU Grossman Long Island School of Medicine
M. Mahmood Hussain, NYU Grossman Long Island School of Medicine
Cristina Sison, Biostatistics Unit, Office of Academic Affairs At Northwell Health

First Author

Shahidul Islam, Biostatistics Unit, Northwell Health, New Hyde Park, NY

Presenting Author

Shahidul Islam, Biostatistics Unit, Northwell Health, New Hyde Park, NY

Phase II Decision-Making Framework: Case Studies and Proof of Concept

Phase II clinical trials are crucial for advancing therapeutics by identifying signals of futility, safety, and efficacy under limited data conditions. Traditional designs struggle with finite sample sizes and complex decisions. To address this, we proposed a flexible, multi-metric Bayesian framework for de-risking interim decision-making. It integrates point estimates, uncertainty, and evidence toward desired thresholds (e.g., a Target Product Profile [TPP]), ensuring transparency and interpretability. While prior evaluations used parametric multilevel model simulations, real-world applicability remained untested.

In this study, we assess the framework using real trial data from REMoxTB. By resampling data from 1,931 observed participants to emulate Phase II conditions, we evaluate Type I error, power, and sample size needs. Results show its potential to streamline decision-making, reduce sample sizes, and identify non-inferior regimens earlier. This work underscores Bayesian methodologies' value in optimizing decision-making for tuberculosis and beyond. 

Keywords

Bayesian statistics

Tuberculosis

Clinical trials

Decision-making

Biostatistics

Phase II trials 

First Author

Taimoor Qureshi, UCSF

Presenting Author

Taimoor Qureshi, UCSF

Power Analysis of Simulated Organ Weight Data

In research involving laboratory animals, adhering to ethical guidelines that rationalize the required number of animals is essential for ensuring responsible research practices. Using reference rat organ weight data from the National Toxicology Program (NTP), we conducted a simulation and power analysis to determine the statistical power of Jonckheere's trend test with Williams and Dunnett multiple comparison procedures, the NTP standard approach for evaluating these data. We also determined the appropriate sample size to achieve desired power (≥80%) with an α=0.05 significance threshold. The simulation evaluated select organ weight endpoints across varying effect sizes of biological interest, across four simulated dose groups, under assumptions of normality and heteroscedasticity between groups. Results for both power and sample size varied across endpoints due to differences in means and standard deviations observed in the pilot data. These findings prompt broader discussions regarding 1) the varying power requirements across endpoints within a single study, and 2) marrying sample sizes derived from power analyses with considerations of time, cost, and ethical constraints. 

Keywords

Statistical power

Simulations

Effect size

Sample size 

Co-Author(s)

Kathryn Konrad, DLH
Gary Larson
Katherine Allen, DLH
Helen Cunny, Division of Translational Toxicology, NIEHS
Keith Shockley, National Institute of Health

First Author

Angela Jeffers

Presenting Author

Angela Jeffers

Reconsidering False Positive Rates for a Portfolio of Carcinogenicity Studies

Hundreds of chemicals have been evaluated for their potential carcinogenicity via two-year rodent studies. Each study includes two rodent species, two sexes, three or more dose groups, and over 40 tumor types. The data are binary, with tumors being either present or not. With over 480 dose-related trend and pairwise tests per study, there are concerns about the overall false positive rate (FPR). While statistical significance is not the only consideration for declaring the carcinogenicity of a chemical, it is an important contributing factor. Previous work has examined FPRs, but the data and methods have changed over time: tumor background rates have shifted, and more sophisticated models are sometimes needed. Newer methods include adjustments for differential survivability among the dose groups. Here, we use simulations to assess a study's FPR using the Poly-3 test. These simulations use current historical controls data and assist in estimating the FPR of a study. This work seeks to emphasize the real-world impact of statistical modeling and enhance confidence in science. This research was supported in part by the Intramural Research Program of the NIH including 75N96022F00055. 

Keywords

False Positive Rate

Simulation

Carcinogenicity

Binary Data 

Co-Author(s)

Laura Betz, DLH
Shawn Harris, DLH
Katherine Allen, DLH
Helen Cunny, Division of Translational Toxicology, NIEHS
Keith Shockley, National Institute of Health

First Author

Kathryn Konrad, DLH

Presenting Author

Kathryn Konrad, DLH

Sample size estimation for the ratio of count outcomes in a cluster randomized trial

Count outcomes frequently occur in cluster randomized trials, where the ratio of count outcomes is often used to assess the effectiveness of an intervention. However, in practice, cluster sizes typically vary across clusters, and sample size estimation based on a constant cluster size assumption may lead to Underpowered studies. To address this issue, we propose using the generalized estimating equation (GEE) method to test the ratio of two count outcomes and introduce closed-form formulas for sample size and power calculation. In particular, the ratio of count outcomes has been frequently used to evaluate vaccine efficacy in cluster randomized trials, where the aim is to demonstrate that the vaccine reduces the incidence rate of a disease compared to placebo or active control. To illustrate the application of the proposed method in a vaccine efficacy cluster randomized trial, we present sample size calculation for the ratio of two count outcomes, accounting for randomly varying cluster sizes. 

Keywords

vaccine efficacy

cluster randomized trial

sample size 

First Author

Jijia Wang, UT Southwestern Medical Center

Presenting Author

Jijia Wang, UT Southwestern Medical Center

Selection Rules for Exponential Population Threshold Parameters

This article constructs statistical selection procedures for exponential populations that may differ in only the threshold parameters. The scale parameters of the populations are assumed common and known. The independent samples are drawn from the populations are taken to be of the same size. The best population is defined as the one associated with the largest threshold parameter. Two procedures are developed for choosing a subset of the populations having the property that the chosen subset contains the best population with a prescribed probability. One procedure is based on the sample minimum values drawn from the populations, and another is based on the sample means from the populations. An "Indifference Zone" (IZ) selection procedure is also developed based on the sample minimum values. The IZ procedure asserts the population with the largest test statistic (e.g., the sample minimum) is the best population. With this approach, the sample size is chosen so as to guarantee the probability in the parameter region where the largest threshold is at least a prescribed amount larger than the remaining thresholds. Numerical examples and the R-codes are given in the Appendices. 

Keywords

Weibull Distribution

Probability of Correct Selection

Minimum Statistic Selection Procedure

Means Selection Procedure

Subset Size

Indifference Zone Selection Rule 

Co-Author

Jezerca Hodaj, Oakland University

First Author

Gary McDonald, Oakland University

Presenting Author

Jezerca Hodaj, Oakland University

Simulation Using Target Trial Emulation For Causal Analysis With Applications to RCT Data for AVM

Target trial emulations are a two-step procedure to help  articulate the causal relationship between a treatment and some health outcome.  A notable limitation is the lack of true randomization and the confounding from noncomparable groups.  However, the closer the baseline data matches a true randomization from homogenous population from potential trial participants, the closer the simulated results should reflect results from an RCT.  This study explores how trial target emulation approach with clinical trial data differs from Individual patient data meta-analysis meta-analysis under settings with increasing heterogenous patient populations.  In this analysis we demonstrate how each approach performs with respect to bias and variability of true treatment affect under various simulated settings which mimic an ideal setting (perfectly harmonized trials) to examples where trials were performed with incomplete or incorrectly applied randomization, which can happen in a poorly run trial.   We explore this in a simulated setting and then apply the methods to a real data of two large trails exploring similar treatments were conducted in order to estimate the true treatment effect. 

Keywords

Simulation

RCT

Causality

Target Trial 

Co-Author

Sixia Chen

First Author

Michael Machiorlatti, University of Oklahoma Health Sciences Center

Presenting Author

Michael Machiorlatti, University of Oklahoma Health Sciences Center

Statistical graph for the evaluation of the primary endpoint in a clinical trial

An assumed treatment effect for a primary endpoint in a randomized clinical trial is determined during the sample size determination process in the planning stage. This effect is an interpretable alternative to the null hypothesis, which posits no effect. To interpret trial results, it is important to evaluate the primary endpoint based on the alternative hypothesis and estimated value of the treatment effect. Nevertheless, most researchers conducting randomized clinical trials on superiority have focused on statistical significance. We propose a quantitative statistical graph, referred to as the "ABC plot," which represents the alternative hypothesis, Bayes factor comparing the null hypothesis with the alternative hypothesis, and confidence interval function for the treatment effect, enhancing the visual evaluation of the treatment effect based on the results of the primary endpoint. We extend it to incorporate the minimum treatment effect used in clinical practice. We apply the proposed graph to a clinical study and demonstrate its usefulness as a statistical tool. 

Keywords

alternative hypothesis

Bayes factor

confidence interval function

clinical trial

primary endpoint

statistical graph 

Co-Author(s)

Takashi Omori
Hiroshi Yadohisa, Doshsiha University

First Author

Yumi Takagi, Kyoto University

Presenting Author

Yumi Takagi, Kyoto University