Bridging Statistical Theory and Practice: Tools and Techniques for Effective Consulting and Collaboration

Sonja Ziniel Chair
University of Colorado School of Medicine - Department of Pediatrics
 
Monday, Aug 4: 2:00 PM - 3:50 PM
4065 
Contributed Papers 
Music City Center 
Room: CC-101C 

Main Sponsor

Section on Statistical Consulting

Presentations

Research Questions: A Key to Developing Shared Understanding in Statistical Consulting

This study examines the role of creating research questions in the early stages of a statistical consulting project. Research questions are fundamental to designing studies. Well structured research questions can help to clarify the plans for the investigation. In prior work, we have proposed guidelines for formulating quantitative research questions.

Clients and consultants with differing types of expertise and styles of communication may have difficulty aligning on a project's aims and requirements. Crafting a research question in simple terms can help to develop a shared understanding of the project's goals, shape the plans for statistical analyses, and build the working relationship. Framing research questions in a simple, well-structured form can help to foster shared understanding and trust. The research questions can serve as the basis of the initial consultations and inform the working agreement.

The presentation will discuss examples of research questions and how consultants can use these questions to communicate with clients at different stages of the consulting project. This can be an important tool in developing more effective collaborations. 

Keywords

Statistical consulting

research questions

communication 

Co-Author

Nicole Lorenzetti, The City College of New York

First Author

David Shilane, Columbia University

Presenting Author

David Shilane, Columbia University

What we talk about when we talk about statistical power

Study proposals typically include a sample size justification in order to communicate that the study has been designed rigorously; but all too often, a biostatistician has been engaged at the final stages of proposal development with the sole purpose of providing this justification. We argue that a power calculation is the wrong deliverable for a biostatistician. Instead, biostatisticians should be integrated as full collaborators, working closely with the study team throughout the entire design process. We argue in two steps. First, practically speaking, there is no such thing as a "quick and easy" power calculation: these calculations depend on a comprehensive understanding of the intricacies and limitations of the study, which often requires extensive discussions with the study team to fully communicate. Second, including a study design and statistics expert as an equal partner produces more informative and reliable studies overall. We draw on our experiences as directors of collaborative statistical cores and offer specific recommendations for biostatisticians, clinical investigators, funding agencies, and research institutions to support this culture shift.  

Keywords

Statistical power

Study design

Collaborative statistics

Statistical consulting

Informative trials 

First Author

Alex Dahlen, New York University, School of Global Public Health

Presenting Author

Alex Dahlen, New York University, School of Global Public Health

Techniques in Team Science: The Preponderance of Evidence for Good Decision-Making in Biomechanics

Statisticians use a variety of evidence to inform decisions about analytic strategies, whether a regression model meets the parametric assumptions or identifying the optimal solution in a principal components analysis. The analogous legal terminology refers to the compilation of evidence allowing for a "more likely than not" decision as the "preponderance of evidence." In the team science context, statisticians must help their collaborators understand the relative contribution and meaning of each source of evidence, both statistically and conceptually, when selecting and specifying models. This presentation outlines this approach, first with a simple example of assessing normality in a single variable, then describing the decision-making process in a clustering algorithm to identify subgroups within high-dimensional biomechanical measures. Without an optimal cluster solution-i.e., no preponderance of evidence-we discuss the requisite dialogue between the statistical evidence and domain evidence to arrive at a reasonable and useful conclusion. These examples are leveraged to provide recommendations for statisticians working as team scientists. 

Keywords

Team Science

Collaborative Research

Statistical Decision-Making

Cluster Analysis

Biomechanics 

Co-Author(s)

Christine Kim, University of Kentucky
Michael Samaan, University of Kentucky
Kate Jochimsen, Harvard Medical School

First Author

Anthony Mangino, University of Kentucky, Department of Biostatistics

Presenting Author

Anthony Mangino, University of Kentucky, Department of Biostatistics

Reinforcing data from controlled experiments with synthetic data to increase predictive accuracy

Product development teams collect data from controlled experiments to optimize products or processes, e.g., ingredient levels of products are optimized for maximum consumer appeal, or process settings are optimized for maximum yield. Physical experiments can be sometimes costly, hence, designs that deliver a minimal number of runs (e.g., D-Optimal designs) are often used. Such designs, however, may not provide adequate coverage of certain parts of the input space which may impact a model's predictive performance. To this end, this paper explores the use of synthetically generated data as reinforcement to real data to enhance predictive performance. The synthetic data points are designed to provide better coverage of the input space while preserving key statistical properties of the original data. A specific use case that showed notable improvements in predictive performance will be presented: RMSE on a held-out test set markedly decreased when comparing models trained on real data alone versus models trained on the combined real+synthetic data. This approach allows for a more comprehensive exploration of the input space without the need to physically collect more data. 

Keywords

Synthetic Data

Prediction Accuracy

Controlled Experiments 

Co-Author

Lochana Palayangoda, University of Nebraska Omaha

First Author

Jason Parcon, PepsiCo

Presenting Author

Jason Parcon, PepsiCo

Clustering and Inference for Ballot Models for VRA Analysis

Analysis of alternative election systems often requires modeling of ballots in settings where the available data is not perfectly aligned with the potential mechanism. In this talk I will discuss both empirical and theoretical questions that arise while doing this modeling, motivated by applications of state-level voting rights act legislation. In particular, I will consider questions in supervised learning including how past data can provide information about likely impacts of a new voting system and what the related inference problem looks like for recently introduced slate-based models. I will also discuss the related unsupervised problem of clustering ballots from preference profiles and the identification of party membership. 

Keywords

Voting Methods

Ranked Ballots

Clustering

Statistical Consulting 

First Author

Daryl DeFord, Washington State University

Presenting Author

Daryl DeFord, Washington State University

ZIM Regression under Complex Sampling Designs, Application in Hospital Inpatient Charges Data.

An underlying population may contain a large proportion of zero values, which causes the population distribution to spike at zero, and such a population is referred to as a zero-inflated population. Zero-inflated populations can be seen in many applications and such populations are analyzed via a two-component mixture model. I will present some examples of zero-inflated populations and explain the estimation problem in generalized linear regression models. I will describe the zero-inflated mixture (ZIM) regression model under complex probability sampling designs via two-component mixture models where the probability distribution of non-zero components is supposed to be parametric. The maximum pseudo-likelihood procedure is proposed to estimate the expected responses at "future" covariate values/vectors. The simulation results show that under some complex probability sampling designs, new confidence intervals based on the pseudo-likelihood function perform significantly better than the standard/classic procedures. The proposed new procedure is applied to hospital data about inpatient charges in dollars. 

Keywords

Zero-Inflated

Sampling Designs

Regression

Simulation

Hospital Inpatient Charges (in Dollars) 

First Author

Khyam Paneru, University of Tampa

Presenting Author

Khyam Paneru, University of Tampa

Automating Codebase Translation from SAS to Python with LLMs

Code translation from SAS to Python remains a challenging effort for organizations migrating their codebases. Classical rules-based methods like Abstract Syntax Trees rely on handcrafted rules that can be time-consuming and inflexible. Unsupervised learning approaches have shown improvements but require massive parallel data for training which is unavailable for SAS and Python. Large Language Models (LLMs) overcome these barriers through parametric knowledge retrieval and offer more promising results despite diverse quality issues (syntax and semantic errors). This presentation explores various strategies for automating SAS to Python translation on complex codebases. We discuss managing context window limitations, nested dependencies, incorporating rules-based approaches, and reducing laziness over tedious code. We also detail specific challenges when adapting SAS to Python such as sentinel values, vectorized operations, and adapting macros. This presentation highlights practical approaches for migrating proprietary software to open-source languages more quickly, reducing resource burden on organizations while preserving critical business logic. 

Keywords

Large Language Models (LLMs)

Code Translation





Federal Statistics

Natural Language Processing 

Co-Author(s)

Ellie Mamantov, Reveal Global Consulting
John Lynagh, Reveal Global Consulting

First Author

Cameron Milne, Reveal Global Consulting

Presenting Author

Cameron Milne, Reveal Global Consulting