Speed 6: Data Science and Statistics: Theory and Applications, Part 1

Miguel Fudolig Chair
University of Nevada-Las Vegas
 
Tuesday, Aug 5: 10:30 AM - 12:20 PM
4113 
Contributed Speed 
Music City Center 
Room: CC-104A 

Presentations

A family of instruments for assessing statistics and data science learning experiences

Attitudes matter in education! The Motivational Attitudes in Statistics and Data Science Education (MASDER) project has developed a family of four attitudinal instruments and two learning environment inventories to measure student attitudes toward statistics, instructor attitudes toward teaching statistics, and the learning environment in both statistics and data science. Each MASDER instrument was developed using a staggered, iterative design process. Validity for the four attitudinal instruments is established by the design process and psychometric analyses: CFA and multidimensional IRT were employed in multiple pilot studies to result in robust final instruments. For the student statistics instrument, based on 15,000+ responses across several pilot studies, psychometric analyses support measuring 11 constructs using a 38-item instrument with good internal consistency reliability. The surveys will be freely available through our website (portal.sdsattitudes.com) and as Qualtrics and PDF files and can be used to study evidence-based best practices in our disciplines. Researchers can access their data and receive customized, automated reports comparing their sample to the nation. 

Keywords

attitudes

statistics education

data science education

learning environment

undergraduate

psychometrics 

Co-Author(s)

Douglas Whitaker
Leyla Batakci, Elizabethtown College
Marjorie Bond, The Pennsylvania State University
April Kerby-Helm, Winona State University
Michael Posner, Villanova University

First Author

Alana Unfried, California State University, Monterey Bay

Presenting Author

Alana Unfried, California State University, Monterey Bay

A Simple Test for Technological Change in the Input-Output Model

The input-output (IO) model is widely used to predict the impact of a sectoral demand shock on other sectors of the economy. The model is basically a linear transformation of the demand into a corresponding output vector, with the transformation matrix being a function of the so-called technological matrix. The technological matrix is usually computed from the Make and Use Tables (MUT) of the System of National Accounts and is commonly updated every time the National Accounts Office updates the MUT. So far, however, no statistical test has been proposed to compare whether the differences between two alternative technological matrices are large enough to justify the replacement of one by the other. The paper proposes such a test based on Wald's chi-square statistic and performs it on simulated MUT sets. The types of (technological) hypotheses that might be tested with the new test and some of its limitations are discussed at the end of the paper. 

Keywords

Input-Ouput model

technological change

National Accounts

Make and Use Tables

Wald's statistic

chi-square 

First Author

Luis Frank

Presenting Author

Luis Frank

An Interpretable Approximation for The Threshold Parameter of Weibull Probability Distributions

The location or threshold parameter of the three-parameter Weibull distribution is often the most important parameter in many applications that require an estimate of a minimum value. However, traditional methods for estimating this parameter rely on complex numerical procedures that hinder interpretability. In this work, we propose a novel, closed-form approximation that expresses the Weibull threshold as a function of the first three statistical moments: mean, standard deviation, and skewness. This approach enhances understanding of how these common statistical measures influence Weibull threshold behavior and simplifies computation. By prioritizing interpretability, a framework is provided that reveals fundamental relationships between statistical moments and Weibull threshold values, offering insights into the behavior of Weibull random variables across a wide range of skew. This proposed approximation is compared to classical estimation methods, demonstrating its effectiveness in capturing the threshold behavior with minimal mathematical complexity and high interpretability. This work serves as a step toward developing practical methods for minimum value estimation. 

Keywords

mean, μ

standard deviation, σ

skew coefficient

threshold parameter, γ

random variable

Weibull distribution 

Presenting Author

Frederic Holland, NASA Glenn Research Center

Are the Gospels and Acts historical? The female names compared to the historical distribution...

In June, 2024 we published evidence that the 82 male Palestinian Jewish names in the four Gospels and Acts in the Bible fit the distribution of the 2,185 historical reference names pretty well. The method used was the chi-squared goodness-of-fit test (Van de Weghe, Luuk and Jason Wilson. 2024. Why Name Popularity is a Good Test of Historicity. Journal for the Study of the Historical Jesus. June 26, 2024. DOI: 10.1163/17455197-bja10035). This is contra Gregor & Blais (2023; Is Name Popularity a Good Test of Historicity? A Statistical Evaluation of Richard Bauckham's Onomastic Argument; JSHJ; Brill; 21:171-202. DOI: 10.1163/17455197-BJA10023), whose methodological flaws we highlighted. In this follow-up study, we apply the same methodology to the 21 female names against the 341 historical reference names. While the conclusions are not as strong, due to the smaller sample size, the fit is still better than any historical work examined. We verified the goodness-of-fit test results with a simulation. We chose the speed poster format precisely to invite feedback in order to improve future work – please stop by! 

Keywords

Bible

Gospel

goodness-of-fit test

historical

text analysis 

First Author

Jason Wilson, Biola University

Presenting Author

Jason Wilson, Biola University

Assessing Home-Field Advantage in the Presidents Cup: Impact on Competitive Balance and Team Perform

The Presidents Cup provides a unique setting to analyze home-field advantage (HFA) in a two-team competition with zero-sum scoring. Our analysis confirms a statistically significant HFA of 2.065 points for the home team, translating to a 4.13-point swing in differential. While the event alternates host locations, HFA does not fully offset the overall point differential when controlling for ability differences, suggesting that other contextual factors-namely team cohesion on Team United States-may influence competitive balance.
Controlling for HFA and ability differences, we estimate that Team United States holds a significant 3.235-point edge over Team International, corresponding to a 6.47-point swing in differential. These findings underscore the consistent impact of HFA in the Presidents Cup and provide insights into how venue effects shape team-based golf competitions. Additionally, they offer evidence that team chemistry and leadership play a significant role in team performance. 

Keywords

golf

home field advantage

team dynamics

sport analytics

presidents cup

tournament 

Co-Author(s)

Hunter Geise, Syracuse University
Charlotte Howland, Syracuse University
Justin Ehrlich

First Author

Collin Kneiss, Syracuse University

Presenting Author

Collin Kneiss, Syracuse University

Asymptotic properties of Impulse Indicator Saturation under outlier contamination

Impulse Indicator Saturation (IIS) is an outlier robust algorithm for estimating linear regression models. It begins by splitting a sample into two halves. Initial least squares estimators from each half are used to classify observations into "good" observations and "outliers", depending on whether their residuals exceed a predetermined cut-off. The IIS estimator is then equal to the least squares estimator on the retained set of `good' observations. I study asymptotic properties of IIS in data generating processes that include outliers. My approach departs from existing literature, where IIS has only been studied without contamination. I write down an asymptotic representation of IIS in terms of an infeasible least squares estimator that perfectly removes all outliers. As a consequence, asymptotic inference with IIS can proceed along the lines of standard least squares theory, and the distributions of test statistics are free of nuisance parameters. I further analyse the False Outlier Discovery Rate (FODR) of IIS, and find a Poisson approximation to its distribution. Simulations and an empirical illustration using macroeconomic time series data are provided. 

Keywords

Outlier detection

Robust estimation and inference

Linear models and regression

Time series 

First Author

Otso Hao, University of Oxford

Presenting Author

Otso Hao, University of Oxford

Biostat PRODIGY

Biostatistics Program for Research Outreach and Development in this Generation of Youth (Biostat PRODIGY) is a one-week summer workshop designed to enhance the knowledge and statistical programming skills of high school students, with a focus on underrepresented minority students interested in STEM fields. The program features a comprehensive curriculum covering introductory statistical programming, biostatistics, epidemiology, and interactions with professionals in these fields. Students engage in hands-on learning experiences, culminating in the creation of their own Shiny application for impactful data visualization. The program aims to instill a genuine interest in biostatistics and epidemiology careers among participants. Collaborating with a local school district, the program breaks barriers by offering workshops at local high schools with district provided Chromebooks. This removes location and equipment barriers and facilitates interactions with professionals from various industries and academic areas, enriching the learning experience. 

Keywords

Biostatistics Education

STEM Workforce Diversity

Statistical Programming

Community Engagement 

First Author

Kristen McQuerry

Presenting Author

Kristen McQuerry

Bridging Data Science and Product Thinking: Creating High-Impact Data Products

Statisticians and data scientists develop powerful models and analyses, yet there's always a struggle to operationalize them into scalable, impactful solutions. A data product mindset bridges this gap-combining statistical rigor, data science, and product thinking to create solutions that are usable, maintainable, and designed for long-term adoption.
In this speed session, we'll break down the key principles of building high-impact data products, including identifying the right use cases, designing for great experiences, and ensuring repeatability. We'll explore how organizations move beyond one-off analyses to create production-ready data assets, such as automated forecasting models, intelligent recommendation systems, and self-service analytics tools that drive measurable business value and create delight for users. Attendees will get practical takeaways on how to apply product thinking to data science, helping data science teams deliver valuable insights across an organization.
Through a few real-world examples, attendees will also learn how to structure data products for maximum impact and drive their adoption. 

Keywords

Data Products

Data Science and Analytics

Scalable Machine Learning Models

Self-service analytics

Product thinking

Adoption and impact of Data Science solutions 

First Author

Rajat Verma

Presenting Author

Rajat Verma

Confidence Interval Coverage in a Multicollinear Logistic Regression Model: A Simulation Study

Multicollinearity refers to the condition where two or more independent variables show a strong correlation. Analyzing multicollinear data using generalized linear models (GLM) presents significant challenges. Highly correlated predictors cause the standard error estimates to inflate, resulting in wide confidence intervals, lower predictive power, and less reliable results for the maximum likelihood estimator (MLE) in GLM. Researchers have developed many methods to address the multicollinearity problem. Typically, the performance of various methods is compared based on their mean squared error (MSE). We aim to expand the research in this field for logistic regression (LR), focusing on the confidence intervals based on ridge, Liu, and Kibria–Lukman (KL) estimators. A simulation study examined the confidence intervals for estimates based on coverage probability and interval width in logistic regression under various conditions and for a range of shrinkage parameters. This paper is the first in the field to conduct a comparative study based on the coverage probability of confidence intervals in logistic regression. 

Keywords

Multicollinearity

logistic regression

shrinkage parameter

confidence interval

coverage probability 

Co-Author(s)

Zoran Bursac, Florida International University
B.M. Kibria, Florida International University

First Author

Sultana Mubarika Chowdhury, Florida International University

Presenting Author

Sultana Mubarika Chowdhury, Florida International University

Differential Privacy in the Survey Context: The Impact of Weighting Class Adjustments on the Sensitivity of a Population Total

The concept of differential privacy (DP) aims at limiting the impact that any single record can have on the analysis of interest. To optimally control this impact, DP generally requires to compute the global sensitivity which measures the maximum possible change of the statistic if a single record is changed in the database. When applying DP in the context of survey data, one needs to consider that preprocessing steps like nonresponse adjustments or calibration have been applied to the data before the analysis. These adjustment steps typically increase the global sensitivity as changing one record in the database will also change the results of the adjustments. In this work, we specifically focus on the effects of weighting class adjustments, a common strategy to correct for unit nonresponse in surveys. We comprehensively examine how different scenarios affect the sensitivity of weighted population totals under both bounded DP (changing values of a single record while keeping dataset size fixed) and unbounded DP (adding or removing a single record) frameworks. Our analysis further distinguishes between the response status of the changed record to identify worst-case scenarios for sensitivity calculations. We derive explicit sensitivity formulas for all possible scenarios and identify which combinations produce maximum sensitivity. Our results show that with weighting class adjustments DP loses its symmetric property, i.e., the sensitivity differs when adding one record compared to a scenario in which one record is removed. 

Keywords

Differential privacy

Nonresponse bias

Post-stratification

Sensitivity

Survey Statistic

Data confidentiality 

Co-Author(s)

Jörg Drechsler, Institute for Employment Research, Germany
Soumojit Das, University of Michigan

First Author

Srijeeta Mitra, University of Maryland College Park

Presenting Author

Srijeeta Mitra, University of Maryland College Park

Efficient Dynamic Prediction of High-density Multilevel Generalized Functional Data

Dynamic prediction, which typically refers to the prediction of future outcomes using historical records, is often of interest in biomedical research. For datasets with large sample sizes, high measurement density, and multilevel structures, traditional methods are often infeasible because of the computational burden associated with both data scale and model complexity. Moreover, many models do not directly facilitate out-of-sample predictions for multilevel generalized outcomes. To address these issues, we develop a novel approach for dynamic predictions based on a recently developed method estimating complex patterns of variation for exponential family data: Generalized Multilevel Functional Principal Components Analysis (gmFPCA). Our method is able to handle large-scale, high-density multilevel repeated measures much more efficiently, with its implementation feasible even on personal computational resources. The proposed method makes highly flexible and accurate predictions of future trajectories for data that exhibits high degrees of nonlinearity, and allows for out-of-sample predictions to be obtained without reestimating any parameters. 

Keywords

Dynamic Prediction

Functional Data

Longitudinal Data

Wearable Device

Generalized Functional Data

Mixed effect models 

First Author

Ying Jin, National Institute of Environmental Health Sciences

Presenting Author

Ying Jin, National Institute of Environmental Health Sciences

Enhancing Trading Performance with Optimized MACD and Multi-Indicator Integration

The Moving Average Convergence Divergence (MACD) indicator is a widely used technical analysis tool in the stock market. One of the most common strategies, the MACD Signal Line Crossover Strategy, generates buy and sell signals based on crossover points. However, it is often affected by false signals, reducing prediction accuracy. To enhance overall returns, this study optimizes the traditional MACD parameters and integrates additional indicators such as KD, RSI, and volume. By evaluating performance across multiple time frames-15-minute, 1-hour, and daily intervals-the proposed algorithm increases the winning rate and improves the effectiveness of the MACD strategy. 

Keywords

MACD

Backtesting

False Signal Reduction

Parameter Optimization 

Co-Author

Yiqing Wang, Independent Researcher

First Author

Luyun Lin

Presenting Author

Luyun Lin

Multifaceted Gender Identity Measurement (M-GIM): A Viable Alternative to Forced-Choice Assessments

While suggesting specific question wording for surveys collecting data on gender identity and sexual orientation, a 2022 National Academies of Sciences, Engineering, and Medicine (NASEM) report on "Measuring Sex, Gender Identity, and Sexual Orientation" recognized limitations of "forced-choice measurement" using multiple-choice items and recommended further research into representing sexual orientation and gender identity (SOGI). Extending the NASEM recommendations, the "Multifaceted Gender Identity Measurement" (M-GIM) pilot study asked respondents about the extent to which they agree or disagree with a series of SOGI characterizations using ordinal scales alongside mental health questionnaires such as the PHQ-8 or AQ-10. This presentation will focus on presenting the results of a cluster analysis conducted on M-GIM pilot survey data that is currently being collected at UCLA. Respondents have been recruited from the UCLA student population and the Los Angeles Black LGBTQ+ Network. Discussion will also focus on the analysis of disparities in quality-of-life outcomes across population subgroups characterized by similar gender-identity or sexual-orientation profiles. 

Keywords

gender identity

sexual orientation

ordinal data cluster analysis

nonbinary

gender fluidity 

Co-Author(s)

Thomas Belin, University of California-Los Angeles
Donatello Telesca, UCLA School of Public Health
Zichen Liu, University of California, Los Angeles

First Author

Andrew Chuang

Presenting Author

Andrew Chuang

Parallel Universes: Decision Analysis and Data Science

In data science, we identify patterns in data and discover what useful, interesting information we can extract. In decision analysis, we utilize that information to make informed choices. Data science primarily functions as a producer of information, while decision analysis serves as a consumer. By leveraging facilitation and decision analysis skills, we broaden our horizons to engage in project framing and collaborate with our colleagues, domain experts, and organization leaders to support their decision-making processes.

Understanding how our identities and beliefs shape our perspective on a problem is quite challenging. However, we can align conflicting sides on a decision by agreeing on the problem framework, establishing values and objectives, determining the data and model that will guide us, generating alternatives, and considering the trade-offs.

It can be challenging to step outside our own beliefs and biases. It is humbling to listen to other sides openly, but this will make us fuller and more effective statistical and data scientists. 

Keywords

decision analysis

structured decision-making

collaboration 

First Author

Mark Otto, Independent

Presenting Author

Mark Otto, Independent

Performances of Some Improved Estimators and their Robust Version with Outliers

This paper introduces enhanced estimators designed to address the issue of multicollinearity in multiple linear regression models. In addition to multicollinearity, the presence of outliers also presents a challenge in multiple linear regression analysis. To tackle these issues, this paper proposes several improved estimators, along with their robust counterparts, and compares their performance. The evaluation of these estimators is based on both Monte Carlo simulations and real-life data, under various outlier scenarios, including no outliers, one outlier, and two outliers. The mean squared error (MSE) is used as the performance criterion. The simulation results show that, when no outliers are present, the improved estimators outperform most of their robust versions. However, in the presence of one or two outliers, all robust versions of the improved estimators perform better than the conventional improved estimators. 

Keywords

Linear Regression

Mean Square Error

M-estimator

Multicollinearity

outliers

Ridge Regression 

Co-Author(s)

Zoran Bursac, Florida International University
B.M. Kibria, Florida International University

First Author

Nusrat Yasmin, Florida International University

Presenting Author

Nusrat Yasmin, Florida International University

Power BI app for Tolerance Intervals and Reduced Major Axes Regression

In-line inspection (ILI) tools that run internally in the pipe detect and characterize threats on a pipeline but are prone to inherent variability. The results of these ILI surveys are used to assess the criticality of reported anomalies, but the ILI runs should be compared to actual field excavations among other comparisons. To effectively manage the threats, correlation between ILI runs, and correlation between ILI and Field excavation results are important. Key to assessing how well results compare are tolerance intervals that with a specified confidence level cover a given proportion of the population. In this PBI app 95% confidence and 80% population coverage is used that meshes well with industry requirements. Another new added feature is least squares regression. However, often a better result for error prone data is reduced major axes regression. Both regression approaches are used to assess the relative fits of the various data types in the model.
These enhancements have been added to a PBI Bias assessment app to provide solid quality results for pipeline companies. This study used representative ILI results to create interactive, statistical, and visual analyses. 

Keywords

ILI

pipeline

correlation

oil

RMA

tolerance 

First Author

William Harper, DNV

Presenting Author

William Harper, DNV

Restricted Mean Survival Time-Based Adjustment of Hazard Ratio Confidence Intervals

Restricted mean survival time ratios (RMSTRs) are widely used as an alternative to hazard ratios (HRs) from Cox proportional hazards models when the proportional hazards (PH) assumption is violated. However, RMSTRs are sensitive to censoring and prone to bias, whereas HRs are generally more robust. Thus, it is desirable to preserve the HR estimate from Cox models while using inference from RMSTRs. When the PH violation is mild or moderate, HRs remain interpretable, though their confidence intervals (CIs) may widen due to increased variation from crossing survival curves. We introduce a method to adjust HR CIs using RMSTR-derived variance estimates, ensuring inference consistency. Additionally, we propose a guideline to classify PH assumption violations as mild. Through simulations, we compare our approach with weighted Cox regression, demonstrating that HRs exhibit a linear relationship with weighted average HRs, supporting their robustness. Finally, we apply our method to real-world data with mild PH violation, showcasing its practical utility. 

Keywords

Restricted mean survival time ratio

Cox proportional hazard assumption

Weighted Cox model 

Co-Author(s)

Deepti Jain, Wayne State University
Marya Wahidi, University of Michigan
Radhika Gogoi, Wayne State University
Rouba Ali-Fehmi, Wayne State University
Seongho Kim, Wayne State University

First Author

Hyejeong Jang, Wayne State University

Presenting Author

Hyejeong Jang, Wayne State University

Testing the Coefficients for the Two-Parameter Multicollinear Linear Regression Model

In linear regression analysis, the assumption of independence among explanatory variables is crucial, with the ordinary least squares (OLS) estimator typically regarded as the Best Linear Unbiased Estimator (BLUE). However, multicollinearity poses challenges by distorting the estimation of individual variable effects and impeding reliable statistical inference. To address this issue, various two-parameter estimators have been proposed in the literature. This paper aims to compare the t-test statistics used to assess the significance of regression coefficients when employing two-parameter biased estimators. A Monte Carlo simulation study is conducted to evaluate their performance, focusing on the maintenance of the empirical type I error rate and power properties, in line with standard testing practices. The findings indicate that some two-parameter estimators offer significant power improvements while preserving the nominal 5% significance level. 

Keywords

Empirical power

Linear Regression

Type I error rate

Multicollinearity

Ridge Regression estimator

Simulation study 

Co-Author(s)

Zoran Bursac, Florida International University
B.M. Kibria, Florida International University

First Author

Md Ariful Hoque, Florida International University

Presenting Author

Md Ariful Hoque, Florida International University

Toward a robust and simple guideline for checking the Central Limit Theorem

In statistical practice, many introductory statistical procedures require the sampling distribution of means to be approximately normal. Most students learn a simplified check of this condition as ``$n ≥ 30$'', which often becomes a black-and-white mantra replacing visual inspection of the data. A slightly more detailed version might be "n ≥ 30 as long as the population distribution is not too skewed." Our research seeks to clarify a guideline that incorporates measures of skewness along with sample size. We used simulation to explore the consequences of skewed populations with different sample sizes. We hope to provide students and practitioners with a slightly more refined rule that allows a way to operationalize the degree of skewness in statistical analysis. 

Keywords

Applied Statistics

Statistical Pedagogy

Simulation

Central Limit Theorem

Skewness

Normality Check 

Co-Author

Beth Chance, California Polytechnic State University

First Author

Visruth Srimath Kandali, California Polytechnic State University

Presenting Author

Visruth Srimath Kandali, California Polytechnic State University

Variants of Regression Tests for Assessing Publication Bias

Publication bias has long been a critical issue in meta-analysis that compromises the certainty of synthesized evidence. Egger's regression test is one of the most popular methods to detect the presence of publication bias by examining the asymmetry of funnel plot. We proposed five variants of Egger-type regression tests incorporating different assumptions for the error term in the model, within both fixed-effect and random-effects settings. This work aims to empirically evaluate the performance for the Egger regression variants. We implemented five Egger-type regressions to a collection of 51 high-quality meta-analyses from the BMJ papers focusing on medical research. Cohen's kappa was utilized to assess the pairwise agreement among the different regression tests, with kappa values varying from approximately 50% to over 90%, indicating moderate to almost perfect agreement. Given the variation in empirical evaluation observed among the Egger-type regressions, it is crucial for meta-analysts to choose and specify the error term employed in Egger regression test when assessing publication bias in practice . 

Keywords

meta-analysis

publication bias

Egger's regression 

Co-Author

Lifeng Lin

First Author

Linyu Shi, AbbVie Inc.

Presenting Author

Linyu Shi, AbbVie Inc.