SPEED 6: Social Statistics, Mental Health Statistics, and Survey Methods, Part 1

Elizabeth Petraglia Chair
Westat
 
Tuesday, Aug 4: 8:30 AM - 10:20 AM
7108 
Contributed Speed 
Thomas M. Menino Convention & Exhibition Center 
Room: CC-102B 

Presentations

A Comparison of Linear, Ensemble, and Neural Network Architectures for Estimating Healthcare Costs

Accurate estimation of individual medical costs is a cornerstone of business analytics in the insurance industry, however the relationship between demographic factors and actual expenditures is often non-linear. This study utilizes a publicly available medical cost dataset containing health indicators such as age, BMI, and smoking status to evaluate the predictive performance of three models. In this study, we compare a Multiple Linear Regression model, Random Forest, and a multi-layered Neural Net to determine if deep learning models provide a statistically significant improvement in Mean Absolute Error (MAE) and R-squared values over traditional frequentist approaches. While linear models offer high interpretability, they may fail to capture the non-linear cost interactions between high BMI and smoking status that are better explained by non-linear models. The findings aim to provide a practical framework for selecting the most efficient model for predicting health costs based on personal demographics, while balancing the trade-offs between model complexity and interpretability. 

Keywords

Regression Analysis

Machine Learning

Healthcare Analytics

Neural Networks

Business Intelligence 

Speaker

Sabine Esmaili

A Stochastic Neoclassical Input-Output Model with Random Agricultural Supply

Leontief's input-output model is widely used in economics to predict the impact of a sector-specific demand shock on other sectors of the economy. This model was later reformulated by ten Raa and Mohnen to align with neoclassical assumptions regarding production functions. However, this revised version omits natural resources, which introduces biases in the model's technological coefficients matrix. This paper corrects the ten Raa-Mohnen model and adapts it to the case of an economy dependent on the provision of water for agriculture, where supply is essentially random. The new model is formulated as a stochastic quadratic programming problem that maximizes value added across all sectors of the economy, except for the agricultural sector, whose output is treated as exogenous and stochastic. The objective function also penalizes production in sectors with highly variable prices. The program is subject to various constraints - such as those from the original Leontief model itself and labor and capital endowments - which are satisfied not with certainty but with a given probability. Finally, the proposed model is calibrated using real data from Argentina's System of National Accounts. 

Keywords

input-output model

stochastic programming

agriculture

natural resources

ten Raa-Mohnen model

Argentina 

Speaker

Luis Frank

Advancing College Access: Factor Analysis of Grade 11 College-Going Self-Efficacy

Onward We Learn prepares Rhode Island students to be the first in their families to attend and complete college. As a provider of the U.S. Department of Education's GEAR UP program, it's critical to understand the social emotional development indicators that support college enrollment and persistence. This study uses factor analysis to examine Grade 11 student responses (n = 113) to the College-Going Self-Efficacy Scale (Gibbons & Borders, 2010), which measures students' beliefs in their ability to successfully attend college. Data were screened for normality and sampling adequacy, with results supporting factor analysis (KMO = .87; Bartlett's χ² = 1832.02, p < .001). Maximum likelihood factor analysis with oblique rotation identified a three-factor structure accounting for 43.01% of the variance. A preliminary confirmatory factor analysis (CFA) was also conducted to evaluate the fit of the retained three-factor model within the study sample. Findings highlight key dimensions of perceived student abilities relevant for strengthening confidence and readiness for post-secondary transitions as well as degree completion. Given high immediate college enrollment (73%) and second-year persistence (71%) rates of Onward students, understanding students' beliefs in their ability to enroll and succeed in college is critical for program accountability and continuous improvement. 

Keywords

Exploratory Factor Analysis

Confirmatory Factor Analysis

Psychometrics

College Going Self-Efficacy

Postsecondary Readiness

Construct Validity 

Speaker

Erin R. Twomey-Wilson, Onward We Learn

An application of Bayesian joint modeling to develop a new personalized VA Women CVD risk score

Precision medicine applies advanced statistical modeling integrating diverse data such as polygenic risk score (PRS), electronic health records data, and longitudinal biomarkers to improve accuracy of model prediction of disease. An appropriate framework for linking longitudinal biomarkers and female-specific conditions from electronic health records with time-to-event outcomes is joint modeling. However, its application in prediction modeling has been limited to small data sets. The current study is one of the first to apply Bayesian joint models (Rizopoulos et al, 2016) to the large-, representative and comprehensive Veterans Affairs (VA) Health Care System records, biomarkers, female sex-specific conditions and cardiovascular disease (CVD) PRS and develop a new personalized VA Women CVD risk score. The new joint models include multiple sub models-time-varying Cox model and general linear models for longitudinal biomarkers and female sex-specific conditions-linked via time of visit. The new personalized VA women CVD risk score improved model accuracy in predicting CVD events (Δ C statistics +0.05) compared to the original VA women CVD risk score (Jeon-Slaughter et al., 2021). 

Keywords

Bayesian Joint Model application to a large-scale data set

Precision Medicine

Prediction model

Women's Health

Clinical decision making

VA Women CVD risk score 

Speaker

Haekyung Jeon-Slaughter, University of Texas Southwestern Medical Center

Co-Author(s)

Callum Doyle, Southern Methodist University
Sy Han Chiou
Xiaofei Chen
Erum Whyne, VA North Texas Health Care System
MinJae Lee, UTHealth-Houston

Connecting the Data: Population-Scale Linkage to Study Prescription Stimulants and Overdose Risk

Background: Evaluating the population-level impact of prescription stimulants on overdose risk requires longitudinal data spanning prescribing, healthcare utilization, and mortality-sources that are rarely integrated at scale. Legal, administrative, and technical barriers often limit linkage of prescription drug monitoring program (PDMP) data with claims, hospital, emergency department, and vital records. We describe a probabilistic linkage framework developed to support an NIH-funded study of prescription stimulants and overdose outcomes.

Linkage Methods: We used FasLink to probabilistically link records across the administrative health data sets. Individuals were assigned persistent, study-specific anonymous identifiers, enabling longitudinal follow-up across insurance transitions and care settings.

Conclusion:
The linked data support time-varying definitions of stimulant exposure and capture fatal and non-fatal overdose outcomes, including polysubstance involvement. This framework demonstrates a scalable, reproducible approach for linking administrative health data to support pharmacoepidemiologic surveillance of controlled substances beyond opioids. 

Keywords

Probabilistic linkage

PDMP

all-payer claims databases

prescription stimulant

overdose surveillance

population health analytics 

Speaker

Sanae El Ibrahimi, Comagine Health

Co-Author(s)

Mary Gary, Comagine Health
Kendra Blalock, Comagine Health
Carson Deahl, Comagine Health
Ryan Zamora, McLean Hospital, Harvard Medical School
Yeaonsoo Park, Data Science and Computational Medicine, Simches Division of Child and Adolescent
Alessandro De Nadai, McLean Hospital/Harvard Medical School

Early and Late Birds: Response Timing Patterns as Predictors of Panel Attrition in Online Surveys

To improve the effectiveness of field management strategies and to identify optimal field timing, it is crucial to understand the differences between early responders (early birds) and those who only respond after receiving reminders (late birds).
This study examines whether systematic biases exist in early and late birds, how their response behavior influences panel participation, and whether their data quality differs (e.g., item nonresponse, speeding).
We analyze 20,817 respondents across eight waves of a push to web survey of the German labor force, aged between 18-65 (IAB-OPAL).
Using a discrete hazard model, we investigate the relationship between days until response to the survey and panel dropout risk.
Preliminary findings indicate that late birds exhibit a significantly higher dropout risk in subsequent waves. A nonlinear relationship exists between the number of panel participations and dropout probability. Our findings have practical implications for reducing panel attrition and systematic bias, improving data quality and optimizing field management. 

Keywords

Response timing in online survey

Panel attrition

Data quality

Fieldmanagement

Bias

Item non-response 

Speaker

Valentina Prospero, Institute for Employment Research (IAB) of the Federal Employment Agency (BA) - Research Department Panel Study Labour Market and Social Security (PASS)

Co-Author(s)

Mustafa Coban, Institut für Arbeitsmarkt- und Berufsforschung (IAB) der Bundesagentur für Arbeit (BA) - Bereich PAS
Christine Distler, Institut für Arbeitsmarkt- und Berufsforschung (IAB) der Bundesagentur für Arbeit (BA) - Bereich PAS
Marcel Müller, Institut für Arbeitsmarkt- und Berufsforschung (IAB) der Bundesagentur für Arbeit (BA) - Bereich PAS

Estimating Customer Wallet Size and Allocation without Surveys: Evidence from E-Commerce Transaction

A single company typically supplies only a fraction of a customer's total demand, making both Size-of-Wallet (SioW) and Share-of-Wallet (SoW) unobservable from the firm's perspective. Existing studies often rely on survey data in which customers self-estimate their Share-of-Wallet, an approach that is impractical for scalable and periodic estimation. This study proposes a survey-free methodology based on statistical modeling and machine learning to estimate SoW and SioW directly from transactional data. The approach infers latent wallet size and allocation behavior without customer self-reports and is illustrated using purchase data from Adidas customers on the Amazon platform. 

Keywords

Share-of-Wallet

Size-of-Wallet

Machine Learning

Latent variable

Pogit model

Customer Analytics 

Speaker

Sandra Ramirez, Pontificia Universidad Javeriana Cali

Co-Author(s)

Iván Gutiérrez, Universidad Andrés Bello, Santiago, Chile
Leonardo Jofré, Pontificia Universidad Católica de Chile

Estimating network clustering with a latent space model from Respondent-Driven Sampling data

The social network structure of hidden and hard-to-reach populations can have important implications for epidemiology and public health. However, collecting full or even moderately dense network data is typically infeasible due to the lack of a sampling frame, privacy concerns, and limited resources. Respondent-Driven Sampling (RDS), which leverages a chain-referral pattern, is widely used in these cases, but standard RDS data provide only tree-structured network data. To address this challenge, we incorporated a recently developed token-based strategy that supplements coupon referral with token distribution, yielding additional observed ties. Inspired by work using aggregated relational data (ARD), we propose a latent space model to effectively describe the network structure. We use the fitted model to predict unobserved ties and estimate network statistics from RDS data, with particular focus on network clustering, an important network feature that cannot be recovered from standard RDS data. This work is motivated by and applied to a rich dataset of RDS samples collected among People Who Inject Drugs across 17 sites in Kenya in the TLC-IDU study (NCT01557998). 

Keywords

Respondent-Driven Sampling

RDS

Hard-to-reach population

Social networks

Network clustering

Latent space model 

Speaker

Yun Jiang, University of Massachusetts Amherst

Co-Author(s)

Krista Gile, University of Massachusetts Amherst
Mercy Nyakowa, Kenya Ministry of Health
Daniel Fedha, Kenya Ministry of Health
Hannah Manley, Albert Einstein College of Medicine
Lindsey Riback, Albert Einstein College of Medicine
Matthew Akiyama, Albert Einstein College of Medicine

Exposure to Personal Care Products and Endocrine-Disrupting Chemicals in SELF

Exposure to endocrine-disrupting chemicals (EDCs), including parabens, phthalates, and phenols through personal care products (PCPs) has been linked to adverse health outcomes. Using data from the Study of Environment, Lifestyle and Fibroids (n=434), we conducted an exploratory analysis of 31 urinary EDC biomarkers and self-reported recent (24-hour) and long-term (12-month) PCP use. Associations were analyzed using log-normal accelerated failure time models including season and product-by-season interactions. Recent and long-term use of nail, skincare, and makeup products were associated with higher urinary paraben and phenol concentrations. Sunscreen use was associated with benzophenone-3, particularly in summer compared with winter, reflecting seasonal use. Overall, use of several PCPs was associated with higher urinary EDC concentrations, with associations influenced by season and frequency of use. This work was supported by the NIH and NIEHS. Contributions by NIH authors are Works of the United States Government. The findings and conclusions do not necessarily reflect the views of the NIH or the U.S. Department of Health and Human Services. 

Keywords

Epidemiology

Exploratory analysis

Endocrine disrupting chemicals

Personal care products

Season analysis

Accelerated failure time model 

Speaker

Angela Jeffers, DLH

Co-Author(s)

Caroll Co, DLH
Lauren A. Wise, Department of Epidemiology, Boston University School of Public Health
Samantha Schildroth, Department of Epidemiology, Boston University School of Public Health
Amelia K. Wesselink, Department of Epidemiology, Boston University School of Public Health
Traci Bethea, Georgetown Lombardi Comprehensive Cancer Center, Georgetown University Medical Center
Anne Marie Jukic, Epidemiology Branch, National Institute of Environmental Health Sciences, NIH
Quaker E. Harmon, Epidemiology Branch, National Institute of Environmental Health Sciences, NIH
Donna D. Baird, Epidemiology Branch, National Institute of Environmental Health Sciences, NIH
Kyla W. Taylor, Division of Translational Toxicology, National Institute of Environmental Health Sciences

Inference for Mean Differences for Skewed Responses: Psychological Well-Being in College Students

The Diener Subjective Psychological Well-Being Scale is commonly used in the social sciences to assess self-perceived psychological flourishing. The scale ranges from 8 to 56, with higher scores indicating greater well-being. The goal of this study was to examine potential differences in mean well-being scores associated with time spent on unpaid caregiving responsibilities among college students in Florida. Prior studies have either used normal approximations for the scale distribution or treated it as a binary outcome. In our study, using the 2025 American College Health Association-National College Health Assessment data, the scale had a left-skewed distribution. Several data transformations and models were explored, and the final model was a generalized linear regression with a Gamma distribution and log link function fitted to the reflected well-being score. The model controlled for students' sociodemographic and academic characteristics. Analyses were performed in SAS® 9.4 software. The model-based confidence interval indicated that the mean well-being score was higher for students who spent more time on caregiving compared with students who spent less time. 

Keywords

Skewed outcomes

Generalized linear model

Gamma regression

Log link

NLMEANS macro 

Speaker

Julia Soulakova, University of Central Florida College of Medicine

Co-Author(s)

Kevin Zou, Univeristy of Central Florida
Mary Schmidt-Owens, University of Central Florida College of Medicine
Joanna Mackie, University of Central Florida College of Medicine

Measurement and development of digital platform work

Digital platform work is an increasingly important component of labour markets worldwide, yet its measurement poses statistical challenges due to heterogeneous work arrangements and rapidly evolving platforms. Reliable and comparable statistics are essential for understanding the scale, characteristics, and dynamics of this form of employment.

Since 2016, Singapore's Ministry of Manpower (MOM) has compiled official statistics on digital platform employment using labour force surveys and administrative data. MOM has also collaborated with the International Labour Organization (ILO) to design a standardised measurement framework for digital platform employment, aimed at improving conceptual clarity and cross-country comparability. Building on this work, MOM and ILO convened a Global Dialogue on Digital Platform Work in 2025, bringing together statisticians, policymakers and platform operators to discuss measurement approaches, data gaps, and implementation issues.

The paper presents the concepts, processes and challenges underpinning these initiatives, and illustrates how international collaboration can strengthen the production of official statistics on digital platform work. 

Keywords

digital platform work

official statistics

international collaboration 

Speaker

Jeremy Heng

Modeling Adolescent Mental Health, Suicide Risk, and Academic Performance Using YRBSS 2023 Data

Adolescent mental health, suicidal ideation, and academic performance are shaped by complex behavioral and environmental factors among U.S. teenagers. Using 2023 Youth Risk Behavior Survey data, we examine how feeling unsafe at school, unfair treatment, sexual assault, parent conflict, alcohol and drug use, and physical activity relate to suicidal thoughts, suicide attempts with injury, self-reported mental health, and grades. A central objective of this study is to improve the sensitivity of predictive models in correctly identifying adolescents who report suicidal ideation, experience mental health difficulties, or exhibit poor academic performance outcomes that are relatively rare yet critically important. We fit logistic regression, decision trees, random forests, and Naive Bayes classifiers to identify important predictors and potential interactions, using logistic models for interpretable associations and tree-based methods to highlight key decision pathways. Because suicidal outcomes are relatively rare, we apply resampling strategies, including SMOTE and distance based under sampling tailored to binary predictors, and compare their impact on sensitivity, specificity, and accuracy. Our findings demonstrate how the choice of sampling strategy meaningfully shapes model performance, with implications for the reliable identification of at‑risk adolescents in survey‑based research settings. 

Keywords

Adolescent mental health

Suicidal ideation and behavior

Supervised machine learning

Distance‑based under sampling

Predictive modeling 

Speaker

Sujith Reddy Ganta

Multitask Learning in Tweedie Generalized Linear Models for Insurance Ratemaking

Insurance ratemaking often requires fitting separate generalized linear models for multiple loss outcomes, such as perils or coverage types, leading to duplicated effort and limited insight into dependence across outcomes. We propose a multitask learning framework for jointly modeling multiple insurance loss responses within a single Tweedie generalized linear model. The approach embeds a regularized multivariate Gaussian regression step within the Iteratively Reweighted Least Squares algorithm (IRLS), allowing dependence across responses to be captured while accommodating semicontinuous loss data. An elastic net penalty is incorporated to address correlated predictors and perform variable selection. Through simulation studies, we demonstrate improved predictive performance and parameter recovery relative to fitting independent models, particularly under multicollinearity. An application to reinsurance data illustrates how the proposed framework enhances interpretability of relationships among loss types while substantially reducing computational time and modeling effort. 

Keywords

ratemaking

multiple perils

Tweedie

elastic net

generalized linear model 

Speaker

Melody Denhere, University of Mary Washington

Co-Author(s)

Guy-vanie Miakonkana, Guardian Life Insurance Company
Emmanuel Thompson, Southeast Missouri State University

Statistical Analysis of Crime Occurrence: Crimestatistics

Violent crime remains a major societal concern and a long-standing focus of intervention by the Chicago Police Department (CPD). While aggregated crime rates have generated important findings in quantitative criminology, crime events are inherently localized in space and time. Reliance on fixed administrative units and static population denominators induces substantial bias and instability, particularly for small areas and short time windows. We propose an adaptive estimation approach that addresses the denominator problem by fixing the number of events using an equal-count algorithm rather than a moving spatial window, and by using a Hilbert curve to preserve continuity in multidimensional spatiotemporal space. The resulting estimates are flexible, interpretable, and amenable to downstream modeling and visualization. The method is applied to CPD's violence-reduction dashboard data, particularly shootings and ShotSpotter detections. Results reveal clear seasonal patterns and high-resolution heat maps of mass shooting events. This approach provides a principled, reproducible alternative for high-resolution crime-rate estimation with broad applicability to spatiotemporal event data. 

Keywords

Spatiotemporal point processes

Adaptive rate estimation

Equal-count algorithms

Spatial indexing (Hilbert curve) 

Speaker

Yining Ding, Purdue

Co-Author(s)

William Cleveland, Purdue University
Wen-wen Tung, Purdue University

The Recipe for Response: Data Collection Methodologies of a National Food Diary Survey

The National Household Food Acquisition and Purchase Survey (FoodAPS) is a complex data collection initiative with high respondent burden. The survey begins with a primary household respondent but then, where applicable, expands to invite additional household members. Historically, the survey captures seven days of food acquisition data from all members of a household, in addition to extensive demographics of the household and a debriefing survey.
To optimize data collection, we experimentally implemented an alternative incentive structure and a shorter reporting period. Our analyses assess the causal impact of these interventions on response rates, data completeness and quality, and survey cost. Additionally, we examine the impacts of the non-experimental use of proxy reporting throughout the survey design. This research aims to provide practical guidance for survey methodologists on optimizing incentive strategies, determining ideal reporting length, and effectively utilizing proxy reporting to enhance data quality and participation in complex diary surveys. 

Keywords

survey methodology

experimental design

data quality

diary survey

proxy reporting

respondent burden 

Speaker

Kayla Higgins, US Census Bureau

Co-Author(s)

Christine Bottini, U.S. Census Bureau
Joseph Rodhouse, USDA Economic Research Service (ERS)

Using Community Learning Through Data-Driven Discovery to Monitor Substance Use in Rural Communities

Rural communities often lack public health surveillance systems due to small populations, limited staffing and resources, unstable rates, and data suppression rules, despite being significantly impacted by substance use outbreaks. Using the Community Learning Through Data-Driven Discovery (CLD3) framework, we partner with rural stakeholders to co-identify priority substance use concerns, relevant data sources, and feasible strategies for local monitoring and prevention.

Our presentation describes CLD3 implementation in data-constrained rural contexts, including integrating nontraditional and administrative data, interpreting trends under uncertainty, and converting insights into actionable local knowledge. We highlight how participatory data design and interpretation address statistical limitations while enhancing the legitimacy and usability of surveillance outputs.

This work provides rural communities with a sustainable, community-governed approach to substance use monitoring that supports prevention, treatment, and recovery, aligns with local data capacity and literacy, and offers statisticians practical insights into community-engaged decision-making with sparse data. 

Keywords

Community Learning Through Data-Driven Discovery

Rural communities

Substance misuse

Data visualization

Administrative data 

Speaker

Matthew Voss, Iowa State University

Co-Author(s)

Shawn Dorius, Iowa State University
Kelsey Van Selous, Iowa State University

Weighted Causal Forests with Complex Survey Data for Heterogenous Treatment Effect Estimation

Causal forests (CFs) have been recently developed to estimate heterogeneous treatment effects using observational data. However, their application in survey studies, particularly population-based complex surveys with designs, has not been evaluated. To address this gap, we develop a weighted CF (wCF) framework by embedding a composite weight that incorporates propensity score (PS) and accounts for survey designs. We conduct extensive simulations to compare wCF with two other methods: an unweighted CF that ignores survey design features; and a naïve weighted CF that incorporates sampling weights but does not account for the other design features. We consider a range of scenarios by varying the degrees of model misspecification, intra-class correlation among observations, and PS overlap. Method performance is evaluated using the average out-of-sample mean squared error and coverage probability. Using data from the Medicare Current Beneficiary Survey (2018–2022), we further illustrate the application of wCF by examining the impact of financial hardship on hospice enrollment among US older adults (≥65 years) with serious illness. 

Keywords

Causal forest

Complex survey data

Machine Learning

Propensity Score

Medicare Current Beneficiary Survey 

Speaker

Chen Yang, Icahn School of Medicine at Mount Sinai

Co-Author(s)

Bian Liu, Icahn School of Medicine at Mount Sinai
Madhu Mazumdar, Icahn School of Medicine At Mount Sinai
Lihua Li, Icahn School of Medicine At Mount Sinai

When Trends Change: Joinpoint Regression in Public Health Surveillance

Joinpoint regression fits segmented linear models to time-series data, identifying "joinpoints" where trends change significantly. We applied it to annual stimulant prescription rates in Oregon (2014–2022). Trends were summarized with segment-specific annual percent change (APC) and average APC (AAPC).
We detected several significant joinpoints. For example, a joinpoint was detected in 2020, when rates shifted from moderate (APC ≈5%) to accelerated (>13%), with subgroup APCs >20% among young/middle-aged adults and stimulant-naïve patients. Average prescriptions per patient remained stable, showing growth was driven by new initiations.
This approach is valuable for noisy public health data: it identifies meaningful shifts invisible to simple linear trends, quantifies speed of change, and helps link trends to policy, clinical, or social events. Our session will describe this application, compare it to other contemporary piecewise and spline regression approaches, and illustrate its utility in pharmacoepidemiology and public health surveillance. 

Keywords

Joinpoint regression

Segmented trend detection

Public health research 

Speaker

Sanae El Ibrahimi, Comagine Health

Co-Author(s)

Mary Gary, Comagine Health
Kendra Blalock, Comagine Health
Carson Deahl, Comagine Health
Ryan Zamora, McLean Hospital, Harvard Medical School
Alessandro De Nadai, McLean Hospital/Harvard Medical School

Who is the Statistical Consultant? An Analysis of the CNSL Directory

In 2026, the ASA Statistical Consulting Section announced it is redesigning its Consultant Directory. Before any data was lost, I scraped all 949 active profiles from the site, containing affiliations, skills, specializations and languages.

This data was analyzed using K Medioids to identify archetypical consultants. This allows the Section to better understand its membership, and also provides a map of specializations in statistical consulting as a distinct profession. The Silhouette criterion suggested an unusually high number of clusters (k=23) relative to the size of the data set.

Investigating further, I calculated the second-nearest medioid for each profile and found that only 10 of the 23 occurred, revealing a many-to-many relationship between primary and secondary specialties. Key archetypes identified include a dominant Academic Biostatistician group (84% PhD) focused on clinical trials, and distinct specialities such as Industrial Quality Engineers (Six Sigma) and Private-Sector Survey Specialists. These results suggest a professional divide between academics and independent practitioners; a one-size-fits-all directory may fit none. 

Keywords

Clustering

Statistical Consulting 

Speaker

Neal Fultz