Tuesday, Aug 4: 8:30 AM - 10:20 AM
7108
Contributed Speed
Thomas M. Menino Convention & Exhibition Center
Room: CC-102B
Presentations
Accurate estimation of individual medical costs is a cornerstone of business analytics in the insurance industry, however the relationship between demographic factors and actual expenditures is often non-linear. This study utilizes a publicly available medical cost dataset containing health indicators such as age, BMI, and smoking status to evaluate the predictive performance of three models. In this study, we compare a Multiple Linear Regression model, Random Forest, and a multi-layered Neural Net to determine if deep learning models provide a statistically significant improvement in Mean Absolute Error (MAE) and R-squared values over traditional frequentist approaches. While linear models offer high interpretability, they may fail to capture the non-linear cost interactions between high BMI and smoking status that are better explained by non-linear models. The findings aim to provide a practical framework for selecting the most efficient model for predicting health costs based on personal demographics, while balancing the trade-offs between model complexity and interpretability.
Keywords
Regression Analysis
Machine Learning
Healthcare Analytics
Neural Networks
Business Intelligence
Leontief's input-output model is widely used in economics to predict the impact of a sector-specific demand shock on other sectors of the economy. This model was later reformulated by ten Raa and Mohnen to align with neoclassical assumptions regarding production functions. However, this revised version omits natural resources, which introduces biases in the model's technological coefficients matrix. This paper corrects the ten Raa-Mohnen model and adapts it to the case of an economy dependent on the provision of water for agriculture, where supply is essentially random. The new model is formulated as a stochastic quadratic programming problem that maximizes value added across all sectors of the economy, except for the agricultural sector, whose output is treated as exogenous and stochastic. The objective function also penalizes production in sectors with highly variable prices. The program is subject to various constraints - such as those from the original Leontief model itself and labor and capital endowments - which are satisfied not with certainty but with a given probability. Finally, the proposed model is calibrated using real data from Argentina's System of National Accounts.
Keywords
input-output model
stochastic programming
agriculture
natural resources
ten Raa-Mohnen model
Argentina
Onward We Learn prepares Rhode Island students to be the first in their families to attend and complete college. As a provider of the U.S. Department of Education's GEAR UP program, it's critical to understand the social emotional development indicators that support college enrollment and persistence. This study uses factor analysis to examine Grade 11 student responses (n = 113) to the College-Going Self-Efficacy Scale (Gibbons & Borders, 2010), which measures students' beliefs in their ability to successfully attend college. Data were screened for normality and sampling adequacy, with results supporting factor analysis (KMO = .87; Bartlett's χ² = 1832.02, p < .001). Maximum likelihood factor analysis with oblique rotation identified a three-factor structure accounting for 43.01% of the variance. A preliminary confirmatory factor analysis (CFA) was also conducted to evaluate the fit of the retained three-factor model within the study sample. Findings highlight key dimensions of perceived student abilities relevant for strengthening confidence and readiness for post-secondary transitions as well as degree completion. Given high immediate college enrollment (73%) and second-year persistence (71%) rates of Onward students, understanding students' beliefs in their ability to enroll and succeed in college is critical for program accountability and continuous improvement.
Keywords
Exploratory Factor Analysis
Confirmatory Factor Analysis
Psychometrics
College Going Self-Efficacy
Postsecondary Readiness
Construct Validity
Precision medicine applies advanced statistical modeling integrating diverse data such as polygenic risk score (PRS), electronic health records data, and longitudinal biomarkers to improve accuracy of model prediction of disease. An appropriate framework for linking longitudinal biomarkers and female-specific conditions from electronic health records with time-to-event outcomes is joint modeling. However, its application in prediction modeling has been limited to small data sets. The current study is one of the first to apply Bayesian joint models (Rizopoulos et al, 2016) to the large-, representative and comprehensive Veterans Affairs (VA) Health Care System records, biomarkers, female sex-specific conditions and cardiovascular disease (CVD) PRS and develop a new personalized VA Women CVD risk score. The new joint models include multiple sub models-time-varying Cox model and general linear models for longitudinal biomarkers and female sex-specific conditions-linked via time of visit. The new personalized VA women CVD risk score improved model accuracy in predicting CVD events (Δ C statistics +0.05) compared to the original VA women CVD risk score (Jeon-Slaughter et al., 2021).
Keywords
Bayesian Joint Model application to a large-scale data set
Precision Medicine
Prediction model
Women's Health
Clinical decision making
VA Women CVD risk score
Background: Evaluating the population-level impact of prescription stimulants on overdose risk requires longitudinal data spanning prescribing, healthcare utilization, and mortality-sources that are rarely integrated at scale. Legal, administrative, and technical barriers often limit linkage of prescription drug monitoring program (PDMP) data with claims, hospital, emergency department, and vital records. We describe a probabilistic linkage framework developed to support an NIH-funded study of prescription stimulants and overdose outcomes.
Linkage Methods: We used FasLink to probabilistically link records across the administrative health data sets. Individuals were assigned persistent, study-specific anonymous identifiers, enabling longitudinal follow-up across insurance transitions and care settings.
Conclusion:
The linked data support time-varying definitions of stimulant exposure and capture fatal and non-fatal overdose outcomes, including polysubstance involvement. This framework demonstrates a scalable, reproducible approach for linking administrative health data to support pharmacoepidemiologic surveillance of controlled substances beyond opioids.
Keywords
Probabilistic linkage
PDMP
all-payer claims databases
prescription stimulant
overdose surveillance
population health analytics
To improve the effectiveness of field management strategies and to identify optimal field timing, it is crucial to understand the differences between early responders (early birds) and those who only respond after receiving reminders (late birds).
This study examines whether systematic biases exist in early and late birds, how their response behavior influences panel participation, and whether their data quality differs (e.g., item nonresponse, speeding).
We analyze 20,817 respondents across eight waves of a push to web survey of the German labor force, aged between 18-65 (IAB-OPAL).
Using a discrete hazard model, we investigate the relationship between days until response to the survey and panel dropout risk.
Preliminary findings indicate that late birds exhibit a significantly higher dropout risk in subsequent waves. A nonlinear relationship exists between the number of panel participations and dropout probability. Our findings have practical implications for reducing panel attrition and systematic bias, improving data quality and optimizing field management.
Keywords
Response timing in online survey
Panel attrition
Data quality
Fieldmanagement
Bias
Item non-response
Speaker
Valentina Prospero, Institute for Employment Research (IAB) of the Federal Employment Agency (BA) - Research Department Panel Study Labour Market and Social Security (PASS)
Co-Author(s)
Mustafa Coban, Institut für Arbeitsmarkt- und Berufsforschung (IAB) der Bundesagentur für Arbeit (BA) - Bereich PAS
Christine Distler, Institut für Arbeitsmarkt- und Berufsforschung (IAB) der Bundesagentur für Arbeit (BA) - Bereich PAS
Marcel Müller, Institut für Arbeitsmarkt- und Berufsforschung (IAB) der Bundesagentur für Arbeit (BA) - Bereich PAS
A single company typically supplies only a fraction of a customer's total demand, making both Size-of-Wallet (SioW) and Share-of-Wallet (SoW) unobservable from the firm's perspective. Existing studies often rely on survey data in which customers self-estimate their Share-of-Wallet, an approach that is impractical for scalable and periodic estimation. This study proposes a survey-free methodology based on statistical modeling and machine learning to estimate SoW and SioW directly from transactional data. The approach infers latent wallet size and allocation behavior without customer self-reports and is illustrated using purchase data from Adidas customers on the Amazon platform.
Keywords
Share-of-Wallet
Size-of-Wallet
Machine Learning
Latent variable
Pogit model
Customer Analytics
The social network structure of hidden and hard-to-reach populations can have important implications for epidemiology and public health. However, collecting full or even moderately dense network data is typically infeasible due to the lack of a sampling frame, privacy concerns, and limited resources. Respondent-Driven Sampling (RDS), which leverages a chain-referral pattern, is widely used in these cases, but standard RDS data provide only tree-structured network data. To address this challenge, we incorporated a recently developed token-based strategy that supplements coupon referral with token distribution, yielding additional observed ties. Inspired by work using aggregated relational data (ARD), we propose a latent space model to effectively describe the network structure. We use the fitted model to predict unobserved ties and estimate network statistics from RDS data, with particular focus on network clustering, an important network feature that cannot be recovered from standard RDS data. This work is motivated by and applied to a rich dataset of RDS samples collected among People Who Inject Drugs across 17 sites in Kenya in the TLC-IDU study (NCT01557998).
Keywords
Respondent-Driven Sampling
RDS
Hard-to-reach population
Social networks
Network clustering
Latent space model
Exposure to endocrine-disrupting chemicals (EDCs), including parabens, phthalates, and phenols through personal care products (PCPs) has been linked to adverse health outcomes. Using data from the Study of Environment, Lifestyle and Fibroids (n=434), we conducted an exploratory analysis of 31 urinary EDC biomarkers and self-reported recent (24-hour) and long-term (12-month) PCP use. Associations were analyzed using log-normal accelerated failure time models including season and product-by-season interactions. Recent and long-term use of nail, skincare, and makeup products were associated with higher urinary paraben and phenol concentrations. Sunscreen use was associated with benzophenone-3, particularly in summer compared with winter, reflecting seasonal use. Overall, use of several PCPs was associated with higher urinary EDC concentrations, with associations influenced by season and frequency of use. This work was supported by the NIH and NIEHS. Contributions by NIH authors are Works of the United States Government. The findings and conclusions do not necessarily reflect the views of the NIH or the U.S. Department of Health and Human Services.
Keywords
Epidemiology
Exploratory analysis
Endocrine disrupting chemicals
Personal care products
Season analysis
Accelerated failure time model
Speaker
Angela Jeffers, DLH
Co-Author(s)
Caroll Co, DLH
Lauren A. Wise, Department of Epidemiology, Boston University School of Public Health
Samantha Schildroth, Department of Epidemiology, Boston University School of Public Health
Amelia K. Wesselink, Department of Epidemiology, Boston University School of Public Health
Traci Bethea, Georgetown Lombardi Comprehensive Cancer Center, Georgetown University Medical Center
Anne Marie Jukic, Epidemiology Branch, National Institute of Environmental Health Sciences, NIH
Quaker E. Harmon, Epidemiology Branch, National Institute of Environmental Health Sciences, NIH
Donna D. Baird, Epidemiology Branch, National Institute of Environmental Health Sciences, NIH
Kyla W. Taylor, Division of Translational Toxicology, National Institute of Environmental Health Sciences
The Diener Subjective Psychological Well-Being Scale is commonly used in the social sciences to assess self-perceived psychological flourishing. The scale ranges from 8 to 56, with higher scores indicating greater well-being. The goal of this study was to examine potential differences in mean well-being scores associated with time spent on unpaid caregiving responsibilities among college students in Florida. Prior studies have either used normal approximations for the scale distribution or treated it as a binary outcome. In our study, using the 2025 American College Health Association-National College Health Assessment data, the scale had a left-skewed distribution. Several data transformations and models were explored, and the final model was a generalized linear regression with a Gamma distribution and log link function fitted to the reflected well-being score. The model controlled for students' sociodemographic and academic characteristics. Analyses were performed in SAS® 9.4 software. The model-based confidence interval indicated that the mean well-being score was higher for students who spent more time on caregiving compared with students who spent less time.
Keywords
Skewed outcomes
Generalized linear model
Gamma regression
Log link
NLMEANS macro
Digital platform work is an increasingly important component of labour markets worldwide, yet its measurement poses statistical challenges due to heterogeneous work arrangements and rapidly evolving platforms. Reliable and comparable statistics are essential for understanding the scale, characteristics, and dynamics of this form of employment.
Since 2016, Singapore's Ministry of Manpower (MOM) has compiled official statistics on digital platform employment using labour force surveys and administrative data. MOM has also collaborated with the International Labour Organization (ILO) to design a standardised measurement framework for digital platform employment, aimed at improving conceptual clarity and cross-country comparability. Building on this work, MOM and ILO convened a Global Dialogue on Digital Platform Work in 2025, bringing together statisticians, policymakers and platform operators to discuss measurement approaches, data gaps, and implementation issues.
The paper presents the concepts, processes and challenges underpinning these initiatives, and illustrates how international collaboration can strengthen the production of official statistics on digital platform work.
Keywords
digital platform work
official statistics
international collaboration
Adolescent mental health, suicidal ideation, and academic performance are shaped by complex behavioral and environmental factors among U.S. teenagers. Using 2023 Youth Risk Behavior Survey data, we examine how feeling unsafe at school, unfair treatment, sexual assault, parent conflict, alcohol and drug use, and physical activity relate to suicidal thoughts, suicide attempts with injury, self-reported mental health, and grades. A central objective of this study is to improve the sensitivity of predictive models in correctly identifying adolescents who report suicidal ideation, experience mental health difficulties, or exhibit poor academic performance outcomes that are relatively rare yet critically important. We fit logistic regression, decision trees, random forests, and Naive Bayes classifiers to identify important predictors and potential interactions, using logistic models for interpretable associations and tree-based methods to highlight key decision pathways. Because suicidal outcomes are relatively rare, we apply resampling strategies, including SMOTE and distance based under sampling tailored to binary predictors, and compare their impact on sensitivity, specificity, and accuracy. Our findings demonstrate how the choice of sampling strategy meaningfully shapes model performance, with implications for the reliable identification of at‑risk adolescents in survey‑based research settings.
Keywords
Adolescent mental health
Suicidal ideation and behavior
Supervised machine learning
Distance‑based under sampling
Predictive modeling
Insurance ratemaking often requires fitting separate generalized linear models for multiple loss outcomes, such as perils or coverage types, leading to duplicated effort and limited insight into dependence across outcomes. We propose a multitask learning framework for jointly modeling multiple insurance loss responses within a single Tweedie generalized linear model. The approach embeds a regularized multivariate Gaussian regression step within the Iteratively Reweighted Least Squares algorithm (IRLS), allowing dependence across responses to be captured while accommodating semicontinuous loss data. An elastic net penalty is incorporated to address correlated predictors and perform variable selection. Through simulation studies, we demonstrate improved predictive performance and parameter recovery relative to fitting independent models, particularly under multicollinearity. An application to reinsurance data illustrates how the proposed framework enhances interpretability of relationships among loss types while substantially reducing computational time and modeling effort.
Keywords
ratemaking
multiple perils
Tweedie
elastic net
generalized linear model
Violent crime remains a major societal concern and a long-standing focus of intervention by the Chicago Police Department (CPD). While aggregated crime rates have generated important findings in quantitative criminology, crime events are inherently localized in space and time. Reliance on fixed administrative units and static population denominators induces substantial bias and instability, particularly for small areas and short time windows. We propose an adaptive estimation approach that addresses the denominator problem by fixing the number of events using an equal-count algorithm rather than a moving spatial window, and by using a Hilbert curve to preserve continuity in multidimensional spatiotemporal space. The resulting estimates are flexible, interpretable, and amenable to downstream modeling and visualization. The method is applied to CPD's violence-reduction dashboard data, particularly shootings and ShotSpotter detections. Results reveal clear seasonal patterns and high-resolution heat maps of mass shooting events. This approach provides a principled, reproducible alternative for high-resolution crime-rate estimation with broad applicability to spatiotemporal event data.
Keywords
Spatiotemporal point processes
Adaptive rate estimation
Equal-count algorithms
Spatial indexing (Hilbert curve)
The National Household Food Acquisition and Purchase Survey (FoodAPS) is a complex data collection initiative with high respondent burden. The survey begins with a primary household respondent but then, where applicable, expands to invite additional household members. Historically, the survey captures seven days of food acquisition data from all members of a household, in addition to extensive demographics of the household and a debriefing survey.
To optimize data collection, we experimentally implemented an alternative incentive structure and a shorter reporting period. Our analyses assess the causal impact of these interventions on response rates, data completeness and quality, and survey cost. Additionally, we examine the impacts of the non-experimental use of proxy reporting throughout the survey design. This research aims to provide practical guidance for survey methodologists on optimizing incentive strategies, determining ideal reporting length, and effectively utilizing proxy reporting to enhance data quality and participation in complex diary surveys.
Keywords
survey methodology
experimental design
data quality
diary survey
proxy reporting
respondent burden
Rural communities often lack public health surveillance systems due to small populations, limited staffing and resources, unstable rates, and data suppression rules, despite being significantly impacted by substance use outbreaks. Using the Community Learning Through Data-Driven Discovery (CLD3) framework, we partner with rural stakeholders to co-identify priority substance use concerns, relevant data sources, and feasible strategies for local monitoring and prevention.
Our presentation describes CLD3 implementation in data-constrained rural contexts, including integrating nontraditional and administrative data, interpreting trends under uncertainty, and converting insights into actionable local knowledge. We highlight how participatory data design and interpretation address statistical limitations while enhancing the legitimacy and usability of surveillance outputs.
This work provides rural communities with a sustainable, community-governed approach to substance use monitoring that supports prevention, treatment, and recovery, aligns with local data capacity and literacy, and offers statisticians practical insights into community-engaged decision-making with sparse data.
Keywords
Community Learning Through Data-Driven Discovery
Rural communities
Substance misuse
Data visualization
Administrative data
Causal forests (CFs) have been recently developed to estimate heterogeneous treatment effects using observational data. However, their application in survey studies, particularly population-based complex surveys with designs, has not been evaluated. To address this gap, we develop a weighted CF (wCF) framework by embedding a composite weight that incorporates propensity score (PS) and accounts for survey designs. We conduct extensive simulations to compare wCF with two other methods: an unweighted CF that ignores survey design features; and a naïve weighted CF that incorporates sampling weights but does not account for the other design features. We consider a range of scenarios by varying the degrees of model misspecification, intra-class correlation among observations, and PS overlap. Method performance is evaluated using the average out-of-sample mean squared error and coverage probability. Using data from the Medicare Current Beneficiary Survey (2018–2022), we further illustrate the application of wCF by examining the impact of financial hardship on hospice enrollment among US older adults (≥65 years) with serious illness.
Keywords
Causal forest
Complex survey data
Machine Learning
Propensity Score
Medicare Current Beneficiary Survey
Speaker
Chen Yang, Icahn School of Medicine at Mount Sinai
Co-Author(s)
Bian Liu, Icahn School of Medicine at Mount Sinai
Madhu Mazumdar, Icahn School of Medicine At Mount Sinai
Lihua Li, Icahn School of Medicine At Mount Sinai
Joinpoint regression fits segmented linear models to time-series data, identifying "joinpoints" where trends change significantly. We applied it to annual stimulant prescription rates in Oregon (2014–2022). Trends were summarized with segment-specific annual percent change (APC) and average APC (AAPC).
We detected several significant joinpoints. For example, a joinpoint was detected in 2020, when rates shifted from moderate (APC ≈5%) to accelerated (>13%), with subgroup APCs >20% among young/middle-aged adults and stimulant-naïve patients. Average prescriptions per patient remained stable, showing growth was driven by new initiations.
This approach is valuable for noisy public health data: it identifies meaningful shifts invisible to simple linear trends, quantifies speed of change, and helps link trends to policy, clinical, or social events. Our session will describe this application, compare it to other contemporary piecewise and spline regression approaches, and illustrate its utility in pharmacoepidemiology and public health surveillance.
Keywords
Joinpoint regression
Segmented trend detection
Public health research
In 2026, the ASA Statistical Consulting Section announced it is redesigning its Consultant Directory. Before any data was lost, I scraped all 949 active profiles from the site, containing affiliations, skills, specializations and languages.
This data was analyzed using K Medioids to identify archetypical consultants. This allows the Section to better understand its membership, and also provides a map of specializations in statistical consulting as a distinct profession. The Silhouette criterion suggested an unusually high number of clusters (k=23) relative to the size of the data set.
Investigating further, I calculated the second-nearest medioid for each profile and found that only 10 of the 23 occurred, revealing a many-to-many relationship between primary and secondary specialties. Key archetypes identified include a dominant Academic Biostatistician group (84% PhD) focused on clinical trials, and distinct specialities such as Industrial Quality Engineers (Six Sigma) and Private-Sector Survey Specialists. These results suggest a professional divide between academics and independent practitioners; a one-size-fits-all directory may fit none.
Keywords
Clustering
Statistical Consulting