Weighting, variance estimation, and modeling

Xin Wang Chair
San Diego State University
 
Wednesday, Aug 6: 2:00 PM - 3:50 PM
4201 
Contributed Papers 
Music City Center 
Room: CC-Davidson Ballroom A2 

Main Sponsor

Survey Research Methods Section

Presentations

Moving toward best practice when using propensity score weighting in survey observational studies

Propensity score weighting is a common method for estimating treatment effects with survey data. The method is applied to minimize confounding using measured covariates that are often different between individuals in treatment and control. However, existing literature does not reach a consensus on the optimal use of survey weights for population-level inference in the propensity score weighting analysis. Under the balancing weights framework, we provided a unified solution for incorporating survey weights in both the propensity score of estimation and the outcome regression model. We derived estimators for different target populations, including the combined, treated, controlled, and overlap populations. We provide a unified expression of the sandwich variance estimator and demonstrate that the survey-weighted estimator is asymptotically normal, as established through the theory of M-estimators. Through an extensive series of simulation studies, we examined the performance of our derived estimators and compared the results to those of alternative methods. We further carried out two case studies to illustrate the application of the different methods of propensity score analysis. 

Keywords

complex survey

propensity score weighting


survey weights

overlap weights 

Co-Author(s)

Fan Li, Yale School of Public Health
Yukang Zeng, Yale

First Author

Guangyu Tong, Yale University

Presenting Author

Guangyu Tong, Yale University

Sample-split regression estimation with high dimensional covariates in survey sampling

In a finite population sampling survey, model-assisted regression estimation is developed to incorporate the auxiliary information efficiently. When we have high-dimensional auxiliary data sets, adding too many auxiliary variables may increase the estimation error and lead to biased estimation. Particularly under informative sampling, the bias of the high dimensional regression estimator may not be negligible. In this paper, we present a novel application of the sample-split estimation method for regression estimation under informative sampling. The proposed method is shown to be consistent even when the auxiliary variables are high-dimensional, and the sampling design is informative. Variance estimation for the sample-split estimator is discussed. Results from a limited simulation study are also presented. 

Keywords

Sample-split estimation

Informative sampling

Model-assisted estimation

High-dimensional regression 

Co-Author(s)

Jae-Kwang Kim, Iowa State University
Shu Yang, North Carolina State University, Department of Statistics

First Author

Yonghyun Kwon, Korea Military Academy

Presenting Author

Yonghyun Kwon, Korea Military Academy

Enhancing Iterative Proportional Fitting for Efficient and Scalable Synthetic Population Generation

The Iterative Proportional Fitting (IPF) algorithm is widely used for survey weighting and synthetic population generation. While efficient in low-dimensional settings, IPF struggles with zero-cell issues in sparse contingency tables and becomes computationally infeasible as dimensionality increases. To address these challenges, we propose a block-wise IPF framework that partitions variables into smaller, correlated feature groups, applying IPF independently within each group. Simulation studies and real-world synthetic population experiments demonstrate that this approach significantly improves computational efficiency and scalability in high-dimensional settings while maintaining a reasonable fit to marginal distributions and preserving inter-variable dependencies comparable to standard IPF. Furthermore, we introduce a hybrid framework that integrates IPF-synthesized data with generative models such as Bayesian networks, and Tabular Variational Autoencoders. This approach ensures accurate marginal fitting while enhancing realism and diversity in synthetic populations. Our contributions improve upon stan-
dard IPF and generative models, advancing synthetic population modeling. 

Keywords

Iterative Proportional Fitting (IPF), block-wise IPF, synthetic population generation, high-dimensional data, contingency tables, marginal constraints, scalability

Zero-cell-issues, computational efficiency, survey weighting, generative models, Bayesian networks, tabular variational autoencoders (TVAEs) 

Co-Author

Amy Wagler, University of Texas At El Paso

First Author

William Agyapong, University of Texas At El Paso

Presenting Author

William Agyapong, University of Texas At El Paso

WITHDRAW: Spatio-temporal Bayesian modeling for estimating CPIs in small domains“

There is interest in estimating Consumer Price Indexes (CPI) for small Core-Based Statistical Areas (CBSAs) and states. Currently, consumer prices are sampled in select CBSAs with the goal of providing reliable index estimates at the national-level, Census division-level, and for CBSAs with sufficiently large populations. We use hierarchical Bayesian models and incorporate covariates and spatio-temporal correlations of consumer prices with the idea that accounting for these correlations will compensate for the sparseness of the collected data and will allow for reliable predictions in the small areas. Our research presented at 2024 JSM demonstrated the utility of accounting for spatial correlations. We are currently investigating if a series of temporal estimates in CBSAs will compensate for the sparseness of direct cross-sectional estimates. We check our model assumptions by comparing estimated and predicted fuel prices with estimates from large administrative datasets. 

Keywords

small area estimation

hierarchical Bayesian models

spatio-temporal correlations

STAN

Gaussian processes 

Co-Author

Terrance Savitsky, US Bureau of Labor Statistics

First Author

Vladislav Beresovsky, U.S. Bureau of Labor Statistics

Presenting Author

Vladislav Beresovsky, U.S. Bureau of Labor Statistics

Optimal Use of Survey Weights for Causal Inference under Informative Sampling

The increasing availability of survey data for causal inference on treatment effects presents new scopes, yet most methods assume ignorability of treatment and non-informative sampling. In practice, survey data often include survey weights, but the sampling is frequently informative-i.e., dependent on the outcome given covariates and treatment-especially when design details are undisclosed. The optimal use of survey weights for causal inference under such sampling is an open problem. We show how survey weights can enhance the efficiency of Horvitz-Thompson estimators. Specifically, we derive the efficient influence function within the class of regular asymptotically linear estimators and propose a novel estimator based on it. Using a super-population framework, we establish its doubly robust property and, via M-estimation, prove its root-N asymptotic normality under parametric nuisance modeling. To enable flexible ML methods, we extend the theory to show our estimator ensures faster-than-root-N rates when the product of nuisance function rates exceeds root-N. We support our theoretical findings through extensive simulations and analysis of the Medical Expenditure Panel Survey data. 

Keywords

Complex survey


Data-Adaptive Method

Doubly Robust Estimation

Empirical process

Population average treatment effect 

Co-Author

Shu Yang, North Carolina State University, Department of Statistics

First Author

Shubhajit Sen

Presenting Author

Shubhajit Sen

Testing interaction effects with weighted interval-censoring Cox proportional hazard models

The Population Assessment of Tobacco and Health (PATH) Study is a national longitudinal study of tobacco use (2013-2021) that requires balanced repeated replicate weights for analysis. Crude and multivariable weighted interval-censoring Cox proportional hazard models were used to estimate two interaction effects (1) sex and years since first hookah use, and (2) ethnicity and years since first hookah use on the age of asthma onset. After controlling for covariates, women, Hispanics and non-Hispanic black adults who reported one or more years since first hookah use had increased risks of asthma onset at earlier ages in comparison to men and non-Hispanic white adults who reported never hookah use (HR= 4.93; 95% CI 2.10-11.58; HR= 5.18; 95% CI 2.21-12.16 and HR=1.63; 95% CI 1.09-2.43, respectively). Also, the interaction of sex and race/ethnicity with past 30-day(P30D) electronic cigarettes (ENDS) use on the age of asthma was estimated. Disseminating the results among health providers and the public about the interaction effect of sex and race/ethnicity with years since the first hookah or P30D ENDS use on earlier ages of asthma onset may encourage users to stop.  

Keywords

Sampling weights

Fay's variance estimation

Balanced Repeated Replicate Weights

Interval-Censoring Hazard Function

hazard risk

age of onset 

Co-Author

Sarah Valencia, Michael and Susan Dell Center for Healthy Living

First Author

Adriana Perez, University of Texas At Houston, Health Science Center

Presenting Author

Adriana Perez, University of Texas At Houston, Health Science Center