Missing Data, Outlier Detection, and Confidentiality

Hou-Cheng Yang Chair
Edwards Lifesciences
 
Thursday, Aug 7: 10:30 AM - 12:20 PM
4223 
Contributed Papers 
Music City Center 
Room: CC-Davidson Ballroom A1 

Main Sponsor

Survey Research Methods Section

Presentations

Construction of Tolerance Intervals for Randomized Response Designs

The randomized response technique has been a cornerstone in survey methodology for eliciting truthful responses
on sensitive subjects, spanning across numerous domains such as behavioral science, socio-economic studies,
psychology, epidemiology, and public health. Since its inception by Warner (1965), the technique has undergone
significant methodological enhancements to increase its reliability and application breadth. Despite its prevalent
use and the passing of nearly six decades, the exploration of tolerance intervals within randomized response
remains limited. This paper aims to extend the statistical toolkit for randomized response by introducing exact
tolerance intervals, building on the foundational confidence interval analysis by Frey and Pérez (2012). 

Keywords

Tolerance Intervals

Confidence Intervals

Randomized Response Techniques

Applied Survey

Sensitive Survey Data 

Co-Author

Derek Young, University of Kentucky

First Author

Daniel Tuyisenge

Presenting Author

Daniel Tuyisenge

How to create a histogram for censored data?

Summary statistics can not be obtained if we are working with a censored dataset. Doing a histogram or box-plot is also not possible. In such cases we can fit a parametric distribution to the data. If the fit is good, summary statistics such as mean , median, standard deviation and percentiles can be calculated from the fitted distribution. The pivotal point of this paper is obtaining a histogram for censored data. Let, U be a censored observation. Let T be the time to death variable. Making use of the fitted distribution we can obtain the conditional distribution of T given T>U and we can also calculate the expected value of the conditional distribution. We replace U with the conditional mean. We repeat this for every censored observation. Thus, we obatin the complete dataset. Now histogram of the data can be obtained. We can superimpose the histogram with the density function of the fitted distribution. We will extend the idea to check on the goodness of fit of the fitted distribution. 

Keywords

censored data

survival analysis

histogram

conditional distribution 

Co-Author(s)

Zhaochong Yu
Marepalli Rao, University of Cincinnati

First Author

Neelakshi Chatterjee

Presenting Author

Neelakshi Chatterjee

Variability in Causal Effects, Moderation, and Noncompliance from Data MAR in a Multisite Trial

We extend methods for estimating hierarchical linear models with incomplete data to study complier-average causal effects (CACE) in a multi-site trial. Individuals at each site are assigned to either treatment or control. Compliers adhere to their assigned condition and would have done so had they been assigned to the other. Under the assumptions of monotonicity (treatment assignment does not decrease participation), random treatment assignment, and treatment affecting outcomes only if participants comply, compliance is missing at random. We study the mean and variance of CACE across sites, site-specific average CACE, and the association between site-level and within-site covariates and CACE. To estimate these, we factorize the complete data likelihood into the distribution of the outcome and compliance, conditional on selected random effects, and the marginal distribution of these random effects. Assuming the random effects are provisionally known, we integrate out the missing data, then numerically integrate out the random effects, maximizing the likelihood using the EM algorithm. We illustrate this approach with data from a large-scale study on charter school effectiveness. 

Keywords

Complier average causal effects (CACE)

Site-specific average CACE

mean and variance of CACE across sites

provisionally known random effects

the EM algorithm

adaptive Gauss Hermite quadrature 

Co-Author(s)

Stephen Raudenbush, The University of Chicago
Ruhi Baichwal, University of Chicago

First Author

Yongyun Shin, Virginia Commonwealth University

Presenting Author

Yongyun Shin, Virginia Commonwealth University

Resampling methods with multiply imputed data

Resampling techniques have become increasingly popular for estimation of uncertainty in a wide range of data, including those collected via surveys. Such data are often fraught with missing values which are commonly imputed to facilitate analysis. This article addresses the issue of using resampling methods such as a jackknife or bootstrap in conjunction with imputations that have been sampled stochastically (e.g., in the vein of multiple imputation). We derive the theory needed to illustrate two key points. First, imputations must be independently generated multiple times within each replicate group of a jackknife or bootstrap. Second, the number of multiply imputed datasets per replicate group must dramatically exceed the number of replicate groups for a jackknife; however, this is not the case in a bootstrap approach. A simulation study is provided to support these theoretical conclusions. 

Keywords

Missing data

Multiple imputation

Jackknife

Bootstrap 

Co-Author

Lane Burgette, RAND

First Author

Michael Robbins, RAND Corporation

Presenting Author

Michael Robbins, RAND Corporation

Balancing Privacy and Precision: The Impact of Data Perturbation Methods on Small Area Estimation

Microdata poses privacy risks, especially in small geographic areas. Perturbation reduces these risks, but balancing privacy and utility remains challenging, particularly in Small Area Estimation (SAE). This study examines how data perturbation affects the accuracy of SAE, aiming to optimize privacy protection and data utility. Using data from the 2018- 2022 American Community Survey Public Use Microdata Sample, we estimate income and poverty at the state and Public Use Microdata Area (PUMA) levels. Six covariates including age, gender, race/ethnicity, education, occupation, and health insurance are used for prediction and perturbed. Records are first classified by three privacy levels. Random Swapping, Post Randomization Method, and Multiple Imputation are then applied at the national, state, and PUMA levels. For each perturbation scenario, we generate SAE at the state and PUMA levels using the Fay-Herriot model and evaluate outcomes within the Risk-Utility (R-U) framework. We hypothesize that greater privacy protection and smaller geographic areas reduce utility, leading to less accurate estimates. 

Keywords

Data Privacy

Data Perturbation

Small Area Estimation

American Community Survey 

Co-Author

Trivellore Raghunathan, Institute for Social Research

First Author

Chendi Zhao

Presenting Author

Chendi Zhao

WITHDRAWN - Determining Edit Limits for an Agricultural Survey Using Outlier Detection Methods

The United States Department of Agriculture's (USDA's) National Agricultural Statistics Service (NASS) conducts hundreds of surveys each year. Many of these surveys rely on pre-assigned upper and lower limits to identify questionable reported values that may require editing. The limits are currently assigned manually by subject matter experts, and values outside of the limit range are flagged for editing. NASS has developed a system to minimize manual editing by automating most editing and imputation actions. Recent research focuses on evaluating several outlier detection methods to create hard edit limits using data-driven methods. The resulting limits must identify extreme anomalies to be successively corrected in automated edits. This paper evaluates four outlier detection methods – Cook's Distance, local outlier probability (LoOP), isolation forest (IF), and FuzzyHRT (historical, relational, and tail anomalies). Their possible application in determining edit limits using a case study from the Agricultural Production Survey is explored. We summarize the characteristics of each method, review the current edit limits, present application results, and discuss next steps. 

Keywords

Outlier detection

Automatic editing

Imputation

Machine learning

Survey modernization 

Co-Author(s)

Megan Lipke, USDA/NASS
Darcy Miller, USDA/NASS
Luca Sartore, National Institute of Statistical Sciences
Kay Lee Turner, USDA/National Agricultural Stats Service
Denise Abreu, USDA/NASS

First Author

Yumiko Siegfried, USDA/NASS