Thursday, Aug 7: 10:30 AM - 12:20 PM
4223
Contributed Papers
Music City Center
Room: CC-Davidson Ballroom A1
Main Sponsor
Survey Research Methods Section
Presentations
The randomized response technique has been a cornerstone in survey methodology for eliciting truthful responses
on sensitive subjects, spanning across numerous domains such as behavioral science, socio-economic studies,
psychology, epidemiology, and public health. Since its inception by Warner (1965), the technique has undergone
significant methodological enhancements to increase its reliability and application breadth. Despite its prevalent
use and the passing of nearly six decades, the exploration of tolerance intervals within randomized response
remains limited. This paper aims to extend the statistical toolkit for randomized response by introducing exact
tolerance intervals, building on the foundational confidence interval analysis by Frey and Pérez (2012).
Keywords
Tolerance Intervals
Confidence Intervals
Randomized Response Techniques
Applied Survey
Sensitive Survey Data
Summary statistics can not be obtained if we are working with a censored dataset. Doing a histogram or box-plot is also not possible. In such cases we can fit a parametric distribution to the data. If the fit is good, summary statistics such as mean , median, standard deviation and percentiles can be calculated from the fitted distribution. The pivotal point of this paper is obtaining a histogram for censored data. Let, U be a censored observation. Let T be the time to death variable. Making use of the fitted distribution we can obtain the conditional distribution of T given T>U and we can also calculate the expected value of the conditional distribution. We replace U with the conditional mean. We repeat this for every censored observation. Thus, we obatin the complete dataset. Now histogram of the data can be obtained. We can superimpose the histogram with the density function of the fitted distribution. We will extend the idea to check on the goodness of fit of the fitted distribution.
Keywords
censored data
survival analysis
histogram
conditional distribution
We extend methods for estimating hierarchical linear models with incomplete data to study complier-average causal effects (CACE) in a multi-site trial. Individuals at each site are assigned to either treatment or control. Compliers adhere to their assigned condition and would have done so had they been assigned to the other. Under the assumptions of monotonicity (treatment assignment does not decrease participation), random treatment assignment, and treatment affecting outcomes only if participants comply, compliance is missing at random. We study the mean and variance of CACE across sites, site-specific average CACE, and the association between site-level and within-site covariates and CACE. To estimate these, we factorize the complete data likelihood into the distribution of the outcome and compliance, conditional on selected random effects, and the marginal distribution of these random effects. Assuming the random effects are provisionally known, we integrate out the missing data, then numerically integrate out the random effects, maximizing the likelihood using the EM algorithm. We illustrate this approach with data from a large-scale study on charter school effectiveness.
Keywords
Complier average causal effects (CACE)
Site-specific average CACE
mean and variance of CACE across sites
provisionally known random effects
the EM algorithm
adaptive Gauss Hermite quadrature
Resampling techniques have become increasingly popular for estimation of uncertainty in a wide range of data, including those collected via surveys. Such data are often fraught with missing values which are commonly imputed to facilitate analysis. This article addresses the issue of using resampling methods such as a jackknife or bootstrap in conjunction with imputations that have been sampled stochastically (e.g., in the vein of multiple imputation). We derive the theory needed to illustrate two key points. First, imputations must be independently generated multiple times within each replicate group of a jackknife or bootstrap. Second, the number of multiply imputed datasets per replicate group must dramatically exceed the number of replicate groups for a jackknife; however, this is not the case in a bootstrap approach. A simulation study is provided to support these theoretical conclusions.
Keywords
Missing data
Multiple imputation
Jackknife
Bootstrap
Microdata poses privacy risks, especially in small geographic areas. Perturbation reduces these risks, but balancing privacy and utility remains challenging, particularly in Small Area Estimation (SAE). This study examines how data perturbation affects the accuracy of SAE, aiming to optimize privacy protection and data utility. Using data from the 2018- 2022 American Community Survey Public Use Microdata Sample, we estimate income and poverty at the state and Public Use Microdata Area (PUMA) levels. Six covariates including age, gender, race/ethnicity, education, occupation, and health insurance are used for prediction and perturbed. Records are first classified by three privacy levels. Random Swapping, Post Randomization Method, and Multiple Imputation are then applied at the national, state, and PUMA levels. For each perturbation scenario, we generate SAE at the state and PUMA levels using the Fay-Herriot model and evaluate outcomes within the Risk-Utility (R-U) framework. We hypothesize that greater privacy protection and smaller geographic areas reduce utility, leading to less accurate estimates.
Keywords
Data Privacy
Data Perturbation
Small Area Estimation
American Community Survey
The United States Department of Agriculture's (USDA's) National Agricultural Statistics Service (NASS) conducts hundreds of surveys each year. Many of these surveys rely on pre-assigned upper and lower limits to identify questionable reported values that may require editing. The limits are currently assigned manually by subject matter experts, and values outside of the limit range are flagged for editing. NASS has developed a system to minimize manual editing by automating most editing and imputation actions. Recent research focuses on evaluating several outlier detection methods to create hard edit limits using data-driven methods. The resulting limits must identify extreme anomalies to be successively corrected in automated edits. This paper evaluates four outlier detection methods – Cook's Distance, local outlier probability (LoOP), isolation forest (IF), and FuzzyHRT (historical, relational, and tail anomalies). Their possible application in determining edit limits using a case study from the Agricultural Production Survey is explored. We summarize the characteristics of each method, review the current edit limits, present application results, and discuss next steps.
Keywords
Outlier detection
Automatic editing
Imputation
Machine learning
Survey modernization