Tuesday, Aug 5: 2:00 PM - 3:50 PM
4136
Contributed Papers
Music City Center
Room: CC-205C
Main Sponsor
Survey Research Methods Section
Presentations
The USDA's National Agricultural Statistics Service (NASS) is committed to ensuring comprehensive coverage and representation of farms across the United States by maintaining a list frame of all known and potential U.S. farms. This comprehensive database serves as the foundation for data collection for agricultural surveys and censuses. A key tool in updating and refining the list frame is the National Agricultural Classification Survey (NACS). NACS is conducted in four phases leading up to the quinquennial Census of Agriculture (COA) (conducted in years ending in 2 and 7). NACS evaluates whether operations have agricultural activity and, if eligible, NASS adds them to the Census Mailing List (CML). However, budget constraints and rising nonresponse rates challenge the accuracy and representativeness of the NASS list frame. This study addresses these challenges by analyzing the integration of data from the most recent phase of NACS with both administrative and auxiliary data sources from the American Community Survey (ACS). The findings aim to inform strategies to enhance the list frame, improve sampling efficiency, and optimize resource allocation.
Keywords
Machine Learning
Non-response
USDA
In real-world applications, datasets may contain observations with multiple labels that are not necessarily mutually exclusive. Sampling methods therefore require accounting for label dependencies. We propose a novel sampling algorithm designed for multi-label datasets. Our algorithm uses the observed label frequencies to estimate the parameters of a multivariate Bernoulli distribution. By adopting optimization constrained to the target distribution, we calculated the weights of each combination of labels. This approach ensures that after weighted sampling, our sub-sample acquires the characteristics of the target distribution while accounting for the label dependencies. Our use case included a broad sample of research articles from Scopus labeled with 66 biomedical topic categories, with an imbalanced distribution typical of multi-label data. We needed to sample from the literature in a way that 1) preserved category frequency order, 2) decreased the differences in frequency of the most to least categories, and 3) accounted for the category dependencies. With this approach, we produced a more balanced sub-sample, thereby enhancing the representation of minority categories.
Keywords
Multivariate Bernoulli Distribution
Constrained optimization
Weighted Sampling
Co-Author(s)
Colby Vorland
Donna Maney, Emory University, Dept. Psychology
Andrew Brown, University of Arkansas for Medical Sciences
First Author
Simon Chung, University of Arkansas for Medical Sciences, Department of Biostatistics
Presenting Author
Simon Chung, University of Arkansas for Medical Sciences, Department of Biostatistics
We consider model-based optimal sampling designs for multipurpose surveys with multiple measures of size when coordinating samples among multiple surveys. The problem is motivated by crop surveys conducted by the United States National Agricultural Statistics Service (NASS), in which estimates of interest include planted and harvested acres of different crops as well as crop yields, and historical acreages are available on the frame as measures of size. Further, there is a need to coordinate three disjoint samples to minimize respondent burden. We use a subframe design to coordinate samples paired with convex optimization to find the inclusion probabilities that minimize expected sample size subject to target precision requirements for different study variables, along with other inequality constraints to select disjoint samples for multiple surveys. The precision requirements are computed as anticipated coefficients of variation under models relating study variables to frame measures of size.
Keywords
Sample Coodination
Optimal Sample Designs
Balanced Sampling
Establishment Surveys
Responsive and adaptive designs have emerged as a framework for targeting and reallocating resources during the data collection period in order to improve survey data collection efficiency. Here, we report on the implementation and evaluation of a responsive design experiment in the National Survey of College Graduates that optimizes the cost-quality tradeoff by minimizing a function of data collection costs and the root mean squared error of a key survey measure, self-reported salary. At three points during the data collection process, we predict outcomes and costs for remaining non-respondents and combine with data from respondents to optimize effort on remaining cases with respect to cost and root mean squared error (RMSE) of mean self-reported salary This process allowed us to reduce data collection costs by nearly 10%, without a statistically or practically significant increase in the RMSE of mean salary or decrease in the unweighted response rate. This experiment demonstrates the potential for these types of designs to more effectively target data collection resources in order to reach survey quality goals.
Keywords
Responsive design
National Survey of College Graduates
Posterior predictive distribution