Improving Sexual Identity Measures in Health Disparity Studies with Machine Learning and Resampling

Brady West Co-Author
Institute for Social Research
 
Rona Hu First Author
 
Rona Hu Presenting Author
 
Wednesday, Aug 7: 11:20 AM - 11:35 AM
2296 
Contributed Papers 
Oregon Convention Center 
Survey research on sexual identity often categorizes respondents as heterosexual, homosexual, and bisexual, but may miss nuanced identities. Prior work has shown that introducing a "something else" response option can affect health disparity estimates. However, many surveys lack this option. We propose a machine learning approach to infer "something else" responses in existing surveys without this option. Leveraging a split-ballot experiment in the 2015-2019 National Survey of Family Growth, we use the half-sample including "something else" as a training dataset and a set of supervised machine learning algorithms to develop a classifier for sexual identity. We then use the half-sample excluding "something else" as a test dataset, predicting responses on the four-category version of sexual identity and computing revised estimates of disparities based on these new predictions. We repeat this process using bootstrap resampling to generate an empirical distribution of revised disparity estimates, comparing the estimates to those based on the original half-sample used for training. We conclude with implications of this work for future surveys measuring sexual identity.

Keywords

Sexual Identity Measurement

Machine Learning

Health Disparity Estimates

Survey Research

National Survey of Family Growth (NSFG)

Bootstrap Resampling 

Main Sponsor

Survey Research Methods Section