Multivariate Bernoulli-Based Sampling Method for Multi-label Data with Application to Meta-Research

Colby Vorland Co-Author
 
Donna Maney Co-Author
Emory University, Dept. Psychology
 
Andrew Brown Co-Author
University of Arkansas for Medical Sciences
 
Simon Chung First Author
University of Arkansas for Medical Sciences, Department of Biostatistics
 
Simon Chung Presenting Author
University of Arkansas for Medical Sciences, Department of Biostatistics
 
Tuesday, Aug 5: 2:20 PM - 2:35 PM
2547 
Contributed Papers 
Music City Center 
In real-world applications, datasets may contain observations with multiple labels that are not necessarily mutually exclusive. Sampling methods therefore require accounting for label dependencies. We propose a novel sampling algorithm designed for multi-label datasets. Our algorithm uses the observed label frequencies to estimate the parameters of a multivariate Bernoulli distribution. By adopting optimization constrained to the target distribution, we calculated the weights of each combination of labels. This approach ensures that after weighted sampling, our sub-sample acquires the characteristics of the target distribution while accounting for the label dependencies. Our use case included a broad sample of research articles from Scopus labeled with 66 biomedical topic categories, with an imbalanced distribution typical of multi-label data. We needed to sample from the literature in a way that 1) preserved category frequency order, 2) decreased the differences in frequency of the most to least categories, and 3) accounted for the category dependencies. With this approach, we produced a more balanced sub-sample, thereby enhancing the representation of minority categories.

Keywords

Multivariate Bernoulli Distribution

Constrained optimization

Weighted Sampling 

Main Sponsor

Survey Research Methods Section