Print Close

Multivariate Bernoulli-Based Sampling Method for Multi-label Data with Application to Meta-Research

Presented During: Sample Design and Non-response Modeling

Colby Vorland Co-Author

Donna Maney Co-Author
Emory University, Dept. Psychology

Andrew Brown Co-Author
University of Arkansas for Medical Sciences

Simon Chung First Author
University of Arkansas for Medical Sciences, Department of Biostatistics

Simon Chung Presenting Author
University of Arkansas for Medical Sciences, Department of Biostatistics

Tuesday, Aug 5: 2:20 PM - 2:35 PM
2547
Contributed Papers

Music City Center

In real-world applications, datasets may contain observations with multiple labels that are not necessarily mutually exclusive. Sampling methods therefore require accounting for label dependencies. We propose a novel sampling algorithm designed for multi-label datasets. Our algorithm uses the observed label frequencies to estimate the parameters of a multivariate Bernoulli distribution. By adopting optimization constrained to the target distribution, we calculated the weights of each combination of labels. This approach ensures that after weighted sampling, our sub-sample acquires the characteristics of the target distribution while accounting for the label dependencies. Our use case included a broad sample of research articles from Scopus labeled with 66 biomedical topic categories, with an imbalanced distribution typical of multi-label data. We needed to sample from the literature in a way that 1) preserved category frequency order, 2) decreased the differences in frequency of the most to least categories, and 3) accounted for the category dependencies. With this approach, we produced a more balanced sub-sample, thereby enhancing the representation of minority categories.

Keywords

Multivariate Bernoulli Distribution

Constrained optimization

Weighted Sampling

Main Sponsor

Survey Research Methods Section