Multivariate Bernoulli-Based Sampling Method for Multi-label Data with Application to Meta-Research
Donna Maney
Co-Author
Emory University, Dept. Psychology
Andrew Brown
Co-Author
University of Arkansas for Medical Sciences
Simon Chung
First Author
University of Arkansas for Medical Sciences, Department of Biostatistics
Simon Chung
Presenting Author
University of Arkansas for Medical Sciences, Department of Biostatistics
Tuesday, Aug 5: 2:20 PM - 2:35 PM
2547
Contributed Papers
Music City Center
In real-world applications, datasets may contain observations with multiple labels that are not necessarily mutually exclusive. Sampling methods therefore require accounting for label dependencies. We propose a novel sampling algorithm designed for multi-label datasets. Our algorithm uses the observed label frequencies to estimate the parameters of a multivariate Bernoulli distribution. By adopting optimization constrained to the target distribution, we calculated the weights of each combination of labels. This approach ensures that after weighted sampling, our sub-sample acquires the characteristics of the target distribution while accounting for the label dependencies. Our use case included a broad sample of research articles from Scopus labeled with 66 biomedical topic categories, with an imbalanced distribution typical of multi-label data. We needed to sample from the literature in a way that 1) preserved category frequency order, 2) decreased the differences in frequency of the most to least categories, and 3) accounted for the category dependencies. With this approach, we produced a more balanced sub-sample, thereby enhancing the representation of minority categories.
Multivariate Bernoulli Distribution
Constrained optimization
Weighted Sampling
Main Sponsor
Survey Research Methods Section
You have unsaved changes.