06/06/2024: 1:15 PM - 2:45 PM EDT
Refereed
Room: Roanoke
Chair
Michael Pokojovy, Old Dominion University
Tracks
Statistical Data Science
Symposium on Data Science and Statistics (SDSS) 2024
Presentations
kNN based ensemble methods minimise the effect of outliers by identifying a set of data points in the given feature space that are nearest to an unseen observation in order to predict its response by using majority voting. The ordinary ensembles based on kNN find out the k nearest observations in a region (bounded by a sphere) based on a predefined value of k. This scenario, however, might not work in situations where the test observation follows the pattern of the closest data points with the same class that lie on a certain path not contained in the given sphere. This paper proposes a k nearest neighbour ensemble where the neighbours are determined in k steps. Starting from the first nearest observation of the test point, the algorithm identifies a single observation that is closest to the observation at the previous step. At each base learner in the ensemble, this search is extended to k steps on a random bootstrap sample with a random subset of features selected from the feature space. The final predicted class of the test point is determined by using a majority vote in the predicted classes given by all base models. This new ensemble method is applied on 20 benchmark datasets and compared with other classical methods, including kNN based models, in terms of classification accuracy, kappa and Brier score as performance metrics. Boxplots are also utilised to illustrate the difference in the results given by the proposed and other state-of-the-art methods. The proposed method outperformed the considered classical methods in the majority of cases. The proposed method is further assessed through a detailed simulation study.
Presenting Author
Saeed Aldahmani
First Author
Saeed Aldahmani
CoAuthor(s)
Zardad Khan
Naz Gul, Abdul Wali Khan University
Amjad Ali, United Arab Emirates University
Encompassing numerous nationwide, statewide, and institutional initiatives in the United States, provider profiling has evolved into a major health care undertaking with ubiquitous applications, profound implications, and high-stakes consequences. In line with such a significant profile, the literature has accumulated an enormous collection of articles dedicated to enhancing the statistical paradigm of provider profiling. Tackling wide-ranging profiling issues, these methods typically adjust for risk factors using linear predictors. While this simple approach generally leads to reasonable assessments, it can be too restrictive to characterize complex and dynamic factor-outcome associations in certain contexts. One such example arises from evaluating dialysis facilities treating Medicare beneficiaries having end-stage renal disease based on 30-day unplanned readmissions in 2020. In this context, the impact of in-hospital COVID-19 on the risk of readmission varied dramatically across pandemic phases. To efficiently capture the variation while profiling facilities, we develop a generalized partially linear model (GPLM) that incorporates a feedforward neural network. Considering provider-level clustering, we implement the GPLM as a stratified sampling-based stochastic optimization algorithm that features accelerated convergence. Furthermore, an exact test is designed to identify under and over-performing facilities, with an accompanying funnel plot visualizing profiling results. The advantages of the proposed methods are demonstrated through simulation experiments and the profiling of dialysis facilities using 2020 Medicare claims sourced from the United States Renal Data System.
Presenting Author
Wenbo Wu, New York University Grossman School of Medicine
First Author
Wenbo Wu, New York University Grossman School of Medicine
CoAuthor(s)
Fan Li, Yale School of Public Health
Richard Liu, NYU Grossman School of Medicine
Yiting Li, NYU Grossman School of Medicine
Mara McAdams-DeMarco, NYU Grossman School of Medicine
Krzysztof Geras, NYU Grossman School of Medicine
Douglas Schaubel, University of Pennsylvania
Iván Díaz, NYU Grossman School of Medicine
Online platform businesses can be identified by using web-scraped texts. This is a classification problem that combines elements of natural language processing and rare event detection. Because online platforms are rare, accurately identifying them with Machine Learning algorithms is challenging. Here, we describe the development of a Machine Learning-based text classification approach that reduces the number of false positives as much as possible. It greatly reduces the bias in the estimates obtained by using calibrated probabilities and ensembles.
Presenting Author
Piet Daas, Statistics Netherlands & EIndhoven University of Technology
First Author
Luuk Gubbels, Eindhoven University of Technology
CoAuthor(s)
Marco Puts, Statistics Netherlands
Piet Daas, Statistics Netherlands & EIndhoven University of Technology