CS019 Classification and Modeling

Conference: Symposium on Data Science and Statistics (SDSS) 2024
06/06/2024: 1:15 PM - 2:45 PM EDT
Refereed 
Room: Roanoke 

Chair

Michael Pokojovy, Old Dominion University

Tracks

Statistical Data Science
Symposium on Data Science and Statistics (SDSS) 2024

Presentations

A k nearest neighbour ensemble via extended neighbourhood rule and feature subsets

kNN based ensemble methods minimise the effect of outliers by identifying a set of data points in the given feature space that are nearest to an unseen observation in order to predict its response by using majority voting. The ordinary ensembles based on kNN find out the k nearest observations in a region (bounded by a sphere) based on a predefined value of k. This scenario, however, might not work in situations where the test observation follows the pattern of the closest data points with the same class that lie on a certain path not contained in the given sphere. This paper proposes a k nearest neighbour ensemble where the neighbours are determined in k steps. Starting from the first nearest observation of the test point, the algorithm identifies a single observation that is closest to the observation at the previous step. At each base learner in the ensemble, this search is extended to k steps on a random bootstrap sample with a random subset of features selected from the feature space. The final predicted class of the test point is determined by using a majority vote in the predicted classes given by all base models. This new ensemble method is applied on 20 benchmark datasets and compared with other classical methods, including kNN based models, in terms of classification accuracy, kappa and Brier score as performance metrics. Boxplots are also utilised to illustrate the difference in the results given by the proposed and other state-of-the-art methods. The proposed method outperformed the considered classical methods in the majority of cases. The proposed method is further assessed through a detailed simulation study. 

Presenting Author

Saeed Aldahmani

First Author

Saeed Aldahmani

CoAuthor(s)

Zardad Khan
Naz Gul, Abdul Wali Khan University
Amjad Ali, United Arab Emirates University

Leveraging Neural Networks to Profile Health Care Providers with Application to Medicare Claims

Encompassing numerous nationwide, statewide, and institutional initiatives in the United States, provider profiling has evolved into a major health care undertaking with ubiquitous applications, profound implications, and high-stakes consequences. In line with such a significant profile, the literature has accumulated an enormous collection of articles dedicated to enhancing the statistical paradigm of provider profiling. Tackling wide-ranging profiling issues, these methods typically adjust for risk factors using linear predictors. While this simple approach generally leads to reasonable assessments, it can be too restrictive to characterize complex and dynamic factor-outcome associations in certain contexts. One such example arises from evaluating dialysis facilities treating Medicare beneficiaries having end-stage renal disease based on 30-day unplanned readmissions in 2020. In this context, the impact of in-hospital COVID-19 on the risk of readmission varied dramatically across pandemic phases. To efficiently capture the variation while profiling facilities, we develop a generalized partially linear model (GPLM) that incorporates a feedforward neural network. Considering provider-level clustering, we implement the GPLM as a stratified sampling-based stochastic optimization algorithm that features accelerated convergence. Furthermore, an exact test is designed to identify under and over-performing facilities, with an accompanying funnel plot visualizing profiling results. The advantages of the proposed methods are demonstrated through simulation experiments and the profiling of dialysis facilities using 2020 Medicare claims sourced from the United States Renal Data System. 

Presenting Author

Wenbo Wu, New York University Grossman School of Medicine

First Author

Wenbo Wu, New York University Grossman School of Medicine

CoAuthor(s)

Fan Li, Yale School of Public Health
Richard Liu, NYU Grossman School of Medicine
Yiting Li, NYU Grossman School of Medicine
Mara McAdams-DeMarco, NYU Grossman School of Medicine
Krzysztof Geras, NYU Grossman School of Medicine
Douglas Schaubel, University of Pennsylvania
Iván Díaz, NYU Grossman School of Medicine

Bias Correction in Machine Learning-based Classification of Rare Events

Online platform businesses can be identified by using web-scraped texts. This is a classification problem that combines elements of natural language processing and rare event detection. Because online platforms are rare, accurately identifying them with Machine Learning algorithms is challenging. Here, we describe the development of a Machine Learning-based text classification approach that reduces the number of false positives as much as possible. It greatly reduces the bias in the estimates obtained by using calibrated probabilities and ensembles. 

Presenting Author

Piet Daas, Statistics Netherlands & EIndhoven University of Technology

First Author

Luuk Gubbels, Eindhoven University of Technology

CoAuthor(s)

Marco Puts, Statistics Netherlands
Piet Daas, Statistics Netherlands & EIndhoven University of Technology