Recent Advances in Machine Learning and Data Science

Kellin Rumsey Chair
Los Alamos National Laboratory
 
Sunday, Aug 3: 4:00 PM - 5:50 PM
4024 
Contributed Papers 
Music City Center 
Room: CC-Davidson Ballroom A2 

Main Sponsor

IMS

Presentations

Beyond the Cutoff: a Graphical Approach to Combining Multiple Fuzzy Regression Discontinuity Designs

Regression discontinuity design (RDD) allows for robust estimation of the local average treatment effect at the cutoff. However, the effect has limited generalizability beyond that cutoff. To tackle this limitation, we propose a method to combine data from multiple fuzzy RDDs with the same score and outcome variables. Our work is motivated by the Dutch Arthroplasty Register dataset, which contains data on primary total hip replacement (THA) from Dutch hospitals. As hospitals use varying age-based cutoff points to decide on the fixation type in THA, we can estimate the treatment effect for a broader population. The key challenge in integrating the data is the presence of compliance groups, which are an inherent part of any fuzzy RDD. We take a rigorous novel graphical approach, which has not yet been exploited in the context of regression discontinuity design. We model the compliance types to depend on population characteristics rather than on specific hospitals. This approach allows us to view the hospital selection as a conditional instrumental variable. Finally, we propose a doubly robust estimator of the treatment effect that exploits local estimates at the cutoff points. 

Keywords

Regression discontinuity design

Causal inference

Multi-site observational study

Extrapolation

Instrumental variable 

Co-Author(s)

Stéphanie van der Pas, Vrije Universiteit Amsterdam
Mark Van De Wiel, VU University Medical Center

First Author

Julia Kowalska, Vrije Universiteit Amsterdam

Presenting Author

Julia Kowalska, Vrije Universiteit Amsterdam

Challenges of the Transition from Teaching Business Statistics to Teaching Business Analytics

With the growing volume and variety of data, knowledge of business analytics is essential for organizations to make data-driven decisions that can significantly improve performance, efficiency, and competitiveness of the business. As the popularity of analytics increases, so will the need for people who have knowledge and skills to convert big data into actionable insight that managers and other decision makers need to improve decision making in the real business world. Colleges and universities are increasing Analytics-related offerings at both graduate and undergraduate levels. The instructors tasked with teaching these new course offerings are generally the same instructors that are relied upon for business statistics education. This paper attempts to provide some recommendations regarding classroom teaching, available software choices, and course materials based on our experience in teaching the business analytics course. 

Keywords

Business Statistics

Business Analytics

Big Data

Teaching 

Co-Author

Eric Howington, Valdosta State University

First Author

Mitra Devkota, University of North Georgia

Presenting Author

Mitra Devkota, University of North Georgia

Diversifying conformal selections

When selecting from a list of potential candidates, it is important to not only ensure that those selected are of high quality, but also that the selected candidates are diverse. For instance, in drug discovery, scientists aim to select potent drugs from a library of unsynthesized candidates, but recognize that it is wasteful to repeatedly synthesize highly similar compounds. In contrast to prior works, which study the problem of making many selections subject to a false discovery rate (FDR) constraint, this paper considers the problem of making a diverse set of selections subject to the same constraint. Our method diversity-aware conformal selection (DACS) works with a
user-specified notion of diversity and runs an optimization procedure to construct a maximally diverse selection set subject to a simple constraint involving certain stopping-time-based conformal e-values. The practitioner has flexibility in the choice of e-values, and DACS's key insight is to use optimal stopping theory to make this choice in a way which (approximately) maximizes diversity. We demonstrate the empirical performance of our method both in simulation and on real datasets. 

Keywords

Conformal prediction

E-values

Optimal stopping 

Co-Author(s)

Ying Jin, Stanford University
James Yang
Emmanuel Candes, Stanford University

First Author

Yash Nair, Stanford University

Presenting Author

Yash Nair, Stanford University

Effects of sample weights on the performance of machine learning models using complex survey data

Recent studies have used machine learning (ML) methods ignoring sample weights determined by a complex design framework. This inadequacy stems from the lack of ML software packages managing sample weights' adjustment. We have developed an R-MLSurvey package for suitably adjusting sample weights to ML algorithms by replicate weights methods in the CV step as an extension of weighted LASSO regression. ML models considered are Penalized logistic regression, i.e., L_1 and Elastic Net (EN), Random Forest (RF), and Extreme gradient boosting (XGBoost), developed by design-based K-fold cross validation (dCV) and Jackknife repeated replication (JKn) for weighted ML models. The final models were evaluated by weighted performance metrics. We discuss effects of sample weights on prediction and variable selection with two examples of class imbalance, hypertension and diabetes, using the National Health and Nutrition Examination Survey (NHANES) data. Two under-sampling approaches were utilized for balancing classes ad hoc. 

Keywords

complex survey, replicate weights, NHANES, sample weights, machine learning, class imbalance 

Co-Author(s)

Paul Rogers, FDA-NCTR
Dong Wang, FDA National Center for Toxicological Research (NCTR)

First Author

Hyeonju Kim, NCTR

Presenting Author

Hyeonju Kim, NCTR

WITHDRAWN - Learning Counterfactual Distributions via Kernel Nearest Neighbors

Consider a setting with multiple units (e.g., individuals, cohorts, geographic locations) and outcomes (e.g., treatments, times, items), where the goal is to learn a multivariate distribution for each unit-outcome entry, such as the distribution of a user's weekly spend and engagement under a specific mobile app version. A common challenge is the prevalence of missing not at random data where the missingness can be correlated with properties of distributions themselves, i.e., there is unobserved confounding. An additional challenge is that for any observed unit-outcome entry, we only have a finite number of samples from the underlying distribution. We tackle these two challenges by casting the problem into a novel distributional matrix completion framework and introduce a kernel-based distributional generalization of nearest neighbors to estimate the underlying distributions. By leveraging maximum mean discrepancies and a suitable factor model on the kernel mean embeddings of the underlying distributions, we establish consistent recovery of the underlying distributions even when data is missing not at random and positivity constraints are violated. 

Keywords

Distribution recovery

Kernel methods

Missing-not-at-random

Nearest neighbors

Mean embedding factor model 

Co-Author(s)

Jacob Feitelberg, Columbia University
Caleb Chin, Cornell University
Anish Agarwal, Columbia University
Raaz Dwivedi, UC Berkeley

First Author

Kyuseong Choi

On improved matrix estimators in high-dimensional data

In this talk, we introduce a class of improved estimators for the mean parameter matrix of a multivariate normal
distribution with an unknown variance-covariance matrix. In particular, some recent results of are established in their full generalities and we revise some results which are useful in studying the risk dominance of shrinkage estimators. We generalize the existing methods in three ways. First, we consider a parametric estimation problem which is enclosed as a special case the one about the vector parameter. Second, we propose a class of James-Stein matrix estimators and, we establish a necessary and a sufficient condition for any member of the proposed class to have a finite risk function. Third, we present the conditions for the proposed class of estimators to dominate the maximum likelihood estimator. On the top of these interesting contributions, the additional novelty consists in the fact that, we extend the methods suitable for the vector parameter case and the derived results hold in the classical case as well as in the context of high and ultra-high dimensional data. 

Keywords

Invariant quadratic loss

James-Stein estimation

Location parameter

Minimax estimation

Moore-Penrose inverse

Risk function 

Co-Author

Arash Foroushani, University of Windsor

First Author

Severien Nkurunziza, University of Windsor

Presenting Author

Severien Nkurunziza, University of Windsor

Statistical properties of the rectified transport

The problem of finding a transformation mapping one distribution into another is a relevant mathematical problem with several applications in physics, genomics, etc. When this transformation is assumed to be monotonic, the above problem corresponds to finding the so-called optimal transport map, for which a rich mathematical regularity theory is available and for which a recent non-parametric estimation theory has been established. These statistical results indicate that plug-in estimators of such maps converge faster than expected for Kernel density estimators, a consequence of the extra degree of smoothness of the optimal map compared to the original densities. Moreover, a central limit theorem has been established for such estimators under suitable bandwidth selection, enabling uncertainty quantification. The main drawback is that their computation is typically intractable as it relies on solving an optimal transport problem in the continuum, for which we can only obtain approximated solutions.


To deal with these issues, we propose rectified transport as an alternative to optimal transport. The rectified map (Liu et al., 2022) is a relaxation of optimal transpor 

Keywords

Optimal Transport

Nonparametric estimation

Nonparametric Regression

Statistical rates 

Co-Author(s)

Arun Kumar Kuchibhotla, Carnegie Mellon University
Larry Wasserman, Carnegie Mellon University

First Author

gonzalo mena, Carnegie Mellon University

Presenting Author

gonzalo mena, Carnegie Mellon University