SPEED 5: Machine Learning, Visualization, and Non-Parametric Statistical Approaches, Part 1

Yuhua Zhang Chair
Harvard University
 
Tuesday, Aug 6: 8:30 AM - 10:20 AM
5078 
Contributed Speed 
Oregon Convention Center 
Room: CC-D135 

Presentations

Analyzing Spatial Dependence in Functional Data and Shapes of 2D Curves

In this work, we model the shapes of spatially dependent functional data or boundaries of two-dimensional (2D) objects, i.e., spatially dependent shapes of parameterized curves. Functional data is often composed of two confounded sources of variation: amplitude and phase. Amplitude captures shape differences among functions while phase captures timing differences in these shape features. Similarly, boundaries of 2D objects represented as parameterized curves exhibit variation in terms of their shape, translation, scale, orientation and parameterization. We study the spatial dependence among functions or curves by first decomposing given data into the different sources of variation. The proposed framework leverages a modified definition of the trace-variogram, which is commonly used to capture spatial dependence in functional data. We propose different types of trace-variograms that capture different components of variation in functional or shape data, and use them to define a functional/shape mark-weighted K function by considering their locations in the spatial domain as random. This statistical summary then allows us to study the spatial dependence in each source of variation separately. Efficacy of the proposed framework is demonstrated through extensive simulation studies and real data applications.  

Keywords

shape

clustering

trace-variogram

kriging 

Co-Author(s)

Karthik Bharath, University of Nottingham
Sebastian Kurtek, The Ohio State University

First Author

Ye Jin Choi

Presenting Author

Ye Jin Choi

Application of Machine Learning Models to Blood Metal Exposures in the NHANES Data

Identifying high exposure levels of blood metals in humans is important because medical interventions or recommendations can be provided to reduce and prevent future exposures. We aimed to use machine learning to develop identification models. Five machine learning models (Linear Discriminant Analysis (LDA), Classification and Regression Trees (CART), K-Nearest Neighbor (KNN), Support Vector Machine (SVM), and Random Forest (RF)) were applied to NHANES 2015-2016 blood metal data. For blood cadmium (BCd) and lead (BPb) exposures, sex, poverty income ratio (PIR), race, age group, and cotinine level were used as attributes for the models while for total mercury (THg) exposure we used sex, PIR, race, age group, and shellfish-eaten. Blood metals concentrations greater than or equal to the 75th percentile was considered as "higher exposure." The following metrics: accuracy, sensitivity, specificity, positive predictive value, and negative predictive value were used to evaluate the performance of the models. The KNN model had the best performance in terms of predicting BCd and THg exposures while the LDA model was best for predicting BPb exposure. 

Keywords

machine learning

metal exposure

NHANES

lead

cadmium

mercury 

Co-Author(s)

Jeffery Jarrett, CDC
Cynthia Ward, CDC
Liza Valentin-Blasini, CDC

First Author

Po-Yung Cheng, CDC

Presenting Author

Po-Yung Cheng, CDC

Comparison of Test Statistics in the Ridge Probit Regression Model: Simulation and Application

Ridge regression is a method that has been proposed to solve the multicollinearity problem in both linear and non-linear regression models. This paper studies different Ridge regression z-type tests of the individual coefficients for the Probit regression model. A simulation study was conducted to evaluate and compare the performance of the test statistics with respect to their empirical size and power under different simulation conditions. Our simulations identified which of the proposed tests maintain type I error rates close to the 5% nominal level while simultaneously showing gains in statistical power over the standard Wald z-test commonly used in Probit regression models. Our paper is the first of its kind to compare z-type tests for these different shrinkage approaches to estimation in Probit Ridge regression. The results will be valuable for applied statisticians and researchers in the area of regression models. 

Keywords

Poisson regression

Ridge regression

Liu regression

Kibria-Lukman regression

Empirical power

Type I error rate 

Co-Author(s)

Zoran Bursac, Florida International University
BM Golam Kibria, Florida International University

First Author

Sergio Perez Melo

Presenting Author

Sergio Perez Melo

Differentially private kernel empirical risk minimization via Kmeans Nymstrom approximation

Since the differential privacy has become a state of the art concept for privacy guarantee, a lengthy amount of works were invested in differentially private kernel learning. However most works is restricted in differentially private kernel learning using translation invariant kernel with rare exceptions. Also, many suggested frameworks release a differentially private kernel learning with fixed hyperparameters, which excludes the hyperparameter tuning procedures from the framework. In this work, we propose a framework of differentially private kernel empirical risk minimization that allows to perform a kernel learning for general kernel by Kmeans Nystrom approximation with theoretical guarantees. Also, we suggest a differentially private kernel mean embedding for general kernel. Additionally we give a differentially private kernel ridge regression, and logistic regression method that can learn various regularization parameters simultaneously. 

Keywords

differentially privacy

kernel learning

empirical risk minimization

Nystrom approximation

Kmeans Nystrom approximation

Kernel mean embedding 

Co-Author(s)

Cheolwoo Park, KAIST
Jeongyoun Ahn, University of Georgia

First Author

Bonwoo Lee

Presenting Author

Bonwoo Lee

Estimation of Non-stationary Covariance Function with/without Replications

Spatial statistics frequently involve modeling (transformed) data by a Gaussian process F, but the covariance of F may NOT be stationary. In the context of replicated observations of a Gaussian spatial field, this study introduces a nonparametric approach applicable for a general covariance function that may be non-stationary. For situations where only a single observation is available with no replication, a local block bootstrap procedure is proposed to generate additional observations. We compare the estimated covariance obtained through the empirical method with those from parametric and nonparametric methodologies. We discuss the strengths and limitations of each approach in capturing the complex covariance structure inherent in spatial data. 

Keywords

Gaussian process

non-stationary

nonparametric

bootstrap

Covariance Function 

Co-Author

Jiayang Sun, George Mason University

First Author

YIYING Fan, Cleveland State University

Presenting Author

YIYING Fan, Cleveland State University

Evaluating Heartbeat Segmentation Methods for the Analysis of Electrocardiogram Data

Electrocardiogram (ECG) data can provide physicians with valuable clinical information related to a patient's heart health. The analysis of ECG data poses many practical challenges, and the analyst must make several decisions related to data preparation. One important consideration is how to segment raw ECG data – which can include hours of ECG recordings containing thousands of individual heartbeats – into analyzable pieces. Many options exist, but popular methods include looking at time series representations of multiple beats or performing an individual beat-by-beat analysis. Furthermore, there are additional researcher degrees of freedom associated with each approach such as determining the length of the time series or how individual beats should be segmented. This work investigates the performance of several techniques used to process ECG data into individual heartbeats when used to classify arrhythmias. Data is taken from the MIT-BIH Arrythmia Database and used to train several different types of arrythmia classifiers. The results are then compared to explore the possibility of developing a general recommendation for an ECG individual beat segmentation procedure. 

Keywords

Electrocardiogram

Classification 

Co-Author(s)

Katie Ma, University of Central Oklahoma
Emily Hendryx Lyons, University of Central Oklahoma

First Author

Tyler Cook, University of Central Oklahoma

Presenting Author

Tyler Cook, University of Central Oklahoma

High-Dimensional Genetic Survival Analysis with Kernel-Based Neural Networks

Survival data analysis is pivotal in statistics and biostatistics, where the Cox Proportional Hazards (Cox PH) model stands out as a widely embraced approach. Recent technological and genetic advancements have broadened our understanding of disease-related genes, unveiling over 1800 identified disease-related genes. However, the complexity of identifying numerous genetic variants influencing disease progression arises from the intricate interplay between genetic and environmental factors, coupled with nonlinear and multifaceted relationships. To meet these challenges, we introduce a kernel-based neural network model. Similar to traditional neural networks, this model utilizes its hierarchical structure to learn complex features and interactions within genetic data. Simulations demonstrate that the kernel-based neural network model outperforms both the traditional Cox model and the Cox prediction model with PyTorch (PyCox) in terms of estimation and prediction accuracy, especially when handling nonlinear high-dimensional covariate effects. The advantages of our model over the Cox model and PyCox are further illustrated through real-world applications. 

Keywords

survival analysis

Cox proportional hazards model

kernel-based neural networks

high dimensional data

genetic analysis 

Co-Author(s)

Qing Lu
Chenxi Li, Michigan State University

First Author

Rongzi Liu

Presenting Author

Rongzi Liu

Kernel Density Estimation for Compositional Data with Zeros using Reflection on Sphere

Compositional data refers to data holding information of relative proportion of components in each observation. This type of data can be easily found in various fields such as chemometrics and bioinformatics. Estimating the density of compositional data is crucial for gaining insights of the underlying patterns. For example, estimated density can be used for comparison of compositional structure between distinct groups. Despite its significance, there has been little focus on nonparametric density estimation of compositional data. Furthermore, many prior works assume that the compositional data do not contain zeros even though many real-world data indeed contain zero components. In this work, we propose a kernel density estimation (KDE) method for compositional data which can naturally handle zero components. We leverage the topological equivalence between a simplex and the first orthant of a sphere and use reflection of data to entire orthants, establishing a connection with spherical KDE. We investigate asymptotic properties of the suggested KDE method including consistency and compare with existing methods using simulation and real data analysis. 

Keywords

Compositional data

Zero components

Kernel density estimation

Spherical KDE 

Co-Author(s)

HYUNBIN CHOI
Jeongyoun Ahn, Korea Advanced Institute of Science and Technology

First Author

Changwon Yoon, Department of Industrial & Systems Engineering, KAIST

Presenting Author

Changwon Yoon, Department of Industrial & Systems Engineering, KAIST

Low-rank, Orthogonally Decomposable Tensor Regression

Multi-dimensional tensor data have gained increasing attention in the recent years. We consider the problem of fitting a generalized linear model with a three-dimensional image covariate, such as one obtained by functional magnetic resonance imaging (fMRI). Many of the classical penalized regression techniques do not account for the spatial structure in imaging data. We assume the parameter tensor is orthogonally decomposable, enabling us to penalize the tensor singular values and avoid a priori specification of the rank. Under this assumption, we additionally propose to penalize internal variation of the parameter tensor. Our approach provides an effective method to reduce the dimensionality and control piecewise smoothness of imaging data. Effectiveness of our method is demonstrated on synthetic data and real MRI imaging data. 

Keywords

Low-rank approximation

Tensor regression

Nuclear norm

Internal variation 

Co-Author(s)

Cheolwoo Park, KAIST
Jeongyoun Ahn, University of Georgia

First Author

Jungmin Kwon

Presenting Author

Jungmin Kwon

Maximum Mean Discrepancy Meets Neural Networks: The Radon-Kolmogorov-Smirnov Test

Maximum mean discrepancy (MMD) refers to a class of nonparametric two-sample tests based on maximizing the mean difference between samples from distribution P versus Q, over all data transformations f in a function space F. Inspired by recent work connecting the functions of Radon bounded variation (RBV) and neural networks (NN), we study the MMD taking F as the unit ball in the RBV space of a given smoothness degree k ≥ 0. This test, named the Radon-Kolmogorov-Smirnov (RKS) test, can be viewed as a generalization of the well-known and classical Kolmogorov-Smirnov (KS) test to multiple dimensions and higher orders of smoothness. It is also intimately connected to NN: we prove the RKS test's witness – the function f achieving the MMD – is always a ridge spline of degree k, i.e., a single neuron in NN. We can thus leverage the modern NN optimization toolkits to (approximately) maximize the criterion that underlies the RKS test. We prove the RKS test has asymptotically full power at distinguishing any P ≠ Q, derive its asymptotic null distribution, and carry out experiments to elucidate the strengths and weaknesses of the RKS test versus the more traditional kernel MMD test. 

Keywords

nonparametric two-sample testing

maximum mean discrepancy (integral probability metric)

neural network based test

ridge spline

Kolmogorov-Smirnov test 

Co-Author(s)

Michael Celentano
Alden Green
Ryan Tibshirani, Carnegie Mellon University

First Author

Seunghoon Paik

Presenting Author

Seunghoon Paik

Mind your zeros: accurate p-value approximation in permutation testing

Permutation procedures are essential for hypothesis testing when the distributional assumptions about the considered test statistic are not met or unknown, but are challenging in scenarios with limited permutations, such as complex biomedical studies. P-values may either be zero, making multiple testing adjustment problematic, or too large to remain significant after adjustment. A common heuristic solution is to approximate extreme p-values by fitting a Generalized Pareto Distribution (GPD) to the tail of the distribution of the permutation test statistics. In practice, an estimated negative shape parameter combined with extreme test statistics can again result in zero p-values. To address this issue, we present a comprehensive workflow for accurate permutation p-value approximation that fits a constrained GPD and strictly avoids zero p-values. We also propose new methods that address the challenges of determining an optimal GPD threshold and correcting for multiple testing. Through extensive simulations, our approach demonstrates considerably higher accuracy than existing methods. The computational framework will be available as the open-source R package "permAprox". 

Keywords

Non-parametric hypothesis testing

Permutation test

Generalized Pareto Distribution (GPD)

Multiple testing correction

Differential abundance and differential association testing in microbiome studies

R package 

Co-Author(s)

Martin Depner, Institute of Asthma and Allergy Prevention , Helmholtz Zentrum München, Munich
Erika von Mutius, Institute of Asthma and Allergy Prevention, Helmholtz Zentrum München
Anne-Laure Boulesteix, Institute for Medical Information Processing, Biometry and Epidemiology, LMU München
Christian L. Müller, helmholtz Munich

First Author

Stefanie Peschel, Munich Center for Machine Learning, Munich, Germany

Presenting Author

Stefanie Peschel, Munich Center for Machine Learning, Munich, Germany

Nonlinear outlier detection leveraging high dimensionality of kernel-induced feature space

Classical linear methods frequently exhibit limited effectiveness for detecting outliers in real-world datasets. To overcome this limitation, we propose a nonlinear outlier detection method that exploits the high dimensionality of kernel-induced feature space. When the data dimension exceeds the sample size, we can calculate the orthogonal distance between each data point and the hyperplane spanned by the other data points. We demonstrate that we can calculate DH (Distance to Hyperplane) in kernel-induced space (kernelized DH) by treating the induced space as a high-dimensional space. Utilizing kernelized DH as a measure of outlyingness, we conduct a permutation test using kernelized DH as a test statistic to determine whether each data point is an outlier or not. Since the model uses only the kernel matrix, we can detect outliers in various data types for which an appropriate kernel can be defined. The experimental results, based on simulated data and real datasets, demonstrate the competitive performance of the proposed method. 

Keywords

Outlier detection

Distance to hyperplane

Kernel method

Kernel-induced feature space 

Co-Author

Jeongyoun Ahn, Korea Advanced Institute of Science and Technology

First Author

Giwon Kim

Presenting Author

Giwon Kim

Nonparametric Density Estimation using Predictive Recursion

We built a novel statistical machine learning framework for nonparametric density estimation, using predictive recursion (PR). In a mixture model, to estimate the unknown mixing density, one can use finite mixture models where there is a need to estimate the number of mixing components or use Dirichlet process where a prior assumption on the form of the mixing density function has to be made. We proposed PR (i) which does not require to estimate the number of components, (ii) which does not need to make assumptions on the form of the mixing density and (iii) which is fast, unlike the MCMC-based methods. PR is capable of capturing spatial characteristics of the data, distinguishing high- and low-density regions. We then extended our approach by integrating deep learning to the conditional nonparametric density estimation setting where we used kernel functions to capture complex and high-dimensional relationships. Finally, we showed the capability of our method in terms of density estimation and predictive performance by comparing its results to state-of-the-art algorithms. 

Keywords

Nonparametric Density Estimation

Machine Learning

Deep Learning

Kernel Functions 

Co-Author(s)

Christopher Tosh, Memorial Sloan Kettering Cancer Centertoshc@mskcc.org
Gemma Moran, Rutgers University
Mithat Gonen, Memorial Sloan-Kettering Cancer Center
Wesley Tansey, Memorial Sloan Kettering Cancer Center

First Author

Ayyuce Begum Bektas, Memorial Sloan Kettering Cancer Center

Presenting Author

Ayyuce Begum Bektas, Memorial Sloan Kettering Cancer Center

Predictive Modeling of Microbiome Data with Interaction Effects

In microbiome research, predicting an outcome of interest from microbial abundances via sparse regression models is a common task. However, models linear in the features might be too simple to capture dynamics in communities, as microbial species tend to interact with one another. To address this, we propose a framework that includes strategies for modeling interaction effects in presence-absence data of microbial species, absolute abundance data, and compositional microbial 16S rRNA sequencing data, where only relative abundance information is available. Our framework incorporates an extension of the constrained lasso for compositional data to interaction effects as well as the statistical concept of hierarchy to enhance the interpretability of interaction effects. Based on synthetic data, we demonstrate the conditions under which true effects can be statically detected, considering varying sparsity of features and varying noise levels. For a selection of real-world microbiome datasets, we show that robust interaction effects between microbial species can be detected and the predictive accuracy can be improved when modeling interaction effects compared to merely additive effects. 

Keywords

interaction modeling

microbial interactions

compositional data

sparsity

lasso

hierarchical interactions 

Co-Author(s)

Christian L. Müller, helmholtz Munich
Jacob Bien, University of Southern California

First Author

Mara Stadler, Helmholtz Center Munich

Presenting Author

Mara Stadler, Helmholtz Center Munich

Random Forests and Clustering for Identifying Clinical Phenotypes

Random Forests can be used for classification and clustering. In the supervised Random Forest used for classification, each subject will have a known grouping. In the unsupervised Random Forest used for clustering, the proximity matrix needed for clustering can be estimated. Clustering algorithms use data to form groups of similar subjects that share distinct properties. Phenotypes can be identified using a proximity matrix generated by the unsupervised Random Forests and subsequent clustering by the Partitioning around Medoids (PAM) algorithm.
PAM uses the dissimilarity matrix in its class partitioning or clustering algorithm and is more robust to noise and outliers as compared to the more commonly used k-means algorithm.

We present results that identify distinct phenotypes or groups of subjects that are Hispanic/Latino with chronic low back pain. Data consisted of sensor-based measures of posture and movement, pain behavior, and psychological measures. Groupings may provide a basis for a more personalized plan of care, including pain management strategies that encourage movement and rest periods. 

Keywords

random forests

chronic lower back pain 

First Author

Barbara Bailey, San Diego State University

Presenting Author

Barbara Bailey, San Diego State University

Recalibration of time-varying covariate Cox model in external validation

External validation is to validate a prediction model using external population data different from the original development cohort and often requires recalibration to preserve the accuracy of the model outcome prediction. Assessing the calibration of Cox model in external validation requires not only visualizing calibration plots but also testing the significant difference between the original model and recalibrated model with regard to the intercept and slopes. However, in the case of the Cox model, there is no intercept (γ0) to estimate. Alternative to the existing method of testing a logistic regression model (Vergouwe et al., 2017), γ0+ γ1 (Xβ ̂ ) for recalibrating and testing γ0=0 and γ1=1, we conducted a log-likelihood test of two models--calibration in-the-large (γ1=1) vs recalibrated (γ1 = γ ̃) models where γ ̃ is a new estimated coefficient and β ̂ are original coefficients of all risk factors, X. The study exemplifies these methods to externally validate the Veterans Affairs (VA) women cardiovascular disease (CVD) risk score to non-Veteran women--civilians and active-duty military service members. 

Keywords

External validation

Recalibration

Women

Cardiovascular Disease

Risk Score

Veterans 

Co-Author

Xiaofei Chen

First Author

Haekyung Jeon-Slaughter, University of Texas Southwestern Medical Center

Presenting Author

Haekyung Jeon-Slaughter, University of Texas Southwestern Medical Center

Subsampling Winner Algorithm 2 for Feature Selection from Large Data

The Subsampling Winner Algorithm (SWA, Fan and Sun 2021) provides a novel alternative to penalized methods and random forest procedures. This paper introduces SWA2, the next generation of SWA designed for more general data that may be subject to data heteroskedasticity or interactions. The performance of SWA2 is compared with benchmark methods, including LASSO and SCAD. A new, faster algorithm will be demonstrated through examples. 

Keywords

feature selection

ensemble method

regression

heteroskedasticity

high dimensional data 

Co-Author(s)

Jiayang Sun, George Mason University
YIYING Fan, Cleveland State University

First Author

Wei Dai, George Mason University

Presenting Author

Wei Dai, George Mason University

Text Cluster Profiling using Generative Language Models and Vector Search

Text clustering is a common tool used to identify natural groupings in a set of documents. But once you have the clusters, how do you know what they represent? The answer is often manual review by subject matter experts, which introduces a bottleneck. In prior work, we showed how generative language models can be used to name and describe text clusters. Here, we add a vector search step, which assesses the quality of both the cluster and the cluster's description. First, we use a generative language model to generate a brief description of each cluster. Next, we query a vector database of document embeddings to identify the documents most similar to each cluster description. Finally, we calculate F1 for query results, relative to the documents in each cluster. As a proof of concept, we fit five HDBSCAN models to the 20 Newsgroups dataset: one with the correct number of clusters (20), and others with 5, 10, 40, and 80 clusters. We ran this pipeline for each clustering model, as well as for the true 20 Newsgroups classes. Results show how our approach can be used to profile clusters, compare models, and what expected values should be relative to a ground truth. 

Keywords

Machine Learning

Generative AI

Large Language Models

Vector Databases

Natural Language Processing 

Co-Author(s)

Peter Baumgartner, RTI International
Anthony Berghammer, RTI International

First Author

Alexander Preiss, RTI International

Presenting Author

Alexander Preiss, RTI International

Where to invest? ROI analysis with continuous treatments using doubly robust machine learning

In sales operations, a business makes tradeoff decisions about where to invest and divest in order to optimize revenue. For instance, in advertising sales, a customer segment (e.g., by vertical) responds differently to treatments, such as the number of sales pitches about a new advertising platform. To optimize revenue impacts, the business needs to know where to make investments and divestments. We apply causal inference methods to estimate impacts of continuous treatments (i.e., dollars invested) and support tradeoff decision-making. Specifically, we implement doubly robust machine learning methods on observational sales data to (1) analyze sales treatment mechanisms and (2) estimate their impacts on revenue outcomes. Through simulation and real data analysis, we demonstrate the potential for doubly robust methods to mitigate bias in ROI decision-making in business problems about investments and resourcing. 

Keywords

Double robustness

ROI analysis

Machine learning

Advertising sales

Decision support 

First Author

Eray Turkel, Stanford University

Presenting Author

Frank Yoon, Google

Multiplier Effects in the Input-Output Model with Two Exogenous Sectors

The study extends the input-output (IO) model to economies in which the government intervenes exogenously in the industrial output by setting the import exchange rate or by expanding/contracting public spending. The model is then extended to the more realistic situation in which the economy is driven not only by fiscal stimulus but also by agricultural productive shocks. The multiplier effect of both the fiscal stimulus and the agricultural shocks on industrial output are computed for 103 industries of Argentina's System of National Accounts (SNA) and compared with elasticities obtained through other econometric methods. Along the paper, guidelines are given for the computation of the empirical direct requirement matrices (needed in the aforementioned models) from the Make and Use Tables of the SNA. 

Keywords

Input-Output model

fiscal multipliers

agriculture

Make and Use Tables

System of National Accounts

Argentina 

Presenting Author

Luis Frank