New Advances in Random Forests

Lucas Mentch Chair
University of Pittsburgh
 
Giles Hooker Organizer
University of Pennsylvania
 
Wednesday, Aug 6: 8:30 AM - 10:20 AM
0665 
Topic-Contributed Paper Session 
Music City Center 
Room: CC-104E 

Applied

Yes

Main Sponsor

Section on Statistical Learning and Data Science

Co Sponsors

Section on Nonparametric Statistics
Section on Statistical Computing

Presentations

Statistical-computational Trade-offs for Recursive Adaptive Partitioning Estimators

Recursive adaptive partitioning estimators, like decision trees and their ensembles, are effective for high-dimensional regression but usually rely on greedy training, which can become stuck at suboptimal solutions. We study this phenomenon in estimating sparse regression functions over binary features, showing that when the true function satisfies a certain structural property—Abbe et al. (2022)'s Merged Staircase Property (MSP)—greedy training achieves low estimation error with only a logarithmic number of samples in the feature count. In contrast, when MSP is absent, estimation becomes exponentially more difficult. Interestingly, this dichotomy between efficient and inefficient estimation resembles the behavior of two-layer neural networks trained with SGD in the mean-field regime. Meanwhile, ERM-trained recursive adaptive partitioning estimators achieve low estimation error with logarithmically many samples, regardless of MSP, revealing a fundamental statistical-computational trade-off for greedy training. 

Keywords

decision trees, random forests, neural networks, greedy algorithms, gradient descent 

Speaker

Jason Klusowski, Princeton University

Tree-Transformers: Improving Tabular Deep Learning by Integrating Random Forests and Transformers

Deep Learning (DL) models excel in domains with unstructured data, such as text and images, but underperform tree-based ensembles like Random Forests (RFs) on tabular data. Recent studies attribute this gap to three key limitations: (1) inability to adapt to sparsity, (2) excessive bias toward smooth solutions, and (3) reliance on rotationally invariant representations, which do not align with real-world data. To address these challenges, we propose Tree-Transformers (TTs), a novel architecture that integrates RFs with transformers. TTs first grow a random forest and extract node-based features from each tree. A transformer is then trained on these representations. To enhance computational efficiency, we employ a mixture-of-experts model that dynamically routes test examples to the most relevant tree-transformer at inference time. Our experiments demonstrate that TTs effectively mitigate the inductive biases of DL models and achieve state-of-the-art performance on real-world tabular benchmarks. 

Speaker

Abhineet Agarwal

Global Quantile Learning with Censored Data Based on Random Forest

Quantiles of survival time have been frequently reported in biomedical studies for their straightforward interpretation as well as superior flexibility and identifiability in the presence of censoring. Censored quantile regression has served as the main tool for predicting survival quantiles. However, existing work on censored quantile regression generally assumes linear effects of covariates and yet limited attention has been paid to the quantile prediction performance. In this work, we propose a Global Censored Quantile Random Forest (GCQRF) framework which is designed to simultaneously predict survival quantiles over of continuum of quantile indices with an inherent capacity to accommodate complex nonlinear relationships between covariates and the survival time. We quantify the prediction process's variation without assuming an infinite forest and establish the corresponding weak convergence result. As a useful by-product, feature importance ranking measures based on out-of-sample predictive accuracy are proposed. We demonstrate the superior predictive accuracy of the proposed method over a number of existing alternatives through extensive numerical studies. We illustrate the utility of the proposed importance ranking measures on both simulated and real data. 

Keywords

Random Forest, Censored Quantile Regression, U-processes, Variable Importance 

Speaker

Siyu Zhou, Emory University

Random forests for binary geospatial data

Existing implementations of random forests for binary data cannot explicitly account for data correlation common in geospatial and time-series settings. For continuous outcomes, recent work has extended random forests (RF) to RF-GLS that incorporate spatial covariance using the generalized least squares (GLS) loss. However, adoption of this idea for binary data is challenging due to the use of the Gini impurity measure in classification trees, which has no known extension to model dependence. We show that for binary data, the GLS loss is also an extension of the Gini impurity measure, as the latter is exactly equivalent to the ordinary least squares (OLS) loss. This justifies using RF-GLS for non-parametric mean function estimation for binary dependent data. We then consider the special case of generalized mixed effects models, the traditional statistical model for binary geospatial data, which models the spatial random effects as a Gaussian process (GP). We propose a novel link-inversion technique that embeds the RF-GLS estimate of the mean function from the first step within the generalized mixed effects model framework, enabling estimation of non-linear covariate effects and offering spatial predictions. We establish consistency of our method, RF-GP, for both mean function and covariate effect estimation. The theory holds for a general class of stationary absolutely regular dependent processes that includes common choices like Gaussian processes with Matérn or compactly supported covariances and autoregressive processes. We demonstrate that RF-GP outperforms competing methods for estimation and prediction in both simulated and real-world data. 

Speaker

Arkajyoti Saha, University of California, Irvine

Random Forests for Time Series Data

Time series data sets in economics and finance often exhibit nonlinearities that are not captured well by traditional autoregressive integrated moving average (ARIMA) models alone. Random forests have gained popularity in these types of forecasting tasks largely because of their ability to capture nonlinearity and feature interactions. However, in its current form, random forests do not leverage autocorrelation present in time series data. In this talk, we will discuss some hybrid strategies to combine the strengths of random forests with those of traditional ARIMA models and demonstrate their effectiveness on high-frequency trading data sets. 

Speaker

Sumanta Basu, Cornell University