011: A workflow for stable interaction detection in high-dimensional data

Conference: Conference on Statistical Practice (CSP) 2023
02/03/2023: 7:30 AM - 8:45 AM PST
Posters 
Room: Cyril Magnin Foyer 

Description

The advent of large-scale data (e.g. from industry or biotechnology) have made the development of suitable statistical analysis techniques a cornerstone of modern interdisciplinary research and data analysis. Often, these data sets contain many covariates but comparatively few samples (p>>n). In this data-scarce regime, standard statistical methods are no longer appropriate.
A common research question in many data-driven observational studies is concerned with estimating how individual covariates influence a readout of interest. In practice, it is unlikely that all measured covariates affect the readout independently of each other. Rather, it can be assumed that only a subset of covariates is relevant, and that they potentially co-operate in a concerted fashion. Thus, a major concern is to identify a small set of reliable effects from a large number of possible combinations of covariates that built hypotheses for further functional analyses. Possible questions in different application areas include, for example, combinatorial (e.g. synergistic or antagonistic) effects of different drugs on a biological readout or the combinatorial behavior of different building energy efficiency measures on the building energy consumption.
Studying the effects of all possible combinations of features is notoriously hard to solve and statisticians often have to deal with noisy datasets of incomplete experimental design, where not all combinations of covariates have been measured. State-of-the-art techniques like sparse linear regression and extensions thereof to interaction models do not deliver sufficiently robust combinatorial effects between covariates. Thus, the development of robust methods is crucial to reduce the number of spurious interaction effects and to allow the communication of a reliable set of interaction effects to clients or collaborators in the application domain.
We propose a computational workflow that robustly recovers bi-order interactions in the data-scarce regime. As baseline model we use a lasso model for hierarchical interactions. Compared to the classical lasso problem with interaction effects the hierarchical model prefers main effects over interaction effects and only selects interaction effects if the predictive accuracy gets considerably improved by the selection of an interaction coefficient.
In order to perform robust model selection in the data-scarce regime, we combine the idea of stability selection and hierarchical interaction modeling. Based on synthetic data we show superior performance of stability selection over the commonly used cross-validation procedure in lasso models in terms of minimizing the number of spurious effects.
To account for potential noise in the data, our workflow comes in combination with a model-based filter algorithm that ensures that the number of spurious interaction effects due to noisy data is minimized.
Our computational workflow is of independent interest whenever robust hierarchical statistical interactions among various types of binary, categorical, and continuous covariates need to be assessed in the data-scarce regime.
We demonstrate the generalizability of our workflow by applying our workflow to various application fields including the study of combinatorial effects of epigenetic modifications as per the histone code hypothesis, the study of combinatorial drug effects on the microbial abundance of certain species in the human gut as well as the study of combinatorial effects between building energy efficiency measures on the building energy consumption.
Participants at the conference will learn about statistical techniques like the lasso, the lasso for hierarchical interactions, stability selection and synthetic data generation. They will receive an introduction on how to use our reproducible workflow to apply it in their own domains. Familiarity with the standard statistical regression model will be of advantage when attending the session.

Keywords

High-dimensional statistics


Combinatorial effects

data scarcity 

Presenting Author

Mara Stadler, Helmholtz Center Munich

First Author

Mara Stadler, Helmholtz Center Munich

CoAuthor

Christian L. Müller, Helmholtz Center Munich

Tracks

Implementation and Analysis
Conference on Statistical Practice (CSP) 2023