Mitigating Data Double Dipping in Statistical Tests Post-Unsupervised Learning: The Role of Synthetic Null Data and Data Splitting Approaches in Single-Cell and Spatial Transcriptomics

Jingyi Jessica Li Speaker
UCLA
 
Sunday, Aug 3: 2:05 PM - 2:30 PM
Invited Paper Session 
Music City Center 
In single-cell and spatial transcriptomic data analysis, unsupervised learning techniques such as clustering are commonly used to create new variables, which are then subjected to subsequent statistical testing for feature screening, such as identifying cell types and cell-type marker genes. However, this process can introduce the issue of data double dipping, where the same data is used to generate and test hypotheses, potentially leading to biased results. This talk will focus on five strategies to address this challenge:

1. Parallel Synthetic Null (Song et al., bioRxiv 2023): Using synthetic null data, such as knockoff data, in parallel with real data as a digital alternative.
2. Concatenated Synthetic Null (DenAdel et al., bioRxiv 2024): Concatenating synthetic null data with real data, a technique known as data augmentation.
3. Data Splitting: Dividing data either by observations or features to ensure independent testing.
4. Data Thinning (Neufeld et al., JMLR 2024): Splitting each data point into two independent data points.
5. Data Fission (Leiner et al., JASA 2023): Creating two independent data points from each original data point.

We will analyze the comparative advantages of these approaches, with a particular focus on their applications in single-cell and spatial transcriptomics data analysis. The analysis will focus on the trade-offs between false discovery rate and discovery power.

Keywords

Double dipping, single-cell data, spatial omics, data splitting