Neural Network with Spatial Random-Effect for Failure Status and Remaining Life of GPU Prediction

Jared Clark Co-Author
 
Jie Min Co-Author
 
Yili Hong Co-Author
 
Lina Lee First Author
 
Lina Lee Presenting Author
 
Sunday, Aug 3: 4:50 PM - 5:05 PM
2812 
Contributed Papers 
Music City Center 
Neural networks typically assume statistically independent observed responses. However, survival time and failure status of GPUs are known to exhibit dependencies related to location information. We propose a deep learning approach that incorporates random-effect embeddings to model GPU failure outcomes. By assigning each GPU location a learnable embedding with imposed spatial structures, the model captures location-specific dependencies in both survival time and failure type predictions. We distinguish between physical locations (the row, column, slot, node, and cage of the GPU) and logical locations (how GPUs are interconnected through wired connections). By imposing correlation structures based on both physical and logical distances, the embeddings effectively capture strong correlations, particularly among GPUs with few intervening links. Our approach demonstrates improvements over previous deep learning models that did not incorporate spatial structure, and we present comparisons with other machine learning and parametric models.

Keywords

GPU Reliability


Deep Learning

Spatial Embedding

Random-Effect Models

Time-to-Failure Prediction