Print Close

Neural Network with Spatial Random-Effect for Failure Status and Remaining Life of GPU Prediction

Presented During: Methods for Dynamic and Spatio-temporal Data

Jared Clark Co-Author

Jie Min Co-Author

Yili Hong Co-Author

Lina Lee First Author

Lina Lee Presenting Author

Sunday, Aug 3: 4:50 PM - 5:05 PM
2812
Contributed Papers

Music City Center

Neural networks typically assume statistically independent observed responses. However, survival time and failure status of GPUs are known to exhibit dependencies related to location information. We propose a deep learning approach that incorporates random-effect embeddings to model GPU failure outcomes. By assigning each GPU location a learnable embedding with imposed spatial structures, the model captures location-specific dependencies in both survival time and failure type predictions. We distinguish between physical locations (the row, column, slot, node, and cage of the GPU) and logical locations (how GPUs are interconnected through wired connections). By imposing correlation structures based on both physical and logical distances, the embeddings effectively capture strong correlations, particularly among GPUs with few intervening links. Our approach demonstrates improvements over previous deep learning models that did not incorporate spatial structure, and we present comparisons with other machine learning and parametric models.

Keywords

GPU Reliability

Deep Learning

Spatial Embedding

Random-Effect Models

Time-to-Failure Prediction