Neural Network with Spatial Random-Effect for Failure Status and Remaining Life of GPU Prediction
Sunday, Aug 3: 4:50 PM - 5:05 PM
2812
Contributed Papers
Music City Center
Neural networks typically assume statistically independent observed responses. However, survival time and failure status of GPUs are known to exhibit dependencies related to location information. We propose a deep learning approach that incorporates random-effect embeddings to model GPU failure outcomes. By assigning each GPU location a learnable embedding with imposed spatial structures, the model captures location-specific dependencies in both survival time and failure type predictions. We distinguish between physical locations (the row, column, slot, node, and cage of the GPU) and logical locations (how GPUs are interconnected through wired connections). By imposing correlation structures based on both physical and logical distances, the embeddings effectively capture strong correlations, particularly among GPUs with few intervening links. Our approach demonstrates improvements over previous deep learning models that did not incorporate spatial structure, and we present comparisons with other machine learning and parametric models.
GPU Reliability
Deep Learning
Spatial Embedding
Random-Effect Models
Time-to-Failure Prediction
You have unsaved changes.