The affected GPU is still offline. We are working with the vendor to address hardware issues.
Posted Jan 22, 2026 - 09:43 CST
Update
We are continuing to investigate this issue.
Posted Jan 14, 2026 - 11:57 CST
Investigating
Some jobs landing on mrudolphgpu4001 may fail with the message, "uncorrectable ECC error encountered". We have narrowed the issue to a specific GPU. We plan to bring the affected GPU offline for further investigation and fixes, which may require a brief downtime. The machine will be brought back up with the unaffected GPUs available for use.
Posted Jan 14, 2026 - 11:56 CST
This incident affected: High Throughput Computing (HTC) System (CHTC Pool).