The affected GPU is still offline. We are working with the vendor to address hardware issues.
Posted Jan 22, 2026 - 09:43 CST
Update
The affected GPU has been removed from the available pool for testing. Unaffected GPUs on the machine are available for use. We are still investigating.
Posted Jan 09, 2026 - 10:35 CST
Investigating
Some jobs landing on xhuanggpu4001 may fail with the message, "uncorrectable ECC error encountered". We have narrowed the issue to a specific GPU. We plan to bring the affected GPU offline for further investigation and fixes, which may require a brief downtime. The machine will be brought back up with the unaffected GPUs available for use.
Posted Jan 07, 2026 - 13:23 CST
This incident affected: High Throughput Computing (HTC) System (CHTC Pool).