The affected GPU has been removed from the available pool for testing. Unaffected GPUs on the machine are available for use. We are still investigating.
Posted Jan 09, 2026 - 10:35 CST
Investigating
Some jobs landing on xhuanggpu4001 may fail with the message, "uncorrectable ECC error encountered". We have narrowed the issue to a specific GPU. We plan to bring the affected GPU offline for further investigation and fixes, which may require a brief downtime. The machine will be brought back up with the unaffected GPUs available for use.
Posted Jan 07, 2026 - 13:23 CST
This incident affects: High Throughput Computing (HTC) System (CHTC Pool).