Update - We are continuing to investigate this issue.
Jan 29, 2026 - 10:20 CST
Investigating - Users of the licensed software CST may encounter this error: "modeler_AMD64: line 154: Aborted (core dumped) "${CST_REGSVR32}"". This occurs on most Execution Points, with the exception of build machines.
We are currently investigating.
Jan 29, 2026 - 10:19 CST
Update - The machine is offline for running diagnostics.
Jan 27, 2026 - 16:45 CST
Update - The affected GPU is still offline. We are working with the vendor to address hardware issues.
Jan 22, 2026 - 09:43 CST
Update - The affected GPU has been removed from the available pool for testing. Unaffected GPUs on the machine are available for use. We are still investigating.
Jan 09, 2026 - 10:35 CST
Investigating - Some jobs landing on xhuanggpu4001 may fail with the message, "uncorrectable ECC error encountered". We have narrowed the issue to a specific GPU. We plan to bring the affected GPU offline for further investigation and fixes, which may require a brief downtime. The machine will be brought back up with the unaffected GPUs available for use.
Jan 07, 2026 - 13:23 CST
Update - The affected GPU is still offline. We are working with the vendor to address hardware issues.
Jan 22, 2026 - 09:43 CST
Update - We are continuing to investigate this issue.
Jan 14, 2026 - 11:57 CST
Investigating - Some jobs landing on mrudolphgpu4001 may fail with the message, "uncorrectable ECC error encountered". We have narrowed the issue to a specific GPU. We plan to bring the affected GPU offline for further investigation and fixes, which may require a brief downtime. The machine will be brought back up with the unaffected GPUs available for use.
Jan 14, 2026 - 11:56 CST
Resolved -
A power issue temporarily took down some of the storage servers and our team was able to bring them back online shortly after. We do not anticipate further issues at this time.
Jan 27, 10:12 CST
Monitoring -
We believe we've fixed the underlying cause of the issue, and are monitoring the effect.
Jan 26, 16:29 CST
Investigating -
Our service monitoring has alerted us to degraded performance of the data system backing /staging. File transfers and other interactions with /staging may be slow and may result in job failures. This may also affect /projects, /software spaces, as well as the OSDF (osdf:///) and UWDF (pelican://chtc.wisc.edu/) file transfers.
Jan 26, 16:15 CST
Jan 26, 2026
Jan 25, 2026
No incidents reported.
Jan 24, 2026
No incidents reported.
Jan 23, 2026
No incidents reported.
Jan 22, 2026
Unresolved incident: [HTC] ECC errors on mrudolphgpu4001.