Update - We are continuing to investigate this issue.
Jan 14, 2026 - 11:57 CST
Investigating - Some jobs landing on mrudolphgpu4001 may fail with the message, "uncorrectable ECC error encountered". We have narrowed the issue to a specific GPU. We plan to bring the affected GPU offline for further investigation and fixes, which may require a brief downtime. The machine will be brought back up with the unaffected GPUs available for use.
Jan 14, 2026 - 11:56 CST
Update - The affected GPU has been removed from the available pool for testing. Unaffected GPUs on the machine are available for use. We are still investigating.
Jan 09, 2026 - 10:35 CST
Investigating - Some jobs landing on xhuanggpu4001 may fail with the message, "uncorrectable ECC error encountered". We have narrowed the issue to a specific GPU. We plan to bring the affected GPU offline for further investigation and fixes, which may require a brief downtime. The machine will be brought back up with the unaffected GPUs available for use.
Jan 07, 2026 - 13:23 CST

About This Site

This page provides information about unplanned downtimes and scheduled maintenance for services offered by the Center for High Throughput Computing

High Throughput Computing (HTC) System Operational
90 days ago
99.97 % uptime
Today
Access Points Operational
90 days ago
99.89 % uptime
Today
CHTC Pool Operational
90 days ago
100.0 % uptime
Today
External Pools (OSPool, Campus HTCondor Pools) Operational
90 days ago
100.0 % uptime
Today
Staging and Projects Space Operational
90 days ago
99.99 % uptime
Today
File Transfers Operational
90 days ago
99.95 % uptime
Today
High Performance Computing (HPC) System Operational
90 days ago
100.0 % uptime
Today
Login Nodes Operational
90 days ago
100.0 % uptime
Today
Cluster Nodes and Jobs Operational
90 days ago
100.0 % uptime
Today
Central Software Installations Operational
90 days ago
100.0 % uptime
Today
Home and Scratch File Systems Operational
90 days ago
100.0 % uptime
Today
Data Transfer Tools Operational
90 days ago
100.0 % uptime
Today
Globus Endpoint Operational
90 days ago
100.0 % uptime
Today
CHTC Internal Infrastructure Operational
90 days ago
100.0 % uptime
Today
Tiger Cluster Operational
90 days ago
100.0 % uptime
Today
RT Email/Ticket Support System Operational
90 days ago
100.0 % uptime
Today
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Jan 19, 2026

No incidents reported today.

Jan 18, 2026

No incidents reported.

Jan 17, 2026

No incidents reported.

Jan 16, 2026

No incidents reported.

Jan 15, 2026
Resolved - This incident has been resolved.
Jan 15, 11:42 CST
Monitoring - The fix for the issue was surprisingly easy to deploy.
Users should no longer encounter this issue.
Please let us know at chtc@cs.wisc.edu if you do!

Jan 9, 15:27 CST
Update - Slight correction: without "max_idle", you can submit up to 10,000 jobs per submission and you can have up to 50,000 jobs total in the queue. ("Done" jobs generally do not count against the total in the queue.)
Jan 9, 14:47 CST
Identified - We've identified an issue where submit files that have "max_idle" (and related keywords) may not create jobs when submitted with "condor_submit".
We have narrowed down the source of the bug, but do not have a fix at this time (and probably not before the weekend).

If you will have less than 10,000 jobs in the queue, we recommend removing this option from your submit file.
If you will more than 10,000 jobs in the queue, you may want to consider removing this option and manually submitting your jobs in batches such that the total number jobs stays under the 10,000 job limit.

We appreciate your patience while we work to deploy a fix for this issue.

Jan 9, 13:57 CST
Jan 14, 2026
Completed - The scheduled maintenance has been completed.
Jan 14, 18:00 CST
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Jan 14, 07:00 CST
Scheduled - We will be upgrading the DSI GPUs on this date.
Jan 7, 16:26 CST
Jan 13, 2026

No incidents reported.

Jan 12, 2026

No incidents reported.

Jan 11, 2026

No incidents reported.

Jan 10, 2026

No incidents reported.

Jan 9, 2026

Unresolved incident: [HTC] ECC errors on xhuanggpu4001.

Jan 8, 2026

No incidents reported.

Jan 7, 2026
Jan 6, 2026

No incidents reported.

Jan 5, 2026

No incidents reported.