Monitoring - A fix has been implemented and we are monitoring the results.
Oct 27, 2025 - 17:41 CDT
Investigating - Users of wright-ap.chtc.wisc.edu are unable to log into wright-ap.chtc.wisc.edu. We are investigating the issue.
Oct 27, 2025 - 15:49 CDT
Identified - Jobs that use multiple GPUs and Pytorch may run into an error where GPUs are not detected. This is occurring on multiple GPU machines after applying driver updates.

We have identified the issue and are actively working to roll out fixes to our GPU machines between 10/27-10/31.

If you encounter this issue, here are some options:
* Wait until next week to submit multi-GPU jobs using Pytorch
* Request alternative resources, such as requesting a single GPU for your jobs, using CPU-only workflows, or non-Pytorch workflows.

We understand this incident is disruptive to researchers' workflows - please reach out to us at chtc@cs.wisc.edu with any concerns.

Oct 24, 2025 - 10:58 CDT

About This Site

This page provides information about unplanned downtimes and scheduled maintenance for services offered by the Center for High Throughput Computing

High Throughput Computing (HTC) System Degraded Performance
90 days ago
99.94 % uptime
Today
Access Points Operational
90 days ago
100.0 % uptime
Today
CHTC Pool Degraded Performance
90 days ago
100.0 % uptime
Today
External Pools (OSPool, Campus HTCondor Pools) Operational
90 days ago
100.0 % uptime
Today
Staging and Projects Space Operational
90 days ago
100.0 % uptime
Today
File Transfers Operational
90 days ago
99.71 % uptime
Today
High Performance Computing (HPC) System Operational
90 days ago
99.99 % uptime
Today
Login Nodes Operational
90 days ago
99.98 % uptime
Today
Cluster Nodes and Jobs Operational
90 days ago
100.0 % uptime
Today
Central Software Installations Operational
90 days ago
100.0 % uptime
Today
Home and Scratch File Systems Operational
90 days ago
100.0 % uptime
Today
Data Transfer Tools Operational
90 days ago
100.0 % uptime
Today
Globus Endpoint Operational
90 days ago
100.0 % uptime
Today
CHTC Internal Infrastructure Operational
90 days ago
100.0 % uptime
Today
Tiger Cluster ? Operational
90 days ago
100.0 % uptime
Today
RT Email/Ticket Support System Operational
90 days ago
100.0 % uptime
Today
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.

Scheduled Maintenance

[HPC] Outage, maintenance of the HPC system Oct 29, 2025 07:00-15:00 CDT

Maintenance of the datacenter requires that the HPC system is powered off.
We may take the opportunity to install some system updates after it is powered back on.

No jobs will run or be accepted during this time. Queued jobs should continue once the maintenance downtime has completed. Jobs submitted with a runtime that intersects with the maintenance window will not start, with the reason "ReqNodeNotAvail, Reserved for maintenance".

Posted on Oct 10, 2025 - 13:42 CDT
Oct 27, 2025

Unresolved incident: [HTC] Unable to access wright-ap.

Oct 26, 2025

No incidents reported.

Oct 25, 2025

No incidents reported.

Oct 24, 2025
Resolved - This incident has been resolved.
Oct 24, 10:59 CDT
Update - We are continuing to monitor for any further issues.
Oct 16, 16:22 CDT
Monitoring - Users should be able to login again.
We have not yet identified the cause, however, so the issue may reoccur.

We will continue to investigate and monitor the situation.

Oct 16, 16:21 CDT
Investigating - Confirmed reports that users are not able to login to spark-login.chtc.wisc.edu at this time.
We are investigating and will provide updates as they become available.

Oct 16, 16:00 CDT
Oct 23, 2025

No incidents reported.

Oct 22, 2025

No incidents reported.

Oct 21, 2025

No incidents reported.

Oct 20, 2025

No incidents reported.

Oct 19, 2025

No incidents reported.

Oct 18, 2025

No incidents reported.

Oct 17, 2025

No incidents reported.

Oct 16, 2025
Oct 15, 2025

No incidents reported.

Oct 14, 2025

No incidents reported.

Oct 13, 2025

No incidents reported.