Identified - Some jobs running on gpulab2001 or gpulab2003 may fail with an error "CUDA error: failed call to cuInit: CUDA_ERROR_UNKNOWN". We are working to resolve the issue.
Jun 02, 2026 - 16:53 CDT

About This Site

This page provides information about unplanned downtimes and scheduled maintenance for services offered by the Center for High Throughput Computing

High Throughput Computing (HTC) System Degraded Performance
90 days ago
98.38 % uptime
Today
Access Points Operational
90 days ago
96.79 % uptime
Today
CHTC Pool Degraded Performance
90 days ago
97.06 % uptime
Today
External Pools (OSPool, Campus HTCondor Pools) Operational
90 days ago
98.62 % uptime
Today
Staging and Projects Space Operational
90 days ago
99.91 % uptime
Today
File Transfers Operational
90 days ago
99.53 % uptime
Today
High Performance Computing (HPC) System Operational
90 days ago
99.02 % uptime
Today
Login Nodes Operational
90 days ago
97.84 % uptime
Today
Cluster Nodes and Jobs Operational
90 days ago
98.43 % uptime
Today
Central Software Installations Operational
90 days ago
100.0 % uptime
Today
Home and Scratch File Systems Operational
90 days ago
99.81 % uptime
Today
Data Transfer Tools Operational
90 days ago
95.37 % uptime
Today
Globus Endpoint Operational
90 days ago
95.37 % uptime
Today
BadgerCompute Operational
90 days ago
100.0 % uptime
Today
BadgerCompute Operational
90 days ago
100.0 % uptime
Today
CHTC Internal Infrastructure Operational
90 days ago
99.85 % uptime
Today
Tiger Cluster Operational
90 days ago
100.0 % uptime
Today
RT Email/Ticket Support System Operational
90 days ago
99.71 % uptime
Today
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Jun 4, 2026
Resolved - We identified the specific cause and are addressing it. Condor commands on ap2002 should be working again, though the issue may reoccur in the future.
Jun 4, 10:08 CDT
Investigating - We're seeing reports of condor commands, such as condor_submit and condor_q, hanging or failing. We are investigating the cause and will update this Status Page as more information becomes available.
Jun 3, 16:32 CDT
Jun 3, 2026
Jun 2, 2026

Unresolved incident: [HTC] GPU issues on gpulab2001, gpulab2003.

Jun 1, 2026
Resolved - This particular incident is resolved.

The issue was caused by a highly specific and rare bug. We are working to address the underlying cause.

Jun 1, 11:27 CDT
Monitoring - The issue has been fixed, at least temporarily. We're still not clear on the cause, though, so there is a chance it may reoccur over the weekend.
May 29, 14:13 CDT
Update - We are still investigating this issue. Our attempted fix did not work so we need to dig deeper into the cause.
May 29, 11:32 CDT
Investigating - The HTCondor queue on ap2001 is currently down. If you run a command like `condor_q` or `condor_submit`, you'll see a message like:

> Error: Can't find address for schedd ap2001.chtc.wisc.edu

> ERROR: Can't find address of local schedd

We are investigating why the queue is down.

May 29, 08:04 CDT
May 31, 2026

No incidents reported.

May 30, 2026

No incidents reported.

May 29, 2026
May 28, 2026

No incidents reported.

May 27, 2026
Completed - The scheduled maintenance has been completed.
May 27, 18:00 CDT
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
May 27, 14:00 CDT
Scheduled - We are bringing ap2001, ap2002, learn, and researcher-owned APs into downtime to apply important updates to the kernel, continuing this morning's maintenance event.

Users will not be able to log in or submit jobs during the downtime.

May 27, 13:43 CDT
Completed - The scheduled maintenance has been completed.
May 27, 12:00 CDT
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
May 27, 06:00 CDT
Scheduled - We are bringing ap2001, ap2002, learn, and researcher-owned APs into downtime to apply important updates to the kernel.

Users will not be able to log in or submit jobs during the downtime.

May 21, 12:21 CDT
May 26, 2026

No incidents reported.

May 25, 2026

No incidents reported.

May 24, 2026

No incidents reported.

May 23, 2026

No incidents reported.

May 22, 2026
Resolved - This incident has been resolved.
May 22, 10:57 CDT
Update - We are continuing to monitor for any further issues.
May 22, 10:57 CDT
Monitoring - We've implemented a fix and are monitoring.

We identified an issue with authentication that was introduced as a side-effect of our recovery from last weekend's outage.

May 21, 15:54 CDT
Investigating - There is an issue with uploading files to /staging via the OSDF (osdf:///chtc/staging/..).
Jobs uploading to /staging via the OSDF will encounter the error "Transfer output files failure ... server returned 403 Forbidden ... ".

We are investigating the issue.

May 21, 10:05 CDT
Resolved - This incident has been resolved.
May 22, 10:00 CDT
Monitoring - A fix has been implemented and we are monitoring the results.
May 21, 10:24 CDT
Investigating - Any jobs or processes fetching Docker images may fail, with error messages like "Error response from daemon: toomanyrequests: You have reached your unauthenticated pull rate limit." or "While making image from oci registry: error fetching image to cache: while building SIF from layers: conveyor failed to get: ... unexpected status code 500 Internal Server Error"

We are currently investigating.

May 20, 12:29 CDT
May 21, 2026