Update - Developers confirmed a typo in the code caused the issue. The team plans to deploy a fix today to address it.
Sep 06, 2024 - 11:49 CDT
Identified - Confirmed user reports that the "condor_watch_q" command is not working on High Throughput system.
Attempting to run the command results in an error stating "ImportError: cannot import name 'can_color' from 'htcondor._utils.ansi'".
This is a bug in the latest test release of HTCondor and we are working with the developers to have it fixed.
In the meantime, you can still do "condor_q" to see the status of your jobs.

Note: Do NOT use "watch" and "condor_q" together - this can slow the server down to the point that no one can do anything!

Sep 06, 2024 - 11:42 CDT
Monitoring - We replaced the network hardware yesterday afternoon and the operation had minimal impact on the system.
We are continuing to monitor the system but are hopeful that the hardware swap will address the majority of the disconnect issues the HPC system has experienced the last couple weeks.
We continue to investigate the cause of the intermittent disconnects that have affected the HPC system throughout the summer.

Sep 04, 2024 - 09:14 CDT
Update - The /scratch directory has been reconnected to spark-login.chtc.wisc.edu. We believe this particular disconnect was caused by the intermittent issue that we experienced earlier this summer, and are investigating.

We are replacing the network hardware this afternoon but the process should have minimal to no interruptions to the HPC system. We will provide further updates as they become available.

Sep 03, 2024 - 12:08 CDT
Update - We are aware that the /scratch directory has disconnected. A short maintenance period is being planned to replace hardware that may be contributing to the underlying cause of this issue. More information regarding this downtime will be shared once it is available.
Sep 03, 2024 - 05:07 CDT
Update - We have made some configuration changes that appear to have alleviated the symptoms (the disconnects appear to have become much less frequent). We are hopeful that the filesystem is more stable now, but are still working to address the underlying causes.
Aug 30, 2024 - 10:50 CDT
Update - Directories of the shared file system (/home, /scratch, /software) disconnected from the log-in server (spark-login.chtc.wisc.edu) sometime over night.
We are working to reconnect the directories.

Aug 28, 2024 - 09:10 CDT
Update - We are continuing our attempts to resolve the issue. Several disconnects happened again throughout today.
An email with more information is being sent to users this afternoon, from the chtc-users@cs.wisc.edu email list.

Aug 27, 2024 - 17:08 CDT
Update - The HPC system is currently having issues with some of its networking hardware, and this in turn is causing the recent issues with the /home, /scratch, and /software directories. Unfortunately this hardware issue means that any machine in the HPC system can lose access to one or more of those directories, including during job execution.
Until we resolve the hardware issue, additional interruptions are expected. We are working with the vendors to resolve the issue, and will provide further updates as they become available.

Aug 26, 2024 - 16:46 CDT
Update - Clarification: spark-login.chtc.wisc.edu is currently available.
We will likely reboot it later today for an upgrade to attempt to fix the issue with the filesystem.
In the meantime, it is possible the issue making /home or /scratch unavailable will reoccur.

Aug 26, 2024 - 10:55 CDT
Identified - User reports over the weekend that /home and /scratch were unavailable when logged in on spark-login.chtc.wisc.edu.
We have investigated and are attempting to fix the underlying cause. We believe the issue is isolated to the login server (jobs in the Slurm queue should be unaffected).
As such, spark-login.chtc.wisc.edu will be undergoing a short maintenance, and will be unavailable for several more hours.

Aug 26, 2024 - 10:46 CDT

About This Site

This page provides information about unplanned downtimes and scheduled maintenance for services offered by the Center for High Throughput Computing

High Throughput Computing (HTC) System Degraded Performance
90 days ago
99.99 % uptime
Today
Access Points Degraded Performance
90 days ago
99.98 % uptime
Today
CHTC Pool Operational
90 days ago
100.0 % uptime
Today
External Pools (OSPool, Campus HTCondor Pools) Operational
90 days ago
100.0 % uptime
Today
Staging and Projects Space Operational
90 days ago
100.0 % uptime
Today
File Transfers Operational
90 days ago
100.0 % uptime
Today
High Performance Computing (HPC) System Degraded Performance
90 days ago
99.09 % uptime
Today
Login Nodes Degraded Performance
90 days ago
99.42 % uptime
Today
Cluster Nodes and Jobs Degraded Performance
90 days ago
99.44 % uptime
Today
Central Software Installations Operational
90 days ago
100.0 % uptime
Today
Home and Scratch File Systems Degraded Performance
90 days ago
97.52 % uptime
Today
Data Transfer Tools Operational
90 days ago
100.0 % uptime
Today
Globus Endpoint Operational
90 days ago
100.0 % uptime
Today
CHTC Internal Infrastructure Operational
90 days ago
100.0 % uptime
Today
Tiger Cluster ? Operational
90 days ago
100.0 % uptime
Today
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Scheduled Maintenance
HTC Central Manager Update Sep 11, 2024 08:00-17:00 CDT
An important component of the HTC system will be upgraded on Wednesday, September 11. There may be potential interruptions to jobs.
Posted on Aug 29, 2024 - 09:58 CDT
Past Incidents
Sep 8, 2024

No incidents reported today.

Sep 7, 2024

No incidents reported.

Sep 6, 2024

Unresolved incident: [HTC] "condor_watch_q" not working.

Sep 5, 2024

No incidents reported.

Sep 4, 2024
Completed - The scheduled maintenance has been completed.
Sep 4, 10:00 CDT
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Sep 4, 08:00 CDT
Scheduled - ap2002 will be down on Wednesday, September 4, starting around 8am in the morning. We will be swapping out some of the hardware components for the ap2002 server. Jobs will be interrupted but should remain in the queue and start when the downtime is over.
Aug 29, 09:57 CDT
Sep 3, 2024
Sep 2, 2024

No incidents reported.

Sep 1, 2024

No incidents reported.

Aug 31, 2024

No incidents reported.

Aug 30, 2024
Aug 29, 2024

No incidents reported.

Aug 28, 2024
Resolved - We've resolved the issue in our network firewall that was preventing Docker jobs from accessing the internet.
Aug 28, 13:56 CDT
Update - We believe we have a solution for the problem and will be deploying the fix over the next couple days.
Aug 28, 08:56 CDT
Identified - We've encountered a few instances of Docker jobs on the High Throughput system being unable to connect to the internet, with an error like "Could not resolve hostname github.com: Temporary failure in name resolution".
We've identified the cause as being changes to our network firewall on particular machines. We are working to deploy a fix, but until we do the issue could become more widespread.
If you encounter this issue, please email us at chtc@cs.wisc.edu with the specific Job IDs that encountered the error.

Aug 26, 11:46 CDT
Aug 27, 2024
Aug 26, 2024
Aug 25, 2024

No incidents reported.