Monitoring - Reboots were completed over the weekend; the networking issue with Docker jobs has been resolved.

Jobs may still go on hold for failing to download the Docker container image - we will continue working on this issue as the HTC pool is upgraded to CentOS Stream 9.

May 13, 2024 - 16:11 CDT
Investigating - Reduced HTC Capacity

An unexpected power outage is impacting one of our server rooms, and may shut down some of our execution points. This may reduce the size of the pool and lead to longer-than-usual queue times.

Issues with Docker Jobs

* Docker jobs on the machines running CentOS Stream 9 may not be able to access the internet due to issues with our firewall. This may cause various and esoteric messages depending on your program and if it depends on network access.
* To solve this issue, many of our nodes will need to be rebooted. These reboots are starting now and will cause jobs to be interrupted. Interrupted jobs should remain in the queue and be restarted by HTCondor.
* Docker jobs may fail to download the Docker container image. Such jobs go on hold with a message like "Error from slotY_ZZ@eXXX.chtc.wisc.edu: Unable to find image" followed by messages with "Pulling fs layer". This is separate from the above issues, and we are working to resolve it.

May 10, 2024 - 16:35 CDT

About This Site

This page provides information about unplanned downtimes and scheduled maintenance for services offered by the Center for High Throughput Computing

High Throughput Computing (HTC) System Operational
90 days ago
99.81 % uptime
Today
Access Points Operational
90 days ago
99.83 % uptime
Today
CHTC Pool Operational
90 days ago
99.25 % uptime
Today
External Pools (OSPool, Campus HTCondor Pools) Operational
90 days ago
100.0 % uptime
Today
Staging and Projects Space Operational
90 days ago
99.98 % uptime
Today
File Transfers Operational
90 days ago
100.0 % uptime
Today
High Performance Computing (HPC) System Operational
90 days ago
99.85 % uptime
Today
Login Nodes Operational
90 days ago
99.56 % uptime
Today
Cluster Nodes and Jobs Operational
90 days ago
99.84 % uptime
Today
Central Software Installations Operational
90 days ago
100.0 % uptime
Today
Home and Scratch File Systems Operational
90 days ago
100.0 % uptime
Today
Data Transfer Tools Operational
90 days ago
99.98 % uptime
Today
Globus Endpoint Operational
90 days ago
99.98 % uptime
Today
CHTC Internal Infrastructure Operational
90 days ago
100.0 % uptime
Today
Tiger Cluster ? Operational
90 days ago
100.0 % uptime
Today
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Past Incidents
May 19, 2024

No incidents reported today.

May 18, 2024

No incidents reported.

May 17, 2024

No incidents reported.

May 16, 2024

No incidents reported.

May 15, 2024
Resolved - Users can log in to ap2001.chtc.wisc.edu again.
Jobs that were running while the AP was offline may have been interrupted, but should have automatically requeued.

May 15, 13:40 CDT
Monitoring - A fix has been implemented and we are monitoring the results.
May 15, 13:22 CDT
Investigating - User report that ap2001.chtc.wisc.edu was confirmed. The machine is not restarting and we are investigating the issue. Users will not be able to login to ap2001 until the issue is resolved.
May 15, 09:14 CDT
May 14, 2024

No incidents reported.

May 13, 2024

Unresolved incident: [HTC] Power outage, issues with Docker jobs.

May 12, 2024

No incidents reported.

May 11, 2024

No incidents reported.

May 10, 2024
May 9, 2024

No incidents reported.

May 8, 2024

No incidents reported.

May 7, 2024

No incidents reported.

May 6, 2024

No incidents reported.

May 5, 2024

No incidents reported.