Resolved -
Most modern worker nodes should be back up - cluster is at normal operation.
Mar 16, 13:33 CDT
Identified -
The HPC cluster login node (spark-login) is back up. We are bringing the worker nodes of the cluster up now.
Mar 16, 12:00 CDT
Investigating -
The HPC cluster (accessed via spark-login.chtc.wisc.edu) went down over the weekend due to a power outage. We will update this incident at the cluster comes back online.
Mar 16, 07:57 CDT
Resolved -
This incident has been resolved.
Mar 13, 09:09 CDT
Monitoring -
A fix has been implemented and we are monitoring the results.
Mar 12, 16:53 CDT
Investigating -
Users may be experiencing login issues to spark-login, including hanging after entering the ssh command or a repeating message upon successful login ( kernel: watchdog: BUG: soft lockup ). We are currently investigating.
Mar 12, 16:16 CDT
Completed -
We've rebooted the machine with 7 GPUs, and the machine is currently accepting jobs.
Mar 9, 13:34 CDT
In progress -
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Mar 6, 06:00 CST
Scheduled -
We will be shutting down mrudolphgpu4001 to repair a faulty GPU. The machine will be brought back up with 7 GPUs while we get the GPU replaced.
Feb 23, 15:34 CST
Resolved -
This incident has been resolved.
Mar 6, 09:36 CST
Monitoring -
We've implemented a fix for the CUDA_VISIBLE_DEVICES issues on various GPU machines. Please resubmit jobs and email chtc@cs.wisc.edu if you are still experiencing issues with it.
Feb 6, 10:18 CST
Update -
We are continuing to investigate this issue.
Feb 4, 16:38 CST
Investigating -
We have received reports of issues with the CUDA_VISIBLE_DEVICES environment variable being set incorrectly on certain GPU jobs. We are investigating the issue and will update this page once more information is known.
Feb 2, 13:57 CST
Resolved -
This incident has been resolved.
Mar 6, 09:36 CST
Investigating -
We have received some reports of issues with file transfers from ResearchDrive in HTCondor jobs for folks with the ResearchDrive/CHTC integration.
*** Please report any issues with file transfer between CHTC and ResearchDrive to chtc@cs.wisc.edu as we are investigating the cause ***
Feb 13, 17:06 CST
Resolved -
This incident has been resolved.
Mar 6, 08:58 CST
Identified -
OSDF file transfers are currently not working. Users may see a hold message like, "Details: failed to get namespace information for remote URL ... error while querying the director". This is due to an unexpected issue with an upgrade. We are working to resolve the issue.
Mar 5, 16:31 CST