Monitoring - We replaced the network hardware yesterday afternoon and the operation had minimal impact on the system.
We are continuing to monitor the system but are hopeful that the hardware swap will address the majority of the disconnect issues the HPC system has experienced the last couple weeks.
We continue to investigate the cause of the intermittent disconnects that have affected the HPC system throughout the summer.
Sep 04, 2024 - 09:14 CDT
Update - The /scratch directory has been reconnected to spark-login.chtc.wisc.edu. We believe this particular disconnect was caused by the intermittent issue that we experienced earlier this summer, and are investigating.
We are replacing the network hardware this afternoon but the process should have minimal to no interruptions to the HPC system. We will provide further updates as they become available.
Sep 03, 2024 - 12:08 CDT
Update - We are aware that the /scratch directory has disconnected. A short maintenance period is being planned to replace hardware that may be contributing to the underlying cause of this issue. More information regarding this downtime will be shared once it is available.
Sep 03, 2024 - 05:07 CDT
Update - We have made some configuration changes that appear to have alleviated the symptoms (the disconnects appear to have become much less frequent). We are hopeful that the filesystem is more stable now, but are still working to address the underlying causes.
Aug 30, 2024 - 10:50 CDT
Update - Directories of the shared file system (/home, /scratch, /software) disconnected from the log-in server (spark-login.chtc.wisc.edu) sometime over night.
We are working to reconnect the directories.
Aug 28, 2024 - 09:10 CDT
Update -
We are continuing our attempts to resolve the issue. Several disconnects happened again throughout today.
An email with more information is being sent to users this afternoon, from the chtc-users@cs.wisc.edu email list.
Aug 27, 2024 - 17:08 CDT
Update - The HPC system is currently having issues with some of its networking hardware, and this in turn is causing the recent issues with the /home, /scratch, and /software directories. Unfortunately this hardware issue means that any machine in the HPC system can lose access to one or more of those directories, including during job execution.
Until we resolve the hardware issue, additional interruptions are expected. We are working with the vendors to resolve the issue, and will provide further updates as they become available.
Aug 26, 2024 - 16:46 CDT
Update - Clarification: spark-login.chtc.wisc.edu is currently available.
We will likely reboot it later today for an upgrade to attempt to fix the issue with the filesystem.
In the meantime, it is possible the issue making /home or /scratch unavailable will reoccur.
Aug 26, 2024 - 10:55 CDT
Identified - User reports over the weekend that /home and /scratch were unavailable when logged in on spark-login.chtc.wisc.edu.
We have investigated and are attempting to fix the underlying cause. We believe the issue is isolated to the login server (jobs in the Slurm queue should be unaffected).
As such, spark-login.chtc.wisc.edu will be undergoing a short maintenance, and will be unavailable for several more hours.
Aug 26, 2024 - 10:46 CDT