Monitoring -
The data recovery process for /projects is complete. We believe we have recovered close to 100% of the data that was originally present in these directories. Some of the metadata for files (like file creation date) may be incorrect; we strongly recommend validating any data that you copy from the recovered file system.
Update -
The data recovery process for /staging and HTC /software is complete. We believe we have recovered about 50% of the data that was originally present in these directories. Some of the metadata for files (like file creation date) may be incorrect; we strongly recommend validating any data that you copy from the recovered file system.
Update -
We are nearly finished recovering data from the /staging directory. We will provide more information in the next day or so as we confirm the recovery process. We are still working on recovering data from the /projects directory and anticipate it will be several weeks before it is ready for users to access.
Dec 10, 15:48 CST
Update -
We have created new /staging, /projects and /software data spaces. Please email us if you need your group /staging directories, /projects, or /software directories re-created. If any aspects of your jobs relied on these directories and you are currently having issues running jobs, contact us at chtc@cs.wisc.edu.
Dec 6, 11:15 CST
Update -
All HTC users should now have access to a new staging directory with a default quota of 100GB / 1000 items. This space can be used exactly like the previous /staging directories to run jobs.
Identified -
We have identified the issue that was causing file system problems on Thursday. We are able to prevent it from recurring; however, it resulted in significant data loss in /staging, /projects, HTC /software and /squid before CHTC personnel were able to react.
All data in /squid is unrecoverable. Any remaining data in /projects and /staging is currently inaccessible as we work to recover whatever additional data we can. We hope to recover at least 50% of /staging and 60% of /projects.
This week (Nov 25-27), we will create a new data store to serve the “/staging” and “/projects” directories. Initially, there will be no data inside these directories. This new data backend for the /staging and /projects directories will be used for CHTC data storage moving forward and will be usable in jobs once available. We will post on this status page when these directories are available.
Resolved -
No further issues have been reported. Marking incident as resolved.
If you encounter the issue again, please let us know at chtc@cs.wisc.edu.
Apr 23, 13:13 CDT
Monitoring -
OSDF transfers on the system have been stable most of this week, so we are hopeful this issue is resolved. We'll continue to monitor in case the issue returns.
Apr 18, 13:48 CDT
Update -
The fix we deployed has addressed the cause of the "permission denied" type of OSDF hold messages. That being said, we are seeing rather slow transfers and jobs going on hold with related messages.
We are investigating the cause of the slow transfers now.
Apr 8, 11:47 CDT
Update -
We believe we've identified the root cause of this issue and are working to deploy a fix. Transfers declared using the osdf:///chtc/staging/ syntax involving unique-per-job data are the most likely to encounter this error until the fix is deployed.
Apr 8, 09:35 CDT
Identified -
Confirmed user reports of jobs going on hold with a message along the lines of "Transfer input files failure at ... using protocol osdf ... Unable to read (...Path...); permission denied ...". This appears to be an intermittent issue and we are working to identify the root cause, which we believe to be related to a certain system being overwhelmed.
If you encounter said hold message, wait a few minutes and release or resubmit the jobs to try again. Let us know at chtc@cs.wisc.edu if this is significantly impacting your work.
Apr 3, 11:57 CDT
Resolved -
We recently identified several misconfigured machines in the storage system backing /staging and /projects. After addressing the issues, performance of the system has been much better.
If you encounter issues with slow transfers involving /staging or /projects, let us know at chtc@cs.wisc.edu.
Apr 23, 13:11 CDT
Identified -
We've confirmed several reports of slow performance of /staging, /project directories. Users may encounter slow file transfers to/from /staging, /projects, and commands that query files in those directories may be slow or hang up entirely.
This issue is related to heavy disk usage in these spaces as a side effect of the ongoing data recovery process. Unfortunately there is not a good workaround at this time, but users are encouraged to move or remove any recovered data (located in /recovery).
We ask that users be patient while we work to resolve this issue.
Jan 23, 16:59 CST
Resolved -
This incident has been resolved. Some jobs may have been interrupted
Apr 22, 15:30 CDT
Investigating -
Users on the HPC system may see "slurm_load_jobs error: Unexpected message received" upon using a slurm-related command. We are currently investigating.
Apr 22, 15:15 CDT
Completed -
The scheduled maintenance has been completed.
Apr 15, 14:02 CDT
In progress -
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Apr 15, 11:40 CDT
Scheduled -
We are draining voyles2000.chtc.wisc.edu to perform maintenance on the machine. Users' jobs will not be able to match to the machine at this time.
Apr 15, 11:38 CDT
Resolved -
This incident has been resolved.
Apr 14, 16:14 CDT
Identified -
The connection is broken between the /scratch storage system and the spark-login.chtc.wisc.edu server. The /scratch storage system is still intact and accessible from worker nodes, and data should be unaffected.
We are working to repair the connection between /scratch and spark-login.
Apr 14, 10:37 CDT
Investigating -
Users of the HPC system trying to access files or directories in /scratch will see a "Permission denied" message. We are actively investigating.
Apr 14, 09:14 CDT