Update - Sporadic reports of the "pull rate limit exceeded" hold message for Docker jobs have continued throughout the past week. "Trying again later" is still the best workaround at this time.
To avoid pulling Docker containers at all, consider creating an Apptainer .sif image of the container, as described in our guide here: https://chtc.cs.wisc.edu/uw-research-computing/htc-docker-to-apptainer. You may still run into the error when building the .sif image, but once built, you can use the image as the software environment for your jobs in a similar manner to Docker containers. Except that the .sif image is a standalone file you place in your /staging directory and so you do not have to pull from DockerHub, thus avoiding their pull rate limits.
We are unlikely to resolve this issue until after the holidays, so until we say otherwise, you may assume that this is issue is still ongoing.
Dec 19, 2024 - 17:09 CST
Update - This afternoon, there have been multiple user reports of their Docker jobs going on hold with the "pull rate limit exceeded" message. "Trying again later" is still the best workaround at this time.
Dec 09, 2024 - 16:55 CST
Update - This issue seems to be intermittent in nature. We have not identified the root cause yet, but are still investigating. In the meantime, trying again later seems to be the workaround.
Dec 03, 2024 - 12:22 CST
Investigating - We are encountering errors on the HTC system about the Docker pull rate limit being exceeded. Jobs may be going on hold with such an error message. We are investigating the issue. Note that our system is set up such that this shouldn't be an issue, and it has nothing to do with your account at CHTC or on DockerHub.
Dec 02, 2024 - 16:13 CST
Update - The data recovery process for /staging and HTC /software is complete. We believe we have recovered about 50% of the data that was originally present in these directories. Some of the metadata for files (like file creation date) may be incorrect; we strongly recommend validating any data that you copy from the recovered file system.
Update - We are nearly finished recovering data from the /staging directory. We will provide more information in the next day or so as we confirm the recovery process. We are still working on recovering data from the /projects directory and anticipate it will be several weeks before it is ready for users to access.
Dec 10, 2024 - 15:48 CST
Update - We have created new /staging, /projects and /software data spaces. Please email us if you need your group /staging directories, /projects, or /software directories re-created. If any aspects of your jobs relied on these directories and you are currently having issues running jobs, contact us at chtc@cs.wisc.edu.
Dec 06, 2024 - 11:15 CST
Update - All HTC users should now have access to a new staging directory with a default quota of 100GB / 1000 items. This space can be used exactly like the previous /staging directories to run jobs.
Identified - We have identified the issue that was causing file system problems on Thursday. We are able to prevent it from recurring; however, it resulted in significant data loss in /staging, /projects, HTC /software and /squid before CHTC personnel were able to react.
All data in /squid is unrecoverable. Any remaining data in /projects and /staging is currently inaccessible as we work to recover whatever additional data we can. We hope to recover at least 50% of /staging and 60% of /projects.
This week (Nov 25-27), we will create a new data store to serve the “/staging” and “/projects” directories. Initially, there will be no data inside these directories. This new data backend for the /staging and /projects directories will be used for CHTC data storage moving forward and will be usable in jobs once available. We will post on this status page when these directories are available.
Resolved -
This incident should be resolved now.
Jan 13, 13:14 CST
Monitoring -
The file system performance appears to be much improved at this time. Please let us know if you continue to experience issues with /staging or /projects.
Jan 7, 15:22 CST
Investigating -
We've confirmed several reports of slow performance of /staging, /project directories. We are actively investigating the cause, which at this time we believe is a consequence of our recovery efforts to restore the data from the /project directory.
Users may encounter slow file transfers to/from /staging, /projects, and commands that query files in those directories may be slow or hang up entirely. Unfortunately there is not a good workaround at this time. We ask that users be patient while we investigate and resolve this issue.
Jan 7, 13:21 CST
Resolved -
A ~bug in the `nvidia-smi` code was causing the command to use an unnecessarily large amount of memory. We've implemented a fix that should resolve this specific problem.
Jan 13, 13:14 CST
Investigating -
We've confirmed user reports that attempts to run the `nvidia-smi` command inside of a Docker job can cause the job to go on hold for exceeding the memory request, even for large memory jobs. This appears to be limited to certain machines, though there seems to be no correlation between them.
Until we identify and resolve the cause, we recommend that you do not use the `nvidia-smi` command inside of your Docker jobs.
Jan 8, 12:14 CST
Resolved -
We've re-deployed the OSDF origin and testing confirmed it is functioning as expected. You can now use the "osdf:///chtc/staging/yourNetID/yourLargeFile" syntax in your submit files again to transfer files via the OSDF system. If you encounter issues, please let us know at chtc@cs.wisc.edu
Jan 2, 10:44 CST
Identified -
We've been slowly rolling out the ability for users to transfer files to, from /staging via the OSDF, with syntax that looks like "osdf:///chtc/staging/yourNetID/yourLargeFile". The steps we've taken to mitigate the impact of https://status.chtc.wisc.edu/incidents/ngnflxddq0wh have inadvertently broken this functionality. We are working to repair the functionality and are also taking the opportunity to update the corresponding software. We are hopeful the functionality will be available again by early next week.
In the meantime, for running jobs within CHTC, you can change "osdf:///chtc/staging/..." to "file:///staging/..." and add the requirement "requirements = (HasCHTCStaging == true)" in your submit file.
Dec 6, 15:59 CST