[HTC] Issues with multi-GPU jobs using Pytorch

Identified

Jobs that use multiple GPUs and Pytorch may run into an error where GPUs are not detected. This is occurring on multiple GPU machines after applying driver updates.

We have identified the issue and are actively working to roll out fixes to our GPU machines between 10/27-10/31.

If you encounter this issue, here are some options:
* Wait until next week to submit multi-GPU jobs using Pytorch
* Request alternative resources, such as requesting a single GPU for your jobs, using CPU-only workflows, or non-Pytorch workflows.

We understand this incident is disruptive to researchers' workflows - please reach out to us at chtc@cs.wisc.edu with any concerns.
Posted Oct 24, 2025 - 10:58 CDT
This incident affects: High Throughput Computing (HTC) System (CHTC Pool).