[HTC][GPUs] nvidia-smi command in Docker jobs

Incident Report for Center for High Throughput Computing

Resolved

A ~bug in the `nvidia-smi` code was causing the command to use an unnecessarily large amount of memory.
We've implemented a fix that should resolve this specific problem.

Posted Jan 13, 2025 - 13:14 CST

Investigating

We've confirmed user reports that attempts to run the `nvidia-smi` command inside of a Docker job can cause the job to go on hold for exceeding the memory request, even for large memory jobs.
This appears to be limited to certain machines, though there seems to be no correlation between them.

Until we identify and resolve the cause, we recommend that you do not use the `nvidia-smi` command inside of your Docker jobs.

Posted Jan 08, 2025 - 12:14 CST

This incident affected: High Throughput Computing (HTC) System (CHTC Pool).