Dept of Radiology Computing Infrastructure
GPU servers including Tesla V100, RTX 8000 Ti, and GTX 2080 Ti
04/12/2024
HTCondor updated to 23.0.8. For details, see https://htcondor.org/htcondor/release-highlights/#long-term-support-channel
02/29/2024
Harbor upgraded to version 2.9.1
04/16/2021
139TB SSD RAID now available
1/16/2021
Two lambda RTX 8000 servers are now available
11/26/2019
Local docker registry added https://registry.cvib.ucla.edu/v2/_catalog
Check current and average GPU usage (internal only)
Running GPU example codes with htcondor
- ssh to a GPU server
- Run the command
git clone https://github.com/CHTC/templates-GPUs.git
-
cd templates-GPUs/test
- Run the “submit.sub” job with the command
condor_submit submit.sub
- Monitor job status with
condor_q
- Review the job output files when completed
Links
- For cluster access, problems and questions, please email support@cvib.ucla.edu
- CHTC info on Jobs That Use GPUs
- CHTC example GPU Job Templates
- CHTC R example
- Docker registry (Internal only)
- Docker/Condor usage (Internal only)
Hostname | GPU-Device | GPUs | GPUmem | CPUs | Mem |
REDLRADADM23589 | Tesla V100-SXM2-32GB | 8 | 32 | 40 | 512 |
REDLRADADM14958 | Quadro RTX 8000 | 10 | 48 | 48 | 1024 |
REDLRADADM14959 | Quadro RTX 8000 | 10 | 48 | 48 | 1024 |
REDLRADADM23620 | GeForce RTX 2080 Ti | 4 | 12 | 10 | 128 |
REDLRADADM23621 | GeForce RTX 2080 Ti | 4 | 12 | 10 | 128 |
REDLRADADM23710 | GeForce RTX 2080 Ti | 4 | 12 | 10 | 128 |
REDLRADADM23712 | GeForce GTX 980 | 3 | 4 | 16 | 128 |
REDWRADMMC23199 | GeForce GTX 1080 Ti | 4 | 12 | 6 | 96 |
redlradbei05920 | GeForce RTX 2080 Ti | 4 | 12 | 10 | 128 |
- Connecting:
ssh REDLRADADM23589.ad.medctr.ucla.edu
- Using the NGC Catalog
- Choose container by clicking on the down arrow in any of the panels. This copies it to your clipboard
- ssh to the DGX-1 and paste the command you obtained from the catalog to pull
- Running with htcondor
- See the “Getting Started” tab for examples and modify the submit file.
- For example:
- For templates – GPUs/docker/tensorflow_python/test_tensorflow.sub
- Change: Requirements = (Target.CUDADriverVersion >= 10.1)
- To: Requirements = ((Target.CUDADriverVersion >= 10.1) && (Machine == “REDLRADADM23589.ad.medctr.ucla.edu”))
- For example:
- See the “Getting Started” tab for examples and modify the submit file.
- Users can store up to 200GB in their /raid/username directory
- There are currently no GPU usage policies enforced, so users are asked not to monopolize GPUs interactively. Over time we’ll put policies in place as users condorize their applications for batch execution
Check current Disk usage (internal only) GPU servers have the following filesystems available for use:
Hostname | Share | Size (TB) | Backup | |
REDLRADADM14901 | radraid | 147 | No | |
REDLRADADM23589 | raid | 7 | No | |
REDLRADADM14958 | raid | 35 | No | |
REDLRADADM14959 | raid | 35 | No | |
REDLRADADM18059 | scratch | 22 | No | |
REDLRADADM18059 | data | 51 | No | |
REDLRADADM21129 | trials | 22 | Yes | |
REDLRADADM21129 | data | 30 | No | |
REDLRADADM30333 | scratch | 32 | No | |
REDLRADADM30333 | images | 22 | No | |
REDLRADADM30333 | cvib4 | 10 | Yes | |
REDLRADADM05294 | ciisraid | 70 | Yes | |
REDLRADADM23716 | cvibraid | 70 | Yes |
Server | Cores | Mem (GB) | Disk (TB) | VMs |
thorserver15 | 24 | 128 | 1.63 | 11 |
thorserver17 | 32 | 512 | 2.72 | 24 |
thorserver18 | 40 | 512 | 3.27 | 33 |
thorserver19 | 24 | 512 | 2.54 | 27 |
thorserver20 | 40 | 512 | 6.11 | 25 |