Dept of Radiology Computing Infrastructure

GPU servers including Tesla V100, RTX 8000 Ti, and GTX 2080 Ti

03/15/2024
HTCondor updated to 23.0.6. For details, see https://htcondor.org/htcondor/release-highlights/#long-term-support-channel

02/29/2024
Harbor upgraded to version 2.9.1

02/28/2024
Kubernetes cluster upgraded to TKG 2.5.0 (kubernetes 1.28.4). For details, see https://docs.vmware.com/en/VMware-Tanzu-Kubernetes-Grid/2.5/tkg-deploy-mc/mgmt-release-notes.html

04/16/2021
139TB SSD RAID now available

1/16/2021
Two lambda RTX 8000 servers are now available

11/26/2019
Local docker registry added https://registry.cvib.ucla.edu/v2/_catalog

Check current and average GPU usage (internal only)

Running GPU example codes with htcondor

  •  ssh to a GPU server
  •  Run the command git clone https://github.com/CHTC/templates-GPUs.git
  •   cd templates-GPUs/test
  •  Run the “submit.sub” job with the command condor_submit submit.sub
  •  Monitor job status with  condor_q
  •  Review the job output files when completed

Links

Hostname GPU-Device GPUs GPUmem CPUs Mem
REDLRADADM23589 Tesla V100-SXM2-32GB 8 32 40 512
REDLRADADM14958 Quadro RTX 8000 10 48 48 1024
REDLRADADM14959 Quadro RTX 8000 10 48 48 1024
REDLRADADM23620 GeForce RTX 2080 Ti 4 12 10 128
REDLRADADM23621 GeForce RTX 2080 Ti 4 12 10 128
REDLRADADM23710 GeForce RTX 2080 Ti 4 12 10 128
REDLRADADM23712 GeForce GTX 980 3 4 16 128
REDWRADMMC01 GeForce GTX 1080 Ti 1 12 6 32
REDWRADMMC23199 GeForce GTX 1080 Ti 4 12 6 96
REDLRADADM11249 GeForce GTX Titan X 4 12 6 64
redlradbei05920 GeForce RTX 2080 Ti 4 12 10 128
  • Connecting: ssh REDLRADADM23589.ad.medctr.ucla.edu
  • Using the NGC Catalog
    • Choose container by clicking on the down arrow in any of the panels. This copies it to your clipboard
    • ssh to the DGX-1 and paste the command you obtained from the catalog to pull
  • Running with htcondor
    • See the “Getting Started” tab for examples and modify the submit file.
      • For example:
        • For templates – GPUs/docker/tensorflow_python/test_tensorflow.sub
        • Change: Requirements = (Target.CUDADriverVersion >= 10.1)
        • To: Requirements = ((Target.CUDADriverVersion >= 10.1) && (Machine == “REDLRADADM23589.ad.medctr.ucla.edu”))
  • Users can store up to 200GB in their /raid/username directory
  • There are currently no GPU usage policies enforced, so users are asked not to monopolize GPUs interactively. Over time we’ll put policies in place as users condorize their applications for batch execution

Check current Disk usage (internal only) GPU servers have the following filesystems available for use:

Hostname Share Size (TB) Backup
REDLRADADM14901 radraid 147 No
REDLRADADM23589 raid 7 No
REDLRADADM14958 raid 35 No
REDLRADADM14959 raid 35 No
REDLRADADM18059 scratch 22 No
REDLRADADM18059 data 51 No
REDLRADADM21129 trials 22 Yes
REDLRADADM21129 data 30 No
REDLRADADM30333 scratch 32 No
REDLRADADM30333 images 22 No
REDLRADADM30333 cvib4 10 Yes
REDLRADADM05294 ciisraid 70 Yes
REDLRADADM23716 cvibraid 70 Yes

VM Info (internal only)

Server Cores Mem (GB) Disk (TB) VMs
thorserver15 24 128 1.63 11
thorserver17 32 512 2.72 24
thorserver18 40 512 3.27 33
thorserver19 24 512 2.54 27
thorserver20 40 512 6.11 25