Cluster

Check current and average GPU usage (internal only)

Running GPU example codes with htcondor
  •  ssh to a GPU server
  •  Run the command git clone https://github.com/CHTC/templates-GPUs.git
  •   cd templates-GPUs/test
  •  Run the "submit.sub" job with the command condor_submit submit.sub
  •  Monitor job status with  condor_q
  •  Review the job output files when completed
Links
Hostname GPU-Device GPUs GPUmem CPUs Mem
REDLRADADM23589 Tesla V100-SXM2-32GB 8 32 40 512
REDLRADADM14958 Quadro RTX 8000 10 48 48 1024
REDLRADADM14959 Quadro RTX 8000 10 48 48 1024
REDLRADADM23620 GeForce RTX 2080 Ti 4 12 10 128
REDLRADADM23621 GeForce RTX 2080 Ti 4 12 10 128
REDLRADADM23710 GeForce RTX 2080 Ti 4 12 10 128
REDLRADADM23712 GeForce GTX 980 3 4 16 128
REDWRADMMC01 GeForce GTX 1080 Ti 1 12 6 32
REDWRADMMC23199 GeForce GTX 1080 Ti 4 12 6 96
REDLRADADM11249 GeForce GTX Titan X 4 12 6 64
redlradbei05920 GeForce RTX 2080 Ti 4 12 10 128
0' -af Machine CUDADeviceName TotalGpus CUDAGlobalMemoryMb CUDA0GlobalMemoryMb TotalCpus TotalMemory| sed s/undefined//g -->
  • Connecting: ssh REDLRADADM23589.ad.medctr.ucla.edu
  • Using the NGC Catalog
    • Choose container by clicking on the down arrow in any of the panels. This copies it to your clipboard
    • ssh to the DGX-1 and paste the command you obtained from the catalog to pull
  • Running with htcondor
    • See the "Getting Started" tab for examples and modify the submit file.
      • For example:
        • For templates - GPUs/docker/tensorflow_python/test_tensorflow.sub
        • Change: Requirements = (Target.CUDADriverVersion >= 10.1)
        • To: Requirements = ((Target.CUDADriverVersion >= 10.1) && (Machine == "REDLRADADM23589.ad.medctr.ucla.edu"))
  • Users can store up to 200GB in their /raid/username directory
  • There are currently no GPU usage policies enforced, so users are asked not to monopolize GPUs interactively. Over time we'll put policies in place as users condorize their applications for batch execution
06/21/2022
Htcondor updated to 9.10.0. For details, see https://htcondor.readthedocs.io/en/latest/version-history/development-release-series-91.html
04/16/2021
139TB SSD RAID now available
1/16/2021
Two lambda RTX 8000 servers are now available
11/26/2019
Local docker registry added https://registry.rip.ucla.edu/v2/_catalog
Check current Disk usage (internal only) GPU servers have the following filesystems available for use:
Hostname Share Size (TB) Backup
REDLRADADM14901 radraid 139 No
REDLRADADM23589 raid 7 No
REDLRADADM14958 raid 7 No
REDLRADADM14959 raid 7 No
REDLRADADM18059 scratch 22 No
REDLRADADM18059 data 51 No
REDLRADADM23716 cvibraid 70 Yes
REDLRADADM21129 trials 22 Yes
REDLRADADM30333 images 22 No
REDLRADADM30333 scratch 32 No
VM Info (internal only)

Server Cores Mem (GB) Disk (TB) VMs
thorserver15 24 128 1.63 11
thorserver17 32 512 2.72 24
thorserver18 40 512 3.27 33
thorserver19 24 512 2.54 27
thorserver20 40 512 6.11 25