Cluster

Check current and average GPU usage (internal only)

Running GPU example codes with htcondor
  •  ssh to a GPU server
  •  Run the command git clone https://github.com/CHTC/templates-GPUs.git
  •   cd templates-GPUs/test
  •  Run the "submit.sub" job with the command condor_submit submit.sub
  •  Monitor job status with  condor_q
  •  Review the job output files when completed
Links
Hostname GPU-Device GPUs GPUmem CPUs Mem
REDLRADADM23589 Tesla V100-SXM2-32GB 8 32 40 512
REDLRADADM14958 Quadro RTX 8000 10 48 48 1024
REDLRADADM14959 Quadro RTX 8000 10 48 48 1024
REDLRADADM23620 GeForce RTX 2080 Ti 4 12 10 128
REDLRADADM23621 GeForce RTX 2080 Ti 4 12 10 128
REDLRADADM23710 GeForce RTX 2080 Ti 4 12 10 128
REDLRADADM23712 GeForce GTX 980 3 4 16 128
REDLRADADM23713 GeForce GTX 780 2 3 6 32
REDWRADMMC01 GeForce GTX 1080 Ti 1 12 6 32
REDWRADMMC23199 GeForce GTX 1080 Ti 4 12 6 96
REDLRADADM11249 GeForce GTX Titan X 4 12 6 64
redlradbei05920 GeForce RTX 2080 Ti 4 12 10 128
  • Connecting: ssh REDLRADADM23589.ad.medctr.ucla.edu
  • Using the NGC Catalog
    • Choose container by clicking on the down arrow in any of the panels. This copies it to your clipboard
    • ssh to the DGX-1 and paste the command you obtained from the catalog to pull
  • Running with htcondor
    • See the "Getting Started" tab for examples and modify the submit file.
      • For example:
        • For templates - GPUs/docker/tensorflow_python/test_tensorflow.sub
        • Change: Requirements = (Target.CUDADriverVersion >= 10.1)
        • To: Requirements = ((Target.CUDADriverVersion >= 10.1) && (Machine == "REDLRADADM23589.ad.medctr.ucla.edu"))
  • Users can store up to 200GB in their /raid/username directory
  • There are currently no GPU usage policies enforced, so users are asked not to monopolize GPUs interactively. Over time we'll put policies in place as users condorize their applications for batch execution
04/16/2021
139TB SSD RAID now available
01/28/2021
Htcondor updated to 8.9.11. For details, see https://htcondor.readthedocs.io/en/v8_9_11/version-history/development-release-series-89.html
1/16/2021
Two lambda RTX 8000 servers are now available
11/26/2019
Local docker registry added https://registry.rip.ucla.edu/v2/_catalog
Check current Disk usage (internal only) GPU servers have the following filesystems available for use:
Hostname Share Size (TB) Backup Domain*
REDLRADADM14901 raid 139 No AD
REDLRADADM23589 raid 7 No AD
REDLRADADM14958 raid 7 No AD
REDLRADADM14959 raid 7 No AD
dingo.cvib.ucla.edu scratch 22 No LDAP
dingo.cvib.ucla.edu data 51 No LDAP
research.cvib.ucla.edu cvib2 15 Yes LDAP
skynet.cvib.ucla.edu cvib 26 Yes LDAP
synapse.cvib.ucla.edu trials 22 Yes LDAP
thorimage9.cvib.ucla.edu apps 16 Yes LDAP
thorimage11.cvib.ucla.edu images 22 No LDAP
thorimage11.cvib.ucla.edu scratch 32 No LDAP
*Migration to AD is planned
VM Info (internal only)

Server Cores Mem (GB) Disk (TB) VMs
thorserver15 24 128 1.63 11
thorserver17 32 512 2.72 24
thorserver18 40 512 3.27 33
thorserver19 24 512 2.54 27
thorserver20 40 512 6.11 25