+1 (310) 481 7551

cvibinbox@mednet.ucla.edu

Cluster

Cluster

Dept of Radiology Computing Infrastructure

GPU servers including Tesla V100, RTX 8000 Ti, and GTX 2080 Ti

News

04/12/2024
HTCondor updated to 23.0.8. For details, see https://htcondor.org/htcondor/release-highlights/#long-term-support-channel

02/29/2024
Harbor upgraded to version 2.9.1

02/28/2024
Kubernetes cluster upgraded to TKG 2.5.0 (kubernetes 1.28.4). For details, see https://docs.vmware.com/en/VMware-Tanzu-Kubernetes-Grid/2.5/tkg-deploy-mc/mgmt-release-notes.html

04/16/2021
139TB SSD RAID now available

1/16/2021
Two lambda RTX 8000 servers are now available

11/26/2019
Local docker registry added https://registry.cvib.ucla.edu/v2/_catalog

Resources

Check current and average GPU usage (internal only)

Running GPU example codes with htcondor

ssh to a GPU server
Run the command git clone https://github.com/CHTC/templates-GPUs.git
cd templates-GPUs/test
Run the “submit.sub” job with the command condor_submit submit.sub
Monitor job status with condor_q
Review the job output files when completed

Links

For cluster access, problems and questions, please email support@cvib.ucla.edu
CHTC info on Jobs That Use GPUs
CHTC example GPU Job Templates
CHTC R example
Docker registry (Internal only)
Docker/Condor usage (Internal only)

GPU-Servers

Hostname	GPU-Device	GPUs	GPUmem	CPUs	Mem
REDLRADADM23589	Tesla V100-SXM2-32GB	8	32	40	512
REDLRADADM14958	Quadro RTX 8000	10	48	48	1024
REDLRADADM14959	Quadro RTX 8000	10	48	48	1024
REDLRADADM23620	GeForce RTX 2080 Ti	4	12	10	128
REDLRADADM23621	GeForce RTX 2080 Ti	4	12	10	128
REDLRADADM23710	GeForce RTX 2080 Ti	4	12	10	128
REDLRADADM23712	GeForce GTX 980	3	4	16	128
REDWRADMMC01	GeForce GTX 1080 Ti	1	12	6	32
REDWRADMMC23199	GeForce GTX 1080 Ti	4	12	6	96
REDLRADADM11249	GeForce GTX Titan X	4	12	6	64
redlradbei05920	GeForce RTX 2080 Ti	4	12	10	128

DGX-1

Connecting: ssh REDLRADADM23589.ad.medctr.ucla.edu
Using the NGC Catalog
- Choose container by clicking on the down arrow in any of the panels. This copies it to your clipboard
- ssh to the DGX-1 and paste the command you obtained from the catalog to pull
Running with htcondor
- See the “Getting Started” tab for examples and modify the submit file.
  - For example:
    - For templates – GPUs/docker/tensorflow_python/test_tensorflow.sub
    - Change: Requirements = (Target.CUDADriverVersion >= 10.1)
    - To: Requirements = ((Target.CUDADriverVersion >= 10.1) && (Machine == “REDLRADADM23589.ad.medctr.ucla.edu”))
Users can store up to 200GB in their /raid/username directory
There are currently no GPU usage policies enforced, so users are asked not to monopolize GPUs interactively. Over time we’ll put policies in place as users condorize their applications for batch execution

Storage

Check current Disk usage (internal only) GPU servers have the following filesystems available for use:

Hostname	Share	Size (TB)	Backup
REDLRADADM14901	radraid	147	No
REDLRADADM23589	raid	7	No
REDLRADADM14958	raid	35	No
REDLRADADM14959	raid	35	No
REDLRADADM18059	scratch	22	No
REDLRADADM18059	data	51	No
REDLRADADM21129	trials	22	Yes
REDLRADADM21129	data	30	No
REDLRADADM30333	scratch	32	No
REDLRADADM30333	images	22	No
REDLRADADM30333	cvib4	10	Yes
REDLRADADM05294	ciisraid	70	Yes
REDLRADADM23716	cvibraid	70	Yes

ESXI Servers

VM Info (internal only)

Server	Cores	Mem (GB)	Disk (TB)	VMs
thorserver15	24	128	1.63	11
thorserver17	32	512	2.72	24
thorserver18	40	512	3.27	33
thorserver19	24	512	2.54	27
thorserver20	40	512	6.11	25