Using GPUs on the NCI National Facility Xe
A small number of GPUs have been installed on the Xe cluster to provide an environment for users to run GPU-enabled code, and for the National Facility to gain experience with using and managing GPUs.
The GPUs are available to all users. There is no specific charge for GPU usage but jobs are charged for associated CPU usage.
This document describes how to get started using the GPUs. For further help with using the GPUs, please email help@nf.nci.org.au.
Contents
System Configuration
The GPUs are an extension of the Xe system. Users should familiarise themselves with the Xe documentation for the basic setup and use of the system (Xe Userguide, Xe Hardware).
We have 16 Xe compute nodes (x1 to x16) deployed with nVidia Tesla s2050 GPU systems. Each of these Xe nodes is connected to a half unit of Tesla s2050 at 8Gbytes/s (via PCI-e Gen2). Each of these Xe nodes (2 quad core Xeon sockets) has access to the following:
- 2 Fermi GPUs, each with 448 CUDA cores running at 1.15GHz
- Each GPU has approximately 2.6Gbytes GDDR5 memory (ECC enabled)
- GPU Memory Bandwidth: 148GB/s and 1.55GHz Memory Clock speed.
- Increased Floating Point Performance enabled by the 2 GPUs: 1 Tflops (Double Precision) and 2.06 Tflops (Single)
The Fermi GPU architecture is depicted in the following diagram:
More details of Tesla s2050 and Fermi Achitecture can be retrieved from http://www.nvidia.com/object/product-tesla-S2050-us.html and http://www.nvidia.com/object/fermi_architecture.html.
Accessing the GPUs on Xe
Preparing Job Script
Access to GPUs is through the Xe PBS batch system. The following is a typical sample PBS job script with shows the minimum requirements for submitting a job with a GPU-enabled executable.
#!/bin/bash #PBS -l ngpus=2 #PBS -l ncpus=1 module load nvidia PATH_TO_GPU_EXECUTABLE > output_file
The "-l ngpus" flag specifies the number of GPUs that will be dedicated to the job. Requested GPUs (and their attached memory) are always dedicated to at most a single job. There is a temporary constraint that both GPUs on a node must be allocated to the one job meaning that requests for GPUs must be in multiples of 2. For jobs requiring multiple nodes, all cpus of the nodes must be requested. This means that ncpus must be a multiple of 8, and ngpus must (currently) be the corresponding multiple of 2, ie.
#PBS -lncpus=8n #PBS -lngpus=2n
for some positive integer n.
As a consequence of this, the maximum number of GPU jobs that PBS will simultaneously execute is 16. We plan to address this limitation in the future.
If you wish to gain access to more than 2 GPUs in a job, you will need to use MPI (or pbsdsh/pbs_rsh). This is because PBS only starts the job on the first node (with the first 2 GPUs). That means you will need to request a multiple of 8 CPUs, and be aware of MPI locality when launching the MPI processes and communicating with the GPUs.
The load of the nvidia module is required to run any GPU program. Users should never need to load a specific nvidia module version (ie. always use "module load nvidia", and never something like "module load nvidia/256.40").
Submit Job
After your have prepared the job script, it need to be submitted to PBS, as follows, to allow your job to run on Xe. (Note that at the NF, jobs do not need to be submitted to a special queue.)
$ qsub job_script
Compiling code
The system environment has been set up to support both Cuda and OpenCL programs.
Compiling Cuda code
$ module load cuda $ nvcc -o executable source.cu -lcuda -lcudart
The nVidia compiler is called nvcc, and it is provided by loading the cuda module. This module will also give access to the rest of the Cuda Toolkit. The nvidia module will be automatically loaded by the cuda module (unless it has already been loaded).
The Cuda SDK (as distinct from the Cuda Toolkit), which consists of libcutil, liboclUtil, and other utility libraries, is provided by the cuda-sdk module. Similarly, this module will auto-load the cuda module if it isn't already present. The Cuda SDK also includes a variety of example programs.
The Cuda Data Parallel Primitives (CUDPP) library is available by loading the cudpp module. This module will auto-load the cuda-sdk module if necessary.
When linking code with nvcc, both -lcuda and -lcudart should be specified. (Not doing so may still work, but may also cause error messages that are misleading and difficult to track down.)
We have installed a simple wrapper script for nvcc to extend its functionality and improve its integration into the NF systems.
Specifically, the nvcc wrapper supports using the following environment variables to automatically pass options to nvcc:
- $NVCC_FLAGS contains options that are always passed to nvcc
- $NVCC_COMPILER_FLAGS contains options that are each prepended by -Xcompiler
- $NVCC_LINKER_FLAGS contains options that are each prepended by -Xlinker
- $NVCC_OPENCC_FLAGS contains options that are each prepended by -Xopencc
- $NVCC_CUDAFE_FLAGS contains options that are each prepended by -Xcudafe
- $NVCC_PTXAS_FLAGS contains options that are each prepended by -Xptxas
- $NVCC_FATBIN_FLAGS contains options that are each prepended by -Xfatbin
If $NVCC_WRAPPER_VERBOSE is set to "y", then the wrapper will output the full nvcc command line to stderr.
The version of gcc installed on Xe has a problem with some of the nVidia system headers, and so requires the -fpermissive flag to be passed to it by nvcc. Thus, the cuda module adds "-fpermissive" to $NVCC_COMPILER_FLAGS, which is equivalent to passing "-Xcompiler -fpermissive" to every invocation of nvcc.
nvcc does not understand "-Wl," compiler flags that are sometimes used, so the wrapper transparently converts such options into the corresponding "-Xlinker" flags that nvcc does understand.
nvcc uses $CPATH, but not $C_INCLUDE_PATH or $CPLUS_INCLUDE_PATH (which should be used). (For an explanation of the subtle differences between these environment variables, refer to our CanonicalUserEnvironmentVariables document.) The nvcc wrapper corrects this erroneous behaviour by adding to $CPATH any entries in $C_INCLUDE_PATH and $CPLUS_INCLUDE_PATH that aren't already there.
Compiling OpenCL code
$ module load nvidia $ gcc -o executable source.c -lOpenCL
Cuda debugger
Cuda GDB supports both 32-bits and 64-bits platforms, and it is built on GDB, extended to support Cuda. To get debug information, source files need to be compiled as follows:
$ module load cuda $ nvcc -g -G ... $ cuda-gdb
The -g flag indicates that debugging information should be added to the (compiled) host code, and the -G flag indicates that debugging information should be added to the (compiled) device code.
Cuda-GDB supports changing focus between physical and logical coordinates, as well as switching between debugging the host and device code.
(cuda-gdb) device sm warp lane block thread (cuda-gdb) cuda block 2,0 thread 256,1,1 (cuda-gdp) p variable
Profiling GPU Programs
The Cuda Toolkit provides its own profiler: Compute Visual Profiler (CVP), which also supports text mode (http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/VisualProfiler/computeprof.html).
Currently CVP does not work with programs that also use MPI.
$ qsub -I -v DISPLAY -lncpus=4,ngpus=2,vmem=8GB,walltime=1:00:00,other=physmem -wd $ module load cuda-cvp $ computeprof &
To use CVP in graphical mode, you will need to login to Xe with a X display, eg. using ssh -X or ssh -Y, or with VNC. The "-v DISPLAY" option to qsub propagates this X display to the interactive job.
To use text based mode, a few enviroment varaibles need to be set:
$ export COMPUTE_PROFILE=1 $ export COMPUTE_PROFILE_CSV=1 $ export COMPUTE_PROFILE_LOG="name_%d.log" $ export COMPUTE_PROFILE_CONFIG=".cp_config"
The .cp_config file can be customized, a sample .cp_config file looks like:
$ cat .cp_config gridsize threadblocksize instructions memtransferdir memtransfersize shared_load shared_store l1_global_load_hit l1_global_load_miss l1_local_load_hit l1_local_load_miss l1_local_store_hit l1_local_store_miss
Available GPU Computing Packages
Click on each package for instructions on using it on Xe. Further details of these packages can be retrieved from http://www.nvidia.com/object/tesla_bio_workbench.html.
Useful Links
- Online GPU programming course
Programming massively parallel processors with CUDA: http://code.google.com/p/stanford-cs193g-sp2010/
GPU programming summer school: http://impact.crhc.illinois.edu/summerschool.php
High Performance Scientific Computing: http://cs.anu.edu.au/student/comp3320
CUDA Toolkit: http://developer.nvidia.com/object/cuda_3_1_downloads.html#Linux
Tesla Bio Workbench: http://www.nvidia.com/object/tesla_bio_workbench.html
