Using GPUs on the NCI National Facility Xe

A small number of GPUs have been installed on the Xe cluster to provide an environment for users to run GPU-enabled code, and for the National Facility to gain experience with using and managing GPUs.

The GPUs are available to all users. There is no specific charge for GPU usage but jobs are charged for associated CPU usage.

This document describes how to get started using the GPUs. For further help with using the GPUs, please email help@nf.nci.org.au.

System Configuration

The GPUs are an extension of the Xe system. Users should familiarise themselves with the Xe documentation for the basic setup and use of the system (Xe Userguide, Xe Hardware).

We have 16 Xe compute nodes (x1 to x16) deployed with nVidia Tesla s2050 GPU systems. Each of these Xe nodes is connected to a half unit of Tesla s2050 at 8Gbytes/s (via PCI-e Gen2). Each of these Xe nodes (2 quad core Xeon sockets) has access to the following:

The Fermi GPU architecture is depicted in the following diagram:

fermi_arch.jpg

More details of Tesla s2050 and Fermi Achitecture can be retrieved from http://www.nvidia.com/object/product-tesla-S2050-us.html and http://www.nvidia.com/object/fermi_architecture.html.

Accessing the GPUs on Xe

Preparing Job Script

Access to GPUs is through the Xe PBS batch system. The following is a typical sample PBS job script with shows the minimum requirements for submitting a job with a GPU-enabled executable.

  #!/bin/bash
  #PBS -l ngpus=2
  #PBS -l ncpus=1
  module load nvidia
  PATH_TO_GPU_EXECUTABLE > output_file

The "-l ngpus" flag specifies the number of GPUs that will be dedicated to the job. Requested GPUs (and their attached memory) are always dedicated to at most a single job. There is a temporary constraint that both GPUs on a node must be allocated to the one job meaning that requests for GPUs must be in multiples of 2. For jobs requiring multiple nodes, all cpus of the nodes must be requested. This means that ncpus must be a multiple of 8, and ngpus must (currently) be the corresponding multiple of 2, ie.

  #PBS -lncpus=8n
  #PBS -lngpus=2n

for some positive integer n.

As a consequence of this, the maximum number of GPU jobs that PBS will simultaneously execute is 16. We plan to address this limitation in the future.

If you wish to gain access to more than 2 GPUs in a job, you will need to use MPI (or pbsdsh/pbs_rsh). This is because PBS only starts the job on the first node (with the first 2 GPUs). That means you will need to request a multiple of 8 CPUs, and be aware of MPI locality when launching the MPI processes and communicating with the GPUs.

The load of the nvidia module is required to run any GPU program. Users should never need to load a specific nvidia module version (ie. always use "module load nvidia", and never something like "module load nvidia/256.40").

Submit Job

After your have prepared the job script, it need to be submitted to PBS, as follows, to allow your job to run on Xe. (Note that at the NF, jobs do not need to be submitted to a special queue.)

$ qsub job_script

Compiling code

The system environment has been set up to support both Cuda and OpenCL programs.

Compiling Cuda code

$ module load cuda
$ nvcc -o executable source.cu -lcuda -lcudart

The nVidia compiler is called nvcc, and it is provided by loading the cuda module. This module will also give access to the rest of the Cuda Toolkit. The nvidia module will be automatically loaded by the cuda module (unless it has already been loaded).

The Cuda SDK (as distinct from the Cuda Toolkit), which consists of libcutil, liboclUtil, and other utility libraries, is provided by the cuda-sdk module. Similarly, this module will auto-load the cuda module if it isn't already present. The Cuda SDK also includes a variety of example programs.

The Cuda Data Parallel Primitives (CUDPP) library is available by loading the cudpp module. This module will auto-load the cuda-sdk module if necessary.

When linking code with nvcc, both -lcuda and -lcudart should be specified. (Not doing so may still work, but may also cause error messages that are misleading and difficult to track down.)

We have installed a simple wrapper script for nvcc to extend its functionality and improve its integration into the NF systems.

Specifically, the nvcc wrapper supports using the following environment variables to automatically pass options to nvcc:

If $NVCC_WRAPPER_VERBOSE is set to "y", then the wrapper will output the full nvcc command line to stderr.

The version of gcc installed on Xe has a problem with some of the nVidia system headers, and so requires the -fpermissive flag to be passed to it by nvcc. Thus, the cuda module adds "-fpermissive" to $NVCC_COMPILER_FLAGS, which is equivalent to passing "-Xcompiler -fpermissive" to every invocation of nvcc.

nvcc does not understand "-Wl," compiler flags that are sometimes used, so the wrapper transparently converts such options into the corresponding "-Xlinker" flags that nvcc does understand.

nvcc uses $CPATH, but not $C_INCLUDE_PATH or $CPLUS_INCLUDE_PATH (which should be used). (For an explanation of the subtle differences between these environment variables, refer to our CanonicalUserEnvironmentVariables document.) The nvcc wrapper corrects this erroneous behaviour by adding to $CPATH any entries in $C_INCLUDE_PATH and $CPLUS_INCLUDE_PATH that aren't already there.

Compiling OpenCL code

$ module load nvidia
$ gcc -o executable source.c -lOpenCL

Cuda debugger

Cuda GDB supports both 32-bits and 64-bits platforms, and it is built on GDB, extended to support Cuda. To get debug information, source files need to be compiled as follows:

$ module load cuda
$ nvcc -g -G ...
$ cuda-gdb

The -g flag indicates that debugging information should be added to the (compiled) host code, and the -G flag indicates that debugging information should be added to the (compiled) device code.

Cuda-GDB supports changing focus between physical and logical coordinates, as well as switching between debugging the host and device code.

(cuda-gdb) device sm warp lane block thread
(cuda-gdb) cuda block 2,0 thread 256,1,1
(cuda-gdp) p variable

Profiling GPU Programs

The Cuda Toolkit provides its own profiler: Compute Visual Profiler (CVP), which also supports text mode (http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/VisualProfiler/computeprof.html).

Currently CVP does not work with programs that also use MPI.

$ qsub -I -v DISPLAY -lncpus=4,ngpus=2,vmem=8GB,walltime=1:00:00,other=physmem -wd
$ module load cuda-cvp
$ computeprof &

To use CVP in graphical mode, you will need to login to Xe with a X display, eg. using ssh -X or ssh -Y, or with VNC. The "-v DISPLAY" option to qsub propagates this X display to the interactive job.

To use text based mode, a few enviroment varaibles need to be set:

$ export COMPUTE_PROFILE=1
$ export COMPUTE_PROFILE_CSV=1
$ export COMPUTE_PROFILE_LOG="name_%d.log"
$ export COMPUTE_PROFILE_CONFIG=".cp_config"

The .cp_config file can be customized, a sample .cp_config file looks like:

$ cat .cp_config
gridsize
threadblocksize
instructions
memtransferdir
memtransfersize
shared_load
shared_store
l1_global_load_hit
l1_global_load_miss
l1_local_load_hit
l1_local_load_miss
l1_local_store_hit
l1_local_store_miss

Available GPU Computing Packages

Click on each package for instructions on using it on Xe. Further details of these packages can be retrieved from http://www.nvidia.com/object/tesla_bio_workbench.html.

Useful Links

Xe/Gpu/Usage (last edited 2011-09-22 00:17:33 by jxc900)