NCSA Forge User Guide


NOTICE: As of 10/1/2012, Forge has been decommissioned and is no longer in production.

System Overview


Forge consists of 44 Dell PowerEdge C6145 quad-socket nodes with dual 8-core AMD Magny-Cours 6136 processors and 64 GB of memory. Each node supports 6 or 8 NVIDIA Fermi M2070 GPUs.

System Configuration

Architecture Heterogeneous
  • Dell PowerEdge C6145 servers
  • NVIDIA Fermi M2070 Accelerator Units
  • 32 nodes connected to 6 Fermi processors via three PCI-e Gen2 X16 slots
  • 12 nodes connected to 8 Fermi processors via four PCI-e Gen2 X16 slots
Number of Servers
Number of CPUs (cores)
Number of Accelerator Units 288
  • AMD Opteron Magny-Cours 6136 2.4 GHz dual-socket eight-core
  • 1333 MHz front side bus (per socket)
  • L3-Cache: 2x6 MB, shared
    (only 10 MB is visible due to the HT Assist feature using 2 MB as a directory cache)
  • Per node (per core): 48 GB (3 GB)
  • Type: DDR3
Accelerator Units
  • 448 CUDA Cores
  • 1.03 teraflops single-precision performance
  • 515 gigaflops double-precision performance
  • 6GB memory
Network Interconnect InfiniBand QDR
Parallel Filesystem GPFS (600 TB total)

File Systems

Home Directories

Your home directory is the default directory you are placed in when you log on. You should use this space for storing files you want to keep long term such as source code, scripts, input data sets, etc. NCSA HPC systems have a 50GB home directory quota.

The command to see your disk usage and limits is quota. Home directories are backed up daily.

Scratch Directories

Scratch file systems are intended for short term use and should be considered volatile. The size of scratch file systems varies with the system.

Please note that backups are not performed on the scratch directories. In the event of a disk crash or file purge, files on the scratch directories cannot be recovered. Therefore, you should make sure to back up your files to permanent storage as significant changes are made (at least daily).

The common scratch filesystem /scratch/users is available for all interactive work. The scratch-global soft link in your home directory points to your scratch directory. For batch jobs, see the section Disk Space for Batch Jobs.

Files in the common scratch filesystem (/scratch/users) are purged on the basis of size and time since the last access:

File Size Removed after
>= 10 GB 4 days
< 10 GB 14 days

Please do not attempt to circumvent this removal scheme (e.g., with touch). Such attempts may result in the loss of access to the scratch file systems.

Files in the batch scratch filesystem (/scratch/batch) on Forge may be purged as soon as the batch job that it is associated with completes. Users should use the saveafterjob utility for automated, guaranteed saving of files from batch jobs.

To opt out of receiving email notifications on purged files, add a file named .nopurgemail in your home directory:

forge$ touch $HOME/.nopurgemail

Reserved Project Space

NCSA has reserved project space available on the HPC systems. They are intended for users who require larger amounts of scratch space than exist in the normal scratch file systems for short durations. These directories are intended for short term use and are not backed up. Contact NCSA Consulting Services to request space. Please include your name, project (PSN), preferred start date, duration of need, approximate disk space needed, and a short description of the project. Include reasons why regular scratch space is insufficient for your needs.

Permanent File Storage

Permanent storage of your files is available using the NCSA archival storage system (MSS). You have read-only access to your MSS account for 4 months after your last NCSA HPC account has been deleted. The NCSA MSS Page has additional information on using MSS. See the Overview section in the MSS User Guide for user and project quota information.

System Access

Methods of Access

There are four methods of access to NCSA resources:

  1. GSI-SSH with grid credential authentication: (default) XSEDE users receive an XSEDE wide LoginID and Password to access XSEDE resources via the XSEDE User Portal(XUP).
  2. SSH with Key Pairs and Passphrase: Initial access via XUP or Site Password required. After generating a public ssh key on a local machine, the user must copy the key to the target resource (machine) using the XUP or Site Password.
  3. SSH with NCSA Kerberos password: XSEDE users may request a NCSA Kerberos password by emailing the XSEDE Helpdesk or submitting a consulting ticket on the XSEDE Helpdesk webpage.
  4. Secure FTP client (GridFTP): Provides file management (listing, moving,transfer, deletion, etc...) using the same login information provided to you as an XSEDE user.

Logging into Forge

Hostnames Access Methods
ssh, scp, sftp,
gsissh, gsiscp, gsisftp,
uberftp, globus-url-copy,
kerberized ftp


Users are provided access to XSEDE Resources via the XSEDE User Portal (XUP). Your Allocation Information Packet contains your XUP username and password that you will need to access the XSEDE User Portal.

You can change your current XUP password or reset a forgotten XUP password on the XUP site.

For information on NCSA local passwords, see here.

Managing your allocation

Charging Algorithm

Once you have an allocation, you will be charged for the amount of service units (SUs) used. SUs for the NCSA HPC systems are calculated based on wall clock time of jobs. The charging algorithm for the Dell NVIDIA Cluster is as follows:

# SUs = 16 * #nodes * WallTime

where WallTime = Total Wall Clock Hours. Note: The minimum resource allocation for a batch job is 1 node.

Verifying Your Account Balance

You can monitor usage of your allocation by using the XSEDE tgusage utility. The data displayed by tgusage is normally updated once each day, so SUs accrued by jobs on a given day will be reflected in tgusage the next day. Entering tgusage with no options or -h/--help displays the usage/help information.

Online Usage Information

Usage information can also be obtained online through the XSEDE User Portal on the "Allocations/Usage" page under the "My XSEDE" tab.

Setting Projects for Charging

If you have more than one project (PSN), you can charge to various projects within a login session. Most users have only one project. You do not need to define a default project unless you have multiple projects.

Setting a Default Project

You can define (or change) a default project with the defacct command. After a default project is set, you will no longer be prompted to choose one of your projects during the login process. Enter defacct at the prompt to set your default.

Enter none at the prompt to unset a default project, and enter a carriage return to leave your default project unchanged. See the defacct man page for more information. The batch_accts command lists all your accounts on the current system.

IMPORTANT: If you have a default project set, batch jobs will automatically be charged to the default project at the time that the job is submitted unless you charge the batch job to a specific project (see below).

Charging to Projects in a Batch Job

You can also charge batch jobs to a specific project (PSN) with the -A option in the PBS qsub command.

Computing Environment

Your shell

When your account is first activated your default shell is set to tcsh, an enhanced version of the Berkeley UNIX C shell (csh). The tcsh shell is completely compatible with the standard csh, and all csh commands and scripts work unedited with tcsh. Enter "man tcsh" at the prompt for details about tcsh.

The other shell available on NCSA HPC systems is the GNU Bourne-Again SHell (bash), which is completely compatible with the Bourne shell (sh). To change your shell, send email to NCSA Helpdesk with a request.

Managing your "dot" files

When your account on an NCSA system is created, default "dot" files are placed in your home directory.

Forge uses the module system to set up the user environment. See the section Managing Your Environment (Modules) for details.

For other "dot" files, you can find copies of the default files in /etc/skel. Copy the appropriate file to your home directory and customize as needed.

Transferring your files to Forge

A variety of methods are available for transferring files between computer systems. See NCSA Data Transfers for details.

Application Development

Compilers and languages

NCSA supports the NVIDIA and PGI compilers for the Tesla Fermi GPUs. As these compilers each provide multiple and differing capabilities, they are summarized below in terms of the language or API provided.


CUDA C is the computing architecture developed, by NVIDIA, for NVIDIA GPUs; it defines extensions to the C language for negotiating execution on the GPU and communication between host and GPU.

CUDA Fortran is an analogous extension to the Fortran language; it was developed as a collaboration between NVIDIA and the Portland Group.

CUDA-x86 is the PGI CUDA C/C++ compiler for x86; it provides a unified programming model for both multi-core and many-core architectures. Executables may be run either on the GPU, or on a non-GPU multi-core x86 architecture.


Environment: The module for NVIDIA's CUDA C is loaded by default upon login. The NVIDIA compiler is nvcc. One can compile on the head node, but execution on the Tesla GPUs is available only via PBS batch jobs.

SDK: Example code, documentation and several utilities can be found in the NVIDIA SDK. When porting code to CUDA C, the examples in the SDK can be quite useful, both as illustration and as templates for certain algorithms (e.g., marching cubes, Monte Carlo). In addition, there are further examples and tutorials available on the NVIDIA site at the link below.


Examples: To use the examples, one should copy the installer to one's home directory, and run with the prompted-for defaults:

forge$ cp /uf/ncsa/consult/nvidia_sdk/gpucomputingsdk_* $HOME
forge$ cd $HOME
forge$ sh ./gpucomputingsdk_* 

To build, cd into the "C" subdirectory and run make. In all cases, the resulting executables should be run on a compute node, accessed through the batch system.

The utility deviceQuery will list characteristics of the Tesla devices.

CUDA Fortran

Environment: To use CUDA Fortran, one need only load the module for the PGI compilers:

forge$ module load pgi/2011

SDK: Examples and makefile may be found here: /usr/local/pgi/linux86-64/2011/cuda/cudaFortranSDK. In all cases, the resulting executables should be run on a compute node, accessed through the batch system.

PGI CUDA Fortran site


Environment: To use CUDA-x86, one need only load the module for the PGI compilers:

forge$ module load pgi/2011

SDK: Examples and makefile may be found here: /usr/local/pgi/linux86-64/2011/cuda/cudaX86SDK. In all cases, the resulting executables should be run on a compute node, out of deference to other users on the login node; as noted in the introduction, however, any multi-core x86 architecture is supported.

PGI CUDA-x86 site

Additional Information

NCSA CUDA tutorial

Accelerator model

In addition to CUDA Fortran, the PGI compilers support an API referred to as the "Accelerator Programming Model", which is similar in practice to OpenMP. In this model, user directives may be added to existing C or Fortran code that will automatically "accelerate" regions of code, by executing on the GPU.

Examples of use may be found here: /usr/local/pgi/linux86-64/2011/etc/samples

The makefile therein will build examples of accelerating C and Fortran code. A summary and references may be found here: PGI Accelerator Compilers


OpenCL is supported by the NVIDIA CUDA distribution; cf. examples in the NVIDIA_GPU_Computing_SDK mentioned above, and the discussion here: OpenCL

Host-only compilers

In addition to the compilers mentioned above, the GNU and Intel compilers are available on Forge, and loaded by default.

Several implementations of MPI built with these compilers are available ; MVAPICH2 is loaded by default, and versions of OpenMPI can be seen with the command module avail.

In all of these cases, the compilers themselves have standard names: mpicc (C), mpicxx (C++), mpif77 (Fortran 77), and mpif90 (Fortran 90).

An article on debugging CUDA-x86 applications may be found here: Debugging CUDA-x86 Applications


Traditional libraries

For host-based coda, the Intel Math Kernel Library (MKL) contains the complete set of functions from the basic linear algebra subprograms (BLAS), the extended BLAS (sparse), the complete set of LAPACK routines, and a set of fast Fourier transforms; it is loaded by default with the Intel compilers.

Accelerated libraries

NVIDIA provides accelerated versions of certain of the above routines, namely:

The SDK contains examples of use for each of these libraries.

The NVIDIA Performance Primitives library (NPP) is a collection of basic algorithms accelerated for the GPU (arithmetic, filter, image, geometric...): NPP

Running your applications


The NCSA Dell NVIDIA Linux Cluster Forge uses the Torque Resource Manager with the Moab Workload Manager for running jobs. Torque is based upon OpenPBS, so the commands are the same as PBS commands.

Interactive Use

The login node is available for interactive use. It has 16 cores and 6 GPU devices. In general, interactive use should be limited to compiling and other development tasks, such as editing source and debugging. The batch system is available for all other jobs. See the qsub -I section for instructions on how to run an interactive job on the compute nodes.

You can compile on the head nodes, but access to nodes with the GPU devices is available only via PBS batch jobs. See the section qsub_I for instructions on how to run an interactive job on the compute nodes.

Running Programs


The batch system requires no additional information for running on the GPUs. Execution of GPU kernels is controlled from within the host code, with 6 or 8 GPUs available from each node. Simply set your batch script for deployment on the host node(s), as in the sample batch scripts.


The MPI implementations on Forge have the mpirun script for running an MPI program. See the sample batch scripts for syntax details.


Before you run an OpenMP program, set the environment variable OMP_NUM_THREADS to the number of threads you want. For example, to run a program with two threads:

forge$ setenv OMP_NUM_THREADS 2

The following environment variables may also be useful in running your OpenMP programs:

OMP_SCHEDULE Sets the schedule type and (optionally) the chunk size for DO and PARALLEL DO loops declared with a schedule of RUNTIME. The default is STATIC.
KMP_LIBRARY sets the run-time execution mode. The default is throughput, but it can be set to turnaround so worker threads do not yield while waiting for work.
KMP_STACKSIZE Sets the number of bytes to allocate for the stack of each parallel thread. You can use a suffix k, m, or g to specify kilobytes, megabytes or gigabytes. The default is 4m.

Hybrid MPI/OpenMP

To run a MPI/OpenMP hybrid program, you need to set the envionment variable OMP_NUM_THREADS to the number of threads you want, and change the number of cpus per node for MPI accordingly. For example, to run a program with 10 MPI ranks and 8 threads for each rank, do the following in your batch script:

#PBS -l nodes=10:ppn=1

(See the qsub section for information on PBS directives.)


The following queues are currently available for users:

Queue GPU configuration Walltime Max # Nodes
debug 6 or 8 GPU nodes 30 mins 4
normal 6 GPU nodes 48 hours 18
eight 8 GPU nodes 48 hours 8

NOTE: while the total number of nodes in the Forge cluster is 44, all nodes may not be available in practice due to offline nodes, etc.

For special queue requests please email

Batch Commands


The qsub command is used to submit a batch job to a queue. All options to qsub can be specified either on the command line or as a line in a script (known as an embedded option). Command line options have precedence over embedded options. Scripts can be submitted using:

qsub [list of qsub options] script_name The following sample batch script illustrates qsub usage and options. -l resource-list: specifies resource limits. The resource_list argument is of the form:


The resource_names are:

walltime: maximum wall clock time (hh:mm:ss) [default: 10 mins]
nodes: number of 16-core nodes [default: 1 node]
ppn: how many cores per node to use (1 through 16)


#PBS -l walltime=00:30:00,nodes=2:ppn=16

-q queue_name: specify queue name. required

-N jobname: specifies the job name.

-o out_file: store the standard output of the job to file out_file.

-e err_file: store the standard error of the job to file err_file.

-j oe: merge standard output and standard error into standard output file.

-V: export all your environment variables to the batch job.

-m be: send mail at the beginning and end of a job.

-M : send any email to given email address.

-A project: charge your job to a specific project (XSEDE project or NCSA PSN). (for users in more than one project)

-X: enables X11 forwarding.


  • Using the -N option will generate stdout and stderr output files with filenames of the form: <jobname>.o<jobid> and <jobname>.o<jobid> respectively in the directory from where the batch job was submitted when used without the -o and -e options.
  • Temporary stdout/stderr files while the job is running are located in the home directory, and named <jobid>.fsched.OU and <jobid>.fsched.ER.

qsub -I

The -I option tells qsub you want to run an interactive job. You can also use other qsub options such as those documented in the batch sample scripts. For example, the following command:

forge$ qsub -I -V -q debug -l walltime=00:30:00,nodes=1:ppn=16

will run an interactive job with a wall clock limit of 30 minutes, using one node and sixteen cores per node.

After you enter the command, you will have to wait for Torque to start the job. As with any job, your interactive job will wait in the queue until the specified number of nodes is available. If you specify a small number of nodes for smaller amounts of time, the wait should be shorter because your job will backfill among larger jobs. Once the job starts, you will see something like this:

qsub: waiting for job 914.fsched to start
qsub: job 914.fsched ready

Now you are logged into the launch node. At this point, you can use the appropriate command to start your program. When you are done with your runs, you can use the exit command to end the job.


The qstat command displays the status of batch jobs.

  • qstat -a gives the status of all jobs on the system.
  • qstat -n lists nodes allocated to a running job in addition to basic information. The first host on the list is the launch node.
  • qstat -f PBS_JOBID gives detailed information on a particular job. Note that currently PBS_JOBID needs to be the full extension: <jobid>
  • qstat -q provides summary information on all the queues.


qhist, a locally written tool, summarizes the raw accounting record(s) for one or more jobs. SU charges for a job are available the day after the job completes. To display information about a specific job, the syntax is qhist PBS_JOBID. See qhist --help for details.


The qdel command deletes a queued job or kills a running job. The syntax is qdel PBS_JOBID. Note: You only need to use the numeric part of the Job ID.

Sample Batch Scripts

Sample batch scripts are available in the directory /usr/local/doc/batch_scripts for use as a template.

Disk Space for Batch Jobs

Scratch space for batch jobs is provided via a per-job scratch directory that is created at the beginning of the job. This directory is created under /scratch/batch, and is based on the JobID. If the batch script uses one of the sample scripts as a template, the name of this scratch directory is available to job scripts with the $SCR environment variable.

Your job scratch directory may be deleted soon [possibly immediately] after your job completes, so you should take care to transfer results to the mass storage system. (see the section Automated Saving of Files from Batch Jobs below).

The cdjob command can be used to change the working directory to the scratch directory of a running batch job. The syntax is

forge$ cdjob PBS_JOBID

Automated Saving of Files from Batch Jobs

The saveafterjob utility is available for automated, guaranteed saving of output files from batch jobs to the mass storage system. For details on its use, see the saveafterjob page and the sample PBS batch scripts.


Debuggers and Profilers

NVIDIA provides cuda-gdb for debugging CUDA C; it is essentially a port of gdb, with appropriate extensions, and consequentially will be familiar to users of gdb.

TAU (Tuning and Analysis Utilities) provides support for GPUs. See TAU at NCSA for details on using it on Forge.

NVIDIA provides a "Visual Profiler", computeprof for CUDA C and OpenCL. To invoke the profiler, one must enable X-forwarding by firstly logging into Forge with "ssh -X forge", and then launching a batch job with option "-X"; cf. the man pages for ssh, respectively, qsub.

PGI's pgprof utility enables profiling of CUDA Fortran and the PGI Accelerator directives; as with NVIDIA's visual profiler, one should enable X-forwarding as described above.



NCSA Dell NVIDIA Linux Cluster Forge

Last updated: April 4, 2012