Georgia Tech Keeneland User Guide

NOTICE: Keeneland is decommissioned from XSEDE as of December 31, 2014. Keeneland's login nodes will remain online until February 1, 2015.


System Overview

Keeneland is a hybrid CPU/GPGPU system for use with codes that can take advantage of GPU accelerators. Keeneland is a Georgia Tech machine administered by The National Institute for Computational Sciences (NICS).

The Keeneland Full Scale (KFS) system was approved by the NSF and deployed late October 2012 as a production XSEDE resource.

System Configuration

The KFS system consists of 264 HP SL250G8 compute nodes, each with two 8-core Intel Sandy Bridge (Xeon E5) processors, three NVIDIA M2090 GPU accelerators, and a Mellanox FDR InfiniBand interconnect, for a total of 264 nodes, 528 CPUs and 792 GPUs. Each node has 32 GB memory.

Compute jobs are charged according to the following equivalencies:

1 node-hr = 16 (KFS) CPU-hrs = 3 GPU-hrs = 3 SUs

File Systems

Users of Keeneland have access to two file systems: NFS and Lustre.

NFS

By default each user has an NFS home directory with a 2 GB quota. The path to this directory is /nics/[a-e]/home/$USER. The environment variable, $HOME, is set to each user's home directory. This directory is generally available by logging in to login.nics.xsede.org even if Keeneland is not available.

Project directories are available by request for storing source code and other files that need to be shared among a group. These project directories may have larger quotas. Large input and output files should be stored on the Lustre filesystem. For more information, see NICS' Project Directories page.

Lustre

Each user has a Lustre scratch directory in /lustre/medusa/$USER. Lustre is a highly scalable cluster file system in which storage of a given file is distributed (or striped) across several hardware locations. This allows larger files than could be stored on any one location, also allowing for much faster transfer speeds if accessed in parallel. Users may increase striping width to improve I/O performance for large files. For more information, see NICS' Lustre page.

There is no quota limit placed on Lustre storage. However, files older than 30 days are eligible to be purged. Any attempt to circumvent the purge policy may lead to account deactivation.

System Access

Keeneland may be accessed via ssh and gsissh.

ssh

In order to ssh to Keeneland you must use a one time password (OTP) token. Tokens are mailed to users when accounts are enabled and are only accessible if the user has returned a notarized NICS Token Activation form (emailed to new users). Consult NICS' OTP information here.

login1$ ssh userid@keeneland.gatech.xsede.org

gsissh

The alternative access method is gsissh. To access Keeneland in this manner, use the GSI authentication to log in to keenelandgsi.nics.xsede.org. Here, you will need to have an XSEDE password to authenticate with a myproxy certificate. This is automatically done through the XSEDE user portal.

The table below contains the IP addresses for each of the above protocols.

ssh/OTP GSISSH GridFTP
keeneland.gatech.xsede.org gsissh.keeneland.gatech.xsede.org gridftp.keeneland.gatech.xsede.org

Logging into Keeneland

The first time you log in using OTP authentication, you are required to choose a Personal Identification Number (PIN). You'll be prompted to enter your PIN followed by the numbers on your OTP token. The numbers on your token will change every 30 seconds or so and you can view the time left on that passcode by the bar on the left hand side of the viewing window. For further information on this, please see, https://www.nics.tennessee.edu/getting-started/access#OTPAuthentication).

Tokens may occasionally become disabled for a variety of reasons. When this happens or you have forgotten your PIN, email help@xsede.org with "Keeneland" in the subject line.

Note: The OTP token is for the specified user only. Sharing with anyone will lead to immediate account deactivation.

User Responsibilities

Keeneland is a Georgia Tech resource supported by NICS. When using Keeneland or NICS resources, you agree to the following user responsibilities:

  • You have the responsibility to protect your account from unauthorized use. Never share login information. If you believe your account has been compromised, immediately notify the XSEDE Help Desk at 866-907-2383.
  • You have responsibility for the security of your programs and data.
  • You may not copy and/or distribute proprietary software or documentation without the permission of the software. Possession or use of illegally copied software is prohibited; all software must be appropriately acquired and used according to the specific licensing.
  • Keeneland resources may only be used by authorized users and is limited to the purpose prescribed in the project award. Use of these resources for processing proprietary information, source code or executable code must be disclosed in the award process and is prohibited unless authorized by the project award. Use of Keeneland resources for export controlled information; source code or executable code is prohibited.
  • To ensure protection of data and resources, user activity and files may be monitored, intercepted, recorded, copied, audited, inspected, and disclosed to authorities. By using Keeneland or any NICS system, the user consents to such at the discretion of authorized site personnel.
  • Activities in violation of any laws may be reported to the proper authorities for investigation and prosecution. Abusive activity may be reported to your home institution for review and action.
  • Keeneland uses the NICS file systems. NICS file systems are generally very reliable, however, data may still be lost or corrupted. Users are responsible for backing up critical data.
  • Violations of Keeneland or NICS policy can result in loss of access to Keeneland and NICS resources and possible prosecution. If you have questions, you may contact NICS User Support during normal working hours, 9:00 am - 6:00 pm ET, at 865-241-1504, or contact the XSEDE Help Desk 24/7 at 866-907-2383 or help@xsede.org.

Computing Environment

The default environment for each user is: home directory, Lustre scratch space, and unix group name associated to their project number (assigned by XSEDE).

Unix Shell

Keeneland's default shell is bash. Other shells are available: sh, csh, tcsh, and zsh. Users may change their default shell in the NICS User Portal, https://portal.nics.tennessee.edu/. You'll need your OTP token to log into the NICS portal.

Environment variables

Display the pre-set environment variables using the Unix env command. In addition, there are pre-set modules loaded when one logs in. This includes the default intel compiler, Moab, Torque, MKL, CUDA, and MPI libraries. Pre-set environment variables include: $HOME and $SCRATCHDIR.

Startup Scripts

Each time you log in to a resource, a number of scripts run to set up your environment. System startup scripts (which are universal for all users) define the modules command and set a number of environment variables. Note that system startup scripts work for login shells – when you log in with ssh or start a job. If you want them to run when you start a new shell, it should be made a login shell, for instance bash -l or newgrp -. For more information, check the Unix man pages.

Additionally, each user can define their own startup scripts, depending on which shell they use. Bash users will use .bashrc for non-login shells, and .bash_profile for login shells (generally, users edit .bashrc and ensure that .bash_profile sources that file. Csh shell users (including tcsh) will use .cshrc and .login instead.

Modules

The modules software package allows you to dynamically modify your user environment by using modulefiles. Modules are useful for building your applications with a specific compiler and set of libraries on Keeneland. For instance, the default modules include the programming environment, PE-intel, which specifies to other modules that you are using the Intel compiler, as well as intel, which adds the actual Intel compiler binaries to your path. It is recommended to switch to the programming environment you want first, and then load the compiler version and other modules. Here is a short list and description of commonly used module commands. Note, if no version number is given after the package name, it will use the default package.

Some modules commands

List available modules:

login1$ module list

Swap modules:

login1$ module swap packageA packageB

This will swap packageB for packageA. Useful to change PE- modules to switch compilers and versions of other modules.

List available modules:

login1$ module avail package

If no package is given, it will list all available modules. This command is useful to see which versions of particular software are installed. Try: module avail namd

Display module information:

login1$ module show package

This gives you the information concerning the installed software. You will see the setenv commands that will modify your environment if you decide to load that module. This is useful for two major reasons. First, you can make sure what executable you like to run- there might be a small difference of the executable name on Keeneland versus another machine. Here, you can perform an ls command on the outputted bin directory. Or, if you are using Python, for example, use which python. This will ensure that python is in your path. On Keeneland, python is always in the path, so this simply ensures that you are getting the version you want. Second, some environment variables could be introduced. For instance, the FFTW module will provide an environment variable that points to the library and include directories – include this variable into your makefile versus including the full path.

Transferring Files to Keeneland

Data transfer methods/software

NICS currently maintains the following options for file transfer:

GridFTP and globus-url-copy

Before using GridFTP and globus-url-copy, check out Getting Started with Globus. A valid myproxy certificate and the loaded Globus module is required. Please see https://www.nics.tennessee.edu/computing-resources/data-transfer/gridftp for more instructions.

scp, sftp, ftp, rsync

These standard UNIX transfer utilities, scp, [sftp](https://portal.xsede.org/knowledge-base/-/kb/document/akqg), ftp and rsync can be used to transfer files to and from NICS systems. These utilities are usually already installed on Linux/Unix machines, and there are many command and graphical clients available. Due to familiarity and ease, these may be the best choice for transferring scripts and small files, however, these options can be slow in comparison, and may be ill suited for transferring large amounts of data. More information on these utilities can be found on the XSEDE Data Management & Transfers page as well as in the NICS Kraken user guide.

XSEDE File Manager

Users can also use the File Manager from the XSEDE Portal for data transfers.

Globus Online

NICS users can use the Globus Online tool to perform large file transfers, for "drag and drop archiving" to move data between its long-time archival storage and compute systems, making it quite easy to move, back up or restore relevant data using a visual interface. To get started visit http://www.globusonline.org.

File Transfer Recommendations

The standard Unix tools for copying data, scp and sftp are recommended for small transfers. For larger transfers, gridftp is often a better choice. On the other hand, rsync processes can be very resource intensive for the login nodes and file system, please avoid using rsync on directories with many files (it maybe killed to prevent a node from failing).

Depending on how the source code for the application you want to use is hosted, various version control programs are available to download the source: commonly Subversion, Git, or Mercurial (see the modules for each).

If one has a lot of files they would like to transfer, they should be packed up in a tar file and then transferred. If you have a lot of data, on Lustre, you may want to ensure that the tar file has a larger stripe count, see http://keeneland.gatech.edu/support/lustre Please do not use many simultaneous tar operations as it can make the node and/or file system unresponsive for other users.

Application Development

Compiling

The following compilers are available on the Keeneland system:

  • Intel
  • PGI
  • GNU

Each compiler vendor has a "Programming Environment" module, for example, PE-intel. This module may be checked by library modules to ensure the correct library build. There are also the compiler modules themselves, for example, intel. If you wish to use something other than the defaults, it is necessary to change the Programming Environment module first, then any library modules (MPI) or compiler versions (gcc/4.4.0).

The GNU compilers are installed in system default locations, and thus are always in the user's PATH, though the PE-gnu module is required in order for mpicc to use gcc.

New compilers may be installed as they are released, check module avail <intel pgi> for new versions.

CUDA

CUDA is installed as a module, check "module avail cuda" for available versions. As with compilers, new versions of CUDA we will install as they are released. The CUDA wrapper is called "nvcc". However, there may be a lag because new CUDA versions often require driver updates.

MPI

OpenMPI and MVAPICH2 are available on Keeneland, and available via modules. As with the compilers, check: module avail

Select one of these MPI implementations using a command like:

login1$ module swap openmpi openmpi/1.6-intel

The MPI wrappers used to compile one's code are called mpicc, mpiCC, and mpif90 for C, C++, and fortran programs, respectively.

Libraries

The common libraries available to users on Keeneland include LAPACK & MAGMA, ScaLAPACK, CUBLAS & BLAS, ACML, CUFFT & FFTW, HDF5, and netCDF. If you would like other libraries not installed, please submit a ticket to help@xsede.org.

Debuggers

There are GDB, DDT, and valgrind debuggers on Keeneland. If one's job does not require many nodes, a good practice is to run an interactive queue session to debug one's software probelms.

Code Tuning

Use TAU to performance tune your code on Keeneland. Georgia Tech has detailed instructions on using TAU on Keeneland.

Running Applications

Once logged in to Keeneland, you are placed on a login node. This should be used for basic tasks such as file editing, code compilation, data backup, and job submission. A job is a simulation (program executable command with proper input and output files) that requests resources (number of nodes and a length of time). The login nodes should not be used to run production jobs. Production work should be performed on the compute nodes.

Job Scheduler

Resource Management

Keeneland uses Torque (an open source PBS derivative) as its batch queue software, with the Moab scheduler, similar to other systems at NICS. Here's an example batch queue script (see the notes afterward for some explanation). This assumes that you have set up the modules in your .bash_profile as described in the Modules section of this document.

Launch Jobs

Jobs can be submitted to the queue via the qsub command. The batch and interactive sessions are available. Batch mode is the typical method to submit production simulations. If one is not certain on how to construct a proper job executable, it is beneficial to use the interactive queue.

Queues

The scheduling policy on Keeneland is designed to facilitate jobs that take advantage of high number of GPUs and the FDR interconnect between nodes. On Tuesdays, Keeneland is taken down for preventative maintenance (PM) if necessary, after which, capability (full-machine) jobs are run. If there is demand for it, capability jobs may be run on Tuesdays even if there is no maintenance. It should be noted that if there is a PM or capability period, the queue will be drained on Monday evening, resulting in a situation where only jobs with short walltimes will be run. Regular production jobs enter either the serial or parallel queues. Since these queues are differentiated by job size, the scheduler will automatically determine the queue for a user submitted job. The table below has an outline of the queue properties:

NAME TIME FRAME NODES AVAILABLE TO THIS QUEUE MAX JOB TIME MAX JOB SIZE (NODES)
Capability Tuesdays (following PM) Exclusive access to compute nodes, 133 node minimum 48 Hours All available nodes
Serial/Parallel Always except during PM/Capability Available nodes not part of a current reservation. 48 Hours 132
Preventative Maintenance (PM) Tuesdays beginning at 8 AM N/A N/A N/A

Fair Share

It may be possible for a single user or project to dominate the system by submitting a large number of jobs. To prevent this, a fair share strategy is used: The priority given to a job takes into account the recently run jobs by that user or project – jobs from projects that have consumed a significant amount of processing time will have lower priority than jobs from projects that have not run many jobs in the past seven days. Thus, if users from a project submit a large number of jobs, other users can still cut in to access a portion of the machine.

Prioritization

Jobs are prioritized by (in descending order of effect):

  • Penalty for projects that have used their whole allocation
  • Number of nodes requested
  • Length of time job has been waiting in queue
  • Per-project fairshare (currently a penalty for projects that have used more than 10% of the available cycles in the last week)

Other Policies in Place

  • Only a user's five highest-priority queued jobs (and 10 per project) are considered for scheduling at any given time.
  • FIRSTFIT backfill is enabled; the way this works is that first the scheduler starts the highest priority job(s) until it finds one that cannot start immediately, sets a reservation for that highest priority job, and then runs any remaining jobs that would not cause the start time of the highest priority job to slip further into the future

Other Policy Attributes

  • Maximum number of jobs per user or project
  • Changing either the fairshare targets or the relative effect of fairshare on priority
  • Changing the threshold for when a job is considered capability or the (human) policy for how/when capability jobs are run

Interactive jobs

For interactive jobs, PBS options are passed through qsub on the command line.

login1$ qsub -I -A XXXYYY -l walltime=01:10:00,nodes=4:ppn=16:gpus=3:shared 

qsub options:

  • -I : Start an interactive session
  • -A : Charge to the "XXXYYY" project

Putting it together:

walltime=01:00:00,nodes=4:ppn=16:gpus=3:shared

will request 4 compute nodes, using 16 processors and 3 gpu accelerators under shared mode on each node for one hour.

After running this command, you will have to wait until enough compute nodes are available, just as in any other batch job. However, once the job starts, the standard input and standard output of this terminal will be linked directly to the head node of our allocated resource. Issuing the exit command (or Control-d) will end the interactive job. From here commands may be executed directly instead of through a batch script.

Batch jobs

KIDS uses Torque (an open source PBS derivative) as its batch queue software, with the Moab scheduler, similar to other systems at NICS. Here's an example batch queue script (see the notes afterward for some explanation). This assumes that you have set up the modules in your .bash_profile as described in the Modules section of this document.

#!/bin/sh
#PBS -N my-job
#PBS -j oe
#PBS -A UT-TENN0037

### Unused PBS options ###
## If left commented, must be specified when the job is submitted:
## 'qsub -l walltime=hh:mm:ss,nodes=12:ppn=4:gpus=3:shared'
##
##PBS -l walltime=00:30:00
##PBS -l nodes=12:ppn=4:gpus=3:shared

### End of PBS options ###

date
cd $PBS_O_WORKDIR

echo "nodefile="
cat $PBS_NODEFILE
echo "=end nodefile"

# run the program
which mpirun
mpirun /bin/hostname

date

# eof

With the PBS options in the example batch script above, the output of job with id will go into a single file named "my-job.o#####" after the run completes. The "-N" option specifies the name (my-job), and the "-j" option combines stderr and stdout, otherwise there would be a "my-job.e#####" file as well.

Notes on Batch Scripts

  • The scheduler is set up to give exclusive access to nodes, so there should be no need to add a flag (like "-l naccesspolicy=singletask") to ensure each job gets its node to itself.
  • A "-S" option to PBS is required if you want to use a shell other than bash. Adding something like #!/bin/tcsh in the first line is not enough to choose a different shell. If you write batch scripts for another shell than bash, you must be sure that the module setup has been done as described in Modules.
  • If you have your environment set up correctly, and are using the OpenMPI from /sw/keeneland/openmpi/1.5.1-intel (check the output of the "which mpirun" command from running this script, you should not need to pass either "-np 2" or "-hostfile $PBS_NODEFILE" to the mpirun command. If your mpirun commands don't work, it may be that your environment is trying to use the wrong mpirun that was not built with Torque integration
  • The scheduler is set up to give exclusive access to nodes, so there should be no need to add a flag (like "-l naccesspolicy=singletask") to ensure each job gets a node to itself.
  • A -S parameter to PBS is required if you want to use a shell other than bash. Adding something like #!/bin/ksh in the first line is not enough to choose a different shell.
  • If you write batch scripts for a shell other than bash, you must be sure that the module setup has been done as described in Modules.
  • If you are sharing your script with anyone else you must be sure that everyone who uses your script has done this setup. Since this is a burden and error prone, you might want to do the module setup explicitly in the batch script if you are using a non-bash shell for your batch scripts.
  • The account number is required. The account number is the same number as the project(s) to which your NICS account is tied to.
  • OpenMPI is integrated with TORQUE such that it will default to use all the resources in your PBS request (e.g. a nodes=2:ppn=3 directive, will run 6 MPI ranks). One does not have to pass either the "-np" or "-hostfile $PBS_NODEFILE" options to the mpirun command.

Job control

Jobs are submitted using the qsub command.

login1$ qsub myscript.pbs

To check the status of one's queued jobs, the qstat command is available.

login1$ qstat -u username

To see all running jobs on Keeneland, you can pass the -r flag (qstat -r). An important column in these command's output is the job state column, marked by "S". The job state (a.k.a. status) can be H (Held), Q (Queued), R (Running), W (Waiting), and C (recently Completed).

To delete a job from the queue.

login1$ qdel jobid

To hold a job to prevent it from being run. For instance, if you submitted, and realize that your input file is corrupted, you can hold the job until you get a chance to change it.

login1$ qhold jobid

To release a held job so that it can run.

login1$ qrls jobid

To change PBS request for a queued or held job. Options take the same format as qsub and overwrite previous options.

login1$ qalter jobid

This command gives a different view of jobs in the queue. The utility will show jobs in the following states:

login1$ showq
  • Active : These jobs are currently running.
  • Eligible : These jobs are currently queued awaiting resources. A user is allowed five jobs in the eligible state.
  • Blocked: These jobs are currently queued but are not eligible to run. Common reasons for jobs in this state would be jobs on hold, or the owning user currently has five jobs in the eligible state.

To view details of a job in the queue:

login1$ checkjob

This can be used For example, if job 736 is currently in a blocked state, the following can be used to view the reason:

login1$ checkjob 736

The return may contain a line similar to the following:

BlockMsg: job 736 violates idle HARD MAXJOB limit of 2 for user  (Req: 1 InUse: 2)

This line indicates the job is in the blocked state because the owning user has reached the limit of five jobs currently in the eligible state.

To get further information:

login1$ showstart 100315

The return may contain a line similar to the following:

job 100315 requires 16384 procs for 00:40:00 Estimated Rsv based start in 15:26:41 on Fri Sep 26 23:41:12 
Estimated Rsv based completion in 16:06:41 on Sat Sep 27 00:21:12. 

The start time may change dramatically as new jobs with higher priority are submitted. It is a very rough estimate based on the current job mix.

To see currently free resources:

login1$ showbf 

This can help you create a job that can be backfilled immediately. As such, it is primarily useful for short jobs.

Tools

Benchmarking and profiling one's calculation is important on any resource. TAU is a full-featured profiler, mpiP is a light weight profiling library, low level interfaces to hardware counters are also available via CUPTI and PAPI. CUPTI is installed with CUDA. TAU, mpiP, and PAPI are available via modules.

Reference

Policies

All NICS policies concerning user responsibilities, directory spaces, grid services, accounting and allocation status, job scheduling, and file system purges can be found at: http://www.nics.tennessee.edu/policies.

Last update: January 5, 2015