Comet User Guide
Last update: July 3, 2017

Trial Accounts

Trial Accounts give potential users rapid access to Comet for the purpose of evaluating Comet for their research. This can be a useful step in accessing the usefulness of the system by allowing them to compile, run, and do initial benchmarking of their application prior to submitting a larger Startup or Research allocation. Trial Accounts are for 1000 CPU, or 100 GPU, core-hours. Requests are fulfilled within 1 working day.

System Overview

Comet Technical Summary

Comet is a dedicated XSEDE cluster designed by Dell and SDSC delivering ~2.0 petaflops, featuring Intel next-gen processors with AVX2, Mellanox FDR InfiniBand interconnects and Aeon storage.

The standard compute nodes consist of Intel Xeon E5-2680v3 (formerly codenamed Haswell) processors, 128 GB DDR4 DRAM (64 GB per socket), and 320 GB of SSD local scratch memory. The GPU nodes contain four NVIDIA GPUs each. The large memory nodes contain 1.5 TB of DRAM and four Haswell processors each. The network topology is 56 Gbps FDR InfiniBand with rack-level full bisection bandwidth and 4:1 oversubscription cross-rack bandwidth. Comet has 7 petabytes of 200 GB/second performance storage and 6 petabytes of 100 GB/second durable storage. It also has dedicated gateway hosting nodes and a Virtual Machine repository. External connectivity to Internet2 and ESNet is 100 Gbps.

As of July 1, 2017: 36 additional GPU nodes were added to Comet, each of which features 4 NVIDIA P100 GPUs, for a total of 144 additional GPUs. Please see the GPU Nodes section for additional information on making GPU allocation reqeusts.

Serving the Long Tail

Comet was designed and is operated on the principle that the majority of computational research is performed at modest scale. Comet also supports science gateways, which are web-based applications that simplify access to HPC resources on behalf of a diverse range of research communities and domains, typically with hundreds to thousands of users. Comet is an NSF-funded system operated by the San Diego Supercomputer Center at UC San Diego. Comet is available through the Extreme Science and Discovery Environment (XSEDE) program.

Comet System Configuration
System Component Configuration
Intel Haswell Standard Compute Nodes
Node count 1,944
Clock speed 2.5 GHz
Cores/node 24
DRAM/node 128 GB
SSD memory/node 320 GB
NVIDIA Kepler K80 GPU Nodes
Node count 36
CPU cores:GPUs/node 24:4
CPU:GPU DRAM/node 128 GB:40 GB
NVIDIA Pascal P100 GPU Nodes
Node count 36
CPU cores:GPUs/node 28:4
CPU:GPU DRAM/node 128 GB:40 GB
Large-memory Haswell Nodes
Node count 4
Clock speed 2.2 GHz
Cores/node 64
DRAM/node 1.5 TB
SSD memory/node 400 GB
Storage Systems
File systems Lustre, NFS
Performance Storage 7 PB
Home file system 280 TB

Resource allocation policies are designed to serve more users than traditional HPC systems

  • The maximum allocation for a Principle Investigator is 10M core-hours. Limiting the allocation size means that Comet can support more projects, even if the size of an individual project is smaller.
  • Access via Science Gateways, which generally serve hundreds to thousands of users, can request more that the 10M SU cap.
  • Comet provides Rapid Access Trial Accounts that give users 1000 SUs within 24 hours of requesting them.

Job scheduling policies are designed for user productivity

  • The maximum allowable job size on Comet is 1,728 cores; a limit that helps shorten wait times since there are fewer nodes in idle state waiting for large number of nodes to become free. In practice, the average job size, weighted by core-hours, is about 400 cores (20 when not weighted, which reflects the large fraction of single node jobs).
  • Comet supports long-running jobs - up to as much as one week by special request.
  • Comet supports shared-node jobs (more than one job on a single node). Many applications are serial or can only scale to a few cores. Allowing shared nodes improves job throughput, provides higher overall system utilization, and allows more users to run on Comet.

Comet's system architecture is designed for user productivity

  • Each rack of Comet standard compute nodes provides 1,728 cores in a fully non-blocking fat tree FDR networking. Jobs can be run within a single rack to minimize latency or allowed to span racks to minimize wait time. This ensures maximum interconnect performance when it is critical to application performance without penalizing throughput when that is the most important factor.
  • Each Comet compute node features 128 GB of DDR4 memory. This is important for both shared node jobs, and for those users with serial and threaded applications.
  • Each compute node features up to 400 GB of SSD memory, which can be used to accelerate I/O performance for some applications.
  • Comet features 36 K80 and 36 P100 GPU nodes, supporting many community developed applications. These versions typically run much faster on GPUs than CPUs.
  • Comet's 4 large memory nodes are well suited applications such as those in genomics
  • Comet's storage system, Data Oasis, provides high performance and high capacity, with added levels of protection via ZFS and a Durable Storage partition for periodic replication of critical project data.

Comet Technical Details

Table 1. Comet Technical Summary
System Component Configuration
1944 Standard Compute Nodes
Processor Type Intel Xeon E5-2680v3
Sockets 2
Cores/socket 12
Clock speed 2.5 GHz
Flop speed 960 GFlop/s
Memory capacity 128 GB DDR4 DRAM
Flash memory 320 GB SSD
Memory bandwidth 120 GB/s
STREAM Triad bandwidth 104 GB/s
36 K80 GPU Nodes
GPUs 4 NVIDIA
Cores/socket 12
Sockets 2
Clock speed 2.5 GHz
Memory capacity 128 GB DDR4 DRAM
Flash memory 400 GB SSD
Memory bandwidth 120 GB/s
STREAM Triad bandwidth 104 GB/s
36 P100 GPU Nodes
GPUs 4 NVIDIA
Cores/socket 14
Sockets 2
Clock speed 2.4 GHz
Memory capacity 128 GB DDR4 DRAM
Flash memory 400 GB SSD
Memory bandwidth 150 GB/s
STREAM Triad bandwidth 116 GB/s
4 Large Memory Nodes
Sockets 4
Cores/socket 16
Clock speed 2.2 GHz
Memory capacity 1.5 TB
Flash memory 400 GB
Stream Triad bandwidth 142 GB/s
Full System
Total compute nodes 1984
Total compute cores 47,776
Peak performance ~2.0 PFlop/s
Total memory 247 TB
Total memory bandwidth 228 TB/s
Total flash memory 634 TB
FDR InfiniBand Interconnect
Topology Hybrid Fat-Tree
Link bandwidth 56 Gb/s (bidirectional)
Peak bisection bandwidth TBD Gb/s (bidirectional)
MPI latency 1.03 - 1.97 µs
DISK I/O Subsystem
File Systems NFS, Lustre
Storage capacity (durable) 6 PB
Storage capacity (performance) 7 PB
I/O bandwidth (performance disk) 200 GB/s

Comet supports the XSEDE core software stack, which includes remote login, remote computation, data movement, science workflow support, and science gateway support toolkits.

Systems Software Environment

Table 2. Systems Software Environment
Software Function Description
Cluster Management Rocks
Operating System CentOS
File Systems NFS, Lustre
Scheduler and Resource Manager SLURM
XSEDE Software CTSS
User Environment Modules
Compilers Intel and PGI Fortran, C, C++
Message Passing Intel MPI, MVAPICH, Open MPI
Debugger DDT
Performance IPM, mpiP, PAPI, TAU

Supported Application Software

by Domain of Science

Table 3. Supported Applications Software
Domain Software
Biochemistry APBS
Bioinformatics BamTools, BEAGLE, BEAST, BEAST 2, bedtools, Bismark, BLAST, BLAT, Bowtie, Bowtie 2, BWA, Cufflinks, DPPDiv, Edena, FastQC, FastTree, FASTX-Toolkit, FSA, GARLI, GATK, GMAP-GSNAP, IDBA-UD, MAFFT, MrBayes, PhyloBayes, Picard, PLINK, QIIME, RAxML, SAMtools, SOAPdenovo2, SOAPsnp, SPAdes, TopHat, Trimmomatic, Trinity, Velvet
Compilers GNU, Intel, Mono, PGI
File format libraries HDF4, HDF5, NetCDF
Interpreted languages MATLAB, Octave, R
Large-scale data analysis frameworks Hadoop 1, Hadoop 2 (with YARN), Spark, RDMA-Spark
Molecular dynamics Amber, Gromacs, LAMMPS, NAMD
MPI libraries MPICH2, MVAPICH2, Open MPI
Numerical libraries ATLAS, FFTW, GSL, LAPACK, MKL, ParMETIS, PETSc, ScaLAPACK, SPRNG, Sundials, SuperLU, Trilinios
Predictive analytics KNIME, Mahout, Weka
Profiling and debugging DDT, IDB, IPM, mpiP, PAPI, TAU, Valgrind
Quantum chemistry CPMD, CP2K, GAMESS, Gaussian, MOPAC, NWChem, Q-Chem, VASP
Structural mechanics Abaqus
Visualization IDL, VisIt

System Access

As an XSEDE computing resource, Comet is accessible to XSEDE users who are given time on the system. To obtain an account, users may submit a proposal through the XSEDE Allocation Request System (XRAS) or request a Trial Account.

Interested parties may contact XSEDE User Support for help with a Comet proposal.

Logging in to Comet

Comet supports several access methods:

  • Single Sign On through the XSEDE User Portal
  • Command-line SSH login using SDSC username and XSEDE User Portal Password

To login to Comet from the command line, use the hostname:

comet.sdsc.xsede.org

The following are examples of Secure Shell (ssh) commands that may be used to log in to Comet:

ssh sdscusername@comet.sdsc.xsede.org
ssh -l sdscusername comet.sdsc.xsede.org

Notes and hints

  • When you log in to comet.sdsc.edu, you will be assigned one of the four login nodes: comet-ln[1-4].sdsc.edu. These nodes are identical in both architecture and software environment. Users should normally log in through comet.sdsc.xsede.org, but may specify one of the four nodes directly if they see poor performance.
  • Please feel free to append your public RSA key to your "~/.ssh/authorized_keys" file to enable access from authorized hosts without having to enter your password. Make sure you have a password on the private key on your local machine. You can use ssh-agent or keychain to avoid repeatedly typing the private key password.

Do NOT use the login nodes for computationally intensive processes. These nodes are meant for compilation, file editing, simple data analysis, and other tasks that use minimal compute resources. All computationally demanding jobs should be submitted and run through the batch queuing system.

Computing Environment

The Environment Modules package provides for dynamic modification of your shell environment. Module commands set, change, or delete environment variables, typically in support of a particular application. They also let the user choose between different versions of the same software or different combinations of related codes.

Modules

For example, if the Intel module and mvapich2_ib module are loaded and the user compiles with mpif90, the generated code is compiled with the Intel Fortran 90 compiler and linked with the mvapich2_ib MPI libraries.

Several modules that determine the default Comet environment are loaded at login time. These include the MVAPICH implementation of the MPI library and the Intel compilers. We strongly suggest that you use this combination whenever possible to get the best performance.

Table 3. Useful Module Commands

Command Description
module list List the modules that are currently loaded
module avail List the modules that are available
module display module name Show the environment variables used by module name and how they are affected
module unload module name Remove module name from the environment
module load module name Load module name into the environment
module swap module one module two Replace module one with module two in the environment

Loading and Unloading Modules

You must remove some modules before loading others. Some modules depend on others, so they may be loaded or unloaded as a consequence of another module command. For example, if intel and mvapich are both loaded, running the command module unload intel will automatically unload mvapich. Subsequently issuing the module load intel command does not automatically reload mvapich.

If you find yourself regularly using a set of module commands, you may want to add these to your configuration files (".bashrc" for bash users, ".cshrc" for C shell users). Complete documentation is available in the module(1) and modulefile(4) manpages.

    login1$ man 1 module
    login1$ man 4 modulefile

module: command not found

The error message "module: command not found" is sometimes encountered when switching from one shell to another or attempting to run the module command from within a shell script or batch job. The reason that the module command may not be inherited as expected is that it is defined as a function for your login shell. If you encounter this error execute the following from the command line (interactive shells) or add to your shell script (including Slurm batch scripts)

login1$ source /etc/profile.d/modules.sh

Managing Your Accounts

Useful Commands

The show_accounts command lists the accounts that you are authorized to use, together with a summary of the used and remaining time.

    [user@comet-login1 ~]$ show_accounts
    ID name project used available used_by_proj
    user project SUs by user SUs available SUs by proj
To charge your job to one of these projects, replace "`project`" with one from the list and put this PBS directive in your job script:
#SBATCH -A project
Many users will have access to multiple accounts (e.g. an allocation for a research project and a separate allocation for classroom or educational use). On some systems a default account is assumed, but please get in the habit of explicitly setting an account for all batch jobs. Awards are normally made for a specific purpose and should not be used for other projects.

Adding Users to an Account

Project PIs and co-PIs can add or remove users from an account. To do this, log in to your XSEDE portal account and go to the Add User page.

Charging

The charge unit for all SDSC machines, including Comet, is the Service Unit (SU). This corresponds to the use of one compute core for one hour. Keep in mind that your charges are based on the resources that are tied up by your job and don't necessarily reflect how the resources are used. Charges are based on either the number of cores or the fraction of the memory requested, whichever is larger. The minimum charge for any job longer than 10 seconds is 1 SU.

Job Charge Considerations

  • A node-exclusive job that runs on a compute node for one hour will be charged 24 SUs (24 cores x 1 hour)
  • A serial job in the shared queue that uses less than 5 GB memory and runs for one hour will be charged 1 SU (1 core x 1 hour)
  • A P100 gpu/gpu-shared job will be charged a premium of 1.5x; P100 GPUs are substantially faster than the K80, achieving more than twice the performance for some applications. Accordingly, users will incur a 1.5x premium when running on the P100 vs the K80. (Please note that the SUs on the GPU resource are measured in terms of K80 GPU hours).
  • A GPU is equivalent to 1/4th of a node which equals 6 cores on the K80 GPUs and 7 cores on the P100 GPUs
  • Multicore jobs will scale according to resource utilization
  • Each standard compute node has ~128 GB of memory and 24 cores
    • Each standard node core has 5 GB of memory (1/24th of the total memory on a standard compute node)
  • Each large memory node has ~1.5 TB of memory and 64 cores
    • Each large memory core has 24 GB of memory (1/64 of total memory on a large memory node)

Application Development

Compiling

Comet provides the Intel, Portland Group (PGI), and GNU compilers along with multiple MPI implementations (MVAPICH2, MPICH2, OpenMPI). Most applications will achieve the best performance on Comet using the Intel compilers and MVAPICH2 and the majority of libraries installed on Gordon have been built using this combination. Although other compilers and MPI implementations are available, we suggest using these only for compatibility purposes.

All three compilers now support the Advanced Vector Extensions 2 (AVX2). Using AVX2, up to eight floating point operations can be executed per cycle per core, potentially doubling the performance relative to non-AVX2 processors running at the same clock speed. Note that AVX2 support is not enabled by default and compiler flags must be set as described below.

Using the Intel Compilers (Default/Suggested)

The Intel compilers and the MVAPICH2 MPI implementation will be loaded by default. If you have modified your environment, you can reload by executing the following commands at the Linux prompt or placing in your startup file ("~/.cshrc" or "~/.bashrc")

    login1$ module purge
    login1$ module load gnutools
    login1$ module load intel mvapich2_ib

For AVX2 support, compile with the "-xHOST" option. Note that the "-xHOST" option alone does not enable aggressive optimization, so compilation with "-O3" is also suggested. The "-fast" flag invokes "-xHOST", but should be avoided since it also turns on interprocedural optimization ("-ipo"), which may cause problems in some instances.

Intel MKL libraries are available as part of the "intel" modules on Comet. Once this module is loaded, the environment variable MKL_ROOT points to the location of the mkl libraries. The MKL link advisor can be used to ascertain the link line (change the MKL_ROOT aspect appropriately).

For example to compile a C program statically linking 64 bit scalapack libraries on Comet:

    mpicc -o pdpttr.exe pdpttr.c \
        -I$MKL_ROOT/include ${MKL_ROOT}/lib/intel64/libmkl_scalapack_lp64.a \
        -Wl,--start-group ${MKL_ROOT}/lib/intel64/libmkl_intel_lp64.a \
        ${MKL_ROOT}/lib/intel64/libmkl_core.a ${MKL_ROOT}/lib/intel64/libmkl_sequential.a \
        -Wl,--end-group ${MKL_ROOT}/lib/intel64/libmkl_blacs_intelmpi_lp64.a -lpthread -lm 
For more information on the Intel compilers:
    login1$ icc -help
    login1$ icpc -help
    login1$ ifort -help

Table 4. Intel Compilers

  Serial MPI OpenMP MPI+OpenMP
Fortran ifort mpif90 ifort -openmp mpif90 -openmp
C icc mpicc icc -openmp mpicc -openmp
C++ icpc mpicxx icpc -openmp mpicxx -openmp

Note for C/C++ users: compiler warning - feupdateenv is not implemented and will always fail. For most users, this error can safely be ignored. By default, the Intel C/C++ compilers only link against Intel's optimized version of the C standard math library (libmf). The error stems from the fact that several of the newer C99 library functions related to floating point rounding and exception handling have not been implemented.

Using the PGI Compilers

The PGI compilers can be loaded by executing the following commands at the Linux prompt or placing in your startup file (~/.cshrc or ~/.bashrc)

    login1$ module purge
    login1$ module load gnutools
    login1$ module load pgi mvapich2_ib
For AVX support, compile with "`-fast`" For more information on the PGI compilers:
    login1$ man pgf90 
    login1$ man pgcc
    login1$ man pgCC
    

Table 5. PGI Compilers

  Serial MPI OpenMP MPI+OpenMP
Fortran pgf90 mpif90 pgf90 -mp mpif90 -mp
C pgcc mpicc pgcc -mp mpicc -mp
C++ pgCC mpicxx pgCC -mp mpicxx -mp

Using the GNU Compilers

The GNU compilers can be loaded by executing the following commands at the Linux prompt or placing in your startup files (~/.cshrc or ~/.bashrc)

    login1$ module purge
    login1$ module load gnutools
    login1$ module load gnu openmpi_ib
For AVX support, compile with "`-mavx`". Note that AVX support is only available in version 4.7 or later, so it is necessary to explicitly load the gnu/4.9.2 module until such time that it becomes the default. For more information on the GNU compilers:
    login1$ man gfortran
    login1$ man g++
    login1$ man gfortran | gcc | g++

Table 6. GNU Compilers

  Serial MPI OpenMP MPI+OpenMP
Fortran gfortran mpif90 gfortran -fopenmp mpif90 -fopenmp
C gcc mpicc gcc -fopenmp mpicc -fopenmp
C++ g++ mpicxx g++ -fopenmp mpicxx -fopenmp

MVAPICH2-GDR on Comet GPU Nodes

The GPU nodes on Comet have MVAPICH2-GDR available. MVAPICH2-GDR is based on the standard MVAPICH2 software stack. It incorporates designs that take advantage of the new GPUDirect RDMA technology for inter-node data movement on NVIDIA GPUs clusters with Mellanox InfiniBand interconnect. The "mvapich2-gdr" modules are also available on the login nodes for compiling purposes. An example compile and run script is provided in "/share/apps/examples/MVAPICH2GDR".

Notes and hints

  • The "mpif90", "mpicc", and "mpicxx" commands are actually wrappers that call the appropriate serial compilers and load the correct MPI libraries. While the same names are used for the Intel, PGI and GNU compilers, keep in mind that these are completely independent scripts.
  • If you use the PGI or GNU compilers or switch between compilers for different applications, make sure that you load the appropriate modules before running your executables.
  • When building OpenMP applications and moving between different compilers, one of the most common errors is to use the wrong flag to enable handling of OpenMP directives. Note that Intel, PGI, and GNU compilers use the "-openmp", "-mp", and "-fopenmp" flags, respectively.
  • Explicitly set the optimization level in your makefiles or compilation scripts. Most well written codes can safely use the highest optimization level ("-O3"), but many compilers set lower default levels (e.g. GNU compilers use the default "-O0", which turns off all optimizations).
  • Turn off debugging, profiling, and bounds checking when building executables intended for production runs as these can seriously impact performance. These options are all disabled by default. The flag used for bounds checking is compiler dependent, but the debugging ("-g") and profiling ("-pg") flags tend to be the same for all major compilers.

Running Jobs on Comet

Running Jobs on Regular Compute Nodes

Comet uses the Simple Linux Utility for Resource Management (SLURM) batch environment. When you run in batch mode, you submit jobs to be run on the compute nodes using the sbatch command as described below. Remember that computationally intensive jobs should be run only on the compute nodes and not the login nodes. Comet has the following partitions available:

Table 7. Queues on Comet

Queue Name Max Walltime Max Nodes Comments
compute 48 hrs 72 Used for access to regular compute nodes
gpu 48 hrs 8 Used for access to the GPU nodes
gpu-shared 48 hrs 1 Used for shared access to a partial GPU node
shared 48 hrs 1 Single-node jobs using fewer than 24 cores
large-shared 48 hrs 1 Single-node jobs using large memory up to 1.45 TB
debug 30 mins 2 Used for access to debug nodes

Submitting Jobs Using sbatch

Jobs can be submitted to the SLURM queues using the "sbatch" command as follows:

[user@comet-ln1]$ sbatch jobscriptfile
where "*`jobscriptfile`*" is the name of a UNIX format file containing special statements (corresponding to "`sbatch`" options), resource specifications and shell commands. Several example SLURM scripts are given below:

Basic MPI Job

#!/bin/bash
#SBATCH --job-name="hellompi"
#SBATCH --output="hellompi.%j.%N.out"
#SBATCH --partition=compute
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=24
#SBATCH --export=ALL
#SBATCH -t 01:30:00

#This job runs with 2 nodes, 24 cores per node for a total of 48 cores.
#ibrun in verbose mode will give binding detail

ibrun -v ../hello_mpi

Basic OpenMP Job

#!/bin/bash
#SBATCH --job-name="hello_openmp"
#SBATCH --output="hello_openmp.%j.%N.out"
#SBATCH --partition=compute
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=24
#SBATCH --export=ALL
#SBATCH -t 01:30:00

#Set the number of openmp threads
export OMP_NUM_THREADS=24

#Run the job using mpirun_rsh
hello_openmp

Hybrid MPI-OpenMP Job

#!/bin/bash
#SBATCH --job-name="hellohybrid"
#SBATCH --output="hellohybrid.%j.%N.out"
#SBATCH --partition=compute
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=24
#SBATCH --export=ALL
#SBATCH -t 01:30:00

#This job runs with 2 nodes, 24 cores per node for a total of 48 cores.
#We use 8 MPI tasks and 6 OpenMP threads per MPI task
export OMP_NUM_THREADS=6
ibrun --npernode 4 ./hello_hybrid

Basic mpirun_rsh Job

#!/bin/bash
#SBATCH --job-name="hellompirunrsh"
#SBATCH --output="hellompirunrsh.%j.%N.out"
#SBATCH --partition=compute
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=24
#SBATCH --export=ALL
#SBATCH -t 01:30:00

#Generate a hostfile from the slurm node list
export SLURM_NODEFILE=`generate_pbs_nodefile`

#Run the job using mpirun_rsh
mpirun_rsh -hostfile $SLURM_NODEFILE -np 48 ../hello_mpi

Using the Shared Partition

#!/bin/bash
#SBATCH -p shared
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --mem=40G
#SBATCH -t 01:00:00
#SBATCH -J HPL.8
#SBATCH -o HPL.8.%j.%N.out
#SBATCH -e HPL.8.%j.%N.err
#SBATCH --export=ALL

export MV2_SHOW_CPU_BINDING=1
ibrun -np 8 ./xhpl.exe

The above script will run using 8 cores and 40 GB of memory. Please note that the performance in the shared partition may vary depending on how sensitive your application is to memory locality and the cores you are assigned by the scheduler. It is possible the 8 cores will span two sockets for example.

SLURM No-Requeue Option

SLURM will requeue jobs if there is a node failure. However, in some cases this might be detrimental if files get overwritten. If users wish to avoid automatic requeue, the following line should be added to their script:

#SBATCH --no-requeue

License Scheduling

MATLAB MDCS, IDL, and Abaqus have restricted licenses with limits on the number of cores used at any given time. Please ensure that you add a license request to your run script so that they are not oversubscribed and cause job failures. For example the following Abaqus job requests 24 licenses:

#!/bin/bash
#SBATCH --job-name="abaqus"
#SBATCH --output="abaqus.%j.%N.out"
#SBATCH --partition=compute
#SBATCH --nodes=1
#SBATCH --export=ALL
#SBATCH --ntasks-per-node=24
#SBATCH -L abaqus:24
#SBATCH -t 01:00:00
module load abaqus/6.14-1
export EXE=`which abq6141`
$EXE job=s4b input=s4b scratch=/scratch/$USER/$SLURM_JOBID cpus=24 mp_mode=threads memory=120000mb interactive

Example Scripts for Applications

SDSC User Services staff have developed sample run scripts for common applications. They are available in the /share/apps/examples directory on Comet.

Job Dependencies

There are several scenarios (e.g. splitting long running jobs, workflows) where users may require jobs with dependencies on successful completions of other jobs. In such cases, SLURM's "--dependency" option can be used. The syntax is as follows:

[user@comet-ln1 ~]$ sbatch --dependency=afterok:jobid jobscriptfile

Job Monitoring and Management

Users can monitor jobs using the squeue command.

[user@comet-ln1 ~]$ squeue -u user1
    JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
    256556 compute raxml_na user1 R 2:03:57 4 comet-14-[11-14]
    256555 compute raxml_na user1 R 2:14:44 4 comet-02-[06-09]
In this example, the output lists two jobs that are running in the `compute` partition. The jobID, partition name, job names, user names, status, time, number of nodes, and the node list are provided for each job. Some common `squeue` options include:

Table 8. squeue options

Option Result
-i interval Repeatedly report at intervals (in seconds)
-i job_list Displays information for specified job(s)
-i part_list Displays information for specified partitions (queues)
-i state_list Shows jobs in the specified state(s)

Users can cancel their own jobs using the "scancel" command as follows:

[user@comet-ln1 ~]$ scancel jobid

Help with ibrun

The options and arguments for ibrun are as follows:

  Usage: ibrun [options] executable [executable args]
  Options:
    -n, -np n
      launch n MPI ranks (default: use all cores provided by resource manager)
   
    -o, --offset n
      assign MPI ranks starting at the nth slot provided by the resource manager (default: 0)
   
    -no n
      assign MPI ranks starting at the nth unique node provided by the resource manager (default: 0)
   
    --npernode n
      only launch n MPI ranks per node (default: ppn from resource manager)
   
    --tpr|--tpp|--threads-per-rank|--threads-per-process n
      how many threads each MPI rank (often referred to as 'MPI process')
      will spawn. (default: $OMP_NUM_THREADS (if defined), ppn/npernode
      if ppn is divisible by npernode, or 1 otherwise)
   
    --switches 'implementation-specific'
      Pass additional command-line switches to the underlying implementation's
      MPI launcher. These WILL be overridden by any switches ibrun
      subsequently enables (default: none)
   
    -bp|--binding-policy scatter|compact|none
      Define the CPU affinity's binding policy for each MPI rank.
      'scatter' distributes ranks across each binding level,
      'compact' fills up a binding level before allocating another,
      and 'none' disables all affinity settings (default: optimized for job geometry)
   
    -bl|--binding-level core|socket|numanode|none
      Define the level of granularity for CPU affinity for each MPI rank.
      'core' binds each rank to a single core; 'socket' binds each rank to
      all cores on a single CPU socket (good for multithreaded ranks);
      'numanode' binds each rank to the subset of cores belonging to a numanode;
      'none' disables all affinity settings. (default: optimized for job geometry)
   
    --dryrun
      Do everything except actually launch the application
   
    -v|--verbose
      Print diagnostic messages
   
    -?
      Print this message
  

Transferring Data

Using Globus Endpoints, Data Movers and Mount Points

All of Comet's NFS and Lustre filesystems are acccessible via the Globus endpoint "xsede#comet". The servers also mount Gordon's filesystems, so the mount points are different for each system. The following table shows the mount points on the data mover nodes (that are the backend for "xsede#comet" and "xsede#gordon").

Machine Location on machine Location on Globus/Data Movers
Comet, Gordon /home/$USER /home/$USER
Comet, Gordon /oasis/projects/nsf /oasis/projects/nsf
Comet /oasis/scratch/comet /oasis/scratch-comet
Gordon /oasis/scratch /oasis/scratch

Storage on Comet

SSD Scratch Space

The compute nodes on Comet have access to fast flash storage. There is 250GB of SSD space available for use on each compute node. The latency to the SSDs is several orders of magnitude lower than that for spinning disk (> 100 microseconds vs. milliseconds) making them ideal for user-level check pointing and applications that need fast random I/O to large scratch files. Users can access the SSDs only during job execution under the following directories local to each compute node:

/scratch/$USER/$SLURM_JOB_ID
Partition Space Available
compute, shared 212 GB
gpu, gpu-shared 286 GB
large-shared 286 GB

A limited number of nodes in the "compute" partition have larger SSDs with a total of 1464 GB available in local scratch. They can be accessed by adding the following to the Slurm script:

#SBATCH --constraint="large_scratch"

Parallel Lustre Filesystems

In addition to the local scratch storage, users will have access to global parallel filesystems on Comet. Overall, Comet has 7 petabytes of 200 GB/second performance storage and 6 petabytes of 100 GB/second durable storage.

Users can now access /oasis/projects from Comet. The two Lustre filesystems available on Comet are:

Lustre Comet scratch filesystem: /oasis/scratch/comet/$USER/temp_project
Lustre NSF projects filesystem: /oasis/projects/nsf

Virtual Clusters

VCs are not meant to replace the standard HPC batch queuing system, which is well suited for most scientific and technical workloads. In addition, a VC should not be simply thought of as a VM (virtual machine). Future XSEDE resources, such as Indiana University's Jetstream will address this need. VCs are primarily intended for those users who require both fine-grained control over their software stack and access to multiple nodes. With regards to the software stack, this may include access to operating systems different from the default version of CentOS available on Comet or to low-level libraries that are closely integrated with the Linux distribution. Science Gateways serving large research communities and that require a flexible software environment are encouraged to consider applying for a VC, as are current users of commercial clouds who want to make the transition for performance or cost reasons.

Maintaining and configuring a virtual cluster requires a certain level of technical expertise. We expect that each project will have at least one person possessing strong systems administration experience with the relevant OS since the owner of the VC will be provided with "bare metal" root level access. SDSC staff will be available primarily to address performance issues that may be related to problems with the Comet hardware and not to help users build their system images.

All VC requests must include a brief justification that addresses the following:

  • Why is a VC required for this project?
  • What expertise does the PI's team have for building and maintaining the VC?

Using GPU nodes

Starting in July 1 2017: Comet provides both NVIDIA K80 and P100 GPU-based resources. These GPU nodes are now allocated as a separate resource and can no longer be accessed using your Comet allocation. Current users will need to request a transfer of time from Comet CPU to Comet GPU through XRAS. the conversion rate is 14 Comet Service Units (SUs) to 1 K80 GPU-hour. The P100 GPUs are substantially faster than the K80, achieving more than twice the performance for some applications. Accordingly, users will incur a 1.5x premium when running on the P100 vs the K80.

The GPU nodes can be accessed via either the "gpu" or the "gpu-shared" partitions.

#SBATCH -p gpu

or

#SBATCH -p gpu-shared

In addition to the partition name (required), the type of GPU (optional) and the individual GPUs are scheduled as a resource.

#SBATCH --gres=gpu[:type]:4

GPUs will be allocated on a first-available, first-schedule basis, unless specified with the [type] option, where type can be K80 or P100 (type is case sensitive).

#SBATCH --gres=gpu:4        #first available gpu node (either type)
#SBATCH --gres=gpu:k80:4    #only K80 nodes
#SBATCH --gres=gpu:p100:4   #only P100 nodes

For example on the "gpu" partition the following lines are needed to utilize 4 P100 GPUs:

#SBATCH -p gpu
#SBATCH --gres=gpu:p100:4

Users should always set --ntasks-per-node equal to 6 x [number of GPUs requested] on all K80 "gpu-shared" jobs, and 7 x [number of GPUs requested] on all P100 "gpu-shared" jobs, to ensure proper resource distribution by the scheduler. This example requests two K80 GPUs on the "gpu-shared" partition:

#SBATCH -p gpu-shared
#SBATCH --ntasks-per-node=12
#SBATCH --gres=gpu:k80:2

Please see /share/apps/examples/GPU for more examples.

Jobs run in the gpu-shared partition are charged differently from other shared partitions on Comet to reflect fractions of a resource used, based on number of GPUs requested and the relative performance of the different GPU types. P100 GPUs are generally substantially faster than K80 nodes, achieving more than twice the performance for some applications, so we charge a 1.5x premium on P100 GPUs.

1 GPU is equivalent to 1/4th of the node, or 6 cores on K80 nodes and 7 cores on P100 nodes.

The charging equation is:

GPU SUs = (Number of K80 GPUs) + (Number of P100 GPUS)*1.5) x (wallclock time)

Using Large Memory Nodes

The large memory nodes can be accessed via the "large-shared" partition. Charges are based on either the number of cores or the fraction of the memory requested, whichever is larger.

For example, on the "large-shared" partition, the following job requesting 16 cores and 445 GB of memory (about 31.3% of 1455 GB of one node's available memory) for 1 hour will be charged 20 SUs:

455/1455(memory) * 64(cores) * 1(duration) ~= 20

#SBATCH --ntasks=16
#SBATCH --mem=455G
#SBATCH --partition=large-shared

While there is not a separate 'large' partition, a job can still explicitly request all of the resources on a large memory node. Please note that there is no premium for using Comet's large memory nodes, but the processors are slightly slower (2.2 GHz compared to 2.5 GHz on the standard nodes), Users are advised to request the large nodes only if they need the extra memory.

Software on Comet

Software Packages

Comet supports the XSEDE core software stack, which includes remote login, remote computation, data movement, science workflow support, and science gateway support toolkits.

Table 9. Software on Comet

Software Package Compiler Suites Parallel Interface
AMBER: Assisted Model Building with Energy Refinement intel mvapich2_ib
APBS: Adaptive Poisson-Boltzmann Solver intel mvapich2_ib
Car-Parrinello 2000 (CP2K) intel mvapich2_ib
DDT    
FFTW: Fastest Fourier Transform in the West intel, pgi, gnu mvapich2_ib
GAMESS: General Atomic Molecular Electronic Structure System intel native: sockets, ip over ib
vsmp: scalemp mpich2
GAUSSIAN pgi Single node, shared memory
GROMACS: GROningen MAchine for Chemical Simulations intel mvapich2_ib
HDF4/HDF5: Hierarchical Data Format intel, pgi, gnu mvapich2_ib for hdf5
LAMMPS:Large-scale Atomic/Molecular Massively Parallel Simulator intel mvapich2_ib
NAMD: NAnoscale Molecular Dynamics intel mvapich2_ib
NCO: NetCDF Operators intel, pgi, gnu none
NetCDF: Network Common Data Format Intel, pgi, gnu none
Python modules (scipy etc) gnu:ipython, nose, pytz
intel:matplotlib, numpy, scipy, pyfits
None
RDMA-Hadoop None None
RDMA-Spark None None
Singularity: User Defined Images None None
VisIt Visualization Package intel openmpi

Software Package Descriptions

AMBER

AMBER is package of molecular simulation programs including SANDER (Simulated Annealing with NMR-Derived Energy Restraints) and a modified version of PMEME (Particle Mesh Ewald Molecular Dynamics) that is faster and more scalable.

APBS

APBS evaluates the electrostatic properties of solvated biomolecular systems. View the APBS documentation

Car-Parrinello 2000

CP2K is a program to perform simulations of molecular systems. It provides a general framework for different methods such as Density Functional Theory (DFT) using a mixed Gaussian and plane waves approach (GPW) and classical pair and many-body potentials. View the CP2K documentation

DDT

DDT is a debugging tool for scalar, multithreaded and parallel applications. DDT Debugging Guide from TACC

FFTW

FFTW is a library for computing the discrete Fourier transform in one or more dimensions, of arbitrary input size, and of both real and complex data. View the FFTW documentation

GAMESS

GAMESS is a program for ab initio quantum chemistry. GAMESS can compute SCF wavefunctions, and correlation corrections to these wavefunctions as well as Density Functional Theory. GAMESS documentation, examples, etc.

GAUSSIAN

Gaussian 09 provides state-of-the-art capabilities for electronic structure modeling. Gaussian 09 User's Reference

GROMACS

GROMACS is a versatile molecular dynamics package, primarily designed for biochemical molecules like proteins, lipids and nucleic acids. GROMACS Online Manual

HDF4/HDF5

HDF is a collection of utilities, applications and libraries for manipulating, viewing, and analyzing data in HDF format. HDF 5 Resources

LAMMPS

LAMMPS is a classical molecular dynamics simulation code. LAMMPS User Manual

NAMD

NAMD is a parallel, object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems. NAMD User Guide

NCO

NCO operates on netCDF input files (e.g. derive new data, average, print, hyperslab, manipulate metadata) and outputs results to screen or text, binary, or netCDF file formats. NCO documentation on SourceForge

netCDF

netCDF is a set of libraries that support the creation, access, and sharing of array-oriented scientific data using machine-independent data formats. netCDF documentation on UCAR's Unidata Program Center

Python modules

The Python modules under /opt/scipy consist of: node, numpy, scipy, matplotlib, pyfits, ipython and pytz. Video tutorial from a TACC workshop on Python

Python videos from Khan Academy

The HPC University Python resources

RDMA-Hadoop

RDMA-based Apache Hadoop 2.x is a high performance derivative of Apache Hadoop developed as part of the High-Performance Big Data (HiBD) project at the Network-Based Computing Lab of The Ohio State University. The installed release on Comet (v0.9.7) is based on Apache Hadoop 2.6.0. The design uses Comet's InfiniBand network at the native level (verbs) for HDFS, MapReduce, and RPC components, and is optimized for use with Lustre.
The design features a hybrid RDMA based HDFS design with in-memory and heterogenous storage including RAM Disk, SSD, HDD, and Lustre. In addition, optimized MapReduce over Lustre (with RDMA based shuffle) is also available. The implementation is fully integrated with SLURM (and PBS) on Comet with scripts available to dynamically deploy hadoop clusters within the SLURM scheduling framework.
Examples for various modes of usage are available in /share/apps/examples/HADOOP/RDMA. Please email help@xsede.org (reference Comet as the machine, and SDSC as the site) if you have any further questions about usage and configuration. Read more about the RDMA Hadoop and HiBD project.

RDMA-Spark

RDMA-based Apache Spark is a high performance derivative of Apache Hadoop developed as part of the High-Performance Big Data (HiBD) project at the Network-Based Computing Lab of The Ohio State University. The installed release on Comet (v0.9.1) is based on Apache Spark 1.5.1. The design uses Comet's InfiniBand network at the native level (verbs) for RDMA based data shuffle, SEDA based shuffle architecture, efficient connection management, non-blocking and chunk based data transfer, and off-JVM-heap buffer management.
The RDMA-Spark cluster setup and usage is managed via the myHadoop framework. An example script is provided in /share/apps/examples/SPARK/sparkgraphx_rdma. Please email help@xsede.org (reference Comet as the machine, and SDSC as the site) if you have any further questions about usage and configuration. See details on RDMA Spark.

Singularity

Singularity is a platform to support users that have different environmental needs than what is provided by the resource or service provider. While the high level perspective of other container solutions seems to fill this niche very well, the current implementations are focused on network service virtualization rather than application level virtualization focused on the HPC space. Because of this, Singularity leverages a workflow and security model that makes it a very reasonable candidate for shared or multi-tenant HPC resources like Comet without requiring any modifications to the scheduler or system architecture. Additionally, all typical HPC functions can be leveraged within a Singularity container (e.g. InfiniBand, high performance file systems, GPUs, etc.). While Singularity supports MPI running in a hybrid model where you invoke MPI outside the container and it runs the MPI programs inside the container, we have not yet tested this.
Examples for various modes of usage are available in /share/apps/examples/Singularity. Please email help@xsede.org (reference Comet as the machine, and SDSC as the site) if you have any further questions about usage and configuration. Read more about Singularity.

VisIt Visualization Package

The VisIt visualization package supports remote submission of parallel jobs and includes a Python interface that provides bindings to all of its plots and operators so they may be controlled by scripting. Watch the Getting Started With VisIt tutorial