SuperMIC User Guide
Last update: March 22, 2017

System Overview

SuperMIC (pronounced Super Mick) is an LSU supercomputer funded by the National Science Foundation's (NSF) Major Research Instrumentation (MRI) award to the Center for Computation & Technology. Forty percent of its computational resources are reserved for participants in the Extreme Science and Engineering Discovery Environment (XSEDE) program, a national system of leadership-class HPC machines that scientists use to share computing resources, data, and expertise.

SuperMIC is capable of a peak theoretical performance of over 925 TF. It achieved a performance of 557 TF during testing, which placed it as number 65 in the June 2014 Top500 List.

SuperMIC went operational on October 1, 2014. It contains a total of 382 nodes, each with two 10-core 2.8GHz Intel Ivy Bridge-EP processors. 380 compute nodes each have 64 GB of memory and 500 GB of local HDD storage. 360 of the compute nodes have 2 Intel Xeon Phi 7120P coprocessors. 20 of the compute nodes have 1 Intel Xeon Phi 7120P coprocessor and 1 NVIDIA Tesla K20X. The system is available to LSU and XSEDE users via an allocation process. XSEDE users may submit allocation and account requests through the XSEDE User Portal. LSU users will need to use their LSU HPC credentials to gain access to SuperMIC (see: LSU HPC account request), and require access to an LSU HPC allocation (see: LSU HPC allocation request) to run production jobs on the system.

System Configuration

  • Login Node
    • One Intel Xeon Phi 7120P Coprocessors
    • Two 2.8GHz 10-Core Ivy Bridge-EP E5-2680 Xeon 64-bit Processors
    • 128GB DDR3 1866MHz Ram
    • 1TB HD
    • 56 Gigabit/sec Infiniband network interface
    • 10 Gigabit Ethernet network interface
    • Red Hat Enterprise Linux 6
  • Login Node
    • One NVIDIA Tesla K20X 6GB GPU
    • Two 2.8GHz 10-Core Ivy Bridge-EP E5-2680 Xeon 64-bit Processors
    • 128GB DDR3 1866MHz Ram
    • 1TB HD
    • 56 Gigabit/sec Infiniband network interface
    • 10 Gigabit Ethernet network interface
    • Red Hat Enterprise Linux 6
  • 360 Compute Nodes
    • Two 2.8GHz 10-Core Ivy Bridge-EP E5-2680 Xeon 64-bit Processors
    • Two Intel Xeon Phi 7120P Coprocessors
    • 64GB DDR3 1866MHz Ram
    • 500GB HD
    • 56 Gigabit/sec Infiniband network interface
    • 1 Gigabit Ethernet network interface
    • Red Hat Enterprise Linux 6
  • 20 Hybrid Compute Nodes
    • Two 2.8GHz 10-Core Ivy Bridge-EP E5-2680 Xeon 64-bit Processors
    • One Intel Xeon Phi 7120P Coprocessors
    • One NVIDIA Tesla K20X 6GB GPU with GPUDirect Support
    • 64GB DDR3 1866MHz Ram
    • 500GB HD
    • 56 Gigabit/sec Infiniband network interface
    • 1 Gigabit Ethernet network interface
    • Red Hat Enterprise Linux 6
  • Cluster Storage
    • 840TB Lustre High-Performance disk
    • 5TB NFS-mounted /home disk storage

System Access

XSEDE users: please use gsissh or XSEDE's Single Sign On login hub to log on to SuperMIC.


To access SuperMIC, users must connect using an Secure Shell (SSH) client.

*nix and Mac Users - SSH client is already installed and can be accessed from the command prompt using the ssh command. One would issue a command similar to the following:

login1$ <b>ssh -X</b>

Enter your password when prompted. The "-X" flags allow for X11 Forwarding to be set up automatically.

Windows Users - You will need to download and install a SSH client such as the PuTTY utility. If users need access to login with X11 Forwarding, a X-Server needs to be installed and running on your local Windows machine. Xming X Server is recommended, advanced users may also install Cygwin which also provides a command line ssh client similar to that available for *nix and Mac Users.

If you have forgotten your password, or you wish to reset it, see here (click "Forgot your password?").

GSI-OpenSSH (gsissh)

The following commands authenticate using the XSEDE myproxy server, then connecting to the gsissh on SuperMIC :

    localhost$ myproxy-logon -s
    localhost$ gsissh

Please consult NCSA's detailed documentation on installing and using myproxy and gsissh, as well as the GSI-OpenSSH User's Guide for more info.

XSEDE also provides a Single Sign On (SSO) login hub, where a proxy certificate is automatically generated, who can then connect to XSEDE resources via gsissh. Detailed information can be found here.


To report a problem please run the ssh or gsissh command with the "-vvv" option and include the verbose information in the ticket.

Transferring your files to SuperMIC

SuperMIC supports multiple file transfer programs, common command line utilities such as scp, sftp, and rsync and services such as globus-url-copy and Globus.


Using scp is the easiest method to use when transferring single files.

Local File to Remote Host

localhost$ scp localfile username@remotehost:/destination/dir/or/filename

Remote Host to Local File

localhost$ scp username@remotehost:/remote/filename localfile


Interactive Mode

One may find this mode very similar to the interactive interface offered. A login session may look similar to the following:

login1$ sftp user@remotehost
enter password

The commands are similar to those offered by the outmoded ftp client programs: get, put, pwd, lcd, etc. For more information on the available set of commands, consult the sftp man page.

login1$ man sftp

Batch Mode

One may use sftp interactively in two cases.

Case 1: Pull a remote file to the local host.

login1$ sftp username@remotehost:/remote/filename localfilename

Case 2: Creating a special sftp batch file containing the set of commands one wishes to execute with out any interaction.

login1$ sftp -b batchfile user@remotehost

Additional information on constructing a batch file is available in the sftp man page.

rsync Over SSH (preferred)

rsync is an extremely powerful program; it can synchronize entire directory trees, only sending data about files that have changed. That said, it is rather picky about the way it is used. The rsync man page has a great deal of useful information, but the basics are explained below.

Single File Synchronization

To synchronize a single file via rsync, use the following:

To send a file:

login1$ rsync --rsh=ssh --archive --stats --progress localfile \

To receive a file:

login1$ rsync --rsh=ssh --archive --stats --progress \
username@remotehost:/remote/filename localfilename

Note that the "--rsh=ssh" option is not necessary with newer versions of rsync, but older installs will default to using rsh (which is not generally enabled on modern OSes).

Directory Synchronization

To synchronize an entire directory, use the following:

To send a directory:

login1$ rsync --rsh=ssh --archive --stats --progress localdir/ \


login1$ rsync --rsh=ssh --archive --stats --progress localdir \

To retrieve a directory:

login1$ rsync --rsh=ssh --archive --stats --progress \
username@remotehost:/remote/directory/ /some/localdirectory/


login1$ rsync --rsh=ssh --archive --stats --progress \
username@remotehost:/remote/directory /some/

Note the difference with the slashes. The second command will place the files in the directory "/destination/localdir"; the fourth will place them in the directory "/some/directory". rsync is very particular about the placement of slashes. Before running any significant rsync command, add the "--dry-run" option. This will let rsync show you what it plans on doing without actually transferring the files.

Synchronization with Deletion

This is very dangerous; a single mistyped character may blow away all of your data. Do not synchronize with deletion if you aren't absolutely certain you know what you're doing.

To have directory synchronization delete files on the destination system that don't exist on the source system:

login1$ rsync --rsh=ssh --archive --stats --dry-run --progress \
\--delete localdir/ username@remotehost:/destination/dir/

Note that the above command will not actually delete (or transfer) anything; the "--dry-run" option must be removed from the list of parameters to actually have it work.


Use BBCP to transfer large data files without encryption.

bbcp [opt] user@source:/path/to/data user@destination:/path/to/store/data

Possible options include:

  • "-P 2" Give a progress report every 2 seconds
  • "-w 2M" TCP window size of 2MBytes
  • "-s 16" Set the number of streams to 16 (default is 4)

Other options may be necessary if bbcp is not installed in a regular location on either end of the transfer. This can lead to rather complex command lines:

login1$ bbcp -z -T \
"ssh -x -a -oFallBackToRsh=no %I -l %U %H /home/user/Custom/bin/bbcp" \
foobar-5.4.14.tbz "ruser@"

Client Software

scp and sftp Standard Clients

The command-line scp and sftp tools come with any modern distribution of OpenSSH; this is generally installed by default on modern Linux, UNIX, and Mac OS X installs.

Windows clients include:

For additional clients, please see

Globus (XSEDE users only)

Since direct ssh is not available to them, XSEDE users must use Globus to transfer data from SuperMIC to other XSEDE resources or their own local computers.

In order to use Globus, a user first needs to create a Globus account at the Globus website. Optionally, the user can link the Globus account to the XSEDE user portal account, which will enable the access to Globus services using the XSEDE credentials.

There are two options to transfer data using Globus: Graphic User Interface (either with a browser or Globus Connect client) and GridFTP (command line interface).

Globus GUI

Users can use a browser to transfer data between two Globus endpoints at the Globus data transfer interface. All XSEDE resources have been configured as Globus endpoints.

The snapshot below is an example of transferring data from Comet at SDSC to SuperMIC:

Alternatively, one can use the Globus Connect Personal client to transfer data.

GridFTP (globus-url-copy)

globus-url-copy XSEDE documentation is a command line data tranfer tool that uses the GridFTP protocol.

The GridFTP endpoint for Super MIC is:


The example below transfers a file from Comet to SuperMIC:

login1$ globus-url-copy -vb gsi gsi
Source: gsi
Dest:   gsi
2062098422 bytes        58.88 MB/sec avg        61.64 MB/sec inst

Please refer to the GridFTP User Guide for detailed description for various options available for globus-url-copy.

Computing Environment

Unix shell

SuperMIC's default shell is bash. Other shells are available: sh, csh, tcsh and ksh. Users may change their default shell by logging into their HPC Profile page at


SuperMIC makes use of modules to allow for adding software to the user's environment.

The module environment will be the default environment on SuperMIC for modifying your user environment. Users who are familiar with the softenv environment on our other clusters please note that softenv environment will not installed. The following is a guide to managing your software environment with modules.

The Environment Modules package provides for dynamic modification of your shell environment. Module commands set, change, or delete environment variables, typically in support of a particular application. They also let the user choose between different versions of the same software or different combinations of related codes. Complete documentation is available in the module(1) and modulefile(4) manpages.

The default environment is defined in the ".modules" file under each user's home directory. Edit this file if you would like to change the default environment.

Command Description
module list List the modules that are currently loaded
module avail List the modules that are available
module display modulename Show the environment variables used by modulename and how they are affected
module unload modulename Remove modulename from the environment
module load modulename Load modulename into the environment
module swap moduleA moduleB Replace moduleA with moduleB in the environment

Loading and unloading modules

You must remove some modules before loading others. Some modules depend on others, so they may be loaded or unloaded as a consequence of another module command. For example, if intel and mvapich are both loaded, running the command module unload intel will automatically unload mvapich. Subsequently issuing the "module load intel" command does not automatically reload mvapich.

File Systems

User-owned storage on the SuperMIC system is available in two directories, identified by the home and work directories. These directories are separate Lustre (global) file systems, and accessible from any node in the system. The home and work directories are created automatically within an hour of first login. If these directories do not exist when you login, please wait at least an hour before contacting the HPC helpdesk.

Home Directory

The /home file system quota on SuperMIC is 5GB. Files can be stored on /home permanently, which makes it an ideal place for your source code and executables. The /home file system is meant for interactive use such as editing and active code development. Do not use /home for batch job I/O.

Work (Scratch) Directory

The /work volume is meant for the input and output of executing batch jobs and not for long term storage. We expect files to be copied to other locations or deleted in a timely manner, usually within 30-120 days. For performance reasons on all volumes, our policy is to limit the number of files per directory to around 10,000 and total files to about 500,000.

The /work file system quota on SuperMIC is unlimited. If it becomes over utilized we will enforce a purge policy, which means that we will begin deleting files starting with the oldest last accessed date, and largest files, and continue until the volume has been reduced below 80%. An email message will be sent out weekly to users who may have files subject to purge informing them of their /work utilization. If diskspace should become critically low, more drastic measures may be required to keep the system stable.

Please do not attempt to circumvent the removal process by manually changing file dates. The /work volume capacity is not unlimited, and attempts to circumvent the purge process may adversely affect others and lead to access restrictions to the /work volume or even the cluster.

Project Directory

The /project file system is a quota-controlled space granted via an allocation system that allows large amounts of space to be shared for periods of 6 months or longer. The process is similar to requesting an allocation of system units, but is granted in 100 GB units for 6 months at a time, subject to renewal and demand. Visit the Storage Policy page for more details on who may apply and its intended uses. Qualified individuals may apply for one on the Storage Allocation Request page.

Application Development

The Intel, GNU and Portland Group (PGI) C, C++ and Fortran compilers are installed on SuperMIC and they can be used to create OpenMP, MPI, hybrid and serial programs. The commands you should use to create each of these types of programs are shown in the table below.

Intel compilers are loaded by default, codes can be compiled according to the following chart:

Intel Compiler Commands

Serial Codes MPI Codes OpenMP Codes Hybrid Codes
Fortran ifort mpif90 ifort -openmp mpif90 -openmp
C icc mpicc icc -openmp mpicc -openmp
C++ icpc mpiCC icpc -openmp mpiCC -openmp

GNU Compiler Commands

Serial Codes MPI Codes OpenMP Codes Hybrid Codes
Fortran gfortran mpif90 gfortran -fopenmp mpif90 -fopenmp
C gcc mpicc gcc -fopenmp mpicc -fopenmp
C++ g++ mpiCC g++ -fopenmp mpiCC -fopenmp

PGI Compiler Commands

Serial Codes MPI Codes OpenMP Codes Hybrid Codes
Fortran pgf90 mpif90 pgf90 -mp mpif90 -mp
C pgcc mpicc pgcc -mp mpicc -mp
C++ pgCC mpiCC pgCC -mp mpiCC -omp

Default MPI: mvapich2 2.0 compiled with Intel compiler version 14.0.2

To compile a serial program, the syntax is:

<your choice of compiler> <compiler flags> <source file name> 

For example, the command below compiles the source file "mysource.f90" and generates the executable "myexec".

login1$ ifort -o myexec mysource.f90

To compile a MPI program, the syntax is the same, except that one needs to replace the serial compiler with an MPI one listed in the table above:

login1$ mpif90 -o myexec_par my_parallel_source.f90

Coprocessor (MIC) Programming

Intel Xeon Phi cards can be used in a few different ways.

  1. Native: log into the Linux running on a Xeon Phi card and launch applications directly;
  2. Offload: use numerical libraries (e.g. Intel MKL) or compiler directives to perform a portion of the computation on Xeon Phi cards;
  3. Symmetric: run MPI programs with processes running on both the hosts and the Xeon Phi's.

Running Natively

Some applications are well suited for running directly on Xeon Phi coprocessors without offload from a host system. This is also known as running in "native mode."

To run your application natively on Xeon Phi, simply compile with the "-mmic" flag on the host and then ssh to the card to run (for OpenMP applications, include the "-openmp" flag as well):

[user@smic194 ~]$ ifort -openmp -mmic hello.f90 -o hello.mic
[user@smic194 ~]$ ssh mic0
user@smic194p-mic0:~$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/compilers/Intel/composer_xe_2013.5.192/compiler/lib/mic
user@smic194p-mic0:~$ ./hello.mic

Note: the LD_LIBRARY_PATH variable needs to point to the correct path as shown in the example above, otherwise the application will crash.


The offloading usage model allows programmers to push a portion of the computation to the Xeon Phi devices. The compiled binary, which contains the executable for Xeon Phi, will be launched on the host, and the offload sections will be automatically executed on the devices. There are two types of offloading, automatic and compiler-assisted, described below.

Automatic Offloading

Some MKL functions (level-3 subroutines in BLAS and factorization functions in LAPACK) support the Automatic Offloading (AO) feature. This feature allows programmers to port their codes that already use MKL functions to Xeon Phi's without changing them. When the AO-enabled MKL function is called, depending on the problem size and the current status of the devices, the MKL runtime will determine how to divide the work between the host CPU's and the Xeon Phi's. For example, if the CPU and the two Xeon Phis on one node are unoccupied and the problem size is large enough, the work will be automatically divided to all of the three devices.

Here is an example of how to compile and run a program with AO-enabled MKL functions (SGEMM in this case):

      [user@smic194 source]$ ifort -mkl -xhost -O3 sgemm.f90 
      [user@smic194 source]$ MKL_MIC_ENABLE=1 OFFLOAD_REPORT=2 ./a.out 
      Computing SGEMM on the host
      [MKL] [MIC --] [AO Function]	SGEMM
      [MKL] [MIC --] [AO SGEMM Workdivision]	0.60 0.20 0.20
      [MKL] [MIC 00] [AO SGEMM CPU Time]	2.546801 seconds
      [MKL] [MIC 00] [AO SGEMM MIC Time]	0.082807 seconds
      [MKL] [MIC 00] [AO SGEMM CPU->MIC Data]	31457280 bytes
      [MKL] [MIC 00] [AO SGEMM MIC->CPU Data]	5242880 bytes
      [MKL] [MIC 01] [AO SGEMM CPU Time]	2.546801 seconds
      [MKL] [MIC 01] [AO SGEMM MIC Time]	0.094674 seconds
      [MKL] [MIC 01] [AO SGEMM CPU->MIC Data]	31457280 bytes
      [MKL] [MIC 01] [AO SGEMM MIC->CPU Data]	5242880 bytes
      Enabling Automatic Offload
      Automatic Offload enabled: 2 MIC devices present
      Computing SGEMM with automatic workdivision
      [MKL] [MIC --] [AO Function]	SGEMM
      [MKL] [MIC --] [AO SGEMM Workdivision]	0.60 0.20 0.20
      [MKL] [MIC 00] [AO SGEMM CPU Time]	0.077709 seconds
      [MKL] [MIC 00] [AO SGEMM MIC Time]	0.023630 seconds
      [MKL] [MIC 00] [AO SGEMM CPU->MIC Data]	31457280 bytes
      [MKL] [MIC 00] [AO SGEMM MIC->CPU Data]	5242880 bytes
      [MKL] [MIC 01] [AO SGEMM CPU Time]	0.077709 seconds
      [MKL] [MIC 01] [AO SGEMM MIC Time]	0.024081 seconds
      [MKL] [MIC 01] [AO SGEMM CPU->MIC Data]	31457280 bytes
      [MKL] [MIC 01] [AO SGEMM MIC->CPU Data]	5242880 bytes
      Setting workdivision for device MIC: 0 to 1.0
      Resulting workdivision configuration:
      workdivision[HOST: 0] = -1.0
      workdivision[MIC: 0] =  1.0
      workdivision[MIC: 1] = -1.0
To control and fine-tune the AO feature, one can use either environment variables or support functions provided by Intel MKL. A few example are listed in the talbe below.

Intel MKL Automatic Offloading environment variables

Environment Variable Equivalent Support Function Purpose
MKL_MIC_ENABLE mkl_mic_enable Enabling and disabling automatic offload
mkl_mic_set_workdivision Controlling work division
MKL_MIC_MAX_MEMORY mkl_mic_set_max_memory Controlling maximum memory used by Automatic Offload

Details and a list of all the automatic offload controls are available in the MKL User Guide document.

Automatic Offloading from Python

Python is supported on Xeon Phi. We provide a python platform that is compiled with intel compiler and the numpy/scipy packages are linked to MKL. Some numpy/scipy functions are computed through MKL, and thus these works can be automatically offloaded to Xeon Phi. An exmple for running python codes with auto-offload is provided below.

login1$ module load python/2.7.10-mkl-mic
login1$ export MKL_MIC_ENABLE=1
login1$ export OFFLOAD_REPORT=2
login1$ python

Load the "python/2.7.10-mkl-mic" module to use the right version of python. If MKL_MIC_ENABLE is equal to 1 and the problem size is large enough, the work is automatically offloaded to Xeon Phis.

Compiler Assisted Offloading

Using pragmas/directives and optional C++ language extensions, programmers can designate code sections to run on Intel Xeon Phi devices. All the setup/teardown, data transfer and synchronization are managed automatically by the compiler and runtime. In code example 1 below, one can see how the pragma (C/C++) or the directive (Fortran) "offload" is used to mark a code section for Xeon Phi devices.

Example 1: Offloaded OpenMP code block with automatic data transfer (on one single node)

 #pragma offload target(mic)
 #pragma omp parallel for reduction(+:pi)
 for (i=0; i<count; i++) {
   float t = (float)((i+0.5)/count);
   pi += 4.0/(1.0+t*t);
 pi /= count; 
!dir$ offload target(mic)
!$omp parallel do
do i=1,10
  A(i) = B(i) * C(i)

In the example above, the "target" clause tells the compiler where to run this section of code. There are also a number of clauses which control data transfer:

Clause Syntax Semantics
Targets target(name[:num]) Where to offload
Inputs in (variable-list) Copy from host memory to target
Outputs out (variable-list) Copy from target to host memory
Inputs & Outputs inout (variable-list) Copy both ways
Non-copied data nocopy (variable-list) Data that is local to target

When compiling codes with offload sections, no additional flags are necessary as the compiler recognizes the pragmas and directives automatically:

    login1$ ifort -openmp -O3 -xhost -opt-report-phase=offload sgemm.f90 -o sgemm.offload
    sgemm.f90(63-63):OFFLOAD:MAIN__:  Offload to target MIC 1
    Data sent from host to target
        mkl_sgemm_hc_$N_V$55, scalar size 4 bytes
        mkl_sgemm_hc_$A_V$62, dope vector, size 96 bytes, () elements
        mkl_sgemm_hc_$B_V$57, dope vector, size 96 bytes, () elements
        mkl_sgemm_hc_$C_V$6d, dope vector, size 96 bytes, () elements
    Data received by host from target
        mkl_sgemm_hc_$C_V$6d, dope vector, size 96 bytes, () elements

The additional output shown above is generated by the "-opt-report-phase=offload" flag to report which variables are offloaded. Programmers can also use the "-vec-report=[0-7]" to instruct the vectorizer to generate a diagnostic information report.

A code with offload sections can be launched on the host with the usual manner:

    login1$ export OMP_NUM_THREADS=20
    login1$ export KMP_AFFINITY=scatter
    login1$ export MIC_ENV_PREFIX=MIC
    login1$ export MIC_OMP_NUM_THREADS=240
    login1$ export MIC_KMP_AFFINITY=scatter
    login1$ ./sgemm.offload

Here the "OMP_NUM_THREADS" and "MIC_OMP_NUM_THREADS" variables are used to specifiy how many threads to spawn on the host and the device, respectively. The prefix "MIC_" is required if one sets MIC environments while being on the host. Note that 240 is the maximum one can use for MIC_OMP_NUM_THREADS because one core (with four threads) should be left for executing offloading processes.

Example 2: Offloaded OpenMP blocks in MPI-OpenMP hybrid codes (on multiple nodes)

OpenMP code is based on shared-memory parallel scheme and thus can be run only within one node. However the codes combining MPI and OpenMP (usually called hybrid codes) can be run on multiple nodes and may gain higher computing speed. If the OpenMP blocks in a hybrid code are offloaded to MIC, the code can be further accelerated. A sample hybrid code with an offloaded OpenMP block is shown below.

    int main() {
        MPI_Init( &argc, &argv );
        MPI_Comm_rank( MPI_COMM_WORLD, &myrank );
        MPI_Comm_size( MPI_COMM_WORLD, &nprocs );
        #pragma  offload target (mic:myrank) in(start_int,end_int)
        #pragma omp parallel private(iam,i,np)
          iam = omp_get_thread_num() ;
          np=omp_get_num_threads() ;
          printf("Thread %5d of %5d in MPI task %5d of %5d\n",iam,np,myrank,nprocs);
    program main
        call mpi_init(ierr)
        call mpi_comm_size(mpi_comm_world,nprocs,ierr)
        call mpi_comm_rank(mpi_comm_world,myrank,ierr)
        !dir$ offload begin target (mic:myrank) in(start_int,end_int)
        !$omp parallel private(iam,i,np)
        iam = omp_get_thread_num()
        write(*,*) iam, myrank, np,nprocs
        !$omp end parallel
        !dir$  end offload
    end program

Before compiling and running codes, remember to switch to Intel MPI (impi). MVAPICH2 also works for this case, but the performance is worse than that of impi.

login1$ module switch mvapich2/2.0/INTEL-14.0.2 impi/


login1$ module load impi/
The compilation is the same as that for a normal CPU-based hybrid code.
login1$ mpiicc -openmp name.c -o


login1$ mpiifort -openmp name.f90 -o
Note that the flag "-mmic" is NOT used while doing offloading. Similar to example 1, some environments need to be set before running jobs. To run the job on an interactive session, execute the following lines.
    login1$ export OFFLOAD_REPORT=2
    login1$ export MIC_ENV_PREFIX=MIC
    login1$ export MIC_OMP_NUM_THREADS=240
    login1$ mpiexec.hydra -n 4 -machinefile nodefile ./

The nodefile is where you should specify the node names. Its contents should look like:


This means that two nodes are requested and two MPI tasks are distributed to each node. The word "mic:myrank" in the sample code enables that each MPI task is offloaded to each MIC of the two nodes, which maximizes the usage of MIC resources.

Programming details for offloading can be found in the Intel User Guide.

Symmetric Processing

An MPI application can run tasks on both host CPUs and MICs. This is called symmetric computing because both host CPUs and MICs act as though they are separate nodes and may have MPI processes launched "symmetrically" on both systems, unlike offloading which depends upon the host to distribute work to the MIC. Either pure MPI codes or hybrid MPI-OpenMP codes can run symmetrically.

Currently MIC-related MPI jobs can only be built by intel MPI compiler on SuperMIC. Since the default MPI compiler on SuperMIC is mvapich, users need to switch to intel MPI before compiling and running codes,

login1$ module switch mvapich2/2.0/INTEL-14.0.2 impi/


login1$ module load impi/

and check the version of MPI implementation,

login1$  which mpiicc

Then create a CPU binary and a MIC binary with suffixes "cpu" and "mic" respectively.

login1$ mpiicc -openmp name.c -o name.cpu
login1$ mpiicc -openmp -mmic name.c -o name.mic
login1$ mpiifort -openmp name.f90 -o name.cpu
login1$ mpiifort -openmp -mmic name.f90 -o name.mic

Note that the "-openmp" flag is optional. It is necessary for a hybrid code, but unnecessary for a pure MPI code.

Run symmetric jobs with mpiexec.hydra

After obtaining an interactive session by "qsub -I", you can launch your jobs from the host of a compute node. In the following, a sample script is provided for running a hybrid job on two nodes, where the MPI tasks are distributed symmetrically to the host and two MICs of each node.

export TASKS_PER_HOST=2  # number of MPI tasks per host
export THREADS_HOST=10   # number of OpenMP threads spawned by each task on the host
export TASKS_PER_MIC=3   # number of MPI tasks per MIC
export THREADS_MIC=80    # number of OpenMP threads spawned by each task on the MIC
export CPU_ENV="-env OMP_NUM_THREADS $THREADS_HOST"  # run-time environments for CPU binary
export MIC_ENV="-env OMP_NUM_THREADS $THREADS_MIC -env LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH"  # run-time environments for MIC binary

mpiexec.hydra \
–n $TASKS_PER_HOST -host smic022 $CPU_ENV ./name.cpu : \
–n $TASKS_PER_MIC -host smic022p-mic0 $MIC_ENV ./name.mic : \
–n $TASKS_PER_MIC -host smic022p-mic1 $MIC_ENV ./name.mic : \
–n $TASKS_PER_HOST -host smic023 $CPU_ENV ./name.cpu : \
–n $TASKS_PER_MIC -host smic023p-mic0 $MIC_ENV ./name.mic : \
–n $TASKS_PER_MIC -host smic023p-mic1 $MIC_ENV ./name.mic

The node names "smicXXX" should be consistent with the interactive session. For a pure MPI job, just set THREADS_HOST=1 and THREADS_MIC=1. The theoretical maximum number of MPI tasks on MIC is 61 that equals to the number of cores. However, the memory of MIC is only 16 GB, which is much less than the host memory 64 GB. One had better not to run too many tasks on MIC in order to balance the memory cost by each task. The maximum value of TASKS_PER_MIC could vary from case to case depending on users' codes. But typically it makes sense to set it no more than 30.

Run symmetric jobs with micrun.sym

As one can see from the previous section, the command for running symmetric jobs on a large number of nodes could become very messy. Alternatively, users are strongly recommended to use micrun.sym, which is a bash script for running symmetric jobs on SuperMIC. Its usage is as simple as typing one command line,

micrun.sym -c /path/to/name.cpu -m /path/to/name.mic

Users can pass input parameters to their own program via the -inp option,

micrun.sym -c /path/to/name.cpu -m /path/to/name.mic -inp "par1 par2 par3 ..."

Here is a typical PBS batch script for using micrun.sym.

#PBS -q workq
#PBS -A your_allocation
#PBS -l walltime=01:30:00
#PBS -l nodes=8:ppn=20
#PBS -N sym_test
#PBS -o test.out
#PBS -e test.err

module load impi/  # load impi

export TASKS_PER_HOST=20 # number of MPI tasks per host
export THREADS_HOST=1    # number of OpenMP threads spawned by each task on the host
export TASKS_PER_MIC=30  # number of MPI tasks per MIC
export THREADS_MIC=1     # number of OpenMP threads spawned by each task on the MIC

cd $PBS_O_WORKDIR        # go to where your PBS job is submitted if necessary
micrun.sym -c /path/to/name.cpu -m /path/to/name.mic  # run micrun.sym

Both MICs on each node are automatically utilized by micrun.sym. In order for maximizing the usage of MIC resources on the cluster, users are enforced to run jobs on both MIC cards of each node when requesting a large number of nodes.

GPU Programming

CUDA Programming

NVIDIA's CUDA compiler and libraries are accessed by loading the CUDA module:

login1$ module load cuda

Use the nvcc compiler on the head node to compile code, and run executables on nodes with GPUs - one head node has GPUs. SuperMIC K20X's GPUs are compute capability 3.5 devices. When compiling your code, make sure to specify this level of capability with:

nvcc -arch=compute_35 -code=sm_35 ...

GPU nodes are accessible through the gpu queue for production work. GPU nodes are not available for XSEDE users though they may be available in the future.

OpenACC Programming

OpenACC is the name of an application program interface (API) that uses a collection of compiler directives to accelerate applications that run on multicore and GPU systems. The OpenACC compiler directives specify regions of code that can be offloaded from a CPU to an attached accelerator. A Quick Start Guide and a User's Guide are available.

Currently, only the Portland Group compilers installed on SuperMIC can be used to compile C and Fortran code annotated with OpenACC directives.

To load the PGI compilers:

login1$ module load portland

To compile a C code annotated with OpenACC directives:

login1$ pgcc -acc -ta=nvidia -Minfo=accel code.c -o code.exe

The Pittsburgh Supercomputing Center (PSC), in cooperation with the National Institute for Computational Sciences (NICS), the Georgia Institute of Technology (Georgia Tech), and the Internet2 community, periodically presents a workshop on OpenACC GPU programming. Please visit the XSEDE Training Course Calendar for upcoming workshop on OpenACC.

Running your applications

SuperMIC uses TORQUE, an open source version of the Portable Batch System (PBS) together with the MOAB Scheduler, to manage user jobs. Whether you run in batch mode or interactively, you will access the compute nodes using the qsub command as described below. Remember that computationally intensive jobs should be run only on the compute nodes and not the login nodes. More details on submitting jobs and PBS commands can be found here.

SuperMIC Queues

Queue Name Max Walltime Max Nodes (per job) Description
workq 72 128 Used for jobs that will use at least one node, i.e. nodes>=1:ppn=20. Currently, this queue has a wallclock limit of 72 hours (3 days). Jobs in workq are not preemptable, which means that running jobs will not be disrupted before completion.
checkpt 72 200 Used for jobs that will use at least one node. Jobs in the checkpt queue can be preempted if needed.

The available queues and actual limit settings can be verified by running the command:

login1$ qstat -q -G 

Job Submission

The qsub command is used to send a batch job to PBS. The basic usage is:

login1$ qsub pbs.script

where "pbs.script" is the script users write to specify their needs. The qsub command also accepts command line arguments, which will overwrite those specified in the script, for example, the following command:

login1$ qsub myscript -A my_LONI_allocation2

will direct the system to charge SUs (service units) to the allocation "my_LONI_allocation2" instead of any allocation specified in "pbs.script".

To submit an interactive job, use the "-I" flag to the qsub command along with the options for resources required, for example:

login1$ qsub -I -l walltime=hh:mm:ss,nodes=n:ppn=20 -A allocation_name

Note that you need to take the whole node when requesting an interactive job, using anything other than "ppn=20" will cause job submission failure. If you need to enable X-Forwarding, add the "-X" flag.

Your PBS submission script should be written in one of the Linux scripting languages such as bash, tcsh, csh or sh i.e. the first line of your submission script should be something like "#!/bin/bash". The next section of the submission script should be PBS directives followed by the actual commands to run your job. Following are a list of useful PBS directives (can also be used as command line options to qsub) and environment variables that can be used in the submit script:

#PBS -q queue Submit job to queue. See SuperMIC Queues. Depending on cluster, addition values allowed are gpu, lasigma, mwfa, bigmem.
#PBS -A allocationname Charge jobs to your allocation named allocationname.
#PBS -l walltime=hh:mm:ss Request resources to run job for hh hours, mm minutes and ss seconds.
#PBS -l nodes=m:ppn=n Request resources to run job on n processors each on m nodes.
#PBS -N jobname Provide a name, jobname to your job to identify it when monitoring job using the qstat command.
#PBS -o filename.out Write PBS standard output to file filename.out.
#PBS -e filename.err Write PBS standard error to file filename.err.
#PBS -j oe Combine PBS standard output and error to the same file. Note you will need either #PBS -o or #PBS -e directive not both.
#PBS -m status Send an email after job status is reached, where status may be: a = when job aborts, b = when job begins, e = when job ends. The arguments can be combined, e.g. abe will send email when job begins and either aborts or ends
#PBS -M email address Send email to this address once status is trigerred.
PBS_O_WORKDIR Environment variable: Directory where the qsub command was executed
PBS_NODEFILE Environment variable: Name of the file that contains a list of the HOSTS provided for the job
PBS_JOBID Environment variable: Job ID number given to this job
PBS_QUEUE Environment variable: Queue job is running in
PBS_WALLTIME Environment variable: Walltime in secs requested
PBS_JOBNAME Environment variable: Name of the job. This can be set using the -N option in the PBS script
PBS_ENVIRONMENT Environment variable: Indicates job type, PBS_BATCH or PBS_INTERACTIVE
PBS_O_SHELL Environment variable: value of the $SHELL variable in the environment in which qsub was executed
PBS_O_HOME Environment variable: Home directory of the user running qsub

Workq Queue Job Script Template

    $ cat ~/script
    #PBS -q workq
    #PBS -l nodes=1:ppn=20
    #PBS -l walltime=HH:MM:SS
    #PBS -o desired_output_file_name
    #PBS -j oe
    # mpi jobs would execute:
    #   mpirun -np 20 -machinefile $PBS_NODEFILE /path/to/your/executable
    # OpenMP jobs would execute:
    #   export OMP_NUM_THREADS=20; /path/to/your/executable

Checkpt Queue Job Script Template

    $ cat ~/script
    #PBS -q checkpt
    #PBS -l nodes=1:ppn=20
    #PBS -l walltime=HH:MM:SS
    #PBS -o desired_output_file_name
    #PBS -j oe
    # mpi jobs would execute:
    #   mpirun -np 20 -machinefile $PBS_NODEFILE /path/to/your/executable
    # OpenMP jobs would execute:
    #   export OMP_NUM_THREADS=20; /path/to/your/executable

Using qstat

Use the "qstat" command to check the status of PBS jobs.

login1$ qstat
Job id              Name             User            Time Use S Queue
|------------------ ---------------- --------------- -------- - -----
729444.qb2          job1.pbs         ebeigi3                0 Q workq          
729516.qb2          MAY2009_d        skayres         533:14:2 R workq          
729538.qb2          wallret_test222  liyuxiu         67:43:38 R workq          
729539.qb2          wallret_test223  liyuxiu         67:43:39 R workq          
729540.qb2          wallret_test228  liyuxiu         66:49:50 R workq          
729541.qb2          wallret_test231  liyuxiu         64:40:21 R workq          
729542.qb2          wallret_test232  liyuxiu         64:40:15 R workq          
729543.qb2          wallret_test233  liyuxiu         63:18:24 R workq          
729567.qb2          CaPtFeAs         cekuma          00:22:01 R workq

Columns 1-6 show the job's id, name, owner, time consumed, job status (R=running, Q=in queue), and which queue each job is in. qstat also accepts command line arguments, for instance, the following usage gives more detailed information regarding jobs.

login1$ qstat -a

                                                                   Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS   TSK Memory Time  S Time
|------------------- -------- -------- ---------- ------ ----- --- ------ ----- - -----
729444.qb2           ebeigi3  workq    job1.pbs      --      2   1    --  06:30 Q   -- 
729516.qb2           skayres  workq    MAY2009_d    2969     8   1    --  72:00 R 66:45
729538.qb2           liyuxiu  workq    wallret_te  26259     1   1    --  70:00 R 67:44
729539.qb2           liyuxiu  workq    wallret_te   5144     1   1    --  70:00 R 67:44
729540.qb2           liyuxiu  workq    wallret_te  12445     1   1    --  70:00 R 66:50
729541.qb2           liyuxiu  workq    wallret_te   2300     1   1    --  70:00 R 64:41
729542.qb2           liyuxiu  workq    wallret_te   1809     1   1    --  70:00 R 64:41
729543.qb2           liyuxiu  workq    wallret_te   9377     1   1    --  70:00 R 63:19
729567.qb2           cekuma   workq    CaPtFeAs    10562     7   1    --  69:50 R 48:18

Other useful options to qstat:

  • -u username: To display only jobs owned by user username.
  • -n: To display list of nodes that jobs are running on.
  • -q: To summarize resources available to all queues.

Cancel a running job

To cancel a PBS job, enter the following command.

login1$ qdel jobid1 jobid2 ...

Query free nodes

Use the "qfree" command to query free nodes and schedule jobs. qfree shows free nodes in each queue.

login1$ qfree
PBS total nodes: 668,  free: 6,  busy: 629,  down: 33,  use: 94%
PBS workq nodes: 529,  free: 3,  busy: 317,  queued: 2
PBS checkpt nodes: 656,  free: 1,  busy: 312,  queued: 64
(Highest priority job 729767 on queue checkpt will start in 2:34:14)

The command output shows a total of 6 free nodes in PBS, they are available in all the two queues: checkpt and workq.

Estimate a job start time

The "showstart" command can be used to get an approximate estimation of the starting time of your job, the basic usage is

login1$ showstart job_id

The following shows an simple example:

login1$ showstart 729767
job 729767 requires 32 procs for 2:00:00:00

Estimated Rsv based start in                 2:33:25 on Tue Dec 17 11:52:32
Estimated Rsv based completion in         2:02:33:25 on Thu Dec 19 11:52:32

Best Partition: base

Please note that the start time listed above is only an estimate. There is no gaurantee that the job will start at the above mentioned time.

Display jobs info

The showq command can be used to display job information within the batch system.

login1$ showq

active jobs------------------------
JOBID              USERNAME      STATE PROCS   REMAINING            STARTTIME                     

729538              liyuxiu    Running     8     2:11:44  Sat Dec 14 13:31:32
729539              liyuxiu    Running     8     2:11:44  Sat Dec 14 13:31:32
729607               amani1    Running   256     2:32:44  Mon Dec 16 15:52:32
729609               amani1    Running   256     2:51:13  Mon Dec 16 16:11:01
729610               amani1    Running   256     2:51:13  Mon Dec 16 16:11:01
729611               amani1    Running   256     2:51:13  Mon Dec 16 16:11:01
729613               amani1    Running   256     3:05:19  Mon Dec 16 16:25:07
… truncated …
92 active jobs        5032 of 5064 processors in use by local jobs (99.37%)
.                629 of 633 nodes active      (99.37%)

eligible jobs----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME                     

729767             lsurampu       Idle    32  2:00:00:00  Mon Dec 16 22:54:38                     
729768             lsurampu       Idle    32  2:00:00:00  Mon Dec 16 22:54:38                     
729769             lsurampu       Idle    32  2:00:00:00  Mon Dec 16 22:54:38                     
… truncated …
16 eligible jobs   

blocked jobs-----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME                     

0 blocked jobs   

Total jobs:  108

To display job information for a particular queue, use the command

login1$ showq -w class=queuename

Display detailed job state

The checkjob command displays detailed information about the job state. This is very useful if your job is remaining in the queued state, and you'd like to see why PBS hasn't executed it:

login1$ checkjob 729787.qb2
job 729787

AName: null
State: Idle 
Creds:  user:apacheco  group:loniadmin  account:loni_loniadmin1  class:workq  qos:userres
WallTime:   00:00:00 of 2:00:00
SubmitTime: Tue Dec 17 09:22:14
(Time Queued  Total: 00:00:14  Eligible: 00:00:06)

NodeMatchPolicy: EXACTNODE
Total Requested Tasks: 32

Req[0]  TaskCount: 32  Partition: ALL  

Flags:          INTERACTIVE
Attr:           INTERACTIVE,checkpoint
StartPriority:  141944
available for 8 tasks     - qb[002,007,376]
rejected for Class        - (null)
rejected for State        - (null)
NOTE:  job req cannot run in partition base (available procs do not meet requirements : 24 of 32 procs found)
idle procs:  32  feasible procs:  24

Node Rejection Summary: [Class: 1][State: 667]

This job cannot be started since it requires 4 nodes (32 procs) but only 3 nodes are available.

Display node memory and cpu

Use "qshow jobid" to display memory and cpu usage on the node that a job is running on. The qshow command is useful to find out how the resources on the node allocated to your job are consumed. For example, if a user's job is running slow due to swapping, this command will provide you with information on how much memory (physical and virtual) is used on all processors allocated to your job.

login1$ qshow 729731
PBS job: 729731, nodes: 4
Hostname  Days Load CPU U# (User:Process:VirtualMemory:Memory:Hours)
qb373       39 8.93 798 21 lsurampu:mdrun_mpi:88M:31M:10.9 lsurampu:mdrun_mpi:90M:32M:10.9 
lsurampu:mdrun_mpi:117M:65M:10.6 lsurampu:mdrun_mpi:89M:31M:10.9 
lsurampu:mdrun_mpi:88M:30M:10.9 lsurampu:mdrun_mpi:88M:30M:10.9 
lsurampu:mdrun_mpi:89M:31M:10.9 lsurampu:mdrun_mpi:91M:33M:10.9 
lsurampu:pbs_demux:3M:0M lsurampu:729731:52M:1M lsurampu:mpirun:52M:1M 
lsurampu:mpirun_rsh:6M:1M lsurampu:mpispawn:6M:1M
qb368       39 8.99 798 12 lsurampu:mdrun_mpi:89M:40M:10.9 lsurampu:mdrun_mpi:89M:31M:10.9 
lsurampu:mdrun_mpi:88M:31M:10.9 lsurampu:mdrun_mpi:89M:32M:10.9 
lsurampu:mdrun_mpi:91M:33M:10.9 lsurampu:mdrun_mpi:95M:37M:10.9 
lsurampu:mdrun_mpi:91M:33M:10.9 lsurampu:mdrun_mpi:112M:50M:10.9 
qb364       39 8.85 800 12 lsurampu:mdrun_mpi:91M:42M:10.9 lsurampu:mdrun_mpi:89M:31M:10.9 
lsurampu:mdrun_mpi:93M:35M:10.9 lsurampu:mdrun_mpi:90M:32M:10.9 
lsurampu:mdrun_mpi:90M:32M:10.9 lsurampu:mdrun_mpi:89M:31M:10.9 
lsurampu:mdrun_mpi:90M:32M:10.9 lsurampu:mdrun_mpi:90M:32M:10.9 
qb362       39 8.89 802 12 lsurampu:mdrun_mpi:90M:41M:10.9 lsurampu:mdrun_mpi:89M:31M:10.9 
lsurampu:mdrun_mpi:89M:31M:10.9 lsurampu:mdrun_mpi:89M:31M:10.9 
lsurampu:mdrun_mpi:112M:51M:10.9 lsurampu:mdrun_mpi:89M:32M:10.9 
lsurampu:mdrun_mpi:89M:31M:10.9 lsurampu:mdrun_mpi:89M:31M:10.9 
PBS_job=729731 user=lsurampu allocation=loni_poly_mic_1 queue=checkpt total_load=32 cpu_hours=320 wall_hours=10 unused_nodes=0 total_nodes=4 avg_load=8

More detailed information on the Torque PBS commands and Moab to schedule and monitor jobs can be found at Adaptive Computing's on-line documentation.

No comments yet. Be the first.