JHU Rockfish User Guide
Last update: June 15, 2021

Introduction

Johns Hopkins University's Rockfish is a community-shared cluster at Johns Hopkins University. It follows the "condominium model" with three main integrated units. The first unit is based on a National Science Foundation (NSF) Major Research Infrastructure Grant (#1920103) and other main grants like DURIP/DoD, a second unit contains medium-size condos ((Schools' condos), and the last unit is the collection of condos purchased by individual research groups. All three units share a base infrastructure, and resources are shared by all users. Rockfish provides resources and tools to integrate traditional High Performance Computing (HPC) with Data Intensive Computing and Machine Learning (ML). As a multi-purpose resource for all fields of science, it will provide High Performance and Data Intensive Computing services to Johns Hopkins University, Morgan State University and XSEDE researchers as a level 2 Service Provider.

Rockfish's compute nodes consist of two 24-core Intel Xeon Cascade Lake 6248R processors, 3.0GHz base frequency and 1 TB NMVe local drive. The regular and GPU nodes have 192GB of DDR4 memory, whereas the large memory nodes have 1.5TB of DDr4 memory. The GPU nodes also have 4 Nvidia A100 GPUs.


Figure 1. Rockfish System
XSEDE hostname: login.rockfish.jhu.edu

Account Administration

A proposal through the XSEDE Resource Allocation request System (XRAS) is required for a research or startup allocation. See XSEDE Allocations for more information about about different types of allocations.

Configuring Your Account

Rockfish uses the bash shell by default. Submit an XSEDE support ticket to request a different shell.

Reset Password: Users may reset a password in two ways:

  1. Email to request a password reset.
  2. Users may also connect to the Rockfish portal and type the email associated with the account. A link to reset the password will be sent to this email.

Modules

The Rockfish cluster uses Modules (lua modules version 8.3, developed at TACC) to dynamically manage users' shell environments. "module" commands will set, modify, or delete environment variables in support of scientific applications, allowing users to select a particular version of an application or a combination of packages.

The "ml available" command will display (i) the applications that have been compiled using GNU compilers, (ii) external applications like matlab, abaqus, which are independent of the compiler used and (iii) a set of core modules. Likewise, if the Intel compilers are loaded "ml avail" will display applications that are compiled using the Intel compilers.

A set of modules are loaded by default at login time. These include Slurm, gcc/9.3 and openmpi/3.1. We strongly recommend that users utilize this combination of modules whenever possible for best performance. In addition, several scientific applications are built with dependencies on other modules. Users will get a message on the screen if this is the case. For more information type:

login1$ ml spider application/version

For example, if you have the gcc/9.3.0 module loaded and try to load intel-mpi you will get::

Lmod has detected the following error: These module(s) or extension(s) exist but cannot be loaded as requested: "intel-mpi"
Try: "module spider intel-mpi" to see how to load the module(s).

The "ml available" command will also display a letter after the module indicating where it is:

L(oaded), D(efault), g(gpu), c(ontainer)
Command Alias / Shortcut Description
module list ml List modules currently loaded
module avail ml av List all scientific applications with different versions
module show modulename ml show Show the environment variables and settings in the module file
module load modulename ml Load modules
module unload modulename mu modulename Unload the application or module
module spider modulename ml spider modulename Shows available versions for modulename
module save modulename ml save modulename Save current modules into a session (default) or named session
module swap modulename ml modulename Automatically swaps versions of modules
module help ml help Shows additional information about the scientific application

System Architecture

Rockfish has three types of compute nodes. "regular memory or standard" compute nodes (192GB), large memory nodes (1524GB) and GPU nodes with 4 Nvidia A100 GPUs. All compute nodes have access to three GPFS file sets. Rockfish, nodes and storage, have Mellanox HDR100 connectivity, with topology 1.5:1. Rockfish is managed using the Bright Computing cluster management software and the Slurm workload manager for job scheduling.

Table 2. Compute Node Specifications](#table2)

Regular (memory) compute nodes
Model Lenovo SD530
Intel Xeon Gold Cascade Lake 6248R
Total cores per node 48 cores per node
Number of Nodes 368
Clock rate 3.0 GHz
RAM 192GB
Total number of cores 17,664
Local storage 1 TB NVMe
Large Memory Nodes
Model Lenovo SR630
Intel Xeon Gold Cascade Lake 6248R
Total cores per node 48 cores per node
Number of Nodes 10
Clock rate 3.0 GHz
RAM 1524GB
Total number of cores 480
Local storage 1 TB NVMe
GPU Nodes
Model Lenovo SR670
Intel Xeon Gold Cascade Lake 6248R
Total cores per node 48 cores per node
Number of Nodes 10
Clock rate 3.0 GHz
RAM 192GB
Total number of cores 480
GPUs 4 Nvidia A110 GPUs (40Gb) PCIe
Total number of GPUs 40
Local storage 1 TB NVMe

Login Nodes

Rockfish's three login nodes (login01-03) are physical nodes with architecture and features similar to the regular memory compute nodes. Please use the gateway to connect to Rockfish.

Data Transfer Nodes (DTNs)

These nodes can be used to transfer data to the Rockfish cluster using secure copy, Globus or any other utility like Filezilla. The endpoint for Globus is "Rockfish User Data". The DTNs are "rfdtn1.rockfish.jhu.edu" and "rfdtn2.rockfish.jhu.edu". Thee nodes are mounted and available on all file systems.

Table 3. Systems Software Environment](#table3)

Software function Description
Cluster management Bright Cluster Management
File System management Xcat/Confluent
Operating System CentOS 8.2
File Systems GPFS, ZFS
Scheduler and resource management Slurm
User Environment Lua modules
Compilers Intel, GNU, PGI
Message passing Intel MPI, OpenMPI, MVAPICH

Table 4. Rockfish File Systems](#table4)

File System Quota File retention Backup Features
$HOME 50GB No file deletion policy Backed up to an off-site location NVMe File system
$SCRATCH4 10TB (combined with scratch16) 30 day retention. Files that have not been accessed for 30 days will be moved to the /data file system. NO Optimized for small files. Block sized 4MB
$SCRATCH16 Same as above Same as above NO Block size 16MB
Optimized for large files
"data" 10TB No deletion policy, but quota driven optional GPFS file set, lower performance

Accessing the System

Rockfish is accessible only to those users and research groups that have been awarded a Rockfish-specific allocation. XSEDE users can connect to Rockfish in different ways:

Secure Shell

localhost$ ssh [-XY] login.rockfish.jhu.edu -l userid

"login" is a gateway server that will authenticate credentials and then connect the user to one of three physical login nodes (identical to regular compute nodes). Hostname: login.rockfish.jhu.edu (gateway)

XSEDE Single Sign-On (SSO) Hub

XSEDE users can also access Rockfish via the XSEDE Single Sign-On Hub. When reporting a problem to the XSEDE Help Desk, please execute the "gsissh -vvv" command and include the verbose output in your problem description.

Citizenship

You share Rockfish with thousands of other users, and what you do on the system affects others. Exercise good citizenship to ensure that your activity does not adversely impact the system and the research community with whom you share it. Here are some rules of thumb:

  • Don't run jobs on the login nodes. Login nodes are used by hundreds of users to monitor their jobs, submit jobs, edit and manipulate files and in some cases to compile codes. We strongly request that users abstain from running jobs on login nodes. Sometimes users may want to run quick jobs to check that input files are correct or scientific applications are working properly. If this is the case, make sure this activity does not take more than a few minutes or even better request an interactive session (interact) to fully test your codes.

  • Don't stress the file systems. Do not perform activities that may impact the file systems (and the login nodes), for example rsync or copying large or many files from one file system to another. Please use globus of the data transfer node (rfdtn1) to copy large amounts of data

  • When submitting a help-desk ticket, be as informative as possible.

Login Node Activities

  • Request an interactive session "interact -usage"

      login1$ interact -X -p analysis -n 1 -c 1 -t 120 
      c001> ml matlab ; matlab
  • Compile codes, for example run "make". Be careful if you are running commands with multiple processes. "make -j 4" may be fine but "make -j 20" may impact other users.

  • Check jobs, use this command "sqme"

  • Edit files, scripts, manipulate files

  • Submit jobs

  • Check output files

What is NOT allowed:

  • Run executables e.g. "./a.out"
  • multiple rsync sessions or copy large number of files

Managing Files

Transferring your Files

  1. scp: Secure copy commands can be used when transferring small amounts of data. We strongly encourage to use the data transfer nodes instead of the gateway.

    scp [-r] file-name userid@rfdtn1.rockfish.jhu.edu:/path/to/file/dir

  2. rsync: An alternative to scp would be "rsync". This command is useful when copying files between file systems or in/out of Rockfish. rsync can also be used to sync file systems as new files are created or as files are modified.

    login1$ rsync -azvh dir1 /new/path/ 

    or

    login1$ rsync -azvh DIR2 userid@server.ip.address:/path/to/new/dir
  3. Globus: We strongly recommend the use of our managed end points via Globus. Rockfish's Globus end point is "Rockfish User Data"

Sharing Files with Collaborators

Users are strongly encouraged to use Globus features to share files with internal or external collaborators.

Software

Rockfish provides a broad application base managed by Lua modules. Most commonly used packages in bioinformatics, molecular dynamics, quantum chemistry, structural mechanics, and genomics are available ("ml avail"). Rockfish also supports Singularity containers.

Installed Software

Rockfish uses the Lua modules. Type "ml avail" to list all the scientific applications that are installed and available via modules.

  • "module" (or "ml") : displays a list of installed applications and corresponding versions.
  • "ml spider APP1" : displays all information on package APP1 (if it is installed)
  • "ml help APP1" : displays any additional information on this scientific application.

Building Software

Users may want to install scientific applications that are used only by the user of by the group in their HOME directories. Then users can create a private module.

  1. Create a directory to install the application: "mkdir -p $HOME/code/APP1"
  2. Install the application following the instructions (README or INSTALL files).
  3. Create a directory in your HOME directory to create a module file: "mkdir $HOME/modulefiles/APP1"
  4. Create a ".lua" file that adds the application path to your $PATH environment variable and all other requirements (lib or include files).
  5. Load the module as "ml own; ml APP1"

Compilers and recommendations

The Rockfish cluster provides three different compilers for compute nodes, GNU, Intel and PGI. There are also MPI libraries (openmpi, Intelmpi and Mvapich2). Most applications have been built using GNU compilers version 9.3.0. Users should evaluate which compiler gives the best performance for their applications.

The intel compilers and intel-mpi libraries can be loaded by executing the following command:

login1$ ml intel intel-mpi intel-mkl

A standard command to compile a Fortran or C-code will look like: (add as many flags as needed)

login1$ ifort (icc) -O3 -xHOST -o code.x code.f90 (or code.c) 

For GNU compilers you may want to use this sequence:

login1$ g++ -O3 -march=native -mtune=native -march=cascadelake-avx2

Running Jobs

Job Accounting

Rockfish allocations are made in core-hours. The recommended method for estimating your resource needs for an allocation request is to perform benchmark runs. The core-hours used for a job are calculated by multiplying the number of processor cores used by the wall-clock duration in hours. Rockfish core-hour calculations should assume that all jobs will run in the regular queue.

For example: if you request one core on one node for an hour your allocation will be charged one core-hour. If you request 24 cores on one node, and the job runs for one hour, your account will be charged 24 core-hours. For parallel jobs, compute nodes are dedicated to the job. If you request 2 compute nodes and the job runs for one hour, your allocation will be charged 96 core-hours.

Job accounting is independent of the number of processes you run on compute nodes. You can request 2 cores for your job for one hour. If you run only one process, your allocation will be charged for 2 core-hours.

Charge = Number of cores x wall-time.

Accessing the Compute Nodes

  • Batch jobs: Jobs can be submitted to the scheduler by writing a script and submitting it via the "sbatch" command:

    login1$ sbatch script-file-name

    where script-file-name is a file that contains a set of keywords used by the scheduler to set variables and the parameters for the job. It also contains a set of Linux commands to be executed. See Job Scripts below.

  • Interactive sessions: Users may need to connect to a compute node in interactive mode by using a internal script called "interact". See "interact -usage" will provide examples and a list of parameters. For example:

    login1$ interact -p defq-n 1 -c 1 -t 120

    Will request an interactive session on the defq queue with one core for 2 hours.

    Alternatively users can use the full command:

    login1$ alloc -J interact -N 1 -n 12 --time=120 --mem=48g -p defq srun --pty bash

    This command will request an interactive session with 12 cores for 120 minutes and 48GB memory for the job (4GB per core).

  • ssh from a login node directly to a compute node. Users may ssh to a compute node where their jobs are running to check or monitor the status of their jobs. This connection will last a few minutes.

Slurm Job Scheduler

Rockfish uses Slurm (Simple Linux Universal Resource Manager) to manage resource scheduling and job submission. Slurm is an open source application with active developers and an increasing user community. It has been adopted by many HPC centers and universities. All users must submit jobs to the scheduler for processing, that is "interactive" use of login nodes for job processing is not allowed. Users who need to interact with their codes while these are running can request an interactive session using the script "interact", which will submit a request to the queuing system that will allow interactive access to the node.

Slurm uses "partitions" to divide types of jobs (partitions are called queues on other schedulers). Rockfish defines a few partitions that will allow sequential/shared computing and parallel (dedicated or exclusive nodes), GPU jobs and large memory jobs. The default partition is "defq".

Queues on Rockfish

Queue limits are subject to change. Rockfish will use partitions and resources associated with them to create different types of allocations.

Regular memory allocations will allow the use of all the regular compute nodes (currently the defq partition). All jobs submitted to the defq partition will account against this partition.

Large memory (LM) allocations will allow the use of the large memory nodes. If a user submits a job to this partition then the LM allocation is charged by default.

Likewise, there is a GPU partition that will allow the use of an GPU nodes.

Table 5. Rockfish Production Queues](#table5)

Queue Name Max Nodes per Job (assoc'd cores)* Max Duration Max number of cores (running) Max number Running + queued Charge Rate (per node-hour)
defq 368 nodes, 48 cores per node 72 hours 4800 9600 1 Service Unit (SU)
bigmem 10 nodes (1524GB per node) 48 hrs 144 288 1 SU
a100 10 nodes, 192GB RAM, 4 Nvidia A100 48 144 288 1SU

Job Management

Users can monitor their jobs with the "squeue" command. In this example user test345 is running two jobs: JobID: 31559 is a parallel job using 4 nodes. JobId: 31560 is a large memory job running on node bigmem01.

login1$ squeue -l -u $USER
JOBID PARTITION     NAME     USER    STATE  TIME NODES NODELIST(REASON)
31559      defq Parallel  test345  RUNNING  1:55     4 c[399-402]
31560    lrgmem       LM  test345  RUNNING  0:31     1 bigmem01

Users can also invoke a script, "sqme", to monitor jobs:

login1$ sqme

To cancel a job, sse the "scancel" command followed by the jobid. For example "scancel 31560" will cancel the LM job for user test345 in the example above.

Sample Job Scripts

The following scripts are examples for different workflows. Users can modify them according to the resources needed to run their applications.

MPI Jobs

This job will run on 5 nodes each with 48 processes/cores. Total 240 MPI processes.

#!/bin/bash
#SBATCH –job-name=MPI-job
#SBATCH –time=5:0:0
#SBATCH –partition=defq
#SBATCH -N 5
#SBATCH –ntasks-per-node=48
#SBATCH -A My-Account

module purge        
ml intel intel-mpi #load Intel compiler and Intel MPI libraries

mpirun ./my-mpi-code.x < my-input-file > My-output.log

OpenMP/Threaded Jobs

This script will run a small job that creates 8 threads. It will use the default time of 1:00:00 (one hour).

#!/bin/bash
#SBATCH –job-name=my-openmp-job
#SBATCH -P defq
#SBATCH -N 1
#SBATCH –ntasks-per-node=8
#SBATCH –mem-per-cpu=4GB
#SBATCH -A My-Account
#SBATCH –export=ALL

ml purge
ml gcc
export OMP_NUM_THREADS=8
time ./a.out > My-output.log

Hybrid (MPI + OpenMP)

This script will run a hybrid jobs (Gromacs) on two nodes, each node will have 8 MPI processes each with 6 threads

#!/bin/bash
#SBATCH –Job-name=Hybrid
#SBATCH –time=4:0:0
#SBATCH –partition=defq
#SBATCH -N 2
#SBATCH –ntasks-per-node=8
#SBATCH –cpus-per-task=6
#SBATCH -A My-Account
#SBATCH -o Hybrid-%J.log
#SBATCH –export=ALL

ml purge
ml gcc openmpi
ml hwloc boost
ml gromacs/2016-mpi-plumed

mpirun -np 8 bin/gmx_mpi mdrun -deffnm [options…]

GNU parallel

This sample will run 48 serial jobs on one node using GNU parallel. This job directs output to the local scratch file system.
#!/bin/bash -l
#SBATCH --time=2:0:0
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=48
#SBATCH --partition=defq          
#SBATCH --export=ALL

module load gaussian
mkdir -p /scratch16/$USER/$SLURM_JOBID
export GAUSS_SCRDIR=/scratch16/$USER/$SLURM_JOBID

ml parallel

# create a file that contains the names of the 48 input files.
cat my-list | parallel -j 20 --joblog LOGS "g09 {}"

Parametric / Array / HTC jobs

This script is an example to run a set of 5,000 jobs. Only 480 jobs will run at a time. The input files are in a directory ($workdir). A temporary directory ($tmpdir) will be created in "scratch" where all the jobs will be run. At the end of each run the temporary directory is deleted.

#!/bin/bash -l
#SBATCH –job-name=small-array
#SBATCH –time=4:0:0
#SBATCH –partition=defq
#SBATCH –nodes=1
#SBATCH –ntasks-per-node=1
#SBATCH –mem-per-cpu=4G
#SBATCH –array=1-5000%480

# load modules and check
ml purge
module load intel
ml

# set variable "file" to read all the files in $workdir 
# (zmatabcde where "abcde" goes from 00001 to 05000) and 
# assign them to the job array
file=$(ls zmat* | sed -n ${SLURM_ARRAY_TASK_ID}p)
echo $file

# get the number for each file (abcde)
newstring="${file:4}"
export basisdir=/scratch16/jcombar1/LC-tests
export workdir=/scratch16/jcombar1/LC-tests
export tmpdir=/scratch16/jcombar1/TMP/$SLURM_JOBID
export PATH=/scratch16/jcombar1/LC/bin:$PATH
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
mkdir -p $tmpdir
cd $tmpdir

# run your job
cp $workdir/$file   ZMAT
cp $basisdir/GENBAS GENBAS
 ./a.out > $workdir/out.$newstring
cd ..
rm -rf $tmpdir

Bigmem (LM) Jobs

This script will run a job that needs large amounts of memory. Users need a special resource allocation (bigmem). It will use the default time 1:00:00 (one hour).

#!/bin/bash
#SBATCH –job-name=my-bigmem-job
#SBATCH -P bigmem
#SBATCH -N 1
#SBATCH –ntasks-per-node=48
#SBATCH -A My-Account_bigmem    ###   this flag is required
#SBATCH –export=ALL

ml purge
ml intel

time ./big-mem.x  > My-output.log

GPU (LM) Jobs (a100 partition)

This script will run a job that uses all 4 Nvidia a100 gpus. Users need a special resource allocation (gpu). It will use the default time 1:00:00 (one hour).

#!/bin/bash
#SBATCH –job-name=my-bigmem-job
#SBATCH -P a100
#SBATCH -N 1
#SBATCH –ntasks-per-node=48
#SBATCH --gres=gpu:4
#SBATCH -A My-Account_gpu    ###   this flag is required
#SBATCH –export=ALL

ml purge
ml intel

time ./gpu-code.x  > My-output.log

Help

Please visit the XSEDE help desk for important contact information. When submitting a support ticket, please include:

  • a complete description of the problem with accompanying screenshots if applicable
  • include any paths to job script or input/output files.
  • if you are having problems while on a login node please include the login node name