XStream User Guide
Last update: August 29, 2017

System Overview

XStream is a GPU cluster hosted at the Stanford Research Computing Center and funded by the National Science Foundation's (NSF) Major Research Instrumentation (MRI) Program. Twenty percent of its computational resources are reserved to XSEDE awards.

XStream is a compute cluster specifically designed by Cray for GPU computing, or more precisely, heterogeneous parallel computing with CPUs and GPUs. It differs from traditional CPU only based HPC systems as it has almost a Petaflop (PF) of GPU compute power. Each of the 65 nodes has 8 NVIDIA K80 cards or 16 NVIDIA Kepler GPUs, interconnected through PCI-Express PLX-based switches. Each GPU has 12GB of GDDR5 memory. Compute nodes also feature 2 Intel Ivy-Bridge 10-core CPUs, 256 GB of DRAM and 450 GB of local SSD storage. The system features 1.4 petabytes of 22 GB/s Lustre storage (Cray Sonexion 1600).

The system also includes two login node each of them has a 10 GigE connection to Stanford's campus backbone which has 100 Gbps connectivity to various national research and education networks.

XStream was ranked #87 on the June 2015 Top500 list of fastest supercomputers (using LINPACK benchmark). Even with the extreme GPU computing, near Petaflop density, this system was #6 in the June 2015 Green 500 list and moved to #5 in the November 2015 list.

System Configuration

All XStream nodes run RHEL 6.9 and are managed with batch services through SLURM. Global $WORK storage area is supported by Lustre parallel distributed file system with 6 IO servers. Inter-node communication (MPI/Lustre) is through a FDR Mellanox InfiniBand network.

2 Login Nodes, each with:

  • Two 2.6GHz 6-Core Ivy Bridge-EP E5-2630 v2 Xeon 64-bit Processors
  • Two NVIDIA Tesla K80 GPU cards (4 Kepler GPUs)
  • 64GB DDR3 1600MT/s DRAM
  • 56Gbps Infiniband FDR network interface
  • 10 Gigabit Ethernet network interface
  • 1 Gigabit Ethernet network interface
  • Red Hat Enterprise Linux Server 6.7

65 Compute Nodes, each with

  • Cray CS-Storm 2626X8N Compute Node
  • Two 2.8GHz 10-Core Ivy Bridge-EP E5-2680 v2 Xeon 64-bit Processors
  • Eight NVIDIA Tesla K80 24GB GPU cards (16 x GK210 12GB GPU)
  • 256GB DDR3 1600MT/s DRAM
  • Three Intel SSD (MLC) in RAID-0 (striped volume) totalling 480GB
  • 56Gbps Infiniband FDR network interface
  • 1 Gigabit Ethernet network interface
  • Red Hat Enterprise Linux Server 6.9

The hardware architecture of the compute nodes is that each CPU socket (PCI root) is connected to PLX switches that connect four K80 cards (eight GPUs) together.

2626X8N Compute Node Diagram

XStream CS-Storm hydra K80 block diagram

You basically have 2 domains, one for each CPU. You can use the "lstopo" command on a compute node (eg. xs-0001) for full details of the PCI bus.

If you plan on doing GPU peer-to-peer communication, the "nvidia-smi" command on a compute node will show you the GPUDirect topology matrix for the system:

xs-0001$ nvidia-smi topo -m

PIX gives you the lower latency, SOC the highest.

Please note that you need to have a job actually running on a compute node with all 16 GPUs allocated in order to SSH to it and see the full GPUDirect topology matrix for the system.

GPU and Nvidia driver settings

GPUs on the compute nodes have the following static settings:

  • Auto Boost ON
  • Persistent mode
  • Compute Mode: Exclusive Process

To switch your job's GPUs to the Default Compute Mode (shared), use the following Slurm constraint:

#SBATCH -C gpu_shared

While this is usually not recommended to prevent wrong process/GPU assignement, this is required for some applications, like LAMMPS or AMBER in P2P mode.

Please note that GPUs on the login nodes are configured in Default Compute Mode, meaning that multiple contexts are allowed per device.

As of July 2017, the version of the Nvidia driver is 375.66. Stanford continually updates the driver as it is made available by Nvidia.

Storage

Login and compute nodes are connected to a private Lustre storage system, a Cray Sonexion appliance, with fast I/O performance. This system is capable of providing more than 22 GB/s of sustained Lustre bandwidth over the Infiniband fabric and about 1.4 PB of usable space.

  • 492 x 4 TB SAS hard drive
  • 48 x Lustre Object Storage Target (OST) - each 32 TB
  • 6 x Embedded Lustre Object Storage Server (OSS) Infiniband FDR 56Gbps

A low-performance 3.3TB NFS-mounted /home disk storage is also available.

Filesystems

Home file system

Each user on XStream has a home directory referenced by $HOME with a quota limit of 5GB (not purged). It is a small and low performance NFS storage space used to keep scripts, binaries, source files, small log files, etc.

The $HOME filesystem is accessible from any node in the system.

The $HOME directory is not intended to be used for computation. The Lustre parallel file system $WORK is much larger and faster, thus much more suited for computation.

Each project has a shared home directory referenced by $GROUP_HOME. Like $HOME, it is a NFS storage space used to store small files shared by all members of your primary POSIX group (usually your primary project).

Note: in $GROUP_HOME, only the owner of the files can delete them.

Important note on home backups: The system doesn't come with any backup system, however user and group home directories are backed up every night by Stanford Research Computing. Contact the XSEDE helpdesk in order to recover any lost files. We also recommend that you periodically back up your files outside of XStream.

Work file system

Work is a Lustre file system mounted on /cstor on any node in the system. This parallel file system has multiple purposes:

  • perform fast large I/Os
  • store large computational data files
  • allow multi-node jobs to write coherent files

Each user has a work directory referenced by $WORK with a quota limit of 1TB which is not purged. Each project has a shared work directory referenced by $GROUP_WORK on the same file system with a group quota limit of 50TB. This space is shared by all members of the project.

Note: in $GROUP_WORK, only the owner of the files can delete them.

User and group quota values are not cumulative, ie. the first limit reached takes precedence.

Important note on work backup: The parallel file system work is not replicated nor backed up.

Local scratch

A local SSD-based scratch space is available on each compute node (NOT on login nodes). It is made of 3 x Intel SSD (MLC) aggregated using Linux dm-raid for a total of 480 GB per node (447 GB usable) and intended for high IOPS local workload.

To access this local scratch space, please use the $LSTOR or $TMPDIR environment variables. This space will be purged when the compute node reboots or when this space becomes full.

System Access

XStream is accessible to XSEDE users with allocations. To obtain an account or submit a proposal through the XSEDE Allocation Request System (XRAS), please read the XSEDE Allocations Overview.

Methods of Access

XSEDE Single Sign-On Hub

The recommended way to access XStream is through XSEDE Single Sign-On Hub:

ssohub$ gsissh xstream

GSI-OpenSSH (gsissh)

Assuming you have the proper XSEDE CA certificates installed on your local machine, the following commands authenticate using the XSEDE myproxy server, then connect to the gsissh port 2222 on XStream:

localhost$ myproxy-logon -l userid -s myproxy.xsede.org
localhost$ gsissh -p 2222 xstream.stanford.xsede.org

When you log in to xstream.stanford.xsede.org, you will be assigned one of the two login nodes: xstream-ln[01-02].stanford.edu. These nodes are identical in both architecture and software environment. Users should normally log in through xstream.stanford.xsede.org, but may specify one node directly if they see poor performance.

Please, do NOT use the login nodes for computationally intensive processes. These nodes are meant for compilation, file editing, simple data analysis, and other tasks that use minimal compute resources. All computationally demanding jobs should be submitted and run through the batch queuing system. You may however use the few GPUs available on the login nodes to perform simple and short tests. Please note that GPUs on the login nodes are configured in Default Compute Mode, meaning that multiple contexts are allowed per device. They are not suitable for performance evaluation.

User Responsibilities

Please note that XStream is not HIPAA compliant and should not be used to process PHI. See https://privacy.stanford.edu/faqs/hipaa-faqs for more information.

Computing Environment

XStream's default and supported shell is bash. Users may still request a shell change (eg. ksh, zsh or tcsh) by contacting the XSEDE Help Desk.

Modules

Modules provide a convenient way to dynamically change the users' environment through modulefiles. This includes easily adding or removing directories to the "$PATH" environment variable.

Lmod is used as a replacement to the original module command. For more information, please take a look at the Lmod user guide.

On XStream, modules follow a hierarchical module naming scheme, so only packages that can be directly loaded are displayed by "module avail".

You can list all available modules using the "spider" sub-command:

login1$ module spider

Using the full name will give you details on how to load the module by listing any required dependencies:

login1$ module spider FFTW/3.3.4

Also, you can use "module list" to see currently loaded packages.

Transferring your files to XStream

XStream supports Globus services such as Globus Connect and Globus "globus-url-copy" utility to transfer files to XStream or between XSEDE sites. Please note that common command line utilities such as scp, sftp, and rsync are also available to transfer files from XStream to a remote host.

Globus Connect

Globus Connect (formerly Globus Online) is recommended for transferring data between XSEDE sites. Globus Connect provides fast, secure transport via an easy-to-use web interface using pre-defined and user-created "endpoints". XSEDE users automatically have access to Globus Connect via their XUP username/password. Other users may sign up for a free Globus Connect Personal account.

Linux Command-line Data Transfer Utilities

globus-url-copy

XSEDE users may also use Globus's globus-url-copy command-line utility to transfer data between XSEDE sites. globus-url-copy, like Globus Connect described above, is an implementation of the GridFTP protocol, providing high speed transport between GridFTP servers at XSEDE sites. The GridFTP servers mount the specific file systems of the target machine, thereby providing access to your files or directories.

This command requires the use of an XSEDE certificate to create a proxy for passwordless transfers. To obtain a proxy, use the 'myproxy-logon' command with your XSEDE User Portal (XUP) username and password to obtain a proxy certificate. The proxy is valid for 12 hours for all logins on the local machine. On XStream, the myproxy-logon command is available on the login nodes.

xstream-ln01$ myproxy-logon -T -l XUP_username

Each globus-url-copy invocation must include the name of the server and a full path to the file. The general syntax looks like:

globus-url-copy [options] source_url destination_url

where each XSEDE URL will generally be formatted:

gsiftp://gridftp_server/path/to/file

Users may look up XSEDE GridFTP servers on the Data Transfer & Management page.

The following command copies "directory1" from Stanford's XStream to TACC's Stampede system, renaming it to "directory2". Note that when transferring directories, the directory path must end with a slash ( '/'):

login1$ globus-url-copy -r -vb \
    gsiftp://xstream.stanford.xsede.org:2811/`pwd`/directory1/ \
    gsiftp://gridftp.stampede.tacc.xsede.org:2811/home/0000/johndoe/directory2/

GSI-OpenSSH

Command-line transfer utilities supporting standard SSH and grid authentication are offered by the Globus GSI-OpenSSH implementation of OpenSSH. The gsissh, gsiscp and gsiftp commands are analogous to the OpenSSH ssh, scp and sftp commands. Grid authentication is provided to XSEDE users by first executing the "myproxy-logon" command (see above).

You must explicitly connect to port 2222 on XStream. The following command copies "file1" on your local machine to Stampede renaming it to "file2".

localhost$ gsiscp -oTcpRcvBufPoll=yes -oNoneEnabled=yes \
    -oNoneSwitch=yes -P2222 file1 xstream.stanford.xsede.org:file2
Please consult Globus' GSI-OpenSSH User's Guide for further info.

Software on XStream

This section contains software or libraries available on XStream as of July 2017.

cuDDN 5.1

The NVIDIA CUDA Deep Neural Network library (cuDDN) is a GPU-accelerated library of primitives for deep neural networks.

login1$ module load CUDA/7.5.18 cuDNN/5.1-CUDA-7.5.18
login1$ module load CUDA/8.0.44 cuDNN/5.1-CUDA-8.0.44

GROMACS 5.1

GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles.

login1$ module load intel/2015.5.223 CUDA/7.5.18 GROMACS/5.1-hybrid

LAMMPS (17Nov2016)

LAMMPS is a classical molecular dynamics simulation code designed to run efficiently on parallel computers. Many of its models have versions that provide accelerated performance on GPUs.

login1$ module load foss/2015.05 LAMMPS/17Nov2016-CUDA-8.0.44-K80

OpenMM 6.3.1

OpenMM is a high performance toolkit for molecular simulation. On XStream, it is compiled against the foss toolchain (GCC 4.9.2) and CUDA 7.0.28.

login1$ module load foss/2015.05 OpenMM/6.3.1

PostgreSQL 9.5.2 with PG-Strom

PostgreSQL is an open source relational database management system (DBMS) developed by a worldwide team of volunteers.

PG-Strom is an extension of PostgreSQL designed to off-load several CPU intensive workloads to GPU devices, to utilize its massive parallel execution capability.

login1$ module load foss/2015.05 CUDA/7.5.18 PostgreSQL/9.5.2-Python-2.7.9
login1$ pg_config --version
PostgreSQL 9.5.2
login1$ initdb -D $WORK/postgres/data

Edit "$WORK/postgres/data/postgresql.conf" and add the following line to load the PG-Strom extension:

shared_preload_libraries = '$libdir/pg_strom'

Start the PostgreSQL server:

login1$ pg_ctl -D $WORK/postgres/data -l logfile start

Connect using your login name and the password postgres and create the pg_strom extension:

login1$ psql -U $USER postgres
psql (9.5.2)
Type 'help' for help.

postgres=# CREATE EXTENSION pg_strom;
CREATE EXTENSION

Stop the PostgreSQL server:

login1$ pg_ctl -D $WORK/postgres/data stop

Please always use PostgreSQL within a job, not on the login nodes.

R 3.2.4

R is a free software environment for statistical computing and graphics.

login1$ ml foss/2015.05 R/3.2.4-libX11-1.6.3

RStudio 0.99.893

RStudio IDE is a powerful and productive user interface for R.

login1$ module load foss/2015.05 git/2.4.1 RStudio/0.99.893

TensorFlow 1.1

TensorFlow is an Open Source Software Library for Machine Intelligence originally developed by Google with CUDA support. To load TensorFlow 1.1 with Python 2.7, use:

login1$ ml tensorflow/1.1.0

To load TensorFlow 1.1 with Python 3.6, use:

login1$ ml tensorflow/1.1.0-cp36

Note: Tensorflow is a special module that will load the foss toolchain automatically.

Newer versions of TensorFlow should be run in Singularity containers.

Theano 0.9.0

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Theano 0.9 has support for libgpuarray.

Usage example without MPI support:

login1$ module load foss/2015.05 Theano/0.9.0-Python-2.7.9-noMPI

Usage example with MPI support:

login1$ module load foss/2015.05 Theano/0.9.0-Python-2.7.9

Torch

Torch is a scientific computing framework with wide support for machine learning algorithms that puts GPUs first. It has support for cuDNN 4.0 and 5.0 and is compiled against the foss toolchain (GCC based).

login1$ module load torch/20160414-cbb5161

Note: Torch is a special module that will load the foss toolchain automatically.

VMD 1.9.2

VMD is a molecular visualization program for displaying, animating, and analyzing large biomolecular systems using 3-D graphics and built-in scripting.

VMD on XStream is built against CUDA 7.0.28 and Nvidia OptiX 3.8.0, enabling the TachyonL-OptiX GPU-accelerated ray tracing renderer available in VMD 1.9.2. At least the following features should also be available: ACTC library support, collective variables, Python support, Pthreads, NetCDF, ImageMagick, ffmpeg and NetPBM.

Prerequisite: use ssh X11 forwarding by adding "-X" to your ssh command when connecting to XStream.

For non-computationally expensive tasks with VMD, you may launch VMD from a login node:

login1$ module load foss/2015.05 VMD/1.9.2-Python-2.7.9
login1$ vmd

To perform computationally expensive tasks with VMD, please launch VMD using srun with the X11 option as shown here (example with 1 task, 4 CPUs and 4 GPUs):

login1$ module load foss/2015.05 VMD/1.9.2-Python-2.7.9
login1$ srun --x11=first -n1 -c4 --gres gpu:4 vmd

Please refer to the Modules section to learn how to search for software on XStream.

Application Development

Compiler toolchains

Compiler toolchains are basically a set of compilers together with libraries that provide additional support that is commonly required to build software. In the HPC world, this usually consists of a library for MPI (inter-process communication over a network), BLAS/LAPACK (linear algebra routines) and FFT (Fast Fourier Transforms).

Compiler toolchains on XStream:

  • foss/2015.05: the FOSS toolchain is a GNU Compiler Collection (GCC) based compiler toolchain, including OpenMPI for MPI support, OpenBLAS (BLAS and LAPACK support), FFTW3 and ScaLAPACK. This version is based on GCC 4.9.2.
  • intel/2015.5.223: Intel Cluster Toolkit Compiler Edition provides Intel C/C++ and Fortran compilers, Intel MPI ' Intel MKL. Based on Intel Parallel Studio XE 2015 update 5.

To load the FOSS toolchain:

login1$ module load foss/2015.05

To load the Intel compiler toolchain:

login1$ module load intel/2015.5.223

Note: loading a toolchain will offer you additional software to load through module. Check module avail once your preferred compiler toolchain is loaded.

Nvidia CUDA

CUDA is an essential software for XStream. CUDA is not part of the above compiler toolchains and should be loaded aside. The following versions of CUDA are available on XStream: 6.5.14, 7.0.28, 7.5.18 (default), 8.0.44 and 8.0.61.

For example, to load CUDA 8.0.61, please use:

login1$ module load CUDA/8.0.61

We recommend to use "-arch=sm_37" to select the architecture specification for Nvidia Tesla K80 GPU.

Running Jobs on XStream

Job Accounting

Computing services for XSEDE users are allocated and charged in Service Units (SUs). On XStream, the rule is simple: 1 SU = 1 GPU hour (GK210 architecture).

XStream SUs charged (GPU hours) = # GPUs * wallclock time

The following job submission rules apply:

  • A job should at least request one GPU as CPU-only jobs are not allowed on XStream.
  • 1 CPU per GPU maximum allowed (or 12,800MB node memory/GPU) with the following exceptions:

    • Half node exception: If the number of requested GPUs per node is 8 and if the "--gres-flags=enforce-binding" option is specified, up to 10 CPUs are allowed (or 128,000MB of memory).
    • Exclusive node exception: If the number of requested GPUs is 16 per node and if "--gres-flags=enforce-binding" is specified, up to 20 CPUs are allowed (or 256,000MB of memory).

XStream is a GPU cluster, before using it, please ensure that your codes are making heavy use of GPUs and not CPUs.

Project accounts

Service Units are allocated to XSEDE projects. Each project has a corresponding "account" in SLURM of the form p-grant_number_lowercase. For example, the SLURM account for the XSEDE grant CIE160024 will be p-cie160024. The same naming convention is used for POSIX groups. Users in several projects can select the SLURM account to be charged by using the following job parameter (example):

#SBATCH -A p-cie160024

Queues

A single default partition, 'normal', is configured and represents all compute nodes. XStream uses SLURM QoS (Quality of Service) to enforce resource usage limits. The table below shows the current job limits per SLURM QoS:

SLURM QoS Max CPUs Max GPUs Max Jobs Max Nodes Job time limits
normal (default) 320/user 400/group 256/user 320/group 512/user 16/user 20/group Default: 2 hours
Max: 2 days
long 20/user 80/group
200 max total
16/user 16/group
160 max total
4/user 4/user
64 max total
Default: 2 hours
Max: 7 days

Job schedulers

XStream runs Simple Linux Utility for Resource Management (SLURM) batch environment and doesn't provide any wrapper commands for now. Please refer to the official SLURM Documentation for more details.

Job submission

SLURM supports a variety of job submission techniques. By accurately requesting the resources you need, you'll be able to get your work done.

A job consists in two parts: resource requests and job steps. Resource requests consist of a number of CPUs, GPUs, computing expected duration, amount of memory, etc. Job steps describe tasks that must be done, software which must be run.

The typical way of creating a job is to write a submission script. A submission script is a shell script, e.g. a Bash script, whose first comments, if they are prefixed with "SBATCH", are understood by SLURM as parameters describing resource requests and other submissions options. You can get the complete list of parameters from the sbatch manpage ("man sbatch").

SLURM will ignore all lines after the first blank line, even the ones containing SBATCH. Always put your SBATCH parameters at the top of your batch script.

The script itself is a job step. Other job steps are created with the "srun" command. For instance, the following script, hypothetically named "submit.sh":

#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=res.txt
#
#SBATCH --time=10:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=500
#SBATCH --gres gpu:1
#SBATCH --gres-flags=enforce-binding

srun hostname
srun sleep 60

This script requests one task with one CPU and one GPU for 10 minutes, along with 500 MB of RAM, in the default partition. The "--gres-flags=enforce-binding" option will ensure the allocated GPU is locally bound to the allocated CPU, to avoid any slow QPI traffic. When started, the job would run a first job step srun hostname, which will launch the command hostname on the node on which the requested CPU was allocated. Then, a second job step will start the "sleep" command.

Once the submission script is written properly, you need to submit it to SLURM through the "sbatch" command, which, upon success, responds with the jobid attributed to the job.

login1$ sbatch submit.sh
Submitted batch job 4011

The job then enters the queue in the PENDING state. Once resources become available and the job has highest priority, an allocation is created for it and it goes to the RUNNING state. If the job completes correctly, it goes to the COMPLETED state, otherwise, it is set to the FAILED state.

Upon completion, the output file contains the result of the commands run in the script file. In the above example, you can see it with "cat res.txt".

Note that you can create an interactive job with the "salloc" command or by issuing an "srun" command directly.

SLURM and GPU IDs

When requesting GPUs with the option --gres gpu:N of srun or sbatch (not salloc), SLURM will set the $CUDA_VISIBLE_DEVICES environment variable to store the GPU ids that have been allocated to the job. So for instance, with --gres gpu:2, depending on the current state of the node GPUs, $CUDA_VISIBLE_DEVICE could be set to '0,1', meaning that you will be able to use GPU 0 and GPU 1. Most applications automatically detect the existence of $CUDA_VISIBLE_DEVICES and run on the allocated GPUs, but some don't and allow to explicitly set GPU ids, which would need to be done manually.

Cancel a job

Use the scancel *jobid* command with the jobid of the job you want canceled. In the case you want to cancel all your jobs, type scancel -u $USER. You can also cancel all your pending jobs for instance with scancel -t PD.

Job information

The "squeue" command shows the list of jobs which are currently running (they are in the RUNNING state, noted as "R") or waiting for resources (noted as "PD").

login1$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 2252    normal     job2     mike PD       0:00      1 (Dependency)
 2251    normal     job1     mike  R 1-16:18:47      1 xs-0022

The above output shows that one job is running, whose name is job1 and whose jobid is 2251. The jobid is a unique identifier that is used by many SLURM commands when actions must be taken about one particular job. For instance, to cancel job job1, you would use scancel 2251. Time is the time the job has been running until now. Node is the number of nodes which are allocated to the job, while the Nodelist column lists the nodes which have been allocated for running jobs. For pending jobs, that column gives the reason why the job is pending.

As with the "sinfo" command (below), you can choose what you want squeue to output with the "--format" argument.

login1$ scontrol show job

To get full details of a pending or running job:

login1$ scontrol show job jobid

You can get near-realtime information about your program (memory consumption, etc.) with the sstat command.

You can get the state of your finished jobs with the "sacct" command:

login1$ sacct
JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
|----------- ---------- ---------- ---------- ---------- ---------- --------
4011               test     normal       srcc          1  COMPLETED      0:0
4011.batch        batch                  srcc          1  COMPLETED      0:0
4011.0         hostname                  srcc          1  COMPLETED      0:0
4011.1            sleep                  srcc          1  COMPLETED      0:0

Use the "sacct" command with its many options to interface to the SLURM accounting database. Here is an example of getting memory information of your recent past jobs:

login1$ sacct --format JobID,jobname,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize
JobID    JobName   NTasks  NodeList     MaxRSS  MaxVMSize     AveRSS  AveVMSize
|----------- ---------- -------- --------- ---------- ---------- ---------- ----------
4011               test            xs-0024        16?        16?
4011.batch        batch        1   xs-0024      1496K    150360K      1496K    106072K
4011.0         hostname        1   xs-0024          0    292768K          0    292768K
4011.1            sleep        1   xs-0024       624K    292764K       624K    100912K

Resource information with "sinfo"

SLURM offers a few commands with many options you can use to interact with the system. For instance, the sinfo command gives an overview of the resources offered by the cluster, while the "squeue" command shows to which jobs those resources are currently allocated.

By default, sinfo lists the partitions that are available. A partition is a set of compute nodes grouped logically. Typical examples include partitions dedicated to batch processing, debugging, post processing, or visualization.

login1$ sinfo

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up 2-00:00:00      2  drain xs-[0054,0057]
normal*      up 2-00:00:00      4    mix xs-[0007,0051-53]
normal*      up 2-00:00:00     48  alloc xs-[0001-0006,0008-0009,0011-0050]
normal*      up 2-00:00:00     11   idle xs-[0010,0055-0056,0058-0065]

In the above example, we see one partition normal. This is the default partition as it is marked with an asterisk. In this example, 48 nodes of the normal partition are being used, 4 are in the mix state (partially allocated), 11 are idle (available) and 2 are drained which means some maintenance operation is taking place.

The "sinfo" command can also output the information in a node-oriented fashion, with the "-N" argument. Along with the "-l" option, it will display more information about the nodes: number of CPUs, memory, temporary disk (also called local scratch space), features of the nodes (such as processor type for instance) and the reason, if any, for which a node is down.

Node characteristics and Generic Resources (GRES): SLURM associates to each nodes a set of Features and a set of Generic resources. Features are immutable characteristics of the node (e.g. CPU model, CPU frequency) while generic resources are consumable resources, meaning that as users reserve them, they become unavailable for the others (e.g. GPUs).

To list all node characteristics including GRES, you can use the following command:

login1$ scontrol show nodes

However, the output of this command is quite verbose. So you can also use sinfo to list GRES of each node using specific output parameters, for example:

login1$ sinfo -o "%10P %8c %8m %11G %5D %N"

PARTITION  CPUS     MEMORY   GRES        NODES NODELIST

test       20       258374   gpu:k80:16  3     xs-[0007,0051,0058]
normal*    20       258374   gpu:k80:16  62    xs-[0001-0006,0008-0050,0052-0057,0059-0065]

On XStream, all compute nodes are identical, so no Features are set, only GRES are interesting for jobs allocation as GPUs are handled there. GRES appear under the form resource:type:count. On XStream, resource is always gpu and type is k80, and count is the number of logical K80 GPUs per node (16).

Tools

Connecting to the compute nodes using ssh is allowed when at least one job of yours is running. You have access to some system tools there to debug your program, like "top", "htop" or "strace". Debuggers from the compiler toolchains are also available. Don't forget to load the proper compiler toolchain first.

Policies

XStream is for authorized users only and all users are expected to comply with all Stanford computing, network and research policies. For more info, see http://acomp.stanford.edu/about/policy and http://doresearch.stanford.edu/policies/research-policy-handbook.

Also please note that XStream is not HIPAA compliant and should not be used to process PHI. See https://privacy.stanford.edu/faqs/hipaa-faqs for more information.