University of Delaware DARWIN User Guide
Last update: May 13, 2021

Getting started on DARWIN

DARWIN (Delaware Advanced Research Workforce and Innovation Network) is a big data and high performance computing system designed to catalyze Delaware research and education funded by a $1.4 million grant from the National Science Foundation (NSF). This award will establish the DARWIN computing system as an XSEDE Level 2 Service Provider in Delaware contributing 20% of DARWIN's resources to XSEDE: Extreme Science and Engineering Discovery Environment. DARWIN has 105 compute nodes with a total of 6672 cores, 22 GPUs, 100TB of memory, and 1.2PB of disk storage. See architecture descriptions for complete details.

Figure 1. DARWIN Configuration

Architecture

The DARWIN cluster is being set up to be very similar to the existing Caviness cluster, and will be familiar to those currently using Caviness. However DARWIN is a NSF funded HPC resource available via committee reviewed allocation request process similar to XSEDE allocations.

An HPC system always has one or more public-facing systems known as login nodes. The login nodes are supplemented by many compute nodes which are connected by a private network. One or more head nodes run programs that manage and facilitate the functioning of the cluster. (In some clusters, the head node functionality is present on the login nodes.) Each compute node typically has several multi-core processors that share memory. Finally, all the nodes share one or more file systems over a high-speed network.

Login Nodes

Login (head) nodes are the gateway into the cluster and are shared by all cluster users. Their computing environment is a full standard variant of Linux configured for scientific applications. This includes command documentation (man pages), scripting tools, compiler suites, debugging/profiling tools, and application software. In addition, the login nodes have several tools to help you move files between the HPC file systems and your local machine, other clusters, and web-based services.

Login nodes should be used to set up and submit job workflows and to compile programs. You should generally use compute nodes to run or debug application software or your own executables.

If your work requires highly interactive graphics and animations, these are best done on your local workstation rather than on the cluster. Use the cluster to generate files containing the graphics information, and download them from the HPC system to your local system for visualization.

When you use SSH to connect to darwin.hpc.udel.edu your computer will choose one of the login (head) nodes at random. The default command line prompt clearly indicates to which login node you have connected: for example, [bjones@login00.darwin ~]$ is shown for account bjones when connected to login node login00.darwin.hpc.udel.edu.

Only use SSH to connect to a specific login node if you have existing processes present on it. For example, if you used the screen or tmux utility to preserve your session after logout.

Compute Nodes

There are many compute nodes with different configurations. Each node consists of multi-core processors (CPUs), memory, and local disk space. Nodes can have different OS versions or OS configurations, but this document assumes all the compute nodes have the same OS and almost the same configuration. Some nodes may have more cores, more memory, GPUs, or more disk.

All compute nodes are now available and configured for use. Each compute node has 64 cores, so the compute resources available at initial early access will be the following:

1) 1024 GiB of system memory and 2.73 TiB of swap on high-speed NVMe storage.

The standard Linux on the compute nodes are configured to support just the running of your jobs, particularly parallel jobs. For example, there are no man pages on the compute nodes. All compute nodes will have full development headers and libraries.

Commercial applications, and normally your programs, will use a layer of abstraction called a programming model. Consult the cluster specific documentation for advanced techniques to take advantage of the low level architecture.

Compute Node Number of Nodes Total Cores Memory per Node Total Memory Total GPUs
Standard 48 3,072 512 GiB 24 TiB
Large Memory 32 2,048 1024 GiB 32 TiB
Extra-Large Memory 11 704 2,048 GiB 22 TiB
nVidia-T4 9 576 512 GiB 4.5 TiB 9
nVidia-V100 3 144 768 GiB 2.25 TiB 12
AMD-MI50 1 64 512 GiB .5 TiB 1
Extended Memory 1 64 1024 GiB + 2.73 TiB1) 3.73 TiB
Total 105 6,672 88.98 TiB 22

File Systems

Home

Each DARWIN user receives a home directory (/home/<uid>) that will remain the same during and after the early access period. This storage has slower access with a limit of 20 GiB. It should be used for personal software installs and shell configuration files.

Lustre High-Performance

Lustre is designed to use parallel I/O techniques to reduce file-access time. The Lustre file systems in use at UD are composed of many physical disks using RAID technologies to give resilience, data integrity, and parallelism at multiple levels. There is approximately 1.1 PiB of Lustre storage available on DARWIN. It uses high-bandwidth interconnects such as Mellanox HDR100. Lustre should be used for storing input files, supporting data files, work files, and output files associated with computational tasks run on the cluster.

  • Each allocation will be assigned a workgroup storage in the Lustre directory (/lustre/workgroup/).

  • Each workgroup storage will have a users directory (/lustre/workgroup/users/uid) for each user of the workgroup to be used as a personal directory for running jobs and storing larger amounts of data.

  • Each workgroup storage will have a software and VALET directory (/lustre/workgroup/sw/ and /lustre/workgroup/sw/valet) all allow users of the workgroup to install software and create VALET package files that need to be shared by others in the workgroup.

  • There will be a quota limit set based on the amount of storage approved for your allocation for the workgroup storage.

While all file systems on the DARWIN cluster utilize hardware redundancies to protect data, there is no backup or replication and no recovery available for the home or Lustre file systems.

Local

Each node has an internal, locally connected disk. Its capacity is measured in terabytes. Each compute node on DARWIN has a 1.75 TiB SSD local scratch file system disk. Part of the local disk is used for system tasks such memory management, which might include cache memory and virtual memory. This remainder of the disk is ideal for applications that need a moderate amount of scratch storage for the duration of a job's run. That portion is referred to as the node scratch file system.

Each node scratch file system disk is only accessible by the node in which it is physically installed. The job scheduling system creates a temporary directory associated with each running job on this file system. When your job terminates, the job scheduler automatically erases that directory and its contents.

Software

A list of installed software that IT builds and maintains for DARWIN users can be found by logging into DARWIN and using the VALET command vpkg_list.

There will not be a full set of software during early access and testing, but we will be continually installing and updating software. Installation priority will go to compilers, system libraries, and highly utilized software packages. Please DO let us know if there are packages that you would like to use on DARWIN, as that will help us prioritize user needs, but understand that we may not be able to install software requests in a timely manner.

Users will be able compile and install software packages in their home or workgroup directories. There will be very limited support for helping with user compiled installs or debugging during early access. Please reference basic software building and management to get started with software installations utilizing VALET (versus Modules) as suggested and used by IT RCI staff on our HPC systems.

Please review the following documents if you are planning to compile and install your own software.

  • High Performance Computing (HPC) Tuning Guide for AMD EPYC™ 7002 Series Processors guide for getting started tuning AMD 2nd Gen EPYC™ Processor based systems for HPC workloads. This is not an all-inclusive guide and some items may have similar, but different, names in specific OEM systems (e.g. OEM-specific BIOS settings). Every HPC workload varies in its performance characteristics. While this guide is a good starting point, you are encouraged to perform your own performance testing for additional tuning. This guide also provides suggestions on which items should be the focus of additional, application-specific tuning (November 2020).
  • HPC Tuning Guide for AMD EPYC™ Processors guide intended for vendors, system integrators, resellers, system managers and developers who are interested in EPYC system configuration details. There is also a discussion on the AMD EPYC software development environment, and we include four appendices on how to install and run the HPL, HPCG, DGEMM, and STREAM benchmarks. The results produced are ‘good' but are not necessarily exhaustively tested across a variety of compilers with their optimization flags (December 2018).
  • AMD EPYC™ 7xx2-series Processors Compiler Options Quick Reference Guide, however we do not have the AOCC compiler (with Flang - Fortran Front-End) installed on DARWIN.

Scheduler

DARWIN will being using the Slurm scheduler like Caviness, and is the most common scheduler among XSEDE resources. Slurm on DARWIN is configured as fair share with each user being given equal shares to access the current HPC resources available on DARWIN.

Queues (Partitions)

Partitions have been created to align with allocation requests moving forward based on different node types. There will be no default partition, and must only specify one partition at a time. It is not possible to specify multiple partitions using Slurm to span different node types.

Run Jobs

In order to schedule any job (interactively or batch) on the DARWIN cluster, you must set your workgroup to define your cluster group. Each research group has been assigned a unique workgroup. Each research group should have received this information in a welcome email. For example,

workgroup -g it_css

will enter the workgroup for it_css. You will know if you are in your workgroup based on the change in your bash prompt. See the following example for user bjones

[bjones@login00.darwin ~]$ workgroup -g it_css
[(unsponsored:bjones)@login00.darwin ~]$ printenv USER HOME WORKDIR WORKGROUP WORKDIR_USER
bjones
/home/1201
/lustre/it_css
it_css
/lustre/it_css/users/1201
[(it_css:bjones)@login00.darwin ~]$

Now we can use salloc or sbatch as long as a partition is specified as well to submit an interactive or batch job respectively. See DARWIN Run Jobs, Schedule Jobs and Managing Jobs wiki pages for more help about Slurm including how to specify resources and check on the status of your jobs.

All resulting executables (created via your own compilation) and other applications (commercial or open-source) should only be run on the compute nodes.

It is a good idea to periodically check in /opt/shared/templates/slurm/ for updated or new templates to use as job scripts to run generic or specific applications designed to provide the best performance on DARWIN.

Help

XSEDE allocations

To report a problem or provide feedback, submit a help desk ticket on the XSEDE Portal and complete the form selecting darwin.udel.xsede.org as the system and your problem details in the description field to route your question more quickly to the research support team. Provide enough details (including full paths of batch script files, log files, or important input/output files) that our consultants can begin to work on your problem without having to ask you basic initial questions.

Ask or tell the HPC community

hpc-ask is a Google group established to stimulate interactions within UD's broader HPC community and is based on members helping members. This is a great venue to post a question about HPC, start a discussion, or share an upcoming event with the community. Anyone may request membership. Messages are sent as a daily summary to all group members. This list is archived, public, and searchable by anyone.

Publication and Grant Writing Resources

Please refer to the NSF award information for a proposal or publication to acknowledge use of or describe DARWIN resources.

XSEDE Allocations

For XSEDE allocations on DARWIN, a PI may use the XSEDE user portal to add or remove accounts for an active allocation on DARWIN as long as the person you want to add has a XSEDE user portal account. Please visit the XSEDE user portal and click the "Create Account" button. The person will need to share their XSEDE user portal account with the PI to be added. All logins to DARWIN will be done via XSEDE Single Sign-on Hub.

Accounts

An XSEDE username will be assigned having the form xsedeuuid. The uid is based on a unique, 4-digit numerical identifier assigned to you. Password based authentication is not supported for XSEDE users on DARWIN, however you may set up direct SSH key authentication on DARWIN once you have logged on to DARWIN (via gsissh darwin) by adding the desired public key to your authorized_keys file in the .ssh directory inside your home directory. Once you have a SSH key set up this way, you can access DARWIN directly by running ssh darwin.hpc.udel.edu or use a PuTTY saved session if it is configured with the private key. See connecting for XSEDE allocation users for more details.

For example,
$ hpc-user-info -a xsedeu1201
full-name = Student Training
last-name = Student Training
home-directory = /home/1201
email-address = traine@udel.edu
clusters = DARWIN
Command Function
hpc-user-info -a username Display info about a user
hpc-user-info -h Display complete syntax

Groups

The allocation groups of which you are a member determine which computing nodes, job queues, and storage resources you may use. Each group has a unique descriptive group name (gname). There are two categories of group names: class and workgroup.

The class category: All users belong to the group named everyone.

The workgroup category: Each workgroup has a unique group name (e.g., xg-tra180011) assigned for each allocation. The PI and users are members of that allocation group (workgroup). To see the usernames of all members of the workgroup, type the hpc-group-info -a allocation_workgroup command.

Use the command groups to see all of your groups. The example below is for user xsedeu1201

$ groups everyone xg-tra180011

For example, the command below will display the complete information about the workgroup xg-tra180011 and its members.

$ hpc-group-info -a xg-tra180011
name = xg-tra180011
gid-number = 1120
description = Student Training
member = xsedeu1201; Student Training; traine@udel.edu

The output of this command represents (description=PI), along with every member in the workgroup and their account information (Username, Full Name, Email Address).

Connecting to DARWIN

Secure shell program (SSH)

Use a secure shell program/client (SSH) to connect to the cluster and a secure file transfer program to move files to and from the cluster.

There are many suitable secure clients for Windows, Mac OS X, and UNIX/Linux. We recommend PuTTY and Xming for Windows users. Macintosh and UNIX/Linux users can use their pre-installed SSH and X11 software. (Newer versions of Mac OS X may not have a current version of X11 installed. See the Apple web site for X11 installation instructions.)

IT strongly recommends that you configure your clients as described in the online X-windows (X11) and SSH documents (Windows / Linux/MacOSX).

Your HPC home directory has a .ssh directory. Do not manually erase or modify the files that were initially created by the system. They facilitate communication between the login (head) node and the compute nodes. Only use standard ssh commands to add keys to the files in the .ssh directory.

Please refer to Windows and Mac/Linux related sections for specific details using the command line on your local computer:

Logging on to DARWIN

You need a DARWIN account to access the login node and it is very important to review the information about cluster accounts and cluster groups before connecting.

To learn about launching GUI applications on DARWIN, refer to Schedule Jobs page.

XSEDE users with an allocation award on DARWIN should use the XSEDE Single Sign-On (SSO) Hub to access the system initially. Once you have logged in to the XSEDE SSO Hub, you can access DARWIN by running:

$ gsissh darwin

or if you need you to use X-Windows requiring X11 forwarding, then use

$ gsissh -Y darwin

Password-based authentication is not supported for XSEDE users, however, you may set up direct SSH key authentication on DARWIN once you are on the system by adding the desired public key to your authorized_keys file in the .ssh directory inside your home directory. Once you have an SSH key set up this way, you can access DARWIN directly by running ssh darwin.hpc.udel.edu. You will need to do this if you need to use X11 tunneling and/or forwarding required for such applications like Jupyter Notebook.

Adding the desired public key will need to be slightly different from most documentation because password-based authentication is not supported and thus you cannot use an SSH forwarding agent like ssh-add -L .ssh/authorized_keys on DARWIN. Instead, you will need to display your public key generated on your local computer (e.g. Windows PuTTY Key Generator make sure you copy the box at the top labeled Public key for pasting into OpenSSH authorized_keys file: while generating or Load an existing private key file to display the public key to copy; Mac look in the .ssh directory and cat id_rsa.pub) and copy it into your clipboard. Once you are logged on to DARWIN using gsissh from the XSEDE SSO hub, then you can open an editor like nano .ssh/authorized_keys and paste the public key you copied from your local computer on a new line at the end of the file and save it.

Once you are logged into DARWIN, your account is configured as a member of an allocation workgroup which determines access to your HPC resources on DARWIN. Setting your allocation workgroup is required in order to submit jobs to the DARWIN cluster. For example, the traine account is a member of the it_css workgroup. To start a shell in the it_css workgroup, type:

$ workgroup -g it_css

Consult the following pages for detailed instructions to connecting to DARWIN.

File Systems

Home

The 13.5 TiB file system uses 960 GiB enterprise class SSD drives in a triple-parity RAID configuration for high reliability and availability. The file system is accessible to all nodes via IPoIB on the 100 Gbit/s InfiniBand network.

Storage

Each user has 20 GB of disk storage reserved for personal use on the home file system. Users' home directories are in /home (e.g., /home/1005), and the directory name is put in the environment variable $HOME at login.

High-Performance Lustre

Lustre is designed to use parallel I/O techniques to reduce file-access time. The Lustre file systems in use at UD are composed of many physical disks using RAID technologies to give resilience, data integrity, and parallelism at multiple levels. There is approximately 1.1 PiB of Lustre storage available on DARWIN. It uses high-bandwidth interconnects such as Mellanox HDR100. Lustre should be used for storing input files, supporting data files, work files, and output files associated with computational tasks run on the cluster.

Consult All About Lustre for more detailed information.

Workgroup Storage

Allocation workgroup storage is available on a high-performance Lustre-based file system having almost 1.1 PB of usable space. Users should have a basic understanding of the concepts of Lustre to take full advantage of this file system. The default stripe count is set to 1 and the default striping is a single stripe distributed across all available OSTs on Lustre. See Lustre Best Practices from Nasa.

Each allocation will have at least 1 TiB of shared (workgroup) storage in the /lustre/ directory identified by the «allocation_workgroup» (e.g., /lustre/it_css) accessbile by all users in the allocation workgroup, and is referred to as your workgroup directory ($WORKDIR), if the allocation workgroup has been set.

Each user in the allocation workgroup will have a /lustre/«workgroup»/users/«uid» directory to be used as a personal workgroup storage directory for running jobs, storing larger amounts of data, input files, supporting data files, work files, output files and source code. It can be referred to as $WORKDIR_USERS, if the allocation workgroup has been set.

Each allocation will also have a /lustre/«workgroup»/sw directory to allow users to install software to be shared for the allocation workgroup. It can be referred to as $WORKDIR_SW, if the allocation workgroup has been set. In addition a /lustre/«workgroup»/sw/valet) directory is also provided to store VALET package files to shared for the allocation workgroup.

Please see workgroup for complete details on environment variables.

Note: A full file system inhibits use for everyone preventing jobs from running.

Local/Node File System

Temporary Storage

Each compute node has its own 2 TB local hard drive, which is needed for time-critical tasks such as managing virtual memory. The system usage of the local disk is kept as small as possible to allow some local disk for your applications, running on the node.

Quotas and Usage

To help users maintain awareness of quotas and their usage on the /home file system, the command my_quotas is available to display a list of all quota-controlled file systems on which the user has storage space.

For example, the following shows the amount of storage available and in-use for user traine in workgroup it_css for their home and workgroup directory.

$ my_quotas
Type  Path           In-use / kiB Available / kiB  Pct
----- -------------- ------------ --------------- ----
user  /home/1201          7497728        20971520  36%
group /lustre/it_css          228      1073741824   0%

Home

Each user's home directory has a hard quota limit of 20 GB. To check usage, use

$ df -h $HOME

The example below displays the usage for the home directory (/home/1201) for the account traine as 7.2 GB used out of 20 GB which matches the above example provide by my_quotas command.

$ df -h $HOME
Filesystem                 Size  Used Avail Use% Mounted on
nfs0-ib:/beagle/home/1201   20G  7.2G   13G  36% /home/1201

Workgroup

All of Lustre is available for allocation workgroup storage. To check Lustre usage for all users, use df -h /lustre.

The example below shows 25 TB is in use out of 954 TB of usable Lustre storage.

$ df -h /lustre
Filesystem                             Size  Used Avail Use% Mounted on
10.65.2.6@o2ib:10.65.2.7@o2ib:/darwin  978T   25T  954T   3% /lustre 

To see your allocation workgroup usage, please use the my_quotas command. Again the the following example shows the amount of storage available and in-use for user traine in allocation workgroup it_css for their home and allocation workgroup directories.

$ my_quotas
Type  Path           In-use / kiB Available / kiB  Pct
----- -------------- ------------ --------------- ----
user  /home/1201          7497728        20971520  36%
group /lustre/it_css          228      1073741824   0% 

Node

The node temporary storage is mounted on /tmp for all nodes. There is no quota, and if you exceed the physical size of the disk you will get disk failure messages. To check the usage of your disk, use the df -h command on the compute node where your job is running.

We strongly recommend that you refer to the node scratch by using the environment variable, $TMPDIR, which is defined by Slurm when using salloc or srunor sbatch.

For example, the command

$ ssh r1n00 df -h /tmp

shows size, used and available space in M, G or T units.

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3       1.8T   41M  1.8T   1% /tmp

This node r1n00 has a 2 TB disk, with only 41 MB used, so 1.8 TB is available for your job.

There is a physical disk installed on each node that is used for time critical tasks, such as swapping memory. Most of the compute nodes are configured with a 2 TB disk, however, the /tmp file system will never have the total disk. Larger memory nodes will need to use more of the disk for swap space.

Recovering Files

While all file systems on the DARWIN cluster utilize hardware redundancies to protect data, there is no backup or replication and no recovery available for the home or Lustre file systems. All backups are the responsibility of the user. DARWIN's systems administrators are not liable for any lost data.

Usage Recommendations

Home directory: Use your home directory to store private files. Application software you use will often store its configuration, history and cache files in your home directory. Generally, keep this directory free and use it for files needed to configure your environment. For example, add symbolic links in your home directory to point to files in any of the other directory.

Workgroup directory: Use the personal allocation workgroup directory for running jobs, storing larger amounts of data, input files, supporting data files, work files, output files and source code in $WORKDIR_USERS as an extension of your home direcory. It is also appropriate to use the software allocation workgroup directory to build applications for everyone in your allocation group in $WORKDIR_SW as well as create a VALET package for your fellow researchers to access applications you want to share in $WORKDIR_SW/valet.

Node scratch directory: Use the node scratch directory for temporary files. The job scheduler software (Slurm) creates a temporary directory in /tmp specifically for each job's temporary files. This is done on each node assigned to the job. When the job is complete, the subdirectory and its contents are deleted. This process automatically frees up the local scratch storage that others may need. Files in node scratch directories are not available to the head node, or other compute nodes.

Transferring Files

Be careful about modifications you make to your startup files (e.g. .bash*). Commands that produce output such as VALET or workgroup commands may cause your file transfer command or application to fail. Log into the cluster with ssh to check what happens during login, and modify your startup files accordingly to remove any commands which are producing output and try again. See computing environment startup and logout scripts for help.

Common Clients

You can move data to and from the cluster using the following supported clients:

sftp Recommended for interactive, command-line use.
scp Recommended for batch script use.
rsync Most appropriate for synchronizing the file directories of two systems when only a small fraction
of the files have been changed since the last synchronization.
Rclone Rclone is a command line program to sync files and directories to and from popular cloud storage services.

If you prefer a non-command-line interface, then consult this table for GUI clients.

winscp Windows only
fetch Mac OS X only
filezilla Windows, Mac OS X, UNIX, Linux
cyberduck Windows, Mac OS X (command line version for Linux)

For Windows clients: editing files on Windows desktops and then transferring them back to the cluster, you may find that your file becomes "corrupt" during file transfer process. The symptoms are very subtle because the file appears to be okay, but in fact contains CRLF line terminators. This causes problems when reading the file on a Linux cluster and generates very strange errors. Some examples might be a file used for submitting a batch job such as submit.qs and one you have used before and know is correct, will no longer work. Or an input file used for ABAQUS like tissue.inp which has worked many times before produces an error like Abaqus Error: Command line option "input" must have a value..

Use the "file" utility to check for CRLF line terminators and dos2unix to fix it, like this below

[bjones@login01 ABAQUS]$ file tissue.inp
tissue.inp: ASCII text, with CRLF line terminators
[bjones@login01 ABAQUS]$ dos2unix tissue.inp
dos2unix: converting file tissue.inp to UNIX format ...
[bjones@login01 ABAQUS]$ file tissue.inp
tissue.inp: ASCII text

Copying Files to the Cluster

To copy a file over an SSH connection from a Mac/UNIX/Linux system to any of the cluster's file systems, type the generic command:

scp [options] local_filename HPC_username@HPC_hostname:HPC_filename

Begin the HPC_filename with a "/" to indicate the full path name. Otherwise the name is relative to your home directory on the HPC cluster.

Use scp -r to copy an entire directory, for example:

$ scp -r fuelcell bjones@darwin.hpc.udel.edu:/lustre/it_css/users/1201/projects

copies the fuelcell directory in your local current working directory into the /lustre/it_css/users/1201/projects directory on the DARWIN cluster. The /lustre/it_css/users/1201/projects directory on the DARWIN cluster must exist, and bjones must have write access to it.

Copying files from the cluster

To copy a file over an SSH connection to a Mac/UNIX/Linux system from any of the cluster's files systems type the generic command:

scp [options] HPC_username@HPC_hostname:HPC_filename local_filename

Begin the HPC_filename with a "/" to indicate the full path name. Otherwise, the name is relative to your home directory.

Use scp -r to copy the entire directory. For example:

$ scp -r bjones@darwin.hpc.udel.edu:/lustre/it_css/users/1201/projects/fuelcell  .

will copy the directory fuelcell on the DARWIN cluster into a new fuelcell directory in your local system's current working directory. (Note the final period in the command.)

Copying Files Between Clusters

You can use GUI applications to transfer small files to and from your PC as a way to transfer between clusters, however this is highly inefficient for large files due to multiple transfers and slower disk speeds. As a result, you do not benefit from the arcfour encoding.

The command tools work the same on any Unix cluster. To copy a file over an SSH connection, first logon the file cluster1 and then use the scp command to copy files to cluster1. Use the generic commands:

ssh [options] HPC_username1@HPC_hostname1
scp [options] HPC_filename1 HPC_username2@HPC_hostname2:HPC_filename2

Login to HPC_hostname1 and in the scp command begin both HPC_filename1 and HPC_filename2 with a "/" to indicate the full path name. The clusters will most likely have different full path names.

Use ssh -A to enable agent forwarding and scp -r to copy the entire directory.1) For example:

$ ssh -A bjones@caviness.hpc.udel.edu
cd archive/it_css/project
scp -r fuelcell bjones@darwin.hpc.udel.edu:/lustre/it_css/users/1201/projects/fuelcell

will copy the directory fuelcell from Farber to a new fuelcell directory on DARWIN.

1) If you are using PuTTY, skip the ssh step and connect to the cluster you want to copy from.

Application Development

There are three 64-bit compiler suites on DARWIN with Fortran, C and C++:

PGI Portland Compiler Suite
Intel Parallel Studio XE
GCC GNU Compiler Collection

In addition, multiple versions of OpenJDK are available for compiling java applications on the login node.

DARWIN is based on AMD EPYC processors, please review the following documents if you are planning to compile and install your own software

  • High Performance Computing (HPC) Tuning Guide for AMD EPYC™ 7002 Series Processors guide for getting started tuning AMD 2nd Gen EPYC™ Processor based systems for HPC workloads. This is not an all-inclusive guide and some items may have similar, but different, names in specific OEM systems (e.g. OEM-specific BIOS settings). Every HPC workload varies in its performance characteristics. While this guide is a good starting point, you are encouraged to perform your own performance testing for additional tuning. This guide also provides suggestions on which items should be the focus of additional, application-specific tuning (November 2020).

  • HPC Tuning Guide for AMD EPYC™ Processors guide intended for vendors, system integrators, resellers, system managers and developers who are interested in EPYC system configuration details. There is also a discussion on the AMD EPYC software development environment, and we include four appendices on how to install and run the HPL, HPCG, DGEMM, and STREAM benchmarks. The results produced are ‘good' but are not necessarily exhaustively tested across a variety of compilers with their optimization flags (December 2018).

  • AMD EPYC™ 7xx2-series Processors Compiler Options Quick Reference Guide, however we do not have the AOCC compiler (with Flang - Fortran Front-End) installed on DARWIN.

Computing Environment

UNIX Shell

The UNIX shell is the interface to the UNIX operating system. The HPC cluster allows use of the enhanced Bourne shell bash, the enhanced C shell tcsh, and the enhanced Korn shell zsh. IT will primarily support bash, the default shell.

For most Linux systems, the sh shell is the bash shell and the csh shell is the tcsh shell. The remainder of this document will use only bash commands.

Environment Variables

Environment variables store dynamic system values that affect the user environment. For example, the PATH environment variable tells the operating system where to look for executables. Many UNIX commands and tools, such as the compilers, debuggers, profilers, editors, and applications with graphical user interfaces, often look at environment variables for information they need to function. The man pages for these programs typically have an ENVIRONMENT VARIABLES section with a list of variable names which tells how the program uses the values.

This is why we encourage users to use VALET to modify your environment versus explicitly setting environment variables.

In bash, a variable must be exported to be used as an environment variable. By convention, environment variables are all uppercase. You can display a list of currently set environment variables by typing

$ env

The "echo" and "export" commands will display and set environment variables.

Command Results
echo $varName Display specific environment variable
export varName=varValue To set an environment variable to a value

You can display specific environment variables by typing for example:

$ echo $HOME 
$ export FFLAGS='-g -Wall'

The variable FFLAGS will have the value -g -Wall in the shell and exported to programs run from this shell.

Spaces are important. Do not put spaces around the equal sign. If the value has spaces, enclose the value in quotes.

If you see instructions that refer the setenv command, replace it with the export bash command. Make sure you use equal signs, with no spaces. The setenv "csh" command uses spaces instead of one equal.

Startup and Logout Scripts

All UNIX systems set up a default environment and provide users with the ability to execute additional UNIX commands to alter the environment. These commands are automatically sourced (executed) by your shell and define the normal and environmental variables, command aliases, and functions you need. Additionally, there is a final system-wide startup file that automatically makes global environment changes that IT sets for all users.

You can modify the default environment by adding lines at the end of the ~/.bash_profile file and the ~/.bashrc file. These modifications affect shells started on the login node and the compute nodes. In general we recommend that you should not modify these files especially when software documentation refers to changing the PATH environment variable, instead use VALET to load the software.

  • The ~/.bash_profile file's commands are executed once at login. Add commands to this file to set your login environment and to run startup programs.
  • The ~/.bashrc file's commands are executed by each new shell you start (spawn). Add lines to this file to create aliases and bash functions. Commands such as "xterm" and "workgroup" automatically start a new shell and execute commands in the ~/.bashrc file. The "salloc" command starts a shell on a compute node and will execute the ~/.bashrc file from your home directory, but it does not execute the commands in the ~/.bash_profile file.

You may modify the IT-supplied ~/.bash_udit file to be able to use several IT-supplied aliases (commands) and environment settings related to your workgroup and work directory . Edit .bash_udit and follow the directions in the file to activate these options. This is the ONLY way you should set your default workgroup at login. DO NOT add the workgroup command to your .bashrc or .bash_profile as this will likely prevent you from logging in and will cause file transfer programs like WinSCP, sftp or Fetch to break.

Exiting the login session or typing the "logout" command executes your ~/.bash_logout file and terminates your session. Add commands to ~/.bash_logout that you want to execute at logout.

To restore the .bash_profile, .bashrc, .bash_udit and .bash_logout files in your home directory to their original state, type from the login node:

$ cp /opt/shared/templates/homedir/.bash* $HOME

Where to put startup commands: You can put bash commands in either ~/.bashrc or ~/.bash_profile. Again we do not recommend modifying these files unless you really know what you are doing. Here are general suggestions:

  • Even if you have favorite commands from other systems, start by using the supplied files and only modify .bash_udit for customization.

  • Add essential commands that you fully understand, and keep it simple. Quoting rules can be complicated.

  • Do not depend on the order of command execution. Do not assume your environment, set in .bash_profile, will be available when the commands in .bashrc are executed.

  • Do not include commands that spawn new shells, such as workgroup.

  • Be very careful of commands that may produce output. If you must, only execute them after a test to make sure there is a terminal to receive the output. Keep in mind using any commands that produce output may break other applications like file transfer (sftp, scp, WinSCP).

  • Do not include VALET commands as they produce output and will be a part of every job submitted which could cause conflicts with other applications you are trying to run in your job script.

  • Keep a session open on the cluster, so when you make a change that prevents you from logging on you can reverse the last change, or copy the original files from /opt/shared/templates/homedir/ to start over.

Using workgroup and Directories

There are some key environment variables that are set for you, and are important for your work on any cluster. They are used to find directories for your projects. These environment variables are set on initial connection to a cluster, and will be changed if you

  • set your workgroup (allocation group allocation_group name) with the "workgroup" command,
  • change to your project directory with the "cd" command,
  • connect to the compute node resources with "salloc" (or "sbatch") command specifying a single partition your allocation workgroup has access based on resources requested for your allocation.

Connecting to Login Node

The system's initialization scripts set the values of some environment variables to help use the file systems.
Variable Value Description
HOSTNAME hostname Host name
USER login_name Login name
HOME /home/uid Your home directory

The initialization scripts also set the standard prompt with your login name and a shortened host name. For example, if your hostname is darwin.hpc.udel.edu and your login_name is bjones, then the standard prompt will be

[bjones@login00.darwin ~]$

Clusters may be configured to have multiple login nodes, with one common name for connecting. For example, on the DARWIN cluster, the hostname may be set to login00 or login01, and the standard prompt and window title bar will indicate which login node on darwin.

Setting Workgroup

To use the compute node resources for a particular allocation group (workgroup), you need to use the "workgroup" command.

For example,

$ workgroup –g it_css

starts a new shell for the workgroup it_css, and sets the environment variables:

Variable Example Value Description
WORKDIR /lustre/it_css Allocation workgroup directory, this is not writeable
WORKGROUP it_css Current allocation workgroup name
WORKDIR_USER /lustre/it_css/users/uid Allocation workgroup user directory
WORKDIR_SW /lustre/it_css/sw Allocation workgroup software directory

Use specific environment variables such as $WORKDIR_USERS when referring to your allocation workgroup user directory and $WORKDIR_SW when referring to your allocation workgroup software directory. This will improve portability.

It is always important to be aware of your current allocation workgroup name. The standard prompt includes the allocation workgroup name, added to your username and host. You must have an allocation workgroup name in your prompt to use that allocation group's compute node resources to submit jobs using sbatch or salloc. An example prompt after the "workgroup" command,

[(it_css:bjones)@login01.darwin ~]$

Changing Directory

When you first connect to the login node, all your commands are executed from your home directory (~). Most of your work will be done in your allocation workgroup directory. The "workgroup" command has an option to start you in the allocation workgroup work directory. For example,

$ workgroup -cg it_css

will spawn a new shell in the workgroup directory for it_css.

You can always use the cd bash command.

For example,

$ cd users/1201/project/fuelcell
   cd /lustre/it_css/users/1201/project/fuelcell
   cd $WORKDIR/users/1201/project/fuelcell
   cd $WORKDIR_USERS/project/fuelcell

The first is using a path name relative to the current working directory (implied ./). The second, third and fourth use the full path ($WORKDIR and $WORKDIR_USERS always begins with a /). In all cases the directory is changed, and the $PWD environment variable is set:

Variable Example Value Description
PWD /lustre/it_css/users/1201/project/fuelcell Print (current) working directory

It is always important to be aware of your current working directory. The standard prompt ends with the basename of PWD. In these two examples the basename is the same, 1201, but the standard bash PROMPT_COMMAND, which is executed every time you change directories, will put the full path of your current working directory in your window title. For example,

bjones@login00.darwin:/lustre/it_css/users/1201

Connecting to a Compute Node

To run a job on the compute nodes, you must submit your job script using sbatch or start an interactive session using salloc. In both cases, you will be connected to one of your allocation compute nodes based on the partition (queue) specified with a clean environment. Do not rely on the environment you set on the login node. The variables USER, HOME, WORKGROUP, WORKDIR, WORKDIR_USERS and PWD are all set on the compute node to match the ones you had on the login node, but two variables are set to node-specific values:

Variable Example Value Description
HOSTNAME r00n17 compute node name
TMPDIR /tmp temporary disk space

An empty directory is created by the SLURM job scheduler that is associated with your job and defined as TMPDIR. This is a safe place to store temporary files that will not interfere with other jobs and tasks you or other members of your group may be executing. This directory is automatically emptied on normal termination of your job. This way the usage on the node scratch file system will not grow over time.

Before submitting jobs you must first use the "workgroup" command. Type workgroup -h for additional information. Both "sbatch" and "salloc" will start in the same project directory you set on the login node and will require a single partition to be specified to be able to submit a batch or interactive session.

Using VALET

The UD-developed VALET system facilitates your use of compilers, libraries, programming tools and application software. It provides a uniform mechanism for setting up a package's required UNIX environment. VALET is a recursive acronym for VALET Automates Linux Environment Tasks. It provides functionality similar to the Modules package used at other HPC sites.

VALET commands set the basic environment for software. This may include the PATH, MANPATH, INFOPATH, LDPATH, LIBPATH and LD_LIBRARY_PATH environment variables, compiler flags, software directory locations, and license paths. This reduces the need for you to set them or update them yourself when changes are made to system and application software. For example, you might find several versions for a single package name, such as Mathematica/8 and Mathematica/8.0.4. You can even apply VALET commands to packages that you install or alter its actions by customizing VALET's configuration files. Type man valet for instructions or see the VALET software documentation for complete details.

The table below shows the basic informational commands for VALET. In subsequent sections, VALET commands are illustrated in the contexts of application development (e.g., compiling, using libraries) and running IT-installed applications.

Command Function
vpkg_help VALET help.
vpkg_list List the packages that have VALET configuration files.
vpkg_versions pkgid List versions available for a single package.
vpkg_info pkgid Show information for a single package (or package version).
vpkg_require pkgid Configure environment for one or more VALET packages.
vpkg_devrequire pkgid Configure environment for one or more VALET packages including software development variables such as CPPFLAGS and LDFLAGS.
vpkg_rollback # or all Each time VALET changes the environment, it makes a snapshot of your environment to which it can return.
vpkg_rollback attempts to restore the UNIX environment to its previous state. You can specify a number (#) to revert one or more prior changes to the environment or all to remove all changes.
vpkg_history List the versioned packages that have been added to the environment.
man valet Complete documentation of VALET commands.