Job management with GRAM5

This document provides a brief introduction to the Grid Resource Allocation and Management (GRAM) component of the Globus Toolkit. For detailed GRAM documentation please consult the GRAM5 User Guide.

GRAM is a set of services and clients for communicating with a range of different batch/cluster job schedulers using a common protocol. GRAM is meant to address a range of jobs where reliable operation, stateful monitoring, credential management, and file staging are important. GRAM client commands are used to locate, submit, monitor, and cancel jobs on XSEDE resources. GRAM is ideal for experienced users with multiple allocations on multiple machines as one can manage jobs on different resources from a single command-line.

Prerequisites for Running Jobs with GRAM

  • Proxy Certificate: A proxy certificate is required to run commands. When a gatekeeper receives a job request, it must authenticate the requester before it can start a jobmanager on the target resource. If you use XSEDE's "Single Sign-on", a proxy certificate is automatically created when you log in using GSI-OpenSSH. If you log in using a different method, then you will need to follow the Single Sign-on instructions for getting a proxy with myproxy-logon.
  • Binary Code Compatibility: Any code to be run on a remote resource must be compiled for that resource. The target execution environment may differ from the one where the GRAM request is initiated.

Using GRAM on XSEDE resources

The following GRAM examples demonstrate GRAM commands that employ the RSL language to provide the job variables, instead of multiple command-line options to describe various job aspects. A globus-job-submit command returns a URL that is your job ID. Use the URL to look up job status, cancel the job, etc. For more detail please consult Globus' GRAM User Guide.

A Simple Session

You may execute GRAM commands from your desktop after downloading and installing the Globus Toolkit binaries. An easier alternative is logon to any XSEDE resource where you have an account and issue commands from that resource's command line.

In the following example, user jsmith logs into TACC's Stampede from her desktop machine, loads the globus module to properly set up the programming environment, generates a proxy, and then executes a simple GRAM command to display the hostname on SDSC's Trestles. Notice that you are submitting a job from Resource A (Stampede) to be run on Resource B (Trestles).

In GRAM, a Gatekeeper Endpoint contains the host, port, service name, and service identity required to contact a particular GRAM service. Table 1. below lists XSEDE resource's endpoints.

Important: When authenticating with the myproxy-logon command, use your XSEDE Portal username and your XSEDE portal password

localhost$ ssh -l jsmith stampede.tacc.utexas.edu
jsmith@stampede.tacc.utexas.edu's password: 
...
stampede$ module load globus
stampede$ myproxy-logon -l jsmith
Enter MyProxy pass phrase:
A credential has been received for user jsmith in /tmp/x509up_u804387.
stampede$ globus-job-run trestles-login.sdsc.xsede.org:2119/jobmanager-fork /bin/hostname
trestles-login2.sdsc.edu
stampede$

An MPI Session

In the following session, the user is logged onto TACC's Lonestar and launches an MPI job on TACC's Stampede. The user submits the job, monitors job status, gathers results and cleans up. The executable, hello.out, is located on the remote machine (Lonestar) and has been compiled for that machine.

  1. Submit an MPI job, hello.out, with 4 processes
      lonestar$ globus-job-submit login5.stampede.tacc.utexas.edu:2119/jobmanager-slurm -np 4 -x  '&(jobtype=mpi)(project=TG-STA110012S)' hello.out
      https://login5.stampede.tacc.utexas.edu:50384/16289936284489898541/13331958173416058768/
  2. Monitor the job status
      lonestar$ globus-job-status https://login5.stampede.tacc.utexas.edu:50384/16289936284489898541/13331958173416058768/
      ACTIVE
      lonestar$ globus-job-status https://login5.stampede.tacc.utexas.edu:50384/16289936284489898541/13331958173416058768/
      DONE
  3. Gather output
      lonestar$ globus-job-get-output \
          https://login5.stampede.tacc.utexas.edu:50384/16289936284489898541/13331958173416058768/
      TACC: Starting up job 778538
      TACC: Setting up parallel environment for MVAPICH2+mpispawn.
      TACC: Starting parallel tasks...
      Hello world from process 0 of 4
      Hello world from process 2 of 4
      Hello world from process 1 of 4
      Hello world from process 3 of 4
      TACC: Shutdown complete. Exiting.
  4. Clean up
      lonestar$ globus-job-clean https://login5.stampede.tacc.utexas.edu:50384/16289936284489898541/13331958173416058768/
          WARNING: Cleaning a job means:
          	- Kill the job if it still running, and
          	- Remove the cached output on the remote resource
          Are you sure you want to cleanup the job now (Y/N) ?
          Y
      Cleanup successful.

Example 1

Starting a Job with GRAM: Output is delivered to a gsiftp Server.

$ globusrun -b -r gridftp1.ls4.tacc.utexas.edu:2120/jobmanager-sge \
'&(executable=/bin/env) (stdout=output) (stderr=error) \
(file_stage_out = \
(output gsiftp://gridftp-qb.loni-lsu.teragrid.org:2811/home/kenneth/info/qa/output.lonestar) \
(error gsiftp://gridftp-qb.loni-lsu.teragrid.org:2811/home/kenneth/info/qa/error.lonestar)) \
(maxTime=3)
(project=TG-STA060014N)'

Example 2

MPI job that returns files to gsiftp server - does not wait for the job to start, giving a contact URL instead. At job completion, stdout and stderr are sent to the gsiftp server

$ globusrun -b -r gridftp1.ls4.tacc.utexas.edu:2120/jobmanager-sge \
'& (executable=/home1/00202/uxkennet/info/qa/cpi) \
(count=4) (jobtype=mpi) (hostcount=2) (stdout=output) (stderr=error) \
(file_stage_out = \
(output gsiftp://gridftp-qb.loni-lsu.teragrid.org:2811/home/kenneth/info/qa/output.gsiftp.lonestar) \
(error gsiftp://gridftp-qb.loni-lsu.teragrid.org:2811/home/kenneth/info/qa/error.gsiftp.lonestar)) \
(maxTime=3) \
(project=TG-STA060014N)'

Example 3

MPI job that returns contact URL, waits for the job to start, and returns stdout to the user through the globusrun command]

$ globusrun -o -r gridftp1.ls4.tacc.utexas.edu:2120/jobmanager-sge \
'& (executable=/home1/00202/uxkennet/info/qa/cpi) \
(count=4) (hostcount=2) (maxTime=3) (project=TG-STA060014N) (jobtype=mpi)' 

Example 4

Checking the Status of a GRAM Job. There are two commands, globusrun and globus-job-status, that monitor the job status. Both require the job ID, in order to view the status of a job.

 $ globusrun -status https://grid.example.org:38824/16001608125017717261/5295612977486019989/
PENDING
$ globus-job-status https://grid.example.org:38824/16001608125017717261/5295612977486019989/

For a complete list of job status states please view the GRAM 5 job status descriptions.

Example 5

Cancelling a GRAM Job: To cancel a gram job you can run one of the following commands, this will remove the job from the local scheduler:

$ globusrun -k https://grid.example.org:38824/16001608125017717261/5295612977486019989/
$ globus-job-cancel -force https://grid.example.org:38824/16001608125017717261/5295612977486019989/

Example 6

Cleaning up after a GRAM Job: cleans out temporary globus state files only; does not delete output/error files

$ globus-job-clean -force https://grid.example.org:38824/16001608125017717261/5295612977486019989/

XSEDE Resource GRAM Endpoints

The following table lists currently supported Gatekeeper Host contact strings for XSEDE resources supporting GRAM.

Resource Site Service Name Version Endpoint

Related Links

Last updated: May 22, 2013