Job management with GRAM5
GRAM is a set of services and clients for communicating with a range of different batch/cluster job schedulers using a common protocol. GRAM is meant to address a range of jobs where reliable operation, stateful monitoring, credential management, and file staging are important. GRAM client commands are used to locate, submit, monitor, and cancel jobs on XSEDE resources. GRAM is ideal for experienced users with multiple allocations on multiple machines as one can manage jobs on different resources from a single command-line.
- Proxy Certificate: A proxy certificate is required to run commands. When a gatekeeper receives a job request, it must authenticate the requester before it can start a jobmanager on the target resource. If you use XSEDE's "Single Sign-on", a proxy certificate is automatically created when you log in using GSI-OpenSSH. If you log in using a different method, then you will need to follow the Single Sign-on instructions for getting a proxy with
- Binary Code Compatibility: Any code to be run on a remote resource must be compiled for that resource. The target execution environment may differ from the one where the GRAM request is initiated.
The following GRAM examples demonstrate GRAM commands that employ the RSL language to provide the job variables, instead of multiple command-line options to describe various job aspects. A
globus-job-submit command returns a URL that is your job ID. Use the URL to look up job status, cancel the job, etc. For more detail please consult Globus' GRAM User Guide.
You may execute GRAM commands from your desktop after downloading and installing the Globus Toolkit binaries. An easier alternative is logon to any XSEDE resource where you have an account and issue commands from that resource's command line.
In the following example, user
jsmith logs into TACC's Stampede from her desktop machine, loads the
globus module to properly set up the programming environment, generates a proxy, and then executes a simple GRAM command to display the hostname on SDSC's Trestles. Notice that you are submitting a job from Resource A (Stampede) to be run on Resource B (Trestles).
In GRAM, a Gatekeeper Endpoint contains the host, port, service name, and service identity required to contact a particular GRAM service. Table 1. below lists XSEDE resource's endpoints.
Important: When authenticating with the
myproxy-logon command, use your XSEDE Portal username and your XSEDE portal password
localhost$ ssh -l jsmith stampede.tacc.utexas.edu firstname.lastname@example.org's password: ... stampede$ module load globus stampede$ myproxy-logon -l jsmith Enter MyProxy pass phrase: A credential has been received for user jsmith in /tmp/x509up_u804387. stampede$ globus-job-run trestles-login.sdsc.xsede.org:2119/jobmanager-fork /bin/hostname trestles-login2.sdsc.edu stampede$
In the following session, the user is logged onto TACC's Lonestar and launches an MPI job on TACC's Stampede. The user submits the job, monitors job status, gathers results and cleans up. The executable,
hello.out, is located on the remote machine (Lonestar) and has been compiled for that machine.
- Submit an MPI job, hello.out, with 4 processes
lonestar$ globus-job-submit login5.stampede.tacc.utexas.edu:2119/jobmanager-slurm -np 4 -x '&(jobtype=mpi)(project=TG-STA110012S)' hello.out https://login5.stampede.tacc.utexas.edu:50384/16289936284489898541/13331958173416058768/
- Monitor the job status
lonestar$ globus-job-status https://login5.stampede.tacc.utexas.edu:50384/16289936284489898541/13331958173416058768/ ACTIVE lonestar$ globus-job-status https://login5.stampede.tacc.utexas.edu:50384/16289936284489898541/13331958173416058768/ DONE
- Gather output
lonestar$ globus-job-get-output \ https://login5.stampede.tacc.utexas.edu:50384/16289936284489898541/13331958173416058768/ TACC: Starting up job 778538 TACC: Setting up parallel environment for MVAPICH2+mpispawn. TACC: Starting parallel tasks... Hello world from process 0 of 4 Hello world from process 2 of 4 Hello world from process 1 of 4 Hello world from process 3 of 4 TACC: Shutdown complete. Exiting.
- Clean up
lonestar$ globus-job-clean https://login5.stampede.tacc.utexas.edu:50384/16289936284489898541/13331958173416058768/ WARNING: Cleaning a job means: - Kill the job if it still running, and - Remove the cached output on the remote resource Are you sure you want to cleanup the job now (Y/N) ? Y Cleanup successful.
Starting a Job with GRAM: Output is delivered to a gsiftp Server.
$ globusrun -b -r gridftp1.ls4.tacc.utexas.edu:2120/jobmanager-sge \ '&(executable=/bin/env) (stdout=output) (stderr=error) \ (file_stage_out = \ (output gsiftp://gridftp-qb.loni-lsu.teragrid.org:2811/home/kenneth/info/qa/output.lonestar) \ (error gsiftp://gridftp-qb.loni-lsu.teragrid.org:2811/home/kenneth/info/qa/error.lonestar)) \ (maxTime=3) (project=TG-STA060014N)'
MPI job that returns files to gsiftp server - does not wait for the job to start, giving a contact URL instead. At job completion, stdout and stderr are sent to the gsiftp server
$ globusrun -b -r gridftp1.ls4.tacc.utexas.edu:2120/jobmanager-sge \ '& (executable=/home1/00202/uxkennet/info/qa/cpi) \ (count=4) (jobtype=mpi) (hostcount=2) (stdout=output) (stderr=error) \ (file_stage_out = \ (output gsiftp://gridftp-qb.loni-lsu.teragrid.org:2811/home/kenneth/info/qa/output.gsiftp.lonestar) \ (error gsiftp://gridftp-qb.loni-lsu.teragrid.org:2811/home/kenneth/info/qa/error.gsiftp.lonestar)) \ (maxTime=3) \ (project=TG-STA060014N)'
MPI job that returns contact URL, waits for the job to start, and returns stdout to the user through the
$ globusrun -o -r gridftp1.ls4.tacc.utexas.edu:2120/jobmanager-sge \ '& (executable=/home1/00202/uxkennet/info/qa/cpi) \ (count=4) (hostcount=2) (maxTime=3) (project=TG-STA060014N) (jobtype=mpi)'
Checking the Status of a GRAM Job. There are two commands,
globus-job-status, that monitor the job status. Both require the job ID, in order to view the status of a job.
$ globusrun -status https://grid.example.org:38824/16001608125017717261/5295612977486019989/ PENDING
$ globus-job-status https://grid.example.org:38824/16001608125017717261/5295612977486019989/
For a complete list of job status states please view the GRAM 5 job status descriptions.
Cancelling a GRAM Job: To cancel a gram job you can run one of the following commands, this will remove the job from the local scheduler:
$ globusrun -k https://grid.example.org:38824/16001608125017717261/5295612977486019989/
$ globus-job-cancel -force https://grid.example.org:38824/16001608125017717261/5295612977486019989/
Cleaning up after a GRAM Job: cleans out temporary globus state files only; does not delete output/error files
$ globus-job-clean -force https://grid.example.org:38824/16001608125017717261/5295612977486019989/
The following table lists currently supported Gatekeeper Host contact strings for XSEDE resources supporting GRAM.