Transferring data includes moving files from local machines to XSEDE, as well as transfers between XSEDE resources. This section gives a high level overview on the recommended XSEDE data transfer methods.
Depending on your data transfer requirements there are a variety of methods for transferring files across XSEDE. You may choose between Globus, the Globus Command Line Interface,
sftp. The pros and cons of each method are summarized in Table 1.1 below.
|Usage Mode||Transfer Method||Pros||Cons|
|Graphical User Interface||Globus||easy to use web interface, can use XUP login (SSO), desktop download available||none|
|Command Line Interface||Globus Command Line Interface (CLI)||managed, reliable and auto tuned transfer; |
advanced syntax for scripting;
can use XSEDE single sign on
|need to setup SSH key in Globus profile; |
advanced knowledge required for authentication and scripting capabilities
| ||high performance transfer with tuning options; |
command line interface
|advanced knowledge required for authentication and performance tuning, and increased reliability|
| ||easy command line interface||must user local (resource-specific) username and password|
Globus is a fast, reliable service for high performance file transfer. This hosted service provides a solution to data movement challenges by providing a robust, secure, and highly monitored environment for file transfers that has powerful yet easy-to-use interfaces. Researchers with no IT background can easily move large quantities of files, or move files of large size, using the Web GUI and developers who want to automate workflows can use the command line interface.
Globus features include:
- High performance: Move terabytes of data in thousands of files
- Automatic fault recovery - Across multiple security domains
- Designed for researchers - Easy "fire and forget" file transfers
- No client software installation - New features automatically available - Consolidated support and troubleshooting - Works with existing GridFTP servers
- Ability to move files to any machine (even your laptop) with ease
- Moving data between locations on XSEDE: The Globus File Transfer service can be used to move data between all XSEDE sites, which are accessible as transfer endpoints in the service. Globus simplifies and automates file transfer without the need to install or interact with GridFTP.
- Moving data to XSEDE from a local machine: This might include observational data from sensors, surveys, etc., that will be analyzed on XSEDE computing resources. Globus Connect is a feature of Globus that makes it possible to create a transfer endpoint on any machine (including campus servers and home laptops) with just a few clicks and without the typical difficulties of a GridFTP install.
For more information view the XSEDE Globus User Guide.
Globus provides a command line interface that may be accessed using any standard ssh terminal client. Prior to running shell commands, you must upload your public SSH key to your Globus account. For more information please see the pertinent section in the Globus File Transfer User Guide.
uberftp are command-line implementations of the GridFTP protocol that underlies all XSEDE transfer mechanisms. Use these commands to transfer large files.
Here's a sample transfer from PSC's Blacklight to TACC's Stampede optimized for large files:
login1$ globus-url-copy -stripe -tcp-bs 8388608 \ gsiftp://gridftp.psc.xsede.org:2811/scratcha/joeuser/mylargefile \ gsiftp://gridftp.stampede.tacc.xsede.org:2811/scratch/joeuser/mylargefile
The following table lists the gridFTP endpoint for each XSEDE system.
|Resource||GridFTP Endpoint||Server Type|
For advanced users speedpage.psc.edu provides information on transfer speeds you can expect using
globus-url-copy with the optimized parameters above.
You may also use one of these command-line tools to transfer small (< 2 GB) files between XSEDE resources and/or your local machine. From Linux or Mac, you can run these commands directly from the terminal. From Windows, use your ssh client. Both
sftp are easy to use and secure, but provide poor performance for large files.
Data protection mechanisms are incorporated into much of the infrastructure and software used in XSEDE, and in most cases users are not required to take any special steps to ensure the integrity of their data. However, there are situations in which a user may wish to check that a transferred file has been copied correctly to the new system, or check that a file has not been changed since it was originally created. In these situations, checksums may be used to generate a cryptographic hash of one or more files. Cryptographic hashes have the property that they always produce the same value when operating on the same input date, so they can be saved and then compared against a recomputed hash to verify that a file is exactly the same as it was when the original checksum was generated. Even a single bit change in a multi-terabyte file will produce a different checksum value, so a successful checksum comparison provides a strong guarantee that data has not been altered in any way.
We recommend that users utilize the "
sha256sum" command to create and check cryptographic hashes. This command should be available on most UNIX systems, as well as most XSEDE resources. To generate a checksum for a given file, run sha256sum with the name of the file (or files) you wish to check; the command will report the checksum of each file on a separate line, followed by the filename:
login1$ sha256sum filename1 filename2 9db55391e52a4a84944c6c9817ab8d0445547e8934d88d26032cc4747e196039 filename1 a6483e57971627e4e2403c6d3e38b205c70db2221f0b9fe46781e0af76192ef5 filename2
To save the generated checksums for comparison, redirect the output to a file:
login1$ sha256sum filename1 filename2 > checksums.out
You can then use the contents of this file to verify that the files are exactly the same on any given system, or on the same system at a later date, using the -c flag to sha256sum:
login1$ sha256sum -c checksums.out filename1: OK filename2: OK
Wildcards can also be used with the sha256sum command. For example, a user could generate checksums for all the files in a directory using the command:
login1$ sha256sum * > checksums.out
After transferring these files to another XSEDE resource, the user could verify that the data was transferred completely and correctly using the saved output file. If any of the files has been corrupted or incompletely transferred, the check will produce output like the following:
login1$ sha256sum -c checksums.out filename1: FAILED filename2: OK sha256sum: WARNING: 1 of 2 computed checksums did NOT match
In this situation, the user should retransfer the files in question or restore from a backup copy of the data.
In order to verify data integrity at a later date, you must have a record of the original checksum values to compare to the present value. Therefore, generate and save checksums when data is first created or before it is transferred into XSEDE, even if you do not immediately intend to perform verification against those checksums.
Some data transfer mechanisms, including GridFTP, provide options to generate and compare checksums as part of the transfer operation. When using Globus to manage GridFTP transfers, include the "
--verify-checksum" option in command-line invocations, or select the "Verify Checksum" option in the web interface. Secure copy (
scp) provides some protection for data integrity during the transfer due to the encryption of data in transit, but it does not perform end-to-end validation of data integrity, therefore users should perform additional verification if data integrity is important.
- Disk speed
- Connectivity of disk to node
- Node characteristics & load
- Connectivity of node to WAN
- For all networks:
- Buffer Size
Many of these factors are intrinsic to the system and not in the control of the user. However, by using the recommendations above, choices made by the user can significantly influence the speed of data transfers, in some cases resulting in 50 times the transfer speed that results from default settings and non-recommended commands. Because of the factors listed just above, even with the best optimization, don't expect 40 Gb/sec; performance is usually limited by end node connectivity, not WAN bandwidth.
Last updated: October 7, 2014