XSEDE's Extended Collaborative Support Service (ECSS) program provides expert technical assistance to XSEDE users. XSEDE's ECSS staff members can collaborate with XSEDE project teams in a variety of support roles:
- Algorithmic or solver change; incorporation or implementation of new or advanced numerical methods and/or math libraries
- Optimization (single processor performance, parallel scaling, parallel I/O performance, memory usage, or benchmarking improvement)
- Development of parallel code from serial code/algorithm
- Data visualization or analysis
- Incorporation of new data management or storage
- New grid computing work
- Implementation of new workflows for automation of scientific processes
- Incorporation of new visualization methods
- Innovative scheduling implementation
- Integration of XSEDE resources into a portal or Science Gateway
We do find projects work best when approached collaboratively, therefore we require involvement from the principal investigator's team. We are not able to support activities such as a complete handoff of a code for independent parallelization by ECSS staff. Also, we will generally not be able to assist with third-party software unless you or we have a developer's license or other collaborative relationship with the authors, or if the software is open-source.
Please see the XSEDE website for more on the ECSS program:
XSEDE users may request ECSS services through the allocation process. PIs may request ECSS assistance along with a new Research allocation request or by submitting a stand-alone Supplement request to an active allocation. The request will be reviewed, usually in conjunction with your XSEDE grant proposal. ECSS management will then assess whether there is also a good potential match with the experts available to collaborate with your team. If so, a member of the ECSS team will contact you to develop a specific work plan for the collaboration.
PIs must answer the following five questions to be included along with the ECSS request. Illustrative examples are provided to clarify the meaning of each question.
1. What do you want to accomplish with the help of expert staff? Have you already done any work on this aspect of your software?
We request ECSS support to improve the parallel efficiency of our major production code, "ABCDcode". This application has routinely utilized up to 250 processes for a cumulative run-time of about 300 hours. The proposed research requires a 6-fold increase in dataset size which will require the utilization of additional processes. In order to maintain an appropriate time-to-solution at least 1,000 processes will need to be utilized.
This project will require the identification of barriers to parallelization which decrease the parallel efficiency of ABCDcode. A likely culprit is our use of a master-slave task management pattern which relies on synchronization between a subset of slave processes. We expect that replacing this communication pattern with a more scalable alternative would greatly increase the parallel efficiency. Additionally, the removal of data dependencies between processes should also help remove synchronization between tasks and improve efficiency. It may also be necessary to investigate the choice of specific algorithms within the code, since it is likely that certain implementations may be adding to the inefficient communication patterns and/or unnecessary process synchronizations.
The assistance requested may pertain to visualization or other aspects of data analysis:
We request ECSS guidance visualization of our simulation data produced by ABCDcode. First, we will need advice as to which tools are best suited to produce 3D animations, volume renderings and streamlines from our data. By deploying and testing these, we may then agree with ECSS staff that we also need help producing a set of initial visualizations, and in the development of a visualization pipeline for our data.
For a project intended to serve a wider community:
We request ECSS support to develop our major production code, "ABCDcode", into a community code for Science Discipline Z. As reported in our XRAC proposal, this application has routinely utilized up to 10,000 processes on Machine M, allowing our group to achieve important breakthroughs in our own research. As a result, we have received many requests from our colleagues to make ABCDcode available to them; however, early attempts by two other groups to use the software on Machine M have caused our code to crash for some of their datasets, and considerable performance degradation in other cases. In addition, we have encountered difficulties porting the code to Machines L and N.
We have a good idea of the reasons for these issues, since our code still embodies some assumptions specific to our own research problems. We hope that ECSS will help us to efficiently remove these restrictions and to develop and thoroughly test a generalized, robust and maintainable version of ABCDcode.
2. How would the success of this collaboration benefit your project?
The successful improvement of ABCDcode's parallel efficiency will enable the simulation of a dataset 6 times larger than previously attempted. This would allow us, with the SUs requested in this XRAC proposal, to directly compare our results with experiment, rather than rely on interpolation. The improved parallel efficiency will, additionally, enable and inform future optimizations necessary for future investigations. Also, the improved efficiency will increase the time-to-solution of current production simulations. This will greatly increase the group's scientific output.
Or, if you wish to develop a community capability:
The successful development and testing of ABCDcode to become a robust and maintainable community code would greatly expand the ability of the Science Discipline Z community to perform the collaborative research proposed to this XRAC review. It will enable us to provide support to its users with a sustainable effort level, including by people from other groups who are not part of the original development team. It will make it possible for us to consider making this system available via a Science Gateway, for the development of which we may need to request continued support next year.
3. Which member(s) of your team would collaborate with ECSS staff?
It is important to ensure that members of your team work hand in hand with the ECSS staff members, so that you fully understand what has been done and are ready to take over when the staff support period ends.
Two graduate students in our group will be dedicated to this project over the next year. They have experience in both the operation and maintenance of ABCDcode which they obtained over the last year. They will collaborate directly with ECSS staff and be responsible for explaining and implementing changes in source code. The ECSS staff will be responsible for identifying parallelization barriers, suggesting improvements, and answering specific questions about resource utilization.
4. Have you had significant interaction on previous projects related to your current proposal or discussed your extended support needs with any xsede staff? If so, please indicate with whom.
This helps us form the ECSS project team and the project plan by enlisting the named staff members in understanding and launching the support project. For example:
The scalability issues with ABCDcode on Machine M were first diagnosed by user consultant F, who, while handling a problem report we had submitted to the XOC, suggested that the likely culprit is our use of a master-slave task management pattern which relies on synchronization between a subset of slave processes. She suggested that fixing the issues will require a sustained collaboration with ECSS staff.
It is expected that the impetus for many ECSS projects will come from the work of the Novel and Innovative Projects (NIP) development area. If this has been the case, it should be mentioned in response to Q4. For example:
We became aware of the potential of XSEDE for our research via our Campus Champion, Dr. C, who arranged for a series of discussions with ECSS NIP staff member Dr. N. This led to our Startup grant which forms the basis for our XRAC proposal, and it was Dr. N who suggested that our serial code should be parallelized and incorporated into an automated work and data flow system via the ECSS project proposed here. We request that Dr. N remain our principal contact as we execute this ECSS project.
5. Have you received Teragrid/XSEDE advanced support in the past? If so, please indicate the time period, and how the support you received then relates to the support you request now.
ECSS staff expertise is a valuable and limited resource shared by all XSEDE users, so it is important to make sure that as many research teams and communities as possible benefit from it. Therefore, requests for recurrent support must be justified and reviewed carefully, to ensure that it is indeed necessary for the full potential of past progress to be realized, or that this is a highly meritorious new project. For example:
Last year, our ECSS project led to the successful development of ABCDcode to become a robust and maintainable community code. As discussed in our new XRAC proposal, we now wish to build a Science Gateway to enable an estimated 2,000 members of the Science Discipline Z community to run this code every year, on XSEDE systems L, M, and N, as well as on various Campus based clusters. As described above, we require the Gateways building expertise of ECSS to efficiently achieve this goal.
Or, to argue that this is a new effort:
Two years ago, our TeraGrid ASTA project boosted the scalability of our "ABCDcode" to routinely utilize up to 10,000 processes on Machine M, allowing our group to achieve important breakthroughs in Science Problem P. Recent work by the PI and a new postdoc, using the XSEDE Startup grant mentioned in the XRAC proposal, has resulted in the creation of a new code, "GHIJcode", which is the first approach that shows any promise of tackling Problem Q, heretofore considered intractable. To be useful in production, GHIJcode would have to scale to 100,000 processes and to be made part of a complex distributed task and data flow. We hope that we will be able to achieve this, but only with ECSS help.
Last update: March 3, 2015