[7-24]Fault Tolerance for PetaScale Systems: Current Knowledge, Challenges and Opportunities
Date:2008-07-15
Title:Fault Tolerance for PetaScale Systems: Current Knowledge, Challenges and Opportunities
Speaker:Professor Franck Cappello(INRIA)
Time:14:00-15:00, July 24
Venue:Room 337
Bio.:
Professor Franck Cappello holds a Senior Researcher position at INRIA. He leads the Grand-Large project at INRIA, focusing on High Performance issues in Large Scale Distributed Systems. He has initiated the XtremWeb (Desktop Grid) and MPICH-V (Fault tolerant MPI) projects. He is currently the director of the Grid5000 project, a nation wide computer science platform for research in Grid and P2P. He has authored more than 60 papers in the domains of High Performance Programming, Desktop Grids, Grids and Fault tolerant MPI. He has contributed to more than 40 Program Committees. He is editorial board member of the international Journal on Grid Computing, Journal of Grid and Utility Computing and Journal of Cluster Computing.He is a steering committee member of IEEE HPDC and IEEE/ACM CCGRID.He is the General co-Chair of IEEE APSCC 2008, Workshop co-chair for IEEE CCGRID’2008, Program co-Chair of IEEE CCGRID’2009 and was the General Chair of IEEE HPDC’2006.
Abstract:
The emergence of PetaScale systems reinvigorates the community interest about how to manage failures in such systems and ensure that large applications successfully complete. Existing results for several key mechanisms associated with fault tolerance in HPC platforms will be presented during this talk. Most of these key mechanisms come from the distributed system theory. Over the last decade, they have received a lot of attention from the community and there is probably little to gain by trying to optimize them again. We will describe some of the latest findings in this domain. Unfortunately, despite their high degree of optimization, existing approaches do not fit well with the challenging evolutions of large scale systems. There is room and even a need for new approaches. Opportunities may come from different origins like adding hardware dedicated to fault tolerance. We will sketch some of these opportunities and their associated limitations.