Three Papers Accepted - SC12


Three Papers by Akira Nukada, Kento Sato and Katsuki Fujisawa have been accepted in SC12.

Authors: Akira Nukada, Kento Sato, Satoshi Matsuoka
      Title: Scalable Multi-GPU 3-D FFT for TSUBAME 2.0 Supercomputer

For scalable 3-D FFT computation using multiple GPUs, efficient all-to-all communication between GPUs is the most important factor in good performance. Implementations with point-to-point MPI library functions and CUDA memory copy APIs typically exhibit very large overheads especially for small message sizes in all-to-all communications between many nodes. We propose several schemes to minimize the overheads, including employment of lower-level API of InfiniBand to effectively overlap intra- and inter-node communication along with computation, as well as auto-tuning strategies to control scheduling and determine rail assignments. As a result we achieve very good strong scalability as well as good performance, up to 4.8TFLOPS using 256 nodes of TSUBAME 2.0 Supercomputer (768 GPUs), several times faster than reported in comparable work.


 Authors Kento Sato, Adam Moody, Kathryn Mohror, Todd Gamblin, Bronis Supinski, Naoya Maruyama, Satoshi Matsuoka
      Title: Design and Modeling of a Non-blocking Checkpointing System

As high performance computing (HPC) systems move towards exascale, the resiliency of these systems is becoming increasingly important. Typically, applications periodically save their state in checkpoint files to mitigate losses due to failures. However, checkpointing on large scale systems can incur unacceptably high overheads when transferring checkpoints to the parallel file system (PFS). As a result, as HPC systems grow larger, application execution cannot practically proceed, and the efficiency of HPC systems decreases. Our approach to solve this problem is to combine non-blocking checkpointing with multi-level checkpointing, where checkpoints are first cached on compute node-local storage and then asynchronously drained from the compute nodes to the PFS. In this paper, we present the design of our approach and a model describing its performance. Our experimental results show that the combination of non-blocking and multi-level checkpointing can achieve as much as 5.2 times more efficiency on future systems.


Authors: Katsuki Fujisawa, Toshio Endo, Hitoshi Sato, Makoto Yamashita, Maho Nakata
      Title: High-Performance General Solver for Extremely Large-scale Semidefinite Programming Problems

Semidefinite Programming (SDP) is one of the most important problems in current research areas in optimization problems. It covers a wide range of applications such as combinatorial optimization, structural optimization, control theory, economics, quantum chemistry, sensor network location, data mining, etc. Solving extremely large-scale SDP problems has significant importance for the current and future applications of SDPs. In 1995, Fujisawa et al. started the SDPA Project aimed for solving large-scale SDP problems with numerical stability and accuracy. SDPA is one of pioneers' codes to solve general SDPs. SDPARA is a parallel version of SDPA on multiple processors and distributed memory, which replaces two major bottleneck parts (the generation of the Schur complement matrix and its Cholesky factorization) of SDPA by their parallel implementation. In particular, it has been successfully applied on combinatorial optimization and truss topology optimization, the new version of SDPARA(7.5.0-G) on a large-scale super computer called TSUBAME2.0 at Tokyo Institute of Technology has succeeded to solve the largest SDP problem which has over 1.48 million constraints and make a new world record. Our implementation has also achieved 533 TFlops in double precision for the large-scale Cholesky factorization using 2,720 CPUs and 4,080 GPUs.

Copyright (c) 2010 Tokyo Institute of Technology. Matsuoka Labo. All Rights Reserved.