SciNet supercomputer's GPFS trick: We node what you did, burst buffer

Good news for Canadian HPC models

A Canadian supercomputer centre using a fast access parallel file system has stuffed an Excelero burst buffer between this storage and the compute nodes.

Why, you ask?

We'll explain. The SciNet supercomputer centre at the University of Toronto provides resources for thousands of researchers in biomedical, aerospace, climate sciences, and more. Its supercomputing jobs - large-scale modelling, simulation, analysis and visualization applications - can sometimes run for weeks, and interruptions delay or occasionally destroy an entire job's results, meaning it has to be run again.

Checkpointing, with fast interrupted job restart, has been used to reduce that risk but, with the disk-based Spectrum Scale (GPFS) storage, as individual jobs become larger, they take longer, making the calculation difficult – or in the worst case, impossible to carry out.

The new idea is to use a flash-based burst buffer between the disks and the compute nodes, so checkpointing can be done faster. The way it was done was to fit NVMe flash drives to some of the compute nodes, which already had a low latency fabric interconnect, and virtualize them into a shared flash pool using Excelero's NVMesh software.

There are 80 NVMe flash drives in 10 servers which support the NSD (Network Shared Drive) protocol. Collectively this burst buffer system is said to provide 20 million random read 4K IOPS, 148GB/sec of write burst bandwidth and 230GB. /sec of read throughput. Checkpoints can be completed in 15 minutes.

Dr Daniel Gruner, CTO at the SciNet High Performance Computing Consortium, said: "NVMesh is an extremely cost-effective method of achieving unheard-of burst buffer bandwidth."

The NVMesh burst buffer "enables standard servers to go beyond their usual role in acting as block targets – the servers now can also act as file servers.”

It would be interesting to compare the performance and cost of this NVMesh configuration with DDN's IME burst buffer. ?


Biting the hand that feeds IT ? 1998–2017

  • 321961289 2018-02-18
  • 776731288 2018-02-18
  • 9075261287 2018-02-18
  • 3005511286 2018-02-18
  • 867341285 2018-02-18
  • 2234581284 2018-02-17
  • 1507351283 2018-02-17
  • 4371991282 2018-02-17
  • 6759701281 2018-02-17
  • 5507351280 2018-02-17
  • 44561279 2018-02-17
  • 1884971278 2018-02-17
  • 8364991277 2018-02-17
  • 713261276 2018-02-17
  • 1284591275 2018-02-17
  • 7923611274 2018-02-17
  • 8609021273 2018-02-17
  • 7607231272 2018-02-17
  • 3146191271 2018-02-17
  • 587901270 2018-02-17