SciNet supercomputer's GPFS trick: We node what you did, burst buffer

Good news for Canadian HPC models

A Canadian supercomputer centre using a fast access parallel file system has stuffed an Excelero burst buffer between this storage and the compute nodes.

Why, you ask?

We'll explain. The SciNet supercomputer centre at the University of Toronto provides resources for thousands of researchers in biomedical, aerospace, climate sciences, and more. Its supercomputing jobs - large-scale modelling, simulation, analysis and visualization applications - can sometimes run for weeks, and interruptions delay or occasionally destroy an entire job's results, meaning it has to be run again.

Checkpointing, with fast interrupted job restart, has been used to reduce that risk but, with the disk-based Spectrum Scale (GPFS) storage, as individual jobs become larger, they take longer, making the calculation difficult – or in the worst case, impossible to carry out.

The new idea is to use a flash-based burst buffer between the disks and the compute nodes, so checkpointing can be done faster. The way it was done was to fit NVMe flash drives to some of the compute nodes, which already had a low latency fabric interconnect, and virtualize them into a shared flash pool using Excelero's NVMesh software.

There are 80 NVMe flash drives in 10 servers which support the NSD (Network Shared Drive) protocol. Collectively this burst buffer system is said to provide 20 million random read 4K IOPS, 148GB/sec of write burst bandwidth and 230GB. /sec of read throughput. Checkpoints can be completed in 15 minutes.

Dr Daniel Gruner, CTO at the SciNet High Performance Computing Consortium, said: "NVMesh is an extremely cost-effective method of achieving unheard-of burst buffer bandwidth."

The NVMesh burst buffer "enables standard servers to go beyond their usual role in acting as block targets – the servers now can also act as file servers.”

It would be interesting to compare the performance and cost of this NVMesh configuration with DDN's IME burst buffer. ?


Biting the hand that feeds IT ? 1998–2017

<tr id="haujiCA"><optgroup id="haujiCA"></optgroup></tr><rt id="haujiCA"></rt>
<rt id="haujiCA"></rt>
<rt id="haujiCA"></rt>
<rt id="haujiCA"></rt><acronym id="haujiCA"></acronym>
<acronym id="haujiCA"><small id="haujiCA"></small></acronym>
<acronym id="haujiCA"></acronym>
<acronym id="haujiCA"></acronym>
<rt id="haujiCA"><small id="haujiCA"></small></rt>
<rt id="haujiCA"></rt>
  • 9911371321 2018-02-20
  • 3873471320 2018-02-20
  • 2917121319 2018-02-20
  • 858501318 2018-02-20
  • 5673191317 2018-02-20
  • 2866151316 2018-02-20
  • 857841315 2018-02-20
  • 4203831314 2018-02-19
  • 586771313 2018-02-19
  • 5456681312 2018-02-19
  • 78361311 2018-02-19
  • 5493931310 2018-02-19
  • 686551309 2018-02-19
  • 7749781308 2018-02-19
  • 8926581307 2018-02-19
  • 840671306 2018-02-19
  • 1015111305 2018-02-19
  • 6607141304 2018-02-19
  • 5587621303 2018-02-19
  • 6265761302 2018-02-19