Handling Tens of Thousands of Cores with Industrial/Legacy Codes: Approaches, Implementation and Timings

Rainald Löhner

Abstract


Given that heat generation is proportional to clockcycles to the third power, commodity chips will not be practical beyond 3-6 Ghz.
While incremental gains can still be obtained via better compilers, prefetching, and a host of other software-based procedures, the only way to increase CPU performance by orders of magnitude is via massive parallelism. For field solvers this can be achieved via domain decomposition (which typically runs in a distributed memory MPI setting), loop parallelism (which typically runs in a shared memory OMP setting), or via specialized hardware (which typically runs on a GPU).
Given the (rather recent) plentiful availability of machines with tens of thousand of cores, we have migrated FEFLO, a typical large-scale code developed over decades, to these mixed MPI/OMP/GPU environments.
The talk will show ways in which this porting can be made as simple and transparent as possible, present results from a wide range of algorithms for compressible and incompressible flows, mesh handling techniques, and fluid-structure interaction, and also consider the many pitfalls and compromises that sometimes have to be struck in a multiphysics production environment.

Full Text:

PDF



Asociación Argentina de Mecánica Computacional
Güemes 3450
S3000GLN Santa Fe, Argentina
Phone: 54-342-4511594 / 4511595 Int. 1006
Fax: 54-342-4511169
E-mail: amca(at)santafe-conicet.gov.ar
ISSN 2591-3522