openPMD plugin: Flush data to disk within a step #4002

franzpoeschel · 2022-02-28T14:49:18Z

The upcoming BP5 engine in ADIOS2 has some features for saving memory compared to BP4.

BP5 will not replace BP4 because these memory optimizations come at a runtime cost, instead users will be able to decide between runtime efficiency and memory efficiency.

One feature that we asked for and is now implemented is the ability to flush data to disk within a single IO step. I'm currently working on exposing this functionality in openPMD. Together with that PR, this PR makes the feature available as a preview in PIConGPU.

Pinging @psychocoderHPC because he asked for this feature

TODO:

Merge ADIOS2: Flush to disk within a step openPMD/openPMD-api#1207 in openPMD-api
Resolve Distinguish BP versions in file endings openPMD/openPMD-api#1205 in openPMD-api before starting to use BP5 in production workflows, otherwise there will be too much confusion
Maybe wait for ADIOS 2.8.0 release which will contain the BP5 engine for the first time
There might still be API changes in the openPMD-api PR, adapt to them
Parallel testing

First results

I ran 4 tests, each one writing 3 IO steps, bit more than 15Gb per step:

BP4 engine without InitialBufferSize
BP4 engine with correctly specified InitialBufferSize
BP5 engine without this PR
BP5 engine with this PR, aggressively write to disk as often as possible

The memory profiles of each run are seen in the following screenshot line by line, note the different y scales

( 1 | 2 )
( 3 | 4 )

Further details:

Interpretation:

Known pathological behavior if not specifying InitialBufferSize, don't do this
Best speed, but high memory usage and necessity to specify InitialBufferSize beforehand
Buffers are allocated as needed, memory usage is equivalent to 2. if InitialBufferSize is specified as the perfect amount
Lowest memory usage, but long runtime due to many little write operations. Compared to the current ADIOS2 output, this saves ~15Gb of peak memory usage

As it stands, the runtime duration of BP5 approaches is very long in these benchmarks. The parameters of the BP5 engine are not yet documented, so I have not really had the chance to tune this yet.

franzpoeschel · 2022-03-01T14:25:57Z

@pnorbert suggested to specify BufferChunkSize as 2Gb, with this setting I got performance close to BP4 for setup (3), and a bit slower than that for setup (4):

franzpoeschel · 2022-03-01T18:35:19Z

Note that in the above image the first graph reaches a surprisingly large peak memory consumption of 110GB.
This is virtual memory only. Essentially, by specifiying BufferChunkSize=2GB, this requests ADIOS2 to over-allocate memory and use only very little of each chunk.
The BP5 engine uses malloc for allocation which is what Heaptrack tracks. Only a little percentage of the malloced memory is translated to physical memory. I confirmed this by monitoring PIConGPU with top while running:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                
4025970 franzpo+  20   0  135.1g  46.1g 136268 R  98.7  36.8   1:31.97 picongpu

In here, the virtual memory usage is even higher than Heaptrack reports, but the physical memory (RES) peaks at 55GB.

This being said, I don't know if SLURM or other Batch systems understand this, i.e. I don't know if they go by physical or virtual memory to monitor memory usage of jobs.

franzpoeschel · 2022-03-03T14:30:20Z

The high virtual memory usage is now fixed in ADIOS2, see the top row in the screenshot:

Also, Norbert told me that there are needless copies in the setup that I use, so I activated the Span-based API for BP5 in openPMD, the result was the bottom row which is actually faster than BP4.

I assume that this is because BP4 initializes 20Gb of memory with zeroes, so the advantage probably will not translate to runs at scale (initialization happens only once, the difference is made more extreme by running under Heaptrack, IO efficiency will dominate over serialization efficiency at scale)

franzpoeschel · 2022-03-03T14:58:08Z

In combination with mapped memory data preparation strategy: Who needs host memory?

Given that we will probably add a third data preparation strategy for Frontier, having memory profiles like this one in the end might not be out of the question.

franzpoeschel · 2022-05-09T12:32:47Z

This PR now has a working suggestion on how to handle different flush targets via JSON configuration

psychocoderHPC · 2022-11-04T15:36:27Z

Thanks for working on this feature, this change is required to support reducing the memory footprint for IO on ORNL crusher/frontier and other systems with a low amount of host memory compared to the GPU memory.

Sry I was not aware that you pushed new changes to this PR. Please ping me next time.
I will review this PR next week.

franzpoeschel mentioned this pull request Feb 28, 2022

Make Engine::Flush collective and use it to allow dumping data to disk during a step ornladios/ADIOS2#2649

Closed

franzpoeschel marked this pull request as draft February 28, 2022 14:58

franzpoeschel force-pushed the topic-flush-inside-step branch from 1ee5d61 to 4ae3880 Compare May 6, 2022 10:51

sbastrakov mentioned this pull request May 18, 2022

Problem trying to run standard examples, hipErrorOutOfMemory #4110

Closed

franzpoeschel force-pushed the topic-flush-inside-step branch from be2aac6 to 1ba0120 Compare June 2, 2022 14:43

franzpoeschel force-pushed the topic-flush-inside-step branch from 1ba0120 to 94be5df Compare July 26, 2022 15:25

franzpoeschel force-pushed the topic-flush-inside-step branch from 62fc8a4 to 9547d7b Compare August 24, 2022 09:40

franzpoeschel marked this pull request as ready for review August 24, 2022 09:49

franzpoeschel force-pushed the topic-flush-inside-step branch from 9547d7b to db10489 Compare September 15, 2022 09:01

franzpoeschel added 6 commits October 18, 2022 16:04

Mark collective flushs as such

b478afb

Specify preferred flush targets

633c1cb

Backwards compatibility

5fe71c1

Set default preferred flush target

ba1cbb6

Fix strings

158cced

Consistently use the same condition for enabling this feature

20e22cb

franzpoeschel force-pushed the topic-flush-inside-step branch from db10489 to 20e22cb Compare October 18, 2022 14:04

franzpoeschel changed the title ~~[WIP] openPMD plugin: Flush data to disk within a step~~ openPMD plugin: Flush data to disk within a step Oct 19, 2022

psychocoderHPC added this to the 0.7.0 / 1.0.0: Next Stable milestone Nov 4, 2022

psychocoderHPC added refactoring code change to improve performance or to unify a concept but does not change public API component: plugin in PIConGPU plugin labels Nov 4, 2022

psychocoderHPC approved these changes Nov 7, 2022

View reviewed changes

psychocoderHPC merged commit e80c45e into ComputationalRadiationPhysics:dev Nov 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

openPMD plugin: Flush data to disk within a step #4002

openPMD plugin: Flush data to disk within a step #4002

franzpoeschel commented Feb 28, 2022 •

edited

Loading

franzpoeschel commented Mar 1, 2022

franzpoeschel commented Mar 1, 2022

franzpoeschel commented Mar 3, 2022

franzpoeschel commented Mar 3, 2022 •

edited

Loading

franzpoeschel commented May 9, 2022

psychocoderHPC commented Nov 4, 2022

openPMD plugin: Flush data to disk within a step #4002

openPMD plugin: Flush data to disk within a step #4002

Conversation

franzpoeschel commented Feb 28, 2022 • edited Loading

franzpoeschel commented Mar 1, 2022

franzpoeschel commented Mar 1, 2022

franzpoeschel commented Mar 3, 2022

franzpoeschel commented Mar 3, 2022 • edited Loading

franzpoeschel commented May 9, 2022

psychocoderHPC commented Nov 4, 2022

franzpoeschel commented Feb 28, 2022 •

edited

Loading

franzpoeschel commented Mar 3, 2022 •

edited

Loading