Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openPMD plugin: Flush data to disk within a step #4002

Conversation

franzpoeschel
Copy link
Contributor

@franzpoeschel franzpoeschel commented Feb 28, 2022

The upcoming BP5 engine in ADIOS2 has some features for saving memory compared to BP4.

BP5 will not replace BP4 because these memory optimizations come at a runtime cost, instead users will be able to decide between runtime efficiency and memory efficiency.

One feature that we asked for and is now implemented is the ability to flush data to disk within a single IO step. I'm currently working on exposing this functionality in openPMD. Together with that PR, this PR makes the feature available as a preview in PIConGPU.

Pinging @psychocoderHPC because he asked for this feature

TODO:

First results

I ran 4 tests, each one writing 3 IO steps, bit more than 15Gb per step:

  1. BP4 engine without InitialBufferSize
  2. BP4 engine with correctly specified InitialBufferSize
  3. BP5 engine without this PR
  4. BP5 engine with this PR, aggressively write to disk as often as possible

The memory profiles of each run are seen in the following screenshot line by line, note the different y scales

( 1 | 2 )
( 3 | 4 )

Bildschirmfoto vom 2022-02-28 15-30-54

Further details:
Bildschirmfoto vom 2022-02-28 15-30-35

Interpretation:

  1. Known pathological behavior if not specifying InitialBufferSize, don't do this
  2. Best speed, but high memory usage and necessity to specify InitialBufferSize beforehand
  3. Buffers are allocated as needed, memory usage is equivalent to 2. if InitialBufferSize is specified as the perfect amount
  4. Lowest memory usage, but long runtime due to many little write operations. Compared to the current ADIOS2 output, this saves ~15Gb of peak memory usage

As it stands, the runtime duration of BP5 approaches is very long in these benchmarks. The parameters of the BP5 engine are not yet documented, so I have not really had the chance to tune this yet.

@franzpoeschel
Copy link
Contributor Author

@pnorbert suggested to specify BufferChunkSize as 2Gb, with this setting I got performance close to BP4 for setup (3), and a bit slower than that for setup (4):

Bildschirmfoto vom 2022-03-01 15-22-38
Bildschirmfoto vom 2022-03-01 15-22-53

@franzpoeschel
Copy link
Contributor Author

Note that in the above image the first graph reaches a surprisingly large peak memory consumption of 110GB.
This is virtual memory only. Essentially, by specifiying BufferChunkSize=2GB, this requests ADIOS2 to over-allocate memory and use only very little of each chunk.
The BP5 engine uses malloc for allocation which is what Heaptrack tracks. Only a little percentage of the malloced memory is translated to physical memory. I confirmed this by monitoring PIConGPU with top while running:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                
4025970 franzpo+  20   0  135.1g  46.1g 136268 R  98.7  36.8   1:31.97 picongpu 

In here, the virtual memory usage is even higher than Heaptrack reports, but the physical memory (RES) peaks at 55GB.

This being said, I don't know if SLURM or other Batch systems understand this, i.e. I don't know if they go by physical or virtual memory to monitor memory usage of jobs.

@franzpoeschel
Copy link
Contributor Author

The high virtual memory usage is now fixed in ADIOS2, see the top row in the screenshot:
Bildschirmfoto vom 2022-03-03 15-13-33

Also, Norbert told me that there are needless copies in the setup that I use, so I activated the Span-based API for BP5 in openPMD, the result was the bottom row which is actually faster than BP4.

I assume that this is because BP4 initializes 20Gb of memory with zeroes, so the advantage probably will not translate to runs at scale (initialization happens only once, the difference is made more extreme by running under Heaptrack, IO efficiency will dominate over serialization efficiency at scale)

@franzpoeschel
Copy link
Contributor Author

franzpoeschel commented Mar 3, 2022

In combination with mapped memory data preparation strategy: Who needs host memory?
Bildschirmfoto vom 2022-03-03 15-57-28

Given that we will probably add a third data preparation strategy for Frontier, having memory profiles like this one in the end might not be out of the question.

@franzpoeschel franzpoeschel force-pushed the topic-flush-inside-step branch from 1ee5d61 to 4ae3880 Compare May 6, 2022 10:51
@franzpoeschel
Copy link
Contributor Author

This PR now has a working suggestion on how to handle different flush targets via JSON configuration

@franzpoeschel franzpoeschel force-pushed the topic-flush-inside-step branch from be2aac6 to 1ba0120 Compare June 2, 2022 14:43
@franzpoeschel franzpoeschel force-pushed the topic-flush-inside-step branch from 1ba0120 to 94be5df Compare July 26, 2022 15:25
@franzpoeschel franzpoeschel force-pushed the topic-flush-inside-step branch from 62fc8a4 to 9547d7b Compare August 24, 2022 09:40
@franzpoeschel franzpoeschel marked this pull request as ready for review August 24, 2022 09:49
@franzpoeschel franzpoeschel force-pushed the topic-flush-inside-step branch from 9547d7b to db10489 Compare September 15, 2022 09:01
@franzpoeschel franzpoeschel force-pushed the topic-flush-inside-step branch from db10489 to 20e22cb Compare October 18, 2022 14:04
@franzpoeschel franzpoeschel changed the title [WIP] openPMD plugin: Flush data to disk within a step openPMD plugin: Flush data to disk within a step Oct 19, 2022
@psychocoderHPC psychocoderHPC added refactoring code change to improve performance or to unify a concept but does not change public API component: plugin in PIConGPU plugin labels Nov 4, 2022
@psychocoderHPC
Copy link
Member

Thanks for working on this feature, this change is required to support reducing the memory footprint for IO on ORNL crusher/frontier and other systems with a low amount of host memory compared to the GPU memory.

Sry I was not aware that you pushed new changes to this PR. Please ping me next time.
I will review this PR next week.

@psychocoderHPC psychocoderHPC merged commit e80c45e into ComputationalRadiationPhysics:dev Nov 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: plugin in PIConGPU plugin refactoring code change to improve performance or to unify a concept but does not change public API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants