-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Engine::Flush collective and use it to allow dumping data to disk during a step #2649
Comments
Just wanted to give a textual thumb's up to this feature request. We'll keep it in mind with the new serializer design... |
So, @franzpoeschel we have this basic functionality in the BP5 Writer engine now. That is, in the middle of a timestep BP5 has the ability to move to disk all the data that has already been subject to a put(). This can happen multiple times in a single step, emptying internal data buffers and also capturing all the data in Deferred Put()s. But BP5 works differently than BP3/4 so we've been discussing how to best expose the functionality. There are really three choices: 1) Make it triggered by some form of the existing Flush() functionality, 2) make PerformPuts() do it (which the name kind of implies), or 3) create a new interface like FlushData() that does it. I lean towards 2, for reasons I'll describe, but I think it's not a slam dunk so we thought we'd ask your input. First, the problem with 1, using existing Flush(), is that the existing flush() really is more to push whole timesteps to disk when they have been queued due to timestep aggregation. That is, it is best used between timesteps, not in the middle of one so we're really changing how it is used, as well as what it does. I think PerformPuts() is a better bet though. Currently in BP3/4, what it does is to copy data from Deferred puts into internal ADIOS buffers. This isn't terrible because BP3/4 eventually copy all data into internal buffers before moving it to disk. However, that isn't true of BP5. There we are trying to use vector I/O to do zero copy on a deferred Put() when moving its data to disk. So keeping the old semantics of "PerformPuts()" just copies deferred data internally so you can reuse those buffers is kind of silly. If you wanted to be able to reuse those buffers, you could have just done a Sync Put(). So I tend to think it makes more sense for PerformPuts() (in the BP5 writer only) actually perform the puts, that is, move data to disk. This involves making it MPI-collective, which it currently is not, but the name really makes sense for what it actually is doing, at least to me. Of course, the third alternative is to just create some new method in the interface and leave both of those older interfaces alone. FlushData() is one candidate name as reasonably descriptive, but open to others as well. Anyhow, your thoughts? |
Hello @eisenhauer, I wasn't really aware of these semantics for So far, our workflow in the ADIOS2 backend of openPMD is to use I have some questions:
I think my preference really depends on the answer to question 2. If the answer is "always collective", I have a feeling that this would break compatibility with code using |
RE: Documentation on Engine::Flush(): What, a comment in the header file isn't enough for you? Or the IO::FlushAll() and ADIOS::FlushAll()? Unfortunately they aren't documented per se, and I only came to know about conflicts of usage when I ran into usages of Flush in the tests... WRT your questions:
I think there are two downsides to making this functionality available via PerformPuts(). One is that some code that we know of always does PerformPuts() right before EndStep(). That is completely innocuous for BP3/4, and in BP5 has the side effect of slightly increasing the metadata size (unless I figure out a way to optimize out that extra data). The second is that yes, should anyone be doing PerformPuts in such a way that making it collective would break them, then they'd have to modify that before using BP5. Whether or not anyone might be doing that out there is unknown, and really we can't raise an exception or anything. They'd find out when they had some of their ranks hanging in PerformPuts() waiting for other calls. But maybe that's clear enough... If not, then we'd have to introduce some new call and only suitably modified code would benefit from this. |
So, in conclusion, putting that functionality in |
Conclusion from the VC: The chance of breaking compatibility is too large to put this functionality behind The current meaning of Mockup suggestions: enum class PutMode{ NonCollective, Collective };
void PerformPuts( PutMode mode = PutMode::NonCollective ); |
Thanks for the summary! :) Just naming wise, I am not sure we should call this "collective"/"non-collective". This could be done either as you example, but describing the functionality as: enum class BufferDraining{ Lazy, Immediately };
void PerformPuts( BufferDraining mode = BufferDraining::Lazy ); or as separate API calls: PerformPuts(); // Non-Collective
PerformPuts(); // Non-Collective
PerformPuts(); // Non-Collective
// ...
DrainBuffers(): // Collective Personally, I slightly favor a separate API call since that one does not change the MPI behavior of Either way, the same thing can be expressed, though. |
To add my $.02: I think it's important to know which API calls are collective vs those that are not. Ideally, this should not depend on the underlying engines, since one of the strengths of ADIOS2 is the choice of engines that can be easily switched at run-time, but if that switching then leads to surprising hangs, it's not really worth that much. On the naming, I'd be kinda in favor of following the HDF5 terminology, ie., independent vs collective. It is fundamentally fine to me to change the collective-ness of given API call based upon an argument that declares whether it's called collectively or not. What I think it not great is to mingle collectiveness with differing semantics. If PerformPuts is to ensure that the pointers/buffers previously passed to Put() can be reused after its return, that's one semantic issue, and I think it should keep this meaning. That's different from the semantics of ensuring that data actually makes it to disk, so I think a different API call is in order for that. This latter call (e.g., |
If you see it justified to give this its own call, I'm also open to that, it's your API after all ;) For us, the important part is to not break compatibility in the way that not distinguishing collective and non-collective PerformPut()ing would do. |
FYI, One of the reasons I'd just keep |
Thanks for the details. Yes, our problem can be summarized as we have workflows where we cannot call |
@ax3l sounds like your workflows gotta |
First results of this in PIConGPU In this test run, this allowed saving ~15Gb of peak main memory (i.e. the entire InitialBufferSize) at a cost of increasing the runtime. BP5 still seems noticeably slower than BP4, even without this feature, but I have not really tuned things for BP5 yet. Thanks a lot for implementing this! |
@franzpoeschel Just wanted to make sure you saw ADIOS2 Issue #3060 and PR #3078. Ended up not sticking with the Flush() name because of the conflict with prior functionality. So now it's PerformDataWrite(), and it is indeed a no-op on any engine except BP5 file writer. |
Huh? I used Flush() for implementing this and it worked? |
The change to PerformDataWrite() just got merged in to master days ago (in anticipation today’s pre-release feature freeze).
From: Franz Pöschel ***@***.***>
Date: Monday, February 28, 2022 at 11:35 AM
To: ornladios/ADIOS2 ***@***.***>
Cc: Eisenhauer, Greg S ***@***.***>, Mention ***@***.***>
Subject: Re: [ornladios/ADIOS2] Make Engine::Flush collective and use it to allow dumping data to disk during a step (#2649)
@franzpoeschel<https://github.com/franzpoeschel> Just wanted to make sure you saw ADIOS2 Issue #3060<#3060> and PR #3078<#3078>. Ended up not sticking with the Flush() name because of the conflict with prior functionality. So now it's PerformDataWrite(), and it is indeed a no-op on any engine except BP5 file writer.
Huh? I used Flush() for implementing this and it worked?
But I can change it to PerformDataWrite() to be on the safe side, thanks for the note!
—
Reply to this email directly, view it on GitHub<#2649 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAHF7GNNNYWPGCGO2SENM73U5OP4PANCNFSM4ZMXKCFQ>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Ah, got it |
@franzpoeschel If you use Sync Puts, BP5 chunking size by default is 16MB so increasing this with parameter BufferChunkSize would decrease the number of writes necessary per process. Up to 2GB if possible. |
Thank you for this hint @pnorbert, this made a noticeable difference: ComputationalRadiationPhysics/picongpu#4002 (comment) |
I don't really understand those memory graphs, please explain that at our next meeting. I especially worry about setup 3, where the chunk list grows to 68GB, while it was kept in check at any other runs. I don't understand what is going on there. |
Sure, we can talk about it in more detail Being able to go from (3) to (4) was exactly the point of this issue. |
Ah, I see now that it grows extremely high. I assume that this might be the fault of setting the BufferChunkSize to 2GB? I believe that Heaptrack tracks the memory by hijacking malloc, so this shows the virtual memory usage. I should try out if I can find out anything about real memory usage which should be lower since most of the allocated memory is not used. |
I confirmed the above now, see here |
Thank you for confirming this. How many processes did this together? |
This was a single process for now, I have not tested this for scalability yet. |
It should not do that, unless every Put passes a data with size between 1/2..1 x the chunk size. It should pack multiple puts into one chunk. When it cannot pack any more puts into the chunk, it downsizes (realloc) that chunk to its final size and then allocates a new chunk. Flush empties the buffer and next Put starts again from scratch (allocate a new chunk). |
Do you use deferred Put(), and than call PerformPuts()? |
Yes. So I guess that this is exactly the reason. |
The PR #3084 should fix this issue. Can you please test it when it is merged? |
Actually, in BP5, PerformPuts() is kinda counterproductive. BP5 is the first engine that can take advantage of the deferred puts. BP5 can write the Deferred user pointers directly to disk or to aggregator, and save memory. PerformPuts() forces a copy into the internal buffer and thus makes a copy. Sync puts perform this copy immediately to give back the user the pointer for reuse. Since BP4 was always calling PerformPuts() in EndStep(), using deferred vs sync puts did not really change anything in memory consumption and performance. Calling PerformPuts() in BP5 equalizes the two cases as well but then what's the point of using deferred puts? The downside of using deferred puts can be too many small writes to disk from a process, instead of a single giant buffer write. This is not good. I think the preferred order of put calls is this:
|
Merged... |
Two big improvement steps! Thank you! |
Using Span should eliminate the need to worry about MinDeferredSize. The only thing to make sure that BufferChunkSize is big. The new 128MB default is still tiny for your use case. |
BTW, BP5 has BufferVType="malloc", with the same memory management as in BP4 in principle but with using malloc/realloc instead of vector.resize. That would be the closest to see how much initializing memory with zeros slows down BP4. |
Yes, true Also, for users of openPMD outside of PIConGPU, we don't use PerformPuts in Streaming mode (except if users explicitly call
So I would use |
Yes, same parameter and logic. I am not aware if malloc has 31 bit limitations and if std::vector has a better allocation routine by default. This logic (and BP3/4) assumes you can allocate a gigantic contiguous memory. Chunking is an alternative for when that's not the case (it was not added to BP for performance) |
I had something wrong remembered, malloc limitations are system dependent, but on my system it can easily above 2Gb. |
Related issues: #1891 and #2629.
Suggestion: The call
Engine::Flush()
should be made MPI-collective and its effect should be to drain data current held in ADIOS buffers to disk. In streaming engines, this call would either have no effect or throw an exception.Why is this feature important?
Multiple reasons:
This is to some extent what the documentation for
Engine::Flush
promises to do anyway:(from here)
As described in this post, the current behavior of
Engine::Flush
is somewhat defective. Its current effect is that all data written up toEngine::Flush
in one step is thrown away and only the data betweenEngine::Flush
andEngine::EndStep
will appear in the written file. I don't know if this somehow intended, but I can't think of a use for it.More fine-grained control over memory usage by ADIOS: Currently, the only ways to make ADIOS drain its buffers to disk are by calling
Engine::EndStep
or byEngine::Close
. This makes it hard to control memory usage within one ADIOS step. Examples areEngine::PerformPuts
, feeding data to ADIOS bit by bit. If not specifying the correctInitialBufferSize
, this will lead to repeated resizing of ADIOS buffers, with negative effects on memory usage, as outlined in Only 50% of the main memory can be used for ADIOS2 #2629.Also, the amount of data written in one step depends on user input and varies over steps, so the size of a step should be considered unbounded.
By allowing intermediate
Engine::Flush
es, we could put a bound on the amount of data held at any point in ADIOS. TheInitialBufferSize
would become easier to specify and not require prediction of the largest step to come in a simulation.Engine::Flush
doesn't require the same level of synchrony, this would become a lot easier.What is the potential impact of this feature in the community?
Simulations with more diverse write patterns would be able to use ADIOS and have a more controlled memory usage.
ADIOS optimizes towards writers with synchronous steps where one step fits fully into memory (per process) and data can be presented in a clear 2-phase workflow of (1) declare variables and enqueue Put() operations and (2) perform the enqueued operations.
This feature request would allow to use ADIOS more easily in applications that do not fulfill these assumptions.
Is your feature request related to a problem? Please describe.
Already outlined above.
Describe the solution you'd like and potential required effort
Solution already described above. For the required effort, I am not sufficiently aware of the BP format intrinsics. I imagine the challenging points would be (1) draining the buffer several times within one step and (2) appending to a written step in the BP file.
Also, slightly API breaking by changing the behavior of
Engine::Flush
.Describe alternatives you've considered and potential required effort
This feature request achieves two goals:
InitialBufferSize
correctly. Needs to be either done by the user (bad UX, bad performance by default) or programmatically (requires duplication of logic in PIConGPU, only feasible if writing a new ADIOS file per step, otherwise the size of the largest step to come needs to be computed).Additionally, according to @pnorbert, the new serializer to come late 2021 will address the inefficient buffer resizing of the current serializer(s), so that will be a more permanent solution. This feature request would not cease to be useful by this point, since point (1) will remain relevant.
The text was updated successfully, but these errors were encountered: