Make the `AnalysisBase` compatible with live simulation streams #4828

ljwoods2 · 2024-12-11T21:01:14Z

Is your feature request related to a problem?

Related to #4827: for MDAnalysis to function with live simulation streams, the AnalysisBase must be modified to account for a few quirks of live-streamed trajectories:

Their length/end frame isn't known (so len(u.trajectory) and u.trajectory.n_frames will fail)
They can't be sliced in any way except u.trajectory[:] and u.trajectory[::n] (where n is positive) so they are inherently incompatible with parallelization schemes

Describe the solution you'd like

Like with parallelization, analysis classes that inherit from AnalysisBase should "opt-in" to stream usage through a class property similar to the parallelizable property (like streamable or similar). To opt-in, inherited classes must also avoid attempting to access the trajectory's length or slice it. The rms.RMSF class is already such a class.

The AnalysisBase must also have a _streamed_run method that never attempts to slice a trajectory or access its length. The imdclient MDAKit ships such a method already (and monkey patches it in): https://github.com/Becksteinlab/imdclient/blob/main/imdclient/streamanalysis.py

Describe alternatives you've considered

Don't allow live simulation streaming, or don't allow builtin analysis classes to be used with these streams

The text was updated successfully, but these errors were encountered:

orbeckst · 2024-12-18T18:43:39Z

Can this issue be addressed independently from #4827 ?

ljwoods2 · 2024-12-18T18:52:53Z

Can this issue be addressed independently from #4827 ?

Fixing this relies on being able to check if a reader inherits from StreamReaderBase (and failing if a stream reader tries to use an analysis class that isn't streamable) which would be a part of #4827- so good point, they are not entirely independent. The rest of the changes would be independent.

yuxuanzhuang · 2024-12-20T03:27:49Z

Generally, I advocate for including essential (and lightweight) components from IMDClient into MDAnalysis. Here are a few thoughts and comments:

Initialization of Results:
I’m uncertain how you plan to handle the initialization of results for (almost all) analyses (e.g., example in RMS.py), considering you don’t know the n_frames beforehand.
Streamable Analyses:
Some analyses may or may not be streamable under certain conditions. For example, in the case of RMSD, I assume that if the reference is an IMDReader and ref_frame is not zero, it would prevent the analysis from being streamable.
IMDReader Compatibility:
Instead of modifying AnalysisBase to be streamable, would it be feasible to make IMDReader “AnalysisBase-compatible”? That is if necessary changes for IMDReader itself can be made to make iteration over IMDReader behaves similarly to other Readers?
Slicing and Stopping Iteration in IMDReader:
Expanding on that, I wonder if disallowing the stop parameter for slicing IMDReader is strictly necessary. I may be missing some details, but it seems theoretically possible to stop IMDReader iteration after processing n frames. If IMDReader could support slicing (start=0, step, stop), it might naturally enable setting n_frames based on the slice.

ljwoods2 · 2024-12-20T20:29:08Z

@yuxuanzhuang Thank you for the in-depth look!

Overall, I agree with you that we could adapt the reader to be more compatible.

As for the initialization of results, my thinking was that we would use python lists to hold intermediate results, and when iteration stops and _conclude is called, the list is converted to a numpy array and made available exactly as before.

Maybe we allow slicing, but raise a RuntimeError if the start frame isn't reached or the end frame isn't reached? What do we pass as the end frame if no stop is defined? In this case, we still would need the AnalysisBase modifications to handle an unknown-length stream- though this could be avoided if we require the stop kwarg for stream readers. In this case, should we still raise an error if that frame isn't reached? Or should we give people their intermediate results?

That might be the best path forward- requiring people to pass in an end frame, defaulting to 0 for the start frame, and failing out if the end frame isn't reached, perhaps with some kind of option to save intermediate results to csv or something. That would also allow us to avoid adding complexity with parsing logic for all the different analysis class' frame-related kwargs like ref_frame, and we could initialize results arrays with the required end frame arg as the length.

Edit: Say something like ref_frame is a frame in the middle of the trajectory- what do we do with the frames up until the ref_frame which isn't yet available (can't just jump directly to it)?

yuxuanzhuang · 2024-12-31T04:39:28Z

As for the initialization of results, my thinking was that we would use python lists to hold intermediate results, and when iteration stops and _conclude is called, the list is converted to a numpy array and made available exactly as before.

Many analyses also rely on array assignment within single_frame, so the list-appending approach wouldn’t work in these cases or unless you want to refactor the whole AnalysisBase. For example, see this usage in MSD analysis.

I don’t have a fully fleshed-out solution yet, but I wonder if it might be possible to implement a “block” analysis approach so it could be processed in chunks of a fixed number of frames, and then the results from each block could be concatenated during _stream_run at the end.

orbeckst · 2024-12-31T08:19:27Z

Good idea.

I hope that the discussion on this issue leads to a feasible and general solution, similar to how the parallelization was refined through discussions. It‘s important for the future of MDAnalysis.

hmacdope · 2025-01-05T20:55:19Z

I don’t have a fully fleshed-out solution yet, but I wonder if it might be possible to implement a “block” analysis approach so it could be processed in chunks of a fixed number of frames, and then the results from each block could be concatenated during _stream_run at the end.

This is a really clever idea IMO, if the trajectory could be processed in blocks guaranteeing that size == blocksize or size == mod(full_len, blocksize) for the end of the stream , then each can call a full AnalysisBase run and then concatenate the resulting results objects. Is this what you were thinking @yuxuanzhuang? It is similar to the way we approached parallelisation IIRC.

IAlibay · 2025-01-06T13:12:22Z

I haven't had time to think this through fully yet, but one initial comment / thought based on @yuxuanzhuang's comment "Instead of modifying AnalysisBase to be streamable, would it be feasible to make IMDReader “AnalysisBase-compatible”?"

I'm not sure if it's in the same spirit, but I am concerned that we are following the path of "let's keep on adding things to AnalysisBase because we can".

It sounds to me like enabling analyses on streams is going to need a lot of analysis-specific work. Is this a fair assessment? If so, is it maybe worth consider if we need either a mixin or a separate subclass to make analyses stream compatible? (rather than adding more to the one base class)

yuxuanzhuang · 2025-01-07T20:16:41Z

This is a really clever idea IMO, if the trajectory could be processed in blocks guaranteeing that size == blocksize or size == mod(full_len, blocksize) for the end of the stream , then each can call a full AnalysisBase run and then concatenate the resulting results objects. Is this what you were thinking @yuxuanzhuang? It is similar to the way we approached parallelisation IIRC.

Yes, exactly. With such, I think the existing _get_aggregator for parallel analysis should just be the right tool for concatenation so no extra function will be needed for that.

orbeckst added enhancement Component-Analysis streaming labels Dec 11, 2024

orbeckst added the decision needed requires input from developers before moving further label Dec 20, 2024

hmacdope mentioned this issue Jan 5, 2025

Adopt a reader for live-streamed Interactive MD (IMD) version 3 trajectories #4827

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the `AnalysisBase` compatible with live simulation streams #4828

Make the `AnalysisBase` compatible with live simulation streams #4828

ljwoods2 commented Dec 11, 2024 •

edited

Loading

orbeckst commented Dec 18, 2024

ljwoods2 commented Dec 18, 2024

yuxuanzhuang commented Dec 20, 2024

ljwoods2 commented Dec 20, 2024 •

edited

Loading

yuxuanzhuang commented Dec 31, 2024 •

edited

Loading

orbeckst commented Dec 31, 2024

hmacdope commented Jan 5, 2025

IAlibay commented Jan 6, 2025

yuxuanzhuang commented Jan 7, 2025

Make the AnalysisBase compatible with live simulation streams #4828

Make the AnalysisBase compatible with live simulation streams #4828

Comments

ljwoods2 commented Dec 11, 2024 • edited Loading

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

orbeckst commented Dec 18, 2024

ljwoods2 commented Dec 18, 2024

yuxuanzhuang commented Dec 20, 2024

ljwoods2 commented Dec 20, 2024 • edited Loading

yuxuanzhuang commented Dec 31, 2024 • edited Loading

orbeckst commented Dec 31, 2024

hmacdope commented Jan 5, 2025

IAlibay commented Jan 6, 2025

yuxuanzhuang commented Jan 7, 2025

Make the `AnalysisBase` compatible with live simulation streams #4828

Make the `AnalysisBase` compatible with live simulation streams #4828

ljwoods2 commented Dec 11, 2024 •

edited

Loading

ljwoods2 commented Dec 20, 2024 •

edited

Loading

yuxuanzhuang commented Dec 31, 2024 •

edited

Loading