-
Notifications
You must be signed in to change notification settings - Fork 677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add h5md format #2787
add h5md format #2787
Conversation
Working on this as an undergraduate REU student over the summer with @orbeckst lab. I'll be adding to this PR continually as I make progress. |
Welcome to MDAnalysis @edisj ! Just going to drop a link to the arxiv paper for H5MD (please do correct me if this isn't the right format specification) so as to help future reviewers: https://arxiv.org/pdf/1308.6382.pdf |
Codecov Report
@@ Coverage Diff @@
## develop #2787 +/- ##
===========================================
+ Coverage 92.80% 92.83% +0.02%
===========================================
Files 185 186 +1
Lines 24205 24399 +194
Branches 3133 3174 +41
===========================================
+ Hits 22463 22650 +187
- Misses 1696 1703 +7
Partials 46 46
Continue to review full report at Codecov.
|
This is the correct format, thank you! |
Here's a link to the github page where I've started working on example h5md files: h5md_examples. Also, some relevant links I've been using thoroughly: If anyone has some other resources please share :) |
Hi everyone, Here's a quick update on what I've been up to. Any help/criticism would be greatly appreciated! ProgressI've pushed my first iteration of an example h5md file and called it H5MD.h5. The code used to create the file can be found in my github repository in write_h5md.py. This file created an mda.Universe object from the TPR_xvf and TRR_xvf test files and created an h5md file. To write the code I followed this example in the pyh5md repository. The pyh5md.element function creates the h5 datasets and stores them in the 'atoms' h5 group. For now, I've only added positions, velocities, forces, masses, n_atoms, and box dimensions, where the positions, velocities, forces, and dimensions were added as time elements and the masses and n_atoms were added as fixed elements. The file read_and_test_file.py simply opens the H5MD.h5 file and reads data out of it. I wrote some simple tests that compare the values of the data extracted from the .h5 file and data extracted with MDAnalysis to see if they're the same. Current IssuesWhen appending the box dimensions from the attribute mda.Universe(TPR_xvf, TRR_xvf).trajectory.ts.triclinic_dimensions into the h5 group atoms.box, it stores the data from the first timestep 3 times, instead of each timestep once. The append method comes from h5md_module.py found here. PlansCurrently, I'm trying to sort out how the box dimensions are stored as a time element. Once I'm convinced that all the data I've stored is correct and can be extracted from the h5md file, I will begin writing an H5MDReader and H5MDWriter class in the mdanalysis/package/MDAnalysis/coordinates library. |
I've just pushed the corrected h5md file named 'H5MD_xvf.h5' into MDAnalysisTests/data. This is the file I will be using to test the H5MDReader class. Is there any way to delete or disregard my push for the 'H5MD.h5' file? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove one of the data files (git rm H5MD.h5
) and rename the other one to match the filename stored in datafiles (git mv H5MD_xcf.h5 cobrotoxin.h5
)
Just use git for the files and push – it will update your branch accordingly. |
Also |
How does getting the mda.Universe class to recognize a new file type work? In mdanalysis/package/MDAnalysis/core/init.py, it gives
which makes me think _READERS is automatically filled from the modules in coordinates. Is there something more to it? |
Yes, just derive from mdanalysis/package/MDAnalysis/coordinates/base.py Line 1361 in 343e10a
mdanalysis/package/MDAnalysis/coordinates/base.py Line 1384 in 343e10a
|
@edisj can you push your first draft for a reader – the one's you showed me the other day? It makes more sense to have them in a PR because then they can be easily reviewed and commented on. |
# set particle masses | ||
frame_masses = self._file['particles']['atoms']['masses'][()] | ||
frame_masses = pyh5md.element(self._trajectory, 'masses').value[()] | ||
n_atoms_now = frame_masses.shape | ||
if n_atoms_now != self.n_atoms : | ||
raise ValueError("Frame %d has %d atoms but the initial frame has %d" | ||
" atoms. MDAnalysis in unable to deal with variable" | ||
" topology!"%(frame, n_atoms_now, self.n_atoms)) | ||
else : | ||
self.ts.masses = frame_masses | ||
ts.masses = frame_masses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timesteps do not have a masses
attribute. MDAnalysis has no concept of particles changing mass over time. Leave this out for the moment.
We can later write a "Parser", something that reads the topology (atom names, masses, bonds (if available).
In your documentation, just state that "masses" are currently not read from a H5MD file – documenting missing features is a totally valid approach to not doing too much work in the beginning. It also helps you to get a working prototype out quickly. It also gives you well-defined tasks for improvement cycles later.
Hi @orbeckst , please check my latest commit. After playing around with mpi4py, it looks like
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you have a solution for the comm/driver problem. I am not totally happy that reopen() behaves differently from the rest but I don't have a better idea.
See comments for detailed remarks.
The `H5MD`_ file format is based upon `HDF5`_, which makes use of parallel | ||
file system features through the MPI-IO interface of the HDF5 library. | ||
The reader currently uses the `H5PY`_ library to access data from an H5MD file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an Examples section (might need to call it "Using H5MD" or similar because "Example" might be mangled by the numpy doc sphinx parser) that shows
- opening a h5md file
- opening a h5md in parallel (add the mpi4py commands for communicator etc)
- using an existing
h5py.File
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding point 3, I tried writing a test to test if the reader could take an open h5py.File
object, but the reader is failing because it's trying to open the file with h5py, which means its skipping the if isinstance(self.filename, h5py.File):
line. I'm not sure why it would skip this line, because even when I test the open file with isintance
it returns True
.
edit: nevermind, it looks like I have to use NamedStream
to pass a file stream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it should: the following should work
u = mda.Universe(h5py.File("sim.h5md"))
If it doesn't then ask on the developer list, please. This is part of the "converter" infrastructure https://docs.mdanalysis.org/2.0.0-dev0/documentation_pages/converters.html and should work. If not, then it might require a closer look or an issue to be raised. Ultimately, the format_hint
should take care of everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I try to pass the argument directly like that, it doesn't work.
Following an example in https://www.mdanalysis.org/docs/documentation_pages/lib/util.html#streams, I did
from MDAnalysis.lib.util import NamedStream
stream = h5py.File("trajectory.h5md", 'r')
u = mda.Universe("topology.tpr", NamedStream(stream, stream.filename))
and I was able to load the universe and access all of the timestep attributes. Is this a valid way of opening a stream?
edit: If I recall correctly from my troubleshooting, passing the argument directly was passing a string to if isinstance(self.filename, h5py.File):
instead of an h5py.File
. I'll ask on the developer list about this
@orbeckst I've pushed some changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am puzzle why supplying a h5md file by itself is not working. NamedStream should not be necessary. But if this is a broader problem then it might need fixing globally. Can you please look into the code path that uses _format_hint()
in order to make the decision to call the H5MDReader?
ts._unitcell[:] = particle_group['box/edges/value'][frame, :] | ||
else: | ||
# sets ts.dimensions = [0, 0, 0, 0, 0, 0] | ||
ts._unitcell[:] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ts._unitcell[:] = particle_group['box/edges/value'][frame, :] | ||
else: | ||
# sets ts.dimensions = [0, 0, 0, 0, 0, 0] | ||
ts._unitcell[:] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, what did u.trajectory.ts.dimensions
produce when you fed it an array of nan
? The only way that the assert can succeed is if you get back an array of 6 zeros.
from MDAnalysis.lib.util import NamedStream | ||
u = mda.Universe(TPR_xvf, NamedStream(f, f.filename)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not need NamedStream
. There's something wrong with how _format_hint
is used (or not used), I think. At least my impression was,
u = mda.Universe(TPR_xvf, f)
should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding a test!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking through now to see what _format_hint
is doing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've found why it's not auto detecting the file format -
in _get_readers.py
:
mdanalysis/package/MDAnalysis/core/_get_readers.py
Lines 91 to 93 in 2f16a16
else: # hits else if for loop completes | |
# else let the guessing begin! | |
format = util.guess_format(filename) |
which gets passed to guess_format()
in MDAnalysis.lib.util:
mdanalysis/package/MDAnalysis/lib/util.py
Lines 992 to 1000 in 2f16a16
if isstream(filename): | |
# perhaps StringIO or open stream | |
try: | |
format = format_from_filename_extension(filename.name) | |
except AttributeError: | |
# format is None so we need to complain: | |
errmsg = (f"guess_format requires an explicit format specifier " | |
f"for stream {filename}") | |
raise ValueError(errmsg) from None |
The problem is that an h5py.File
uses the .name
attribute as the group name, not filename:
so I don't think it can't guess the file format from the extension.
EDIT: AND an h5py.File
is returning False
with isstream()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the _get_reader_for
documentation, it mentions
mdanalysis/package/MDAnalysis/core/_get_readers.py
Lines 57 to 68 in 2f16a16
Notes | |
----- | |
There are a number of special cases that can be handled: | |
- If `filename` is a numpy array, | |
:class:`~MDAnalysis.coordinates.memory.MemoryReader` is returned. | |
- If `filename` is an MMTF object, | |
:class:`~MDAnalysis.coordinates.MMTF.MMTFReader` is returned. | |
- If `filename` is a ParmEd Structure, | |
:class:`~MDAnalysis.coordinates.ParmEd.ParmEdReader` is returned. | |
- If `filename` is an iterable of filenames, | |
:class:`~MDAnalysis.coordinates.chain.ChainReader` is returned. |
but I can't find the code giving how it handles these cases. As I understand, _format_hint
should be passed to get_reader_for
but I don't see what it does with it in the function
…one if no 'edges'
@orbeckst pushed the changes to |
Will review asap with a view towards merging! |
We have not really tested that part, I think there are "pragma: no-cov" on these lines. Keep the issue open for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work @edisj ! My main issue is the current implementation of close()
and then a bunch of minor corrections as per comments.
@MDAnalysis/coredevs this should be my last round of reviews. If you want to comment on anything then please do so soon. If all goes well I'd like to merge this PR tomorrow.
|
||
|
||
@pytest.mark.skipif(not HAS_H5PY, reason="h5py not installed") | ||
def test_has_position(h5md_file, ref, tmpdir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test does not assert anything – add some sort of assertion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have 2 ideas, which is better?
assert getattr(u.trajectory.ts, 'positions', None) is None
with pytest raises (NoDataError):
u.trajectory.ts.positions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, come to think of it, the test should probably be called test_no_positions
...
@orbeckst thanks for the comments. Just pushed a commit that hopefully satisfies all of the remaining concerns |
- more explanatory text (capabilities and limitations of reader) - streamline introductory text - add paragraph for users to contact us about compiling parallel h5py & friends - formatting: - use normal Python code blocks - use normal /emphasized text for packages - markup ``True`` and ``None`` - sphinx role markup for methods, exceptions, and attributes
@edisj I added some final doc fixes and everything looks great. I approved the PR and I am just waiting for green CI or any last minute comments from anyone else. |
It is done! Congratulations, @edisj ! |
Wooo! Thanks for all the help @orbeckst |
* Fixes MDAnalysis#762 * add H5MD coordinate reader (supports parallel MPI in principle but is not well tested at the moment, see MDAnalysis#2865) * added test h5md datafiles: real example (derived from cobrotoxin.trr) and synthetic example for MultiframeReaderTest reader tests * add tests (MultiframeReaderTest and custom) * add documentation (example and parallel MPI) * added h5py into conda dependencies and pyh5md into pip dependencies * update CHANGELOG * update AUTHORS
Fixes #762
Changes made in this Pull Request:
add h5md writer(this will be handled in separate pull request)PR Checklist