Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add h5md format #2787

Merged
merged 52 commits into from
Aug 7, 2020
Merged

add h5md format #2787

merged 52 commits into from
Aug 7, 2020

Conversation

edisj
Copy link
Contributor

@edisj edisj commented Jun 23, 2020

Fixes #762

Changes made in this Pull Request:

  • add h5md reader
  • add h5md writer (this will be handled in separate pull request)
  • add h5md test file

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

Sorry, something went wrong.

Verified

This commit was signed with the committer’s verified signature.
pendo324 Justin
@edisj edisj marked this pull request as draft June 23, 2020 18:27
@edisj
Copy link
Contributor Author

edisj commented Jun 23, 2020

Working on this as an undergraduate REU student over the summer with @orbeckst lab. I'll be adding to this PR continually as I make progress.

@IAlibay
Copy link
Member

IAlibay commented Jun 23, 2020

Welcome to MDAnalysis @edisj !

Just going to drop a link to the arxiv paper for H5MD (please do correct me if this isn't the right format specification) so as to help future reviewers: https://arxiv.org/pdf/1308.6382.pdf

@codecov
Copy link

codecov bot commented Jun 23, 2020

Codecov Report

Merging #2787 into develop will increase coverage by 0.02%.
The diff coverage is 93.95%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2787      +/-   ##
===========================================
+ Coverage    92.80%   92.83%   +0.02%     
===========================================
  Files          185      186       +1     
  Lines        24205    24399     +194     
  Branches      3133     3174      +41     
===========================================
+ Hits         22463    22650     +187     
- Misses        1696     1703       +7     
  Partials        46       46              
Impacted Files Coverage Δ
package/MDAnalysis/coordinates/H5MD.py 93.92% <93.92%> (ø)
package/MDAnalysis/coordinates/__init__.py 100.00% <100.00%> (ø)
coordinates/__init__.py 100.00% <0.00%> (ø)
coordinates/base.py 95.06% <0.00%> (+0.24%) ⬆️
package/MDAnalysis/coordinates/base.py 95.06% <0.00%> (+0.24%) ⬆️
package/MDAnalysis/topology/DMSParser.py 90.54% <0.00%> (+1.65%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2550d06...53f28a8. Read the comment docs.

@edisj
Copy link
Contributor Author

edisj commented Jun 23, 2020

Welcome to MDAnalysis @edisj !

Just going to drop a link to the arxiv paper for H5MD (please do correct me if this isn't the right format specification) so as to help future reviewers: https://arxiv.org/pdf/1308.6382.pdf

This is the correct format, thank you!

@edisj
Copy link
Contributor Author

edisj commented Jun 23, 2020

Here's a link to the github page where I've started working on example h5md files: h5md_examples.

Also, some relevant links I've been using thoroughly:
pyh5md repository - this is the Python library I'm using to read/write the *.h5 files
H5MD documentation - for a rough understanding of how groups and datasets work in the h5md format
H5PY documentation - more thorough documentation involving how groups and datasets are created in *.h5 files. Many classes in h5md are inherited from h5py.

If anyone has some other resources please share :)

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
…f and TPR_xvf
@edisj
Copy link
Contributor Author

edisj commented Jun 25, 2020

Hi everyone,

Here's a quick update on what I've been up to. Any help/criticism would be greatly appreciated!

Progress

I've pushed my first iteration of an example h5md file and called it H5MD.h5. The code used to create the file can be found in my github repository in write_h5md.py. This file created an mda.Universe object from the TPR_xvf and TRR_xvf test files and created an h5md file. To write the code I followed this example in the pyh5md repository. The pyh5md.element function creates the h5 datasets and stores them in the 'atoms' h5 group. For now, I've only added positions, velocities, forces, masses, n_atoms, and box dimensions, where the positions, velocities, forces, and dimensions were added as time elements and the masses and n_atoms were added as fixed elements.

The file read_and_test_file.py simply opens the H5MD.h5 file and reads data out of it. I wrote some simple tests that compare the values of the data extracted from the .h5 file and data extracted with MDAnalysis to see if they're the same.

Current Issues

When appending the box dimensions from the attribute mda.Universe(TPR_xvf, TRR_xvf).trajectory.ts.triclinic_dimensions into the h5 group atoms.box, it stores the data from the first timestep 3 times, instead of each timestep once. The append method comes from h5md_module.py found here.
EDIT: I've sorted out this issue. I'm moving onto creating the coordinate reader now

Plans

Currently, I'm trying to sort out how the box dimensions are stored as a time element. Once I'm convinced that all the data I've stored is correct and can be extracted from the h5md file, I will begin writing an H5MDReader and H5MDWriter class in the mdanalysis/package/MDAnalysis/coordinates library.

@edisj edisj closed this Jun 25, 2020
@edisj edisj reopened this Jun 25, 2020
@edisj
Copy link
Contributor Author

edisj commented Jun 25, 2020

I've just pushed the corrected h5md file named 'H5MD_xvf.h5' into MDAnalysisTests/data. This is the file I will be using to test the H5MDReader class. Is there any way to delete or disregard my push for the 'H5MD.h5' file?

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove one of the data files (git rm H5MD.h5) and rename the other one to match the filename stored in datafiles (git mv H5MD_xcf.h5 cobrotoxin.h5)

testsuite/MDAnalysisTests/datafiles.py Outdated Show resolved Hide resolved
@orbeckst
Copy link
Member

Just use git for the files and push – it will update your branch accordingly.

@orbeckst
Copy link
Member

Also git rm testsuite/MDAnalysisTests/data/H5MD_xvf.h5 – it's still included.

@edisj
Copy link
Contributor Author

edisj commented Jun 30, 2020

How does getting the mda.Universe class to recognize a new file type work? In mdanalysis/package/MDAnalysis/core/init.py, it gives

# Registry of Readers, Parsers and Writers known to MDAnalysis
# Metaclass magic fills these as classes are declared.
_READERS = {}

which makes me think _READERS is automatically filled from the modules in coordinates. Is there something more to it?

@orbeckst
Copy link
Member

Yes, just derive from base.ReaderBase, and then that's handled by the Readermeta

class _Readermeta(type):
meta-class, see
class ProtoReader(IOBase, metaclass=_Readermeta):

@orbeckst
Copy link
Member

orbeckst commented Jul 2, 2020

@edisj can you push your first draft for a reader – the one's you showed me the other day? It makes more sense to have them in a PR because then they can be easily reviewed and commented on.

Comment on lines 129 to 137
# set particle masses
frame_masses = self._file['particles']['atoms']['masses'][()]
frame_masses = pyh5md.element(self._trajectory, 'masses').value[()]
n_atoms_now = frame_masses.shape
if n_atoms_now != self.n_atoms :
raise ValueError("Frame %d has %d atoms but the initial frame has %d"
" atoms. MDAnalysis in unable to deal with variable"
" topology!"%(frame, n_atoms_now, self.n_atoms))
else :
self.ts.masses = frame_masses
ts.masses = frame_masses
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timesteps do not have a masses attribute. MDAnalysis has no concept of particles changing mass over time. Leave this out for the moment.

We can later write a "Parser", something that reads the topology (atom names, masses, bonds (if available).

In your documentation, just state that "masses" are currently not read from a H5MD file – documenting missing features is a totally valid approach to not doing too much work in the beginning. It also helps you to get a working prototype out quickly. It also gives you well-defined tasks for improvement cycles later.

@edisj
Copy link
Contributor Author

edisj commented Aug 1, 2020

Hi @orbeckst , please check my latest commit.

After playing around with mpi4py, it looks like driver='mpio' must be passed with a comm argument
image
AND comm can only be passed if driver='mpio. I changed self.open_trajectory() to hopefully satisfy these conditions. I also left room to pass other drivers.

  • For self.open_trajectory():

    • If a h5py.File object is passed, self._driver can easily be found with self._file.driver
    • I've been hunting for a similar way to pull out which communicator was passed, but can't figure it out yet. I found this issue but I'm not sure if I know exactly what to do. @pdebuyl is this how I would go about finding the MPI.Comm object that opened the file? And would I be able to repass the object as an argument to close and reopen the file?
      def open_trajectory(self):
      """opens the trajectory file using h5py library"""
      self._frame = -1
      if isinstance(self.filename, h5py.File):
      self._file = self.filename
      self._driver = self._file.driver
      else:
      if self._comm is not None:
      # can only pass comm argument to h5py.File if driver='mpio'
      assert self._driver == 'mpio'
      self._file = h5py.File(self.filename, 'r',
      driver=self._driver,
      comm=self._comm)
      elif self._driver is not None:
      self._file = h5py.File(self.filename, 'r', driver=self._driver)
      else:
      self._file = h5py.File(self.filename, 'r')
  • For self._reopen() and self.close():

    • just went with if statements to check if the driver is 'mpio'. I don't know how else to handle self.close() if the driver is 'mpio' other than not doing anything
      def close(self):
      """close reader"""
      if self._driver != 'mpio':
      self._file.close()
      def _reopen(self):
      """reopen trajectory
      Note
      ----
      This method does close the and reopen the file like most other
      coordinate readers, as this would cause problems with reopening
      the file with the ``driver`` and ``comm`` arguments for parallel
      reads. Instead, it rewinds the trajectory back to the first timstep
      if 'mpio' is passed for the driver.
      """
      if self._driver != 'mpio':
      self.close()
      self.open_trajectory()
      else:
      self._frame = -2
      self._read_next_timestep()
  • TODO

    • I'll add some bits in the documentation about how to setup parallel hdf5/h5py
    • For failing to import mpi4py (it's not a dependency, right?) , should I do something similar to how the reader handles h5py?

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have a solution for the comm/driver problem. I am not totally happy that reopen() behaves differently from the rest but I don't have a better idea.

See comments for detailed remarks.

package/MDAnalysis/coordinates/H5MD.py Show resolved Hide resolved
package/MDAnalysis/coordinates/H5MD.py Outdated Show resolved Hide resolved
package/MDAnalysis/coordinates/H5MD.py Outdated Show resolved Hide resolved
package/MDAnalysis/coordinates/H5MD.py Outdated Show resolved Hide resolved
package/MDAnalysis/coordinates/H5MD.py Outdated Show resolved Hide resolved
package/MDAnalysis/coordinates/H5MD.py Outdated Show resolved Hide resolved
package/MDAnalysis/coordinates/H5MD.py Outdated Show resolved Hide resolved
package/MDAnalysis/coordinates/H5MD.py Outdated Show resolved Hide resolved
package/MDAnalysis/coordinates/H5MD.py Outdated Show resolved Hide resolved
The `H5MD`_ file format is based upon `HDF5`_, which makes use of parallel
file system features through the MPI-IO interface of the HDF5 library.
The reader currently uses the `H5PY`_ library to access data from an H5MD file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an Examples section (might need to call it "Using H5MD" or similar because "Example" might be mangled by the numpy doc sphinx parser) that shows

  1. opening a h5md file
  2. opening a h5md in parallel (add the mpi4py commands for communicator etc)
  3. using an existing h5py.File

Copy link
Contributor Author

@edisj edisj Aug 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding point 3, I tried writing a test to test if the reader could take an open h5py.File object, but the reader is failing because it's trying to open the file with h5py, which means its skipping the if isinstance(self.filename, h5py.File): line. I'm not sure why it would skip this line, because even when I test the open file with isintance it returns True.

edit: nevermind, it looks like I have to use NamedStream to pass a file stream

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it should: the following should work

u = mda.Universe(h5py.File("sim.h5md"))

If it doesn't then ask on the developer list, please. This is part of the "converter" infrastructure https://docs.mdanalysis.org/2.0.0-dev0/documentation_pages/converters.html and should work. If not, then it might require a closer look or an issue to be raised. Ultimately, the format_hint should take care of everything.

Copy link
Contributor Author

@edisj edisj Aug 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I try to pass the argument directly like that, it doesn't work.

Following an example in https://www.mdanalysis.org/docs/documentation_pages/lib/util.html#streams, I did

from MDAnalysis.lib.util import NamedStream
stream = h5py.File("trajectory.h5md", 'r')
u = mda.Universe("topology.tpr", NamedStream(stream, stream.filename))

and I was able to load the universe and access all of the timestep attributes. Is this a valid way of opening a stream?

edit: If I recall correctly from my troubleshooting, passing the argument directly was passing a string to if isinstance(self.filename, h5py.File): instead of an h5py.File. I'll ask on the developer list about this

@edisj
Copy link
Contributor Author

edisj commented Aug 3, 2020

@orbeckst I've pushed some changes

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am puzzle why supplying a h5md file by itself is not working. NamedStream should not be necessary. But if this is a broader problem then it might need fixing globally. Can you please look into the code path that uses _format_hint() in order to make the decision to call the H5MDReader?

ts._unitcell[:] = particle_group['box/edges/value'][frame, :]
else:
# sets ts.dimensions = [0, 0, 0, 0, 0, 0]
ts._unitcell[:] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead use

Suggested change
ts._unitcell[:] = None
ts._unitcell = None

and set the whole attribute to None? This is closer to what we decided in #1738 and #2698 .

package/MDAnalysis/coordinates/H5MD.py Outdated Show resolved Hide resolved
ts._unitcell[:] = particle_group['box/edges/value'][frame, :]
else:
# sets ts.dimensions = [0, 0, 0, 0, 0, 0]
ts._unitcell[:] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, what did u.trajectory.ts.dimensions produce when you fed it an array of nan? The only way that the assert can succeed is if you get back an array of 6 zeros.

package/MDAnalysis/coordinates/H5MD.py Outdated Show resolved Hide resolved
Comment on lines 428 to 429
from MDAnalysis.lib.util import NamedStream
u = mda.Universe(TPR_xvf, NamedStream(f, f.filename))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not need NamedStream. There's something wrong with how _format_hint is used (or not used), I think. At least my impression was,

u = mda.Universe(TPR_xvf, f)

should work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding a test!!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through now to see what _format_hint is doing

Copy link
Contributor Author

@edisj edisj Aug 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found why it's not auto detecting the file format -

in _get_readers.py:

else: # hits else if for loop completes
# else let the guessing begin!
format = util.guess_format(filename)

which gets passed to guess_format() in MDAnalysis.lib.util:

if isstream(filename):
# perhaps StringIO or open stream
try:
format = format_from_filename_extension(filename.name)
except AttributeError:
# format is None so we need to complain:
errmsg = (f"guess_format requires an explicit format specifier "
f"for stream {filename}")
raise ValueError(errmsg) from None

The problem is that an h5py.File uses the .name attribute as the group name, not filename:
image
so I don't think it can't guess the file format from the extension.

EDIT: AND an h5py.File is returning False with isstream()

Copy link
Contributor Author

@edisj edisj Aug 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the _get_reader_for documentation, it mentions

Notes
-----
There are a number of special cases that can be handled:
- If `filename` is a numpy array,
:class:`~MDAnalysis.coordinates.memory.MemoryReader` is returned.
- If `filename` is an MMTF object,
:class:`~MDAnalysis.coordinates.MMTF.MMTFReader` is returned.
- If `filename` is a ParmEd Structure,
:class:`~MDAnalysis.coordinates.ParmEd.ParmEdReader` is returned.
- If `filename` is an iterable of filenames,
:class:`~MDAnalysis.coordinates.chain.ChainReader` is returned.

but I can't find the code giving how it handles these cases. As I understand, _format_hint should be passed to get_reader_for but I don't see what it does with it in the function

@edisj
Copy link
Contributor Author

edisj commented Aug 4, 2020

@orbeckst pushed the changes to ts.dimensions, and also changed the .h5md examples because 'unit' should have been an attribute of the 'position/value' dataset, not the 'position' group. Still looking for a solution for _format_hint

@orbeckst
Copy link
Member

orbeckst commented Aug 5, 2020

Will review asap with a view towards merging!

@edisj
Copy link
Contributor Author

edisj commented Aug 5, 2020

@orbeckst should I add that it fixes #2865 as well?

Actually, I'm not sure if that issue is solved yet

@orbeckst
Copy link
Member

orbeckst commented Aug 5, 2020

We have not really tested that part, I think there are "pragma: no-cov" on these lines. Keep the issue open for now.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work @edisj ! My main issue is the current implementation of close() and then a bunch of minor corrections as per comments.

@MDAnalysis/coredevs this should be my last round of reviews. If you want to comment on anything then please do so soon. If all goes well I'd like to merge this PR tomorrow.

package/MDAnalysis/coordinates/H5MD.py Outdated Show resolved Hide resolved
package/MDAnalysis/coordinates/H5MD.py Outdated Show resolved Hide resolved
package/MDAnalysis/coordinates/H5MD.py Outdated Show resolved Hide resolved
package/MDAnalysis/coordinates/H5MD.py Outdated Show resolved Hide resolved
package/MDAnalysis/coordinates/H5MD.py Outdated Show resolved Hide resolved
package/MDAnalysis/coordinates/H5MD.py Outdated Show resolved Hide resolved


@pytest.mark.skipif(not HAS_H5PY, reason="h5py not installed")
def test_has_position(h5md_file, ref, tmpdir):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test does not assert anything – add some sort of assertion.

Copy link
Contributor Author

@edisj edisj Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have 2 ideas, which is better?

assert getattr(u.trajectory.ts, 'positions', None) is None
with pytest raises (NoDataError):
    u.trajectory.ts.positions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, come to think of it, the test should probably be called test_no_positions...

testsuite/MDAnalysisTests/coordinates/test_h5md.py Outdated Show resolved Hide resolved
testsuite/MDAnalysisTests/coordinates/test_h5md.py Outdated Show resolved Hide resolved
testsuite/MDAnalysisTests/coordinates/test_h5md.py Outdated Show resolved Hide resolved
@edisj
Copy link
Contributor Author

edisj commented Aug 6, 2020

@orbeckst thanks for the comments.

Just pushed a commit that hopefully satisfies all of the remaining concerns

@orbeckst orbeckst added the NSF REU NSF Research Experience for Undergraduates project label Aug 7, 2020
- more explanatory text (capabilities and limitations of reader)
- streamline introductory text
- add paragraph for users to contact us about compiling parallel h5py & friends
- formatting: 
   - use normal Python code blocks
   - use normal /emphasized text for packages 
   - markup ``True`` and ``None``
   - sphinx role markup for methods, exceptions, and attributes
@orbeckst
Copy link
Member

orbeckst commented Aug 7, 2020

@edisj I added some final doc fixes and everything looks great. I approved the PR and I am just waiting for green CI or any last minute comments from anyone else.

@orbeckst orbeckst merged commit 27278bb into MDAnalysis:develop Aug 7, 2020
@orbeckst
Copy link
Member

orbeckst commented Aug 7, 2020

It is done!

Congratulations, @edisj !

@edisj
Copy link
Contributor Author

edisj commented Aug 7, 2020

Wooo! Thanks for all the help @orbeckst

@edisj edisj deleted the h5md-format branch August 8, 2020 00:16
@edisj edisj mentioned this pull request Mar 24, 2021
4 tasks
PicoCentauri pushed a commit to PicoCentauri/mdanalysis that referenced this pull request Mar 30, 2021
* Fixes MDAnalysis#762
* add H5MD coordinate reader (supports parallel MPI in principle but is not well tested at the moment, see MDAnalysis#2865)
* added test h5md datafiles:  real example (derived from cobrotoxin.trr) and synthetic example for MultiframeReaderTest reader tests
* add tests (MultiframeReaderTest and custom)
* add documentation (example and parallel MPI)
* added h5py into conda dependencies and pyh5md into pip dependencies
* update CHANGELOG
* update AUTHORS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component-Readers Format-H5MD hdf5-based H5MD trajectory format new-feature NSF REU NSF Research Experience for Undergraduates project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for HALMD files (H5MD)
8 participants