Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added reading of record types #1762

Merged
merged 4 commits into from
Jan 29, 2018
Merged

added reading of record types #1762

merged 4 commits into from
Jan 29, 2018

Conversation

richardjgowers
Copy link
Member

Starts #1753

Changes made in this Pull Request:

  • added record_type attribute

Is this the best name for the ATOM/HETATM thing in a PDB file?

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@richardjgowers richardjgowers added this to the 0.17.x milestone Jan 25, 2018

@staticmethod
def _gen_initial_values(na, nr, ns):
return np.array(['ATOM'] * na, dtype=object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems (memory)-wasteful to store a string with each atom. For 1M atoms, that can be an additional len("HETATM") * 1e6 / (1024*1024) = 5.7 MiB. Could we instead use a small integer, such as 1 = ATOM, 2 = HETATM, 0 = no idea, and then get the string back on the fly (e.g., as a dict look-up)?

Or is this more complicated than the memory consumption?

I'm just conscientious of the fact that we will be using the additional memory for any PDB file (especially simulation systems), where no-one cares about the atom records per se.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question about memory consumption

@richardjgowers
Copy link
Member Author

richardjgowers commented Jan 26, 2018 via email

@jbarnoud
Copy link
Contributor

Encoding would save some memory for sure, and it will not affect performances in a perceivable way (or at least I do not expect it to). But it will create a new level of indirection for the user where what he sees is not what is stored; because it is different from how the other topology attribute are saved, I am worried the surprise the encoding could cause could be worst than the memory consumption.

@orbeckst
Copy link
Member

I think it depends on what you call the attribute and how you use it. Ultimately, only the PDBWriter needs to know what to do with it. (Of course, one needs to document the meaning of the codes but that's not worse than storing the PSF atomtype which can be either a number or a string, depending on if this is XPLOR PSF or CHARMM PSF.)

@jbarnoud
Copy link
Contributor

It all depend on if we want it to be exposed to the users. The way the PR does it, it is exposed, and I think it is a good idea. Indeed, by exposing it we make it possible for a user to change the record type of some residues to deal with some software that handle the two types of record separately (many software just ignore HETATM records). We also make it possible to add the information to a universe that does not have it to later write a PDB as wanted; finally it allows to select atoms based on the record type to, for instance, ignore the HETATM like many software.

If it is accessible to the user, it is much more natural to do something like selection.record_types = 'ATOM' than selection.record_types = 0; or to do selection[selection.record_types == 'ATOM'].

@mnmelo
Copy link
Member

mnmelo commented Jan 29, 2018

I think with some work involving @property getters and setters we could have both Richard's and Jonathan's approaches' benefits (at perhaps some added code complexity).

@richardjgowers
Copy link
Member Author

Yeah so @mnmelo is right, we can have our cake and eat it. With every TopologyAttr there's a get_atoms and set_atoms method where we can do work on the values if we want. I've updated this PR to do some encoding, this is invisible to the user and other components of MDA. Only if we go sniffing around inside RecordTypes.values will we see that it's actually a boolean array there :)

This isn't currently 100% working now as ResidueGroup.record_types will expose the boolean array, I'm looking at the cleanest way to fix this...

@richardjgowers
Copy link
Member Author

So this should be good to go if @jbarnoud happy with the encoding. I've left out writing record types and selecting based on them deliberately because they'd make good GSOC starters, I'll make them into issues once this is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants