added reading of record types #1762

richardjgowers · 2018-01-25T16:22:57Z

Starts #1753

Changes made in this Pull Request:

added record_type attribute

Is this the best name for the ATOM/HETATM thing in a PDB file?

PR Checklist

Tests?
Docs?
CHANGELOG updated?
Issue raised/referenced?

orbeckst · 2018-01-26T20:55:10Z

package/MDAnalysis/core/topologyattrs.py

+
+    @staticmethod
+    def _gen_initial_values(na, nr, ns):
+        return np.array(['ATOM'] * na, dtype=object)


It seems (memory)-wasteful to store a string with each atom. For 1M atoms, that can be an additional len("HETATM") * 1e6 / (1024*1024) = 5.7 MiB. Could we instead use a small integer, such as 1 = ATOM, 2 = HETATM, 0 = no idea, and then get the string back on the fly (e.g., as a dict look-up)?

Or is this more complicated than the memory consumption?

I'm just conscientious of the fact that we will be using the additional memory for any PDB file (especially simulation systems), where no-one cares about the atom records per se.

orbeckst

question about memory consumption

richardjgowers · 2018-01-26T21:20:32Z

Yeah, we could do some encoding actually seeing as there's only two options, that's a good point

…

On Fri, 26 Jan 2018, 8:57 p.m. Oliver Beckstein, ***@***.***> wrote: ***@***.**** commented on this pull request. question about memory consumption — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1762 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AI0jB_A2jaUxmg43Ibb18-SLOdseuQR7ks5tOjwfgaJpZM4RtHC_> .

jbarnoud · 2018-01-27T10:15:29Z

Encoding would save some memory for sure, and it will not affect performances in a perceivable way (or at least I do not expect it to). But it will create a new level of indirection for the user where what he sees is not what is stored; because it is different from how the other topology attribute are saved, I am worried the surprise the encoding could cause could be worst than the memory consumption.

orbeckst · 2018-01-29T04:14:06Z

I think it depends on what you call the attribute and how you use it. Ultimately, only the PDBWriter needs to know what to do with it. (Of course, one needs to document the meaning of the codes but that's not worse than storing the PSF atomtype which can be either a number or a string, depending on if this is XPLOR PSF or CHARMM PSF.)

jbarnoud · 2018-01-29T13:30:38Z

It all depend on if we want it to be exposed to the users. The way the PR does it, it is exposed, and I think it is a good idea. Indeed, by exposing it we make it possible for a user to change the record type of some residues to deal with some software that handle the two types of record separately (many software just ignore HETATM records). We also make it possible to add the information to a universe that does not have it to later write a PDB as wanted; finally it allows to select atoms based on the record type to, for instance, ignore the HETATM like many software.

If it is accessible to the user, it is much more natural to do something like selection.record_types = 'ATOM' than selection.record_types = 0; or to do selection[selection.record_types == 'ATOM'].

mnmelo · 2018-01-29T13:42:55Z

I think with some work involving @property getters and setters we could have both Richard's and Jonathan's approaches' benefits (at perhaps some added code complexity).

richardjgowers · 2018-01-29T14:10:25Z

Yeah so @mnmelo is right, we can have our cake and eat it. With every TopologyAttr there's a get_atoms and set_atoms method where we can do work on the values if we want. I've updated this PR to do some encoding, this is invisible to the user and other components of MDA. Only if we go sniffing around inside RecordTypes.values will we see that it's actually a boolean array there :)

This isn't currently 100% working now as ResidueGroup.record_types will expose the boolean array, I'm looking at the cleanest way to fix this...

richardjgowers · 2018-01-29T14:44:37Z

So this should be good to go if @jbarnoud happy with the encoding. I've left out writing record types and selecting based on them deliberately because they'd make good GSOC starters, I'll make them into issues once this is merged.

richardjgowers added 2 commits January 25, 2018 16:18

added reading of record types

911ea22

added test for default record_types

93aefc7

richardjgowers added this to the 0.17.x milestone Jan 25, 2018

orbeckst reviewed Jan 26, 2018

View reviewed changes

jbarnoud approved these changes Jan 27, 2018

View reviewed changes

encoding of RecordTypes

5265b62

fixed ResidueGroup and SegmentGroup RecordType access

4de452f

jbarnoud self-assigned this Jan 29, 2018

jbarnoud added Format-PDB Format-PQR Component-Topology labels Jan 29, 2018

jbarnoud merged commit d9ef4ec into develop Jan 29, 2018

xiki-tempula mentioned this pull request Jan 30, 2018

Allow PDBWriter (and similar) to write record_types #1753

Closed

kain88-de deleted the issue-1753-add_recordtype branch May 21, 2018 08:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added reading of record types #1762

added reading of record types #1762

richardjgowers commented Jan 25, 2018

orbeckst Jan 26, 2018

orbeckst left a comment

richardjgowers commented Jan 26, 2018 via email

jbarnoud commented Jan 27, 2018

orbeckst commented Jan 29, 2018

jbarnoud commented Jan 29, 2018

mnmelo commented Jan 29, 2018

richardjgowers commented Jan 29, 2018

richardjgowers commented Jan 29, 2018

added reading of record types #1762

added reading of record types #1762

Conversation

richardjgowers commented Jan 25, 2018

PR Checklist

orbeckst Jan 26, 2018

Choose a reason for hiding this comment

orbeckst left a comment

Choose a reason for hiding this comment

richardjgowers commented Jan 26, 2018 via email

jbarnoud commented Jan 27, 2018

orbeckst commented Jan 29, 2018

jbarnoud commented Jan 29, 2018

mnmelo commented Jan 29, 2018

richardjgowers commented Jan 29, 2018

richardjgowers commented Jan 29, 2018