Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple RDKitConverter #2775

Merged
merged 107 commits into from
Aug 21, 2020
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
107 commits
Select commit Hold shift + click to select a range
90fccce
simple converter
Jun 19, 2020
ba0f89e
add PDB residue info and guessers
Jun 22, 2020
b54ec60
fix 2 letter atom names and fix bond order
Jun 24, 2020
ac38ceb
test mol2 topology and pdb info
Jun 24, 2020
de5560b
autopep8
Jun 25, 2020
b00dee3
fix code review
Jun 25, 2020
495217d
store other attributes in each atom
Jun 25, 2020
eb5d365
more tests on topology + pep8
Jun 25, 2020
1774963
fix other props type
Jun 26, 2020
81eea81
bond order and charges
Jun 26, 2020
6d6a65b
fix indentation and bonds with atoms outside of atomgroup
Jun 26, 2020
cf0706c
fix test
Jun 26, 2020
d41fa5b
fix test minimal build
Jun 29, 2020
fe27f0b
add icodes to MonomerInfo
Jun 29, 2020
3590113
better attributes conversion
Jun 29, 2020
2d79014
test mda to rdkit MonomerInfo conversion
Jun 29, 2020
df660b7
Merge branch 'rdkit-converter' into rdkit-sanitize
Jun 29, 2020
68dc7ca
test mda to rdkit MonomerInfo conversion
Jun 29, 2020
f57e9e6
docs
Jun 29, 2020
99da261
fix minimal deps test
Jun 29, 2020
18a05ab
fix for minimal deps tests 🙏
Jun 30, 2020
1c14478
use util functions
Jun 30, 2020
5493fd1
Merge branch 'develop' into rdkit-converter
Jun 30, 2020
f5c798a
move min deps test up
Jun 30, 2020
558f7cd
capitalize elements by default
Jun 30, 2020
984f507
store index in MDA object
Jul 1, 2020
e582e63
fetch singular from _TOPOLOGY_ATTRS
Jul 1, 2020
931a81b
updated changelog
Jul 2, 2020
c432ea9
use ag intersection for bonds instead of try except
Jul 3, 2020
b0c39eb
more tests + pep8
Jul 3, 2020
7b002f1
Merge branch 'develop' into rdkit-converter
Jul 3, 2020
dc0c276
fix test for numpy 1.13.3
Jul 6, 2020
6b8cd9d
Merge branch 'rdkit-converter' of https://github.com/cbouy/mdanalysis…
Jul 6, 2020
0fd38b0
docs + pep8
Jul 7, 2020
25ee227
fix too long atom names in PDB output
Jul 7, 2020
cfce278
table of supported attributes for the converter
Jul 7, 2020
3717f57
changelog convention
Jul 7, 2020
499010a
catch no bonds
Jul 7, 2020
9604b6a
use match in pytest.raises and pytest.warns
Jul 7, 2020
30871ab
remove bondtype attribute
Jul 8, 2020
9c0a42e
fix bond attributes
Jul 8, 2020
66d644d
Merge branch 'rdkit-sanitize' into rdkit-converter
Jul 8, 2020
1bbfb09
simplified the code for infering
Jul 8, 2020
4a79cfd
include negative charges
Jul 9, 2020
bd88908
skip failing test for now
Jul 10, 2020
d9e088e
new test
Jul 10, 2020
f6aa736
update doc
Jul 10, 2020
b943af1
standardize groups
Jul 13, 2020
0105419
fix tests
Jul 13, 2020
7168e2d
add previously failing tests
Jul 13, 2020
de324fd
Merge branch 'develop' into rdkit-converter
Jul 13, 2020
d1f06f3
simpler pattern standardization
Jul 14, 2020
5a661f8
more tests on inferring and patterns
Jul 14, 2020
6e9c221
test atom properties and reaction mapping
Jul 14, 2020
b02fef6
fix docs
Jul 14, 2020
67dc9c2
Merge branch 'rdkit-converter' of https://github.com/cbouy/mdanalysis…
Jul 14, 2020
c8acda0
fix minimal build tests
Jul 15, 2020
9defa40
pep8
Jul 15, 2020
39ae42b
bfactors as TempFactor
Jul 15, 2020
ff45d3b
Update CHANGELOG
Jul 15, 2020
7d7f3f2
pep8
Jul 16, 2020
9d895bc
Merge branch 'rdkit-converter' of https://github.com/cbouy/mdanalysis…
Jul 16, 2020
f3f6f01
pep8
Jul 16, 2020
5d342db
no hydrogen warning
Jul 16, 2020
eff29a2
quickfix for polycyclic conjugated systems
Jul 17, 2020
5a37d84
Merge branch 'develop' into rdkit-converter
Jul 17, 2020
ca07796
save any atom property instead of just the ones tagged with "_MDAnaly…
Jul 20, 2020
28d04ed
general solution
Jul 21, 2020
ae73aff
back to old method for conjugated systems
Jul 21, 2020
8eaebbd
Merge branch 'rdkit-converter' of https://github.com/cbouy/mdanalysis…
Jul 21, 2020
425369e
Merge branch 'develop' into rdkit-converter
Jul 22, 2020
d6cd44d
added "RDKit" in changelog
Jul 22, 2020
1398f8c
fix when first atom is charged
Jul 22, 2020
e71a686
working general solution to conjugated systems
Jul 22, 2020
925bd7f
pep8
Jul 22, 2020
5a24ff6
Merge branch 'rdkit-converter' of https://github.com/cbouy/mdanalysis…
Jul 22, 2020
e5282f3
for loop instead of while
Jul 22, 2020
b7ac69e
bugfix for conjugated system with N at one edge
Jul 23, 2020
aba4b1b
fix typos
Jul 27, 2020
49f3bff
fix test for min deps
Jul 27, 2020
70cce99
use indirect parametrization
Jul 27, 2020
e82cf0a
add coordinates + tests + fix index property
Jul 29, 2020
3427c3c
cache the molecule and only update conformers when iterating trajectory
Jul 29, 2020
5218daa
update docs
Jul 30, 2020
3313b56
add kwargs to the cache key
Jul 30, 2020
a72a51b
use fstrings
Jul 31, 2020
76102fb
Merge branch 'develop' into rdkit-converter
orbeckst Aug 7, 2020
b9a25b2
Merge branch 'develop' into rdkit-converter
Aug 10, 2020
cec1f68
cache kwarg + fix cache tests
Aug 10, 2020
75f6262
Merge branch 'rdkit-converter' of https://github.com/cbouy/mdanalysis…
Aug 10, 2020
6c4c123
fix code review + improve docs
Aug 13, 2020
cdb60b2
improve doc
Aug 13, 2020
ebd0416
add test nan in coordinates
Aug 13, 2020
a1aab5b
warn nan coordinates
Aug 13, 2020
dcb8f33
end of code review
Aug 13, 2020
47b2755
corrections to smarts patterns for reactions
Aug 14, 2020
89efcd3
sort atoms by NUE + more standardization tests
Aug 14, 2020
a595e69
pdb atom names formating
Aug 14, 2020
f394e72
test RDKit->MDA->RDKit->MDA equal
Aug 14, 2020
e525e41
tackle conjugated systems ending with O-
Aug 14, 2020
9e22681
docs + pep8
Aug 14, 2020
eaf6887
tests
Aug 14, 2020
ded60c0
add max_iter kwarg
Aug 14, 2020
c18c7cd
Merge branch 'develop' into rdkit-converter
Aug 14, 2020
4ed0483
reorder docstring standardize_patterns
Aug 14, 2020
d288e87
Merge branch 'rdkit-converter' of https://github.com/cbouy/mdanalysis…
Aug 14, 2020
4e49616
fix doc build
Aug 14, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
199 changes: 195 additions & 4 deletions package/MDAnalysis/coordinates/RDKit.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,8 @@
<Universe with 42 atoms>
>>> u.trajectory
<RDKitReader with 10 frames of 42 atoms>
>>> u.atoms.convert_to("RDKIT")
<rdkit.Chem.rdchem.RWMol object at 0x7fcebb958148>


Classes
Expand All @@ -55,15 +57,50 @@
"""

import warnings
import re

import numpy as np

from ..exceptions import NoDataError
from ..topology.guessers import guess_atom_element
from . import memory
from . import base

try:
from rdkit import Chem
except ImportError:
pass
else:
RDBONDTYPE = {
'AROMATIC': Chem.BondType.AROMATIC,
'SINGLE': Chem.BondType.SINGLE,
'DOUBLE': Chem.BondType.DOUBLE,
'TRIPLE': Chem.BondType.TRIPLE,
}
RDBONDORDER = {
1: Chem.BondType.SINGLE,
1.5: Chem.BondType.AROMATIC,
"ar": Chem.BondType.AROMATIC,
2: Chem.BondType.DOUBLE,
3: Chem.BondType.TRIPLE,
}
# add string version of the key for each bond
RDBONDORDER.update({str(key): value for key, value in RDBONDORDER.items()})
RDATTRIBUTES = {
"altLoc": "AltLoc",
"chainID": "ChainId",
"name": "Name",
"occupancy": "Occupancy",
"resname": "ResidueName",
"resid": "ResidueNumber",
"segindex": "SegmentNumber",
"tempfactor": "TempFactor",
}

cbouy marked this conversation as resolved.
Show resolved Hide resolved

class RDKitReader(memory.MemoryReader):
"""Coordinate reader for RDKit.

.. versionadded:: 2.0.0
"""
format = 'RDKIT'
Expand Down Expand Up @@ -95,10 +132,164 @@ def __init__(self, filename, **kwargs):
"""
n_atoms = filename.GetNumAtoms()
coordinates = np.array([
conf.GetPositions() for conf in filename.GetConformers()],
conf.GetPositions() for conf in filename.GetConformers()],
dtype=np.float32)
if coordinates.size == 0:
warnings.warn("No coordinates found in the RDKit molecule")
coordinates = np.empty((1,n_atoms,3), dtype=np.float32)
coordinates = np.empty((1, n_atoms, 3), dtype=np.float32)
coordinates[:] = np.nan
super(RDKitReader, self).__init__(coordinates, order='fac', **kwargs)
super(RDKitReader, self).__init__(coordinates, order='fac', **kwargs)


class RDKitConverter(base.ConverterBase):
"""Convert MDAnalysis AtomGroup or Universe to `RDKit <https://www.rdkit.org/docs/source/rdkit.Chem.rdchem.html#rdkit.Chem.rdchem.Mol>`_ :class:`rdkit.Chem.rdchem.Mol`.

Example
-------

.. code-block:: python

import MDAnalysis as mda
from MDAnalysis.tests.datafiles import PDB_full
u = mda.Universe(PDB_full)
mol = u.select_atoms('resname DMS').convert_to('RDKIT')


.. versionadded:: 2.0.0
"""

lib = 'RDKIT'
units = {'time': None, 'length': 'Angstrom'}

def convert(self, obj):
"""Write selection at current trajectory frame to :class:`~rdkit.Chem.rdchem.Mol`.

Parameters
-----------
obj : AtomGroup or Universe
"""
try:
from rdkit import Chem
except ImportError:
raise ImportError("RDKit is required for the RDKitConverter but "
"it's not installed. Try installing it with \n"
"conda install -c conda-forge rdkit")
try:
# make sure to use atoms (Issue 46)
ag = obj.atoms
except AttributeError:
raise TypeError("No `atoms` attribute in object of type {}, "
"please use a valid AtomGroup or Universe".format(
type(obj))) from None

try:
elements = ag.elements
cbouy marked this conversation as resolved.
Show resolved Hide resolved
except NoDataError:
raise AttributeError(
"The `elements` attribute is required for the RDKitConverter "
"but is not present in this AtomGroup. Please refer to the "
"documentation to guess elements from other attributes. "
"If `types` are present in the AtomGroup, a good starting "
"point would be:\n"
cbouy marked this conversation as resolved.
Show resolved Hide resolved
">>> from MDAnalysis.topology.guessers import "
"guess_atom_element\n"
">>> elements = np.array(["
"guess_atom_element(x).capitalize() for x in u.atoms.types"
"], dtype=object)\n"
">>> u.add_TopologyAttr('elements', elements)") from None

other_attrs = {}
for attr in ["bfactors", "charges", "icodes", "segids", "types"]:
if hasattr(ag, attr):
other_attrs[attr] = getattr(ag, attr)

mol = Chem.RWMol()
atom_mapper = {}

for i, (atom, element) in enumerate(zip(ag, elements)):
# create atom
rdatom = Chem.Atom(element)
# disable adding H to the molecule
IAlibay marked this conversation as resolved.
Show resolved Hide resolved
cbouy marked this conversation as resolved.
Show resolved Hide resolved
rdatom.SetNoImplicit(True)
# add PDB-like properties
mi = Chem.AtomPDBResidueInfo()
for attr, rdattr in RDATTRIBUTES.items():
_add_mda_attr_to_rdkit(atom, attr, rdattr, mi)
rdatom.SetMonomerInfo(mi)
# other properties
for attr in other_attrs.keys():
value = other_attrs[attr][i]
if isinstance(value, np.generic):
# convert numpy types to python standard types
value = str(value)
rdatom.SetProp("_MDAnalysis_%s" % attr[:-1], value)
# add atom
index = mol.AddAtom(rdatom)
# map index in universe to index in mol
cbouy marked this conversation as resolved.
Show resolved Hide resolved
atom_mapper[atom.ix] = index

try:
bonds = ag.bonds
if (len(bonds) == 0) and (ag.n_atoms > 1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could simplify this with a call to ag.bonds.values() that should raise an IndexError if you have no bonds in your system.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up not using bonds.values() as it needs coordinates to work, but people might just want to convert their topology to rdkit without needing the trajectory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's reasonable.

I'm trying to work out if this does the intended behaviour though. My understanding is that calling AtomGroup.bonds should raise a NoDataError if no bond information is present. If bond information is present but the number of bonds is 0, that should be a valid input, at least according to whatever the user provided right?

Is there a particular case where the bonds attribute get populated but we still require guessing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only case I could find where the bonds attribute was present but empty is a PDB file without a CONECT record, I don't know if that's intended or not (the PDB_helix test file for example).
The corresponding AtomGroup probably still requires guessing bonds though, as I don't really see a scenario where people need to convert their simulation of monoatomic ions or noble gases to RDKit.
For now I can just try if ag.bonds is present if you prefer, PDB files are a mess anyway !

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably not normal, we should have a consistent API, if bonds aren't present then the bonds attribute doesn't get assigned :/ Can you raise this as an issue too? It might be that there's a rationale for it, but I can't seem to understand what it could be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raised #2832

# force guessing bonds
raise NoDataError
except NoDataError:
cbouy marked this conversation as resolved.
Show resolved Hide resolved
warnings.warn(
"No `bonds` attribute in this AtomGroup. Guessing bonds based"
"on atoms coordinates")
ag.guess_bonds()
IAlibay marked this conversation as resolved.
Show resolved Hide resolved
bonds = ag.bonds

for bond in bonds:
bond_indices = [atom_mapper[i] for i in bond.indices]
try:
bond_type = bond.type.upper()
except AttributeError:
# bond type can be a tuple for PDB files
bond_type = None
bond_type = RDBONDTYPE.get(bond_type, RDBONDORDER.get(
bond.order, Chem.BondType.SINGLE))
mol.AddBond(*bond_indices, bond_type)

# sanitization
Chem.SanitizeMol(mol)

return mol


def _add_mda_attr_to_rdkit(atom, attr, rdattr, mi):
"""Converts an MDAnalysis atom attribute into the RDKit equivalent and
stores it into an RDKit AtomPDBResidueInfo object.

Parameters
----------

atom : MDAnalysis.core.groups.Atom
The atom to get the attributes from
attr : str
Name of the atom attribute in MDAnalysis in the singular form
rdattr : str
Name of the equivalent attribute in RDKit, as found in the `Set` and
`Get` methods of the `AtomPDBResidueInfo`
mi : rdkit.Chem.rdchem.AtomPDBResidueInfo
MonomerInfo object containing all the relevant atom attributes
"""
try: # get value in MDA atom
value = getattr(atom, attr)
except AttributeError:
pass
else:
if isinstance(value, np.generic):
# convert numpy types to python standard types
value = value.item()
if attr == "name":
# RDKit needs the name to be properly formated for a
# PDB file (1 letter elements start at col 14)
name = re.findall('(\D+|\d+)', value)
if len(name) == 2:
symbol, number = name
else:
symbol, number = name[0], ""
value = "{:>2}".format(symbol) + "{:<2}".format(number)
# set attribute value in RDKit MonomerInfo
getattr(mi, "Set%s" % rdattr)(value)
119 changes: 111 additions & 8 deletions testsuite/MDAnalysisTests/coordinates/test_rdkit.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,24 +24,34 @@

import pytest
import MDAnalysis as mda
from MDAnalysis.topology.guessers import guess_atom_element
import numpy as np
from numpy.testing import (assert_equal,
assert_almost_equal)

from MDAnalysisTests.datafiles import mol2_molecule
from MDAnalysisTests.datafiles import mol2_molecule, PDB_full

try:
from rdkit import Chem
from rdkit.Chem import AllChem
except ImportError:
rdkit_installed = False
else:
rdkit_installed = True

Chem = pytest.importorskip("rdkit.Chem")
AllChem = pytest.importorskip("rdkit.Chem.AllChem")

def mol2_mol():
return Chem.MolFromMol2File(mol2_molecule, removeHs=False)


def smiles_mol():
mol = Chem.MolFromSmiles("CCO")
mol = Chem.MolFromSmiles("CN1C=NC2=C1C(=O)N(C(=O)N2C)C")
mol = Chem.AddHs(mol)
cids = AllChem.EmbedMultipleConfs(mol, numConfs=3)
return mol


@pytest.mark.skipif(rdkit_installed is False, reason="requires RDKit")
class TestRDKitReader(object):
@pytest.mark.parametrize("rdmol, n_frames", [
(mol2_mol(), 1),
Expand All @@ -51,7 +61,7 @@ def test_coordinates(self, rdmol, n_frames):
universe = mda.Universe(rdmol)
assert universe.trajectory.n_frames == n_frames
expected = np.array([
conf.GetPositions() for conf in rdmol.GetConformers()],
conf.GetPositions() for conf in rdmol.GetConformers()],
dtype=np.float32)
assert_equal(expected, universe.trajectory.coordinate_array)

Expand All @@ -65,14 +75,107 @@ def test_no_coordinates(self):
assert len(w) == 1
assert "No coordinates found" in str(
w[-1].message)
expected = np.empty((1,u.atoms.n_atoms,3), dtype=np.float32)
expected = np.empty((1, u.atoms.n_atoms, 3), dtype=np.float32)
expected[:] = np.nan
assert_equal(u.trajectory.coordinate_array, expected)

def test_compare_mol2reader(self):
universe = mda.Universe(mol2_mol())
mol2 = mda.Universe(mol2_molecule)
assert universe.trajectory.n_frames == mol2.trajectory.n_frames
assert_equal(universe.trajectory.ts.positions,
assert_equal(universe.trajectory.ts.positions,
mol2.trajectory.ts.positions)



@pytest.mark.skipif(rdkit_installed is False, reason="requires RDKit")
class TestRDKitConverter(object):
@pytest.fixture
def pdb(self):
return mda.Universe(PDB_full)

@pytest.fixture
def mol2(self):
u = mda.Universe(mol2_molecule)
# add elements
elements = np.array([
guess_atom_element(x).capitalize() for x in u.atoms.types
], dtype=object)
u.add_TopologyAttr('elements', elements)
return u

@pytest.mark.parametrize("sel_str", [
"resid 1",
"resname LYS and name NZ",
"resid 34 and altloc B",
])
def test_monomer_info(self, pdb, sel_str):
rdmol = Chem.MolFromPDBFile(PDB_full)
sel = pdb.select_atoms(sel_str)
umol = sel.convert_to("RDKIT")
atom = umol.GetAtomWithIdx(0)
mi = atom.GetMonomerInfo()

for mda_attr, rd_attr in mda.coordinates.RDKit.RDATTRIBUTES.items():
if mda_attr == "occupancy":
mda_attr = "occupancie"
elif mda_attr == "segindex":
mda_attr = "segindice"
rd_value = getattr(mi, "Get%s" % rd_attr)()
mda_value = getattr(sel, "%ss" % mda_attr)[0]
if mda_attr == "name":
rd_value = rd_value.strip()
assert rd_value == mda_value

def test_identical_topology_mol2(self, mol2):
"""Check stereochemistry on atoms and bonds (but not yet)"""
rdmol = mol2_mol()
umol = mol2.atoms.convert_to("RDKIT")
assert rdmol.HasSubstructMatch(umol, useChirality=False)
assert umol.HasSubstructMatch(rdmol, useChirality=False)

def test_identical_topology(self):
rdmol = smiles_mol()
u = mda.Universe(rdmol)
umol = u.atoms.convert_to("RDKIT")
assert rdmol.HasSubstructMatch(umol) and umol.HasSubstructMatch(rdmol)

def test_raise_requires_elements(self):
IAlibay marked this conversation as resolved.
Show resolved Hide resolved
u = mda.Universe(mol2_molecule)
with pytest.raises(AttributeError) as e:
cbouy marked this conversation as resolved.
Show resolved Hide resolved
u.atoms.convert_to("RDKIT")
assert "`elements` attribute is required for the RDKitConverter" in str(
e.value)

def test_warn_guess_bonds(self, pdb):
pdb.delete_bonds(pdb.bonds)
ag = pdb.select_atoms("resnum 101 and segid A")
pdb.delete_bonds(ag.bonds)
with warnings.catch_warnings(record=True) as w:
cbouy marked this conversation as resolved.
Show resolved Hide resolved
# Cause all warnings to always be triggered.
warnings.simplefilter("always")
# trigger warning
ag.convert_to("RDKIT")
assert len(w) == 1
assert "No `bonds` attribute in this AtomGroup" in str(
w[-1].message)

@pytest.mark.parametrize("idx", [0, 10, 42])
def test_other_attributes(self, mol2, idx):
mol = mol2.atoms.convert_to("RDKIT")
rdprops = mol.GetAtomWithIdx(idx).GetPropsAsDict()
for prop in ["charge", "segid", "type"]:
rdprop = rdprops["_MDAnalysis_%s" % prop]
mdaprop = getattr(mol2.atoms[idx], prop)
assert rdprop == mdaprop


class TestRequiresRDKit(object):
def test_converter_requires_rdkit(self):
if rdkit_installed:
pass
else:
u = mda.Universe(mol2_molecule)
with pytest.raises(ImportError) as e:
u.atoms.convert_to("RDKIT")
assert "RDKit is required for the RDKitConverter" in str(
e.value)