Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An "accurate and reliable" RDKitConverter #3044

Merged
merged 55 commits into from
Apr 2, 2022
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
f464d3c
fix carbonyl
Oct 5, 2020
d129537
fix charged mols
Oct 12, 2020
24ddb3e
remove false premature return
Oct 14, 2020
4b6afc1
minor comments
Oct 14, 2020
c9513c7
assign most likely charges to monatomic ions
Oct 14, 2020
338dad3
fix when 2 ending patterns are matched
Oct 14, 2020
029ab1f
skip when backtrack prevents changes
Oct 15, 2020
046383e
improvements for Nterm, sulfones, and nitrogen next to a charged atom
Oct 27, 2020
bd12bc1
better handling of ring conjugated systems and conjugated triple bonds
Oct 28, 2020
bd14836
revert rdkitcache branch changes
Oct 28, 2020
0e7d4e4
monatomic default charges handling
Nov 13, 2020
d9b4d95
fix multiple types of end pattern + test
Nov 17, 2020
b36eab6
fix some sulfur and nitrogen compounds
Nov 18, 2020
555eab3
add amino and nucleic acids test
Nov 18, 2020
3e4d066
add test for ions
Nov 18, 2020
ec54125
final fixes
Nov 19, 2020
3167e43
fix docs and unused import
Nov 19, 2020
7b6823b
pep8
Nov 19, 2020
f695f91
Merge branch 'develop' into fix-converter
cbouy May 11, 2021
3f4645f
pep8
cbouy May 11, 2021
20a64ae
comments
cbouy May 11, 2021
5c03ee8
fix rdkit pdb atom names formatting
cbouy May 11, 2021
44c523e
skip sanitization if fails
cbouy May 11, 2021
0ce3271
reorder rdkit atoms to match mdanalysis
cbouy May 11, 2021
7477e7f
change setting and transferring atom properties
cbouy May 12, 2021
df4feb4
fix tests + pep8
cbouy May 12, 2021
359fdf4
fix tests
cbouy May 12, 2021
e358504
add more comments and UGM presentation
cbouy May 13, 2021
8fbc852
fix links
cbouy May 13, 2021
2c3e2fa
add test for new atom reordering
cbouy May 24, 2021
9ba698d
remove unnecessary sorting when using rdkit-based guessers or selection
cbouy May 24, 2021
896e7dc
Merge branch 'MDAnalysis:develop' into fix-converter
cbouy Jan 18, 2022
923dfb1
Merge branch 'MDAnalysis:develop' into fix-converter
cbouy Feb 16, 2022
3f6ff64
add missing test
cbouy Feb 16, 2022
5872559
add bond stereo test
cbouy Feb 17, 2022
7108da8
Merge branch 'MDAnalysis:develop' into fix-converter
cbouy Mar 7, 2022
428d341
more docs and comments
cbouy Mar 7, 2022
50de418
use PDBWriter.deduce_pdb_atom_name instead of custom code
cbouy Mar 7, 2022
eec6b1d
remove unnecessary bond stereo assignment
cbouy Mar 7, 2022
393d141
remove unused bond test
cbouy Mar 11, 2022
a2886e5
add _MDAnalysis_name atom property
cbouy Mar 16, 2022
ccd4bca
Merge branch 'MDAnalysis:develop' into fix-converter
cbouy Mar 16, 2022
055692b
add bond stereo test
cbouy Mar 16, 2022
3c9d874
mention failing molecules in docs
cbouy Mar 16, 2022
1a094fa
update changelog
cbouy Mar 16, 2022
9b3b7b5
improve readability
cbouy Mar 16, 2022
b18150f
improve tests readability
cbouy Mar 16, 2022
e01fc07
fix resonance struct test
cbouy Mar 17, 2022
4e4a40b
add xfail test
cbouy Mar 17, 2022
1a08adb
Merge branch 'develop' into fix-converter
orbeckst Mar 23, 2022
05c5c1b
Merge branch 'develop' into fix-converter
cbouy Mar 30, 2022
5a48bd6
lint
cbouy Mar 30, 2022
430f3e6
docs update from code review
cbouy Mar 30, 2022
d324223
Merge branch 'fix-converter' of github.com:cbouy/mdanalysis into fix-…
cbouy Mar 30, 2022
f019d28
fix docs
cbouy Mar 30, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
230 changes: 144 additions & 86 deletions package/MDAnalysis/coordinates/RDKit.py
Original file line number Diff line number Diff line change
Expand Up @@ -273,8 +273,8 @@ def convert(self, obj, cache=True, NoImplicit=True, max_iter=200):

Parameters
-----------
obj : :class:`~MDAnalysis.core.groups.AtomGroup` or :class:`~MDAnalysis.core.universe.Universe`

obj : :class:`~MDAnalysis.core.groups.AtomGroup` or
:class:`~MDAnalysis.core.universe.Universe`
cache : bool
Use a cached copy of the molecule's topology when available. To be
used, the cached molecule and the new one have to be made from the
Expand Down Expand Up @@ -340,7 +340,7 @@ def convert(self, obj, cache=True, NoImplicit=True, max_iter=200):
return mol

def atomgroup_to_mol(self, ag, NoImplicit=True, max_iter=200):
"""Converts an AtomGroup to an RDKit molecule.
"""Converts an AtomGroup to an RDKit molecule

Parameters
-----------
Expand Down Expand Up @@ -512,9 +512,25 @@ def _infer_bo_and_charges(mol):
R-C(-O)-O the first oxygen read will receive a double bond and the other
one will be charged. It will also affect more complex conjugated systems.
"""
atoms = sorted([a for a in mol.GetAtoms() if a.GetAtomicNum() > 1],
cbouy marked this conversation as resolved.
Show resolved Hide resolved
reverse=True,
key=lambda a: _get_nb_unpaired_electrons(a)[0])

MONATOMIC_CATION_CHARGES = {
3: 1, 11: 1, 19: 1, 37: 1, 47: 1, 55: 1,
12: 2, 20: 2, 29: 2, 30: 2, 38: 2, 56: 2,
26: 2, # Fe could also be +3
13: 3,
}

for atom in sorted(mol.GetAtoms(), reverse=True,
key=lambda a: _get_nb_unpaired_electrons(a)[0]):
for atom in atoms:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add tests for case where an atom degree is 0?

# monatomic ions
if atom.GetDegree() == 0:
atom.SetFormalCharge(MONATOMIC_CATION_CHARGES.get(
atom.GetAtomicNum(),
-_get_nb_unpaired_electrons(atom)[0]))
mol.UpdatePropertyCache(strict=False)
continue
# get NUE for each possible valence
nue = _get_nb_unpaired_electrons(atom)
# if there's only one possible valence state and the corresponding
Expand All @@ -530,7 +546,7 @@ def _infer_bo_and_charges(mol):
neighbors = sorted(atom.GetNeighbors(), reverse=True,
key=lambda a: _get_nb_unpaired_electrons(a)[0])
# check if one of the neighbors has a common NUE
for i, na in enumerate(neighbors, start=1):
for na in neighbors:
# get NUE for the neighbor
na_nue = _get_nb_unpaired_electrons(na)
# smallest common NUE
Expand All @@ -546,15 +562,16 @@ def _infer_bo_and_charges(mol):
order = common_nue + 1
bond.SetBondType(RDBONDORDER[order])
mol.UpdatePropertyCache(strict=False)
if i < len(neighbors):
# recalculate nue for atom
nue = _get_nb_unpaired_electrons(atom)

# if the atom still has unpaired electrons
nue = _get_nb_unpaired_electrons(atom)[0]
if nue > 0:
# transform it to a negative charge
atom.SetFormalCharge(-nue)
# go to next atom if one of the valences is complete
nue = _get_nb_unpaired_electrons(atom)
if any([n == 0 for n in nue]):
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a test for this case where common_nue != 0?


# if atom valence is still not filled
nue = _get_nb_unpaired_electrons(atom)
if not any([n == 0 for n in nue]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test for this please?

# transform nue to charge
atom.SetFormalCharge(-nue[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test for this please?

atom.SetNumRadicalElectrons(0)
mol.UpdatePropertyCache(strict=False)

Expand Down Expand Up @@ -605,52 +622,53 @@ def _standardize_patterns(mol, max_iter=200):
+---------------+------------------------------------------------------------------------------+
| Name | Reaction |
+===============+==============================================================================+
| conjugated | ``[*-;!O:1]-[*:2]=[*:3]-[*-:4]>>[*+0:1]=[*:2]-[*:3]=[*+0:4]`` |
| conjugated | ``[*-:1]-[*:2]=[*:3]-[*-:4]>>[*+0:1]=[*:2]-[*:3]=[*+0:4]`` |
+---------------+------------------------------------------------------------------------------+
| conjugated N+ | ``[N;X3;v3:1]-[*:2]=[*:3]-[*-:4]>>[N+:1]=[*:2]-[*:3]=[*+0:4]`` |
+---------------+------------------------------------------------------------------------------+
| conjugated-N+ | ``[N;X3;v3:1]-[*:2]=[*:3]-[*-:4]>>[N+:1]=[*:2]-[*:3]=[*+0:4]`` |
| conjugated O- | ``[O:1]=[#6+0,#7+:2]-[*:3]=[*:4]-[*-:5]>>[O-:1]-[*:2]=[*:3]-[*:4]=[*+0:5]`` |
+---------------+------------------------------------------------------------------------------+
| conjugated-O- | ``[O:1]=[#6:2]-[*:3]=[*:4]-[*-:5]>>[O-:1]-[*:2]=[*:3]-[*:4]=[*+0:5]`` |
| conjug. S=O | ``[O-:1]-[S;D4;v4:2]-[*:3]=[*:4]-[*-:5]>>[O+0:1]=[*:2]=[*:3]-[*:4]=[*+0:5]`` |
+---------------+------------------------------------------------------------------------------+
| Cterm | ``[C-;X2:1]=[O:2]>>[C+0:1]=[O:2]`` |
| Cterm | ``[C-;X2;H0:1]=[O:2]>>[C+0:1]=[O:2]`` |
+---------------+------------------------------------------------------------------------------+
| Nterm | ``[N-;X2;H1:1]>>[N+0:1]`` |
| Nterm | ``[N-;X2;H1;$(N-[*^3]):1]>>[N+0:1]`` |
+---------------+------------------------------------------------------------------------------+
| keto-enolate | ``[#6-:1]-[#6:2]=[O:3]>>[#6+0:1]=[#6:2]-[O-:3]`` |
+---------------+------------------------------------------------------------------------------+
| arginine | ``[N;H1:1]-[C-;X3;H0:2](-[N;H2:3])-[N;H2:4]>>[N:1]-[C+0:2](-[N:3])=[N+:4]`` |
| arginine | ``[C-;v3:1]-[#7+0;v3;H2:2]>>[#6+0:1]=[#7+:2]`` |
+---------------+------------------------------------------------------------------------------+
| histidine | ``[#6+0;H0:1]=[#6+0:2]-[#7;X3:3]-[#6-;X3:4]>>[#6:1]=[#6:2]-[#7+:3]=[#6+0:4]``|
+---------------+------------------------------------------------------------------------------+
| sulfone | ``[S;X4;v4:1](-[O-;X1:2])-[O-;X1:3]>>[S:1](=[O+0:2])=[O+0:3]`` |
| sulfone | ``[S;D4;!v6:1]-[*-:2]>>[S;v6:1]=[*+0:2]`` |
+---------------+------------------------------------------------------------------------------+
| nitro | ``[N;X3;v3:1](-[O-;X1:2])-[O-;X1:3]>>[N+:1](-[O-:2])=[O+0:3]`` |
| charged N | ``[#7+0;X3:1]-[*-:2]>>[#7+:1]=[*+0:2]`` |
+---------------+------------------------------------------------------------------------------+

"""

# standardize conjugated systems
_rebuild_conjugated_bonds(mol, max_iter)

# list of sanitized reactions
reactions = []
for name, rxn in [
cbouy marked this conversation as resolved.
Show resolved Hide resolved
("Cterm", "[C-;X2;H0:1]=[O:2]>>[C+0:1]=[O:2]"),
("Nterm", "[N-;X2;H1;$(N-[*^3]):1]>>[N+0:1]"),
("keto-enolate", "[#6-:1]-[#6:2]=[O:3]>>[#6+0:1]=[#6:2]-[O-:3]"),
("ARG", "[C-;v3:1]-[#7+0;v3;H2:2]>>[#6+0:1]=[#7+:2]"),
("HIP", "[#6+0;H0:1]=[#6+0:2]-[#7;X3:3]-[#6-;X3:4]"
">>[#6:1]=[#6:2]-[#7+:3]=[#6+0:4]"),
("sulfone", "[S;D4;!v6:1]-[*-:2]>>[S;v6:1]=[*+0:2]"),
("charged-N", "[#7+0;X3:1]-[*-:2]>>[#7+:1]=[*+0:2]"),
]:
reaction = AllChem.ReactionFromSmarts(rxn)
reactions.append(reaction)

fragments = []
for reactant in Chem.GetMolFrags(mol, asMols=True):

for name, reaction in [
("Cterm", "[C-;X2:1]=[O:2]>>[C+0:1]=[O:2]"),
("Nterm", "[N-;X2;H1:1]>>[N+0:1]"),
("keto-enolate", "[#6-:1]-[#6:2]=[O:3]>>[#6+0:1]=[#6:2]-[O-:3]"),
("ARG", "[N;H1:1]-[C-;X3;H0:2](-[N;H2:3])-[N;H2:4]"
">>[N:1]-[C+0:2](-[N:3])=[N+:4]"),
("HIP", "[#6+0;H0:1]=[#6+0:2]-[#7;X3:3]-[#6-;X3:4]"
">>[#6:1]=[#6:2]-[#7+:3]=[#6+0:4]"),
("sulfone", "[S;X4;v4:1](-[O-;X1:2])-[O-;X1:3]"
">>[S:1](=[O+0:2])=[O+0:3]"),
("nitro", "[N;X3;v3:1](-[O-;X1:2])-[O-;X1:3]"
">>[N+:1](-[O-:2])=[O+0:3]"),
]:
reactant.UpdatePropertyCache(strict=False)
Chem.Kekulize(reactant)
for reaction in reactions:
reactant = _run_reaction(reaction, reactant)

fragments.append(reactant)

# recombine fragments
Expand All @@ -670,8 +688,8 @@ def _run_reaction(reaction, reactant):

Parameters
----------
reaction : str
SMARTS reaction
reaction : rdkit.Chem.rdChemReactions.ChemicalReaction
Reaction from SMARTS
reactant : rdkit.Chem.rdchem.Mol
The molecule to transform

Expand All @@ -680,27 +698,22 @@ def _run_reaction(reaction, reactant):
product : rdkit.Chem.rdchem.Mol
The final product of the reaction
"""
# count how many times the reaction should be run
pattern = Chem.MolFromSmarts(reaction.split(">>")[0])
n_matches = len(reactant.GetSubstructMatches(pattern))

# run the reaction for each matched pattern
rxn = AllChem.ReactionFromSmarts(reaction)
for n in range(n_matches):
products = rxn.RunReactants((reactant,))
for n in range(reactant.GetNumAtoms()):
reactant.UpdatePropertyCache(strict=False)
Chem.Kekulize(reactant)
products = reaction.RunReactants((reactant,))
# only keep the first product
if products:
product = products[0][0]
cbouy marked this conversation as resolved.
Show resolved Hide resolved
# map back atom properties from the reactant to the product
_reassign_props_after_reaction(reactant, product)
# apply the next reaction to the product
product.UpdatePropertyCache(strict=False)
reactant = product
else:
# exit the n_matches loop if there's no product. Example
# where this is needed: SO^{4}_{2-} will match the sulfone
# pattern 6 times but the reaction is only needed once
# exit the loop if there's no product
break
reactant.UpdatePropertyCache(strict=False)
Chem.Kekulize(reactant)
return reactant


Expand All @@ -723,6 +736,7 @@ def _rebuild_conjugated_bonds(mol, max_iter=200):
Since ``anion-*=*`` is the same as ``*=*-anion`` in terms of SMARTS, we can
control that we don't transform the same triplet of atoms back and forth by
adding their indices to a list.
This function also handles conjugated systems with triple bonds.
The molecule needs to be kekulized first to also cover systems
with aromatic rings.

Expand All @@ -735,61 +749,83 @@ def _rebuild_conjugated_bonds(mol, max_iter=200):
"""
mol.UpdatePropertyCache(strict=False)
Chem.Kekulize(mol)
pattern = Chem.MolFromSmarts("[*-;!O]-[*+0]=[*+0]")
pattern = Chem.MolFromSmarts("[*-{1-2}]-,=[*+0]=,#[*+0]")
base_end_pattern = Chem.MolFromSmarts(
"[*-{1-2}]-,=[*+0]=,#[*+0]-,=[*-{1-2}]")
odd_end_pattern = Chem.MolFromSmarts(
"[*-]-[*+0]=[*+0]-[*-,$([#7;X3;v3]),$([#6+0,#7+1]=O),"
"$([S;D4;v4]-[O-])]")
# number of unique matches with the pattern
n_matches = len(set([match[0]
for match in mol.GetSubstructMatches(pattern)]))
if n_matches == 0:
# nothing to standardize
return
# check if there's an even number of anion-*=* patterns
elif n_matches % 2 == 0:
end_pattern = Chem.MolFromSmarts("[*-;!O]-[*+0]=[*+0]-[*-]")
else:
elif n_matches == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test for this please?

# as a last resort, the only way to standardize is to find a nitrogen
# that can accept a double bond and a positive charge
# or a carbonyl that will become an enolate
end_pattern = Chem.MolFromSmarts(
"[*-;!O]-[*+0]=[*+0]-[$([#7;X3;v3]),$([#6+0]=O)]")
end_pattern = odd_end_pattern
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test for this please?

# give priority to base end pattern and then deal with the odd one if
# necessary
end_pattern = base_end_pattern
backtrack = []
backtrack_cycles = 0
for _ in range(max_iter):
# simplest case where n=1
# check for ending pattern
end_match = mol.GetSubstructMatch(end_pattern)
if end_match:
# index of each atom
anion1, a1, a2, anion2 = end_match
term_atom = mol.GetAtomWithIdx(anion2)
# [*-]-*=*-C=O
if term_atom.GetAtomicNum() == 6 and term_atom.GetFormalCharge() == 0:
# [*-]-*=*-[C,N+]=O --> *=*-*=[C,N+]-[O-]
if (
term_atom.GetAtomicNum() == 6
and term_atom.GetFormalCharge() == 0
) or (
term_atom.GetAtomicNum() == 7
and term_atom.GetFormalCharge() == 1
):
for neighbor in term_atom.GetNeighbors():
bond = mol.GetBondBetweenAtoms(anion2, neighbor.GetIdx())
if neighbor.GetAtomicNum() == 8 and bond.GetBondTypeAsDouble() == 2:
if (neighbor.GetAtomicNum() == 8 and
bond.GetBondTypeAsDouble() == 2):
bond.SetBondType(Chem.BondType.SINGLE)
cbouy marked this conversation as resolved.
Show resolved Hide resolved
neighbor.SetFormalCharge(-1)
break
# [*-]-*=*-[Sv4]-[O-] --> *=*-*=[Sv6]=O
elif (term_atom.GetAtomicNum() == 16 and
term_atom.GetFormalCharge() == 0):
for neighbor in term_atom.GetNeighbors():
bond = mol.GetBondBetweenAtoms(anion2, neighbor.GetIdx())
if (neighbor.GetAtomicNum() == 8 and
neighbor.GetFormalCharge() == -1 and
bond.GetBondTypeAsDouble() == 1):
bond.SetBondType(Chem.BondType.DOUBLE)
neighbor.SetFormalCharge(0)
break
else:
# [*-]-*=*-N
if term_atom.GetAtomicNum() == 7 and term_atom.GetFormalCharge() == 0:
end_charge = 1
# [*-]-*=*-[*-]
else:
end_charge = 0
mol.GetAtomWithIdx(anion2).SetFormalCharge(end_charge)
term_atom.SetFormalCharge(term_atom.GetFormalCharge() + 1)
# common part of the conjugated systems: [*-]-*=*
mol.GetAtomWithIdx(anion1).SetFormalCharge(0)
mol.GetBondBetweenAtoms(anion1, a1).SetBondType(
Chem.BondType.DOUBLE)
mol.GetBondBetweenAtoms(a1, a2).SetBondType(Chem.BondType.SINGLE)
mol.GetBondBetweenAtoms(a2, anion2).SetBondType(
Chem.BondType.DOUBLE)
a = mol.GetAtomWithIdx(anion1)
a.SetFormalCharge(a.GetFormalCharge() + 1)
b = mol.GetBondBetweenAtoms(anion1, a1)
b.SetBondType(RDBONDORDER[b.GetBondTypeAsDouble() + 1])
b = mol.GetBondBetweenAtoms(a1, a2)
b.SetBondType(RDBONDORDER[b.GetBondTypeAsDouble() - 1])
b = mol.GetBondBetweenAtoms(a2, anion2)
b.SetBondType(RDBONDORDER[b.GetBondTypeAsDouble() + 1])
mol.UpdatePropertyCache(strict=False)
continue

# shorten the anion-anion pattern from n to n-1
matches = mol.GetSubstructMatches(pattern)
if matches:
# check if we haven't already transformed this triplet
for match in matches:
# sort the indices for the comparison
g = tuple(sorted(match))
g = set(match)
if g in backtrack:
# already transformed
continue
Expand All @@ -799,19 +835,41 @@ def _rebuild_conjugated_bonds(mol, max_iter=200):
backtrack.append(g)
break
else:
anion, a1, a2 = matches[0]
# already performed all changes
if backtrack_cycles == 1:
# might be stuck because there were 2 "odd" end patterns
# misqualified as a single base one
end_pattern = odd_end_pattern
elif backtrack_cycles > 1:
# likely stuck changing the same bonds over and over so
# let's stop here
mol.UpdatePropertyCache(strict=False)
return
match = matches[0]
anion, a1, a2 = match
backtrack = [set(match)]
backtrack_cycles += 1

# charges
mol.GetAtomWithIdx(anion).SetFormalCharge(0)
mol.GetAtomWithIdx(a2).SetFormalCharge(-1)
a = mol.GetAtomWithIdx(anion)
a.SetFormalCharge(a.GetFormalCharge() + 1)
a = mol.GetAtomWithIdx(a2)
a.SetFormalCharge(a.GetFormalCharge() - 1)
# bonds
mol.GetBondBetweenAtoms(anion, a1).SetBondType(
Chem.BondType.DOUBLE)
mol.GetBondBetweenAtoms(a1, a2).SetBondType(Chem.BondType.SINGLE)
b = mol.GetBondBetweenAtoms(anion, a1)
b.SetBondType(RDBONDORDER[b.GetBondTypeAsDouble() + 1])
b = mol.GetBondBetweenAtoms(a1, a2)
b.SetBondType(RDBONDORDER[b.GetBondTypeAsDouble() - 1])
mol.UpdatePropertyCache(strict=False)
# update number of matches for the end pattern
n_matches = len(set([match[0] for match in matches]))
if n_matches == 1:
end_pattern = odd_end_pattern
# start new iteration
continue

# no more changes to apply
mol.UpdatePropertyCache(strict=False)
return

# reached max_iter
Expand Down
Loading