-
Notifications
You must be signed in to change notification settings - Fork 667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
An "accurate and reliable" RDKitConverter #3044
Conversation
Hello @cbouy! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2022-03-30 21:49:43 UTC |
Codecov Report
@@ Coverage Diff @@
## develop #3044 +/- ##
===========================================
+ Coverage 94.20% 94.21% +0.01%
===========================================
Files 190 190
Lines 24668 24725 +57
Branches 3313 3321 +8
===========================================
+ Hits 23238 23295 +57
+ Misses 1384 1382 -2
- Partials 46 48 +2
Continue to review full report at Codecov.
|
Hey @cbouy I think we're more interested in correctness rather than speed. Speed is something we can tweak on later once we've established what the code should do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cbouy, this looks awesome. Could you please add some tests for the lines that codecov points out, and address the comment by the PEP8 bot?
|
||
for atom in sorted(mol.GetAtoms(), reverse=True, | ||
key=lambda a: _get_nb_unpaired_electrons(a)[0]): | ||
for atom in atoms: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add tests for case where an atom degree is 0?
nue = _get_nb_unpaired_electrons(atom) | ||
if any([n == 0 for n in nue]): | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a test for this case where common_nue != 0?
|
||
# if atom valence is still not filled | ||
nue = _get_nb_unpaired_electrons(atom) | ||
if not any([n == 0 for n in nue]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test for this please?
nue = _get_nb_unpaired_electrons(atom) | ||
if not any([n == 0 for n in nue]): | ||
# transform nue to charge | ||
atom.SetFormalCharge(-nue[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test for this please?
elif n_matches % 2 == 0: | ||
end_pattern = Chem.MolFromSmarts("[*-;!O]-[*+0]=[*+0]-[*-]") | ||
else: | ||
elif n_matches == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test for this please?
end_pattern = Chem.MolFromSmarts( | ||
"[*-;!O]-[*+0]=[*+0]-[$([#7;X3;v3]),$([#6+0]=O)]") | ||
end_pattern = odd_end_pattern | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test for this please?
Okay I should be done with this one now! I realized that there was a better way to test if the mol matches one of the resonance structure, so I implemented this change. I added the XFAIL tests as @orbeckst suggested Thanks again for reviewing this!! |
I'm also preparing the benchmark on ChEMBL for the converter, should I create a new |
Yes, create the repo under the MDAnalysis org. (I think you have permission to do so but if not say something.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very impressive. I could only comment on documentation aspects and one question about cations.
obj : :class:`~MDAnalysis.core.groups.AtomGroup` or | ||
:class:`~MDAnalysis.core.universe.Universe` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has to be on one line to format correctly, see https://mdanalysis--3044.org.readthedocs.build/en/3044/documentation_pages/converters/RDKit.html#MDAnalysis.converters.RDKit.RDKitConverter.convert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might work if you just say
obj : `AtomGroup` or `Universe`
Try it and see if it links correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The back-tic notation did not automagically generate links so your previous solution was better. I'd revert to it.
Aside from @orbeckst's comments the code looks great -- thank you for putting up with the super slow reviews! Agreed with Oliver, the guessing code is very impressive and it'll be a fantastic solution for so many people trying to create molecular graphs out of lossy input like xyz files. |
@cbouy I don't know how busy you are at the moment but if you could have a look at my minor comments then we could merge this big PR really soon™. |
@orbeckst I'm in the middle of moving out but I should be able to find some time after Monday! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for answering my questions — looks good.
(There's one change that I would revert so that links are generated in the docs but that's not blocking.)
Great contribution!!
@lilyminium or @IAlibay can you please be in charge of eventually merging? Thanks! |
Reverted the docstring thingy (and fixed a typo)! Thanks a ton @orbeckst @lilyminium and @IAlibay for your help!! |
@lilyminium @IAlibay is something holding up the merge? I just don't know, otherwise I'd press the shiny green button myself. (Sorry to be onerous, I just want to see this code in before anything else happened.) |
Sorry just really busy atm, I'll have a quick glance through over the weekend and then merge. |
No -- just had another quick glance and everything still LGTM! @IAlibay did you still want to have a look? Otherwise I can merge. |
Go for it! |
Hooray!!!! |
Hi @IAlibay @orbeckst @richardjgowers
I made a few changes to the RDKitConverter so that it more accurately infers bond orders and charges in small molecules.
I tested this new version on the ChEMBL27 dataset and it managed to retrieve the original molecule or a resonance structure in 99.1% of the 1.7M compounds (the GSoC version was around 90%).
I can share the notebook that I used to run this ChEMBL benchmark, either in one of you other repos or I create a new one, as you prefer.
Changes made in this Pull Request:
NoImplicit=False
names
property is now available throughatom.GetProp("_MDAnalysis_name")
, whileatom.GetMonomerInfo().GetName()
returns the PDB-formatted version (with spaces).All of these changes do not take "efficiency" in consideration, so it probably slows down the code a bit, but I think they are all necessary. If you identify things that could be optimized please let me know.
Their are additional changes that I could make that on one hand would slow down the code even more but on the other hand make it even more reliable. See #3339 for examples of failing molecules.
I could fix them by adding more SMIRKS/reaction SMARTS, but considering the fact that these patterns are rare and would slow down the conversion I don't think they are worth adding to the code.
PR Checklist