-
Notifications
You must be signed in to change notification settings - Fork 667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RDKit Descriptors and fingerprints wrapper #2912
base: develop
Are you sure you want to change the base?
Conversation
Hello @cbouy! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2021-04-28 18:40:44 UTC |
Codecov Report
@@ Coverage Diff @@
## develop #2912 +/- ##
===========================================
+ Coverage 92.83% 92.85% +0.02%
===========================================
Files 170 171 +1
Lines 22809 22882 +73
Branches 3242 3260 +18
===========================================
+ Hits 21174 21247 +73
Misses 1587 1587
Partials 48 48
Continue to review full report at Codecov.
|
I changed the results attribute to store an array of dict for each frame, and I added a static method to list available descriptors |
Regarding the output given to users, I'm still scratching my head over 2 things: For fingerprints, apart from hashed fp which have a predefined number of bits in length, and the MACCSKeys which is 167 bits long, it's usually a bad idea to convert them to an array as it will likely crash or hang forever because of memory consumption. For descriptors, maybe I can simply output an array of descriptor values instead of a dict or list of tuples with descriptors names ? The calculation will raise an error if a descriptor is not found and the descriptors are calculated in the order given by the user (if I change |
Maybe for simplicity's sake, we could just return fingerprints and descriptors in the exact same way RDKIT does (i.e. just return the object) for each frame (as a list)? |
I think the point of the wrapper is to make things easy for people who are not necessarily familiar with RDKit, and the fingerprint object can be a bit intimidating and depends on the type of fingerprint requested (some will output an ExplicitBitVect object, some different subtypes of SparseIntVect which behaves differently) so that's why I was proposing an optional general purpose output (dict or array), but if users want the RDKit object they can still have it. For descriptors, yeah it's basically just the value for each frame so I guess I'll simply make a 2D array out of that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just read through some of the docstrings for copy editing.
I'm not sure, although a general purpose would be more "user friendly", at the end of the day I feel like users should be sufficiently versed in RDKit if they are looking to be using these options. To me, offering an object conversion layer not only leads to more work, but eventually more confusion for users. I'll ping @fiona-naughton and @richardjgowers though, who might have different views here.
Yep, that makes sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks really good, got a few comments & suggestions.
The main point of discussion I think may need to be had, is if user friendliness is worth the extra complexity.
|
||
Notes | ||
----- | ||
To generate a Morgan fingerprint, don't forget to specify the radius:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe here we should link to the API docs for each of the fingerprints supported? We also should have a usage example for each case too (it adds to the docs but then users are clear on what should work).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe just the one like to a page that lists all the individual docs for the fingerprints. Re: usage, if the usage is all similar/the same then we can just say that in the docs.
|
||
get_fingerprint(ag, 'Morgan', radius=2) | ||
|
||
``dtype="array"`` is not recommended for non-hashed fingerprints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, I personally think we should try to avoid these cases completely, but I'll let one of the other @MDAnalysis/gsoc-mentors make a judgement call here too.
|
||
Example | ||
------- | ||
Here's an example with a custom function for a trajectory with 3 frames:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this example, however in the final version we will need something a little bit more descriptive, probably detailing what's in _RDKIT_DESCRIPTORS
, any edge cases (are there any cases of descriptors that can be given optional parameters?), etc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I said in list_available
docstring, the list of descriptors is by no means curated, some of the things here aren't even descriptors but helper functions that are inside the module.
Some of these functions take parameters like includeHs
or onlyHeavy
, but most of them don't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll defer to the other @MDAnalysis/gsoc-mentors here to pitch in on what they think might be a good compromise here.
self.results[self._frame_index][i] = func(mol) | ||
|
||
@staticmethod | ||
def list_available(flat=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs documenting in the class docstring, so that users are aware of it's existence (including an example use).
I do wonder if it might be simpler (at the risk of making things harder for users), to not offer this list but rather just have the option of passing an RDKit function to this analysis method. Then you can just have this blank-ish AnalysisBase method that essentially is a pure trajectory analysis wrapper around existing RDKit function (if this makes any sense)? Again one of those things worth discussing further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well it's definitely simpler for us but harder for users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Porque no los dos, perhaps it could be a separate more flexible class and this can be the easy-peasy one.
@cbouy if you can update this against develop I'll put this next up on my review list. |
Essential - probably not. I added in all the RDKits because it would be really great to get them out there. I'll let @cbouy speak on how realistic this is (for this and the other opened RDKit PRs). If it helps, I'd be happy with a relatively rapid 2.1.0 release that adds new RDKit features. |
I think this is the RDKit PR with the lowest priority and I don't mind if it's not in 2.0 |
@cbouy for the PRs that are opened, which ones (if any) are critical for 2.0? |
Part of the fixes for #2468
Depends on #2775
Changes made in this Pull Request:
RDKitDescriptors
class, as a subclass of AnalysisBaseQuick example
I'm not sure if a dict for the results is ideal, as you can see passing several lambda functions will be problematic so maybe a list of
(function_name, value)
tuples is better. Or we just don't care about lambdas ?PR Checklist
Ping @fiona-naughton @IAlibay @richardjgowers