-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Retrieve mapping betwewen SMILES and SELFIES tokens #48
Comments
Hi jannisborn, This is in principle possible, actually in a quite straight-forward way I believe. However so far it doesn't exist yet. Let me talk to the team and see whether we can easily add it for the next version of SELFIES. best regards, |
Many thanks @MarioKrenn6240, that's great news! Please let me know if I can be of any help. |
Hi @MarioKrenn6240, I was wondering whether there is any way forward to improve the package here. Thanks! |
Hi @jannisborn, Thank you for your feedback! Your suggested feature is not currently supported by this package, but in principle, it should be possible to obtain such mappings. Unfortunately, with the current algorithm's structure and design, it would be hard extend the package to include an efficient implementation of the feature. The SELFIES team is currently discussing how to move forward with the library (such that new features, including yours, can be added in more naturally). In the meantime, we would be happy to provide our support and help if you decide to look into this extension yourself! Best regards, Alston |
Just want to ping on this issue to see if there is an update with SELFIES 2.0. I'm interested in this too and am happy to contribute for this feature. |
hey @whitead , we didn't include this feature yet, but made a few useful precursors. If you have time, please go for it! |
@MarioKrenn6240 trying to get started on this. Do you have any hints? Would it require the encoder/decoder to be stable in preserving order or would it make more sense to use the graph object to track this? |
Hi @whitead, Currently, I think an important first step would be to decide what form of mapping is needed. For example, an atom-to-atom mapping would be straightforward, as every atom in a decoded SMILES can be traced back to exactly one atom symbol in the SELFIES (and vice versa). A more general token-to-token mapping would be more complex, as it is a many-to-many correspondance. For example, a single
|
The new tool LeRuLi has an explainer for SELFIES, meaning it can map from selfies tokens to specific parts of the graph. I think this is what you might want. Talk to the authors, Dominik and Guido are very helpful. |
We implemented something like this based on the amazing SELFIES 2.0 code for atom mappings for leruli: You can try it interactively by searching any molecule and hitting the "explain SELFIES" button on the result page We'd be happy to contribute that code, but the points @alstonlo made are very valid. So far what our code can do is identify which atom and bond is created by which SELFIES token. For SMILES, that would at least allow a one-to-one mapping of heavy atoms between SELFIES and SMILES tokens, except for the bond orders. How about this code structure: selfies.decoder() gets an optional argument, "atom_mapping", default False that, if set returns not only the existing SMILES, but also a dictionary with keys being the SELFIES token index and values being the SMILES atom index. That would allow to trace both directions. If that is in line with your thoughts on the API of the package, I'm happy to prepare a pull request. |
I've started a work in progress in #75 via thoughts from @alstonlo. Happy to hear feedback from @ferchault about his experience. The basic idea is to store a map of which symbols led to which atoms/bonds in the graph. Then when the graph is converted to another format, we can use that to construct a mapping between tokens in input/output. |
I've completed a PR in #75 with a short description here |
Sorry that it took so long to get this into the main repo, but it is in now finally. Thanks a lot @whitead , and welcome to the developer team :) |
@jannisborn Do you think your issue/request is solved by Andrew's addition? If yes, you can close the issue. Thanks |
Thanks for the great progress on this!
See example: smiles = 'C=COc1[nH]c(N=Cc2ccco2)c(C#N)c1'
selfie, attribution = encoder(smiles, attribute=True)
print(selfie)
for a in attribution:
print(a) Result: [C][=C][O][C][NH1][C][Branch1][#Branch2][N][=C][C][=C][C][=C][O][Ring1][Branch1][=C][Branch1][Ring1][C][#N][C][=Ring1][=C]
('[C]', [(1, 'C')])
('[=C]', [(2, 'C')])
('[O]', [(3, 'O')])
('[C]', [(4, 'c')])
('[NH1]', [(6, '[nH]')])
('[C]', [(7, 'c')])
('[N]', [(9, 'N')])
('[=C]', [(10, 'C')])
('[C]', [(11, 'c')])
('[=C]', [(13, 'c')])
('[C]', [(14, 'c')])
('[=C]', [(15, 'c')])
('[O]', [(16, 'o')])
('[Ring1]', None)
('[Branch1]', None)
('[Branch1]', [(9, 'N')])
('[#Branch2]', [(9, 'N')])
('[=C]', [(19, 'c')])
('[C]', [(21, 'C')])
('[#N]', [(22, 'N')])
('[Branch1]', [(21, 'C')])
('[Ring1]', [(21, 'C')])
('[C]', [(24, 'c')])
('[=Ring1]', None)
('[=C]', None) If we start counting from 1, in the second position, the attribution tuple should be |
Is it possible to get a map which SMILES tokens were used to generate which SELFIES tokens (or v.v.)?
I am looking for a feature like this:
In this simple example
[0,1,2]
would imply that the first SMILES token (C
) is mapped to the first selfies token ([C]
) and so on.Motivation:
I think this feature could be very useful to close the gap between RDKit and SELFIES. One example are scaffolds. Say we have a molecule, want to retrieve its scaffold and decorate it with a generative model. With SMILES it's easy (see example below) but with SELFIES it's not possible (as far as I understand).
My questions:
Discussion:
Such a mapping would imply a standardized way of splitting the strings into tokens. Fortunately, we have
split_selfies
already, but regarding SMILES, I think that the tokenizer from the Found in Translation paper!) could be a good choice since it's used widely. (I'm using that tokenizer in the example below.)==== EXAMPLE ===
This is just the appendix to the post. It's an example for how to retrieve which SMILES tokens constitute the scaffold of a given molecule. As it appears to me, this is currently not possible with SELFIES.
First, some boring setup:
Example molecule (left) and RDKit-extracted scaffold (right):
Output will be:
Trying to achieve the same with SELFIES does not seem to work. This is because
selfies.encoder
does not fully preserve the order of the tokens passed. It preserves it to large extents (which is great) but around ring symbols it usually breaks. I feel like I would need to reverse-engineer the context free grammar to solve this.Here would be the tokens in SMILES and SELFIES respectively:
The text was updated successfully, but these errors were encountered: