Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Retrieve mapping betwewen SMILES and SELFIES tokens #48

Closed
jannisborn opened this issue May 6, 2021 · 16 comments · Fixed by #78
Closed

Feature request: Retrieve mapping betwewen SMILES and SELFIES tokens #48

jannisborn opened this issue May 6, 2021 · 16 comments · Fixed by #78
Labels
enhancement New feature or request

Comments

@jannisborn
Copy link
Contributor

Is it possible to get a map which SMILES tokens were used to generate which SELFIES tokens (or v.v.)?

I am looking for a feature like this:

>>>smiles = 'CCO'
>>>encoder(smiles, get_mapping=True)
([C][C][O], [0,1,2])

In this simple example [0,1,2] would imply that the first SMILES token (C) is mapped to the first selfies token ([C]) and so on.

Motivation:
I think this feature could be very useful to close the gap between RDKit and SELFIES. One example are scaffolds. Say we have a molecule, want to retrieve its scaffold and decorate it with a generative model. With SMILES it's easy (see example below) but with SELFIES it's not possible (as far as I understand).

My questions:

  1. Is it, in principle, possible to obtain such a mapping?
  2. If yes, is there already a way to obtain it with the current package?
  3. If no, is this a feature that seems worth implementing?

Discussion:
Such a mapping would imply a standardized way of splitting the strings into tokens. Fortunately, we have split_selfies already, but regarding SMILES, I think that the tokenizer from the Found in Translation paper!) could be a good choice since it's used widely. (I'm using that tokenizer in the example below.)

==== EXAMPLE ===
This is just the appendix to the post. It's an example for how to retrieve which SMILES tokens constitute the scaffold of a given molecule. As it appears to me, this is currently not possible with SELFIES.

First, some boring setup:

from rdkit import Chem
from selfies import encoder, decoder, split_selfies
from rdkit.Chem.Scaffolds.MurckoScaffold import GetScaffoldForMol
from pytoda.smiles.processing import tokenize_smiles
import re 

# Setup tokenizer
NON_ATOM_CHARS = set(list(map(str, range(1, 10))) + ['/', '\\', '(', ')', '#', '=', '.', ':', '-'])
regexp = re.compile(
    r'(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#|'
    r'-|\+|\\\\|\/|:|~|@|\?|>|\*|\$|\%[0-9]{2}|[0-9])'
)
smiles_tokenizer = lambda smi: [token for token in regexp.split(smi) if token]

Example molecule (left) and RDKit-extracted scaffold (right):
Screenshot 2021-05-06 at 11 59 07

smiles = 'CCOc1[nH]c(N=Cc2ccco2)c(C#N)c1C#N'
mol = Chem.MolFromSmiles(smiles)
atom_symbols = [atom.GetSymbol() for atom in mol.GetAtoms()]
scaffold = GetScaffoldForMol(mol)
# List of ints pointing to scaffold atoms as they occur in SMILES
scaffold_atoms = mol.GetSubstructMatches(scaffold)[0]
smiles_tokens =  smiles_tokenizer(smiles)


atom_id = -1
for token in smiles_tokens:
    if token not in NON_ATOM_CHARS:
        # Found atom
        atom_id += 1
        if atom_id in scaffold_atoms:
            print(token, '--> on scaffold')
        else:
            print(token, '--> not on scaffold')
    else:
        # Non-Atom-Chars
        if (atom_id in scaffold_atoms and atom_id+1 in scaffold_atoms) or atom_id==scaffold_atoms[-1]:
            print(token, '--> on scaffold')
        else:
            print(token, '--> not on scaffold')
    

Output will be:

C --> not on scaffold
C --> not on scaffold
O --> not on scaffold
c --> on scaffold
1 --> on scaffold
[nH] --> on scaffold
c --> on scaffold
( --> on scaffold
N --> on scaffold
= --> on scaffold
C --> on scaffold
c --> on scaffold
2 --> on scaffold
c --> on scaffold
c --> on scaffold
c --> on scaffold
o --> on scaffold
2 --> on scaffold
) --> on scaffold
c --> on scaffold
( --> not on scaffold
C --> not on scaffold
# --> not on scaffold
N --> not on scaffold
) --> not on scaffold
c --> on scaffold
1 --> on scaffold
C --> not on scaffold
# --> not on scaffold
N --> not on scaffold

Trying to achieve the same with SELFIES does not seem to work. This is because selfies.encoder does not fully preserve the order of the tokens passed. It preserves it to large extents (which is great) but around ring symbols it usually breaks. I feel like I would need to reverse-engineer the context free grammar to solve this.
Here would be the tokens in SMILES and SELFIES respectively:

C [C]
C [C]
O [O]
c [C]
1 [NHexpl]
[nH] [C]
c [Branch1_1]
( [Branch2_3]
N [N]
= [=C]
C [C]
c [=C]
2 [C]
c [=C]
c [O]
c [Ring1]
o [Branch1_1]
2 [=C]
) [Branch1_1]
c [Ring1]
( [C]
C [#N]
# [C]
N [Expl=Ring1]
) [=C]
c [C]
1 [#N]
C
#
N
@MarioKrenn6240 MarioKrenn6240 added the enhancement New feature or request label May 10, 2021
@MarioKrenn6240
Copy link
Collaborator

Hi jannisborn,
Thanks for your question.

This is in principle possible, actually in a quite straight-forward way I believe. However so far it doesn't exist yet.

Let me talk to the team and see whether we can easily add it for the next version of SELFIES.

best regards,
Mario

@jannisborn
Copy link
Contributor Author

Many thanks @MarioKrenn6240, that's great news! Please let me know if I can be of any help.

@jannisborn
Copy link
Contributor Author

Hi @MarioKrenn6240, I was wondering whether there is any way forward to improve the package here. Thanks!

@alstonlo
Copy link
Collaborator

Hi @jannisborn,

Thank you for your feedback! Your suggested feature is not currently supported by this package, but in principle, it should be possible to obtain such mappings. Unfortunately, with the current algorithm's structure and design, it would be hard extend the package to include an efficient implementation of the feature.

The SELFIES team is currently discussing how to move forward with the library (such that new features, including yours, can be added in more naturally). In the meantime, we would be happy to provide our support and help if you decide to look into this extension yourself!

Best regards,

Alston

@whitead
Copy link
Contributor

whitead commented Oct 25, 2021

Just want to ping on this issue to see if there is an update with SELFIES 2.0. I'm interested in this too and am happy to contribute for this feature.

@MarioKrenn6240
Copy link
Collaborator

hey @whitead , we didn't include this feature yet, but made a few useful precursors. If you have time, please go for it!

@whitead
Copy link
Contributor

whitead commented Nov 6, 2021

@MarioKrenn6240 trying to get started on this. Do you have any hints? Would it require the encoder/decoder to be stable in preserving order or would it make more sense to use the graph object to track this?

@MarioKrenn6240 MarioKrenn6240 self-assigned this Nov 7, 2021
@MarioKrenn6240
Copy link
Collaborator

hey @whitead, @alstonlo will come back with some suggestions very soon.

@MarioKrenn6240 MarioKrenn6240 removed their assignment Nov 9, 2021
@alstonlo
Copy link
Collaborator

Hi @whitead,

Currently, selfies uses a graph-based intermediate data structure, i.e. it translates SMILES <-> GRAPH <-> SELFIES. I think it might make the most sense to use that graph structure, augmented with some "tag" attributes to keep track of which token(s) every node and edge was derived from.

I think an important first step would be to decide what form of mapping is needed. For example, an atom-to-atom mapping would be straightforward, as every atom in a decoded SMILES can be traced back to exactly one atom symbol in the SELFIES (and vice versa). A more general token-to-token mapping would be more complex, as it is a many-to-many correspondance. For example, a single [Branch] symbol would be derived from open and closed SMILES brackets. There may also be some minor ambiguity around forming the mapping that would need to be resolved. For example,

  • What should the index symbols after a branch or ring map to?
  • The SELFIES [O][#C] is translated to O=C. In this case, the triple bond is converted to a double bond due of the valence rules of [O]. Should the = then map to both the [O] and [#C] symbols?

@MarioKrenn6240
Copy link
Collaborator

The new tool LeRuLi has an explainer for SELFIES, meaning it can map from selfies tokens to specific parts of the graph. I think this is what you might want. Talk to the authors, Dominik and Guido are very helpful.

@ferchault
Copy link

We implemented something like this based on the amazing SELFIES 2.0 code for atom mappings for leruli:

image

You can try it interactively by searching any molecule and hitting the "explain SELFIES" button on the result page

image

We'd be happy to contribute that code, but the points @alstonlo made are very valid. So far what our code can do is identify which atom and bond is created by which SELFIES token. For SMILES, that would at least allow a one-to-one mapping of heavy atoms between SELFIES and SMILES tokens, except for the bond orders.

How about this code structure: selfies.decoder() gets an optional argument, "atom_mapping", default False that, if set returns not only the existing SMILES, but also a dictionary with keys being the SELFIES token index and values being the SMILES atom index. That would allow to trace both directions. If that is in line with your thoughts on the API of the package, I'm happy to prepare a pull request.

@whitead
Copy link
Contributor

whitead commented Nov 24, 2021

I've started a work in progress in #75 via thoughts from @alstonlo. Happy to hear feedback from @ferchault about his experience. The basic idea is to store a map of which symbols led to which atoms/bonds in the graph. Then when the graph is converted to another format, we can use that to construct a mapping between tokens in input/output.

@whitead
Copy link
Contributor

whitead commented Nov 28, 2021

I've completed a PR in #75 with a short description here

@MarioKrenn6240
Copy link
Collaborator

Sorry that it took so long to get this into the main repo, but it is in now finally. Thanks a lot @whitead , and welcome to the developer team :)

@MarioKrenn6240
Copy link
Collaborator

@jannisborn Do you think your issue/request is solved by Andrew's addition? If yes, you can close the issue. Thanks

@jannisborn
Copy link
Contributor Author

jannisborn commented Feb 19, 2022

Thanks for the great progress on this!
I had a look at the code and while it worked great overall, I found two small issues. I'm preparing a PR to fix them now.

  1. The attribution in the encoder starts counting from 1 whereas the attribution in the decoder starts at 0. Not sure this is intention but definitely it's inconsistent.
  2. The encoder attribution skips over bond tokens which leads to wrong indices

See example:

smiles = 'C=COc1[nH]c(N=Cc2ccco2)c(C#N)c1'
selfie, attribution = encoder(smiles, attribute=True)
print(selfie)
for a in attribution:
    print(a)

Result:

[C][=C][O][C][NH1][C][Branch1][#Branch2][N][=C][C][=C][C][=C][O][Ring1][Branch1][=C][Branch1][Ring1][C][#N][C][=Ring1][=C]
('[C]', [(1, 'C')])
('[=C]', [(2, 'C')])
('[O]', [(3, 'O')])
('[C]', [(4, 'c')])
('[NH1]', [(6, '[nH]')])
('[C]', [(7, 'c')])
('[N]', [(9, 'N')])
('[=C]', [(10, 'C')])
('[C]', [(11, 'c')])
('[=C]', [(13, 'c')])
('[C]', [(14, 'c')])
('[=C]', [(15, 'c')])
('[O]', [(16, 'o')])
('[Ring1]', None)
('[Branch1]', None)
('[Branch1]', [(9, 'N')])
('[#Branch2]', [(9, 'N')])
('[=C]', [(19, 'c')])
('[C]', [(21, 'C')])
('[#N]', [(22, 'N')])
('[Branch1]', [(21, 'C')])
('[Ring1]', [(21, 'C')])
('[C]', [(24, 'c')])
('[=Ring1]', None)
('[=C]', None)

If we start counting from 1, in the second position, the attribution tuple should be (3, 'C') rather than (2, 'C'). But the bond is skipped over and all subsequent indices are incorrectly shifted by 1. The same applies to other types of bonds and it differs from the setup of the decoder attribution. Also in contrast to the bonds, the ring symbols are not ignored when the index is build. This seems inconsistent to me. The reason is that your smiles tokenizer removes all bonds (basically everything in SMILES_BOND_ORDERS).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants