Feature request: Retrieve mapping betwewen SMILES and SELFIES tokens #48

jannisborn · 2021-05-06T11:10:44Z

Is it possible to get a map which SMILES tokens were used to generate which SELFIES tokens (or v.v.)?

I am looking for a feature like this:

>>>smiles = 'CCO'
>>>encoder(smiles, get_mapping=True)
([C][C][O], [0,1,2])

In this simple example [0,1,2] would imply that the first SMILES token (C) is mapped to the first selfies token ([C]) and so on.

Motivation:
I think this feature could be very useful to close the gap between RDKit and SELFIES. One example are scaffolds. Say we have a molecule, want to retrieve its scaffold and decorate it with a generative model. With SMILES it's easy (see example below) but with SELFIES it's not possible (as far as I understand).

My questions:

Is it, in principle, possible to obtain such a mapping?
If yes, is there already a way to obtain it with the current package?
If no, is this a feature that seems worth implementing?

Discussion:
Such a mapping would imply a standardized way of splitting the strings into tokens. Fortunately, we have split_selfies already, but regarding SMILES, I think that the tokenizer from the Found in Translation paper!) could be a good choice since it's used widely. (I'm using that tokenizer in the example below.)

==== EXAMPLE ===
This is just the appendix to the post. It's an example for how to retrieve which SMILES tokens constitute the scaffold of a given molecule. As it appears to me, this is currently not possible with SELFIES.

First, some boring setup:

from rdkit import Chem
from selfies import encoder, decoder, split_selfies
from rdkit.Chem.Scaffolds.MurckoScaffold import GetScaffoldForMol
from pytoda.smiles.processing import tokenize_smiles
import re 

# Setup tokenizer
NON_ATOM_CHARS = set(list(map(str, range(1, 10))) + ['/', '\\', '(', ')', '#', '=', '.', ':', '-'])
regexp = re.compile(
    r'(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#|'
    r'-|\+|\\\\|\/|:|~|@|\?|>|\*|\$|\%[0-9]{2}|[0-9])'
)
smiles_tokenizer = lambda smi: [token for token in regexp.split(smi) if token]

Example molecule (left) and RDKit-extracted scaffold (right):

smiles = 'CCOc1[nH]c(N=Cc2ccco2)c(C#N)c1C#N'
mol = Chem.MolFromSmiles(smiles)
atom_symbols = [atom.GetSymbol() for atom in mol.GetAtoms()]
scaffold = GetScaffoldForMol(mol)
# List of ints pointing to scaffold atoms as they occur in SMILES
scaffold_atoms = mol.GetSubstructMatches(scaffold)[0]
smiles_tokens =  smiles_tokenizer(smiles)


atom_id = -1
for token in smiles_tokens:
    if token not in NON_ATOM_CHARS:
        # Found atom
        atom_id += 1
        if atom_id in scaffold_atoms:
            print(token, '--> on scaffold')
        else:
            print(token, '--> not on scaffold')
    else:
        # Non-Atom-Chars
        if (atom_id in scaffold_atoms and atom_id+1 in scaffold_atoms) or atom_id==scaffold_atoms[-1]:
            print(token, '--> on scaffold')
        else:
            print(token, '--> not on scaffold')

Output will be:

C --> not on scaffold
C --> not on scaffold
O --> not on scaffold
c --> on scaffold
1 --> on scaffold
[nH] --> on scaffold
c --> on scaffold
( --> on scaffold
N --> on scaffold
= --> on scaffold
C --> on scaffold
c --> on scaffold
2 --> on scaffold
c --> on scaffold
c --> on scaffold
c --> on scaffold
o --> on scaffold
2 --> on scaffold
) --> on scaffold
c --> on scaffold
( --> not on scaffold
C --> not on scaffold
# --> not on scaffold
N --> not on scaffold
) --> not on scaffold
c --> on scaffold
1 --> on scaffold
C --> not on scaffold
# --> not on scaffold
N --> not on scaffold

Trying to achieve the same with SELFIES does not seem to work. This is because selfies.encoder does not fully preserve the order of the tokens passed. It preserves it to large extents (which is great) but around ring symbols it usually breaks. I feel like I would need to reverse-engineer the context free grammar to solve this.
Here would be the tokens in SMILES and SELFIES respectively:

C [C]
C [C]
O [O]
c [C]
1 [NHexpl]
[nH] [C]
c [Branch1_1]
( [Branch2_3]
N [N]
= [=C]
C [C]
c [=C]
2 [C]
c [=C]
c [O]
c [Ring1]
o [Branch1_1]
2 [=C]
) [Branch1_1]
c [Ring1]
( [C]
C [#N]
# [C]
N [Expl=Ring1]
) [=C]
c [C]
1 [#N]
C
#
N

The text was updated successfully, but these errors were encountered:

MarioKrenn6240 · 2021-05-10T19:40:38Z

Hi jannisborn,
Thanks for your question.

This is in principle possible, actually in a quite straight-forward way I believe. However so far it doesn't exist yet.

Let me talk to the team and see whether we can easily add it for the next version of SELFIES.

best regards,
Mario

jannisborn · 2021-05-11T10:49:28Z

Many thanks @MarioKrenn6240, that's great news! Please let me know if I can be of any help.

jannisborn · 2021-07-23T12:57:16Z

Hi @MarioKrenn6240, I was wondering whether there is any way forward to improve the package here. Thanks!

alstonlo · 2021-07-23T23:21:31Z

Hi @jannisborn,

Thank you for your feedback! Your suggested feature is not currently supported by this package, but in principle, it should be possible to obtain such mappings. Unfortunately, with the current algorithm's structure and design, it would be hard extend the package to include an efficient implementation of the feature.

The SELFIES team is currently discussing how to move forward with the library (such that new features, including yours, can be added in more naturally). In the meantime, we would be happy to provide our support and help if you decide to look into this extension yourself!

Best regards,

Alston

whitead · 2021-10-25T11:10:25Z

Just want to ping on this issue to see if there is an update with SELFIES 2.0. I'm interested in this too and am happy to contribute for this feature.

MarioKrenn6240 · 2021-10-28T02:23:13Z

hey @whitead , we didn't include this feature yet, but made a few useful precursors. If you have time, please go for it!

whitead · 2021-11-06T22:01:32Z

@MarioKrenn6240 trying to get started on this. Do you have any hints? Would it require the encoder/decoder to be stable in preserving order or would it make more sense to use the graph object to track this?

MarioKrenn6240 · 2021-11-08T19:47:59Z

hey @whitead, @alstonlo will come back with some suggestions very soon.

alstonlo · 2021-11-12T00:17:08Z

Hi @whitead,

Currently, selfies uses a graph-based intermediate data structure, i.e. it translates SMILES <-> GRAPH <-> SELFIES. I think it might make the most sense to use that graph structure, augmented with some "tag" attributes to keep track of which token(s) every node and edge was derived from.

I think an important first step would be to decide what form of mapping is needed. For example, an atom-to-atom mapping would be straightforward, as every atom in a decoded SMILES can be traced back to exactly one atom symbol in the SELFIES (and vice versa). A more general token-to-token mapping would be more complex, as it is a many-to-many correspondance. For example, a single [Branch] symbol would be derived from open and closed SMILES brackets. There may also be some minor ambiguity around forming the mapping that would need to be resolved. For example,

What should the index symbols after a branch or ring map to?
The SELFIES [O][#C] is translated to O=C. In this case, the triple bond is converted to a double bond due of the valence rules of [O]. Should the = then map to both the [O] and [#C] symbols?

MarioKrenn6240 · 2021-11-13T14:13:53Z

The new tool LeRuLi has an explainer for SELFIES, meaning it can map from selfies tokens to specific parts of the graph. I think this is what you might want. Talk to the authors, Dominik and Guido are very helpful.

ferchault · 2021-11-16T11:49:25Z

We implemented something like this based on the amazing SELFIES 2.0 code for atom mappings for leruli:

You can try it interactively by searching any molecule and hitting the "explain SELFIES" button on the result page

We'd be happy to contribute that code, but the points @alstonlo made are very valid. So far what our code can do is identify which atom and bond is created by which SELFIES token. For SMILES, that would at least allow a one-to-one mapping of heavy atoms between SELFIES and SMILES tokens, except for the bond orders.

How about this code structure: selfies.decoder() gets an optional argument, "atom_mapping", default False that, if set returns not only the existing SMILES, but also a dictionary with keys being the SELFIES token index and values being the SMILES atom index. That would allow to trace both directions. If that is in line with your thoughts on the API of the package, I'm happy to prepare a pull request.

whitead · 2021-11-24T22:13:22Z

I've started a work in progress in #75 via thoughts from @alstonlo. Happy to hear feedback from @ferchault about his experience. The basic idea is to store a map of which symbols led to which atoms/bonds in the graph. Then when the graph is converted to another format, we can use that to construct a mapping between tokens in input/output.

whitead · 2021-11-28T00:21:10Z

I've completed a PR in #75 with a short description here

MarioKrenn6240 · 2022-02-16T15:54:52Z

Sorry that it took so long to get this into the main repo, but it is in now finally. Thanks a lot @whitead , and welcome to the developer team :)

MarioKrenn6240 · 2022-02-18T15:23:44Z

@jannisborn Do you think your issue/request is solved by Andrew's addition? If yes, you can close the issue. Thanks

jannisborn · 2022-02-19T18:11:13Z

Thanks for the great progress on this!
I had a look at the code and while it worked great overall, I found two small issues. I'm preparing a PR to fix them now.

The attribution in the encoder starts counting from 1 whereas the attribution in the decoder starts at 0. Not sure this is intention but definitely it's inconsistent.
The encoder attribution skips over bond tokens which leads to wrong indices

See example:

smiles = 'C=COc1[nH]c(N=Cc2ccco2)c(C#N)c1'
selfie, attribution = encoder(smiles, attribute=True)
print(selfie)
for a in attribution:
    print(a)

Result:

[C][=C][O][C][NH1][C][Branch1][#Branch2][N][=C][C][=C][C][=C][O][Ring1][Branch1][=C][Branch1][Ring1][C][#N][C][=Ring1][=C]
('[C]', [(1, 'C')])
('[=C]', [(2, 'C')])
('[O]', [(3, 'O')])
('[C]', [(4, 'c')])
('[NH1]', [(6, '[nH]')])
('[C]', [(7, 'c')])
('[N]', [(9, 'N')])
('[=C]', [(10, 'C')])
('[C]', [(11, 'c')])
('[=C]', [(13, 'c')])
('[C]', [(14, 'c')])
('[=C]', [(15, 'c')])
('[O]', [(16, 'o')])
('[Ring1]', None)
('[Branch1]', None)
('[Branch1]', [(9, 'N')])
('[#Branch2]', [(9, 'N')])
('[=C]', [(19, 'c')])
('[C]', [(21, 'C')])
('[#N]', [(22, 'N')])
('[Branch1]', [(21, 'C')])
('[Ring1]', [(21, 'C')])
('[C]', [(24, 'c')])
('[=Ring1]', None)
('[=C]', None)

If we start counting from 1, in the second position, the attribution tuple should be (3, 'C') rather than (2, 'C'). But the bond is skipped over and all subsequent indices are incorrectly shifted by 1. The same applies to other types of bonds and it differs from the setup of the decoder attribution. Also in contrast to the bonds, the ring symbols are not ignored when the index is build. This seems inconsistent to me. The reason is that your smiles tokenizer removes all bonds (basically everything in SMILES_BOND_ORDERS).

MarioKrenn6240 added the enhancement New feature or request label May 10, 2021

MarioKrenn6240 self-assigned this Nov 7, 2021

MarioKrenn6240 removed their assignment Nov 9, 2021

whitead mentioned this issue Nov 24, 2021

Mapping attribution of tokens between encoder and decoder #75

Merged

7 tasks

jannisborn mentioned this issue Feb 19, 2022

Attribution map improvements (#75) #78

Merged

MarioKrenn6240 closed this as completed in #78 Feb 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Retrieve mapping betwewen SMILES and SELFIES tokens #48

Feature request: Retrieve mapping betwewen SMILES and SELFIES tokens #48

jannisborn commented May 6, 2021

MarioKrenn6240 commented May 10, 2021

jannisborn commented May 11, 2021

jannisborn commented Jul 23, 2021

alstonlo commented Jul 23, 2021

whitead commented Oct 25, 2021

MarioKrenn6240 commented Oct 28, 2021

whitead commented Nov 6, 2021

MarioKrenn6240 commented Nov 8, 2021

alstonlo commented Nov 12, 2021

MarioKrenn6240 commented Nov 13, 2021

ferchault commented Nov 16, 2021

whitead commented Nov 24, 2021

whitead commented Nov 28, 2021

MarioKrenn6240 commented Feb 16, 2022

MarioKrenn6240 commented Feb 18, 2022

jannisborn commented Feb 19, 2022 •

edited

Loading

Feature request: Retrieve mapping betwewen SMILES and SELFIES tokens #48

Feature request: Retrieve mapping betwewen SMILES and SELFIES tokens #48

Comments

jannisborn commented May 6, 2021

MarioKrenn6240 commented May 10, 2021

jannisborn commented May 11, 2021

jannisborn commented Jul 23, 2021

alstonlo commented Jul 23, 2021

whitead commented Oct 25, 2021

MarioKrenn6240 commented Oct 28, 2021

whitead commented Nov 6, 2021

MarioKrenn6240 commented Nov 8, 2021

alstonlo commented Nov 12, 2021

MarioKrenn6240 commented Nov 13, 2021

ferchault commented Nov 16, 2021

whitead commented Nov 24, 2021

whitead commented Nov 28, 2021

MarioKrenn6240 commented Feb 16, 2022

MarioKrenn6240 commented Feb 18, 2022

jannisborn commented Feb 19, 2022 • edited Loading

jannisborn commented Feb 19, 2022 •

edited

Loading