Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserving the dot symbol (".") in the alphabet #104

Closed
vandrw opened this issue Aug 10, 2023 · 1 comment
Closed

Preserving the dot symbol (".") in the alphabet #104

vandrw opened this issue Aug 10, 2023 · 1 comment
Labels
question Further information is requested

Comments

@vandrw
Copy link
Contributor

vandrw commented Aug 10, 2023

Hi! The current version of the library provides the get_alphabet_from_selfies function that generates a list of symbols from an iterable. However, if one uses the returned alphabet to generate an encoding (as seen below), the script would raise an error that there is no symbol "." in the alphabet.

sf.selfies_to_encoding(
        selfies_string,
        vocab_stoi=symbol_to_index_dict,
        pad_to_len=16,
        enc_type="label",
)

After having a closer look at the functions involved, there are two possible solutions:

  1. Preserve the dot symbol in the alphabet by removing the following line:
    alphabet.discard(".")
  2. Clean up the dot symbol from the char_list created here (i.e., char_list.remove(".")):
    char_list = split_selfies(selfies)

In the context of training a machine learning model, would this symbol provide vital information? If so, wouldn't it be better to preserve it in the alphabet?

@MarioKrenn6240
Copy link
Collaborator

Hi @vandrw -- Thanks for your interest. The dot is used to represent two unconnected molecules, or unphysical bonds that cannot be represented in a native way within SMILES (Ferrocenes). For generative models, we didnt encounter any usecase of it yet. We only introduced it to be able to read even some il-formed SMILES. Hope this helps!

@MarioKrenn6240 MarioKrenn6240 added the question Further information is requested label Aug 17, 2023
vandrw pushed a commit to vandrw/selfies that referenced this issue Nov 15, 2023
MarioKrenn6240 added a commit that referenced this issue Nov 23, 2023
fix #104 Add check for dot symbol and warn user
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants