Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalars support #132

Merged
merged 6 commits into from
Jul 26, 2024
Merged

Scalars support #132

merged 6 commits into from
Jul 26, 2024

Conversation

YoelShoshan
Copy link
Collaborator

mainly related to adding supports for scalars input/output to our mammal architecture

from fuse.utils import NDict


class InjectorTokenizer(ModularTokenizer):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

originally I wanted to use this as a drop-in replacement for ModularTokenizer, but in order to avoid code duplication I ended up only storing static methods here.

@YoelShoshan YoelShoshan requested a review from mosheraboh July 24, 2024 08:05
Copy link
Collaborator

@mosheraboh mosheraboh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
See few comments inline

@@ -1025,6 +1031,7 @@ def encode_list(
on_unknown: (Optional[str], optional): What happens if unknown tokens (i.e. ones mapped to <UNK>) are encountered: 'raise' or 'warn'
verbose (Optional[int], optional): verbosity level. 0: no notification, 1: warning notification, 2: warning with partial data, 3: warning
with full data. Defaults to 1.
also_return_split: defaults to False. If set to True, the return value will also contain a list that contains per meta-tokenizer-instruction element of Encoding
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be set to True if we want scalar support? or it's just for debug?
If it used for scalars, can we simply infer it from typed_input_list?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't call this directly, injector_tokenizer_op does it automatically for you.
It's not just for debug
we can't infer it from typed_input_list because we don't know how many tokens will be per tokenizer part (as it's not always 1:1 - there are things like SMILES, and things like cropping/padding)

if we only get the final merged one we can't understand:

  1. which tokens we should replace with , <MASKED_SCALARS>
  2. where are the scalars tokens

the only way we can do that externally is by effectively doing the entire logic of modular tokenizer including actual tokenization, padding, cropping, which is both code duplication and will also be slower.
that's why I preferred to allow to return this "internal split" already calculated variable.

If this isn't completely clear yet let's talk




Raises:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete if you don't intend to document the Exceptions

if (
tokenizer_type == "SCALARS_LITERALS"
): # note: masking is only supported in literals (not in "from dict")
values = subseq.split(",")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we should write "," and not "?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the scalars tokenizer require that you split them with ','
if you have an alternative you prefer do suggest.
I will add some description of the expected format in the injector files docstrings

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added docstrings with format description for both injector_tokenizer.py and injector_tokenizer_op

also renamed InjectorTokenizer to InjectorTokenizerHelpers and stopped inheriting from ModularTokenizer in it because it's misleading, as it's just 2 static helper methods.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is part of the docstrings I've added

applies a injector tokenizer

    injector tokenizer builds on top of modular tokenizer.
    its purpose is to build inputs_emb for the model (instead of input_ids)
        this allows to support more advanced inputs beyond token ids, like:
        * scalars inputs
        * embeddings vector within a single input

    supported syntax/format:

    for text following <@TOKENIZER-TYPE=SCALARS_LITERALS> supports the following format:
    ',' separated float values and/or <MASK> tokens - 
        for example: "2.7,3.99,-12.9" or "<MASK><MASK>" or "2.19,<MASK>,3.19,<MASK>"
    
    for text following <@TOKENIZER-TYPE=SCALARS_FROM_DICT> is expected to be a key to the sample NDict
        for example: "blah.boo.banana"  or "data.input.encoder_input"
        note: in SCALARS_FROM_DICT you can't describe masked scalars (outputs) you can only describe inputs

    example usage:

    encoder_input:
    <@TOKENIZER-TYPE=AA><MOLECULAR_WEIGHT_IN_SOME_UNIT><@TOKENIZER-TYPE=SCALARS_LITERALS>0.3<@TOKENIZER-TYPE=AA><BINDING_AFFINITY_NANOMOLAR><@TOKENIZER-TYPE=SCALARS_LITERALS><MASK><@TOKENIZER-TYPE=AA><SEQUENCE_NATURAL_START>ISGGDAIYSSTGRCSLGFNVRSGSTYYFLTAGICTDGATTWWANSARTTVLGTTSGSSFPNNDYGIVRYTNTTIPKDGTVGGQDITSAANATVGMAVTRRGSTTGTISGSVTALNATVNYGGGDVVYGMIRTNVCAEPGDSGGPLYSGTRAIGLTSGGSGNCSSGGTTFFQPVTEALVAYGVSVY<SEQUENCE_NATURAL_END>
    labels:
    <@TOKENIZER-TYPE=AA><MOLECULAR_WEIGHT_IN_SOME_UNIT><@TOKENIZER-TYPE=SCALARS_LITERALS>0.3<@TOKENIZER-TYPE=AA><BINDING_AFFINITY_NANOMOLAR><@TOKENIZER-TYPE=SCALARS_LITERALS>12.4<@TOKENIZER-TYPE=AA><SEQUENCE_NATURAL_START>ISGGDAIYSSTGRCSLGFNVRSGSTYYFLTAGICTDGATTWWANSARTTVLGTTSGSSFPNNDYGIVRYTNTTIPKDGTVGGQDITSAANATVGMAVTRRGSTTGTISGSVTALNATVNYGGGDVVYGMIRTNVCAEPGDSGGPLYSGTRAIGLTSGGSGNCSSGGTTFFQPVTEALVAYGVSVY<SEQUENCE_NATURAL_END>

else:
raise Exception(f"tokenizer_type={tokenizer_type} is not supported")

# elif tokenizer_type == "SCALARS_MASKED":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

elif tokenizer_type.startswith("VECTORS_"):
raise Exception("VECTOR_* are not supported yet")
else:
with_placeholders.append("<@TOKENIZER-TYPE=" + tokenizer_type + ">")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might mistakenly drop here the max length per element.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so:

sequence = "<@TOKENIZER-TYPE=AA><BLAH><BLAH2>QKPGQAPRLLIYG<@TOKENIZER-TYPE=AA@MAX-LEN=122><BLAH3>SGSDFSDFSFD"
hints_and_subseq = re.split("<@TOKENIZER-TYPE=([^>]*)>", sequence)[1]
In [6]: hints_and_subseq
Out[6]: ['AA', '<BLAH><BLAH2>QKPGQAPRLLIYG', 'AA@MAX-LEN=122', '<BLAH3>SGSDFSDFSFD']

tell me if you still think I miss something here

curr_indices.append(i + prev_index_end + 1)
curr_data.append(float(val))
else:
scalars_masked_indices.append(i + prev_index_end + 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is a running index of scalars and index that aligns it to the encoder_input.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this collects all of the indices (at the level of final tokens) of masked scalars, across the entire sequence.
expected to be empty for labels, and possibly non-empty for encoder_input yeah.

@@ -241,22 +243,30 @@ def __call__(
)

if isinstance(data, str):
encoded, overflow_info = self._tokenizer.encode(
_ans = self._tokenizer.encode(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why "_" prefix?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to hint that it's "private" or "local" - and should not be used as the answer to outside,
just a convention - it's possible that it's a convention I only have with myself :D

InjectorTokenizer,
)

# from fusedrug.data.tokenizer.modulartokenizer.modular_tokenizer import ModularTokenizer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented-out imports.

**kwargs,
)

self._input_dim = input_dim
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you using it somewhere? the functions below are static.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it from here and also from multiple related places.

@YoelShoshan YoelShoshan merged commit ccf7505 into main Jul 26, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants