Scalars support #132

YoelShoshan · 2024-07-24T07:58:00Z

mainly related to adding supports for scalars input/output to our mammal architecture

YoelShoshan · 2024-07-24T08:00:30Z

fusedrug/data/tokenizer/injectortokenizer/injector_tokenizer.py

+from fuse.utils import NDict
+
+
+class InjectorTokenizer(ModularTokenizer):


originally I wanted to use this as a drop-in replacement for ModularTokenizer, but in order to avoid code duplication I ended up only storing static methods here.

mosheraboh

Looks good.
See few comments inline

mosheraboh · 2024-07-25T09:38:01Z

fusedrug/data/tokenizer/modulartokenizer/modular_tokenizer.py

@@ -1025,6 +1031,7 @@ def encode_list(
            on_unknown: (Optional[str], optional): What happens if unknown tokens (i.e. ones mapped to <UNK>) are encountered: 'raise' or 'warn'
            verbose (Optional[int], optional): verbosity level. 0: no notification, 1: warning notification, 2: warning with partial data, 3: warning
                with full data. Defaults to 1.
+            also_return_split: defaults to False. If set to True, the return value will also contain a list that contains per meta-tokenizer-instruction element of Encoding


Should it be set to True if we want scalar support? or it's just for debug?
If it used for scalars, can we simply infer it from typed_input_list?

You don't call this directly, injector_tokenizer_op does it automatically for you.
It's not just for debug
we can't infer it from typed_input_list because we don't know how many tokens will be per tokenizer part (as it's not always 1:1 - there are things like SMILES, and things like cropping/padding)

if we only get the final merged one we can't understand:

which tokens we should replace with , <MASKED_SCALARS>

where are the scalars tokens

the only way we can do that externally is by effectively doing the entire logic of modular tokenizer including actual tokenization, padding, cropping, which is both code duplication and will also be slower.
that's why I preferred to allow to return this "internal split" already calculated variable.

If this isn't completely clear yet let's talk

mosheraboh · 2024-07-25T09:42:54Z

fusedrug/data/tokenizer/ops/injector_tokenizer_ops.py

+
+
+
+        Raises:


delete if you don't intend to document the Exceptions

mosheraboh · 2024-07-25T09:49:40Z

fusedrug/data/tokenizer/injectortokenizer/injector_tokenizer.py

+                if (
+                    tokenizer_type == "SCALARS_LITERALS"
+                ):  # note: masking is only supported in literals (not in "from dict")
+                    values = subseq.split(",")


So we should write "," and not "?

yes, the scalars tokenizer require that you split them with ','
if you have an alternative you prefer do suggest.
I will add some description of the expected format in the injector files docstrings

added docstrings with format description for both injector_tokenizer.py and injector_tokenizer_op

also renamed InjectorTokenizer to InjectorTokenizerHelpers and stopped inheriting from ModularTokenizer in it because it's misleading, as it's just 2 static helper methods.

this is part of the docstrings I've added

applies a injector tokenizer injector tokenizer builds on top of modular tokenizer. its purpose is to build inputs_emb for the model (instead of input_ids) this allows to support more advanced inputs beyond token ids, like: * scalars inputs * embeddings vector within a single input supported syntax/format: for text following <@TOKENIZER-TYPE=SCALARS_LITERALS> supports the following format: ',' separated float values and/or <MASK> tokens - for example: "2.7,3.99,-12.9" or "<MASK><MASK>" or "2.19,<MASK>,3.19,<MASK>" for text following <@TOKENIZER-TYPE=SCALARS_FROM_DICT> is expected to be a key to the sample NDict for example: "blah.boo.banana" or "data.input.encoder_input" note: in SCALARS_FROM_DICT you can't describe masked scalars (outputs) you can only describe inputs example usage: encoder_input: <@TOKENIZER-TYPE=AA><MOLECULAR_WEIGHT_IN_SOME_UNIT><@TOKENIZER-TYPE=SCALARS_LITERALS>0.3<@TOKENIZER-TYPE=AA><BINDING_AFFINITY_NANOMOLAR><@TOKENIZER-TYPE=SCALARS_LITERALS><MASK><@TOKENIZER-TYPE=AA><SEQUENCE_NATURAL_START>ISGGDAIYSSTGRCSLGFNVRSGSTYYFLTAGICTDGATTWWANSARTTVLGTTSGSSFPNNDYGIVRYTNTTIPKDGTVGGQDITSAANATVGMAVTRRGSTTGTISGSVTALNATVNYGGGDVVYGMIRTNVCAEPGDSGGPLYSGTRAIGLTSGGSGNCSSGGTTFFQPVTEALVAYGVSVY<SEQUENCE_NATURAL_END> labels: <@TOKENIZER-TYPE=AA><MOLECULAR_WEIGHT_IN_SOME_UNIT><@TOKENIZER-TYPE=SCALARS_LITERALS>0.3<@TOKENIZER-TYPE=AA><BINDING_AFFINITY_NANOMOLAR><@TOKENIZER-TYPE=SCALARS_LITERALS>12.4<@TOKENIZER-TYPE=AA><SEQUENCE_NATURAL_START>ISGGDAIYSSTGRCSLGFNVRSGSTYYFLTAGICTDGATTWWANSARTTVLGTTSGSSFPNNDYGIVRYTNTTIPKDGTVGGQDITSAANATVGMAVTRRGSTTGTISGSVTALNATVNYGGGDVVYGMIRTNVCAEPGDSGGPLYSGTRAIGLTSGGSGNCSSGGTTFFQPVTEALVAYGVSVY<SEQUENCE_NATURAL_END>

mosheraboh · 2024-07-25T09:50:19Z

fusedrug/data/tokenizer/injectortokenizer/injector_tokenizer.py

+                else:
+                    raise Exception(f"tokenizer_type={tokenizer_type} is not supported")
+
+                # elif tokenizer_type == "SCALARS_MASKED":


mosheraboh · 2024-07-25T09:52:09Z

fusedrug/data/tokenizer/injectortokenizer/injector_tokenizer.py

+            elif tokenizer_type.startswith("VECTORS_"):
+                raise Exception("VECTOR_* are not supported yet")
+            else:
+                with_placeholders.append("<@TOKENIZER-TYPE=" + tokenizer_type + ">")


You might mistakenly drop here the max length per element.

I don't think so:

sequence = "<@TOKENIZER-TYPE=AA><BLAH><BLAH2>QKPGQAPRLLIYG<@TOKENIZER-TYPE=AA@MAX-LEN=122><BLAH3>SGSDFSDFSFD" hints_and_subseq = re.split("<@TOKENIZER-TYPE=([^>]*)>", sequence)[1]

In [6]: hints_and_subseq Out[6]: ['AA', '<BLAH><BLAH2>QKPGQAPRLLIYG', 'AA@MAX-LEN=122', '<BLAH3>SGSDFSDFSFD']

tell me if you still think I miss something here

mosheraboh · 2024-07-25T10:20:22Z

fusedrug/data/tokenizer/injectortokenizer/injector_tokenizer.py

+                            curr_indices.append(i + prev_index_end + 1)
+                            curr_data.append(float(val))
+                        else:
+                            scalars_masked_indices.append(i + prev_index_end + 1)


So this is a running index of scalars and index that aligns it to the encoder_input.

yes, this collects all of the indices (at the level of final tokens) of masked scalars, across the entire sequence.
expected to be empty for labels, and possibly non-empty for encoder_input yeah.

mosheraboh · 2024-07-25T11:15:20Z

fusedrug/data/tokenizer/ops/modular_tokenizer_ops.py

@@ -241,22 +243,30 @@ def __call__(
                )

        if isinstance(data, str):
-            encoded, overflow_info = self._tokenizer.encode(
+            _ans = self._tokenizer.encode(


why "_" prefix?

to hint that it's "private" or "local" - and should not be used as the answer to outside,
just a convention - it's possible that it's a convention I only have with myself :D

mosheraboh · 2024-07-25T11:16:19Z

fusedrug/data/tokenizer/ops/injector_tokenizer_ops.py

+    InjectorTokenizer,
+)
+
+# from fusedrug.data.tokenizer.modulartokenizer.modular_tokenizer import ModularTokenizer


remove commented-out imports.

mosheraboh · 2024-07-25T11:17:56Z

fusedrug/data/tokenizer/ops/injector_tokenizer_ops.py

+            **kwargs,
+        )
+
+        self._input_dim = input_dim


are you using it somewhere? the functions below are static.

removed it from here and also from multiple related places.

YoelShoshan added 5 commits July 19, 2024 17:42

adding injector tokenizer to support scalars and vectors injection

2460cb7

FLOAT and VECTOR meta tokenizers support

3df904c

scalars support

44419b3

merge

1fdf230

merge

f642a07

YoelShoshan commented Jul 24, 2024

View reviewed changes

YoelShoshan requested a review from mosheraboh July 24, 2024 08:05

mosheraboh approved these changes Jul 25, 2024

View reviewed changes

PR comments

ab16722

YoelShoshan merged commit ccf7505 into main Jul 26, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalars support #132

Scalars support #132

YoelShoshan commented Jul 24, 2024

YoelShoshan Jul 24, 2024

mosheraboh left a comment

mosheraboh Jul 25, 2024

YoelShoshan Jul 26, 2024

mosheraboh Jul 25, 2024

mosheraboh Jul 25, 2024

YoelShoshan Jul 26, 2024

YoelShoshan Jul 26, 2024

YoelShoshan Jul 26, 2024

mosheraboh Jul 25, 2024

mosheraboh Jul 25, 2024

YoelShoshan Jul 26, 2024

mosheraboh Jul 25, 2024

YoelShoshan Jul 26, 2024

mosheraboh Jul 25, 2024

YoelShoshan Jul 26, 2024

mosheraboh Jul 25, 2024

mosheraboh Jul 25, 2024

YoelShoshan Jul 26, 2024

		from fuse.utils import NDict


		class InjectorTokenizer(ModularTokenizer):

Scalars support #132

Scalars support #132

Conversation

YoelShoshan commented Jul 24, 2024

Choose a reason for hiding this comment

mosheraboh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment