-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scalars support #132
Scalars support #132
Conversation
from fuse.utils import NDict | ||
|
||
|
||
class InjectorTokenizer(ModularTokenizer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
originally I wanted to use this as a drop-in replacement for ModularTokenizer, but in order to avoid code duplication I ended up only storing static methods here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
See few comments inline
@@ -1025,6 +1031,7 @@ def encode_list( | |||
on_unknown: (Optional[str], optional): What happens if unknown tokens (i.e. ones mapped to <UNK>) are encountered: 'raise' or 'warn' | |||
verbose (Optional[int], optional): verbosity level. 0: no notification, 1: warning notification, 2: warning with partial data, 3: warning | |||
with full data. Defaults to 1. | |||
also_return_split: defaults to False. If set to True, the return value will also contain a list that contains per meta-tokenizer-instruction element of Encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be set to True if we want scalar support? or it's just for debug?
If it used for scalars, can we simply infer it from typed_input_list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't call this directly, injector_tokenizer_op does it automatically for you.
It's not just for debug
we can't infer it from typed_input_list because we don't know how many tokens will be per tokenizer part (as it's not always 1:1 - there are things like SMILES, and things like cropping/padding)
if we only get the final merged one we can't understand:
- which tokens we should replace with , <MASKED_SCALARS>
- where are the scalars tokens
the only way we can do that externally is by effectively doing the entire logic of modular tokenizer including actual tokenization, padding, cropping, which is both code duplication and will also be slower.
that's why I preferred to allow to return this "internal split" already calculated variable.
If this isn't completely clear yet let's talk
|
||
|
||
|
||
Raises: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete if you don't intend to document the Exceptions
if ( | ||
tokenizer_type == "SCALARS_LITERALS" | ||
): # note: masking is only supported in literals (not in "from dict") | ||
values = subseq.split(",") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we should write "," and not "?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the scalars tokenizer require that you split them with ','
if you have an alternative you prefer do suggest.
I will add some description of the expected format in the injector files docstrings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added docstrings with format description for both injector_tokenizer.py and injector_tokenizer_op
also renamed InjectorTokenizer to InjectorTokenizerHelpers and stopped inheriting from ModularTokenizer in it because it's misleading, as it's just 2 static helper methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is part of the docstrings I've added
applies a injector tokenizer
injector tokenizer builds on top of modular tokenizer.
its purpose is to build inputs_emb for the model (instead of input_ids)
this allows to support more advanced inputs beyond token ids, like:
* scalars inputs
* embeddings vector within a single input
supported syntax/format:
for text following <@TOKENIZER-TYPE=SCALARS_LITERALS> supports the following format:
',' separated float values and/or <MASK> tokens -
for example: "2.7,3.99,-12.9" or "<MASK><MASK>" or "2.19,<MASK>,3.19,<MASK>"
for text following <@TOKENIZER-TYPE=SCALARS_FROM_DICT> is expected to be a key to the sample NDict
for example: "blah.boo.banana" or "data.input.encoder_input"
note: in SCALARS_FROM_DICT you can't describe masked scalars (outputs) you can only describe inputs
example usage:
encoder_input:
<@TOKENIZER-TYPE=AA><MOLECULAR_WEIGHT_IN_SOME_UNIT><@TOKENIZER-TYPE=SCALARS_LITERALS>0.3<@TOKENIZER-TYPE=AA><BINDING_AFFINITY_NANOMOLAR><@TOKENIZER-TYPE=SCALARS_LITERALS><MASK><@TOKENIZER-TYPE=AA><SEQUENCE_NATURAL_START>ISGGDAIYSSTGRCSLGFNVRSGSTYYFLTAGICTDGATTWWANSARTTVLGTTSGSSFPNNDYGIVRYTNTTIPKDGTVGGQDITSAANATVGMAVTRRGSTTGTISGSVTALNATVNYGGGDVVYGMIRTNVCAEPGDSGGPLYSGTRAIGLTSGGSGNCSSGGTTFFQPVTEALVAYGVSVY<SEQUENCE_NATURAL_END>
labels:
<@TOKENIZER-TYPE=AA><MOLECULAR_WEIGHT_IN_SOME_UNIT><@TOKENIZER-TYPE=SCALARS_LITERALS>0.3<@TOKENIZER-TYPE=AA><BINDING_AFFINITY_NANOMOLAR><@TOKENIZER-TYPE=SCALARS_LITERALS>12.4<@TOKENIZER-TYPE=AA><SEQUENCE_NATURAL_START>ISGGDAIYSSTGRCSLGFNVRSGSTYYFLTAGICTDGATTWWANSARTTVLGTTSGSSFPNNDYGIVRYTNTTIPKDGTVGGQDITSAANATVGMAVTRRGSTTGTISGSVTALNATVNYGGGDVVYGMIRTNVCAEPGDSGGPLYSGTRAIGLTSGGSGNCSSGGTTFFQPVTEALVAYGVSVY<SEQUENCE_NATURAL_END>
else: | ||
raise Exception(f"tokenizer_type={tokenizer_type} is not supported") | ||
|
||
# elif tokenizer_type == "SCALARS_MASKED": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
elif tokenizer_type.startswith("VECTORS_"): | ||
raise Exception("VECTOR_* are not supported yet") | ||
else: | ||
with_placeholders.append("<@TOKENIZER-TYPE=" + tokenizer_type + ">") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might mistakenly drop here the max length per element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so:
sequence = "<@TOKENIZER-TYPE=AA><BLAH><BLAH2>QKPGQAPRLLIYG<@TOKENIZER-TYPE=AA@MAX-LEN=122><BLAH3>SGSDFSDFSFD"
hints_and_subseq = re.split("<@TOKENIZER-TYPE=([^>]*)>", sequence)[1]
In [6]: hints_and_subseq
Out[6]: ['AA', '<BLAH><BLAH2>QKPGQAPRLLIYG', 'AA@MAX-LEN=122', '<BLAH3>SGSDFSDFSFD']
tell me if you still think I miss something here
curr_indices.append(i + prev_index_end + 1) | ||
curr_data.append(float(val)) | ||
else: | ||
scalars_masked_indices.append(i + prev_index_end + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is a running index of scalars and index that aligns it to the encoder_input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this collects all of the indices (at the level of final tokens) of masked scalars, across the entire sequence.
expected to be empty for labels, and possibly non-empty for encoder_input yeah.
@@ -241,22 +243,30 @@ def __call__( | |||
) | |||
|
|||
if isinstance(data, str): | |||
encoded, overflow_info = self._tokenizer.encode( | |||
_ans = self._tokenizer.encode( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why "_" prefix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to hint that it's "private" or "local" - and should not be used as the answer to outside,
just a convention - it's possible that it's a convention I only have with myself :D
InjectorTokenizer, | ||
) | ||
|
||
# from fusedrug.data.tokenizer.modulartokenizer.modular_tokenizer import ModularTokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove commented-out imports.
**kwargs, | ||
) | ||
|
||
self._input_dim = input_dim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you using it somewhere? the functions below are static.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed it from here and also from multiple related places.
mainly related to adding supports for scalars input/output to our mammal architecture