Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCR chain type special tokens #127

Merged
merged 96 commits into from
Jun 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
96 commits
Select commit Hold shift + click to select a range
a5e842d
from_file for modular tokenizer
May 18, 2023
5a93fc8
from_file for modular tokenizer
May 18, 2023
9ad130b
add EOS token to modulartokenizer
May 24, 2023
9aa0107
modular tokenizer ops for t5 PPI
May 25, 2023
8e48881
modular tokenizer ops adjustments
May 28, 2023
ecc12a6
modular tokenizer additions: ops
May 29, 2023
dd896cb
adding special tokens to existing ModularTokenizer
May 30, 2023
08ec610
adding special tokens to existing ModularTokenizer
May 30, 2023
6233f40
modular tokenizer ops bug fix
May 31, 2023
5390a71
mod tokenizer additions
May 31, 2023
d6ee4ea
pr comments
Jun 1, 2023
371baa0
modulartokenizer fixes, special tokens added
Jun 4, 2023
df9508b
modulartokenizer fixes, special tokens added
Jun 4, 2023
5c51567
modulartokenizer fixes, special tokens added
Jun 4, 2023
d33d9cc
formatting fixes
Jun 4, 2023
33e4cad
PR fixes
Jun 5, 2023
d2d50b2
readme update
Jun 5, 2023
b627873
relative links test
Jun 6, 2023
72e731a
relative links test
Jun 6, 2023
34913e6
Merge branch 'main' into PPI_t5
floccinauc Jun 6, 2023
c012e23
code cleanup
Jun 6, 2023
2de1cec
Merge branch 'PPI_t5' of https://github.com/BiomedSciAI/fuse-drug int…
Jun 6, 2023
b3e977b
cleanup
Jun 6, 2023
650ff6e
cleanup
Jun 6, 2023
e4343f5
Merge branch 'main' into PPI_t5
floccinauc Jun 7, 2023
4d66f49
PR fix
Jun 7, 2023
cc10b08
PR fix
Jun 7, 2023
3819aaa
added special token
Jun 7, 2023
6587606
remove unused FastModularTokenizerOp operator
Jun 8, 2023
0d3b88b
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Jun 8, 2023
fc77d1b
adding dti_binding_dataset_combined
Jun 14, 2023
3c95264
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Jun 14, 2023
1d180c5
adding dti_binding_dataset_combined
Jun 14, 2023
3aab635
Merge branch 'main' into PPI_t5
floccinauc Jun 18, 2023
639a400
small fixes
Jun 18, 2023
f03c4b2
Merge branch 'PPI_t5' of https://github.com/BiomedSciAI/fuse-drug int…
Jun 18, 2023
7a00d65
small fixes
Jun 18, 2023
39ae971
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Jun 18, 2023
0dfee10
black fixes
Jun 18, 2023
58446a9
black fixes
Jun 18, 2023
95e007c
setting and exposing consistent max_len logic
Jun 21, 2023
363b6be
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Jun 21, 2023
f1f3817
setting and exposing consistent max_len logic
Jun 21, 2023
b9ab404
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Jun 21, 2023
a100c49
PR fixes
Jun 21, 2023
f5c9908
merging with string encode
Jun 25, 2023
1e31234
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Jun 29, 2023
e8d201c
small tokenizer ops fix
Jul 16, 2023
708e2e4
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Jul 16, 2023
b929e9d
formatting
Jul 16, 2023
e147566
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Jul 19, 2023
6adb330
import fixes
Jul 20, 2023
86ed14e
some special token annotations
Aug 3, 2023
9e3625d
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Aug 3, 2023
88ea30b
some special token annotations
Aug 3, 2023
cce85da
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Aug 8, 2023
86a430a
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Aug 16, 2023
c118680
readme correction
Aug 20, 2023
e4323e4
readme correction
Aug 20, 2023
d3a4d9d
adding special tokens and text table loader
Aug 21, 2023
3f0adb8
adding special tokens and text table loader
Aug 21, 2023
6d69fe6
adding special tokens and text table loader
Aug 21, 2023
960ba28
small changes
Sep 3, 2023
f9baa56
small changes
Sep 3, 2023
df3abab
small changes
Sep 3, 2023
43b639b
Adding sequence-related tokens (start,end,type,noop,backspace)
Sep 12, 2023
03cfd35
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Sep 12, 2023
61637b5
adding a new token TARGETED_ANTIBODY_DESIGN_ENCODER_ONLY_MODE
Sep 21, 2023
db7fc3f
adding a new token TARGETED_ANTIBODY_DESIGN_ENCODER_ONLY_MODE
Sep 21, 2023
6ca5554
tokenizer improvements
Jan 11, 2024
626102d
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Feb 6, 2024
32303fa
add a warning/exception on encountering unknown tokens and a method t…
Feb 21, 2024
9398cb7
add a warning/exception on encountering unknown tokens and a method t…
Feb 21, 2024
348df7f
PR fix
Feb 21, 2024
6f9ad97
update
Feb 29, 2024
0e6fa55
update
Feb 29, 2024
4624474
update
Feb 29, 2024
0f1fa67
detail level in unk token warnings during ModularTokenizer encode/enc…
Feb 29, 2024
2da0cdb
detail level in unk token warnings during ModularTokenizer encode/enc…
Feb 29, 2024
e89f473
detail level in unk token warnings during ModularTokenizer encode/enc…
Feb 29, 2024
52742da
detail level in unk token warnings during ModularTokenizer encode/enc…
Feb 29, 2024
2f733db
add verbose 0 version to ModularTokenizer.encode
Mar 3, 2024
40b58bd
add verbose 0 version to ModularTokenizer.encode
Mar 3, 2024
22e5f64
main merge
Mar 7, 2024
6b1e229
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Mar 19, 2024
746f750
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Mar 20, 2024
a4f309e
save path compatibility fix for modular tokenizer
Mar 20, 2024
4dc0e0f
save path compatibility fix for modular tokenizer
Mar 20, 2024
aedce8b
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
Apr 11, 2024
d3348e4
adding special tokens
Apr 11, 2024
407a9d1
add mutated token
Apr 14, 2024
4c87f49
travis fixes
Apr 14, 2024
b98584a
Merge branch 'main' of https://github.com/BiomedSciAI/fuse-drug into …
May 26, 2024
2a14ae4
TCR chain tokens
Jun 2, 2024
14e1b7b
TCR chain type tokens: alpha, gamma, delta and their CDR3 regions
Jun 2, 2024
b86dc56
TCR chain type tokens: alpha, gamma, delta and their CDR3 regions
Jun 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -2702,6 +2702,51 @@
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 300,
"content": "<MOLECULAR_ENTITY_TCR_ALPHA_CDR3>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 301,
"content": "<MOLECULAR_ENTITY_TCR_DELTA_CDR3>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 302,
"content": "<MOLECULAR_ENTITY_TCR_DELTA_VAR>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 303,
"content": "<MOLECULAR_ENTITY_TCR_GAMMA_CDR3>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 304,
"content": "<MOLECULAR_ENTITY_TCR_GAMMA_VAR>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
}
],
"normalizer": null,
Expand Down Expand Up @@ -3017,6 +3062,11 @@
"<GENERAL_CHAIN>": 297,
"<SUBMOLECULAR_ENTITY>": 298,
"<MUTATED>": 299,
"<MOLECULAR_ENTITY_TCR_ALPHA_CDR3>": 300,
"<MOLECULAR_ENTITY_TCR_DELTA_CDR3>": 301,
"<MOLECULAR_ENTITY_TCR_DELTA_VAR>": 302,
"<MOLECULAR_ENTITY_TCR_GAMMA_CDR3>": 303,
"<MOLECULAR_ENTITY_TCR_GAMMA_VAR>": 304,
"#": 527,
"%": 528,
"(": 529,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2702,6 +2702,51 @@
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 300,
"content": "<MOLECULAR_ENTITY_TCR_ALPHA_CDR3>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 301,
"content": "<MOLECULAR_ENTITY_TCR_DELTA_CDR3>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 302,
"content": "<MOLECULAR_ENTITY_TCR_DELTA_VAR>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 303,
"content": "<MOLECULAR_ENTITY_TCR_GAMMA_CDR3>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 304,
"content": "<MOLECULAR_ENTITY_TCR_GAMMA_VAR>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
}
],
"normalizer": null,
Expand Down Expand Up @@ -3023,6 +3068,11 @@
"<GENERAL_CHAIN>": 297,
"<SUBMOLECULAR_ENTITY>": 298,
"<MUTATED>": 299,
"<MOLECULAR_ENTITY_TCR_ALPHA_CDR3>": 300,
"<MOLECULAR_ENTITY_TCR_DELTA_CDR3>": 301,
"<MOLECULAR_ENTITY_TCR_DELTA_VAR>": 302,
"<MOLECULAR_ENTITY_TCR_GAMMA_CDR3>": 303,
"<MOLECULAR_ENTITY_TCR_GAMMA_VAR>": 304,
"[CL:0000499]": 3522,
"[CL:2000060]": 3523,
"[CL:0000235]": 3524,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2702,6 +2702,51 @@
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 300,
"content": "<MOLECULAR_ENTITY_TCR_ALPHA_CDR3>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 301,
"content": "<MOLECULAR_ENTITY_TCR_DELTA_CDR3>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 302,
"content": "<MOLECULAR_ENTITY_TCR_DELTA_VAR>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 303,
"content": "<MOLECULAR_ENTITY_TCR_GAMMA_CDR3>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 304,
"content": "<MOLECULAR_ENTITY_TCR_GAMMA_VAR>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
}
],
"normalizer": null,
Expand Down Expand Up @@ -3023,6 +3068,11 @@
"<GENERAL_CHAIN>": 297,
"<SUBMOLECULAR_ENTITY>": 298,
"<MUTATED>": 299,
"<MOLECULAR_ENTITY_TCR_ALPHA_CDR3>": 300,
"<MOLECULAR_ENTITY_TCR_DELTA_CDR3>": 301,
"<MOLECULAR_ENTITY_TCR_DELTA_VAR>": 302,
"<MOLECULAR_ENTITY_TCR_GAMMA_CDR3>": 303,
"<MOLECULAR_ENTITY_TCR_GAMMA_VAR>": 304,
"[100130093]": 5000,
"[100133445]": 5001,
"[100286793]": 5002,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2702,6 +2702,51 @@
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 300,
"content": "<MOLECULAR_ENTITY_TCR_ALPHA_CDR3>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 301,
"content": "<MOLECULAR_ENTITY_TCR_DELTA_CDR3>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 302,
"content": "<MOLECULAR_ENTITY_TCR_DELTA_VAR>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 303,
"content": "<MOLECULAR_ENTITY_TCR_GAMMA_CDR3>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 304,
"content": "<MOLECULAR_ENTITY_TCR_GAMMA_VAR>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
}
],
"normalizer": null,
Expand Down Expand Up @@ -3023,6 +3068,11 @@
"<GENERAL_CHAIN>": 297,
"<SUBMOLECULAR_ENTITY>": 298,
"<MUTATED>": 299,
"<MOLECULAR_ENTITY_TCR_ALPHA_CDR3>": 300,
"<MOLECULAR_ENTITY_TCR_DELTA_CDR3>": 301,
"<MOLECULAR_ENTITY_TCR_DELTA_VAR>": 302,
"<MOLECULAR_ENTITY_TCR_GAMMA_CDR3>": 303,
"<MOLECULAR_ENTITY_TCR_GAMMA_VAR>": 304,
"A": 501,
"B": 502,
"C": 503,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2702,6 +2702,51 @@
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 300,
"content": "<MOLECULAR_ENTITY_TCR_ALPHA_CDR3>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 301,
"content": "<MOLECULAR_ENTITY_TCR_DELTA_CDR3>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 302,
"content": "<MOLECULAR_ENTITY_TCR_DELTA_VAR>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 303,
"content": "<MOLECULAR_ENTITY_TCR_GAMMA_CDR3>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 304,
"content": "<MOLECULAR_ENTITY_TCR_GAMMA_VAR>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
}
],
"normalizer": null,
Expand Down Expand Up @@ -3017,6 +3062,11 @@
"<GENERAL_CHAIN>": 297,
"<SUBMOLECULAR_ENTITY>": 298,
"<MUTATED>": 299,
"<MOLECULAR_ENTITY_TCR_ALPHA_CDR3>": 300,
"<MOLECULAR_ENTITY_TCR_DELTA_CDR3>": 301,
"<MOLECULAR_ENTITY_TCR_DELTA_VAR>": 302,
"<MOLECULAR_ENTITY_TCR_GAMMA_CDR3>": 303,
"<MOLECULAR_ENTITY_TCR_GAMMA_VAR>": 304,
"#": 527,
"%": 528,
"(": 529,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2702,6 +2702,51 @@
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 300,
"content": "<MOLECULAR_ENTITY_TCR_ALPHA_CDR3>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 301,
"content": "<MOLECULAR_ENTITY_TCR_DELTA_CDR3>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 302,
"content": "<MOLECULAR_ENTITY_TCR_DELTA_VAR>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 303,
"content": "<MOLECULAR_ENTITY_TCR_GAMMA_CDR3>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 304,
"content": "<MOLECULAR_ENTITY_TCR_GAMMA_VAR>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
}
],
"normalizer": null,
Expand Down Expand Up @@ -3023,6 +3068,11 @@
"<GENERAL_CHAIN>": 297,
"<SUBMOLECULAR_ENTITY>": 298,
"<MUTATED>": 299,
"<MOLECULAR_ENTITY_TCR_ALPHA_CDR3>": 300,
"<MOLECULAR_ENTITY_TCR_DELTA_CDR3>": 301,
"<MOLECULAR_ENTITY_TCR_DELTA_VAR>": 302,
"<MOLECULAR_ENTITY_TCR_GAMMA_CDR3>": 303,
"<MOLECULAR_ENTITY_TCR_GAMMA_VAR>": 304,
"[CL:0000499]": 3522,
"[CL:2000060]": 3523,
"[CL:0000235]": 3524,
Expand Down
Loading
Loading