-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow loss masking for defined spans of characters #113
Open
sohamparikh
wants to merge
26
commits into
main
Choose a base branch
from
soham/loss-masking-spans
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+796
β149
Open
Changes from all commits
Commits
Show all changes
26 commits
Select commit
Hold shift + click to select a range
9367fcd
convert character spans to token spans
sohamparikh 515dcb5
handle null spans
sohamparikh 3457ba2
handle spans in data iterator, fix test
sohamparikh c7373b9
bump dataset version
sohamparikh 0699e0f
create a document class
sohamparikh 419acd7
make loss masking work for prepare and training
sohamparikh acad1e4
merge main
sohamparikh daa2ad7
bos and eos options for tokenizer
sohamparikh bb175bf
loss masking for triton cross entropy
sohamparikh 0e7ad8b
fix random data tests
sohamparikh 989a8f8
revert precommit versions
sohamparikh 9633f88
fix memmap dataset test
sohamparikh 4f955ff
fix remaining dataset tests
sohamparikh 70e40e8
Merge branch 'main' into soham/loss-masking-spans
sohamparikh 1ac5052
compose tests
sohamparikh aebb5a0
handle special tokens from config
sohamparikh d8e3ae1
fix fim to handle bos and eos
sohamparikh a887dd6
address review comments
sohamparikh 40a80f6
fix memmap tests
sohamparikh e908303
fix fim tests
sohamparikh 20ffae8
special tokens mode -> sequence delimiters
sohamparikh 753e731
GPTDataBatch -> GPTBatch
sohamparikh cce0701
GPTMemmapDocument, GPTMemmapSample -> GPTSample
sohamparikh 0583dec
make loss masking opt-in in cross-entropy
sohamparikh 7c40bf2
make spans opt-in during prepare
sohamparikh 1998b9f
make spans opt-in for train
sohamparikh File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -36,6 +36,9 @@ def get_document_sizes(self) -> np.ndarray: | |
# TODO: This can be really big. | ||
return self._dataset.get_document_sizes()[self._begin : self._end] | ||
|
||
def get_span_sizes(self) -> np.ndarray: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this only used in the tests? If so I'm not sure it's worth making a public method at this stage. |
||
return self._dataset.get_span_sizes()[self._begin : self._end] | ||
|
||
|
||
class GPTConcatenatedDataset[IndexedDatasetType: GPTIndexedDataset]( | ||
ConcatenatedDataset[IndexedDatasetType], GPTIndexedDataset | ||
|
@@ -45,3 +48,6 @@ class GPTConcatenatedDataset[IndexedDatasetType: GPTIndexedDataset]( | |
def get_document_sizes(self) -> np.ndarray: | ||
# TODO: This can be really big. | ||
return np.concatenate([dataset.get_document_sizes() for dataset in self._datasets]) | ||
|
||
def get_span_sizes(self) -> np.ndarray: | ||
return np.concatenate([dataset.get_span_sizes() for dataset in self._datasets]) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this code will change the order of tokens in the sequence, we would need to change the masks accordingly to allow for FIM with loss masking.
At this point, I think we should not and fail if FIM was used with loss masking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throwing an error statement in this function now