-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trainer clean up #52
Trainer clean up #52
Conversation
More parts, but each part is smaller and doing less.
It will dump correctly on ctrl-c anyway
Shuffles the next dataset in advance
Move test.py trainer to trainer_test.py
Because it has been superseded by shuffle.py
I added more comments explaining my shenanigans. Also I fixed a known compatibility issue with Python 3.8.
Very much helping me with experimenting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Most points are minor.
One major point:
What is the purpose of API v2. Technically we haven't released this so we don't need to specify it. We should indeed have an API string in the yml. We should discuss the v2 api offline.
|
||
|
||
@dataclass(frozen=True) | ||
class Modifier: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the modifier bits close to this bit (or just above it).
The curriculum v2 and embedded deduplication are all related to me trying to remove the need for a step between having the output from the cleaning and categorised your datasets, and having the input data for your training. My idea was that the amount of effort necessary for adding another dataset should be pretty minimal. With the above implementation you can just add some more filenames to your yaml and it will just work. The current implementation is not perfect, i.e. if a sentence pair occurs in both clean and dirty and both datasets are used in a stage, those pairs could be seen more often. But at least they're guaranteed to only occur once in each dataset. I'm also open for adding something more advanced, like the score based deduplication discussed in #41, to the end of the cleaning pipeline. The nasty bit about that is that you have to rescore & regenerate your training datasets. By integrating it into the cleaning pipeline we could keep the scores cached though. But we do have to offer something to abridge going from the cleaned datasets to the trainer input. We shouldn't expect our end-users to dive into bash and concat + dedupe files themselves. |
Nick mentioned that we're essentially IO-bound so let's make use of that. This splits the shuffler in a reading and a shuffling+writing thread so that the reading thread can continue reading (and deduplicating) while shuffling + writing happens in the background. It also no longer writes the unshuffled chunks to disk which saves a lot of IO. Finally we prefer pigz over gzip as even for reading it seems to be quite a bit faster (which I didn't expect since reading is inherently single-threaded. Maybe just better implemented in pigz?)
I reworked |
(I disabled the type language server, can you tell?)
Also more comments to make more sense of the DatasetReader bits
Using a dict for the dataset paths still gives you the freedom of changing the path without having to update the entire configuration file, but it also encourages people to use a single file for the dataset (and deduplicate it etc.)
Restructured the trainer codebase a bit to be less stateful, more functional, better typed, and even added some unit tests. Also implemented async shuffling.
Don't try to read the diff. Just go to trainer.py, scroll all the way to the
__name__ == "__main__"
section, and start from there. It will make much more sense from there.Todo:
shuffle.sh
that doesn't shuffle in memory. For speed that should be C++, but for portability it would be nice to have it in Python inside this project as well.