Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainer clean up #52

Merged
merged 43 commits into from
Dec 12, 2022
Merged

Trainer clean up #52

merged 43 commits into from
Dec 12, 2022

Conversation

jelmervdl
Copy link
Collaborator

@jelmervdl jelmervdl commented Nov 29, 2022

Restructured the trainer codebase a bit to be less stateful, more functional, better typed, and even added some unit tests. Also implemented async shuffling.

Don't try to read the diff. Just go to trainer.py, scroll all the way to the __name__ == "__main__" section, and start from there. It will make much more sense from there.

Todo:

  • maybe implement alternative to shuffle.sh that doesn't shuffle in memory. For speed that should be C++, but for portability it would be nice to have it in Python inside this project as well.

@jelmervdl jelmervdl requested a review from XapaJIaMnu November 29, 2022 10:32
@jelmervdl jelmervdl marked this pull request as ready for review November 30, 2022 13:30
Copy link
Collaborator

@XapaJIaMnu XapaJIaMnu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Most points are minor.

One major point:

What is the purpose of API v2. Technically we haven't released this so we don't need to specify it. We should indeed have an API string in the yml. We should discuss the v2 api offline.

trainer/shuffle.py Outdated Show resolved Hide resolved
trainer/shuffle.py Outdated Show resolved Hide resolved
trainer/shuffle.py Outdated Show resolved Hide resolved
trainer/shuffle.py Show resolved Hide resolved


@dataclass(frozen=True)
class Modifier:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the modifier bits close to this bit (or just above it).

trainer/trainer.py Outdated Show resolved Hide resolved
trainer/trainer.py Show resolved Hide resolved
trainer/trainer.py Show resolved Hide resolved
trainer/trainer.py Outdated Show resolved Hide resolved
trainer/trainer.py Outdated Show resolved Hide resolved
@jelmervdl
Copy link
Collaborator Author

The curriculum v2 and embedded deduplication are all related to me trying to remove the need for a step between having the output from the cleaning and categorised your datasets, and having the input data for your training.

My idea was that the amount of effort necessary for adding another dataset should be pretty minimal. With the above implementation you can just add some more filenames to your yaml and it will just work. The current implementation is not perfect, i.e. if a sentence pair occurs in both clean and dirty and both datasets are used in a stage, those pairs could be seen more often. But at least they're guaranteed to only occur once in each dataset.

I'm also open for adding something more advanced, like the score based deduplication discussed in #41, to the end of the cleaning pipeline. The nasty bit about that is that you have to rescore & regenerate your training datasets. By integrating it into the cleaning pipeline we could keep the scores cached though.

But we do have to offer something to abridge going from the cleaned datasets to the trainer input. We shouldn't expect our end-users to dive into bash and concat + dedupe files themselves.

Nick mentioned that we're essentially IO-bound so let's make use of that. This splits the shuffler in a reading and a shuffling+writing thread so that the reading thread can continue reading (and deduplicating) while shuffling + writing happens in the background. It also no longer writes the unshuffled chunks to disk which saves a lot of IO. Finally we prefer pigz over gzip as even for reading it seems to be quite a bit faster (which I didn't expect since reading is inherently single-threaded. Maybe just better implemented in pigz?)
@jelmervdl
Copy link
Collaborator Author

I reworked shuffle.py a bit. On sigyn I tested it on a bit of data and it shuffled it in 7 minutes (was 17 with the previous implementation.) For comparison, doing it all in memory and without Python, e.g. pigz -cd *.gz | shuf | pigz -c > shuffled.gz does it in 5 minutes. But during lunch I've decided that trying to optimise this further in Global Interpreter Locked Python (as opposed to, say, C++) is silly.

trainer/trainer.py Outdated Show resolved Hide resolved
Using a dict for the dataset paths still gives you the freedom of changing the path without having to update the entire configuration file, but it also encourages people to use a single file for the dataset (and deduplicate it etc.)
@XapaJIaMnu XapaJIaMnu merged commit 9533d52 into trainer Dec 12, 2022
@jelmervdl jelmervdl deleted the trainer-clean-up branch January 5, 2023 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants