Trainer clean up #52

jelmervdl · 2022-11-29T10:32:59Z

Restructured the trainer codebase a bit to be less stateful, more functional, better typed, and even added some unit tests. Also implemented async shuffling.

Don't try to read the diff. Just go to trainer.py, scroll all the way to the __name__ == "__main__" section, and start from there. It will make much more sense from there.

Todo:

maybe implement alternative to shuffle.sh that doesn't shuffle in memory. For speed that should be C++, but for portability it would be nice to have it in Python inside this project as well.

More parts, but each part is smaller and doing less.

It will dump correctly on ctrl-c anyway

Shuffles the next dataset in advance

Move test.py trainer to trainer_test.py

…ption

Because it has been superseded by shuffle.py

I added more comments explaining my shenanigans. Also I fixed a known compatibility issue with Python 3.8.

Very much helping me with experimenting

XapaJIaMnu

Looks good to me. Most points are minor.

One major point:

What is the purpose of API v2. Technically we haven't released this so we don't need to specify it. We should indeed have an API string in the yml. We should discuss the v2 api offline.

trainer/shuffle.py

XapaJIaMnu · 2022-12-05T15:20:28Z

trainer/trainer.py

+
+
+@dataclass(frozen=True)
+class Modifier:


Move the modifier bits close to this bit (or just above it).

trainer/trainer.py

jelmervdl · 2022-12-05T18:25:50Z

The curriculum v2 and embedded deduplication are all related to me trying to remove the need for a step between having the output from the cleaning and categorised your datasets, and having the input data for your training.

My idea was that the amount of effort necessary for adding another dataset should be pretty minimal. With the above implementation you can just add some more filenames to your yaml and it will just work. The current implementation is not perfect, i.e. if a sentence pair occurs in both clean and dirty and both datasets are used in a stage, those pairs could be seen more often. But at least they're guaranteed to only occur once in each dataset.

I'm also open for adding something more advanced, like the score based deduplication discussed in #41, to the end of the cleaning pipeline. The nasty bit about that is that you have to rescore & regenerate your training datasets. By integrating it into the cleaning pipeline we could keep the scores cached though.

But we do have to offer something to abridge going from the cleaned datasets to the trainer input. We shouldn't expect our end-users to dive into bash and concat + dedupe files themselves.

Nick mentioned that we're essentially IO-bound so let's make use of that. This splits the shuffler in a reading and a shuffling+writing thread so that the reading thread can continue reading (and deduplicating) while shuffling + writing happens in the background. It also no longer writes the unshuffled chunks to disk which saves a lot of IO. Finally we prefer pigz over gzip as even for reading it seems to be quite a bit faster (which I didn't expect since reading is inherently single-threaded. Maybe just better implemented in pigz?)

jelmervdl · 2022-12-06T14:29:43Z

I reworked shuffle.py a bit. On sigyn I tested it on a bit of data and it shuffled it in 7 minutes (was 17 with the previous implementation.) For comparison, doing it all in memory and without Python, e.g. pigz -cd *.gz | shuf | pigz -c > shuffled.gz does it in 5 minutes. But during lunch I've decided that trying to optimise this further in Global Interpreter Locked Python (as opposed to, say, C++) is silly.

(I disabled the type language server, can you tell?)

Also more comments to make more sense of the DatasetReader bits

trainer/trainer.py

Using a dict for the dataset paths still gives you the freedom of changing the path without having to update the entire configuration file, but it also encourages people to use a single file for the dataset (and deduplicate it etc.)

jelmervdl added 12 commits November 22, 2022 00:34

Reorganise trainer to be more explicit about state

cb81e26

More parts, but each part is smaller and doing less.

Fix bugs

4829efd

V1 and V2 curriculum yaml structures

e2f26de

Share readers across all stages

8b39c41

Don't dump state after every batch

e2983b7

It will dump correctly on ctrl-c anyway

Add documentation & safety checks

5d21c12

Add async shuffling

60c92be

Shuffles the next dataset in advance

Discard pre-shuffled data when restoring

a010e36

Properly clean up resources if close() is used

0cb7478

Start adding unit tests

79b1769

Move test.py trainer to trainer_test.py

Move trainer Popen call out of trainer, add unit test for state resum…

1dd42b0

…ption

Naming things

ab288e0

jelmervdl requested a review from XapaJIaMnu November 29, 2022 10:32

Replace default shuffler with one that does file-based shuffling

983c22e

jelmervdl marked this pull request as ready for review November 30, 2022 13:30

jelmervdl added 7 commits December 2, 2022 12:16

Remove random.sh! Why? Click here to find out!

45dcdb9

Because it has been superseded by shuffle.py

You will not believe what happened to shuffle.py!

7af737c

I added more comments explaining my shenanigans. Also I fixed a known compatibility issue with Python 3.8.

Add deduplication functionality to shuffle.py

420d437

Fix modifiers messing up newlines

2beac2f

Deduplicate trainer input, and add flag to flip src/trg sentence side

33f9257

Make file path relative to configuration file

8e6d561

Allow actual training program to be specified on the cli.

cd4cbbc

Very much helping me with experimenting

XapaJIaMnu requested changes Dec 5, 2022

View reviewed changes

jelmervdl added 7 commits December 5, 2022 17:21

split -> split_into_tempfiles

863ca73

Add explicit option for setting the temporary directory

5bb3fb7

Add comment as to why _read_gzip is implemented that way

701f7fd

Remove oops

f048fd4

Document DatasetReader some more

e820a7d

Rename dataset_name to until_dataset_name for that last dataset

d8c1cc7

Formalise & document disabling state loading

d1a03df

jelmervdl added 8 commits December 6, 2022 15:46

Fix bugs

120ce04

Add quick sanity check as to whether all dataset files exist

ca3a8e1

exist does not exist but exists does

22f3534

Fix missing import

812723a

(I disabled the type language server, can you tell?)

Closer control the marian subprocess

fddb7fb

Make the trainer a bit more verbose, telling us what it is doing

5dd932c

Also more comments to make more sense of the DatasetReader bits

We're sending 15, not 9

c5e1726

Add f prefixes I missed

bc35c90

XapaJIaMnu reviewed Dec 8, 2022

View reviewed changes

trainer/trainer.py Outdated Show resolved Hide resolved

jelmervdl added 7 commits December 8, 2022 14:59

Remove flip option

4a82c82

Remove deduplication from shuffle.py

590f649

Remove bufsize=0 as it is not clear whether this is an improvement

73ccd0b

Add third ctrl-c for force-quit trainer

46db743

Change configuration format

51f7212

Using a dict for the dataset paths still gives you the freedom of changing the path without having to update the entire configuration file, but it also encourages people to use a single file for the dataset (and deduplicate it etc.)

Update & document tests

bf4df4c

Update README.md

e0c2326

XapaJIaMnu merged commit 9533d52 into trainer Dec 12, 2022

XapaJIaMnu mentioned this pull request Dec 12, 2022

WiP trainer implementation #30

Merged

jelmervdl deleted the trainer-clean-up branch January 5, 2023 10:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer clean up #52

Trainer clean up #52

jelmervdl commented Nov 29, 2022 •

edited

Loading

XapaJIaMnu left a comment

XapaJIaMnu Dec 5, 2022

jelmervdl commented Dec 5, 2022

jelmervdl commented Dec 6, 2022



		@dataclass(frozen=True)
		class Modifier:

Trainer clean up #52

Trainer clean up #52

Conversation

jelmervdl commented Nov 29, 2022 • edited Loading

XapaJIaMnu left a comment

Choose a reason for hiding this comment

XapaJIaMnu Dec 5, 2022

Choose a reason for hiding this comment

jelmervdl commented Dec 5, 2022

jelmervdl commented Dec 6, 2022

jelmervdl commented Nov 29, 2022 •

edited

Loading