WiP trainer implementation #30

XapaJIaMnu · 2022-09-08T16:28:45Z

Working on the trainer, all together with example config and test files

XapaJIaMnu · 2022-11-08T10:12:41Z

To run do:

./trainer.py -c train_config.yml

The state tracker is a bit spaghetti-ish

ZJaume · 2022-11-09T19:07:40Z

Been doing a couple more runs. Right now the child marian process seems to be poorly controlled. If I press CTRL+C the process stays there.

ZJaume · 2022-11-09T19:09:08Z

trainer/trainer.py

+            self.trainer.stdin.writelines(batch)
+            self.state_tracker.update_seed()
+            # Termination condition, check if we finished training.
+            stop_training = self.dataset_objects[stage.until_dataset].epoch >= stage.until_epoch


Maybe we need a self.trainer.wait() somewhere here? The communicate function takes complete control of the suprocess but it seems to not fit the purpose of this function.

I think the wait should avoid the trainer.py to end before marian but won't kill marian if CTRl+C. Maybe try to catch the CTRL+C exception and then wait()?

I think it would be:

try: # train loop except KeyboardInterrupt: trainer.wait() raise

Add a trainer.stdin.close() before that trainer.wait() and that should stop the trainer cleanly. Alternatively you can use trainer.kill() to terminate it immediately.

What I often do is:

try: # train loop finally: trainer.stdin.close() trainer.wait()

That way, in both the interrupted and the normal case you end with stopping the trainer cleanly. Although in this use case I'm not sure whether trainer.kill() might be helpful in an except so you don't have to wait too long.

Should be fixed

Very much helping me with experimenting

Nick mentioned that we're essentially IO-bound so let's make use of that. This splits the shuffler in a reading and a shuffling+writing thread so that the reading thread can continue reading (and deduplicating) while shuffling + writing happens in the background. It also no longer writes the unshuffled chunks to disk which saves a lot of IO. Finally we prefer pigz over gzip as even for reading it seems to be quite a bit faster (which I didn't expect since reading is inherently single-threaded. Maybe just better implemented in pigz?)

(I disabled the type language server, can you tell?)

Also more comments to make more sense of the DatasetReader bits

Using a dict for the dataset paths still gives you the freedom of changing the path without having to update the entire configuration file, but it also encourages people to use a single file for the dataset (and deduplicate it etc.)

Trainer clean up

XapaJIaMnu · 2022-12-12T15:20:42Z

Close, superseded by #52

XapaJIaMnu added 8 commits September 8, 2022 17:29

WiP trainer implementation

6d358da

WiP trainer implementation

b05912b

More progress

b39051c

merge with master

efed530

wip commit

b3a634d

Merge branch 'main' into trainer

d8b2795

first trainer working implementation, but slow

a5dafb3

Fixes, seems to work now

e73102d

jelmervdl linked an issue Oct 4, 2022 that may be closed by this pull request

Curriculum training & on-the-fly data augmentation #22

Closed

XapaJIaMnu added 11 commits October 10, 2022 14:22

Enumerate input lines

7f4fe3c

Seeded random and lowercase tests

e599f55

Uppercase ratio and shuffling, deterministically

53eec7d

Clarification for inf

70e60e8

Recommended best practises

9efc129

Add a state tracker and clean up the random seed

48f357e

State tracker, not restoring state yet

264d2ef

some cleanup

583f718

More towards stage restoration

f0eee6a

Restore_state should be working now

2987799

Let's get this reviewed

ea239c3

XapaJIaMnu marked this pull request as ready for review November 8, 2022 10:11

XapaJIaMnu requested a review from jelmervdl November 8, 2022 10:12

XapaJIaMnu added 3 commits November 9, 2022 10:26

Proper do not resume flag

c5d3fe4

Use subprocess and not pexpect

38d5e3f

Merge branch 'main' into trainer

418258e

ZJaume reviewed Nov 9, 2022

View reviewed changes

XapaJIaMnu added 2 commits November 9, 2022 19:53

Prepare placeholders for use in trainer

1468c27

Add finally for closing the trainer

2e9dfac

jelmervdl and others added 25 commits December 5, 2022 15:22

Allow actual training program to be specified on the cli.

cd4cbbc

Very much helping me with experimenting

split -> split_into_tempfiles

863ca73

Add explicit option for setting the temporary directory

5bb3fb7

Add comment as to why _read_gzip is implemented that way

701f7fd

Remove oops

f048fd4

Document DatasetReader some more

e820a7d

Rename dataset_name to until_dataset_name for that last dataset

d8c1cc7

Formalise & document disabling state loading

d1a03df

Fix bugs

120ce04

Add quick sanity check as to whether all dataset files exist

ca3a8e1

exist does not exist but exists does

22f3534

Fix missing import

812723a

(I disabled the type language server, can you tell?)

Closer control the marian subprocess

fddb7fb

Make the trainer a bit more verbose, telling us what it is doing

5dd932c

Also more comments to make more sense of the DatasetReader bits

We're sending 15, not 9

c5e1726

Add f prefixes I missed

bc35c90

Remove flip option

4a82c82

Remove deduplication from shuffle.py

590f649

Remove bufsize=0 as it is not clear whether this is an improvement

73ccd0b

Add third ctrl-c for force-quit trainer

46db743

Change configuration format

51f7212

Using a dict for the dataset paths still gives you the freedom of changing the path without having to update the entire configuration file, but it also encourages people to use a single file for the dataset (and deduplicate it etc.)

Update & document tests

bf4df4c

Update README.md

e0c2326

Merge pull request #52 from jelmervdl/trainer-clean-up

9533d52

Trainer clean up

XapaJIaMnu closed this Dec 12, 2022

XapaJIaMnu reopened this Dec 12, 2022

XapaJIaMnu merged commit 6475028 into main Dec 12, 2022

jelmervdl deleted the trainer branch January 5, 2023 10:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WiP trainer implementation #30

WiP trainer implementation #30

XapaJIaMnu commented Sep 8, 2022

XapaJIaMnu commented Nov 8, 2022

ZJaume commented Nov 9, 2022

ZJaume Nov 9, 2022

ZJaume Nov 9, 2022 •

edited

Loading

ZJaume Nov 9, 2022 •

edited

Loading

jelmervdl Nov 9, 2022

XapaJIaMnu Nov 10, 2022

XapaJIaMnu commented Dec 12, 2022

WiP trainer implementation #30

WiP trainer implementation #30

Conversation

XapaJIaMnu commented Sep 8, 2022

XapaJIaMnu commented Nov 8, 2022

ZJaume commented Nov 9, 2022

ZJaume Nov 9, 2022

Choose a reason for hiding this comment

ZJaume Nov 9, 2022 • edited Loading

Choose a reason for hiding this comment

ZJaume Nov 9, 2022 • edited Loading

Choose a reason for hiding this comment

jelmervdl Nov 9, 2022

Choose a reason for hiding this comment

XapaJIaMnu Nov 10, 2022

Choose a reason for hiding this comment

XapaJIaMnu commented Dec 12, 2022

ZJaume Nov 9, 2022 •

edited

Loading

ZJaume Nov 9, 2022 •

edited

Loading