Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WiP trainer implementation #30

Merged
merged 82 commits into from
Dec 12, 2022
Merged

WiP trainer implementation #30

merged 82 commits into from
Dec 12, 2022

Conversation

XapaJIaMnu
Copy link
Collaborator

Working on the trainer, all together with example config and test files

@jelmervdl jelmervdl linked an issue Oct 4, 2022 that may be closed by this pull request
@XapaJIaMnu XapaJIaMnu marked this pull request as ready for review November 8, 2022 10:11
@XapaJIaMnu XapaJIaMnu requested a review from jelmervdl November 8, 2022 10:12
@XapaJIaMnu
Copy link
Collaborator Author

To run do:

./trainer.py -c train_config.yml

The state tracker is a bit spaghetti-ish

@ZJaume
Copy link
Collaborator

ZJaume commented Nov 9, 2022

Been doing a couple more runs. Right now the child marian process seems to be poorly controlled. If I press CTRL+C the process stays there.

self.trainer.stdin.writelines(batch)
self.state_tracker.update_seed()
# Termination condition, check if we finished training.
stop_training = self.dataset_objects[stage.until_dataset].epoch >= stage.until_epoch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need a self.trainer.wait() somewhere here? The communicate function takes complete control of the suprocess but it seems to not fit the purpose of this function.

Copy link
Collaborator

@ZJaume ZJaume Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the wait should avoid the trainer.py to end before marian but won't kill marian if CTRl+C. Maybe try to catch the CTRL+C exception and then wait()?

Copy link
Collaborator

@ZJaume ZJaume Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be:

try:
     # train loop
except KeyboardInterrupt:
     trainer.wait()
     raise

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a trainer.stdin.close() before that trainer.wait() and that should stop the trainer cleanly. Alternatively you can use trainer.kill() to terminate it immediately.

What I often do is:

try:
  # train loop
finally:
  trainer.stdin.close()
  trainer.wait()

That way, in both the interrupted and the normal case you end with stopping the trainer cleanly. Although in this use case I'm not sure whether trainer.kill() might be helpful in an except so you don't have to wait too long.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed

jelmervdl and others added 25 commits December 5, 2022 15:22
Very much helping me with experimenting
Nick mentioned that we're essentially IO-bound so let's make use of that. This splits the shuffler in a reading and a shuffling+writing thread so that the reading thread can continue reading (and deduplicating) while shuffling + writing happens in the background. It also no longer writes the unshuffled chunks to disk which saves a lot of IO. Finally we prefer pigz over gzip as even for reading it seems to be quite a bit faster (which I didn't expect since reading is inherently single-threaded. Maybe just better implemented in pigz?)
(I disabled the type language server, can you tell?)
Also more comments to make more sense of the DatasetReader bits
Using a dict for the dataset paths still gives you the freedom of changing the path without having to update the entire configuration file, but it also encourages people to use a single file for the dataset (and deduplicate it etc.)
@XapaJIaMnu
Copy link
Collaborator Author

Close, superseded by #52

@XapaJIaMnu XapaJIaMnu closed this Dec 12, 2022
@XapaJIaMnu XapaJIaMnu reopened this Dec 12, 2022
@XapaJIaMnu XapaJIaMnu merged commit 6475028 into main Dec 12, 2022
@jelmervdl jelmervdl deleted the trainer branch January 5, 2023 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Curriculum training & on-the-fly data augmentation
3 participants