-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WiP trainer implementation #30
Conversation
To run do:
The state tracker is a bit spaghetti-ish |
Been doing a couple more runs. Right now the child marian process seems to be poorly controlled. If I press CTRL+C the process stays there. |
trainer/trainer.py
Outdated
self.trainer.stdin.writelines(batch) | ||
self.state_tracker.update_seed() | ||
# Termination condition, check if we finished training. | ||
stop_training = self.dataset_objects[stage.until_dataset].epoch >= stage.until_epoch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need a self.trainer.wait()
somewhere here? The communicate
function takes complete control of the suprocess but it seems to not fit the purpose of this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the wait should avoid the trainer.py to end before marian but won't kill marian if CTRl+C. Maybe try to catch the CTRL+C exception and then wait()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be:
try:
# train loop
except KeyboardInterrupt:
trainer.wait()
raise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a trainer.stdin.close()
before that trainer.wait()
and that should stop the trainer cleanly. Alternatively you can use trainer.kill()
to terminate it immediately.
What I often do is:
try:
# train loop
finally:
trainer.stdin.close()
trainer.wait()
That way, in both the interrupted and the normal case you end with stopping the trainer cleanly. Although in this use case I'm not sure whether trainer.kill()
might be helpful in an except so you don't have to wait too long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed
Very much helping me with experimenting
Nick mentioned that we're essentially IO-bound so let's make use of that. This splits the shuffler in a reading and a shuffling+writing thread so that the reading thread can continue reading (and deduplicating) while shuffling + writing happens in the background. It also no longer writes the unshuffled chunks to disk which saves a lot of IO. Finally we prefer pigz over gzip as even for reading it seems to be quite a bit faster (which I didn't expect since reading is inherently single-threaded. Maybe just better implemented in pigz?)
(I disabled the type language server, can you tell?)
Also more comments to make more sense of the DatasetReader bits
Using a dict for the dataset paths still gives you the freedom of changing the path without having to update the entire configuration file, but it also encourages people to use a single file for the dataset (and deduplicate it etc.)
Trainer clean up
Close, superseded by #52 |
Working on the trainer, all together with example config and test files