Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix filepath and save state_dict for model for checkpoint finalize #46

Merged
merged 2 commits into from
May 9, 2019
Merged

Fix filepath and save state_dict for model for checkpoint finalize #46

merged 2 commits into from
May 9, 2019

Conversation

vedanuj
Copy link
Contributor

@vedanuj vedanuj commented May 6, 2019

  • Fixing bug in pth_filepath and params_filepath that was adding save_dir twice in the path.
  • Fixing bug where the self.trainer.model when tried to save was throwing error. We need to save only the model state_dict. Fixes issue Do I need to download the tremendous weights again? #44 error.

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label May 6, 2019
@apsdehal
Copy link
Contributor

apsdehal commented May 6, 2019

I actually wanted to save the whole model itself for demo purposes. I guess it is fine with weights also. But, we probably should look into why this issue is exactly happening.

@@ -36,15 +36,14 @@ def __init__(self, trainer):
self.config["log_foldername"] = self.ckpt_foldername
self.ckpt_foldername = os.path.join(self.save_dir, self.ckpt_foldername)
self.pth_filepath = os.path.join(
self.save_dir, self.ckpt_foldername,
self.ckpt_foldername,
self.ckpt_prefix + self.model_name + "_final.pth"
)
self.params_filepath = os.path.join(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove this line as it doesn't server any purpose now.

@vedanuj
Copy link
Contributor Author

vedanuj commented May 8, 2019

Upon investigating more I found why we are not being able to save the model as a whole. So the problem is with the Logger class. Python logging module use reentrant locks which cannot be pickled. https://bugs.python.org/issue30520. There was a bug in Python <=3.5 whereby _thread.RLock objects did not raise an error when pickled (they cannot be pickled). https://bugs.python.org/issue29168

We add Logger instance to PythiaLoss, Metrics and the BaseModel itself. Removing the global logger from these classes gets rid of the issue but not sure if we want to do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed Do not delete this pull request or issue due to inactivity.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants