-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error at validation step when fine-tuning a model #18
Comments
I wonder if the problem has something to do with how many processes multiprocessing is spawning. I can get a process name when I investigate the pickling error with |
Update: no, it actually looks as if this is the problem is occurring the first time the validation runs. I've copy-pasted the same directory for both training and validation (mostly because Ryan told me to do that when he showed me how to run the fine tuning earlier). --> 201 validate(eval_loader, model, criterion, epoch, config) Variables:
config (click to expand details):
model (click to expand details):
|
I've left the validation directory GUI field empty, and it now seems to be running without hitting the error. |
Hi Genevieve, I think the error message is exactly right: AttributeError: Can't pickle local object 'FactorPad.__init__.<locals>.pad_func' The FactorPad transform, defined here, uses a closure that doesn't play nice with pickling. I'm not sure how I've never run into this issue on any of the Mac, Linux, and Windows systems I've tested the plugin on. I've confirmed that refactoring the transform enables pickling. My guess is that everything will work if you replace the FactorPad function in source with: class FactorPad(A.Lambda):
def __init__(self, factor=128):
super().__init__(image=self.pad_func, mask=self.pad_func)
self.factor = factor
def pad_func(self, x, **kwargs):
return factor_pad(x, factor=self.factor) I'll release this fix in the next version of empanada. In the meantime another valid work around is to use a custom config to disable multiprocessing. E.g., download this finetuning config file, set workers to 0, point the "Custom config" field in the plugin to this modified config file and run. P.S. It looks like the model was running on GPU so I'm not sure why you didn't see utilization increase. "Using CPU for training" gets printed to terminal if a GPU isn't being used. I suppose an equivalent "Using GPU for training" statement would be nice. |
Ok, I'll try that workaround.
👍 Yes, that'd be good to have |
Upgrading to empanada-dl v1.6.0 and empanada-napari v0.2.1 should fix this issue. |
That's great, thank you! |
Summary
I'm trying to fine tune a model in empanada-napari, and encountering a pickle error at epoch 19.
EDIT: It turns out this happens only when the validation step is run, and leaving the validation directory empty in the GUI avoids this error. I had copied the same training directory path into the validation path, since I think that's what Ryan told me to do when he showed me how to fine tune a model. The fine tuning docs page also says this should work.
What I expected to happen
I expected the fine tuning training to complete sucessfully, after 100 iterations (the number specified in the GUI, I'm using the default values).
What happened instead
The fine tuning training appears to start successfully, but at epoch 19 I get:
I don't know why the error always happens at epoch 19.
System details
Operating system: Windows 10
CUDA version: 11.6
conda environment python version: 3.9.10
empanada-napari version: 0.2.0
empanada-dl version: 0.1.4
I think that empanada-napari should be using the GPU for fine-tuning the model, but I can't see any confirmation of that. There's also no increase in GPU memory use when I start the fine tuning, so maybe it's not actually using the GPU at all?
To reproduce
AttributeError: Can't pickle local object 'FactorPad.__init__.<locals>.pad_func'
Full error message (click to expand)
The text was updated successfully, but these errors were encountered: