-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tutorial DataModule leads to Validation data leakage in Multiprocessing #18638
Comments
IMO if users think this is not clear, the example should be updated to set the generator on the However, I would like to defend the example in that it is a code snippet that tries to highlight the following point:
It doesn't try to give you a full end-to-end training example but rather just get this message across so you can build your own. |
@awaelchli I don't think its a good idea to encourage & even demonstrate code smells which can easily lead to silent & quite serious bugs later on. Moreover maybe we shouldn't be conveying that you can just copy and paste when the reality is different for multi-gpu training? What if someone started their project based off a docs example then scaled to multiple GPUs & went to publish their results without reading the finer points of setup() vs prepare_data()? I think it's quite possible they'd find a new backend. Also there are many other places in the docs which encourage splits in setup() (& random is ofc most common type). I think all of these places should heavily emphasize this GOTCHA & recommended best practice (e.g. deterministic splits). |
Has my last reply left the wrong impression @profPlum? I'm sorry if it has frustrated you but I have never said no to any of your comments and I have not objected to anything you said. I simply stated my opinion in addition to this ticket and I made this clear in my message. You are proposing that we don't recommend doing the split in setup(), but you haven't said where it should be done otherwise. Could you clarify please? Would my suggestion of setting the random seed on the random_split function not work and why not? |
I'm sorry too, perhaps I conveyed the wrong tone. And you're right I'm not 100% sure of perfect solution either... I'm just worried that your response was too narrow in regards to changing one line of code. Whereas the problem is more widespread; it would be better to change every similar part of the documentation & clearly emphasize the GOTCHA + best practice. Honestly it still scares me that even after docs change it will be so easy to make such a serious mistake which could easily go undetected... |
@awaelchli But in regards to doing it in setup vs somewhere else I was thinking it would be nice if it were easy to do shuffling/splitting in prepare_data(). But I'm not sure how to do that in a way which is safe when training multiple models at once in a shared file system (e.g. slurm/cloud). Without directly coupling the code with slurm or something else. |
Sure, we can make that clearer in the docs. I'm struggling to understand though how and why Lightning should enforce such behavior. It doesn't fit well with the core philosophy.
I really don't think Lightning should change the default behavior of torch functions that the user calls, nor should it silently set the seed or random state if the user doesn't explicitly ask for it. Lightning's philosophy is to organize PyTorch code and provide good default configuration, but it does not in any way intend to override the behavior of torch on a functional level. |
@awaelchli Like I said I don't know of a perfect solution & I'm not recommending you change behavior of pytorch functions. But the reason I think it would be desirable to find a solution to this problem is because (as you say) pytorch-lightning is about making best practices default (e.g. proper splitting) and letting users avoid engineering issues like this one. Out of curiosity how is this problem dealt with in raw pytorch in a multiprocessing setting? There must be some kind of best practice associated with random_split right? P.S. Yes I was considering saving the splits to disk in prepare_data & loading in setup. |
Got it yes, so we're aligned generally.
In PyTorch the best practice is to always set the seed for the random_split function regardless of multiprocessing actually. You'd normally want to fix the splitting completely so that different experiments are fully-comparable (e.g. when comparing validation metrics). The multiprocessing doesn't really play much into this except that when new processes are created, they all start with a different random state. That's why best practice in PyTorch is to set the global random seed. So in summary, best practice is to set
This amounts to the same practice in Lightning, except that for (1) we have the convenient |
Oh interesting, ya I was getting that impression since keras also does deterministic splitting. I guess then encouraging seeding is probably our best bet, for now. |
@awaelchli Hi again, it just occurred to me that this problem goes away with CLI arg My understanding then would be that (given the current defaults) the problem I mentioned shouldn't be a problem. But it is, why is that? tl;dr |
@profPlum In addition, could you explain how you are launching the script? |
@profPlum Let me know what you think about my last reply and the PR I sent. |
Perfect! I was in fact using slurm/srun. |
📚 Documentation
It is recommended in the docs to do splits in the setup() method & an example is given that does random splits in setup.
Perhaps it shouldn't be recommended to do random splits in the setup(). Because
I've found that when I do random_split inside setup with multiprocessing it results in each GPU worker having a different validation_dataset (given by random split).
So when training the model is trained on some (or all) of the validation data when random splits are done in setup(). (without a seed)
cc @Borda @carmocca @awaelchli
The text was updated successfully, but these errors were encountered: