Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The speed issue of validation stage when run two or more same model in different datasets #150

Open
CKK-coder opened this issue Jun 20, 2024 · 1 comment

Comments

@CKK-coder
Copy link

when I use different GPU servers to train adaface in different daatsets, the speed of train stage is normal. But when this two task run validation stage in the same time, the cpu utilization is very low and the "validation dataloader" will take so long time. Specifically, only one task running, it will take about 10 min or less to "validation dataloader" after one train epoch. When two task running, it will task more than several hours to "validation dataloader" after one train epoch. What is the reason of this issue, how can i sovle it? Look forward to your reply !

@afm215
Copy link

afm215 commented Jan 31, 2025

Hello, I also had the same issue, which I temporarily mitigated by creating copies of the validation set for each simultaneous training. I think it has something to do with the use of numpy memmap (maybe we should change the mode to "r" in the read_memmap util function) , but I have not had a closer look yet. Have you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants