The speed issue of validation stage when run two or more same model in different datasets #150

CKK-coder · 2024-06-20T07:23:09Z

when I use different GPU servers to train adaface in different daatsets, the speed of train stage is normal. But when this two task run validation stage in the same time, the cpu utilization is very low and the "validation dataloader" will take so long time. Specifically, only one task running, it will take about 10 min or less to "validation dataloader" after one train epoch. When two task running, it will task more than several hours to "validation dataloader" after one train epoch. What is the reason of this issue, how can i sovle it? Look forward to your reply !

afm215 · 2025-01-31T09:13:58Z

Hello, I also had the same issue, which I temporarily mitigated by creating copies of the validation set for each simultaneous training. I think it has something to do with the use of numpy memmap (maybe we should change the mode to "r" in the read_memmap util function) , but I have not had a closer look yet. Have you?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The speed issue of validation stage when run two or more same model in different datasets #150

The speed issue of validation stage when run two or more same model in different datasets #150

CKK-coder commented Jun 20, 2024

afm215 commented Jan 31, 2025

The speed issue of validation stage when run two or more same model in different datasets #150

The speed issue of validation stage when run two or more same model in different datasets #150

Comments

CKK-coder commented Jun 20, 2024

afm215 commented Jan 31, 2025