-
Notifications
You must be signed in to change notification settings - Fork 462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
question about parallelism for embedding #2119
Comments
If I understand correctly you want to data parallel row wise shards for an embedding? AFAIU, this is seems like a niche case and not sure as to if it brings gains over the current supported sharding schemes. Usually RW/CW sharding is efficient for multi node training |
Thanks for your reply and I've got what you mean. But I think when it comes to massive training, such as hundreds of GPUs, RW/CW probably make the embedding tables in a single GPU too small. In this case, could DP+RW/CW be a better way? Or just use TW+RW/CW? |
Sorry for the late reply here, we recently added a GRID_SHARD type for massive training which is both ROW and COLUMN wise sharding. Depending on how big the embedding tables this can be more efficient for massive training. For your case I think DP+(RW/CW) seems best - I'm sure by now you've come up with something we have something coming for this type officially in the next month akin to multi level parallelism |
That sounds great! By the way, I found that GRID_SHARD type is only supported to apply on EmbeddingCollectionBag but not EmbeddingCollection. In my case, I mainly use EmbeddingCollection, so I really want to know whether your new multi level parallelism is for both EmbeddingCollection and EmbeddingCollectionBag. |
Yes for the multi level parallelism, it will support both EmbeddingCollection and EmbeddingCollectionBag. It is applied at the model level, meaning EC/EBC are supported as well as all the sharding types for the emebedding tables. |
it seems to me that it's a good method for massive training and I believe it's more efficient. However, your method applies allreduce on weights of embeddings rather than gradients and I think the latter one is more prevalent. Furthermore, if some complex optimizers like Adam are used for embeddings, these two methods may yield different results. I'm not sure whether this discrepancy would affect model's performance. Are there any theories that support this method? |
Yeah great catch - we've gone the embedding weights way instead of gradients due to a FBGEMM implementation detail. FBGEMM fuses the optimizer update in the backward, so if we wanted to gradient sync instead we would lose quite a bit of performance and incur a much larger memory overhead. Which means it's not truly "equivalent" training to a non 2D scheme. Some tuning of the optimizer is required, which I'm hoping to share more once ready. |
It seems torchrec does not support the combination of data parallelism and row-wise parallelism for embedding. I want to know is there a plan on it? Or is row-wise parallelism efficient enough when it comes to multi-node training?
The text was updated successfully, but these errors were encountered: