Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf] Skip creating attention mask in llama dataloader #40

Open
wants to merge 1 commit into
base: rocm_dev
Choose a base branch
from

Conversation

billishyahao
Copy link

This patch is to skip creating attention mask in llama dataloader by adding flag --no-create-attention-mask-in-dataloader. With this patch, we can see multiple benefits below:

  1. This could bring 4%~6% performance gain
  2. Also address observed data loader crash issue when dealing with long sequence case e.g. [BUG] Long context training using context-parallel hangs/crashes NVIDIA/Megatron-LM#1025
  3. We also see new megatron model example adopt this flag as well. e.g. https://github.com/NVIDIA/Megatron-LM/blob/40db706d37a25787b0fb6b7b561327e5d2b4b2e4/examples/mamba/train.sh#L102

@wenchenvincent wenchenvincent requested a review from lizamd January 22, 2025 20:32
@wenchenvincent
Copy link
Collaborator

@lizamd Are you aware of this change? Do we use this setting for testing?

@wenchenvincent
Copy link
Collaborator

@billishyahao Could you give some more details on the behavior --no-create-attention-mask-in-dataloader? For example, if attention mask is not created in dataloader, where is it created?

@lizamd
Copy link

lizamd commented Jan 24, 2025

@billishyahao could you provide more data on the 4-5% perf gain and address @wenchenvincent 's question? we can have a call too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants