Update the attention_mask reformat for MHA #802

apsonawane · 2024-08-16T01:26:30Z

For phi3 models the attention_mask reformat was incorrect. Updated the pattern in this below PR.
This PR also helps with the issue open here: #552

tianleiwu · 2024-08-17T01:18:09Z

src/python/py/models/builder.py

@@ -1689,61 +1694,73 @@ def make_attention_mask_reformatting_for_mha(self):
        # Make nodes for the attention mask subgraphs that reformat the
        # 2D attention mask (B, S) to 4D causal attention mask (B, N, S, T)


I do not under stand this. I think MHA supports 2D mask with shape (B, T). Shall we use that directly instead of converting to 4D in onnx graph? (May need MHA supports causal mask in cuda ep).

It is better that we use 1D mask (total kv lengths, assuming right padding) to be consistent with GQA.

natke

Does this change get validated by tests?

baijumeswani · 2024-09-10T19:05:40Z

apsonawane is this pull-request still relevant? Could you please address the comments and update the PR when possible?

apsonawane · 2024-09-10T20:24:41Z

Yes, I will be updating the PR

apsonawane · 2024-09-16T15:55:01Z

Phi-3.5 onnx models has been released here. The issue is not seen in these models would recommend using Phi-3.5 instead of Phi-3.

Closing this PR as it is no longer required.

myadav2 · 2024-09-19T02:38:37Z

Looks like this has been solved with the latest ONNX release, but fine-tuning these ONNX models by converting them to torch is really tricky. Has the fix been made to the non-ONNX models as well? Any workaround for that?

yufenglee · 2024-09-19T16:26:35Z

Looks like this has been solved with the latest ONNX release, but fine-tuning these ONNX models by converting them to torch is really tricky. Has the fix been made to the non-ONNX models as well? Any workaround for that?

@myadav2, the overall process for fine-tuning is that: fine-tune with Pytorch -> use ModelBuilder or PyTorch ONNX export to convert the model to ONNX -> serve with ORT GenAI API

myadav2 · 2024-09-19T16:29:18Z

Looks like this has been solved with the latest ONNX release, but fine-tuning these ONNX models by converting them to torch is really tricky. Has the fix been made to the non-ONNX models as well? Any workaround for that?

@myadav2, the overall process for fine-tuning is that: fine-tune with Pytorch -> use ModelBuilder or PyTorch ONNX export to convert the model to ONNX -> serve with ORT GenAI API

Yes, but the issue of gibberish output after fine-tuning long context text is present for the base Pytorch models. I don't think we have a fix for that as far as I know

yufenglee · 2024-09-19T16:53:39Z

Looks like this has been solved with the latest ONNX release, but fine-tuning these ONNX models by converting them to torch is really tricky. Has the fix been made to the non-ONNX models as well? Any workaround for that?

@myadav2, the overall process for fine-tuning is that: fine-tune with Pytorch -> use ModelBuilder or PyTorch ONNX export to convert the model to ONNX -> serve with ORT GenAI API

Yes, but the issue of gibberish output after fine-tuning long context text is present for the base Pytorch models. I don't think we have a fix for that as far as I know

@myadav2, that's annoying. Could you please describe your issue in details and share some examples? I can bring the issues to the Phi3 model training team and see if they can help.

kunal-vaishnavi · 2024-09-19T19:09:04Z

Has the fix been made to the non-ONNX models as well? Any workaround for that?

The Phi-3.5 PyTorch models should also have this fix. If you still observe this behavior, you can open an issue in this repo so this can be tracked.

Fix for issue: #552

bd8d97d

apsonawane requested review from yufenglee, kunal-vaishnavi and hanbitmyths August 16, 2024 01:27

tianleiwu reviewed Aug 17, 2024

View reviewed changes

natke self-requested a review August 21, 2024 00:08

natke reviewed Aug 21, 2024

View reviewed changes

baijumeswani mentioned this pull request Sep 10, 2024

Phi-3-mini-4k-instruct-onnx model generates nonsensical results when prompt is longer than half of the context window length #552

Open

apsonawane closed this Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update the attention_mask reformat for MHA #802

Update the attention_mask reformat for MHA #802

apsonawane commented Aug 16, 2024

tianleiwu Aug 17, 2024 •

edited

Loading

natke left a comment

baijumeswani commented Sep 10, 2024

apsonawane commented Sep 10, 2024

apsonawane commented Sep 16, 2024

myadav2 commented Sep 19, 2024 •

edited

Loading

yufenglee commented Sep 19, 2024

myadav2 commented Sep 19, 2024

yufenglee commented Sep 19, 2024

kunal-vaishnavi commented Sep 19, 2024

		@@ -1689,61 +1694,73 @@ def make_attention_mask_reformatting_for_mha(self):
		# Make nodes for the attention mask subgraphs that reformat the
		# 2D attention mask (B, S) to 4D causal attention mask (B, N, S, T)

Update the attention_mask reformat for MHA #802

Update the attention_mask reformat for MHA #802

Conversation

apsonawane commented Aug 16, 2024

tianleiwu Aug 17, 2024 • edited Loading

Choose a reason for hiding this comment

natke left a comment

Choose a reason for hiding this comment

baijumeswani commented Sep 10, 2024

apsonawane commented Sep 10, 2024

apsonawane commented Sep 16, 2024

myadav2 commented Sep 19, 2024 • edited Loading

yufenglee commented Sep 19, 2024

myadav2 commented Sep 19, 2024

yufenglee commented Sep 19, 2024

kunal-vaishnavi commented Sep 19, 2024

tianleiwu Aug 17, 2024 •

edited

Loading

myadav2 commented Sep 19, 2024 •

edited

Loading