Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚨All attention refactor🚨 #35235

Merged
merged 99 commits into from
Dec 18, 2024
Merged

🚨All attention refactor🚨 #35235

merged 99 commits into from
Dec 18, 2024

Conversation

ArthurZucker
Copy link
Collaborator

@ArthurZucker ArthurZucker commented Dec 12, 2024

What does this PR do?

Todo in this PR:

  • Cohere
  • Chameleon
  • DBRX
  • Gemma
  • Gemma2
  • GLM (modular donc rien à faire je crois)
  • gpt_neoX et GPT2
  • Granite
  • Jamba
  • JetMoe
  • Mimi
  • Mistral
  • Mixtral
  • Mllama
  • Moshi
  • Nemotron
  • OPT
  • Phi
  • Ph3
  • PhiMoe
  • Qwen2
  • qwen2Moe
  • qwen2VL
  • SableML
  • StartCoder2 -> Modular normalement oK
  • Idefics1,2,3
  • Olmo
  • Olmo2
  • Siglip
  • Whisper

@ArthurZucker ArthurZucker force-pushed the all-attention-refactor branch from 0dc9253 to d1aa9ce Compare December 12, 2024 13:49
)


class GradientCheckpointLayer(torch.nn.Module):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should help with kwargs as well

@Cyrilvallez Cyrilvallez force-pushed the all-attention-refactor branch from 8b56823 to ecd814b Compare December 16, 2024 11:28
@foreverpiano
Copy link

query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)

raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")

AttributeError: 'MistralAttention' object has no attribute 'num_heads'

How can I fix this?

@ArthurZucker
Copy link
Collaborator Author

Hey! you should try to use the latest release of transformers! query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2) is what's used now.

@ArthurZucker
Copy link
Collaborator Author

Is this by any chance related to AWQ or another package?

@foreverpiano
Copy link

foreverpiano commented Jan 13, 2025

Is there any doc about how to migrate from previous version to this version, like the variable definition, the alias change?

@foreverpiano
Copy link

foreverpiano commented Jan 13, 2025

Have you tested on several benchmarks about the performance? I knew that the Longbench score on transformer v4.47 vs v4.36 varies a lot on llama-3. Is it stable on this version?
I suggest adding some simple and small dataset tests.

@Cyrilvallez
Copy link
Member

Hey! Everything stays the same in terms of user experience/benchmark scores. If you used to hack into the different Layer classes however, it may have changed a bit. You can simply go and check-out the modeling code in this case (as was the case if you hacked into it in the first place I guess!)

loadams added a commit to deepspeedai/DeepSpeed that referenced this pull request Jan 13, 2025
Breaking change in transformers is
huggingface/transformers#35235. Need to make
changes to unpin nv-a6000 workflow.
@poedator
Copy link
Contributor

poedator commented Jan 14, 2025

My friends use a GPT2Model in production and want to compile it with StaticCache. With the maintainers blessing, I would try to create a PR with DynamicCache / StaticCache support in GPT2Model.
I am quite familiar with Cache class, I already coded some and made the DynamicCache work.

Please let me know if there are any hidden obstacles in Cache implementation for GPT2? Which tests to run or add?
@ArthurZucker

@Rocketknight1
Copy link
Member

cc @gante to that question!

@gante
Copy link
Member

gante commented Jan 15, 2025

I've chatted to @poedator offline -- I couldn't think of any obstacle in particular, and suggested a) to ensure we leave a deprecation warning regarding the old cache format b) use RUN_SLOW=1 py.test tests/models/gpt2/test_modeling_gpt2.py as a correctness check (gpt2 is fairly well tested, especially wrt text generation)

@poedator
Copy link
Contributor

It looks like test_flash_attn_2_from_config is broken - it expects attention layer to have flashattention in its name,
if "FlashAttention" in module.__class__.__name__:...
but after this refactoring, the attention classes are named differently.

ref

if "FlashAttention" in module.__class__.__name__:

please fix or suspend the test.
@ArthurZucker

@ArthurZucker
Copy link
Collaborator Author

indeed gimme a min!

youkaichao pushed a commit to vllm-project/vllm that referenced this pull request Feb 3, 2025
# Adds support for `transformers` as a backend

Following huggingface/transformers#35235, a
bunch of models should already be supported, we are ramping up support
for more models.

Thanks @Isotr0py for the TP support, and @hmellor for his help as well!
This includes: 
- `trust_remote_code=True` support: any model on the hub, if it
implements attention the correct way can be natively supported!!
- tensor parallel support

---------

Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: Harry Mellor <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
ydshieh added a commit that referenced this pull request Feb 4, 2025
* update

* update

* update

* dev-ci

* more changes

* fix

* fix

* fix

---------

Co-authored-by: ydshieh <[email protected]>
siqi654321 pushed a commit to siqi654321/DeepSpeed that referenced this pull request Feb 7, 2025
Breaking change in transformers is
huggingface/transformers#35235. Need to make
changes to unpin nv-a6000 workflow.

Signed-off-by: siqi <[email protected]>
fxmarty-amd pushed a commit to fxmarty-amd/vllm that referenced this pull request Feb 7, 2025
# Adds support for `transformers` as a backend

Following huggingface/transformers#35235, a
bunch of models should already be supported, we are ramping up support
for more models.

Thanks @Isotr0py for the TP support, and @hmellor for his help as well!
This includes:
- `trust_remote_code=True` support: any model on the hub, if it
implements attention the correct way can be natively supported!!
- tensor parallel support

---------

Signed-off-by: Harry Mellor <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: Harry Mellor <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Signed-off-by: Felix Marty <[email protected]>
traincheck-team pushed a commit to traincheck-team/DeepSpeed that referenced this pull request Feb 9, 2025
Breaking change in transformers is
huggingface/transformers#35235. Need to make
changes to unpin nv-a6000 workflow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.