Add the Bamba Model #34982

fabianlim · 2024-11-28T00:35:17Z

What does this PR do?

This PR merges the BambaModel, which is a hybrid mamba2 architecture with SwiGLU. The checkpoints are jointly trained by IBM, Princeton, and UIUC.

The implementation is based off ai21labs/Jamba-v0.1 and the mamba2 implementation ported over to HF for the codestral model.

cc: @ani300, @raghukiran1224

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Co-authored-by: Gabe Goodhart <[email protected]>

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

Rocketknight1 · 2024-11-28T13:32:12Z

Hi @fabianlim, do you have a paper reference for this model or any details on the trained checkpoints?

fabianlim · 2024-11-28T14:10:40Z

@Rocketknight1 thanks for reaching out. Yes my colleagues are preparing a paper and a GitHub repo with the (training) code. And checkpoints will be 1.8T, 2T, 2.2T, and an sft model. We will update the PR accordingly.

cc: @raghukiran1224

raghukiran1224 · 2024-11-28T15:09:12Z

The data used is all open, we plan to share any and all details on what the community would want! Open source is the name of the game 😄

Rocketknight1 · 2024-11-29T13:42:17Z

Cool! @molbap will be the point of contact at Hugging Face for this PR, so feel free to ping me or him if you have any questions as you're working on it.

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

…into bamba-pr

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

molbap

All tests + integration pass and modular looks better, thanks for working on it and congrats on the model! pinging @ArthurZucker so core review starts ;)

molbap · 2024-12-16T17:26:41Z

docs/source/en/model_doc/bamba.md

+
+## Overview
+
+TODO


This should be filled in as well before merging

updated in 44788dc

molbap · 2024-12-16T17:27:12Z

docs/source/en/model_doc/bamba.md

+    print(i)
+```
+
+<!-- update this -->


Suggested change

updated in 44788dc

Added overview, update Model inference card and added config

Minor fixes

Added overview and other additional details for Bamba

Signed-off-by: Antoni Viros i Martin <[email protected]>

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

ArthurZucker

Great work, a few nits to adresse and we can merge!

ArthurZucker · 2024-12-18T15:39:59Z

src/transformers/models/bamba/__init__.py

+# Copyright 2024 IBM and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+
+from ...utils import (
+    OptionalDependencyNotAvailable,
+    _LazyModule,
+    is_torch_available,
+)
+
+
+_import_structure = {
+    "configuration_bamba": ["BambaConfig"],
+}
+
+try:
+    if not is_torch_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["modeling_bamba"] = [
+        "BambaForCausalLM",
+        "BambaModel",
+        "BambaPreTrainedModel",
+    ]
+
+if TYPE_CHECKING:
+    from .configuration_bamba import BambaConfig
+
+    try:
+        if not is_torch_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .modeling_bamba import (
+            BambaForCausalLM,
+            BambaModel,
+            BambaPreTrainedModel,
+        )
+
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)


Suggested change

# Copyright 2024 IBM and the HuggingFace Inc. team. All rights reserved.

#

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

#

# http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

from typing import TYPE_CHECKING

from ...utils import (

OptionalDependencyNotAvailable,

_LazyModule,

is_torch_available,

)

_import_structure = {

"configuration_bamba": ["BambaConfig"],

}

try:

if not is_torch_available():

raise OptionalDependencyNotAvailable()

except OptionalDependencyNotAvailable:

pass

else:

_import_structure["modeling_bamba"] = [

"BambaForCausalLM",

"BambaModel",

"BambaPreTrainedModel",

]

if TYPE_CHECKING:

from .configuration_bamba import BambaConfig

try:

if not is_torch_available():

raise OptionalDependencyNotAvailable()

except OptionalDependencyNotAvailable:

pass

else:

from .modeling_bamba import (

BambaForCausalLM,

BambaModel,

BambaPreTrainedModel,

)

else:

import sys

sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)

# Copyright 2024 IBM and the HuggingFace Inc. team. All rights reserved.

#

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

#

# http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

from typing import TYPE_CHECKING

from ...utils import _LazyModule

from ...utils.import_utils import define_import_structure

if TYPE_CHECKING:

from .configuration_bamba import *

from .modeling_bamba import *

from .processing_bamba import *

else:

import sys

_file = globals()["__file__"]

sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)

and you need to define __all__ like:

__all__ = [ "GemmaModel", "GemmaForCausalLM", "GemmaForSequenceClassification", "GemmaForTokenClassification", "GemmaPreTrainedModel", ]

at the end of the modular file! see modular gemma2

addressed in latest commit

ArthurZucker · 2024-12-18T15:41:06Z

src/transformers/models/bamba/configuration_bamba.py

+
+        assert mamba_intermediate % mamba_n_heads == 0, "mamba_n_heads must divide mamba_expand * hidden_size"
+
+        # for the mamba_v2, must satisfy the following
+        if mamba_d_head == "auto":
+            mamba_d_head = mamba_intermediate // mamba_n_heads
+        assert mamba_d_head * mamba_n_heads == mamba_intermediate


let's remove asserts and rather raise errors please

done in latest commit

ArthurZucker · 2024-12-18T15:41:59Z

src/transformers/models/bamba/convert_mamba_ssm_checkpoint.py

+def convert_state_dict_from_mamba_ssm(original_sd: Dict) -> Dict[str, torch.Tensor]:
+    state_dict = {}
+
+    for orig_k, param in original_sd.items():
+        k = orig_k.replace("backbone", "model")
+
+        # for embeddings
+        k = k.replace("embedding", "embed_tokens")
+
+        # for mixer
+        k = k.replace("mixer", "mamba")
+
+        # for final layernorm
+        k = k.replace("norm_f", "final_layernorm")
+
+        # for block layernorm
+        k = re.sub(r"(\d+)\.norm\.", r"\1.input_layernorm.", k)
+        k = re.sub(r"(\d+)\.norm2\.", r"\1.pre_ff_layernorm.", k)
+
+        # for mlp
+        k = k.replace("mlp.fc2", "feed_forward.down_proj")
+
+        if "mlp.fc1" in k:
+            param, param2 = torch.chunk(param, 2, dim=0)
+            k2 = k.replace("mlp.fc1", "feed_forward.gate_proj")
+            state_dict[k2] = param2
+            k = k.replace("mlp.fc1", "feed_forward.up_proj")
+
+        if ("in_proj" in k and orig_k.replace("in_proj", "conv1d") in original_sd) or (
+            "out_proj" in k and orig_k.replace("out_proj", "conv1d") in original_sd
+        ):
+            # then this must be a mamba
+            pass
+        else:
+            # for attn
+            # - because mixer was replaced to mamba above
+            k = k.replace("mamba.out_proj", "self_attn.o_proj")
+            if "mamba.in_proj" in k:
+                m, n = param.shape
+                d = (m - n) // 2
+                param, param2, param3 = torch.split(param, [n, d, d], dim=0)
+                k2 = k.replace("mamba.in_proj", "self_attn.k_proj")
+                state_dict[k2] = param2
+                k2 = k.replace("mamba.in_proj", "self_attn.v_proj")
+                state_dict[k2] = param3
+                k = k.replace("mamba.in_proj", "self_attn.q_proj")
+
+        state_dict[k] = param
+


we like to have a more explicit dict like this one

transformers/src/transformers/models/mllama/convert_mllama_weights_to_hf.py

Line 41 in deac971

ORIGINAL_TO_CONVERTED_KEY_MAPPING = {

, but it's not blocking merge!

leaving for future/followup PR

ArthurZucker · 2024-12-18T15:44:08Z

src/transformers/models/bamba/modular_bamba.py

+
+
+# Adapted from transformers.models.llama.modeling_llama.apply_rotary_pos_emb
+# - handles the case if the rotary embedding is smaller than head_dim


really not sure this is worth changing as it's just ads an assert.

Otherwise we should use the notation like

transformers/src/transformers/models/glm/modular_glm.py

Line 71 in deac971

def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):

changed to adapt from GLM

ArthurZucker · 2024-12-18T15:46:24Z

src/transformers/models/bamba/modular_bamba.py

+    """
+    Compute ∆, A, B, C, and D the state space parameters and compute the `contextualized_states`.
+    A, D are input independent (see Mamba paper [1] Section 3.5.2 "Interpretation of A" for why A isn't selective)
+    ∆, B, C are input-dependent (this is a key difference between Mamba and the linear time invariant S4,
+    and is why Mamba is called **selective** state spaces)
+    """


as a reviewer, but also any dev that is gonna read this code, we need to know what the differences are with Mamba2Mixer.

Could you add comments here explaining why we have to redefine everything?

added a list to the comments

ArthurZucker · 2024-12-18T15:47:09Z

src/transformers/models/bamba/modular_bamba.py

+class BambaDecoderLayer(LlamaDecoderLayer):
+    def __init__(self, config: BambaConfig, layer_idx: int, layer_type: str = "mamba"):
+        super().__init__()
+
+        del self.self_attn
+
+        del self.mlp
+        del self.post_attention_layernorm
+        self.feed_forward = BambaMLP(config)
+        self.pre_ff_layernorm = BambaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+
+        self.layer_type = layer_type
+        if layer_type == "mamba":
+            self.mamba = BambaMixer(config=config, layer_idx=layer_idx)
+        elif layer_type == "attention":
+            self.self_attn = BAMBA_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)
+        else:
+            raise ValueError("Invalid layer_type")


This is a lot more similar to :

class JambaAttentionDecoderLayer(nn.Module): def __init__(self, config: JambaConfig, layer_idx: int): super().__init__() num_experts = config.layers_num_experts[layer_idx] self.self_attn = JAMBA_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) ffn_layer_class = JambaSparseMoeBlock if num_experts > 1 else JambaMLP self.feed_forward = ffn_layer_class(config) self.input_layernorm = JambaRMSNorm(config.hidden_size, eps=config.rms_norm_eps) self.pre_ff_layernorm = JambaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)

which should be a better base 😉

done in latest commit

ArthurZucker · 2024-12-18T15:49:17Z

src/transformers/models/bamba/modular_bamba.py

+
+
+# Adapted from transformers.models.jamba.modeling_jamba.JambaForCausalLM
+class BambaForCausalLM(BambaPreTrainedModel, GenerationMixin):


forward and init should be the same as LlamaForcausalLM which you should be able to inherit from and just change the prepare input for generation as it is the only difference no?

done in latest commit

Signed-off-by: Antoni Viros i Martin <[email protected]>

…into bamba-pr

Signed-off-by: Antoni Viros i Martin <[email protected]>

ArthurZucker · 2024-12-19T09:05:50Z

Kudos! 🚀

fabianlim marked this pull request as draft November 28, 2024 00:35

fabianlim changed the title ~~initial commit for PR~~ Add the Bamba Model Nov 28, 2024

initial commit for PR

e299132

Co-authored-by: Gabe Goodhart <[email protected]>

fabianlim force-pushed the bamba-pr branch from 2bdd76e to e299132 Compare November 28, 2024 01:11

rename dynamic cache

c1d2d2b

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

Rocketknight1 assigned molbap Nov 29, 2024

fabianlim added 3 commits December 4, 2024 03:28

add more unit tests

7c87f85

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

add integration test

2897866

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

add integration test

5671778

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

fabianlim mentioned this pull request Dec 5, 2024

Add Bamba Model vllm-project/vllm#10909

Merged

6 tasks

ani300 added 5 commits December 5, 2024 14:37

Add modular bamba file

947b877

Merge branch 'bamba-pr' of https://github.com/fabianlim/transformers …

78c9b04

…into bamba-pr

Merge branch 'main' into bamba-pr

3e352f7

Remove trainer changes from unrelated PR

2c21572

Modify modular and cofig to get model running

a0e58b4

molbap added State space models Issues or PRs related to state space models such as mamba, mamba2 New model labels Dec 9, 2024

ani300 mentioned this pull request Dec 9, 2024

Running Bamba natively on Pytorch foundation-model-stack/bamba#2

Open

5 tasks

ani300 added 3 commits December 9, 2024 22:48

Fix some CI errors and beam search

a450f1c

Fix a plethora of bugs from CI/docs/etc

146a940

Add bamba to models with special caches

1144bbb

gabe-l-hart mentioned this pull request Dec 12, 2024

Bamba architecture ggerganov/llama.cpp#10810

Draft

3 tasks

ani300 and others added 4 commits December 14, 2024 00:05

Updat to newer mamba PR for mamba sublayer

856cb3a

fix test_left_padding_compatibility

9ec6d15

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

Merge remote-tracking branch 'upstream/main' into bamba-pr

d7875be

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

fix style

895521d

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

fabianlim added 4 commits December 16, 2024 12:55

address comments

94a13d6

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

fix modular

4cff28c

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

left out one part of modular

5d9ce5c

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

change model

b934261

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

fabianlim marked this pull request as ready for review December 16, 2024 15:33

ani300 added 2 commits December 16, 2024 16:01

Make Rotary modular as well

024072a

Merge branch 'main' into bamba-pr

f7ceb0c

molbap approved these changes Dec 16, 2024

View reviewed changes

divya-kumari32 and others added 9 commits December 16, 2024 22:51

Update bamba.md

0e97747

Added overview, update Model inference card and added config

Update bamba.md

53b8acf

Update bamba.md

a8fa7ff

Update bamba.md

0115bf6

Minor fixes

Merge pull request #4 from divya-kumari32/patch-1

44788dc

Added overview and other additional details for Bamba

Add docs for config and model back

537964b

Signed-off-by: Antoni Viros i Martin <[email protected]>

Merge branch 'main' into bamba-pr

ab26161

Add warning when using fast kernels

ddd6118

replaced generate example

1088778

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

pcuenca requested a review from ArthurZucker December 18, 2024 12:06

ArthurZucker approved these changes Dec 18, 2024

View reviewed changes

ani300 added 7 commits December 18, 2024 18:13

Address comments from PR

d4c650c

Signed-off-by: Antoni Viros i Martin <[email protected]>

Merge branch 'bamba-pr' of https://github.com/fabianlim/transformers …

6f28b96

…into bamba-pr

Merge branch 'main' into bamba-pr

1c82bf0

Propagate attention fixes

9911cdf

Signed-off-by: Antoni Viros i Martin <[email protected]>

Fix attention interfaces to the new API

c7b50e6

Signed-off-by: Antoni Viros i Martin <[email protected]>

Fix API for decoder layer

e0f34f5

Signed-off-by: Antoni Viros i Martin <[email protected]>

Remove extra weights

bdd3272

Signed-off-by: Antoni Viros i Martin <[email protected]>

molbap merged commit 9613933 into huggingface:main Dec 18, 2024
23 checks passed

vasqu mentioned this pull request Dec 18, 2024

[Mamba2] Fix caching, slow path, and multi-gpu #35154

Merged

5 tasks

garrett361 mentioned this pull request Jan 24, 2025

Add padding-free to bamba #35861

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the Bamba Model #34982

Add the Bamba Model #34982

fabianlim commented Nov 28, 2024 •

edited

Loading

Rocketknight1 commented Nov 28, 2024

fabianlim commented Nov 28, 2024

raghukiran1224 commented Nov 28, 2024

Rocketknight1 commented Nov 29, 2024

molbap left a comment

molbap Dec 16, 2024

fabianlim Dec 16, 2024

molbap Dec 16, 2024

fabianlim Dec 16, 2024

ArthurZucker left a comment

ArthurZucker Dec 18, 2024

ArthurZucker Dec 18, 2024

ani300 Dec 18, 2024

ArthurZucker Dec 18, 2024

ani300 Dec 18, 2024

ArthurZucker Dec 18, 2024

ani300 Dec 18, 2024 •

edited

Loading

ArthurZucker Dec 18, 2024

ArthurZucker Dec 18, 2024

ani300 Dec 18, 2024

ArthurZucker Dec 18, 2024

ani300 Dec 18, 2024

ArthurZucker Dec 18, 2024

ani300 Dec 18, 2024

ArthurZucker Dec 18, 2024

ani300 Dec 18, 2024

ArthurZucker commented Dec 19, 2024



		# Adapted from transformers.models.llama.modeling_llama.apply_rotary_pos_emb
		# - handles the case if the rotary embedding is smaller than head_dim



		# Adapted from transformers.models.jamba.modeling_jamba.JambaForCausalLM
		class BambaForCausalLM(BambaPreTrainedModel, GenerationMixin):


		## Overview

		TODO

Add the Bamba Model #34982

Add the Bamba Model #34982

Conversation

fabianlim commented Nov 28, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

Rocketknight1 commented Nov 28, 2024

fabianlim commented Nov 28, 2024

raghukiran1224 commented Nov 28, 2024

Rocketknight1 commented Nov 29, 2024

molbap left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ani300 Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker commented Dec 19, 2024

fabianlim commented Nov 28, 2024 •

edited

Loading

ani300 Dec 18, 2024 •

edited

Loading