Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HF and DeepSeek-R1-Distill-Llama-70B support #17420

Closed
wants to merge 166 commits into from
Closed

Conversation

yieldthought
Copy link
Contributor

@yieldthought yieldthought commented Jan 31, 2025

Problem description

Existing codebase loads the meta checkpoint format but many derivative models are only available on huggingface.

What's changed

Add support for loading HuggingFace model formats, paving the way for full Qwen support (pending yarn rope implementation) and adding DeepSeek-R1-Distill-Llama-70B support.

Checklist

yieldthought and others added 30 commits January 28, 2025 10:21
### What's changed
Removed all CI tests running TG llama on the old codebase since it's
outdated and we are only developing on the new codebase now.

Running TG piplelines:
https://github.com/tenstorrent/tt-metal/actions/runs/12933778923

### Checklist
- [ ] Post commit CI passes
- [ ] Blackhole Post commit (if applicable)
- [ ] Model regression CI testing passes (if applicable)
- [ ] Device performance regression CI testing passes (if applicable)
- [ ] **(For models and ops writers)** Full [new
models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml)
tests passes
- [ ] New/Existing tests provide coverage for changes
### Ticket
Link to Github Issue #16144 

### Problem description
- Binary_ng ops support only bfloat16 datatype
- binary bitwise ops, rsub, pow, add(int32) are not present in binary_ng

### What's changed
- Added float32 support for binary_ng ops
- Added bitwise ops
- Added add(int32), rsub and pow to binary_ng
- Fixed bias_gelu logic

### Checklist
- [x] Post commit CI passes
https://github.com/tenstorrent/tt-metal/actions/runs/12796247845
https://github.com/tenstorrent/tt-metal/actions/runs/12834197476
https://github.com/tenstorrent/tt-metal/actions/runs/12922250157
https://github.com/tenstorrent/tt-metal/actions/runs/12948620892
- [x] Blackhole Post commit (if applicable)
https://github.com/tenstorrent/tt-metal/actions/runs/12805291674
https://github.com/tenstorrent/tt-metal/actions/runs/12834199165
https://github.com/tenstorrent/tt-metal/actions/runs/12914904913
https://github.com/tenstorrent/tt-metal/actions/runs/12948614665
- [ ] Model regression CI testing passes (if applicable)
- [ ] Device performance regression CI testing passes (if applicable)
- [ ] **(For models and ops writers)** Full [new
models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml)
tests https://github.com/tenstorrent/tt-metal/actions/runs/12842128253
https://github.com/tenstorrent/tt-metal/actions/runs/12916189194
- [x] New/Existing tests provide coverage for changes

---------

Co-authored-by: Patrick Roberts <[email protected]>
…6627)

### Ticket
- #16626

### Problem description
In the current use case of Matmul1D with gather_in0 in the Llama models,
the activations and weights need to be padded. This results in
significant overhead.

### What's changed
- Added support to skip part of in0_block_w that is padding information
- Pad the Kt and Nt in the host code for gather_in0

### Checklist
- [x] Post commit CI passes
(https://github.com/tenstorrent/tt-metal/actions/runs/12893880800)
- [x] New/Existing tests provide coverage for changes
(https://github.com/tenstorrent/tt-metal/actions/runs/12893883783)
…o dispatch s and increase the dispatch s page size to avoid having to split some commands when having multiple sub-devices instead
…torch

- Use ttnn.from_device to convert multi-device tensors first
- This fixes pref regression for falcon demo tests
### Ticket

#17040 

### Problem description
Since the ARCH_NAME dependency was removed, there is no longer a reason
to have multiple images.

### What's changed

- Change the workflows to generate only a single release image.
- Update the documentation
- Change paths in the Docker registry

### Checklist
- [x] Post commit CI passes
- [ ] Blackhole Post commit (if applicable)
- [ ] Model regression CI testing passes (if applicable)
- [ ] Device performance regression CI testing passes (if applicable)
- [ ] **(For models and ops writers)** Full [new
models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml)
tests passes
- [ ] New/Existing tests provide coverage for changes
- Move validation to tensor layout and tensor spec construction
  * Refactor as helper functions in tensor layout and tensor spec
  * Tensor layout checks for shard spec + tile spec
    ** Physical shard shape must be divisible by tile shape if TILE layout
  * Tensor spec checks for shard spec + tensor shape
    ** Core grid is valid for number of shards along rows/cols
- Update validation for sharding (only call if is_sharded and shard_spec is not None)
  * Reword asserts to be more descriptive
  * Remove check on shard shape for row major sharding
  * Switch to query physical shape and physical shard shape
- Add gtests for illegal tensor layout and tensor spec creation
  * TODO (issue #17060): Flip to TT_FATAL
  * Rename sharding_with_alignment to more generic file name
  * Update tests to provide correct shard spec
    ** Add non-zero grid size for sharding
    ** Add TensorMemoryLayout matching intended spec
#0: Fix incorrect TensorMemoryLayout in test_scaled_dot_product_attention_decode.py
### Ticket
[Link to Github
Issue](#16954)

### Problem description
Most CBs between 1 and 16 are unused. The causes the dispatcher to waste
timing initializing many unneeded CBs, so it would be better to pack
them starting at 0.


### Checklist
- [x] Post commit CI passes
- [x] Blackhole Post commit (if applicable)
- [ ] Model regression CI testing passes (if applicable)
- [ ] Device performance regression CI testing passes (if applicable)
- [ ] **(For models and ops writers)** Full [new
models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml)
tests passes
- [ ] New/Existing tests provide coverage for changes
…er dealloc issues

  - Add tests for reading and writing shards with Interleaved and Sharded configs
  - Add test for deallocation, verying addresses
Noncontiguous CB ranges cause performance problems for the dispatcher, because it initializes all CBs up to the max used index. Warn when programs don't do that.
### Ticket
[Link to Github
Issue](#16679)

### Problem description
TopK currently supports max sorting, where K max values are returned. We
need to add necessary changes to LLKs to support returning the K min
values.

### What's changed
LLKs were updated to pass down a flag specifying which behavior (largest
or smallest k values) is expected. Ckernel updated to place min values
into register instead of max values when flag is set, returning k min
values as a result.

### Checklist
- [x] [Post commit CI
passes](https://github.com/tenstorrent/tt-metal/actions/runs/12932508914)
- [x] [Blackhole Post
commit](https://github.com/tenstorrent/tt-metal/actions/runs/12932523648)
(if applicable)
- [ ] Model regression CI testing passes (if applicable)
- [ ] Device performance regression CI testing passes (if applicable)
- [ ] **(For models and ops writers)** Full [new
models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml)
tests passes
- [ ] New/Existing tests provide coverage for changes
…17097)

### Ticket
#17095

### Problem description
We don't even compile our repo automatically in PRs. Re require devs to
navigate the maze of GH to find the right button to mash. And we can't
just auto-run APC because that's crazy long (and heavy on the infra).

### What's changed
A new workflow that does a simple build. We'll expand later, but with an
eye on robustness and speed.
### Ticket
[Link to Github
Issue](#16956 (comment))

### Checklist
- [x] Post commit CI passes
- [x] Blackhole Post commit (if applicable)
- [ ] Model regression CI testing passes (if applicable)
- [ ] Device performance regression CI testing passes (if applicable)
- [ ] **(For models and ops writers)** Full [new
models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml)
tests passes
- [ ] New/Existing tests provide coverage for changes
  - Add top level EnqueueWriteMeshBuffer and EnqueueReadMeshBuffer APIs
    to distributed.hpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.