Add HF and DeepSeek-R1-Distill-Llama-70B support #17420

yieldthought · 2025-01-31T13:20:20Z

Problem description

Existing codebase loads the meta checkpoint format but many derivative models are only available on huggingface.

What's changed

Add support for loading HuggingFace model formats, paving the way for full Qwen support (pending yarn rope implementation) and adding DeepSeek-R1-Distill-Llama-70B support.

Checklist

### What's changed Removed all CI tests running TG llama on the old codebase since it's outdated and we are only developing on the new codebase now. Running TG piplelines: https://github.com/tenstorrent/tt-metal/actions/runs/12933778923 ### Checklist - [ ] Post commit CI passes - [ ] Blackhole Post commit (if applicable) - [ ] Model regression CI testing passes (if applicable) - [ ] Device performance regression CI testing passes (if applicable) - [ ] **(For models and ops writers)** Full [new models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml) tests passes - [ ] New/Existing tests provide coverage for changes

…6627) ### Ticket - #16626 ### Problem description In the current use case of Matmul1D with gather_in0 in the Llama models, the activations and weights need to be padded. This results in significant overhead. ### What's changed - Added support to skip part of in0_block_w that is padding information - Pad the Kt and Nt in the host code for gather_in0 ### Checklist - [x] Post commit CI passes (https://github.com/tenstorrent/tt-metal/actions/runs/12893880800) - [x] New/Existing tests provide coverage for changes (https://github.com/tenstorrent/tt-metal/actions/runs/12893883783)

…o dispatch s and increase the dispatch s page size to avoid having to split some commands when having multiple sub-devices instead

…torch - Use ttnn.from_device to convert multi-device tensors first - This fixes pref regression for falcon demo tests

### Ticket #17040 ### Problem description Since the ARCH_NAME dependency was removed, there is no longer a reason to have multiple images. ### What's changed - Change the workflows to generate only a single release image. - Update the documentation - Change paths in the Docker registry ### Checklist - [x] Post commit CI passes - [ ] Blackhole Post commit (if applicable) - [ ] Model regression CI testing passes (if applicable) - [ ] Device performance regression CI testing passes (if applicable) - [ ] **(For models and ops writers)** Full [new models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml) tests passes - [ ] New/Existing tests provide coverage for changes

- Move validation to tensor layout and tensor spec construction * Refactor as helper functions in tensor layout and tensor spec * Tensor layout checks for shard spec + tile spec ** Physical shard shape must be divisible by tile shape if TILE layout * Tensor spec checks for shard spec + tensor shape ** Core grid is valid for number of shards along rows/cols - Update validation for sharding (only call if is_sharded and shard_spec is not None) * Reword asserts to be more descriptive * Remove check on shard shape for row major sharding * Switch to query physical shape and physical shard shape - Add gtests for illegal tensor layout and tensor spec creation * TODO (issue #17060): Flip to TT_FATAL * Rename sharding_with_alignment to more generic file name * Update tests to provide correct shard spec ** Add non-zero grid size for sharding ** Add TensorMemoryLayout matching intended spec #0: Fix incorrect TensorMemoryLayout in test_scaled_dot_product_attention_decode.py

### Ticket [Link to Github Issue](#16954) ### Problem description Most CBs between 1 and 16 are unused. The causes the dispatcher to waste timing initializing many unneeded CBs, so it would be better to pack them starting at 0. ### Checklist - [x] Post commit CI passes - [x] Blackhole Post commit (if applicable) - [ ] Model regression CI testing passes (if applicable) - [ ] Device performance regression CI testing passes (if applicable) - [ ] **(For models and ops writers)** Full [new models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml) tests passes - [ ] New/Existing tests provide coverage for changes

…er dealloc issues - Add tests for reading and writing shards with Interleaved and Sharded configs - Add test for deallocation, verying addresses

Noncontiguous CB ranges cause performance problems for the dispatcher, because it initializes all CBs up to the max used index. Warn when programs don't do that.

### Ticket [Link to Github Issue](#16679) ### Problem description TopK currently supports max sorting, where K max values are returned. We need to add necessary changes to LLKs to support returning the K min values. ### What's changed LLKs were updated to pass down a flag specifying which behavior (largest or smallest k values) is expected. Ckernel updated to place min values into register instead of max values when flag is set, returning k min values as a result. ### Checklist - [x] [Post commit CI passes](https://github.com/tenstorrent/tt-metal/actions/runs/12932508914) - [x] [Blackhole Post commit](https://github.com/tenstorrent/tt-metal/actions/runs/12932523648) (if applicable) - [ ] Model regression CI testing passes (if applicable) - [ ] Device performance regression CI testing passes (if applicable) - [ ] **(For models and ops writers)** Full [new models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml) tests passes - [ ] New/Existing tests provide coverage for changes

…17097) ### Ticket #17095 ### Problem description We don't even compile our repo automatically in PRs. Re require devs to navigate the maze of GH to find the right button to mash. And we can't just auto-run APC because that's crazy long (and heavy on the infra). ### What's changed A new workflow that does a simple build. We'll expand later, but with an eye on robustness and speed.

### Ticket [Link to Github Issue](#16956 (comment)) ### Checklist - [x] Post commit CI passes - [x] Blackhole Post commit (if applicable) - [ ] Model regression CI testing passes (if applicable) - [ ] Device performance regression CI testing passes (if applicable) - [ ] **(For models and ops writers)** Full [new models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml) tests passes - [ ] New/Existing tests provide coverage for changes

- Add top level EnqueueWriteMeshBuffer and EnqueueReadMeshBuffer APIs to distributed.hpp

yieldthought and others added 30 commits January 28, 2025 10:21

#0: Add HF model support, test with Qwen 2.5 and QwQ, WIP

b27c756

#0: Cherry-picked qwen25 onto main, runs llama+qwen demo on t3k

d913b82

#0: TG fixes

c985f27

#0: Be principled about using base_model_name and model_name

8093c65

#0: Relax PERF.md formatting requirements

3a325d7

#0: Apply llama hard-coded params in principled way

c877ff9

#0: Apply rope scaling based on original context length from config

bc9202e

#0: Continuation of rope improvement

9509d67

#0: Update 1B ref outputs to be correct size

d422c77

#0: Yarn todo

93a5e26

#0: Fix llama 3.1-70b line

43804d5

#0: Fix incorrect assertion for page size for prefetch relay inline t…

b6d841c

…o dispatch s and increase the dispatch s page size to avoid having to split some commands when having multiple sub-devices instead

#16758: Move mesh_composer call to after ttnn.from_device in ttnn.to_…

9b68a6b

…torch - Use ttnn.from_device to convert multi-device tensors first - This fixes pref regression for falcon demo tests

#0: Add WriteShard and ReadShard MeshBuffer APIs and resolve MeshBuff…

e96ae86

…er dealloc issues - Add tests for reading and writing shards with Interleaved and Sharded configs - Add test for deallocation, verying addresses

#16979: Log when CB ranges aren't contiguous

bc00adc

Noncontiguous CB ranges cause performance problems for the dispatcher, because it initializes all CBs up to the max used index. Warn when programs don't do that.

#16502: Add Unary with params support to BinaryNg

a026f63

#0: Make sub-device merge core ranges for generating mcast commands

2d72b4d

#16716: Correctly sanitize local L1 for ethernet cores

1157898

#16474: Clean up mention of phys coords in debug tools

5e6ae99

#16539: Watcher noc sanitize virtual coord bugfix

1b2c3ef

#0: Add native 2D sharding and replication functionality to MeshBuffer

79fc75a

- Add top level EnqueueWriteMeshBuffer and EnqueueReadMeshBuffer APIs to distributed.hpp

yieldthought closed this Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HF and DeepSeek-R1-Distill-Llama-70B support #17420

Add HF and DeepSeek-R1-Distill-Llama-70B support #17420

yieldthought commented Jan 31, 2025 •

edited

Loading

Add HF and DeepSeek-R1-Distill-Llama-70B support #17420

Add HF and DeepSeek-R1-Distill-Llama-70B support #17420

Conversation

yieldthought commented Jan 31, 2025 • edited Loading

Problem description

What's changed

Checklist

yieldthought commented Jan 31, 2025 •

edited

Loading