Merge ericlBuehler/mistral.rs into `spiceai`. #16

Jeadie · 2025-01-23T00:35:01Z

🤔 Concerns

Wait on resolution to metal/xcode from Couldnt run any vision model EricLBuehler/mistral.rs#935 (comment)
- ~~Possible Implementation: Move supports_attn_softmax logic to build.rs EricLBuehler/mistral.rs#1085~~
- Add above PR to this branch.

* Adding streaming function to mistralrs server. * Adding simple_stream example

* Add a forward_autocast method * Add a to_gguf_quant method for bnb * Handle blocksizes * Maybe cast * Add QuantMethod::dequantize_w * Debug * Debug * Debug * Fix the bug maybe??? * Fix the bug maybe??? * Clippy

* More vllama optimizations * Oops * Use addmm metal * Make some progress * Conditional * Conditional * No crossattn quant * Fix loading from uqff

* Update docs * Update deps

* Work on prefix cacher * It works * Clippy * Enable partial matches

* Add --cpu flag to `mistralrs-server` * Update lib.rs * Update main.rs * Update lib.rs * Update lib.rs * Update lib.rs

* Separate cuda paged attention impl * Sketch necessary functions * Implement swap blocks and copy blocks * Implement reshape_and_cache * Add the kernels * Add the normal sdpa kernel as a basis * Wire things up a bit * Correct inputs * Implement the pagedattention kernel * Kernels compile * Instantiate! * Add kernel call code * Implement the op for v1 * Implement the v2 kernel and op * Clippy * Correct memory info for metal * Fixes * Fix cuda * Fix cuda * Fix cuda * Fix * Debugging * 🚀 It works! * Use faster vector implementation * Fix bug for kernel cache & Phi3 inference (EricLBuehler#1003) * Fix bug for kernel cache & Phi3 inference * Remove the leftover used in nightly debugging * Fix warning * Add (deactivated) bfloat support w/ simd * Tune num_threads * Update docs * Disable paged attn by default on metal --------- Co-authored-by: Guoqing Bao <[email protected]>

* Support for normal cache for mllama, phi3v, qwen2vl * Clippy

…models (EricLBuehler#1009) * Support BF16 kvcache & attention for GGUF/GGML quantization * Fix clippy * Pass dtype to xlora gguf/ggml model * Remove the hardcoded fix for the literal chat template (side effect: the model cannot terminate itself for running GGUF file) * Pass dtype to Lora GGUF/GGML models

* Move start_offsets_kernel to correct device * Move start_offsets_kernel to correct device * Move start_offsets_kernel to correct device * Move start_offsets_kernel to correct device * Move start_offsets_kernel to correct device * Move start_offsets_kernel to correct device * Move start_offsets_kernel to correct device * Move start_offsets_kernel to correct device * Move start_offsets_kernel to correct device * Update starcoder2.rs * Support device mapping * Support device mapping * Support device mapping * Support device mapping * Support device mapping * format * Support device mapping * remove mut * remove mut * Add get_unique_devices method * Move tensor for device mapping * Add DeviceMapper * Fix wrong RotaryEmbedding import * Fix wrong RotaryEmbedding import * Remove unecessary tensor copies * Add DeviceMapper * Add DeviceMapper * Add DeviceMapper * Add device mapping * Create tensor copies for each device for pa * Add device mapper * Add device mapper * Add device mapper * Add device mapper * Add device mapper * Add device mapper * Add device mapper * Add device mapper * Add device mapper * Add device mapper * add device mapper * Remove unecessary tensor move * Remove unecessary tensor move * Remove unecessary tensor move * Remove unecessary tensor move * Remove unecessary tensor move * Remove unecessary tensor move * Remove unecessary tensor move * Remove unecessary tensor move * format * format * format * clippy * format

* Fixes for prefix cache + llama vision * Fix for vllama

* Initial steps toward supporting deepseekv2 * Implement the attention forward * Add the mlp * Implement the moe gate and forward * Fixes * Forward pass runs * Clippy * Update * It works * Use faster rope * Use normal cache * Add framework for paged attn * Fixes * Support isq * Add moqe support, residual tensors * Add examples, python API, docs * Fix tests * Update deps

* Use cudarc fork * Use cudarc fork

* Use float8 mistralrs_cudarc_fork feature * Fix

…er#1071) * pass mapper * pass mapper * pass mapper * pass mapper * pass mapper * pass mapper * pass mapper

* Fixing idefics3 and idefics2 * Fixing idefics3

* Improve handling of activations in device map * Log sub models that are not device mapped * Reduce defaults * Register sub models for the rest of the vision models

…er#1077) * Implement the deepseekv3 model * Update apis and docs

* handle assistant messages with 'tool_calls' when used in chat_template * linting * add better methods for using tools in and update examples * fixes * Update interactive_mode.rs * Don't print GGUF model metadata when silent=true

… `Usage`. (EricLBuehler#1078) * handle assistant messages with 'tool_calls' when used in chat_template * linting * add better methods for using tools in and update examples * fixes * Update interactive_mode.rs * add Usage to ChatCompletionChunkResponse * add usage telemetry to streaming messages * clppy

…1080)

* Add siglip, configs * Fix siglip * Implement the resampler * Implement the rest of the vision model * Add the image processor * Implement the processor * Clippy * A few fixes * Even more fixes * It works * ISQ support * Fix cuda * Major refactor of rope * Fix * Fix resampler pos embed * Complete merge * Small optimization * Add docs and examples * Implement residual tensors * Update docs

…ricLBuehler#1082)

github-actions · 2025-01-23T00:36:08Z

Code Metrics Report

  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                2           35           28            0            7
 Dockerfile              1           41           22           10            9
 JSON                   12          105          104            0            1
 Python                 63         2706         2339           70          297
 Shell                   1           57           22           18           17
 Plain Text              3         3723            0         2413         1310
 TOML                   18          612          546            2           64
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       4            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          205          178            1           26
 (Total)                            282          210           32           40
-------------------------------------------------------------------------------
 Markdown               43         3324            0         2520          804
 |- BASH                 6          101           98            0            3
 |- JSON                 1           12           12            0            0
 |- Python               7          121          109            0           12
 |- Rust                12          406          344            0           62
 |- TOML                 2           75           63            0           12
 (Total)                           4039          626         2520          893
-------------------------------------------------------------------------------
 Rust                  287        87687        78742         1808         7137
 |- Markdown           140         1499           25         1362          112
 (Total)                          89186        78767         3170         7249
===============================================================================
 Total                 436        98311        81822         6843         9646
===============================================================================

…d.rs' into jeadie/25-01-23/spiceai

haricot and others added 30 commits December 12, 2024 07:20

fix mistralrs-server ignoring interactive_mode arg (EricLBuehler#990)

c1e9268

Adding streaming function to mistralrs server. (EricLBuehler#986)

9d1f09f

* Adding streaming function to mistralrs server. * Adding simple_stream example

Fixes for bnb and more apis in mistralrs-quant (EricLBuehler#972)

458dc5f

* Add a forward_autocast method * Add a to_gguf_quant method for bnb * Handle blocksizes * Maybe cast * Add QuantMethod::dequantize_w * Debug * Debug * Debug * Fix the bug maybe??? * Fix the bug maybe??? * Clippy

Support send + sync in loader (EricLBuehler#991)

a1edf6a

Optimize llama vision memory usage & performance (EricLBuehler#992)

d033d2c

* More vllama optimizations * Oops * Use addmm metal * Make some progress * Conditional * Conditional * No crossattn quant * Fix loading from uqff

Update docs (EricLBuehler#993)

a5cb6bb

* Update docs * Update deps

Use metal autorelease to optimize memory usage (EricLBuehler#996)

0b4532c

Partial Fix for Sliding Window Attention (EricLBuehler#994)

c582196

Only dep on objc when building on metal (EricLBuehler#998)

3a26a46

Prefix cacher v2 (EricLBuehler#1000)

fc65371

* Work on prefix cacher * It works * Clippy * Enable partial matches

Add --cpu flag to mistralrs-server (EricLBuehler#997)

9a208a9

* Add --cpu flag to `mistralrs-server` * Update lib.rs * Update main.rs * Update lib.rs * Update lib.rs * Update lib.rs

Fix cross attention + prefix cacher v2 support (EricLBuehler#1006)

5340576

Support for normal cache for mllama, phi3v, qwen2vl (EricLBuehler#1007)

d28ddf9

* Support for normal cache for mllama, phi3v, qwen2vl * Clippy

Cleaner creation of dummy pa input metadata (EricLBuehler#1014)

d8fa819

Merge branch 'EricLBuehler:master' into master

403eeb2

Prefix cacher fixes (EricLBuehler#1018)

d116d4d

More fixes for the prefix cacher (EricLBuehler#1019)

bd7af28

Support uqff for idefics3 (EricLBuehler#1020)

8ee15d4

Bump version to v0.3.5 (EricLBuehler#1021)

f1c3a36

Cleaner pipeline no prefix cache setting (EricLBuehler#1022)

0875194

* Fixes for prefix cache + llama vision * Fix for vllama

Support uqff load/save for idefics3 (EricLBuehler#1023)

27ab495

Update license for 2025 (EricLBuehler#1024)

e06f0af

Remove debug

a06f937

Use cudarc fork to fix CUDA build on Windows (EricLBuehler#1032)

80beed4

* Use cudarc fork * Use cudarc fork

Fix metal paged attn phi3 (EricLBuehler#1033)

a853cdd

Use float8 mistralrs_cudarc_fork feature (EricLBuehler#1034)

d2692f9

* Use float8 mistralrs_cudarc_fork feature * Fix

cdoko and others added 19 commits January 18, 2025 05:35

Fix paged attention for vision models on multiple devices (EricLBuehl…

cc24ad9

…er#1071) * pass mapper * pass mapper * pass mapper * pass mapper * pass mapper * pass mapper * pass mapper

Fixes for idefics3 and idefics2 (EricLBuehler#1073)

9ea9ffb

* Fixing idefics3 and idefics2 * Fixing idefics3

Improve automatic device map (EricLBuehler#1076)

111b6cf

* Improve handling of activations in device map * Log sub models that are not device mapped * Reduce defaults * Register sub models for the rest of the vision models

Implement the DeepSeekV3 model (support full DeepSeek R1) (EricLBuehl…

21d7461

…er#1077) * Implement the deepseekv3 model * Update apis and docs

Merge branch 'EricLBuehler:master' into master

bea876c

Support loading blockwise quantized fp8 for DeepSeekV3 (EricLBuehler#…

df318df

…1080)

Bump version to v0.4.0 (EricLBuehler#1081)

28b1008

Bump version to 0.4.0

14b871c

Bump version to 0.4.0

30b429b

No warns on accelerate build

2e18be8

No warns on accelerate build

f1a56f6

Merge branch 'EricLBuehler:master' into master

2a8c7fd

Allow using library in CurrentThread mode (for example unit tests) (E…

502d982

…ricLBuehler#1082)

Improve accuracy of uqff auto device map (EricLBuehler#1084)

0f99ec1

Move supports_attn_softmax logic to build.rs

7554088

Merge branch 'master' into spiceai

076e9ff

Jeadie self-assigned this Jan 23, 2025

ewgenius approved these changes Jan 23, 2025

View reviewed changes

Jeadie mentioned this pull request Jan 23, 2025

Update mistral.rs package spiceai/spiceai#4482

Closed

sgrebnov approved these changes Jan 23, 2025

View reviewed changes

Jeadie added 2 commits January 28, 2025 12:17

Merge remote-tracking branch 'EricLBuehler/supports_attn_softmax_buil…

b5bed4e

…d.rs' into jeadie/25-01-23/spiceai

lint

80a152b

sgrebnov approved these changes Jan 28, 2025

View reviewed changes

Jeadie mentioned this pull request Jan 29, 2025

Merge mistral upstream spiceai/spiceai#4562

Merged

Sevenannn approved these changes Jan 29, 2025

View reviewed changes

Jeadie merged commit d4e2702 into spiceai Jan 30, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge ericlBuehler/mistral.rs into `spiceai`. #16

Merge ericlBuehler/mistral.rs into `spiceai`. #16

Jeadie commented Jan 23, 2025 •

edited

Loading

github-actions bot commented Jan 23, 2025

Merge ericlBuehler/mistral.rs into spiceai. #16

Merge ericlBuehler/mistral.rs into spiceai. #16

Conversation

Jeadie commented Jan 23, 2025 • edited Loading

🤔 Concerns

github-actions bot commented Jan 23, 2025

Merge ericlBuehler/mistral.rs into `spiceai`. #16

Merge ericlBuehler/mistral.rs into `spiceai`. #16

Jeadie commented Jan 23, 2025 •

edited

Loading