forked from EricLBuehler/mistral.rs
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge ericlBuehler/mistral.rs into spiceai
.
#16
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* Adding streaming function to mistralrs server. * Adding simple_stream example
* Add a forward_autocast method * Add a to_gguf_quant method for bnb * Handle blocksizes * Maybe cast * Add QuantMethod::dequantize_w * Debug * Debug * Debug * Fix the bug maybe??? * Fix the bug maybe??? * Clippy
* More vllama optimizations * Oops * Use addmm metal * Make some progress * Conditional * Conditional * No crossattn quant * Fix loading from uqff
* Update docs * Update deps
* Work on prefix cacher * It works * Clippy * Enable partial matches
* Add --cpu flag to `mistralrs-server` * Update lib.rs * Update main.rs * Update lib.rs * Update lib.rs * Update lib.rs
* Separate cuda paged attention impl * Sketch necessary functions * Implement swap blocks and copy blocks * Implement reshape_and_cache * Add the kernels * Add the normal sdpa kernel as a basis * Wire things up a bit * Correct inputs * Implement the pagedattention kernel * Kernels compile * Instantiate! * Add kernel call code * Implement the op for v1 * Implement the v2 kernel and op * Clippy * Correct memory info for metal * Fixes * Fix cuda * Fix cuda * Fix cuda * Fix * Debugging * 🚀 It works! * Use faster vector implementation * Fix bug for kernel cache & Phi3 inference (EricLBuehler#1003) * Fix bug for kernel cache & Phi3 inference * Remove the leftover used in nightly debugging * Fix warning * Add (deactivated) bfloat support w/ simd * Tune num_threads * Update docs * Disable paged attn by default on metal --------- Co-authored-by: Guoqing Bao <[email protected]>
* Support for normal cache for mllama, phi3v, qwen2vl * Clippy
…models (EricLBuehler#1009) * Support BF16 kvcache & attention for GGUF/GGML quantization * Fix clippy * Pass dtype to xlora gguf/ggml model * Remove the hardcoded fix for the literal chat template (side effect: the model cannot terminate itself for running GGUF file) * Pass dtype to Lora GGUF/GGML models
* Move start_offsets_kernel to correct device * Move start_offsets_kernel to correct device * Move start_offsets_kernel to correct device * Move start_offsets_kernel to correct device * Move start_offsets_kernel to correct device * Move start_offsets_kernel to correct device * Move start_offsets_kernel to correct device * Move start_offsets_kernel to correct device * Move start_offsets_kernel to correct device * Update starcoder2.rs * Support device mapping * Support device mapping * Support device mapping * Support device mapping * Support device mapping * format * Support device mapping * remove mut * remove mut * Add get_unique_devices method * Move tensor for device mapping * Add DeviceMapper * Fix wrong RotaryEmbedding import * Fix wrong RotaryEmbedding import * Remove unecessary tensor copies * Add DeviceMapper * Add DeviceMapper * Add DeviceMapper * Add device mapping * Create tensor copies for each device for pa * Add device mapper * Add device mapper * Add device mapper * Add device mapper * Add device mapper * Add device mapper * Add device mapper * Add device mapper * Add device mapper * Add device mapper * add device mapper * Remove unecessary tensor move * Remove unecessary tensor move * Remove unecessary tensor move * Remove unecessary tensor move * Remove unecessary tensor move * Remove unecessary tensor move * Remove unecessary tensor move * Remove unecessary tensor move * format * format * format * clippy * format
* Fixes for prefix cache + llama vision * Fix for vllama
* Initial steps toward supporting deepseekv2 * Implement the attention forward * Add the mlp * Implement the moe gate and forward * Fixes * Forward pass runs * Clippy * Update * It works * Use faster rope * Use normal cache * Add framework for paged attn * Fixes * Support isq * Add moqe support, residual tensors * Add examples, python API, docs * Fix tests * Update deps
* Use cudarc fork * Use cudarc fork
* Use float8 mistralrs_cudarc_fork feature * Fix
…er#1071) * pass mapper * pass mapper * pass mapper * pass mapper * pass mapper * pass mapper * pass mapper
* Fixing idefics3 and idefics2 * Fixing idefics3
* Improve handling of activations in device map * Log sub models that are not device mapped * Reduce defaults * Register sub models for the rest of the vision models
…er#1077) * Implement the deepseekv3 model * Update apis and docs
* handle assistant messages with 'tool_calls' when used in chat_template * linting * add better methods for using tools in and update examples * fixes * Update interactive_mode.rs * Don't print GGUF model metadata when silent=true
… `Usage`. (EricLBuehler#1078) * handle assistant messages with 'tool_calls' when used in chat_template * linting * add better methods for using tools in and update examples * fixes * Update interactive_mode.rs * add Usage to ChatCompletionChunkResponse * add usage telemetry to streaming messages * clppy
* Add siglip, configs * Fix siglip * Implement the resampler * Implement the rest of the vision model * Add the image processor * Implement the processor * Clippy * A few fixes * Even more fixes * It works * ISQ support * Fix cuda * Major refactor of rope * Fix * Fix resampler pos embed * Complete merge * Small optimization * Add docs and examples * Implement residual tensors * Update docs
Code Metrics Report=============================================================================== Language Files Lines Code Comments Blanks =============================================================================== C Header 2 35 28 0 7 Dockerfile 1 41 22 10 9 JSON 12 105 104 0 1 Python 63 2706 2339 70 297 Shell 1 57 22 18 17 Plain Text 3 3723 0 2413 1310 TOML 18 612 546 2 64 YAML 2 21 19 2 0 ------------------------------------------------------------------------------- Jupyter Notebooks 4 0 0 0 0 |- Markdown 2 77 32 31 14 |- Python 2 205 178 1 26 (Total) 282 210 32 40 ------------------------------------------------------------------------------- Markdown 43 3324 0 2520 804 |- BASH 6 101 98 0 3 |- JSON 1 12 12 0 0 |- Python 7 121 109 0 12 |- Rust 12 406 344 0 62 |- TOML 2 75 63 0 12 (Total) 4039 626 2520 893 ------------------------------------------------------------------------------- Rust 287 87687 78742 1808 7137 |- Markdown 140 1499 25 1362 112 (Total) 89186 78767 3170 7249 =============================================================================== Total 436 98311 81822 6843 9646 =============================================================================== |
ewgenius
approved these changes
Jan 23, 2025
sgrebnov
approved these changes
Jan 23, 2025
…d.rs' into jeadie/25-01-23/spiceai
sgrebnov
approved these changes
Jan 28, 2025
Sevenannn
approved these changes
Jan 29, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
🤔 Concerns
Possible Implementation: Move supports_attn_softmax logic to build.rs EricLBuehler/mistral.rs#1085