forked from mlc-ai/mlc-llm
-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge with mlc-ai/main
(d3d264d4b05d73e9757375013b842254f052c6ed
, April 29th 2024)
#265
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This PR introduces the logprobs support with OpenAI API compatibility. It enhances the sampler with a function to get the top-probability tokens (supporting 5 tokens at most as of now). To make it easy to pass logprob results back from serving engine to frontend, we choose to pass logprob results in JSON string with OpenAI API spec. Unit tests are added to ensure the correctness of logprobs. And the logprobs support also work with speculative decoding.
This PR supports Mixtral in MLC serve. The main thing is only introducing the Mistral conversation template to Python registry so that MLC Serve can use. Besides that, this PR updates the KV cache capacity analysis to make it more accurate in terms of usage calculation, while being conservative since there is a known issue regarding batch-prefill embedding taking which may lead to OOM. We will reset the follow up on the issue with a fix in the future and then enable the estimation to use more GPU vRAM.
Prior to this PR, `u_char` was used while it is not a standard type in C++, which causes Windows build failure. This PR fixes it by using `unsigned char`.
…#1849) [Fix] Add phi lm head name to is_final_fc
…#1852) Instead of a python function that returns an updated `IRModule`, the new `optimize_mod_pipeline` function returns a `tvm.ir.transform.Pass` which can be applied to an `IRModule`.
* Create __init__.py * Add files via upload * Update model.py * Update model_preset.py * Update conv_templates.cc * Update internlm_loader.py * Update internlm_quantization.py * fix name of notes * Update model.py * Migration * fix pylint issue * fix pylint issue * fix pylint error * Update internlm_loader.py * Update __init__.py * Update __init__.py * Delete python/mlc_chat/model/internlm/__init__.py * Add files via upload
Prior to this commit, a model name with multiple path components (e.g. `dist/models/group_name/model_name`) would have duplicated path components (e.g. `dist/group_name/artifact_path/group_name/libname.so`). This commit resolves the duplication.
* [KVCache] Add max num threads to KVCache kernels, fix WebGPU * Read max_num_threads_per_block when available * Change merge state in place kernel * Make attention decode aware of max num threads, not just webgpu Co-authored-by: Egor Churaev <[email protected]> * Change util function name --------- Co-authored-by: Egor Churaev <[email protected]>
…1860) This PR moves the import of transformers into the function body of tiktoken tokenizer conversion, so we do not have a force dependency on transformers.
This PR adds RWKV5 support with RNNState, a similar interface as PagedAttention. Co-authored-by: Xiaoyu Zhang <[email protected]>
Following mlc-ai#1854 , this pr registers the ChatML conversation template.
Sets the entry functions for a module. This utility is intended for cases where only module contains several externally-exposed functions, and only one is desired for use. (e.g. Separating out a `transform_params` function from an `IRModule` that also contains inference functions.) This commit only updates the external visibility, after which `relax.transform.DeadCodeElimination()` can be applied.
…i#1856) This allows it to be used as part of a optimization pipeline specified as a `tvm.ir.transform.Sequential`.
mlc-ai#1867) This PR is the 3rd part of the grammar-guided generation. This intregrates the grammar framework into the generation process, and supports JSON output for now. The API this PR provides is compatible with the OpenAI api. ### APIs #### Python API ``` @DataClass class ResponseFormat: type: Literal["text", "json_object"] = "text" json_schema: Optional[str] = None @DataClass class GenerationConfig: response_format: ResponseFormat = ResponseFormat(type="text") ``` #### Rest API ``` response_format: { "type": "text" } # text generation, by default response_format: { "type": "json_object" } # json generation response_format: { "type": "json_object", json_schema="..."} # json generation with schema ``` JSON generation with schema is not supported yet, but has been planned to be realized in the future. ### Performance #### Without JSON ``` Single token prefill latency: 891.2234 ms/tok Single token decode latency: 31.3399 ms/tok Prefill token throughput: 4693.3077 tok/s Decode token throughput: 226.4406 tok/s Overall token throughput: 470.3180 tok/s ``` #### With JSON ``` Single token prefill latency: 219.2287 ms/tok Single token decode latency: 29.1399 ms/tok Prefill token throughput: 7392.1555 tok/s Decode token throughput: 179.2296 tok/s Overall token throughput: 1052.1996 tok/s ``` We observed a slight decrease in performance under JSON mode. This will be further optimized in the future.
This PR brings field `n` to generation config and thereby supports parallel generation. This parallel generation effectively leverages the "fork" functionality of paged KV cache. This PR supports specifying the number of parallel generation `n` in stardard OpenAI ChatCompletion API. This is the last feature towards the OpenAI API feature completeness.
Sometimes scm checkout can timeout, this PR add retry to that
Prior to this PR, the TIR attention kernels does not cast matmul operands to fp32 before multiplying. For models like Phi-2 which may have large Q/K/V data (at the level of a few hundreds), the fp16 multiplication exceeds the range of fp16, and lead to attention result being NAN sometimes. This PR fixes this issue.
…lc-ai#1857) Prior to this commit, the `ReorderTransformFunc` required several components of the `ParamManager` to use. The functionality it provides, reordering dataflow blocks to minimize the liveset, is useful outside of the context of the `ParamManager`. This commit makes the following changes, allowing it to be used independently of the `ParamManager`. - Generate the `pidx2binname` dictionary outside of `ReorderTransformFunc` - Allow parameters to be separate `func.params`, rather than a single bundled tuple parameter.
This PR migrates Phi-2 for Paged KV cache Attention as a part of Model definition migration according to mlc-ai#1749 . Co-authored-by: Shrey Gupta <[email protected]>
…c-ai#1874) The use of `call_inplace_packed` and `call_pure_packed` in the old flow is outdated due to signature changes. This PR fixes the issue.
PR mlc-ai#1852 missed to apply the BundleModelParams pass and thus made the compiled models not runnable through ChatModule (mlc-ai#1864). This PR fixes the issue.
As pointed out by mlc-ai#1830, this PR fixes the Android app download link in docs.
Fix website link not accessible
This PR adopts suggestions from the support of OpenAI API parallel generation `n` in mlc-ai#1868. The main update in this PR is to make the RequestState as a standalone object class, which was a typedef from `std::vector<RequestStateEntry>` before. This PR also fixes a bug in prefill that will cause engine failure when `n` is large.
Support Qwen1.0 Paged KV Cache
This PR introduces the Paged Radix Tree data structure, as foundation and prerequisite of prefix caching.
This PR removes the mandatory model check in server since as of now we serve one engine at most which means there is always a unique engine being served. As issue mlc-ai#2155 points out, the model check in server can be a bad experience when the model string mismatches.
* [Eagle] Attach gpu verifier to model * WIP * WIP * fix * Enable GPU verifier * lint * lint
* [Eagle] Make BatchSelectLastHidden able to run on the controller
…lc-ai#2206) This PR updates the draft verification of the normal mode speculative decoding. Prior to this PR, we did not effectively leverage all the draft tokens, and this PR fixes the issue.
) This PR introduces a renormalization interface with regard to top-p values for speculative decoding. This is helpful for simplifying the logic of speculative decoding verification stage, as all probs have been already updated with the top-p values and no top-p needs to be taken into consideration. So for speculative decoding, we always renorm the probability distribution before sampling/verifying. For non speculative decoding mode, we keep using the previous flow, which applies top-p together when sampling. Co-authored-by: Wuwei Lin <[email protected]>
This commit renames the LLMEngine to MLCEngine.
This commit returns a list of integers and adds an assert to check that the string of CUDA architecture must contain numbers only. Co-authored-by: msyu <[email protected]>
Take advantage of OpenCl host ptr that improves copy performance
It improves 2x time for tir based page attention for opencl adreno.
feat: support serving for rwkv
…#2226) This PR removes the imports of functions in `cli.model_metadata` from engine_base.py. The file `cli.model_metadata` is not designed for import directly, and when importing functions from the file, it repetitively reports warnings of ``` RuntimeWarning: 'mlc_llm.cli.model_metadata' found in sys.modules after import of package 'mlc_llm.cli', but prior to execution of 'mlc_llm.cli.model_metadata'; this may result in unpredictable behaviour ```
…onfig values to NOT_GIVEN (mlc-ai#2225) * Change OpenAI protocol default value to None in JSON FFI engine * [JSONFFIEngine] Support generation config in JSONFFIEngine. Default config values to NOT_GIVEN
This PR adds the early exit for the GPU sampler, which ran into GPU kernels even when the batch size is 0 prior to this commit. The 0 batch size case can happen when parallel generation of a request and engine preemption exists. In this case, the GPU sampler should just synchronization and return, and not run into any GPU kernel.
This PR introduces the compiler pass that rewrites the normal softmax to a two-stage softmax. This is based on our finding that when vocabulary size is large, the normal softmax cannot have high-enough parallelism on GPU. So we partition the workload into two stages for better parallelism and better performance.
remove model metadata step (#1) * remove model metadata step and make minor fixes
This commit introduces the GPU top-p cutoff operator for efficient probability renormalization under top-p.
This PR supports creating EngineConfig from a JSON string, which is useful for JSONFFIEngine and its API bindings. This commit also removes the device from the EngineConfig for better clarity.
This PR migrates JSONFFIEngine to a formal namespace. Also list TODOs to further simplify the JSONFFIEngine.
improve Install via environment variable
This PR integrates the sampling function in FlashInfer. We integrate the one without top-p for now.
* add model lib delivery * fix lint
Lunderberg
pushed a commit
to Lunderberg/mlc-llm
that referenced
this pull request
Jul 25, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.