Mixtral-ROCm Fanservice Edition

commit 53b5ae02cb1b533b78302422951bcfdeca6e2738 Author: YellowRoseCx <[email protected]> Date: Tue Dec 12 12:08:29 2023 -0600 mixtral fan service commit 168b1d74e26d0321e2e89358303b6c33e8d7d33e Merge: f13295b de15d4a6 Author: YellowRoseCx <[email protected]> Date: Tue Dec 12 12:00:52 2023 -0600 Merge branch 'kcpp-rocm-mixtral2' into main2 commit de15d4a632939a685ec12fa17355298542facf15 Merge: 74acc54 ea4402b Author: YellowRoseCx <[email protected]> Date: Tue Dec 12 11:45:19 2023 -0600 Merge branch 'mixtral' into kcpp-rocm-mixtral commit ea4402b Author: Georgi Gerganov <[email protected]> Date: Tue Dec 12 17:03:38 2023 +0200 test-backend-ops : add one more sum_rows test commit a51bc0c Author: Georgi Gerganov <[email protected]> Date: Tue Dec 12 15:55:42 2023 +0200 metal : fix binary ops for ne10 % 4 != 0 commit 08eb991 Author: Georgi Gerganov <[email protected]> Date: Tue Dec 12 14:14:15 2023 +0200 metal : add cpy f16 -> f32 kernel commit a742d9f Author: slaren <[email protected]> Date: Tue Dec 12 12:46:33 2023 +0100 gguf-py : bump version commit 6a419f4 Author: Georgi Gerganov <[email protected]> Date: Tue Dec 12 13:04:33 2023 +0200 convert : support safetensors format commit 74acc54 Author: Concedo <[email protected]> Date: Tue Dec 12 10:53:34 2023 +0800 Revert "Hide hipBLAS (ROCm) if CuBLAS exists - vice versa" This reverts commit 4b854d4. commit f1cbfab Author: slaren <[email protected]> Date: Mon Dec 11 20:02:55 2023 +0100 convert : fix style commit 7dc75e3 Author: slaren <[email protected]> Date: Mon Dec 11 20:00:28 2023 +0100 convert : use 1e6 rope_freq_base for mixtral commit 296c945 Author: slaren <[email protected]> Date: Mon Dec 11 16:53:25 2023 +0100 cuda : fix mul_mat_id with multi gpu commit 33e50f1 Author: slaren <[email protected]> Date: Mon Dec 11 12:27:48 2023 +0100 test-backend-ops : disable MOE test with thread sanitizer commit ffda94c Author: slaren <[email protected]> Date: Mon Dec 11 12:15:31 2023 +0100 test-backend-ops : simplify and disable slow tests to avoid CI timeout commit 06581f2 Author: Concedo <[email protected]> Date: Mon Dec 11 16:54:42 2023 +0800 perf endpoint lets you monitor if the embedded horde worker has issues commit fce971d Author: Concedo <[email protected]> Date: Mon Dec 11 16:17:10 2023 +0800 do not build the clblast noavx2 binary if not on windows commit 8cbaed1 Author: Georgi Gerganov <[email protected]> Date: Mon Dec 11 08:55:16 2023 +0200 llama : fix hard-coded number of experts commit 4b854d4 Author: YellowRoseCx <[email protected]> Date: Sun Dec 10 22:49:35 2023 -0600 Hide hipBLAS (ROCm) if CuBLAS exists - vice versa commit b002981 Author: slaren <[email protected]> Date: Mon Dec 11 02:43:52 2023 +0100 test-backend-ops : fix dequantize block offset commit f1380d7 Author: slaren <[email protected]> Date: Sun Dec 10 22:58:31 2023 +0100 test-backend-ops : add cpy from f32 -> all types test commit 54d254b Author: slaren <[email protected]> Date: Sun Dec 10 21:52:11 2023 +0100 test-backend-ops : cleanup, add moe test for batches commit e2cf3b7 Author: henk717 <[email protected]> Date: Sun Dec 10 14:30:17 2023 +0100 koboldcpp.sh - The Mamba Multitool (LostRuins#554) * .sh script V1 * koboldcpp.sh polish * koboldcpp.sh dist generator * Include html's in dist * RWKV in Linux Dist * Lower dependency requirements * Eliminate wget dependency * More distinct binary name I know its technically amd64, but I don't want to cause confusion among nvidia users. * Use System OpenCL Unsure how this will behave in the pyinstaller build, but pocl ended up CPU only. With a bit of luck the pyinstaller uses the one from the actual system if compiled in a system without opencl, while conda now includes it for that specific system. * Add cblas dependency Missing this causes compile failures on some system's * ICD workaround Ideally we find a better solution, but conda forces ICD and needs this for the successful compile. However, pyinstaller then embeds the ICD causing it to be limited to the system it was compiled for. By temporarily removing the ICD pyinstaller can't find it and everything remains functional. Ideally we do this on a pyinstaller level, but I could not find any good options to do so yet. --------- Co-authored-by: root <root@DESKTOP-DQ1QRAG> commit 54ba263 Author: Georgi Gerganov <[email protected]> Date: Sun Dec 10 15:27:41 2023 +0200 test-backend-ops : make experts more evenly probable (test_moe) commit b0b83dd Author: Georgi Gerganov <[email protected]> Date: Sun Dec 10 14:30:38 2023 +0200 metal : fix ggml_mul_mat_id for F32 commit 65923a8 Author: Georgi Gerganov <[email protected]> Date: Sun Dec 10 14:17:46 2023 +0200 convert : determine n_ctx correctly commit 8614aa7 Author: slaren <[email protected]> Date: Sun Dec 10 13:12:11 2023 +0100 cuda : fix get_rows when ncols is odd commit cefebb3 Author: slaren <[email protected]> Date: Sun Dec 10 13:11:39 2023 +0100 test-backend-ops : add moe test commit e640cbe Author: Georgi Gerganov <[email protected]> Date: Sun Dec 10 13:57:54 2023 +0200 llama : add n_expert and n_expert_used to hparams + change quants commit d1259b7 Author: Georgi Gerganov <[email protected]> Date: Sun Dec 10 13:00:13 2023 +0200 llama : do not quantize expert gating tensors commit 6cfb31f Author: Georgi Gerganov <[email protected]> Date: Sun Dec 10 10:59:13 2023 +0200 metal : add indirect mat-vec kernels for all quantization types commit 016f9bb Author: Georgi Gerganov <[email protected]> Date: Sun Dec 10 09:38:21 2023 +0200 metal : fix ggml_get_rows to work with non-cont src1 commit 0710b0f Author: slaren <[email protected]> Date: Sat Dec 9 23:29:47 2023 +0100 llama : offload missing ffn_moe_silu commit 62b95f9 Author: slaren <[email protected]> Date: Sat Dec 9 22:39:34 2023 +0100 cuda : support non-contiguous src1 in get_rows commit 2e4db48 Author: slaren <[email protected]> Date: Sat Dec 9 22:38:22 2023 +0100 ggml : update get_rows f16 and q commit ac3f7d8 Author: slaren <[email protected]> Date: Sat Dec 9 19:19:03 2023 +0100 ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D commit 8c5b66e Author: Georgi Gerganov <[email protected]> Date: Sat Dec 9 15:30:34 2023 +0200 metal : reduce the kernel launches for ggml_mul_mat_id commit 7e2006b Author: Georgi Gerganov <[email protected]> Date: Sat Dec 9 14:24:58 2023 +0200 metal : add/mul/div use general kernel when src1 not cont commit 06dfde3 Author: slaren <[email protected]> Date: Sat Dec 9 13:21:09 2023 +0100 llama : add basic support for offloading moe with CUDA commit 2cbcba8 Author: Georgi Gerganov <[email protected]> Date: Sat Dec 9 14:18:42 2023 +0200 metal : add more general support for ggml_get_rows + tests commit 9064b1c Author: Georgi Gerganov <[email protected]> Date: Sat Dec 9 14:04:54 2023 +0200 ggml : fix ggml_get_rows to take into account ne02 / ne11 commit ee8fb39 Author: slaren <[email protected]> Date: Sat Dec 9 12:42:25 2023 +0100 ggml : add n_as argument to ggml_mul_mat_id commit 7372b62 Author: Georgi Gerganov <[email protected]> Date: Sat Dec 9 13:18:58 2023 +0200 ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only) commit 8b185b7 Author: Georgi Gerganov <[email protected]> Date: Sat Dec 9 13:01:42 2023 +0200 llama : fix expert weighting in the FFN commit 7ea3695 Author: Georgi Gerganov <[email protected]> Date: Sat Dec 9 12:45:15 2023 +0200 llama : first working version commit af1a096 Author: Georgi Gerganov <[email protected]> Date: Sat Dec 9 12:07:39 2023 +0200 llama : fix cur -> cur_expert commit aedfad1 Author: Georgi Gerganov <[email protected]> Date: Sat Dec 9 11:47:40 2023 +0200 llama : update graph to support MoE commit 861cd67 Author: Georgi Gerganov <[email protected]> Date: Sat Dec 9 11:19:46 2023 +0200 ggml : sync latest ggml_mul_mat_id commit a3eefe9 Author: Georgi Gerganov <[email protected]> Date: Sat Dec 9 11:14:03 2023 +0200 llama : model loading commit d38e41e Author: Georgi Gerganov <[email protected]> Date: Sat Dec 9 10:59:37 2023 +0200 convert : fix n_ff typo commit dff8cbe Author: Georgi Gerganov <[email protected]> Date: Sat Dec 9 10:51:58 2023 +0200 convert : support Mixtral as LLAMA arch commit 7a69152 Author: Concedo <[email protected]> Date: Fri Dec 8 21:06:32 2023 +0800 lowvram var defaults commit 7418bca Author: Concedo <[email protected]> Date: Fri Dec 8 19:20:30 2023 +0800 up ver commit c47bc28 Author: Concedo <[email protected]> Date: Fri Dec 8 18:35:45 2023 +0800 slight refactor for noscript ui commit 7469f20 Author: Concedo <[email protected]> Date: Fri Dec 8 18:16:14 2023 +0800 use lowvram flag for offload qkv commit ec21fa7 Merge: 930cdfb fe680e3 Author: Concedo <[email protected]> Date: Fri Dec 8 17:42:26 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # .github/workflows/build.yml # .gitignore # CMakeLists.txt # Makefile # Package.swift # README.md # ggml-cuda.cu # llama.cpp # llama.h # scripts/sync-ggml.sh # tests/CMakeLists.txt commit 930cdfb Author: Concedo <[email protected]> Date: Fri Dec 8 16:53:30 2023 +0800 updated lite, added patch that links to noscript mode commit fe680e3 Author: Georgi Gerganov <[email protected]> Date: Thu Dec 7 22:26:54 2023 +0200 sync : ggml (new ops, tests, backend, etc.) (ggerganov#4359) * sync : ggml (part 1) * sync : ggml (part 2, CUDA) * sync : ggml (part 3, Metal) * ggml : build fixes ggml-ci * cuda : restore lost changes * cuda : restore lost changes (StableLM rope) * cmake : enable separable compilation for CUDA ggml-ci * ggml-cuda : remove device side dequantize * Revert "cmake : enable separable compilation for CUDA" This reverts commit 09e35d0. * cuda : remove assert for rope * tests : add test-backend-ops * ggml : fix bug in ggml_concat * ggml : restore `ggml_get_n_tasks()` logic in `ggml_graph_plan()` * ci : try to fix macOS * ggml-backend : remove backend self-registration * ci : disable Metal for macOS cmake build ggml-ci * metal : fix "supports family" call * metal : fix assert * metal : print resource path ggml-ci --------- Co-authored-by: slaren <[email protected]> commit bcc0eb4 Author: Georgi Gerganov <[email protected]> Date: Thu Dec 7 13:03:17 2023 +0200 llama : per-layer KV cache + quantum K cache (ggerganov#4309) * per-layer KV * remove unnecessary copies * less code duplication, offload k and v separately * llama : offload KV cache per-layer * llama : offload K shift tensors * llama : offload for rest of the model arches * llama : enable offload debug temporarily * llama : keep the KV related layers on the device * llama : remove mirrors, perform Device -> Host when partial offload * common : add command-line arg to disable KV cache offloading * llama : update session save/load * llama : support quantum K cache (ggerganov#4312) * llama : support quantum K cache (wip) * metal : add F32 -> Q8_0 copy kernel * cuda : add F32 -> Q8_0 copy kernel ggml-ci * cuda : use mmv kernel for quantum cache ops * llama : pass KV cache type through API * llama : fix build ggml-ci * metal : add F32 -> Q4_0 copy kernel * metal : add F32 -> Q4_1 copy kernel * cuda : wip * cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels * llama-bench : support type_k/type_v * metal : use mm kernel only for quantum KV cache * cuda : add comment * llama : remove memory_f16 and kv_f16 flags --------- Co-authored-by: slaren <[email protected]> * readme : add API change notice --------- Co-authored-by: slaren <[email protected]> commit 81bc921 Author: Hongyu Ouyang <[email protected]> Date: Thu Dec 7 02:25:22 2023 -0800 train : fix ggerganov#4227 (double free in examples/train-text-from-scratch/train-text-from-scratch.cpp) (ggerganov#4351) On commit b1108 (44c117f) xaedes added ggml_allocr * alloc = NULL; ... (many lines in between) if (alloc) { ggml_allocr_free(alloc); } Which is correct, but it's easy to lose context after many lines in between. On commit b1287 (0e76a899) xaedes made a big change. From here on, alloc is freed eagerly. alloc = ggml_allocr_new(...) ... (short lines of code) ggml_allocr_free(alloc) This happens a few times, but alloc is never set to NULL, and many lines below, we still have if (alloc) { ggml_allocr_free(alloc); } which causes a double-free. commit 05cd6e5 Author: Georgi Gerganov <[email protected]> Date: Wed Dec 6 20:21:59 2023 +0200 server : recognize cache_prompt parameter in OAI API (ggerganov#4347) commit c751152 Author: Concedo <[email protected]> Date: Thu Dec 7 00:52:25 2023 +0800 noscript mode is done commit 12002d8 Author: Concedo <[email protected]> Date: Wed Dec 6 17:51:08 2023 +0800 very basic noscript mode commit caa9249 Author: Georgi Gerganov <[email protected]> Date: Wed Dec 6 10:41:03 2023 +0200 common : fix compile warning commit da5eaef Author: stduhpf <[email protected]> Date: Wed Dec 6 09:08:17 2023 +0100 speculative : support `--color` (ggerganov#4343) * speculative: add some colors * minor : add braces --------- Co-authored-by: Georgi Gerganov <[email protected]> commit 5f6e0c0 Author: Marcus Dunn <[email protected]> Date: Tue Dec 5 10:55:12 2023 -1000 grammar : pre-computed pieces + reserve mem + less string copies (ggerganov#4330) * reserve space for codepoints * improvement for the appended 0 * used precomputed token text for grammar sample * reserve canidates_decoded * reserve canidates_grammar * remove candidates_decoded * Revert "remove candidates_decoded" This reverts commit 3773328. * changed decode_utf8 to take src by ref commit 5aa365d Author: Kerfuffle <[email protected]> Date: Tue Dec 5 10:19:18 2023 -0700 llama : allow overriding GGUF metadata when loading model (ggerganov#4092) * feat: Allow overriding GGUF metadata when loading model * Fix the one time GCC is stricter than clang about something * Step1 * Refactor... basically everything! * Nuke obsolete GetArrayLen struct * simplify std::string specialization * Various cleanups Add informational output when overrides are applied Warn user when an override with the wrong type is specified * Fix broken logic for parsing bool KV overrides Fix issue where overrides didn't apply when key missing in GGUF metadata Resolve merge changes * llama : rearrange model params * Update new GET_KEY call Add note that metadata KV overrides aren't reflected in initial metadata KV info dump --------- Co-authored-by: cebtenzzre <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> commit b6f952f Author: Concedo <[email protected]> Date: Tue Dec 5 21:08:10 2023 +0800 improved exit logic commit 52c8bc3 Author: MaggotHATE <[email protected]> Date: Tue Dec 5 15:05:51 2023 +0500 sampling : custom samplers order (ggerganov#4285) * Samplers sequence order w parameter * Cleaned commented code * Fixed formatting * Rewrote with unordered_map * Revert and rewrite, too many problems and safeguards would be needed * Fixed code style * Code style fixes according to review * More readable samplers input string, fixed help * Style fix in sampler_queue * Formatting fixes * Fixing whitespaces commit e4b76bb Author: kchro3 <[email protected]> Date: Mon Dec 4 23:29:46 2023 -0800 swift : revert compiler checks for swift package (ggerganov#4332) commit 23b5e12 Author: Daniel Bevenius <[email protected]> Date: Mon Dec 4 17:04:21 2023 +0100 simple : update error message for KV cache check (ggerganov#4324) This commit updates the error message that is printed when the KV cache is not big enough to hold all the prompt and generated tokens. Specifically it removes the reference to n_parallel and replaces it with n_len. Signed-off-by: Daniel Bevenius <[email protected]> commit d208995 Author: Miwa / Ensan <[email protected]> Date: Tue Dec 5 01:03:49 2023 +0900 swift : fix concatenation method to avoid invalid UTF8 stringfication (ggerganov#4325) commit 5c9f90c Author: Miwa / Ensan <[email protected]> Date: Mon Dec 4 22:43:45 2023 +0900 swift : fix prompt tokenization logic (ggerganov#4321) commit a5a5839 Author: Concedo <[email protected]> Date: Mon Dec 4 21:10:42 2023 +0800 handle accidentally selecting a kcpps file as model instead commit 4fa44e8 Author: Ikko Eltociear Ashimine <[email protected]> Date: Mon Dec 4 16:57:35 2023 +0900 grammar-parser : fix typo (ggerganov#4318) preceeding -> preceding commit 8602f5a Merge: ac36aee fbbc428 Author: Concedo <[email protected]> Date: Sun Dec 3 22:00:14 2023 +0800 Merge branch 'master' into concedo_experimental commit fbbc428 Author: Georgi Gerganov <[email protected]> Date: Sun Dec 3 15:56:35 2023 +0200 ggml : reuse ggml_get_n_tasks() in ggml_graph_plan() (ggerganov#4308) * ggml : fix soft max out-of-bounds access ggml-ci * ggml : reuse ggml_get_n_tasks() in ggml_graph_plan() ggml-ci commit ac36aee Merge: 48544cd 33e171d Author: Concedo <[email protected]> Date: Sun Dec 3 21:56:29 2023 +0800 Merge branch 'master' into concedo_experimental # Conflicts: # CMakeLists.txt # Makefile commit adf3de4 Author: Georgi Gerganov <[email protected]> Date: Sun Dec 3 15:56:22 2023 +0200 ggml : fix soft max out-of-bounds access (ggerganov#4307) ggml-ci commit 48544cd Author: Concedo <[email protected]> Date: Sun Dec 3 21:46:50 2023 +0800 Revert "Revert "ggml : add ggml_soft_max_ext (ggerganov#4256)"" This reverts commit a8e66ef. commit 33e171d Author: Ed Lee <[email protected]> Date: Sun Dec 3 01:10:43 2023 -0800 server : fix OpenAI API `stop` field to be optional (ggerganov#4299) (cherry picked from commit Mozilla-Ocho/llamafile@e8c92bc) commit 6949b50 Author: Rickard Edén <[email protected]> Date: Sun Dec 3 10:03:25 2023 +0100 py : add grammar to oai like api (ggerganov#4294) commit d7b800b Author: Georgi Gerganov <[email protected]> Date: Sun Dec 3 10:58:16 2023 +0200 llama : pad KV cache size (ggerganov#4280) * llama : pad KV cache size to 32 * metal : try to improve batched decoding commit 6570a20 Author: Concedo <[email protected]> Date: Sun Dec 3 15:44:53 2023 +0800 token count includes ids commit 5a7d312 Author: Georgi Gerganov <[email protected]> Date: Fri Dec 1 20:39:12 2023 +0200 llama : avoid using "optional" keyword (ggerganov#4283) commit d5a1cbd Author: Georgi Gerganov <[email protected]> Date: Fri Dec 1 20:35:03 2023 +0200 llama : support optional tensors (ggerganov#4283) commit b220222 Author: Miwa / Ensan <[email protected]> Date: Sat Dec 2 03:19:45 2023 +0900 swift : fix token_to_piece implementation (ggerganov#4278) * Fix token_to_piece implementation in Swift * Fix errors commit 511f52c Author: Jared Van Bortel <[email protected]> Date: Fri Dec 1 13:18:35 2023 -0500 build : enable libstdc++ assertions for debug builds (ggerganov#4275) commit 03562f3 Author: CausalLM <[email protected]> Date: Sat Dec 2 02:17:06 2023 +0800 llama : support attention bias on LLaMA architecture (ggerganov#4283) * Support attention_bias on LLaMA architecture QKVO bias, should fix InternLM (ggerganov#3133) and works for LLaMAfied Qwen models (ggerganov#3743 (comment)). * check existence of qkvo bias while loading llama models Tested on LLaMA2, CUDA and CPU. * Update llama.cpp commit 37c746d Author: Shijie <[email protected]> Date: Sat Dec 2 02:16:31 2023 +0800 llama : add Qwen support (ggerganov#4281) * enable qwen to llama.cpp * llama : do not GPU split bias tensors --------- Co-authored-by: Georgi Gerganov <[email protected]> commit 880f579 Author: Georgi Gerganov <[email protected]> Date: Fri Dec 1 18:42:11 2023 +0200 llama : fix integer overflow during quantization (ggerganov#4284) happens with multi-threaded quantization of Qwen-72B ggml-ci
YellowRoseCx · Dec 12, 2023 · b1cf642 · b1cf642
1 parent f13295b
commit b1cf642
Show file tree

Hide file tree

Showing 50 changed files with 9,147 additions and 2,017 deletions.
diff --git a/.gitignore b/.gitignore
@@ -81,6 +81,7 @@ poetry.toml
 
 # Test binaries
 tests/test-grammar-parser
+/tests/test-llama-grammar
 tests/test-double-float
 tests/test-grad0
 tests/test-opt
@@ -92,6 +93,8 @@ tests/test-tokenizer-0-llama
 tests/test-tokenizer-0-falcon
 tests/test-tokenizer-1-llama
 tests/test-tokenizer-1-bpe
+/tests/test-rope
+/tests/test-backend-ops
 
 /koboldcpp_default.so
 /koboldcpp_failsafe.so
@@ -115,5 +118,8 @@ hipblas.dll
 koboldcpp_hipblas.so
 koboldcpp_hipblas.dll
 
+bin/
+conda/
+
 # Jetbrains idea folder
 .idea/
diff --git a/Makefile b/Makefile
@@ -480,8 +480,13 @@ endif
 ifdef CLBLAST_BUILD
 koboldcpp_clblast: ggml_clblast.o ggml_v2_clblast.o ggml_v1.o expose.o common.o gpttype_adapter_clblast.o ggml-opencl.o ggml_v2-opencl.o ggml_v2-opencl-legacy.o ggml-quants.o ggml-alloc.o ggml-backend.o grammar-parser.o $(OBJS)
 	$(CLBLAST_BUILD)
+ifdef NOAVX2_BUILD
+koboldcpp_clblast_noavx2: ggml_clblast_noavx2.o ggml_v2_clblast_noavx2.o ggml_v1_failsafe.o expose.o common.o gpttype_adapter_clblast_noavx2.o ggml-opencl.o ggml_v2-opencl.o ggml_v2-opencl-legacy.o ggml-quants_noavx2.o ggml-alloc.o ggml-backend.o grammar-parser.o $(OBJS)
+	$(CLBLAST_BUILD)
+else
 koboldcpp_clblast_noavx2:
 	$(DONOTHING)
+endif
 else
 koboldcpp_clblast:
 	$(DONOTHING)

diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-# koboldcpp-ROCM for AMD
+# <center>koboldcpp-ROCM MIXTRAL FanService Edition for AMD</center>
 Quick Linux install:              
 To install, either use the file "[easy_KCPP-ROCm_install.sh](https://github.com/YellowRoseCx/koboldcpp-rocm/blob/main/easy_KCPP-ROCm_install.sh)" or navigate to the folder you want to download to in Terminal then run
 ```        

diff --git a/common/common.cpp b/common/common.cpp
@@ -279,8 +279,18 @@ bool gpt_params_parse_ex(int argc, char ** argv, gpt_params & params) {
                 break;
             }
             params.yarn_beta_slow = std::stof(argv[i]);
-        } else if (arg == "--memory-f32") {
-            params.memory_f16 = false;
+        } else if (arg == "--samplers") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            sparams.samplers_sequence = parse_samplers_input(argv[i]);
+        } else if (arg == "--sampling-seq") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            sparams.samplers_sequence = argv[i];
         } else if (arg == "--top-p") {
             if (++i >= argc) {
                 invalid_param = true;
@@ -499,6 +509,12 @@ bool gpt_params_parse_ex(int argc, char ** argv, gpt_params & params) {
             params.infill = true;
         } else if (arg == "-dkvc" || arg == "--dump-kv-cache") {
             params.dump_kv_cache = true;
+        } else if (arg == "-nkvo" || arg == "--no-kv-offload") {
+            params.no_kv_offload = true;
+        } else if (arg == "-ctk" || arg == "--cache-type-k") {
+            params.cache_type_k = argv[++i];
+        } else if (arg == "-ctv" || arg == "--cache-type-v") {
+            params.cache_type_v = argv[++i];
         } else if (arg == "--multiline-input") {
             params.multiline_input = true;
         } else if (arg == "--simple-io") {
@@ -679,6 +695,47 @@ bool gpt_params_parse_ex(int argc, char ** argv, gpt_params & params) {
                 std::istreambuf_iterator<char>(),
                 std::back_inserter(sparams.grammar)
             );
+        } else if (arg == "--override-kv") {
+            if (++i >= argc) {
+                invalid_param = true;
+                break;
+            }
+            char * sep = strchr(argv[i], '=');
+            if (sep == nullptr || sep - argv[i] >= 128) {
+                fprintf(stderr, "error: Malformed KV override: %s\n", argv[i]);
+                invalid_param = true;
+                break;
+            }
+            struct llama_model_kv_override kvo;
+            std::strncpy(kvo.key, argv[i], sep - argv[i]);
+            kvo.key[sep - argv[i]] = 0;
+            sep++;
+            if (strncmp(sep, "int:", 4) == 0) {
+                sep += 4;
+                kvo.tag = LLAMA_KV_OVERRIDE_INT;
+                kvo.int_value = std::atol(sep);
+            } else if (strncmp(sep, "float:", 6) == 0) {
+                sep += 6;
+                kvo.tag = LLAMA_KV_OVERRIDE_FLOAT;
+                kvo.float_value = std::atof(sep);
+            } else if (strncmp(sep, "bool:", 5) == 0) {
+                sep += 5;
+                kvo.tag = LLAMA_KV_OVERRIDE_BOOL;
+                if (std::strcmp(sep, "true") == 0) {
+                    kvo.bool_value = true;
+                } else if (std::strcmp(sep, "false") == 0) {
+                    kvo.bool_value = false;
+                } else {
+                    fprintf(stderr, "error: Invalid boolean value for KV override: %s\n", argv[i]);
+                    invalid_param = true;
+                    break;
+                }
+            } else {
+                fprintf(stderr, "error: Invalid type for KV override: %s\n", argv[i]);
+                invalid_param = true;
+                break;
+            }
+            params.kv_overrides.push_back(kvo);
 #ifndef LOG_DISABLE_LOGS
         // Parse args for logging parameters
         } else if ( log_param_single_parse( argv[i] ) ) {
@@ -722,6 +779,11 @@ bool gpt_params_parse_ex(int argc, char ** argv, gpt_params & params) {
         }
     }
 
+    if (!params.kv_overrides.empty()) {
+        params.kv_overrides.emplace_back(llama_model_kv_override());
+        params.kv_overrides.back().key[0] = 0;
+    }
+
     return true;
 }
 
@@ -762,6 +824,8 @@ void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) {
     printf("  -n N, --n-predict N   number of tokens to predict (default: %d, -1 = infinity, -2 = until context filled)\n", params.n_predict);
     printf("  -c N, --ctx-size N    size of the prompt context (default: %d, 0 = loaded from model)\n", params.n_ctx);
     printf("  -b N, --batch-size N  batch size for prompt processing (default: %d)\n", params.n_batch);
+    printf("  --samplers            samplers that will be used for generation in the order, separated by \';\', for example: \"top_k;tfs;typical;top_p;min_p;temp\"\n");
+    printf("  --sampling-seq        simplified sequence for samplers that will be used (default: %s)\n", sparams.samplers_sequence.c_str());
     printf("  --top-k N             top-k sampling (default: %d, 0 = disabled)\n", sparams.top_k);
     printf("  --top-p N             top-p sampling (default: %.1f, 1.0 = disabled)\n", (double)sparams.top_p);
     printf("  --min-p N             min-p sampling (default: %.1f, 0.0 = disabled)\n", (double)sparams.min_p);
@@ -799,8 +863,6 @@ void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) {
     printf("  --yarn-beta-fast N    YaRN: low correction dim or beta (default: %.1f)\n", params.yarn_beta_fast);
     printf("  --ignore-eos          ignore end of stream token and continue generating (implies --logit-bias 2-inf)\n");
     printf("  --no-penalize-nl      do not penalize newline token\n");
-    printf("  --memory-f32          use f32 instead of f16 for memory key+value (default: disabled)\n");
-    printf("                        not recommended: doubles context memory required and no measurable increase in quality\n");
     printf("  --temp N              temperature (default: %.1f)\n", (double)sparams.temp);
     printf("  --logits-all          return logits for all tokens in the batch (default: disabled)\n");
     printf("  --hellaswag           compute HellaSwag score over random tasks from datafile supplied with -f\n");
@@ -841,6 +903,12 @@ void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) {
     printf("  --verbose-prompt      print prompt before generation\n");
     printf("  -dkvc, --dump-kv-cache\n");
     printf("                        verbose print of the KV cache\n");
+    printf("  -nkvo, --no-kv-offload\n");
+    printf("                        disable KV offload\n");
+    printf("  -ctk TYPE, --cache-type-k TYPE\n");
+    printf("                        KV cache data type for K (default: %s)\n", params.cache_type_k.c_str());
+    printf("  -ctv TYPE, --cache-type-v TYPE\n");
+    printf("                        KV cache data type for V (default: %s)\n", params.cache_type_v.c_str());
     printf("  --simple-io           use basic IO for better compatibility in subprocesses and limited consoles\n");
     printf("  --lora FNAME          apply LoRA adapter (implies --no-mmap)\n");
     printf("  --lora-scaled FNAME S apply LoRA adapter with user defined scaling S (implies --no-mmap)\n");
@@ -851,6 +919,9 @@ void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) {
     printf("                        draft model for speculative decoding (default: %s)\n", params.model.c_str());
     printf("  -ld LOGDIR, --logdir LOGDIR\n");
     printf("                        path under which to save YAML logs (no logging if unset)\n");
+    printf("  --override-kv KEY=TYPE:VALUE\n");
+    printf("                        advanced option to override model metadata by key. may be specified multiple times.\n");
+    printf("                        types: int, float, bool. example: --override-kv tokenizer.ggml.add_bos_token=bool:false\n");
     printf("\n");
 #ifndef LOG_DISABLE_LOGS
     log_print_usage();
@@ -887,6 +958,48 @@ std::string gpt_random_prompt(std::mt19937 & rng) {
     GGML_UNREACHABLE();
 }
 
+//
+// String parsing
+//
+
+std::string parse_samplers_input(std::string input) {
+    std::string output = "";
+    // since samplers names are written multiple ways
+    // make it ready for both system names and input names
+    std::unordered_map<std::string, char> samplers_symbols {
+        {"top_k",      'k'},
+        {"top-k",      'k'},
+        {"top_p",      'p'},
+        {"top-p",      'p'},
+        {"nucleus",    'p'},
+        {"typical_p",  'y'},
+        {"typical-p",  'y'},
+        {"typical",    'y'},
+        {"min_p",      'm'},
+        {"min-p",      'm'},
+        {"tfs_z",      'f'},
+        {"tfs-z",      'f'},
+        {"tfs",        'f'},
+        {"temp",       't'},
+        {"temperature",'t'}
+    };
+    // expected format example: "temp;top_k;tfs_z;typical_p;top_p;min_p"
+    size_t separator = input.find(';');
+    while (separator != input.npos) {
+        std::string name = input.substr(0,separator);
+        input = input.substr(separator+1);
+        separator = input.find(';');
+
+        if (samplers_symbols.find(name) != samplers_symbols.end()) {
+            output += samplers_symbols[name];
+        }
+    }
+    if (samplers_symbols.find(input) != samplers_symbols.end()) {
+        output += samplers_symbols[input];
+    }
+    return output;
+}
+
 //
 // Model utils
 //
@@ -901,10 +1014,39 @@ struct llama_model_params llama_model_params_from_gpt_params(const gpt_params &
     mparams.tensor_split    = params.tensor_split;
     mparams.use_mmap        = params.use_mmap;
     mparams.use_mlock       = params.use_mlock;
+    if (params.kv_overrides.empty()) {
+        mparams.kv_overrides = NULL;
+    } else {
+        GGML_ASSERT(params.kv_overrides.back().key[0] == 0 && "KV overrides not terminated with empty key");
+        mparams.kv_overrides = params.kv_overrides.data();
+    }
 
     return mparams;
 }
 
+static ggml_type kv_cache_type_from_str(const std::string & s) {
+    if (s == "f16") {
+        return GGML_TYPE_F16;
+    }
+    if (s == "q8_0") {
+        return GGML_TYPE_Q8_0;
+    }
+    if (s == "q4_0") {
+        return GGML_TYPE_Q4_0;
+    }
+    if (s == "q4_1") {
+        return GGML_TYPE_Q4_1;
+    }
+    if (s == "q5_0") {
+        return GGML_TYPE_Q5_0;
+    }
+    if (s == "q5_1") {
+        return GGML_TYPE_Q5_1;
+    }
+
+    throw std::runtime_error("Invalid cache type: " + s);
+}
+
 struct llama_context_params llama_context_params_from_gpt_params(const gpt_params & params) {
     auto cparams = llama_context_default_params();
 
@@ -914,7 +1056,6 @@ struct llama_context_params llama_context_params_from_gpt_params(const gpt_param
     cparams.n_threads_batch   = params.n_threads_batch == -1 ? params.n_threads : params.n_threads_batch;
     cparams.mul_mat_q         = params.mul_mat_q;
     cparams.seed              = params.seed;
-    cparams.f16_kv            = params.memory_f16;
     cparams.logits_all        = params.logits_all;
     cparams.embedding         = params.embedding;
     cparams.rope_scaling_type = params.rope_scaling_type;
@@ -925,6 +1066,10 @@ struct llama_context_params llama_context_params_from_gpt_params(const gpt_param
     cparams.yarn_beta_fast    = params.yarn_beta_fast;
     cparams.yarn_beta_slow    = params.yarn_beta_slow;
     cparams.yarn_orig_ctx     = params.yarn_orig_ctx;
+    cparams.offload_kqv       = !params.no_kv_offload;
+
+    cparams.type_k = kv_cache_type_from_str(params.cache_type_k);
+    cparams.type_v = kv_cache_type_from_str(params.cache_type_v);
 
     return cparams;
 }
@@ -1337,7 +1482,6 @@ void dump_non_result_info_yaml(FILE * stream, const gpt_params & params, const l
     }
     fprintf(stream, "lora_base: %s\n", params.lora_base.c_str());
     fprintf(stream, "main_gpu: %d # default: 0\n", params.main_gpu);
-    fprintf(stream, "memory_f32: %s # default: false\n", !params.memory_f16 ? "true" : "false");
     fprintf(stream, "mirostat: %d # default: 0 (disabled)\n", sparams.mirostat);
     fprintf(stream, "mirostat_ent: %f # default: 5.0\n", sparams.mirostat_tau);
     fprintf(stream, "mirostat_lr: %f # default: 0.1\n", sparams.mirostat_eta);

diff --git a/common/common.h b/common/common.h
@@ -94,6 +94,8 @@ struct gpt_params {
     std::vector<std::string> antiprompt; // string upon seeing which more user input is prompted
     std::string logdir            = "";  // directory in which to save YAML log files
 
+    std::vector<llama_model_kv_override> kv_overrides;
+
     // TODO: avoid tuple, use struct
     std::vector<std::tuple<std::string, float>> lora_adapter; // lora adapter path with user defined scale
     std::string lora_base  = "";                              // base model path for the lora adapter
@@ -106,7 +108,6 @@ struct gpt_params {
     size_t hellaswag_tasks = 400;   // number of tasks to use when computing the HellaSwag score
 
     bool mul_mat_q         = true;  // if true, use mul_mat_q kernels instead of cuBLAS
-    bool memory_f16        = true;  // use f16 instead of f32 for memory kv
     bool random_prompt     = false; // do not randomize prompt if none provided
     bool use_color         = false; // use color to distinguish generations and inputs
     bool interactive       = false; // interactive mode
@@ -131,10 +132,14 @@ struct gpt_params {
     bool verbose_prompt    = false; // print prompt tokens before generation
     bool infill            = false; // use infill mode
     bool dump_kv_cache     = false; // dump the KV cache contents for debugging purposes
+    bool no_kv_offload     = false; // disable KV offloading
+
+    std::string cache_type_k = "f16"; // KV cache data type for the K
+    std::string cache_type_v = "f16"; // KV cache data type for the V
 
     // multimodal models (see examples/llava)
     std::string mmproj = ""; // path to multimodal projector
-    std::string image = ""; // path to an image file
+    std::string image  = ""; // path to an image file
 };
 
 bool gpt_params_parse_ex(int argc, char ** argv, gpt_params & params);
@@ -149,6 +154,12 @@ std::string gpt_random_prompt(std::mt19937 & rng);
 
 void process_escapes(std::string& input);
 
+//
+// String parsing
+//
+
+std::string parse_samplers_input(std::string input);
+
 //
 // Model utils
 //

diff --git a/common/grammar-parser.cpp b/common/grammar-parser.cpp
@@ -190,7 +190,7 @@ namespace grammar_parser {
                 pos = parse_space(pos + 1, is_nested);
             } else if (*pos == '*' || *pos == '+' || *pos == '?') { // repetition operator
                 if (last_sym_start == out_elements.size()) {
-                    throw std::runtime_error(std::string("expecting preceeding item to */+/? at ") + pos);
+                    throw std::runtime_error(std::string("expecting preceding item to */+/? at ") + pos);
                 }
 
                 // apply transformation to previous symbol (last_sym_start to end) according to