-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add potential to run Jina Embeddings architecture #6826
Conversation
Hey @ggerganov , I would like to get some comments specially on the |
e946cb0
to
d7d6a4e
Compare
The way it is implemented now, ALiBi is not applied because diff --git a/llama.cpp b/llama.cpp
index 309f4eec..1230a4bc 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -4135,7 +4135,7 @@ static void llm_load_hparams(
model.ftype = ml.ftype;
- if (hparams.f_max_alibi_bias > 0.0f && model.arch != LLM_ARCH_JINA_BERT) {
+ if (hparams.f_max_alibi_bias > 0.0f) {
hparams.need_kq_pos = true;
}
@@ -7984,11 +7984,8 @@ struct llm_build_context {
struct ggml_tensor * cur;
struct ggml_tensor * inpL;
- struct ggml_tensor * inp_pos = nullptr;
+ struct ggml_tensor * inp_pos = inp_pos = build_inp_pos();
- if (model.arch != LLM_ARCH_JINA_BERT) {
- inp_pos = build_inp_pos();
- }
struct ggml_tensor * inp_mean = build_inp_mean();
struct ggml_tensor * inp_cls = build_inp_cls();
@@ -8010,6 +8007,9 @@ struct llm_build_context {
// KQ_mask (mask for 1 head, it will be broadcasted to all heads)
struct ggml_tensor * KQ_mask = build_inp_KQ_mask(false);
+ // positions of the tokens in the KV cache
+ struct ggml_tensor * KQ_pos = build_inp_KQ_pos();
+
// iterate layers
for (int il = 0; il < n_layer; ++il) {
struct ggml_tensor * cur = inpL;
@@ -8065,7 +8065,7 @@ struct llm_build_context {
struct ggml_tensor * kq = ggml_mul_mat(ctx0, k, q);
cb(kq, "kq", il);
- kq = ggml_soft_max_ext(ctx0, kq, KQ_mask, nullptr, 1.0f/sqrtf(float(n_embd_head)), hparams.f_max_alibi_bias);
+ kq = ggml_soft_max_ext(ctx0, kq, KQ_mask, KQ_pos, 1.0f/sqrtf(float(n_embd_head)), hparams.f_max_alibi_bias);
cb(kq, "kq_soft_max_ext", il);
struct ggml_tensor * v = ggml_cont(ctx0, ggml_transpose(ctx0, ggml_reshape_2d(ctx0, Vcur, n_embd_gqa, n_tokens)));
@@ -11131,7 +11131,7 @@ static int llama_decode_internal(
}
// non-causal masks do not use the KV cache
- if (hparams.causal_attn) {
+ if (hparams.causal_attn || model.arch == LLM_ARCH_JINA_BERT) {
llama_kv_cache_update(&lctx);
// if we have enough unused cells before the current head -> But this still does not work because the Let's revisit this PR after merging #5021 - I think the fix should be relatively simple, but will be easier to resolve conflicts after we merge #5021 |
Hey @ggerganov , I have a couple of comments from the suggestions you made:
|
da96368
to
d9b8dd6
Compare
Hey @ggerganov , Is there anything missing? |
Hey @ggerganov , I fixed the last conflicts |
Yup, thanks. I'll be looking today a bit more - think the ALiBi stuff still needs some changes/improvements. Hope to be ready soon |
…t-jina-embeddings
This reverts commit b83cc3f.
Hello @ggerganov ,
Thanks for having this awesome project. I have been trying to add support for Jina Embeddings (https://huggingface.co/jinaai/jina-embeddings-v2-base-en) in
llama.cpp
This PR aims to be able to run in
llama.cpp
the Jina Embeddings architecture.For this, the changes made are:
JinaBertModel
intoconvert-hf-to-gguf.py
to be able to extract the tensors into GGUF.ollama
to load the model with proper vocab settings (Add EOS and BOS tokens)LLM_ARCH_JINA_BERT
architecture and adapt the tensors used by the implementation.build_bert
model to adapt to some small changes needed by the model (like not having positional embeddings)ALIBI
computation ofsoftmax
to have the slope multiplied by the distance to the diagonal of the specific head attention.