llama-embedding result confused. #12100

swordow · 2025-02-28T03:32:32Z

swordow
Feb 28, 2025

I tried following 3 test cases, but the result is confused.
Test1, create embedings for "Hello" (dont add special/bos)
llama-embedding.exe -m gte-qwen2-7b-instruct-f16.gguf -e -p "Hello" --verbose-prompt -ngl 0 --batch-size 4096
and get the outputs:
embedding 0: [-0.010509 -0.007925 -0.006991 ... -0.010548 -0.014585 0.018345 ]
Test2, create embedings for "test" (dont add special/bos):
llama-embedding.exe -m gte-qwen2-7b-instruct-f16.gguf -e -p "test" --verbose-prompt -ngl 0 --batch-size 4096
and get the outputs:
embedding 0: [-0.000707 0.007401 0.001886 ... -0.014110 -0.003793 0.016024 ]
Test3, create embedings for "Hello test" (dont add special/bos):
llama-embedding.exe -m gte-qwen2-7b-instruct-f16.gguf -e -p "Hello test" --verbose-prompt -ngl 0 --batch-size 4096
and get the outputs:

embedding 0: [-0.010513 -0.007934 -0.006993  ... -0.010563 -0.014576  0.018352 ]
embedding 1: [ 0.005916 -0.000861 -0.004125  ... -0.011239 -0.020057  0.012426 ]

The Test3's embeding 0 is consistent with Test1's embedding 0, but The Test3's embeding 1 is not onsistent with Test2's embedding 0.

ggerganov · 2025-02-28T06:54:16Z

ggerganov
Feb 28, 2025
Maintainer

Use --verbose-prompt to see the difference in the 2 cases.

0 replies

baxmet · 2025-02-28T10:23:12Z

baxmet
Feb 28, 2025

@swordow positional embedding shift?

0 replies

swordow · 2025-03-01T13:28:03Z

swordow
Mar 1, 2025
Author

@baxmet
@ggerganov

I found the problem. "Hello test" is splitted into two tokens: Hello and test and the space ' ' is not stripped.
The llama.cpp uses the unicode split regex from model gte-qwen2-7b-instruct.
The model gte-qwen2-7b-instruct use the vocab pre type qwen2 and its split regex is from here:
https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct/raw/main/tokenizer.json

"pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "ByteLevel",
        "add_prefix_space": false,
        "trim_offsets": false,
        "use_regex": false
      }
    ]
  },

which is consistent with code in https://github.com/ggml-org/llama.cpp/blob/master/src/llama-vocab.cpp

case LLAMA_VOCAB_PRE_TYPE_QWEN2:
                regex_exprs = {
                    // original regex from tokenizer.json
                    // "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
                    "(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",
                };
                break;

.
Maybe there is a problem in this regex.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama-embedding result confused. #12100

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

llama-embedding result confused. #12100

swordow Feb 28, 2025

Replies: 3 comments

ggerganov Feb 28, 2025 Maintainer

baxmet Feb 28, 2025

swordow Mar 1, 2025 Author

swordow
Feb 28, 2025

ggerganov
Feb 28, 2025
Maintainer

baxmet
Feb 28, 2025

swordow
Mar 1, 2025
Author