Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

Fix embeddings extraction #273

Merged
merged 8 commits into from
May 24, 2023

Conversation

skirodev
Copy link
Contributor

@skirodev skirodev commented May 24, 2023

This is a pull request to fix some errors in the "extract embeddings" section of the original code in the repository. The running results of the code after the corrections were made when obtaining embeddings are shown below.

The Result of calculating the cosine similarity using the fixed code and llama.cpp:
Query Text: ”My favourite animal is the dog“

Text Similarity(LLaMA.cpp) Similarity(Fixed Code)
“My favourite animal is the dog” 1 0.99999999
“My favourite animal is the cat” 0.928901 0.92852515
”I have just adopted a cute dog“ 0.70601 0.596631

The first 10 elements of text embeddings, the length of all embeddings is 4096:

  1. Query Text: "My favourite animal is the dog"
    Fixed Code:
    [1.6758084, -0.10050424, -0.3598353, -4.8037696, 1.4627854, -2.3149424, 2.0896077, -1.4845643, 0.5678074, -0.18290481]
    LLaMA.cpp:
    [1.71472, -0.0879841, -0.369562, -4.77127, 1.44898, -2.30881, 2.11664, -1.49164, 0.558957, -0.186297]
  1. Text: "I have just adopted a cute dog"
    Fixed Code:
    [-0.5668864, -1.1441603, -2.3260381, -6.598249, 1.3243433, -1.7672672, 0.9366179, 0.3414038, 1.0347223, 1.2812055]
    LLaMA.cpp:
    [1.19416, -1.06577, -1.59795, -4.10442, 1.06202, -2.01701, 0.804208, 0.264081, 2.19793, 1.59068]
  1. Text: "My favourite animal is the cat"
    Fixed Code:
    [2.5065503, 0.066849045, -0.081686154, -4.7165327, 1.6428406, -2.7072144, 2.914644, -1.7767899, 0.41250193, 0.5863753]
    LLaMA.cpp:
    [2.49985, 0.0733547, -0.093324, -4.74914, 1.63704, -2.69348, 2.94915, -1.73889, 0.421222, 0.616791]

@philpax philpax merged commit 73e5bb3 into rustformers:main May 24, 2023
@philpax
Copy link
Collaborator

philpax commented May 24, 2023

Thank you for this! It's great to have this sorted out. The discrepancies between our numbers and llama.cpp are likely due to #210, which we should be able to resolve.

I revised your example to make it possible to use your own model / query / comparands, and to simplify all of the examples to reduce the amount of code to the bare essentials, but the spirit is more or less the same.

Again, thank you for addressing this 🙏

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants