Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: embedding distance calculator #141

Closed
1 task done
marieaurore123 opened this issue Dec 4, 2024 · 3 comments
Closed
1 task done

feat: embedding distance calculator #141

marieaurore123 opened this issue Dec 4, 2024 · 3 comments
Assignees
Milestone

Comments

@marieaurore123
Copy link
Contributor

marieaurore123 commented Dec 4, 2024

  • I have looked for existing issues (including closed) about this

Feature Request

Add feature to rig which can help users measure semantic similarity (or dissimilarity) between a prediction and a reference string.

Motivation

This feature is very related to LLMs and Embeddings and could be useful both Rig itself (improving the in-memory vector store) and for other projects that depend on Rig.

Proposal

pub enum DistanceMetric {
    Cosine,
    L2,
    Dot,
    Manhattan,
}

impl rig::embeddings::embedding::Embedding {
    fn cosine_dist(embedding: Embedding) -> f64;
    fn dot_dist(embedding: Embedding) -> f64;
    fn l2_dist(embedding: Embedding) -> f64;
    fn manhattan_dist(embedding: Embedding) -> f64;
}
  • Update in memory vector store with this feature
  • Set implementation behind feature flag.
  • Maybe - instead of adding semanticsimilarity_rs, copy only necessary code

Dependencies:
semanticsimilarity_rs

Alternatives

@marieaurore123 marieaurore123 self-assigned this Dec 4, 2024
@cvauclair
Copy link
Contributor

This makes a lot of sense! I think it makes more sense to have it part of rig-core directly since it's closely related to the existing embedding API. However, depending on how "heavy" the dependency is (it does depend on Rayon), this feature could be behind a feature flag

@cvauclair
Copy link
Contributor

Alternatively, what could be interesting for both flexibility and completeness would be to rewrite the distance algorithms from semanticsimilarity_rs without using Rayon, and have rayon as a feature flag which would overwrite those implementations with the ones from semanticsimilarity_rs (or our own versions which uses rayon).

This can be done like using something like:

struct Embedding {...}

#[cfg(feature = "rayon")]
impl Embedding {
  pub fn cosine_distance(&self, other: &Embedding) -> f64 {
    // implementation using rayon or semanticsimilarity_rs
  }
}

#[cfg(not(feature = "rayon"))]
impl Embedding {
  pub fn cosine_distance(&self, other: &Embedding) -> f64 {
    // implementation with stdlib (without rayon or semanticsimilarity_rs)
  }
}

@mateobelanger mateobelanger added this to the v0.6 milestone Dec 5, 2024
@cvauclair
Copy link
Contributor

Closed by #142

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants