Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Demo-2 #2

Merged
merged 30 commits into from
Mar 5, 2024
Merged

Add Demo-2 #2

merged 30 commits into from
Mar 5, 2024

Conversation

abhi3700
Copy link
Owner

Description

This PR adds code + data for this logic:

Consider sample dataset of 1000 reviews.

  1. For each sample in dataset (use the 1000 reviews)
    1. Create embedding of sample
    2. Save embeddings to disk, linked with source sample
  2. For each embedding in embeddings
    1. for each nbits = [8, 16, 32, 64, 128]
      1. Apply LSH to the embedding
      2. Add hash to a hash table, along with index of embedding
      3. Save hashes to disk (categorize by parameter)

Here is the output file (in CSV format) with following columns:

Text Embedding Hash 8-bit Hash 16-bit Hash 32-bit Hash 64-bit Hash 128-bit

Remove unused arg from func: 'hash_vector'
Actually there are multiple bucket with respective key hash. So, changed the name.
- Replace <br /> with whitespace
- Add embedding corresponding to each review
@abhi3700 abhi3700 self-assigned this Feb 26, 2024
@abhi3700 abhi3700 added the enhancement New feature or request label Feb 26, 2024
@abhi3700
Copy link
Owner Author

One observation when repeating data pre-processing, for most of the text reviews out of 1000, the values get changed after 3-4 decimal places. 63d48d9
image

I don't think it would affect in bucketing.

@abhi3700
Copy link
Owner Author

abhi3700 commented Feb 27, 2024

This is the error at when calling function: pl.read_csv() directly…

polars.exceptions.ComputeError: could not parse `00011011110011110011001001110010` as dtype `i64` at column 'Hash 32-bit' (column number 5)

The current offset in the file is 34598 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `00011011110011110011001001110010` to the `null_values` list.

Original error: ```remaining bytes non-empty```

Although this can be fixed by reading as str instead of default i64 type. Not sure if there is any solution in vector DB or not 🤔 .
Now, it’s fixed. commit

- fmt done
- `main.py` is removed now. As pre-processing & detection are run in separate scripts.
- 1st review as query text is checked. Observation 🔍 is that except for nbits=8, it falls into the expected bucket.
Refrain calling 'client.embeddings.create' multiple times in loop as it 'create' fn supports calling multiple text & return embeddings for all in matrix or n-dim array
- Execution time is reduced from 15 mins to 30s for all hyperplanes.
- openai embedding 'create' fn supports list of texts. So, loop removed from main() fn
- For 8 nbits, some buckets modify with less/more texts falling inside.
- For other nbits: 16, 32, 64, 128, some buckets change their hash with same texts inside.
 This is the observation seed when re-running again. In this commit, the changes can be seen.
- For 8 nbits, some buckets modify with less/more texts falling inside. Also, their hash changes as well.
- For other nbits: 16, 32, 64, 128, some buckets change their hash with same texts inside.
 This is the observation seed when re-running again. In this commit, the changes can be seen. It seems like 8 nbits is not reliable.
Modified the script to include generating preprocessed_data with embedding i.e. 'List[float]' for each review along with buckets generation for all hyperplanes
Fix static typing. Added pre-check before detecting text query.
@abhi3700
Copy link
Owner Author

abhi3700 commented Mar 1, 2024

Observation (for 20 paragraphs): The matrix does have the relatively lowest Hamming distance diagonally as expected. You can find those in matrix_8.csv, matrix_16.csv, etc.

@abhi3700
Copy link
Owner Author

abhi3700 commented Mar 4, 2024

Added interactive (in html format…..hover on it for values) heatmap plots to give a visualization of the generated matrices for each nbits = 8, 16, 32, 64, 128 for 20 paragraphs b/w original & slightly modified texts. Download the html plots from commit.
👀 With increasing no. of hyperplanes, the diagonal blue line is more crystal clear.

image image

image

image image

…s.csv files

- Add 3 more python packages for this to requirements.txt, pyproject.toml files
@abhi3700 abhi3700 requested a review from jfrank-summit March 5, 2024 11:20
@abhi3700 abhi3700 merged commit cbbc89a into main Mar 5, 2024
1 check passed
@abhi3700 abhi3700 added the bug Something isn't working label Mar 5, 2024
@abhi3700 abhi3700 deleted the demo-2 branch March 6, 2024 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants