Add Demo-2 #2

abhi3700 · 2024-02-26T10:08:16Z

Description

This PR adds code + data for this logic:

Consider sample dataset of 1000 reviews.

For each sample in dataset (use the 1000 reviews)
1. Create embedding of sample
2. Save embeddings to disk, linked with source sample
For each embedding in embeddings
1. for each nbits = [8, 16, 32, 64, 128]
  1. Apply LSH to the embedding
  2. Add hash to a hash table, along with index of embedding
  3. Save hashes to disk (categorize by parameter)

Here is the output file (in CSV format) with following columns:

Text	Embedding	Hash 8-bit	Hash 16-bit	Hash 32-bit	Hash 64-bit	Hash 128-bit

Remove unused arg from func: 'hash_vector'

Actually there are multiple bucket with respective key hash. So, changed the name.

- Replace <br /> with whitespace - Add embedding corresponding to each review

abhi3700 · 2024-02-27T11:57:50Z

One observation when repeating data pre-processing, for most of the text reviews out of 1000, the values get changed after 3-4 decimal places. 63d48d9

I don't think it would affect in bucketing.

abhi3700 · 2024-02-27T11:59:42Z

This is the error at when calling function: pl.read_csv() directly…

polars.exceptions.ComputeError: could not parse `00011011110011110011001001110010` as dtype `i64` at column 'Hash 32-bit' (column number 5)

The current offset in the file is 34598 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `00011011110011110011001001110010` to the `null_values` list.

Original error: ```remaining bytes non-empty```

Although this can be fixed by reading as str instead of default i64 type. Not sure if there is any solution in vector DB or not 🤔 .
Now, it’s fixed. commit

src/semantic_hashing_demo/preprocessing.py

- fmt done - `main.py` is removed now. As pre-processing & detection are run in separate scripts. - 1st review as query text is checked. Observation 🔍 is that except for nbits=8, it falls into the expected bucket.

Refrain calling 'client.embeddings.create' multiple times in loop as it 'create' fn supports calling multiple text & return embeddings for all in matrix or n-dim array

- Execution time is reduced from 15 mins to 30s for all hyperplanes. - openai embedding 'create' fn supports list of texts. So, loop removed from main() fn

- For 8 nbits, some buckets modify with less/more texts falling inside. - For other nbits: 16, 32, 64, 128, some buckets change their hash with same texts inside. This is the observation seed when re-running again. In this commit, the changes can be seen.

- For 8 nbits, some buckets modify with less/more texts falling inside. Also, their hash changes as well. - For other nbits: 16, 32, 64, 128, some buckets change their hash with same texts inside. This is the observation seed when re-running again. In this commit, the changes can be seen. It seems like 8 nbits is not reliable.

Modified the script to include generating preprocessed_data with embedding i.e. 'List[float]' for each review along with buckets generation for all hyperplanes

Fix static typing. Added pre-check before detecting text query.

abhi3700 · 2024-03-01T18:36:21Z

Observation (for 20 paragraphs): The matrix does have the relatively lowest Hamming distance diagonally as expected. You can find those in matrix_8.csv, matrix_16.csv, etc.

abhi3700 · 2024-03-04T20:36:53Z

Added interactive (in html format…..hover on it for values) heatmap plots to give a visualization of the generated matrices for each nbits = 8, 16, 32, 64, 128 for 20 paragraphs b/w original & slightly modified texts. Download the html plots from commit.
👀 With increasing no. of hyperplanes, the diagonal blue line is more crystal clear.

…s.csv files - Add 3 more python packages for this to requirements.txt, pyproject.toml files

src/semantic_hashing_demo/generate_data.py

src/semantic_hashing_demo/lsh.py

abhi3700 added 5 commits February 23, 2024 20:56

Minor fix

9f9c762

Remove unused arg from func: 'hash_vector'

Change bucket to buckets

4335b42

Actually there are multiple bucket with respective key hash. So, changed the name.

Add demo-2 step-1

9a104f5

- Replace <br /> with whitespace - Add embedding corresponding to each review

Refator code

c5769d5

Add code for embeddings + LSH for each hyperplanes params

9efc81f

abhi3700 self-assigned this Feb 26, 2024

abhi3700 added the enhancement New feature or request label Feb 26, 2024

abhi3700 requested a review from jfrank-summit February 26, 2024 10:08

abhi3700 added 4 commits February 26, 2024 23:13

Changed the name

2b08a02

Update readme for preprocessing of data into embeddings

6281f8e

Output data file renamed, Add detection code to just read file

f1ddcb7

Minor changes

63d48d9

abhi3700 added 4 commits February 27, 2024 21:01

Add code + data for bucketing

236e434

Add code for detecting query text

508053c

Modify preprocessing with new design pattern of code

c5a677b

Update bucket to df

658aabd

its-colby reviewed Feb 28, 2024

View reviewed changes

src/semantic_hashing_demo/preprocessing.py Outdated Show resolved Hide resolved

abhi3700 added 11 commits February 28, 2024 21:04

Revamp code design pattern for preprocessing, Also modify data structure

aa52060

Revamp detection code with OOP design pattern

ee6e1ea

- fmt done - `main.py` is removed now. As pre-processing & detection are run in separate scripts. - 1st review as query text is checked. Observation 🔍 is that except for nbits=8, it falls into the expected bucket.

Optimize code

13dab47

Refrain calling 'client.embeddings.create' multiple times in loop as it 'create' fn supports calling multiple text & return embeddings for all in matrix or n-dim array

Optimize code with re-runs

b9bb412

- Execution time is reduced from 15 mins to 30s for all hyperplanes. - openai embedding 'create' fn supports list of texts. So, loop removed from main() fn

Update detect text code with optimized LSH class

3842df7

Restore preprocessed_data file generation

555022a

Modified the script to include generating preprocessed_data with embedding i.e. 'List[float]' for each review along with buckets generation for all hyperplanes

Add @jeremy code for generating text

400b456

Refactor code

a8d54ea

Fix static typing. Added pre-check before detecting text query.

Add code for processing generated paragraphs

a54cee8

Add code for saving Hamming Distance (HD) matrix

27f9f49

abhi3700 added 3 commits March 2, 2024 01:18

Add README with clean instruction for setup, Fmt done

e304a61

Add interactive heatmap plot (in html) for the matrices for each nbits

379910c

Include build inside install step itself

b7ac5fd

Add code for generating plots, Remove dataframe alongwith matrix_nbit…

842df3e

…s.csv files - Add 3 more python packages for this to requirements.txt, pyproject.toml files

jfrank-summit reviewed Mar 4, 2024

View reviewed changes

src/semantic_hashing_demo/generate_data.py Outdated Show resolved Hide resolved

jfrank-summit reviewed Mar 4, 2024

View reviewed changes

src/semantic_hashing_demo/lsh.py Outdated Show resolved Hide resolved

Cleanup

506a8c9

abhi3700 requested a review from jfrank-summit March 5, 2024 11:20

jfrank-summit approved these changes Mar 5, 2024

View reviewed changes

abhi3700 merged commit cbbc89a into main Mar 5, 2024
1 check passed

abhi3700 added the bug Something isn't working label Mar 5, 2024

abhi3700 deleted the demo-2 branch March 6, 2024 12:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Demo-2 #2

Add Demo-2 #2

abhi3700 commented Feb 26, 2024

abhi3700 commented Feb 27, 2024

abhi3700 commented Feb 27, 2024 •

edited

Loading

abhi3700 commented Mar 1, 2024

abhi3700 commented Mar 4, 2024 •

edited

Loading

Add Demo-2 #2

Add Demo-2 #2

Conversation

abhi3700 commented Feb 26, 2024

Description

abhi3700 commented Feb 27, 2024

abhi3700 commented Feb 27, 2024 • edited Loading

abhi3700 commented Mar 1, 2024

abhi3700 commented Mar 4, 2024 • edited Loading

abhi3700 commented Feb 27, 2024 •

edited

Loading

abhi3700 commented Mar 4, 2024 •

edited

Loading