-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Demo-2 #2
Add Demo-2 #2
Conversation
Actually there are multiple bucket with respective key hash. So, changed the name.
- Replace <br /> with whitespace - Add embedding corresponding to each review
One observation when repeating data pre-processing, for most of the text reviews out of 1000, the values get changed after 3-4 decimal places. 63d48d9 I don't think it would affect in bucketing. |
This is the error at when calling function: pl.read_csv() directly… polars.exceptions.ComputeError: could not parse `00011011110011110011001001110010` as dtype `i64` at column 'Hash 32-bit' (column number 5)
The current offset in the file is 34598 bytes.
You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `00011011110011110011001001110010` to the `null_values` list.
Original error: ```remaining bytes non-empty``` Although this can be fixed by reading as |
- fmt done - `main.py` is removed now. As pre-processing & detection are run in separate scripts. - 1st review as query text is checked. Observation 🔍 is that except for nbits=8, it falls into the expected bucket.
Refrain calling 'client.embeddings.create' multiple times in loop as it 'create' fn supports calling multiple text & return embeddings for all in matrix or n-dim array
- Execution time is reduced from 15 mins to 30s for all hyperplanes. - openai embedding 'create' fn supports list of texts. So, loop removed from main() fn
- For 8 nbits, some buckets modify with less/more texts falling inside. - For other nbits: 16, 32, 64, 128, some buckets change their hash with same texts inside. This is the observation seed when re-running again. In this commit, the changes can be seen.
- For 8 nbits, some buckets modify with less/more texts falling inside. Also, their hash changes as well. - For other nbits: 16, 32, 64, 128, some buckets change their hash with same texts inside. This is the observation seed when re-running again. In this commit, the changes can be seen. It seems like 8 nbits is not reliable.
Modified the script to include generating preprocessed_data with embedding i.e. 'List[float]' for each review along with buckets generation for all hyperplanes
Fix static typing. Added pre-check before detecting text query.
Observation (for 20 paragraphs): The matrix does have the relatively lowest Hamming distance diagonally as expected. You can find those in |
Added interactive (in html format…..hover on it for values) heatmap plots to give a visualization of the generated matrices for each nbits = 8, 16, 32, 64, 128 for 20 paragraphs b/w original & slightly modified texts. Download the html plots from commit. |
…s.csv files - Add 3 more python packages for this to requirements.txt, pyproject.toml files
Description
This PR adds code + data for this logic:
Consider sample dataset of 1000 reviews.
Here is the output file (in CSV format) with following columns: