-
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
c084850
commit 9e9dd0a
Showing
38 changed files
with
3,854 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
ONNX_EMBEDDING_MODEL="Xenova/all-MiniLM-L6-v2" | ||
ONNX_EMBEDDING_MODEL_PRECISION=q8 | ||
# | Model | Precision | Size | | ||
# | -------------------------------------------- | -------------- | ---------------------- | | ||
# | Xenova/all-MiniLM-L6-v2 | fp32, fp16, q8 | 90 MB, 45 MB, 23 MB | | ||
# | Xenova/all-MiniLM-L12-v2 | fp32, fp16, q8 | 133 MB, 67 MB, 34 MB | | ||
# | Xenova/paraphrase-multilingual-MiniLM-L12-v2 | fp32, fp16, q8 | 470 MB, 235 MB, 118 MB | | ||
# | Xenova/all-distilroberta-v1 | fp32, fp16, q8 | 326 MB, 163 MB, 82 MB | | ||
# | BAAI/bge-small-en-v1.5 | fp32 | 133 MB | | ||
|
||
ALLOW_REMOTE_MODELS=true | ||
LOCAL_MODEL_PATH=models/ | ||
CACHE_DIR=models/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
node_modules/ | ||
models/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
# π·οΈ Fast Topic Analysis | ||
|
||
A tool for analyzing text against predefined topics using embeddings and cosine similarity. | ||
|
||
 | ||
|
||
## Overview | ||
|
||
This project consists of two main components: | ||
1. A generator (`generate.js`) that creates topic embeddings from training data | ||
2. A test runner (`run-test.js`) that analyzes text against these topic embeddings | ||
|
||
## Setup | ||
|
||
Install dependencies: | ||
|
||
```bash | ||
npm install | ||
``` | ||
|
||
## Usage | ||
|
||
### Generating Topic Embeddings | ||
|
||
```bash | ||
node generate.js | ||
``` | ||
|
||
This will: | ||
- Clean the `data/topic_embeddings` directory | ||
- Process training data from `data/training_data.jsonl` | ||
- Generate embeddings for each topic defined in `labels-config.js` | ||
- Save embeddings as JSON files in `data/topic_embeddings/` | ||
|
||
### Running Analysis | ||
|
||
```bash | ||
node run-test.js | ||
``` | ||
|
||
The test runner provides an interactive interface to: | ||
1. Choose logging verbosity | ||
2. Optionally show matched sentences if verbose logging is disabled | ||
3. Select a test message file to analyze | ||
|
||
Configuration preferences (last used file, verbosity, etc.) are automatically saved in `run-test-config.json`. | ||
|
||
#### π¨ First Run Model Download | ||
|
||
The first time a model is used (e.g. `generate.js` or `run-test.js`), it will be downloaded and cached to the directory speciifed in `.env`. All subsequent runs will be fast as the model will be loaded from the cache. | ||
|
||
|
||
### Output | ||
|
||
The analysis will show: | ||
- Similarity scores between the test text and each topic | ||
- Execution time | ||
- Total comparisons made | ||
- Number of matches found | ||
- Model information | ||
|
||
## File Structure | ||
|
||
``` | ||
βββ data/ | ||
β βββ training_data.jsonl # Training data | ||
β βββ topic_embeddings/ # Generated embeddings | ||
βββ test-messages/ # Test files | ||
βββ modules/ | ||
β βββ embedding.js # Embedding functions | ||
β βββ similarity.js # Similarity calculation | ||
βββ generate.js # Embedding generator | ||
βββ run-test.js # Test runner | ||
βββ labels-config.js # Topic definitions | ||
``` | ||
|
||
## Customizing | ||
|
||
- Change the `ONNX_EMBEDDING_MODEL` and `ONNX_EMBEDDING_MODEL_PRECISION` in `.env` to use a different embedding model and precision. | ||
- Change the thresholds defined in `labels-config.js` per topic to change the similarity score that triggers a match. | ||
- Add more test messages to the `test-messages` directory to test against. | ||
- Add more training data to `data/training_data.jsonl` to improve the topic embeddings. | ||
|
||
### Training Data | ||
|
||
The training data is a JSONL file that contains the training data. Each line is a JSON object with the following fields: | ||
- `text`: The text to be analyzed | ||
- `label`: The label of the topic | ||
|
||
```jsonl | ||
{"text": "amphibians, croaks, wetlands, camouflage, metamorphosis", "label": "frogs"} | ||
{"text": "jumping, ponds, tadpoles, moist skin, diverse habitats", "label": "frogs"} | ||
{"text": "waterfowl, quacking, ponds, waddling, migration", "label": "ducks"} | ||
{"text": "feathers, webbed feet, lakes, nesting, foraging", "label": "ducks"} | ||
{"text": "dabbling, flocks, wetlands, bills, swimming", "label": "ducks"} | ||
``` | ||
|
||
The training data is used to generate the topic embeddings. The more training data you have, the better the topic embeddings will be. | ||
The labels to be used when generating the topic embeddings are defined in `labels-config.js`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
Here are the prompts I used to generate the example training data that ships with this project. | ||
|
||
|
||
------------------------------------- | ||
-- Training Data Generation Prompt -- | ||
------------------------------------- | ||
Generate a ton of training data for me. It should be in `jsonl` format, like this example: | ||
|
||
```jsonl | ||
{"label": "Disney", "text": "The Lion King's soundtrack is simply unforgettable."} | ||
{"label": "Disney", "text": "I just watched the new live-action remake of Mulan."} | ||
{"label": "Disney", "text": "Disneyland's churros are the best theme park snack!"} | ||
{"label": "Llamas", "text": "Llamas are so photogenic with their quirky expressions."} | ||
{"label": "Llamas", "text": "Thereβs a llama-themed cafe that opened nearby."} | ||
{"label": "Cookies", "text": "Chocolate-dipped cookies make for a fancy dessert."} | ||
{"label": "Cookies", "text": "I bought a cookie jar shaped like a cat!"} | ||
``` | ||
|
||
We will be covering a few different topics ("Disney", "Llamas", "Cookies"). | ||
This means I want lots and lots (as much as you can generate) of entries in jsonl format. The label for this data should be either "Disney", "Llamas", "Cookies" (go through each one and generate related text). Generate as much data as you can. | ||
|
||
|
||
|
||
|
||
------------------------------------ | ||
-- Test Message Generation Prompt -- | ||
------------------------------------ | ||
Generate a bunch of emails (use fake names, etc). I just need the email body (not the to, from, subject, etc.). | ||
|
||
Each email should include text that is related to one or more of these topics. Keep it as random as possible. The length of each message should be pretty long to average (based on your knowledge of emails). |
Oops, something went wrong.