The following are the various components of this project:
-
modified_llama
Llama2 modified to allow extraction of the context vectors. -
generate_context_vectors.py
Use modified_llama to extract the context vectors from articles and store it using the cv_storage library (see below).
Check the arguments to the main function for the available options like input files and output folders. -
wikipedia_parser
Read files generated by https://github.com/mlabs-haskell/wikipedia_parser/ -
indexed_binary_db
A binary database that can consists of an index file and a data file.
The index file stores the span of each entry(start, end)
in the data file, and some metadata. The index is supposed to be small, so that it can be quickly loaded into memory to search for an entry based on their metadata and find their span in the data file. This span is then used to load the actual entry from the data file. -
cv_storage
Efficiently store context vectors, queryable by article and section names. Usesindexed_binary_db
under the hood. -
cv_library
Generate lower-fidelity versions of a context vector for fast searching.
Analogous to Mipmaps in 3D rendering (https://en.wikipedia.org/wiki/Mipmap). -
cv_hier_storage
Use the lower-fidelity versions of the context vectors to quickly compare an input context vector and find the closest match. Usesindexed_binary_db
under the hood. -
query_generator
Generate LLM prompts from an article to help with context vector generation -
generate_hier_cv_db.py
Read an instance ofcv_storage
and usecv_library
to generate an instance ofcv_hier_storage
- Run
just tests
to run the module level tests. - Run
just cv_hier_db_test_e2e
to test thecv_hier_storage
module by generating one and running some sanity checks on it.