Vector Store initial implementation #830

TheoMcCabe · 2023-10-30T09:44:26Z

This is a first pass at using a vector store to automatically retrieve files from a code repository based on their relevance to the prompt.

Uses llama index for the vector store abstraction
Implements a code splitter (with most of the code taken from llama index) which depends on tree sitter library
Creates a new step set called VECTOR_IMPROVE to run this improve which is run with the -vi argument
Finds 2 relevant snippets then feeds in the entire file for those snippets
Also provides a list of all code files in the repository to provide wider context

Theres a lot more work to do in this area, but I think the scope of this work is enough to merge on its own pending review etc. It works for me for at least small use cases like snake, so seems be a good place to hand over to others to have a play around with?

Some ideas of what to do next:

The whole vector store piece needs lots of refining and is largely untested on large repositories etc. An obvious limitation is that its currently hard coded to retrieve 2 snippets, and return the files those snippets are in. So a maximum of 2 files will be fed to the LLM. Possibly we would want to load as many files as possible within the token limit?
Other improvements we might want to look into would be an improved ranking of snippets based on connectedness ? connectness of functions is currently calculated in aider
An improved high level context. Aider uses tree sitter to sumarise each file into methods and classes, and analyses the connectness and then returned a summaries tree representation of the most important parts of the code.
refine the retrieval algorithms - to work better for code and possibly to include additional meta data in the query
expand the vector store to include non code files

sweep-ai · 2023-10-30T09:45:29Z

Apply Sweep Rules to your PR?

Apply: Ensure all new functions and classes have very clear, concise and up-to-date docstrings. Take gpt_engineer/ai.py as a good example.

pbharrin · 2023-10-30T18:08:56Z

Amazing job. I will take a look

UmerHA

Hey, great work.

Let me summarize to make sure I understand:

vi option searches for code files whose vector embedding fits the user prompt, and adds those to the llm prompt
the vector db is built by (a) parsing each code file into its AST, (3) traveling the AST and (4) splitting each node into chunk (if node > max len, then we split the node into multiple chunks
to find code chunks we use llama index

Looks good to me!

What I'd change though is the directory to XML mapping, which seem overly complex to me. I think me can make it a lot simpler, see https://gist.github.com/UmerHA/a0845f17325f07c554a6dada9fc0cab2.

gpt_engineer/data/file_repository.py

.gitignore

TheoMcCabe · 2023-10-31T06:55:16Z

Hey, great work.

Let me summarize to make sure I understand:

vi option searches for code files whose vector embedding fits the user prompt, and adds those to the llm prompt

the vector db is built by (a) parsing each code file into its AST, (3) traveling the AST and (4) splitting each node into chunk (if node > max len, then we split the node into multiple chunks

to find code chunks we use llama index

Looks good to me!

What I'd change though is the directory to XML mapping, which seem overly complex to me. I think me can make it a lot simpler, see https://gist.github.com/UmerHA/a0845f17325f07c554a6dada9fc0cab2.

Sounds about right to me yeah

thanks for the review @UmerHA !

TheoMcCabe · 2023-10-31T11:02:28Z

I've updated now to not include some contentious code that wasnt required for this work . Also added more tests

TheoMcCabe · 2023-10-31T18:12:56Z

Hey, great work.

Let me summarize to make sure I understand:

vi option searches for code files whose vector embedding fits the user prompt, and adds those to the llm prompt

the vector db is built by (a) parsing each code file into its AST, (3) traveling the AST and (4) splitting each node into chunk (if node > max len, then we split the node into multiple chunks

to find code chunks we use llama index

Looks good to me!

What I'd change though is the directory to XML mapping, which seem overly complex to me. I think me can make it a lot simpler, see https://gist.github.com/UmerHA/a0845f17325f07c554a6dada9fc0cab2.

Ah @UmerHA i just saw the link at the end of this thanks a lot for the contribution - i didn't include it because i didn't see it but happy to use this instead of my list of paths if you think it's an improvement?

Basically i'm really not sure how we should approach 'big context' as well as the 'small context' and this is just a start. It possibly adds no value today but hopefully we can iteratively improve

Aiders approach sends a map of method signatures and orders it to send only the most important methods which sounds super powerful.

It seems to me without method signatures or the ability to delve deeper into files providing the wider context of files is somewhat useless. Maybe it should be removed entirely - it's pretty much a placeholder right now

Anyway the XML is gone for now but lots of room to improve in future so do please contribute

pbharrin · 2023-10-31T19:08:28Z

Very good job with this. It looks good to me. I think it is a good starting point, and we can work off of this initial work.

TheoMcCabe · 2023-11-01T14:44:59Z

I'm happy to merge as it's had a few reviews and it shouldn't have any impact on current functionality but will wait for @ATheorell And @captivus who have said they want to take a look when they get time

TheoMcCabe added 17 commits October 30, 2023 09:40

WIP

471a07c

WIP

7df1d08

data module cleanup

93f76c8

data package working MVP

ae938d2

wip

e02aaa0

rename Db to FileRepository

f530b4d

WIP

abe368c

working locally - and add example

36ab0ea

rename test file

2e6c323

add tests

12bc67c

pre commit

93bd8dd

pre commit

9e49ff9

code owners

81e71d4

llama index dependency

9a3efc8

Rare mojo not a maintainer

0ef17b4

pre commit

057240b

tree sitter index

02e4ef3

TheoMcCabe requested review from captivus, pbharrin, AntonOsika and ATheorell October 30, 2023 09:47

TheoMcCabe marked this pull request as ready for review October 30, 2023 10:33

TheoMcCabe requested a review from UmerHA as a code owner October 30, 2023 10:33

TheoMcCabe added 2 commits October 30, 2023 10:38

fix bug

1b73def

bugfix

cc50bc1

TheoMcCabe mentioned this pull request Oct 30, 2023

Vector store - initial unrefined implementation #828

Closed

TheoMcCabe added 3 commits October 30, 2023 17:19

add tests for html ccs javascript java and c#

728aa2d

fix test python

af3bc50

pre commit

0093c9f

UmerHA requested changes Oct 31, 2023

View reviewed changes

gpt_engineer/data/file_repository.py Outdated Show resolved Hide resolved

gpt_engineer/data/file_repository.py Outdated Show resolved Hide resolved

.gitignore Show resolved Hide resolved

TheoMcCabe added 3 commits October 31, 2023 08:54

remove xml path list

4de7b6b

add more tests

9e559ed

pre commit

caa8016

dont run LLM tests

84c3e36

UmerHA approved these changes Oct 31, 2023

View reviewed changes

test names

807b125

ATheorell merged commit 92e4f0e into AntonOsika:main Nov 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector Store initial implementation #830

Vector Store initial implementation #830

TheoMcCabe commented Oct 30, 2023 •

edited

Loading

sweep-ai bot commented Oct 30, 2023

pbharrin commented Oct 30, 2023

UmerHA left a comment

TheoMcCabe commented Oct 31, 2023

TheoMcCabe commented Oct 31, 2023 •

edited

Loading

TheoMcCabe commented Oct 31, 2023 •

edited

Loading

pbharrin commented Oct 31, 2023

TheoMcCabe commented Nov 1, 2023

Vector Store initial implementation #830

Vector Store initial implementation #830

Conversation

TheoMcCabe commented Oct 30, 2023 • edited Loading

sweep-ai bot commented Oct 30, 2023

Apply Sweep Rules to your PR?

pbharrin commented Oct 30, 2023

UmerHA left a comment

Choose a reason for hiding this comment

TheoMcCabe commented Oct 31, 2023

TheoMcCabe commented Oct 31, 2023 • edited Loading

TheoMcCabe commented Oct 31, 2023 • edited Loading

pbharrin commented Oct 31, 2023

TheoMcCabe commented Nov 1, 2023

TheoMcCabe commented Oct 30, 2023 •

edited

Loading

TheoMcCabe commented Oct 31, 2023 •

edited

Loading

TheoMcCabe commented Oct 31, 2023 •

edited

Loading