Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector Store initial implementation #830

Merged
merged 27 commits into from
Nov 3, 2023

Conversation

TheoMcCabe
Copy link
Collaborator

@TheoMcCabe TheoMcCabe commented Oct 30, 2023

This is a first pass at using a vector store to automatically retrieve files from a code repository based on their relevance to the prompt.

  • Uses llama index for the vector store abstraction
  • Implements a code splitter (with most of the code taken from llama index) which depends on tree sitter library
  • Creates a new step set called VECTOR_IMPROVE to run this improve which is run with the -vi argument
  • Finds 2 relevant snippets then feeds in the entire file for those snippets
  • Also provides a list of all code files in the repository to provide wider context

Theres a lot more work to do in this area, but I think the scope of this work is enough to merge on its own pending review etc. It works for me for at least small use cases like snake, so seems be a good place to hand over to others to have a play around with?

Some ideas of what to do next:

  • The whole vector store piece needs lots of refining and is largely untested on large repositories etc. An obvious limitation is that its currently hard coded to retrieve 2 snippets, and return the files those snippets are in. So a maximum of 2 files will be fed to the LLM. Possibly we would want to load as many files as possible within the token limit?
  • Other improvements we might want to look into would be an improved ranking of snippets based on connectedness ? connectness of functions is currently calculated in aider
  • An improved high level context. Aider uses tree sitter to sumarise each file into methods and classes, and analyses the connectness and then returned a summaries tree representation of the most important parts of the code.
  • refine the retrieval algorithms - to work better for code and possibly to include additional meta data in the query
  • expand the vector store to include non code files

@sweep-ai
Copy link
Contributor

sweep-ai bot commented Oct 30, 2023

Apply Sweep Rules to your PR?

  • Apply: Ensure all new functions and classes have very clear, concise and up-to-date docstrings. Take gpt_engineer/ai.py as a good example.

@TheoMcCabe TheoMcCabe marked this pull request as ready for review October 30, 2023 10:33
@TheoMcCabe TheoMcCabe requested a review from UmerHA as a code owner October 30, 2023 10:33
@pbharrin
Copy link
Contributor

Amazing job. I will take a look

Copy link
Collaborator

@UmerHA UmerHA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, great work.

Let me summarize to make sure I understand:

  1. vi option searches for code files whose vector embedding fits the user prompt, and adds those to the llm prompt
  2. the vector db is built by (a) parsing each code file into its AST, (3) traveling the AST and (4) splitting each node into chunk (if node > max len, then we split the node into multiple chunks
  3. to find code chunks we use llama index

Looks good to me!

What I'd change though is the directory to XML mapping, which seem overly complex to me. I think me can make it a lot simpler, see https://gist.github.com/UmerHA/a0845f17325f07c554a6dada9fc0cab2.

gpt_engineer/data/file_repository.py Outdated Show resolved Hide resolved
gpt_engineer/data/file_repository.py Outdated Show resolved Hide resolved
.gitignore Show resolved Hide resolved
@TheoMcCabe
Copy link
Collaborator Author

Hey, great work.

Let me summarize to make sure I understand:

  1. vi option searches for code files whose vector embedding fits the user prompt, and adds those to the llm prompt

  2. the vector db is built by (a) parsing each code file into its AST, (3) traveling the AST and (4) splitting each node into chunk (if node > max len, then we split the node into multiple chunks

  3. to find code chunks we use llama index

Looks good to me!

What I'd change though is the directory to XML mapping, which seem overly complex to me. I think me can make it a lot simpler, see https://gist.github.com/UmerHA/a0845f17325f07c554a6dada9fc0cab2.

Sounds about right to me yeah

thanks for the review @UmerHA !

@TheoMcCabe
Copy link
Collaborator Author

TheoMcCabe commented Oct 31, 2023

I've updated now to not include some contentious code that wasnt required for this work . Also added more tests

@TheoMcCabe
Copy link
Collaborator Author

TheoMcCabe commented Oct 31, 2023

Hey, great work.

Let me summarize to make sure I understand:

  1. vi option searches for code files whose vector embedding fits the user prompt, and adds those to the llm prompt

  2. the vector db is built by (a) parsing each code file into its AST, (3) traveling the AST and (4) splitting each node into chunk (if node > max len, then we split the node into multiple chunks

  3. to find code chunks we use llama index

Looks good to me!

What I'd change though is the directory to XML mapping, which seem overly complex to me. I think me can make it a lot simpler, see https://gist.github.com/UmerHA/a0845f17325f07c554a6dada9fc0cab2.

Ah @UmerHA i just saw the link at the end of this thanks a lot for the contribution - i didn't include it because i didn't see it but happy to use this instead of my list of paths if you think it's an improvement?

Basically i'm really not sure how we should approach 'big context' as well as the 'small context' and this is just a start. It possibly adds no value today but hopefully we can iteratively improve

Aiders approach sends a map of method signatures and orders it to send only the most important methods which sounds super powerful.

It seems to me without method signatures or the ability to delve deeper into files providing the wider context of files is somewhat useless. Maybe it should be removed entirely - it's pretty much a placeholder right now

Anyway the XML is gone for now but lots of room to improve in future so do please contribute

@pbharrin
Copy link
Contributor

Very good job with this. It looks good to me. I think it is a good starting point, and we can work off of this initial work.

@TheoMcCabe
Copy link
Collaborator Author

I'm happy to merge as it's had a few reviews and it shouldn't have any impact on current functionality but will wait for @ATheorell And @captivus who have said they want to take a look when they get time

@ATheorell ATheorell merged commit 92e4f0e into AntonOsika:main Nov 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants