Skip to content

frieda-huang/Ponder

Repository files navigation

Ponder

A tool for developers to understand codebases efficiently.

Features

Code Parsing

  • Extracts file structures, classes, functions, and dependencies.

Code Q&A

Ask questions like:

  • What does this function do?
  • Which modules does this class depend on?
  • Where is this function used?

The system provides:

  • Contextual answers from the codebase.
  • Links to relevant file locations or dependency visualizations.
  • Step-by-step explanations of complex interactions.

Examples

Query

How does the retrieve_relevant_pages function work, and how does it leverage ColPali’s indexing strategies within the retrieval architecture?

Expected Answer The retrieve_relevant_pages function takes a query and retrieves the top relevant PDF pages by:

  1. Embedding the query using ColPali.
  2. Performing an approximate nearest neighbor search based on the specified indexing strategy (HNSW or IVFFlat) to identify candidate pages.
  3. Re-ranking the candidates using ColPali’s late interaction scoring, which fine-tunes relevance by comparing embeddings at a more granular level.

Design Patterns

  1. Strategy Pattern ...
  2. Dependency Injection ...

Tech stack

Category Tools/Technologies
Frontend • Next.js
• Typescript
• Shadcn
LLM Orchestration LiteLLM
LLM Model • Claude 3.5 Sonnet
DeepSeek-Coder-V2
Model Training and Fine-tuning LLM Foundry
Performance Optimization flash-attention
Triton
Synthetic Data Generation GPT-o1
Evaluation CodeXGLUE
Code Parsing tree-sitter

Experimentation Plan

  1. Fine-tune DeepSeek-Coder-V2-Instruct
  2. Patch the model architecture (namely DeepSeek-Coder-V2-Instruct) to use the Flash Attention v2 Triton kernel
  3. Use MosaicML with FSDP

Resources

For implementation, we follow:

Notes

LLM-based Localization is a very interesting problem

  • Finding the correct files given user intent
  • I will focus on an unsolved issue: when to perform RAG in agent
  • Use GraphRAG when user intent requires multiple pieces of information, as it can handle multi-hop queries efficiently
  • Use PageRank to prioritize which nodes (i.e., files or functions) to explore
  • A high PageRank score indicates that an existing knowledge graph may suffice to generate accurate responses without external retrieval. This is because entities with high PageRank scores are typically well-connected and central to the structure of the knowledge graph.

Implemenation

Objective: Fine-tune DeepSeek-Coder-V2-Instruct to retrieve the most relevant code snippets for a user query

  • Build a codebase graph to identify relationships between files, functions, classes, and dependencies
  • Generate synthetic user queries related to code functionality
  • Use PageRank scores to rank the most relevant code snippets for each query
  • Extract additional context for each snippet
  • Use PageRank scores during training to help determine code importance and during inference to help decide retrieval strategy (direct vs GraphRAG)
    • Nodes are source files and the edges are the references between methods/classes in the source files

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published