Ponder

A tool for developers to understand codebases efficiently.

Features

Code Parsing

Extracts file structures, classes, functions, and dependencies.

Code Q&A

Ask questions like:

What does this function do?
Which modules does this class depend on?
Where is this function used?

The system provides:

Contextual answers from the codebase.
Links to relevant file locations or dependency visualizations.
Step-by-step explanations of complex interactions.

Examples

Query

How does the retrieve_relevant_pages function work, and how does it leverage ColPali’s indexing strategies within the retrieval architecture?

Expected Answer The retrieve_relevant_pages function takes a query and retrieves the top relevant PDF pages by:

Embedding the query using ColPali.
Performing an approximate nearest neighbor search based on the specified indexing strategy (HNSW or IVFFlat) to identify candidate pages.
Re-ranking the candidates using ColPali’s late interaction scoring, which fine-tunes relevance by comparing embeddings at a more granular level.

Design Patterns

Strategy Pattern ...
Dependency Injection ...

Tech stack

Category	Tools/Technologies
Frontend	• Next.js • Typescript • Shadcn
LLM Orchestration	LiteLLM
LLM Model	• Claude 3.5 Sonnet • DeepSeek-Coder-V2
Model Training and Fine-tuning	LLM Foundry
Performance Optimization	• flash-attention • Triton
Synthetic Data Generation	GPT-o1
Evaluation	CodeXGLUE
Code Parsing	tree-sitter

Experimentation Plan

Fine-tune DeepSeek-Coder-V2-Instruct
Patch the model architecture (namely DeepSeek-Coder-V2-Instruct) to use the Flash Attention v2 Triton kernel
Use MosaicML with FSDP

Resources

For implementation, we follow:

Notes

LLM-based Localization is a very interesting problem

Finding the correct files given user intent
I will focus on an unsolved issue: when to perform RAG in agent
Use GraphRAG when user intent requires multiple pieces of information, as it can handle multi-hop queries efficiently
Use PageRank to prioritize which nodes (i.e., files or functions) to explore
A high PageRank score indicates that an existing knowledge graph may suffice to generate accurate responses without external retrieval. This is because entities with high PageRank scores are typically well-connected and central to the structure of the knowledge graph.

Implemenation

Objective: Fine-tune DeepSeek-Coder-V2-Instruct to retrieve the most relevant code snippets for a user query

Build a codebase graph to identify relationships between files, functions, classes, and dependencies
Generate synthetic user queries related to code functionality
Use PageRank scores to rank the most relevant code snippets for each query
Extract additional context for each snippet
Use PageRank scores during training to help determine code importance and during inference to help decide retrieval strategy (direct vs GraphRAG)
- Nodes are source files and the edges are the references between methods/classes in the source files

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
data/Python/python/final/jsonl/train		data/Python/python/final/jsonl/train
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ponder

Features

Examples

Tech stack

Experimentation Plan

Resources

Notes

Implemenation

About

Releases

Packages

Languages

frieda-huang/Ponder

Folders and files

Latest commit

History

Repository files navigation

Ponder

Features

Examples

Tech stack

Experimentation Plan

Resources

Notes

Implemenation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages