pyDB, a simple project with a simple goal: learn how databases work.
At the core of data engineering lives databases and distributed systems. Two immensley complex topics. As a foray into them, I'm iteratively building my own simple database, taking courses, and adding complexity as I go.
The first version will help me learn the basics, with no tutorials, all written in python. It'll include:
- A command line interface
- REPL with pretty printing
- command history
- A tokenizer with
create table
andselect
commands - A CSV serializer/deserializer
- Unit and integration tests
- Error handling
- Logging
The next step will include taking a CMU course on databases. The optimizations are TBD based on what I learn. They may include
- Storage updates for
- compression
- data versioning
- distributed storage
- Query optimizers
- More advanced commands like
- Aggregate functions
- Where clauses
Python is not the optimal language for a database. So pyDB will either be rewritten in rust or C++. That's it!
Edit: I ended up following all the assignments for the CMU course listed above. It included creating a copy-on-write trie, a buffer pool manager, a b+ tree, a query optimizer, and concurrency controls. I did not make the repo public to respect the wishes of the instructors of that course.
By conquer the world I mean help others. Obviously. I hope to contribute to some popular opensource databases. Ideally ones that solve a common issue like distributed compute or in-memory processing.