Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] Add documentation #570 #613

Merged
merged 23 commits into from
Jul 2, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file removed .db_file.agdb
Empty file.
92 changes: 54 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,63 +2,79 @@

[![Crates.io](https://img.shields.io/crates/v/agdb)](https://crates.io/crates/agdb) [![release](https://github.com/agnesoft/agdb/actions/workflows/release.yaml/badge.svg)](https://github.com/agnesoft/agdb/actions/workflows/release.yaml) [![coverage](https://github.com/agnesoft/agdb/actions/workflows/coverage.yaml/badge.svg)](https://github.com/agnesoft/agdb/actions/workflows/coverage.yaml) [![codecov](https://codecov.io/gh/agnesoft/agdb/branch/main/graph/badge.svg?token=Z6YO8C3XGU)](https://codecov.io/gh/agnesoft/agdb)

The Agnesoft Graph Database (aka _agdb_) is persistent memory mapped graph database using purely 'no-text' programmatic queries. It can be used as a main persistent storage as well as fast in-memory cache. Its typed but schema-less data store allows for seamless data updates with no downtime or costly migrations. All queries are constructed via a builder pattern (or directly as objects) with no special language or text parsing.
The Agnesoft Graph Database (aka _agdb_) is persistent memory mapped graph database using object 'no-text' queries. It can be used as a main persistent storage, data analytics platform as well as fast in-memory cache. Its typed schema-less data store allows for flexible and seamless data updates with no downtime or costly migrations. All queries are constructed via a builder pattern (or directly as objects) with no special language or text parsing.

# Key Features

- Data plotted on a graph
- Typed key-value properties of graph elements (nodes & edges)
- Persistent file based storage
- Memory mapped for fast querying
- ACID compliant
- Programmatic queries (no text, no query language)
- Typed schema-less key-value data store
- Object queries with builder pattern (no text, no query language)
- Memory mapped for fast querying
- _No dependencies_

# Quickstart

Add `agdb` as a dependency to your project:

```
cargo add agdb
```

Basic usage demonstrating creating a database, insert the graph elements with data and querying them back (select and search):

```
use agdb::Db;
use agdb::Comparison;

fn main() {
let mut db = Db::new("db_file.agdb").unwrap();

//create a nodes with data
db.exec_mut(&QueryBuilder::insert().nodes().aliases(&["users".into()]).query()).unwrap();
let users = db.exec_mut(&QueryBuilder::insert().nodes().values(&[
&[("id", 1).into(), ("username", "user_1").into()],
&[("id", 2).into(), ("username", "user_2").into()],
&[("id", 3).into(), ("username", "user_3").into()]]
).query()).unwrap();
Basic usage demonstrating creating a database, inserting graph elements with data and querying them back with select and search. The function using this code must handle `agdb::DbError` and [`agdb::QueryError`](docs/queries.md#queryerror) error types for operator `?` to work:

//connect nodes
db.exec_mut(&QueryBuilder::insert().edges().from(&["users".into()]).to(&users.ids()).query()).unwrap();
```Rust
let mut db = Db::new("user_db.agdb")?;

//select nodes
let user_elements = db.exec(&QueryBuilder::select().ids(&users.ids()).query()).unwrap();
db.exec_mut(&QueryBuilder::insert().nodes().aliases("users").query())?;
let users = db.exec_mut(&QueryBuilder::insert()
.nodes()
.values(vec![vec![("username", "Alice").into(), ("joined", 2023).into()],
vec![("username", "Bob").into(), ("joined", 2015).into()],
vec![("username", "John").into()]])
.query())?;
db.exec_mut(&QueryBuilder::insert().edges().from("users").to(users.ids()).query())?;
```

for element in user_elements.elements {
println!("{:?}: {:?}", element.id, element.values);
}
This code creates a database called `user_db.agdb` with a simple graph of 4 nodes. The first node is aliased `users` and 3 user nodes for Alice, Bob and John are then connected with edges to the `users` node. The arbitrary `username` and sometimes `joined` properties are attached to the user nodes.

// DbId(2): [DbKeyValue { key: String("id"), value: Int(1) }, DbKeyValue { key: String("username"), value: String("user_1") }]
// DbId(3): [DbKeyValue { key: String("id"), value: Int(2) }, DbKeyValue { key: String("username"), value: String("user_2") }]
// DbId(4): [DbKeyValue { key: String("id"), value: Int(3) }, DbKeyValue { key: String("username"), value: String("user_3") }]
You can select the graph elements (both nodes & edges) with their ids to get them back with their associated data (key-value properties):

//search with conditions
let user_id = db.exec(&QueryBuilder::search().from("users").where_().key("username").value(Comparison::Equal("user_2".into())).query()).unwrap();
```Rust
let user_elements = db.exec(&QueryBuilder::select().ids(users.ids()).query())?;
println!("{:?}", user_elements);
// QueryResult {
// result: 3,
// elements: [
// DbElement { id: DbId(2), values: [DbKeyValue { key: String("username"), value: String("Alice") }, DbKeyValue { key: String("joined"), value: Int(2023) }] },
// DbElement { id: DbId(3), values: [DbKeyValue { key: String("username"), value: String("Bob") }, DbKeyValue { key: String("joined"), value: Int(2015) }] },
// DbElement { id: DbId(4), values: [DbKeyValue { key: String("username"), value: String("John") }] }
// ] }
```

println!("{:?}", user_id.elements);
//[DbElement { id: DbId(3), values: [] }]
}
You can also search through the graph to get back only the elements you want:

```Rust
let user = db.exec(&QueryBuilder::select()
.search(QueryBuilder::search()
.from("users")
.where_()
.key("username")
.value(Comparison::Equal("John".into()))
.query())
.query())?;
println!("{:?}", user);
// QueryResult {
// result: 1,
// elements: [
// DbElement { id: DbId(4), values: [DbKeyValue { key: String("username"), value: String("John") }] }
// ] }
```

For comprehensive overview of all queries see the [queries](docs/queries.md) reference or continue with more in-depth [guide](docs/guide.md).

# Reference

TBD
- [Concepts](docs/concepts.md)
- [Queries](docs/queries.md)
- [Guide](docs/guide.md)
- [But why?](docs/but_why.md)
73 changes: 73 additions & 0 deletions docs/but_why.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
The following items provide explanations for some of the design choices of `agdb`. All of them are based on research and extensive testing of various approaches and options. For example unlike most graph implementations out there the `agdb` is using pure contiguous vectors instead of linked lists. Curious to lear why? Read on!

- [Why graph?](#why-graph)
- [Why not use an existing graph database?](#why-not-use-an-existing-graph-database)
- [Why object queries?](#why-object-queries)
- [Why single file?](#why-single-file)
- [What about sharding, replication and performance at scale?](#what-about-sharding-replication-and-performance-at-scale)

# Why graph?

The database area is dominated by relational database systems (tables) and text queries since the 1970s. However the issues with the relational database systems are numerous and they even gave rise the the regular SW profession - database engineer. This is because contrary to their name they are very awkward at representing actual relations between data which is always demanded by the real world applications. They typically use foreign keys and/or proxy tables to represent them. Additionally the tables naturally enforce fixed immutable data schema upon the data they store. To change the schema one needs to create a new database with the changed schema and copy the data over (this is called database migration). Such operation is very costly and most database systems fair poorly when there are foreign keys involved (requiring them to be disabled for the migration to happen). As it turns out nowadays no database schema is truly immutable. New and changed requirements happen so often that the database schemas usually need updating (migrating) nearly every time there is an update to the systems using it.

There is no solution to this "schema" issue because it is the inherent feature of representing data in tabular form. It can be only mitigated to some degree but your mileage will vary greatly when using these techniques many of which are considered anti-patterns. Things like indexes, indirection (storing data with varied length), storing blob data, data with internal format unknown to the database itself (e.g. JSON) are all ways to prevent the need for database migration at the cost of efficiency. While there are good reasons for representing data in tabular form (lookup speed and space efficiency) the costs of very often far exceed the benefits. Plus as it turns out it is not even that efficient!

The tables are represented as fixed size records (rows) one after another (this is what makes the schema immutable). This representation is the most efficient when we are reading entire rows at the time (all columns) which is very rarely the case. Most often we want only some of the columns which means we are discarding some (or most) of the row when reading it. This is the same problem the CPU itself has when using memory. It reads is using cache lines. If we happen to read only some of the line the rest is wasted and another line needs to be fetched for the next item(s) (this is called a `cache miss`). This is why contiguous collections (like a `vector`) are almost always the most efficient because they minimize the cache misses. Chandler Carruth had numerous talks at CPPCon on this subject demonstrating that by far the biggest performance impact on software are the cache misses (over 50 % and up to 80 % !!!) with everything else being dwarfed in comparison.

Beside trying to optimize the tables the most prominent "solution" are the NoSQL databases. They typically use a different way to store data, often in a "schema-less" to cater to the above use cases - easing database migrations (or eliminating them) and providing more efficient data lookup. They typically choose some combination of key-value representation, document representation or a graph representation to scaffold the data instead of tables. They often trade in ACID properties, use write only - never delete "big tables" and other techniques.

Of NoSQL databases the graph databases stand out in particular because by definition they actually store the relations between the data in the database. How the values are then "attached" to the graph vary but the graph itself serves as an "index" as well as a "map" to be efficiently searched and reason about. The sparse graph (not all nodes are connected to all others) representation is then actually the most flexible and accurate way to store and represent the sparse data (as mentioned nearly or real world data is sparse).

There are two key properties of representing data as a graph that directly relates to the aforementioned issues of schema and data searching. Firstly the graph itself is the schema that can change freely as needed at any time without any restrictions eliminating the schema issue entirely. You do not need to be clairvoyant and agonize over the right database schema. You can do what works now and change your mind later without any issues. Secondly the graph allows accurately representing any actual relations between the data allowing the most efficient native traversal and lookup of data (vaguely resembling traditional indexing) making the lookup constantly efficient regardless of the data set size. Where table performance will deteriorate as it grows the graph will stay constantly efficient if you can traverse only the subset of the nodes via their relations even if the graph itself contained billions of nodes.

That is in a nutshell why the graph database is the best choice for most problem domains and data sets out there and why `agdb` is the graph database.

**Costs**

Everything has the cost and graph databases are no exception. Some operations and some data representations may be costlier in them as opposed to table based databases. For example if you had immutable schema that never updates then table based database might a better fit as the representation in form of tables is more storage efficient. or if you always read the whole table or whole rows then once again the table based databases might be more performant. Typically though these are uncommon edge cases unlikely to be found in the real world applications. The data is almost always sparse and diverse in nature, the schema is never truly stable etc. On the other hand most use cases benefit greatly from graph based representation and thus such a database is well worth it despite some (often more theoretical) costs.

# Why not use an existing graph database?

The following is the list of requirements for an ideal graph database:

- Free license
- Faster than table based databases in most common use cases
- No new language for querying
- No text based queries
- Rust and/or C++ driver
- Resource efficient (storage & memory)

Surprisingly there is no database that would fit the bill. Even the most popular graph databases such as `Neo4J` or `OrientDB` fall short on several of these requirements. They do have their own text based language (e.g. Cypher for Neo4J). They lack the drivers for C++/Rust. They are not particularly efficient (being mostly written in Java). Even the recent addition built in Rust - `SurrealDb` - is using text based SQL queries. Quite incomprehensibly its driver support for Rust itself is not very mature so far and was added only later despite the system being written in Rust. Something which is oddly common in the database world, e.g. `RethinkDb`, itself a document database, written mostly in C++, has no C++ support but does officially support for example Ruby. Atop of these issues they often do not leverage the graph structure very well (except for Neo4J which does great job at this) still leaning heavily towards tables.

# Why object queries?

The most ubiquitous database query language is SQL which is text based language created in the 1970s. Its biggest advantage is that being text based it can be used from any language to communicate with the database. However just like relational (table) bases databases from the same era it has some major flaws:

- It needs to be parsed and interpreted by the database during runtime leading to common syntax errors that are hard or impossible to statically check.
- Being a separate programming language from the client coding language increases cognitive load on the programmer.
- It opens up the database to attacks from SQL-injection where the attacker is trying to force the interpreter to treat the user input (e.g. table or column names) as SQL code itself issuing malicious commands such as stealing or damaging the data.
- Being "Turing-complete" and complex language on itself means it can lead (and often leads) to incredibly complex and unmaintainable queries.

The last point is particularly troublesome because it partially stems from the `schema` issue discussed in the previous points. One common way to avoid changing the schema is to transform the data via queries. This is not only less efficient than representing the data in the correct form directly but also increases the complexity of queries significantly.

The solutions include heavily sanitizing the user inputs in an attempt to prevent SQL injection attacks, wrapping the constructing of SQL in a builder-pattern to prevent syntax errors and easing the cognitive load by letting programmers create their queries in their main coding language. The complexity is often being reduced by the use of stored SQL procedures (pre-created queries). However all of these options can only mitigate the issues SQL has.

Using native objects representing the queries eliminate all of the SQL issues sacrificing the portability between languages. However that can be relatively easily be made up via already very mature (de)serialization of native objects available in most languages. Using builder pattern to construct these objects further improve their correctness and readability. Native objects carry no additional cognitive load on the programmer and can be easily used just like any other code.

# Why single file?

All operating systems have fairly low limit on number of open file descriptors for a program and for all programs in total making this system resource one of the rarest. Furthermore operating over multiple files does not seem to bring in any substantial benefit for the database while it complicates its implementation significantly. The graph database typically needs to have access to the full graph at all times unlike say key-value stores or document databases. Splitting the data into multiple files would therefore be actually detrimental. Lastly overall storage taken by the multiple files would not actually change as the amount of data would be the same.

Conversely using just a single file (with a second temporary write ahead log file) makes everything simpler and easier. You can for example easily transfer the data to a different machine - it is just one file. The database can also operate on the file directly if memory mapping was turned off to save RAM at the cost of performance. The program would not need to juggle multiple files and consuming valuable system resources.

The one file is the database and the data.

# What about sharding, replication and performance at scale?

Most databases tackle the issue of (poor) performance at scale by scaling up using replication/sharding strategies. While these techniques are definitely useful and they are planned for `agdb` they should be avoid as much as possible. The increase in complexity when using replication or sharding is dramatic and it is only worth it if there is no other choice.

The `agdb` is designed so that it performs well regardless of its size. Most read operations are O(1) and there is no limit on concurrency on them. Most write operations are O(1) amortized. The O(n) complexities are limited to individual node traversals, e.g. reading a 1000 connected nodes will take 1000 O(1) operations = O(n) same as reading 1000 rows in a table. However if you structure your data well (meaning you do not blindly connect everything to everything) you can have as large data set as your hardware can fit without issues if you can query only subset of the graph (subgraph) since your query will have performance based on that subgraph and not all the data stored in the database.

The point here is that you will need to scale out only when your database starts exceeding limits of a single machine. Adding data replication/backup will be relatively easy feature. Sharding would be only slightly harder but the database has been written in a way that it can be used relatively easily. The obvious downside is the huge performance dip for such a setup. To alleviate this the local caches could be used but as mentioned this only further adds to complexity.

So while features "at scale" are definitely coming you should avoid using them as much as possible.
Loading