Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for dumping the database to stream #4165

Closed
wants to merge 1 commit into from

Conversation

link2xt
Copy link
Collaborator

@link2xt link2xt commented Mar 18, 2023

Serialization is ready, dump <filename> in repl produces a dump in bencode format.
Deserialization is ready too, in REPL read <filename> reads the dump.

Serializer holds a read transaction, so it should be possible to run serialization twice and get the same result, e.g. for hashing and actual streaming, without having to store the dump anywhere.

The blobs are currently not dumped into the same stream. Dumping the blobs into the stream has the advantage that the dump will not contain $BLOBDIR filenames at all, so there are no assumptions about whether the blobs are stored inside the database or in the filesystem. We can rethink it if we decide to store the blobs into the database.

Note that in bencode dictionary keys are stored in lexicographical order. This makes it very easy to parse structures stored as dictionaries, because you either know the next key if the key is required or only need one token of lookahead (peek() operation on the token stream) if they key is optional.

@link2xt link2xt force-pushed the link2xt/sql-dump branch 2 times, most recently from 4d387d8 to f4908bd Compare March 19, 2023 00:11
}

async fn write_bytes(w: &mut (impl AsyncWrite + Unpin), b: &[u8]) -> Result<()> {
let bytes_len = format!("{}:", b.len());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These format calls are pretty expensive, definitely would try to avoid them.

@link2xt
Copy link
Collaborator Author

link2xt commented Mar 20, 2023

My 208M database serializes to 149M bencode dump, so even without compression replacing the database with a dump is an improvement. Additional 50M are likely all the indices that can be recreated from the data.

@link2xt link2xt force-pushed the link2xt/sql-dump branch 6 times, most recently from c45cb35 to d53118c Compare March 28, 2023 11:25
@link2xt link2xt force-pushed the link2xt/sql-dump branch 2 times, most recently from 3116182 to a8e1d14 Compare April 23, 2023 17:08
@link2xt link2xt force-pushed the link2xt/sql-dump branch 3 times, most recently from 9a42192 to 82460d8 Compare May 8, 2023 23:18
@link2xt link2xt force-pushed the link2xt/sql-dump branch from 82460d8 to 08a67d9 Compare May 17, 2023 17:26
@link2xt link2xt changed the title Add support for dumping the database to stream feat: add support for dumping the database to stream May 17, 2023
Comment on lines +348 to +353
pub(crate) async fn get_connection(&self) -> Result<PooledConnection> {
let lock = self.pool.read().await;
let pool = lock.as_ref().context("no SQL connection")?;
pool.get().await
}

Copy link
Collaborator

@Hocuri Hocuri Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried that people will use this function for writing without calling write_lock(). For now, I couldn't come up with a better idea to make this safer than a doc comment - maybe a doc comment is fine / the best we can do.

Comment on lines +1322 to +1332
pub async fn deserialize(&self, r: impl AsyncRead + Unpin) -> Result<()> {
let mut conn = self.get_connection().await?;

// Start a write transaction to take a database snapshot.
let transaction = conn.transaction()?;

let decoder = Decoder::new(r);
decoder.deserialize(transaction).await?;

Ok(())
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs to call self.write_lock().

Comment on lines +1225 to +1228
async fn deserialize(mut self, mut tx: Transaction<'_>) -> Result<()> {
self.expect_dictionary().await?;

self.expect_key("_config").await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're holding the Transaction over an await point, the future returned by the function will not be Send.

JsonRPC requires all futures to be Send, so this:

diff --git a/deltachat-jsonrpc/src/api/mod.rs b/deltachat-jsonrpc/src/api/mod.rs
index b71718299..88f19e504 100644
--- a/deltachat-jsonrpc/src/api/mod.rs
+++ b/deltachat-jsonrpc/src/api/mod.rs
@@ -146,6 +146,16 @@ async fn inner_get_backup_qr(&self, account_id: u32) -> Result<Qr> {
 
 #[rpc(all_positional, ts_outdir = "typescript/generated")]
 impl CommandApi {
+    async fn t(&self) {
+        let mut w = Vec::new();
+        self.get_context(0)
+            .await
+            .unwrap()
+            .sql
+            .serialize(&mut w)
+            .await;
+    }
+
     /// Test function.
     async fn sleep(&self, delay: f64) {
         tokio::time::sleep(std::time::Duration::from_secs_f64(delay)).await
diff --git a/src/context.rs b/src/context.rs
index 627e63c78..b2bf48129 100644
--- a/src/context.rs
+++ b/src/context.rs
@@ -188,7 +188,7 @@ fn deref(&self) -> &Self::Target {
 pub struct InnerContext {
     /// Blob directory path
     pub(crate) blobdir: PathBuf,
-    pub(crate) sql: Sql,
+    pub sql: Sql,
     pub(crate) smeared_timestamp: SmearedTimestamp,
     /// The global "ongoing" process state.
     ///

will give you a vey long and unreadable error.

https://docs.rs/tokio/latest/tokio/task/struct.LocalSet.html#use-with-run_until looks like it might make it possible to run a non-send function from JsonRPC, but I didn't research further for now. Maybe you know more about this, anyway.

Comment on lines +11 to +15
/// If this version is below the actual database version,
/// serialization code is outdated.
/// If this version is above the actual database version,
/// migrations have to be run first to update the database.
const SERIALIZE_DBVERSION: &str = "99";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we'll have to update serializing and deserializing everytime we change the db scheme, and (de)serializing only works between the same core version. We should make sure that everyone is OKish with this trade-off before merging.

@link2xt link2xt deleted the branch master October 25, 2023 21:22
@link2xt link2xt closed this Oct 25, 2023
@link2xt link2xt reopened this Oct 25, 2023
@link2xt link2xt deleted the branch master December 17, 2023 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants