`db.connect()/disconnect()` issues: `db.prop` and `db.syncStatus.prop` values, delayed replication, and read your own writes anomaly #226

nurturenature · 2025-01-09T04:12:13Z

Hi,

When db.disconnect()/connect()ing, some issues have been observed.

The issue is being opened against the Dart client, even though it's likely to include other libs, as that's where the behavior is being observed.

Using debugging code like:

log.fine('/powersync/$action: request');
...
switch (action) { 
  case 'connect':
    await db.connect(connector: connector);
    response = {
      'db.closed': db.closed,
      'db.connected': db.connected,
      'db.currentStatus': '${db.currentStatus}'
    };
    break;
  case 'disconnect':
    await db.disconnect();
    response = {
      'db.closed': db.closed,
      'db.connected': db.connected,
      'db.currentStatus': '${db.currentStatus}'
    };
    break;
}
...
log.fine('/powersync/$action: response: $response');

The expectation is that on return from await:

when connecting, both db.connected and db.syncStatus.connected with be true
db.syncStatus.uploading/downloading will be false if db.connected and db.syncStatus.connected are false

It's also possible for db.disconnect()/connect() to lead to replication pausing for 30s+:

even when there's no database activity
all local writes have been uploadDatad

And of most concern, db.disconnect()/connect() can lead to:

dropping local writes, possibly related to sqlite-core's
- Local changes reverted when a replication delay is present powersync-sqlite-core#54

The first two issues can hopefully be debugged with existing or straightforward tests.
If the last issue isn't just another way to express the sqlite-core reverting local changes, I can pursue some type of simpler test case than a full Jepsen test.

Thanks!

Tests disconnect() a random majority of nodes every ~5s for ~5s, and the connect() all nodes.
Some annotated sample log snippets, search for '#' for annotation comments.

SyncStatus not always tracking/reflecting the actual state:

[2025-01-09 00:06:53.097713] [powersync_endpoint] [FINE] /powersync/connect: request

# suspect? connected is false even after awaiting
[2025-01-09 00:06:53.097766] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: false connecting: false downloading: false uploading: false lastSyncedAt: 2025-01-09 00:06:50.197767, hasSynced: true, error: null>
[2025-01-09 00:06:53.097825] [powersync_endpoint] [FINE] /powersync/connect: response: {db.closed: false, db.connected: false, db.currentStatus: SyncStatus<connected: false connecting: false downloading: false uploading: false lastSyncedAt: 2025-01-09 00:06:50.197767, hasSynced: true, error: null>}
[2025-01-09 00:06:53.099358] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: false connecting: true downloading: false uploading: false lastSyncedAt: 2025-01-09 00:06:50.197767, hasSynced: true, error: null>

# suspect? connected is false, yet uploading is true
[2025-01-09 00:06:53.099477] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: false connecting: true downloading: false uploading: true lastSyncedAt: 2025-01-09 00:06:50.197767, hasSynced: true, error: null>
[2025-01-09 00:06:53.102568] [powersync_endpoint] [FINER] uploadData: (2) patch result: {id: d5dbe3f3-5ca9-4327-958d-7a3daa154482, k: 7, v:  1}
[2025-01-09 00:06:53.111988] [powersync_endpoint] [FINER] uploadData: (3) patch result: {id: f5ce635c-7231-475e-bc28-4ca6517283e3, k: 9, v:  1 8}
[2025-01-09 00:06:53.112972] [powersync_endpoint] [FINER] uploadData: (3) patch result: {id: 4a59b53a-badd-4333-8fa9-b4c8095c7eba, k: 8, v:  1 7}
[2025-01-09 00:06:53.117300] [powersync_endpoint] [FINER] uploadData: (4) patch result: {id: f5ce635c-7231-475e-bc28-4ca6517283e3, k: 9, v:  1 8 10}
[2025-01-09 00:06:53.118872] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: false connecting: true downloading: false uploading: true lastSyncedAt: 2025-01-09 00:06:50.197767, hasSynced: true, error: null>
[2025-01-09 00:06:53.118922] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: false connecting: true downloading: false uploading: false lastSyncedAt: 2025-01-09 00:06:50.197767, hasSynced: true, error: null>

# connect finally becomes true, but only after uploading is complete, e.g. clears the upload queue that was accumulated during the previous disconnect
[2025-01-09 00:06:53.133736] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: false uploading: false lastSyncedAt: 2025-01-09 00:06:50.197767, hasSynced: true, error: null>

replication paused

# at the end of a test run, a majority of nodes look like

# suspect? insure connected, but note the disconnect bounce
[2025-01-09 00:08:30.584932] [powersync_endpoint] [FINE] /powersync/connect: request
[2025-01-09 00:08:30.584981] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: false connecting: false downloading: false uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:08:30.585034] [powersync_endpoint] [FINE] /powersync/connect: response: {db.closed: false, db.connected: false, db.currentStatus: SyncStatus<connected: false connecting: false downloading: false uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>}
[2025-01-09 00:08:30.586498] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: false connecting: true downloading: false uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:08:30.586722] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: false connecting: true downloading: false uploading: true lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>

# suspect? we're uploading even though not connected was/is/remains false
[2025-01-09 00:08:30.589472] [powersync_endpoint] [FINER] uploadData: (119) patch result: {id: 1f32b853-b006-4b78-8ad2-16e9c238a5a8, k: 11, v:  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 18 23 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 99 100 101 102 103 104 105 106 107 108 110 111 113 114 115}
[2025-01-09 00:08:30.592236] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: false connecting: true downloading: false uploading: true lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:08:30.592272] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: false connecting: true downloading: false uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>

# let's explicitly wait for upload queue count == 0
[2025-01-09 00:08:30.600952] [powersync_endpoint] [FINE] /powersync/upload-queue-wait: request
[2025-01-09 00:08:30.601770] [powersync_endpoint] [FINE] /powersync/upload-queue-wait: response: {db.upload-queue-wait: queue-empty}
[2025-01-09 00:08:30.602277] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: false uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:08:30.602372] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:08:30.604702] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:08:30.604726] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:08:30.606861] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>

# let's explicitly wait for db.syncStatus.downloading to become false
# suspect? we wait for 30 seconds before logging an error and continuing
# suspect? even though all test activity has stopped
# suspect? it will be 30s+ before we get an UpdateNotification<{lww}> and db.syncStatus.downloading becomes false
[2025-01-09 00:08:30.607448] [powersync_endpoint] [FINE] /powersync/downloading-wait: request
[2025-01-09 00:09:00.638623] [powersync_endpoint] [FINE] /powersync/downloading-wait: response: {ERROR: Tried 299 times every 0:00:00.100000, db.currentStatus: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>}

[2025-01-09 00:09:00.638968] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:09:00.638987] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:09:00.639005] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:09:00.639018] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:09:00.639263] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:09:00.639948] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:09:00.641163] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:09:00.641186] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:09:00.641251] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:09:00.641265] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:09:00.641764] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:09:00.642903] [powersync_endpoint] [FINEST] updates: UpdateNotification<{lww}>
[2025-01-09 00:09:00.643012] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: false uploading: false lastSyncedAt: 2025-01-09 00:09:00.642986, hasSynced: true, error: null>

# final reads occur on all nodes and are consistent

client not reading its own previous write

The transaction graph shows a cycle:

p-> is process order
- the top transaction happened first
  - appended '40' to key 17
  - read key 17 and it shows the append of '40'
- the bottom transaction happened later in the same process
  - did not observe the append of '40' to key 17
rw-> is logical read/write order
- it shows the bottom transaction would have logically happened first

So the graph cycles which shows a ready your own writes anomaly.

The text was updated successfully, but these errors were encountered:

rkistner · 2025-01-09T04:31:21Z

Thanks for the detailed write-up! Some intial comments:

And of most concern, db.disconnect()/connect() can lead to:
dropping local writes, possibly related to sqlite-core's
powersync-ja/powersync-sqlite-core#54

It sounds like exactly that issue. It was fixed a couple of days ago in v1.11.0 - could you please re-test with that version?

client not reading its own previous write

Is this the same as / related to the above issue, or is a completely separate one?

We'll have to investigate the sync status issues properly - it does seem like there are a couple of issues there.

nurturenature · 2025-01-09T17:57:46Z

Is this the same as / related to the above issue, or is a completely separate one?

Yes, the summary's "And of most concern, db.disconnect()/connect() can lead to dropping local writes, possibly related to sqlite-core's # 54" and the example for "client not reading its own previous write" are the same/single issue.

It sounds like exactly that issue. It was fixed a couple of days ago in v1.11.0 - could you please re-test with that version?

With 1.10.0

issue is easy to reproduce
@ 2tps/client for 30s, can happen multiple times

With 1.11.0

harder to reproduce
need at least 6tps/client for 100s to start seeing it

And confirming that even though the client does not read its own write, the write is replicated to PostgreSQL:

# here's the Ok'd write in the client logs
[2025-01-09 00:08:01.753380] [powersync_endpoint] [FINE] /sql-txn: request: {value: [{f: append, k: 15, v: 200}, {f: append, k: 17, v: 40}, {f: r, k: 17, v: null}, {f: append, k: 15, v: 201}]}
[2025-01-09 00:08:01.754705] [powersync_endpoint] [FINEST] updates: UpdateNotification<{lww}>
[2025-01-09 00:08:01.754912] [powersync_endpoint] [FINE] /sql-txn: response: {value: [{f: append, k: 15, v: 200}, {f: append, k: 17, v: 40}, {f: r, k: 17, v: [10 28 29 31 32 34 40]}, {f: append, k: 15, v: 201}], type: ok}

# and we can see the client backend connector uploading it to PostgreSQL 
[2025-01-09 00:08:01.755388] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: false uploading: true lastSyncedAt: 2025-01-09 00:08:01.748321, hasSynced: true, error: null>
[2025-01-09 00:08:01.758089] [powersync_endpoint] [FINER] uploadData: (244) patch result: {id: d51e1538-6db2-47aa-ad2a-ae4fed5aaf8e, k: 17, v:  10 28 29 31 32 34 40}

# and the PostgreSQL logs show the write transaction
2025-01-09 00:08:01.756 UTC [73] LOG:  statement: BEGIN ISOLATION LEVEL REPEATABLE READ;
...
2025-01-09 00:08:01.757 UTC [73] LOG:  execute s/409/p/409: UPDATE lww SET v = ' 10 28 29 31 32 34 40' WHERE id = 'd51e1538-6db2-47aa-ad2a-ae4fed5aaf8e' RETURNING *
...
2025-01-09 00:08:01.758 UTC [73] LOG:  statement: COMMIT;

rkistner · 2025-01-10T07:51:06Z

Ok, sounds like it confirms the fix in v1.11.0 does have an impact, but there could still be a further issue.

What exactly does the test do to reproduce the issue? One client or multiple? If I understand correctly you repeatedly disconnect and reconnect the client while writing data - Are you introducing any other faults (for example on postgres or the service), or only that?

Is it feasible for me to run that test myself to reproduce? I could also configure my own test setup for that, but I'd need to get all the right conditions (like mentioned above).

Let me know if a chat or call would be better to talk through this.

nurturenature · 2025-01-11T02:27:35Z

What exactly does the test do to reproduce the issue?

usually 5 client nodes
random transactions against random client nodes
- can/cannot also include PostgreSQL itself
- mix of 1-4 random reads/writes per transaction
- against random keys
~5s disconnect a random majority of nodes
~5s insure all nodes connected
configurable tps rate and test duration
no environmental faults

Is it feasible for me to run that test myself to reproduce?

The full Jepsen test environment is a heavy lift, steep learning curve, etc.
Written in Clojure, idioms and usage from the distributed system research community, etc.

I chatted with @kobiebotha to see if it would be appropriate to write a similar but simpler more specific purpose fuzzer in Dart for this use case, e.g. fuzzing disconnecting/connecting. Maybe it could even evolve into a full PowerSync API fuzzer easy enough to use for CI/CD testing.

Give me a moment to architect and frame out an initial implementation, will be my first use of Isolates, Stream creation, etc and I can check back for feedback before a full implementation.

I expect that such a fuzzer would expose:

common db.prop, db.syncStatus.prop value mismatches
over time occasional slowdown of replication, to the point it would look like clients are diverging
occasionally clients would not always reading their preceding writes
and maybe something novel, one never knows :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`db.connect()/disconnect()` issues: `db.prop` and `db.syncStatus.prop` values, delayed replication, and read your own writes anomaly #226

`db.connect()/disconnect()` issues: `db.prop` and `db.syncStatus.prop` values, delayed replication, and read your own writes anomaly #226

nurturenature commented Jan 9, 2025

rkistner commented Jan 9, 2025

nurturenature commented Jan 9, 2025 •

edited

Loading

rkistner commented Jan 10, 2025

nurturenature commented Jan 11, 2025

db.connect()/disconnect() issues: db.prop and db.syncStatus.prop values, delayed replication, and read your own writes anomaly #226

db.connect()/disconnect() issues: db.prop and db.syncStatus.prop values, delayed replication, and read your own writes anomaly #226

Comments

nurturenature commented Jan 9, 2025

SyncStatus not always tracking/reflecting the actual state:

replication paused

client not reading its own previous write

rkistner commented Jan 9, 2025

nurturenature commented Jan 9, 2025 • edited Loading

rkistner commented Jan 10, 2025

nurturenature commented Jan 11, 2025

`db.connect()/disconnect()` issues: `db.prop` and `db.syncStatus.prop` values, delayed replication, and read your own writes anomaly #226

`db.connect()/disconnect()` issues: `db.prop` and `db.syncStatus.prop` values, delayed replication, and read your own writes anomaly #226

nurturenature commented Jan 9, 2025 •

edited

Loading