Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db.connect()/disconnect() issues: db.prop and db.syncStatus.prop values, delayed replication, and read your own writes anomaly #226

Open
nurturenature opened this issue Jan 9, 2025 · 4 comments

Comments

@nurturenature
Copy link

Hi,

When db.disconnect()/connect()ing, some issues have been observed.

The issue is being opened against the Dart client, even though it's likely to include other libs, as that's where the behavior is being observed.

Using debugging code like:

log.fine('/powersync/$action: request');
...
switch (action) { 
  case 'connect':
    await db.connect(connector: connector);
    response = {
      'db.closed': db.closed,
      'db.connected': db.connected,
      'db.currentStatus': '${db.currentStatus}'
    };
    break;
  case 'disconnect':
    await db.disconnect();
    response = {
      'db.closed': db.closed,
      'db.connected': db.connected,
      'db.currentStatus': '${db.currentStatus}'
    };
    break;
}
...
log.fine('/powersync/$action: response: $response');

The expectation is that on return from await:

  • when connecting, both db.connected and db.syncStatus.connected with be true
  • db.syncStatus.uploading/downloading will be false if db.connected and db.syncStatus.connected are false

It's also possible for db.disconnect()/connect() to lead to replication pausing for 30s+:

  • even when there's no database activity
  • all local writes have been uploadDatad

And of most concern, db.disconnect()/connect() can lead to:

The first two issues can hopefully be debugged with existing or straightforward tests.
If the last issue isn't just another way to express the sqlite-core reverting local changes, I can pursue some type of simpler test case than a full Jepsen test.

Thanks!

Tests disconnect() a random majority of nodes every ~5s for ~5s, and the connect() all nodes.
Some annotated sample log snippets, search for '#' for annotation comments.

SyncStatus not always tracking/reflecting the actual state:

[2025-01-09 00:06:53.097713] [powersync_endpoint] [FINE] /powersync/connect: request

# suspect? connected is false even after awaiting
[2025-01-09 00:06:53.097766] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: false connecting: false downloading: false uploading: false lastSyncedAt: 2025-01-09 00:06:50.197767, hasSynced: true, error: null>
[2025-01-09 00:06:53.097825] [powersync_endpoint] [FINE] /powersync/connect: response: {db.closed: false, db.connected: false, db.currentStatus: SyncStatus<connected: false connecting: false downloading: false uploading: false lastSyncedAt: 2025-01-09 00:06:50.197767, hasSynced: true, error: null>}
[2025-01-09 00:06:53.099358] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: false connecting: true downloading: false uploading: false lastSyncedAt: 2025-01-09 00:06:50.197767, hasSynced: true, error: null>

# suspect? connected is false, yet uploading is true
[2025-01-09 00:06:53.099477] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: false connecting: true downloading: false uploading: true lastSyncedAt: 2025-01-09 00:06:50.197767, hasSynced: true, error: null>
[2025-01-09 00:06:53.102568] [powersync_endpoint] [FINER] uploadData: (2) patch result: {id: d5dbe3f3-5ca9-4327-958d-7a3daa154482, k: 7, v:  1}
[2025-01-09 00:06:53.111988] [powersync_endpoint] [FINER] uploadData: (3) patch result: {id: f5ce635c-7231-475e-bc28-4ca6517283e3, k: 9, v:  1 8}
[2025-01-09 00:06:53.112972] [powersync_endpoint] [FINER] uploadData: (3) patch result: {id: 4a59b53a-badd-4333-8fa9-b4c8095c7eba, k: 8, v:  1 7}
[2025-01-09 00:06:53.117300] [powersync_endpoint] [FINER] uploadData: (4) patch result: {id: f5ce635c-7231-475e-bc28-4ca6517283e3, k: 9, v:  1 8 10}
[2025-01-09 00:06:53.118872] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: false connecting: true downloading: false uploading: true lastSyncedAt: 2025-01-09 00:06:50.197767, hasSynced: true, error: null>
[2025-01-09 00:06:53.118922] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: false connecting: true downloading: false uploading: false lastSyncedAt: 2025-01-09 00:06:50.197767, hasSynced: true, error: null>

# connect finally becomes true, but only after uploading is complete, e.g. clears the upload queue that was accumulated during the previous disconnect
[2025-01-09 00:06:53.133736] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: false uploading: false lastSyncedAt: 2025-01-09 00:06:50.197767, hasSynced: true, error: null>

replication paused

# at the end of a test run, a majority of nodes look like

# suspect? insure connected, but note the disconnect bounce
[2025-01-09 00:08:30.584932] [powersync_endpoint] [FINE] /powersync/connect: request
[2025-01-09 00:08:30.584981] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: false connecting: false downloading: false uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:08:30.585034] [powersync_endpoint] [FINE] /powersync/connect: response: {db.closed: false, db.connected: false, db.currentStatus: SyncStatus<connected: false connecting: false downloading: false uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>}
[2025-01-09 00:08:30.586498] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: false connecting: true downloading: false uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:08:30.586722] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: false connecting: true downloading: false uploading: true lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>

# suspect? we're uploading even though not connected was/is/remains false
[2025-01-09 00:08:30.589472] [powersync_endpoint] [FINER] uploadData: (119) patch result: {id: 1f32b853-b006-4b78-8ad2-16e9c238a5a8, k: 11, v:  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 18 23 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 99 100 101 102 103 104 105 106 107 108 110 111 113 114 115}
[2025-01-09 00:08:30.592236] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: false connecting: true downloading: false uploading: true lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:08:30.592272] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: false connecting: true downloading: false uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>

# let's explicitly wait for upload queue count == 0
[2025-01-09 00:08:30.600952] [powersync_endpoint] [FINE] /powersync/upload-queue-wait: request
[2025-01-09 00:08:30.601770] [powersync_endpoint] [FINE] /powersync/upload-queue-wait: response: {db.upload-queue-wait: queue-empty}
[2025-01-09 00:08:30.602277] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: false uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:08:30.602372] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:08:30.604702] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:08:30.604726] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:08:30.606861] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>

# let's explicitly wait for db.syncStatus.downloading to become false
# suspect? we wait for 30 seconds before logging an error and continuing
# suspect? even though all test activity has stopped
# suspect? it will be 30s+ before we get an UpdateNotification<{lww}> and db.syncStatus.downloading becomes false
[2025-01-09 00:08:30.607448] [powersync_endpoint] [FINE] /powersync/downloading-wait: request
[2025-01-09 00:09:00.638623] [powersync_endpoint] [FINE] /powersync/downloading-wait: response: {ERROR: Tried 299 times every 0:00:00.100000, db.currentStatus: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>}

[2025-01-09 00:09:00.638968] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:09:00.638987] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:09:00.639005] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:09:00.639018] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:09:00.639263] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:09:00.639948] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:09:00.641163] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:09:00.641186] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:09:00.641251] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:09:00.641265] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:09:00.641764] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-01-09 00:08:30.100527, hasSynced: true, error: null>
[2025-01-09 00:09:00.642903] [powersync_endpoint] [FINEST] updates: UpdateNotification<{lww}>
[2025-01-09 00:09:00.643012] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: false uploading: false lastSyncedAt: 2025-01-09 00:09:00.642986, hasSynced: true, error: null>

# final reads occur on all nodes and are consistent

client not reading its own previous write

read your writes anomaly

The transaction graph shows a cycle:

  • p-> is process order

    • the top transaction happened first
      • appended '40' to key 17
      • read key 17 and it shows the append of '40'
    • the bottom transaction happened later in the same process
      • did not observe the append of '40' to key 17
  • rw-> is logical read/write order

    • it shows the bottom transaction would have logically happened first

So the graph cycles which shows a ready your own writes anomaly.

@rkistner
Copy link
Contributor

rkistner commented Jan 9, 2025

Thanks for the detailed write-up! Some intial comments:

And of most concern, db.disconnect()/connect() can lead to:
dropping local writes, possibly related to sqlite-core's
powersync-ja/powersync-sqlite-core#54

It sounds like exactly that issue. It was fixed a couple of days ago in v1.11.0 - could you please re-test with that version?

client not reading its own previous write

Is this the same as / related to the above issue, or is a completely separate one?

We'll have to investigate the sync status issues properly - it does seem like there are a couple of issues there.

@nurturenature
Copy link
Author

nurturenature commented Jan 9, 2025

Is this the same as / related to the above issue, or is a completely separate one?

Yes, the summary's "And of most concern, db.disconnect()/connect() can lead to dropping local writes, possibly related to sqlite-core's # 54" and the example for "client not reading its own previous write" are the same/single issue.

It sounds like exactly that issue. It was fixed a couple of days ago in v1.11.0 - could you please re-test with that version?

With 1.10.0

  • issue is easy to reproduce
  • @ 2tps/client for 30s, can happen multiple times

With 1.11.0

  • harder to reproduce
  • need at least 6tps/client for 100s to start seeing it

And confirming that even though the client does not read its own write, the write is replicated to PostgreSQL:

# here's the Ok'd write in the client logs
[2025-01-09 00:08:01.753380] [powersync_endpoint] [FINE] /sql-txn: request: {value: [{f: append, k: 15, v: 200}, {f: append, k: 17, v: 40}, {f: r, k: 17, v: null}, {f: append, k: 15, v: 201}]}
[2025-01-09 00:08:01.754705] [powersync_endpoint] [FINEST] updates: UpdateNotification<{lww}>
[2025-01-09 00:08:01.754912] [powersync_endpoint] [FINE] /sql-txn: response: {value: [{f: append, k: 15, v: 200}, {f: append, k: 17, v: 40}, {f: r, k: 17, v: [10 28 29 31 32 34 40]}, {f: append, k: 15, v: 201}], type: ok}

# and we can see the client backend connector uploading it to PostgreSQL 
[2025-01-09 00:08:01.755388] [powersync_endpoint] [FINEST] statusStream: SyncStatus<connected: true connecting: false downloading: false uploading: true lastSyncedAt: 2025-01-09 00:08:01.748321, hasSynced: true, error: null>
[2025-01-09 00:08:01.758089] [powersync_endpoint] [FINER] uploadData: (244) patch result: {id: d51e1538-6db2-47aa-ad2a-ae4fed5aaf8e, k: 17, v:  10 28 29 31 32 34 40}

# and the PostgreSQL logs show the write transaction
2025-01-09 00:08:01.756 UTC [73] LOG:  statement: BEGIN ISOLATION LEVEL REPEATABLE READ;
...
2025-01-09 00:08:01.757 UTC [73] LOG:  execute s/409/p/409: UPDATE lww SET v = ' 10 28 29 31 32 34 40' WHERE id = 'd51e1538-6db2-47aa-ad2a-ae4fed5aaf8e' RETURNING *
...
2025-01-09 00:08:01.758 UTC [73] LOG:  statement: COMMIT;

@rkistner
Copy link
Contributor

Ok, sounds like it confirms the fix in v1.11.0 does have an impact, but there could still be a further issue.

What exactly does the test do to reproduce the issue? One client or multiple? If I understand correctly you repeatedly disconnect and reconnect the client while writing data - Are you introducing any other faults (for example on postgres or the service), or only that?

Is it feasible for me to run that test myself to reproduce? I could also configure my own test setup for that, but I'd need to get all the right conditions (like mentioned above).

Let me know if a chat or call would be better to talk through this.

@nurturenature
Copy link
Author

What exactly does the test do to reproduce the issue?

  • usually 5 client nodes
  • random transactions against random client nodes
    • can/cannot also include PostgreSQL itself
    • mix of 1-4 random reads/writes per transaction
    • against random keys
  • ~5s disconnect a random majority of nodes
  • ~5s insure all nodes connected
  • configurable tps rate and test duration
  • no environmental faults

Is it feasible for me to run that test myself to reproduce?

The full Jepsen test environment is a heavy lift, steep learning curve, etc.
Written in Clojure, idioms and usage from the distributed system research community, etc.

I chatted with @kobiebotha to see if it would be appropriate to write a similar but simpler more specific purpose fuzzer in Dart for this use case, e.g. fuzzing disconnecting/connecting. Maybe it could even evolve into a full PowerSync API fuzzer easy enough to use for CI/CD testing.

Give me a moment to architect and frame out an initial implementation, will be my first use of Isolates, Stream creation, etc and I can check back for feedback before a full implementation.

I expect that such a fuzzer would expose:

  • common db.prop, db.syncStatus.prop value mismatches
  • over time occasional slowdown of replication, to the point it would look like clients are diverging
  • occasionally clients would not always reading their preceding writes
  • and maybe something novel, one never knows :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants