[bug/v0.17.0-rc1] invalid _pragma: sqlite3: disk I/O error #3360

S0yKaf · 2024-09-27T03:29:33Z

Gotosocial Version: 0.17.0-rc1+git-de72855
env: Docker

relevant config:

  GTS_DB_TYPE: sqlite
  GTS_DB_SQLITE_SYNCHRONOUS: "NORMAL"
  GTS_DB_SQLITE_JOURNAL_MODE: "WAL"
  GTS_DB_ADDRESS: /gotosocial/db/sqlite.db

error executing command: error creating dbservice: sqlite ping: sqlite3: 
invalid _pragma: sqlite3: disk I/O error                                                                                                                                                              

timestamp="27/09/2024 03:03:10.408" func=oauth.newTokenStore.func1 
level=ERROR msg="error while sweeping oauth entries: sqlite3: invalid _pragma: sqlite3: disk I/O error"

Upgrading from 0.16.0 to 0.17.0-rc1 gotosocial is unable to connect to the database giving the errors above.
After some investigation, it seems that the sqlite default is to use Synchronous 2 (FULL) . Gotosocial is able to start properly once the database is manually set to PRAGMA Synchronous = 1;

Using GTS_DB_SQLITE_SYNCHRONOUS: "FULL" does not work and will throw invalid _pragma: sqlite3: disk I/O error

However the database seem to reset itself to synchronous FULL after some time, causing the error to appear again.

I'm not really sure what's causing this, I know the sqlite3 driver was updated for this release. Seems to be related to #3067 ?

Everything was running fine in 0.16.0 without even needing to set GTS_DB_SQLITE_SYNCHRONOUS: "NORMAL" and
GTS_DB_SQLITE_JOURNAL_MODE: "WAL"

The text was updated successfully, but these errors were encountered:

tsmethurst · 2024-09-27T08:55:47Z

Interesting, thanks for investigating. We switched default SQLite driver from ModernC SQLite to WASM SQLite for this release, so it's possibly a kink related to that, though we haven't seen it ourselves yet. What operating system / architecture is your server running on?

tsmethurst · 2024-09-27T08:59:03Z

Just for clarity, does it work if you set GTS_DB_SQLITE_SYNCHRONOUS: "NORMAL" and
GTS_DB_SQLITE_JOURNAL_MODE: "WAL"? If so then we probably just need to change our config a bit or update our pragma string.

S0yKaf · 2024-09-27T15:13:06Z

@tsmethurst
I'm on Debian 6.1.0-23-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.99-1 (2024-07-15) x86_64 GNU/Linux Running through Kubernetes with K3S.
Sqlite database is stored on an NFS mount.

Just for clarity, does it work if you set GTS_DB_SQLITE_SYNCHRONOUS: "NORMAL" and
GTS_DB_SQLITE_JOURNAL_MODE: "WAL"

If is set GTS_DB_SQLITE_SYNCHRONOUS: "FULL" it does not work AT ALL. If I do NORMAL + WAL It'll work after I go manually run PRAGMA Synchronous = 1; on the database.

I did some more digging, I'm seeing some permission issues when sqlite is committing the shm and wal to the database. I think when I run PRAGMA Synchronous = 1; it fixes the perms on the shm and wal file so gotosocial can write to them normally. and It breaks again when sqlite commits the log files again.

Even weirder, It seems that if I keep a manual sqlite connection open in my terminal, everything works normally and no more issues.

I'll probably patch out the WASM driver and see if the issue is still happening, but I'm almost certain the problem is from there.

EDIT:
I started GTS with trace logging, I didn't get any more information about the sqlite error.

tsmethurst · 2024-09-27T15:45:54Z

Alright thanks for the extra info. You can do a build of GtS with the modernc driver instead of the wasm drive by calling GO_BUILDTAGS=moderncsqlite3 ./scripts/build.sh

S0yKaf · 2024-09-27T18:31:52Z

So I'm really starting to think the WASM driver doesn't play as nicely with NFS as the modernc driver.
https://github.com/ncruces/go-sqlite3/tree/main/vfs#file-locking

I don't fully understand everything written in there but I know that Sqlite on NFS is notoriously NOT recommended specifically because NFS has such a hard time with locking.

Otherwise, file locking is not supported, and you must use nolock=1 (or immutable=1) to open database files. To use the database/sql driver with nolock=1 you must disable connection pooling by calling db.SetMaxOpenConns(1).

I think this PR #2025 enabled pooling. Maybe we need to add an ENV var to turn on nolock ?

If I may ask, what was the reason to switch to the WASM driver in the first place?

NyaaaWhatsUpDoc · 2024-09-27T21:58:57Z

i think nolock is probably something we'd consider as another SQLite configuration parameter so we can set it in the dsn appropriately, and then limit the connection pool.

WASM was chosen over the modernc library as the WASM toolchain is much more battle tested as a toolchain than modernc's transpile toolchain, the codebase is much nicer to work with, and the modernc implementation still suffers with a concurrency issue on connection interrupt which after many many hours i could not hope to fix without learning their specific C->Go transpiler :p

NyaaaWhatsUpDoc · 2024-09-27T22:06:35Z

if it's still going to be a build target some use i'll update my branch with our concurrency bug workaround we were using: gitlab.com/NyaaaWhatsUpDoc/sqlite

S0yKaf · 2024-09-27T22:28:11Z

I can confirm that the issue doesn't happen when using a build with: GO_BUILDTAGS=moderncsqlite3

S0yKaf · 2024-09-27T22:48:04Z

Do you think it would be a possibility to ship a WASM docker image and a modernc image as a temporary solution?
Something like:
superseriousbusiness/gotosocial:latest
superseriousbusiness/gotosocial:latest-modernc

NyaaaWhatsUpDoc · 2024-09-27T22:56:51Z

just bumped the modernc version in #3367

daenney · 2024-09-28T00:03:41Z

The SQLite FAQ on WAL-mode states:

All processes using a database must be on the same host computer; WAL does not work over a network filesystem

https://www.sqlite.org/wal.html

Irrespective of the driver, using SQLite in WAL-mode, which is what we do for GtS, is generally considered unsafe. It's possibly fine with a single process like we do, but nobody is really clear about that.

S0yKaf · 2024-09-28T00:48:45Z

I feel like using WAL in this case it absolutely fine because of this statement:

This is because WAL requires all processes to share a small amount of memory and processes on separate host machines obviously cannot share memory with each other.

I think the real issue with running sqlite on NFS is due to poor/unreliable locking implementations. I 100% agree that running it on NFS is definitely not ideal, but at the moment it's my only option besides using postgres.

@daenney I'm open to using something other than WAL mode, it was my understanding that it was the best safe/performant journal_mode but reading up on it more, using Synchronous FULL with some other journal_mode is probably very safe too? Do you think using TRUNCATE is a viable alternative?

S0yKaf · 2024-09-28T01:06:22Z

So,

  GTS_DB_SQLITE_SYNCHRONOUS: "FULL"
  GTS_DB_SQLITE_JOURNAL_MODE: "TRUNCATE"
  GTS_DB_MAX_OPEN_CONNS_MULTIPLIER: "4"

Seems to be working with no errors using the WASM driver. You will notice having the open_conns greater than 1 also works. I'll keep monitoring the logs to see if anything pops up.

daenney · 2024-09-28T08:34:55Z

As far as the options you've selected goes, FULL with TRUNCATE is fine. I'm not sure if DELETE would be faster with NFS, and there is also PERSIST in case DELETE or TRUNCATE are meaningfully slower on NFS than overriding some bits in the file. It's probably not something to worry about until it becomes measurable.

It's a bit tricky because in theory WAL over network, as long as every process is the same host, is fine. But it's not something that's either recommended or really supported by the authors. SQLite is meant to be run colocated with your application on the same host, so anything over the network can introduce issues it's not designed to cope with. If you're going to introduce the network anyway, a client-server database that's designed to operate in this manner is probably the best bet.

Litestream provides an alternative that doesn't break the operating model SQLite assumes. The DB and app are in the same place but the DB is continuously replicated to object storage instead. You can restore from it on startup.

It's also really easy to break this invariant. For example, if you're doing backups (VACUUM INTO or the .backup command on the CLI) of the DB while GtS is running, those backups need to be initiated from the same machine. If containers are involved, it needs to happen from within the same cgroups/namespaces. Simply running a SQLite CLI command from the wrong place can break things.

NyaaaWhatsUpDoc · 2024-09-28T11:38:57Z

Indeed, us officially supporting a usecase of SQLite that even the authors don't officially support seems like either responsibility-wise, or maintenance-wise, this can only backfire for us.

tsmethurst · 2024-09-28T11:46:57Z

We should add a warning to our docs about using SQLite in such a way, probably, and mention the possibility of configuring it as mentioned here: #3360 (comment). Thanks for the detailed testing and reports @S0yKaf, that's very helpful!

S0yKaf · 2024-09-28T14:57:55Z

@daenney Thank you for the detailed explanation!
I appreciate the time you've all spent on this despite the fact that it was entirely user error.

tsmethurst · 2024-09-28T15:20:19Z

Sometimes the real bugfix is the friends you make along the way 🤠

ncruces · 2024-09-28T17:57:24Z

I think my driver on NFS is probably be even less safe that the original.

Tbh, I don't know if Linux OFD locks even work over NFS (assuming this is Linux).

If fact, by design, it's much easier for a networked file system to implement the POSIX locking semantics.

tsmethurst · 2024-09-28T20:46:10Z

I'll close this now since we've figured out what the issue was and updated our docs :)

ncruces · 2024-09-28T23:05:12Z

I opened an issue in my repo to look into this. One thing you might be able to get away with is use sqlite3_flock.

PS: about nolock, a couple of things: (1) it's really dangerous, please don't use it unless you know for sure what you're doing; (2) it just doesn't work at all with WAL.

S0yKaf changed the title ~~[bug/v0.17.0-rc1] sqlite3 synchronous reseting to FULL ?~~ [bug/v0.17.0-rc1] invalid _pragma: sqlite3: disk I/O error Sep 27, 2024

tsmethurst added this to the v0.17.0 milestone Sep 27, 2024

tsmethurst added bug Something isn't working config Something needs to be made configurable, or there's a config issue labels Sep 27, 2024

daenney mentioned this issue Sep 28, 2024

[docs] Don't run SQLite on networked storage #3369

Merged

10 tasks

tsmethurst closed this as completed Sep 28, 2024

ncruces mentioned this issue Sep 28, 2024

Investigate usage over NFS ncruces/go-sqlite3#159

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug/v0.17.0-rc1] invalid _pragma: sqlite3: disk I/O error #3360

[bug/v0.17.0-rc1] invalid _pragma: sqlite3: disk I/O error #3360

S0yKaf commented Sep 27, 2024 •

edited

Loading

tsmethurst commented Sep 27, 2024 •

edited

Loading

tsmethurst commented Sep 27, 2024 •

edited

Loading

S0yKaf commented Sep 27, 2024 •

edited

Loading

tsmethurst commented Sep 27, 2024

S0yKaf commented Sep 27, 2024 •

edited

Loading

NyaaaWhatsUpDoc commented Sep 27, 2024 •

edited

Loading

NyaaaWhatsUpDoc commented Sep 27, 2024

S0yKaf commented Sep 27, 2024

S0yKaf commented Sep 27, 2024

NyaaaWhatsUpDoc commented Sep 27, 2024

daenney commented Sep 28, 2024 •

edited

Loading

S0yKaf commented Sep 28, 2024 •

edited

Loading

S0yKaf commented Sep 28, 2024

daenney commented Sep 28, 2024

NyaaaWhatsUpDoc commented Sep 28, 2024

tsmethurst commented Sep 28, 2024 •

edited

Loading

S0yKaf commented Sep 28, 2024

tsmethurst commented Sep 28, 2024

ncruces commented Sep 28, 2024 •

edited

Loading

tsmethurst commented Sep 28, 2024

ncruces commented Sep 28, 2024

[bug/v0.17.0-rc1] invalid _pragma: sqlite3: disk I/O error #3360

[bug/v0.17.0-rc1] invalid _pragma: sqlite3: disk I/O error #3360

Comments

S0yKaf commented Sep 27, 2024 • edited Loading

tsmethurst commented Sep 27, 2024 • edited Loading

tsmethurst commented Sep 27, 2024 • edited Loading

S0yKaf commented Sep 27, 2024 • edited Loading

tsmethurst commented Sep 27, 2024

S0yKaf commented Sep 27, 2024 • edited Loading

NyaaaWhatsUpDoc commented Sep 27, 2024 • edited Loading

NyaaaWhatsUpDoc commented Sep 27, 2024

S0yKaf commented Sep 27, 2024

S0yKaf commented Sep 27, 2024

NyaaaWhatsUpDoc commented Sep 27, 2024

daenney commented Sep 28, 2024 • edited Loading

S0yKaf commented Sep 28, 2024 • edited Loading

S0yKaf commented Sep 28, 2024

daenney commented Sep 28, 2024

NyaaaWhatsUpDoc commented Sep 28, 2024

tsmethurst commented Sep 28, 2024 • edited Loading

S0yKaf commented Sep 28, 2024

tsmethurst commented Sep 28, 2024

ncruces commented Sep 28, 2024 • edited Loading

tsmethurst commented Sep 28, 2024

ncruces commented Sep 28, 2024

S0yKaf commented Sep 27, 2024 •

edited

Loading

tsmethurst commented Sep 27, 2024 •

edited

Loading

tsmethurst commented Sep 27, 2024 •

edited

Loading

S0yKaf commented Sep 27, 2024 •

edited

Loading

S0yKaf commented Sep 27, 2024 •

edited

Loading

NyaaaWhatsUpDoc commented Sep 27, 2024 •

edited

Loading

daenney commented Sep 28, 2024 •

edited

Loading

S0yKaf commented Sep 28, 2024 •

edited

Loading

tsmethurst commented Sep 28, 2024 •

edited

Loading

ncruces commented Sep 28, 2024 •

edited

Loading