Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug/v0.17.0-rc1] invalid _pragma: sqlite3: disk I/O error #3360

Closed
S0yKaf opened this issue Sep 27, 2024 · 21 comments
Closed

[bug/v0.17.0-rc1] invalid _pragma: sqlite3: disk I/O error #3360

S0yKaf opened this issue Sep 27, 2024 · 21 comments
Labels
bug Something isn't working config Something needs to be made configurable, or there's a config issue
Milestone

Comments

@S0yKaf
Copy link
Contributor

S0yKaf commented Sep 27, 2024

Gotosocial Version: 0.17.0-rc1+git-de72855
env: Docker

relevant config:

  GTS_DB_TYPE: sqlite
  GTS_DB_SQLITE_SYNCHRONOUS: "NORMAL"
  GTS_DB_SQLITE_JOURNAL_MODE: "WAL"
  GTS_DB_ADDRESS: /gotosocial/db/sqlite.db
error executing command: error creating dbservice: sqlite ping: sqlite3: 
invalid _pragma: sqlite3: disk I/O error                                                                                                                                                              

timestamp="27/09/2024 03:03:10.408" func=oauth.newTokenStore.func1 
level=ERROR msg="error while sweeping oauth entries: sqlite3: invalid _pragma: sqlite3: disk I/O error"

Upgrading from 0.16.0 to 0.17.0-rc1 gotosocial is unable to connect to the database giving the errors above.
After some investigation, it seems that the sqlite default is to use Synchronous 2 (FULL) . Gotosocial is able to start properly once the database is manually set to PRAGMA Synchronous = 1;

Using GTS_DB_SQLITE_SYNCHRONOUS: "FULL" does not work and will throw invalid _pragma: sqlite3: disk I/O error

However the database seem to reset itself to synchronous FULL after some time, causing the error to appear again.

I'm not really sure what's causing this, I know the sqlite3 driver was updated for this release. Seems to be related to #3067 ?

Everything was running fine in 0.16.0 without even needing to set GTS_DB_SQLITE_SYNCHRONOUS: "NORMAL" and
GTS_DB_SQLITE_JOURNAL_MODE: "WAL"

@S0yKaf S0yKaf changed the title [bug/v0.17.0-rc1] sqlite3 synchronous reseting to FULL ? [bug/v0.17.0-rc1] invalid _pragma: sqlite3: disk I/O error Sep 27, 2024
@tsmethurst tsmethurst added this to the v0.17.0 milestone Sep 27, 2024
@tsmethurst
Copy link
Contributor

tsmethurst commented Sep 27, 2024

Interesting, thanks for investigating. We switched default SQLite driver from ModernC SQLite to WASM SQLite for this release, so it's possibly a kink related to that, though we haven't seen it ourselves yet. What operating system / architecture is your server running on?

@tsmethurst
Copy link
Contributor

tsmethurst commented Sep 27, 2024

Just for clarity, does it work if you set GTS_DB_SQLITE_SYNCHRONOUS: "NORMAL" and
GTS_DB_SQLITE_JOURNAL_MODE: "WAL"? If so then we probably just need to change our config a bit or update our pragma string.

@tsmethurst tsmethurst added bug Something isn't working config Something needs to be made configurable, or there's a config issue labels Sep 27, 2024
@S0yKaf
Copy link
Contributor Author

S0yKaf commented Sep 27, 2024

@tsmethurst
I'm on Debian 6.1.0-23-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.99-1 (2024-07-15) x86_64 GNU/Linux Running through Kubernetes with K3S.
Sqlite database is stored on an NFS mount.

Just for clarity, does it work if you set GTS_DB_SQLITE_SYNCHRONOUS: "NORMAL" and
GTS_DB_SQLITE_JOURNAL_MODE: "WAL"

If is set GTS_DB_SQLITE_SYNCHRONOUS: "FULL" it does not work AT ALL. If I do NORMAL + WAL It'll work after I go manually run PRAGMA Synchronous = 1; on the database.


I did some more digging, I'm seeing some permission issues when sqlite is committing the shm and wal to the database. I think when I run PRAGMA Synchronous = 1; it fixes the perms on the shm and wal file so gotosocial can write to them normally. and It breaks again when sqlite commits the log files again.

Even weirder, It seems that if I keep a manual sqlite connection open in my terminal, everything works normally and no more issues.

I'll probably patch out the WASM driver and see if the issue is still happening, but I'm almost certain the problem is from there.

EDIT:
I started GTS with trace logging, I didn't get any more information about the sqlite error.

@tsmethurst
Copy link
Contributor

Alright thanks for the extra info. You can do a build of GtS with the modernc driver instead of the wasm drive by calling GO_BUILDTAGS=moderncsqlite3 ./scripts/build.sh

@S0yKaf
Copy link
Contributor Author

S0yKaf commented Sep 27, 2024

So I'm really starting to think the WASM driver doesn't play as nicely with NFS as the modernc driver.
https://github.com/ncruces/go-sqlite3/tree/main/vfs#file-locking

I don't fully understand everything written in there but I know that Sqlite on NFS is notoriously NOT recommended specifically because NFS has such a hard time with locking.

Otherwise, file locking is not supported, and you must use nolock=1 (or immutable=1) to open database files. To use the database/sql driver with nolock=1 you must disable connection pooling by calling db.SetMaxOpenConns(1).

I think this PR #2025 enabled pooling. Maybe we need to add an ENV var to turn on nolock ?

If I may ask, what was the reason to switch to the WASM driver in the first place?

@NyaaaWhatsUpDoc
Copy link
Member

NyaaaWhatsUpDoc commented Sep 27, 2024

i think nolock is probably something we'd consider as another SQLite configuration parameter so we can set it in the dsn appropriately, and then limit the connection pool.

WASM was chosen over the modernc library as the WASM toolchain is much more battle tested as a toolchain than modernc's transpile toolchain, the codebase is much nicer to work with, and the modernc implementation still suffers with a concurrency issue on connection interrupt which after many many hours i could not hope to fix without learning their specific C->Go transpiler :p

@NyaaaWhatsUpDoc
Copy link
Member

if it's still going to be a build target some use i'll update my branch with our concurrency bug workaround we were using: gitlab.com/NyaaaWhatsUpDoc/sqlite

@S0yKaf
Copy link
Contributor Author

S0yKaf commented Sep 27, 2024

I can confirm that the issue doesn't happen when using a build with: GO_BUILDTAGS=moderncsqlite3

@S0yKaf
Copy link
Contributor Author

S0yKaf commented Sep 27, 2024

Do you think it would be a possibility to ship a WASM docker image and a modernc image as a temporary solution?
Something like:
superseriousbusiness/gotosocial:latest
superseriousbusiness/gotosocial:latest-modernc

@NyaaaWhatsUpDoc
Copy link
Member

just bumped the modernc version in #3367

@daenney
Copy link
Member

daenney commented Sep 28, 2024

The SQLite FAQ on WAL-mode states:

All processes using a database must be on the same host computer; WAL does not work over a network filesystem

https://www.sqlite.org/wal.html

Irrespective of the driver, using SQLite in WAL-mode, which is what we do for GtS, is generally considered unsafe. It's possibly fine with a single process like we do, but nobody is really clear about that.

@S0yKaf
Copy link
Contributor Author

S0yKaf commented Sep 28, 2024

I feel like using WAL in this case it absolutely fine because of this statement:

This is because WAL requires all processes to share a small amount of memory and processes on separate host machines obviously cannot share memory with each other.

I think the real issue with running sqlite on NFS is due to poor/unreliable locking implementations. I 100% agree that running it on NFS is definitely not ideal, but at the moment it's my only option besides using postgres.

@daenney I'm open to using something other than WAL mode, it was my understanding that it was the best safe/performant journal_mode but reading up on it more, using Synchronous FULL with some other journal_mode is probably very safe too? Do you think using TRUNCATE is a viable alternative?

@S0yKaf
Copy link
Contributor Author

S0yKaf commented Sep 28, 2024

So,

  GTS_DB_SQLITE_SYNCHRONOUS: "FULL"
  GTS_DB_SQLITE_JOURNAL_MODE: "TRUNCATE"
  GTS_DB_MAX_OPEN_CONNS_MULTIPLIER: "4"

Seems to be working with no errors using the WASM driver. You will notice having the open_conns greater than 1 also works. I'll keep monitoring the logs to see if anything pops up.

@daenney
Copy link
Member

daenney commented Sep 28, 2024

As far as the options you've selected goes, FULL with TRUNCATE is fine. I'm not sure if DELETE would be faster with NFS, and there is also PERSIST in case DELETE or TRUNCATE are meaningfully slower on NFS than overriding some bits in the file. It's probably not something to worry about until it becomes measurable.


It's a bit tricky because in theory WAL over network, as long as every process is the same host, is fine. But it's not something that's either recommended or really supported by the authors. SQLite is meant to be run colocated with your application on the same host, so anything over the network can introduce issues it's not designed to cope with. If you're going to introduce the network anyway, a client-server database that's designed to operate in this manner is probably the best bet.

Litestream provides an alternative that doesn't break the operating model SQLite assumes. The DB and app are in the same place but the DB is continuously replicated to object storage instead. You can restore from it on startup.

It's also really easy to break this invariant. For example, if you're doing backups (VACUUM INTO or the .backup command on the CLI) of the DB while GtS is running, those backups need to be initiated from the same machine. If containers are involved, it needs to happen from within the same cgroups/namespaces. Simply running a SQLite CLI command from the wrong place can break things.

@NyaaaWhatsUpDoc
Copy link
Member

Indeed, us officially supporting a usecase of SQLite that even the authors don't officially support seems like either responsibility-wise, or maintenance-wise, this can only backfire for us.

@tsmethurst
Copy link
Contributor

tsmethurst commented Sep 28, 2024

We should add a warning to our docs about using SQLite in such a way, probably, and mention the possibility of configuring it as mentioned here: #3360 (comment). Thanks for the detailed testing and reports @S0yKaf, that's very helpful!

@S0yKaf
Copy link
Contributor Author

S0yKaf commented Sep 28, 2024

@daenney Thank you for the detailed explanation!
I appreciate the time you've all spent on this despite the fact that it was entirely user error.

@tsmethurst
Copy link
Contributor

Sometimes the real bugfix is the friends you make along the way 🤠

@ncruces
Copy link

ncruces commented Sep 28, 2024

I think my driver on NFS is probably be even less safe that the original.

Tbh, I don't know if Linux OFD locks even work over NFS (assuming this is Linux).

If fact, by design, it's much easier for a networked file system to implement the POSIX locking semantics.

@tsmethurst
Copy link
Contributor

I'll close this now since we've figured out what the issue was and updated our docs :)

@ncruces
Copy link

ncruces commented Sep 28, 2024

I opened an issue in my repo to look into this. One thing you might be able to get away with is use sqlite3_flock.

PS: about nolock, a couple of things: (1) it's really dangerous, please don't use it unless you know for sure what you're doing; (2) it just doesn't work at all with WAL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working config Something needs to be made configurable, or there's a config issue
Projects
None yet
Development

No branches or pull requests

5 participants