-
Notifications
You must be signed in to change notification settings - Fork 341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persistent cache to use SQLite #3055
Comments
See the discussion in: #2653 Pickle cache is just the first iteration of a "cache type". The are other formats, loading methods and optimizations that are likely improvements. |
I'm not sure if a SQLite format is the exact approach we'll take- so is it alright if we close this out? Or we can rename this to a more general "expand cache types" issue. It's a good idea though |
I think this is orthogonal to cache types, at least if we are talking to persistent cache formats like pickle, npy, etc. In this issue, I suggest instead of having a file per cell, have a single SQLite db with a single table e.g. of approximately the following scheme: create table cell_results (
file_path text,
cell_hash text,
format text,
values blob
); Thus, this database can handle all formats mixed with each other: just having different format column value. Also, if the user changed the setting to a new format, marimo can still lead the old format (as it knows it from the table) if needed, before saving in the new format next time. For in-memory, remote, and mixed cache configurations, this is of no concern to this db -- it's simply oblivious of those. It only knows about the values that are going to end up saved locally on disk. |
So, also on reflection, I think DuckDB is superfluous -- should be sufficient to store Apache Arrow blobs in SQLite. |
I think this would still be a type of cache type. Files are easy for invalidation, and can be used across the wire- so we would still want to keep them, esp. for backwards compat. I'd also be weary of prematurely adding optimization here when there a few more outstanding bottlenecks that I think will have more impact. For instance, in my experience, rehashing large input variables takes far more time than any IO operations. |
Can you elaborate what do you mean by "files are easy for invalidation"? How I see it:
Well, here you actually inviting subtle problems and transient bugs with silent corruption of cache contents. This is exactly the reason why generally ad hoc file storage is not the best idea. Even the current And for remote, instead of network FS + file IO, it would be better to use rqlite. See rqlite FAQ:
Keeping |
I don't think there's a disagreement here, these are all good points- but like you said, I don't think incorporating this removes PickleLoader. |
@dmadisetti unless you want to own this because it interferes/intertwines with some other development you are already doing, I can draft a PR on the weekend. You can give extra requirements/ideas how you think it should be done. |
still in progress |
Re spec: I think it's simple as inheriting from Loader, and overloading for write, load, etc. Feel free to push up what you have |
I don't think you should make this the default yet- there's a few hash breaking changes yet to come. I think rolling this out with those changes would be ideal. There's an "experimental" mechanism, that this should live under to be default. Also, I might do a bit of benchmarking. While I think this is a nice solution- I'm not sold that the gains from this will be noticeable, but may come at a marked memory penalty. |
The current progress is here: https://gist.github.com/leventov/00d94e82028fd3c6884950e454ce1a4c. Unfortunately this export format doesn't have anchor links, so you should search in page for e.g. "create table", or scroll through the page. Thread local SQLite connection managementThis gist currently uses cell_id (and non-existent yet app_id) considerationsIn the discussions through the gist above and the table schema, you will find This would also require changing Loader API (pass in Yet another optimisation that stable cell_ids and for cell_coinciding-with-cached-module use case is that storing again the unchanged cell results (even though inputs, and hence module hash, have changed) is unnecessary. Also, notions of app_id would also be useful, for example for clearning all cache for app (which would result in In general, documenting the semantics of cell_id as some kind of "developer docs" would be helpful. And app_id, if that will be added. Stability across marimo app restarts implies that different marimo app instances also share the cell_id. In turn, this implies an internal database for them, such as another sqlite. Stability across code movements and changes (What if the notebook file is renamed? What if cell is renamed? What if cell code is moved to another file, but explicitly connected to the Migration pathI currently think of the following:
Module hash pending changesI actually think this breaking change is a good case for table-driven persistent cache, not the reason to delay it. We can store Cleaning up non-cell-coinciding module hashes (but still within a cell)Possible design:
This leads to some write amplification for such cells. But note that for vast majority of cells, those where the cached module coincides with the cell itself, this But for those (rare) cells that are not persisted themselves but have persisted modules within, this write amplification seems like a fine tradeoff (which earns automatically-clean cache). The alternative is to let user periodically clean the cache for cell or app altogether, but this is mental burden and I think should be avoided. Note that this still hinges on the stability of cell_ids. Performance and footprint
As evident from above, table-driven cache enables many paths to clean up the cache automatically and semi-automatically, at the same time not requiring the user to purge ALL cache (including for still used cells), which is bad developer experience IMO. Just from this, if these auto-cleanup routines are enabled, the table-driven cache should in practice result in much smaller The extra benefits are saner migration paths (enabled by things like But even the above note about automatic clean-ups was not true and we would actually never delete anything from the cache_entries table (as currently files in cache dirs are never deleted), why do you think the table "may come at a marked memory penalty"? My intuition is the opposite: SQLite's footprint/overhead per row is smaller than operating system's overhead per file (separate file descriptor, probably mandatory alignment to 4KB page size per file internally, etc.). If your "may come at a marked memory penalty" refers to double-storage of content_blobs in WAL and "main database", then SQLite automatically recycles the WAL approximately every 4MB of WAL. This should not be a concern. Indexes are in memory (see gist), however they are expected to be dwarfed by the content_blobs sizes, anyway. When they are not, it means that the entire cached content is itself minuscule in size, and hence the cache size is not an issue. Moreover, the entire point of whether file descriptors vs rows in sqlite are larger footprint is moot (and hence, the few paragraphs above are just of technical curiosity, and to make sure I understand you). This is because we overall don't expect to have millions of active modules-to-be-cached. The real enemy will always be piling up outdated module_hashes and versions. That, again, table-driven cache puts under control, directory + file-per-module cache does not. |
Ok, I think these |
Using the internal state manager will do default cleanup, and be tied to the calling cell. But I think that might have been a bit of a brain fart for me, since the connection should be on a per notebook- and not per cell basis (but the correct solution if it was cell based).
I think this makes it a deal breaker for the default implementation- and with a file reference seems like many of the benefits of a persistent connection are lost at this point? marimo "just working" is a good design philosophy and shouldn't need configuration Memory might be less of an issue than I thought if the entries are so limited- my usecase of persistent cache has been for 1gb+ blobs- which would make every call with the sqlite cache just unneeded overhead/ complexity
Module support is already baked in with In the execution "default" branch (probably stale by now), module objects are replaced with a stub, with additional metadata. Metadata is currently baked into the pickled object, so the sqlite doesn't really add much there either, but will require schema management opposed to just being pinned to a marimo version.
I agree with this, but I think this is a UI not implementation issue. I think a user prompt for "clean cache" in the file hamburger might be the best option here (maybe on a per cell basis if "cached execution"). In which case it doesn't matter if sqlite (filter on I don't think I mentioned this before, but one of the strongest points in advocating for file based cache artifacts is because they can be distributed by any static file server. Reading over your implementation I'm back to not being totally convinced by the added complexity, but happy to eat my words with the benchmarking. That said, here's how I see the implementation: class DbLoader(Loader):
"""Inherit if it can be placed in sqlite"""
@abstractmethod
def load_blob(self, blob: bytes) -> Cache:
"""Convert blob to a cache object
Args:
blob: The raw bytes to convert
Returns:
cache: An object representing the cache data
"""
@abstractmethod
def serialize_cache(self, cache: Cache) -> bytes:
"""Convert cache object to a blob
Args:
cache: An object representing the cache data
Returns:
blob: The raw bytes to save
"""
# TODO Expand from PickleLoader to an DbLoader
DB_LOADERS : dict[str, DbLoader] = {...}
def get_connection():
...
class SqliteLoader(Loader):
"""General loader for serializable objects."""
def __init__(self, name: str, save_path: str) -> None:
super().__init__(name)
self.name = name
self.save_path = Path(save_path)
self.save_path.mkdir(parents=True, exist_ok=True)
self.conn = get_connection() # TODO
def build_path(self, hashed_context: str, cache_type: CacheType) -> Path:
return self.save_path / f"cache.db"
def cache_hit(self, hashed_context: str, cache_type: CacheType) -> bool:
return self.conn.query(...) # TODO select exists(select 1 from cache where hash = {}, hashed_context)
def load_cache(self, hashed_context: str, cache_type: CacheType) -> Cache:
result = self.conn.query(...) # TODO select blob, cache_type, loader from cache where hash = {}
assert result.loader in LOADERS, "No good"
loader = DB_LOADERS[result.loader](self.name, self.save_path)
if not result.blob:
return loader.load_cache(hashed_context, cache_type)
return loader.load_blob(result.blob, cache_type)
def save_cache(self, cache: Cache, loader_type : Literal["pickle"] ="pickle") -> None:
# pickle just for now
assert cache.loader in LOADERS, "No good"
loader = DB_LOADERS[loader_type](self.name, self.save_path)
# Any added metadata starts to bifurcate this from the more straight
# forward case, but last_access allows for LRU vs date heuristics
# cache_type | hash | loader | blob | last_access
# -----------------------------------------------------------
# cache.cache_type | hash | loader_type | loader.serialize_cache(cache)
self.conn.insert(...) |
The connection should be one per thread - and only because SQLite connection is not a thread-safe construct. If the marimo runtime's concurrency architecture is thread-per-notebook (== App, right?), those would be the same things, but I was not sure about this because I didn't dig into Marimo's threading/concurrency architecture. "But the correct solution if it was cell based" -- why is this? I don't understand.
I somewhat struggle to parse your reply here. I will assume that by "persistent connection" you meant "using SQLite for cache content blob management" [rather than the native file system, as it is done now]. In summary, the benefits of using SQLite for cache content blob management:
To me, all these points scream about the need to do this management in a dedicated small table in SQLite. That "{prefix}{module_hash}{cell_id}_{app_id}.pickle" (plus content checksum in the leading/trailing file bytes) is basically turning file system's directory object into such table, basically, but, obviously, a very poor man's one. I don't see how it's simpler or more elegant -- to me, it seems like very much the opposite.
The fact that really large content blobs (larger than 100mb) are still stored outside of SQLite doesn't interact with any of the benefits above. Storing blobs large blobs in files adds the overhead of an extra pair of file IO calls (writes on the write path, reads on the read path) relative to the situation, and reading/writing the file descriptor, compared to the situation when such a blob would have been stored right in the table, but these overheads are going to be completely dwarfed by the time it will take to actually pickle/unpickle and read/write these large files. In fact, storing smaller blobs inside the table is just an optimisation for smaller blobs. The additional benefit of this optimsation is that it would make the cache directory more "observable" when the user decides to eyeball it: here're my large files (which I may decide to delete selectively to unclutter my disk), and "cache.db" with the rest. Complete purge of the cache is just removing everything from the dir, as before.
I don't see how my proposal requires configuration in any real practice. There are probably no people out there going after files smaller than 100mb on their disk to free space. I'm working on a 9-year old laptop with 256Gb which is chronically starved of disk space and I do it regularly (i.e., I go through various cache dirs by various apps, often outdated, such as prev. year's IntelliJ's IDEs which leave entirely separate unused directories of caches after themselves, using Disk Inventory X or similar apps), but even I don't remember I was ever desperate enough to sort through files smaller than 100mb.
My view is that always-on persistent cache for all cells (as per #3054) recreates the Jupyter experience and expectations. From computational perspective, when doing Bayesian inference with e.g. PyMC, computations may be very heavy, especially on CPU, but results are small. The same for most types of ML training, I guess. Another important case is writing LLM-calling workflows in the cells - LLM calls to providers sometimes take long time, and also they cost money. If I have a long, multi-cell workflow that takes dozens of cents to complete, if my laptop loses power, I really don't want to re-pay these dozens of cents to recreate the working state of the program. Personal anecdote: I'm currently writing these workflows and the "cost anxiety" makes me to store everything that LLMs and embedding models have every returned to me on the disk. Results are too small to bother with the space or disk I/O, but the price of obtaining them is non-negligible. But doing this, as you can imagine, adds a lot of accidental complexity to my code: instead of worrying about the logic, I worry about serde and caching. |
File modified tag will break in many ways across a plethora of setups that users may invent to store and share them, including Git, rsync, Dropbox, network FS, and others.
Dangerous in what sense? Fundamentally, everything that we are talking here about is still a cache, not an ironklad storage. Losing it in 1% of situations should be fine. Cf. my anecdotal examples above. If I don't have any persistence of cells at all, I start to worry about computation time and costs in case of laptop power outages/crashes (which happens with my old laptop regularly), or if I share the notebook with a colleague, and hence I add accidentally-complex persistance myself. But if I knew I had a relatively reliable, always-on cache, I would essentially be back to "Jupyter guarantees" (which are also not ironklad in practice - even in more than 1% of cases, cells need to be re-computed for various reasons), and I would stop worrying about persistence. A 1% chance of incurring an extra $1 dollar cost or spending 5 minutes waiting for some computations is not worth worrying at all. In leiu of persistence, even though rationally I should probably also not worry about these (the cost of adding caching complexity myself is higher than moderate risk of incurring extra $1 of costs), somehow it's hard to give up these worries.
Not sure what difference does it make wrt. SQLite-managed cache content blobs and FS-managed -- "cache.db" is synced and shared on that "static file server" as well as the adjacent blob files, if any. |
I realised that SQLite will fail badly when people check in the cache into Git (they will!) and have merge conflicts. Hence I re-assembled the proposal with a new solution here: #3176 |
Description
SQLite is meant to replace
fopen()
! Why wouldPickleLoader
still use file-per-cell then?Additionally, this shall reduce file I/O calls and some unnecessary overhead that should become more relevant when cell results are stored and invalidated all the time, as per #3054
Suggested solution
Use one SQlite db for persistent cache per directory, e.g., at
<work-dir>/.marimo/persistent_cache.sqlite
by default, or the location configured in<work-dir>/.marimo/config
(or global~/.marimo/config
, a la Git).Alternative
Since the results of cells are often dataframes, maybe better to use DuckDB by default, or at least provide an option to choose. DuckDB will still handle non-dataframe results fine, but SQLite might quickly become inefficient at storing dataframe results. There may be a custom Pickler that handles dataframe-shaped fields within result object(s) customly, at arbitrary depth.
Additional context
No response
The text was updated successfully, but these errors were encountered: