Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent cache to use SQLite #3055

Closed
leventov opened this issue Dec 4, 2024 · 17 comments
Closed

Persistent cache to use SQLite #3055

leventov opened this issue Dec 4, 2024 · 17 comments
Labels
enhancement New feature or request

Comments

@leventov
Copy link

leventov commented Dec 4, 2024

Description

SQLite is meant to replace fopen()! Why would PickleLoader still use file-per-cell then?

Additionally, this shall reduce file I/O calls and some unnecessary overhead that should become more relevant when cell results are stored and invalidated all the time, as per #3054

Suggested solution

Use one SQlite db for persistent cache per directory, e.g., at <work-dir>/.marimo/persistent_cache.sqlite by default, or the location configured in <work-dir>/.marimo/config (or global ~/.marimo/config, a la Git).

Alternative

Since the results of cells are often dataframes, maybe better to use DuckDB by default, or at least provide an option to choose. DuckDB will still handle non-dataframe results fine, but SQLite might quickly become inefficient at storing dataframe results. There may be a custom Pickler that handles dataframe-shaped fields within result object(s) customly, at arbitrary depth.

Additional context

No response

@leventov leventov added the enhancement New feature or request label Dec 4, 2024
@dmadisetti
Copy link
Collaborator

See the discussion in: #2653

Pickle cache is just the first iteration of a "cache type". The are other formats, loading methods and optimizations that are likely improvements.

@dmadisetti
Copy link
Collaborator

I'm not sure if a SQLite format is the exact approach we'll take- so is it alright if we close this out? Or we can rename this to a more general "expand cache types" issue. It's a good idea though

@leventov
Copy link
Author

leventov commented Dec 4, 2024

I think this is orthogonal to cache types, at least if we are talking to persistent cache formats like pickle, npy, etc.

In this issue, I suggest instead of having a file per cell, have a single SQLite db with a single table e.g. of approximately the following scheme:

create table cell_results (
  file_path text, 
  cell_hash text,
  format text,
  values blob
);

Thus, this database can handle all formats mixed with each other: just having different format column value. Also, if the user changed the setting to a new format, marimo can still lead the old format (as it knows it from the table) if needed, before saving in the new format next time.

For in-memory, remote, and mixed cache configurations, this is of no concern to this db -- it's simply oblivious of those. It only knows about the values that are going to end up saved locally on disk.

@leventov
Copy link
Author

leventov commented Dec 4, 2024

So, also on reflection, I think DuckDB is superfluous -- should be sufficient to store Apache Arrow blobs in SQLite.

@dmadisetti
Copy link
Collaborator

I think this would still be a type of cache type. Files are easy for invalidation, and can be used across the wire- so we would still want to keep them, esp. for backwards compat.

I'd also be weary of prematurely adding optimization here when there a few more outstanding bottlenecks that I think will have more impact. For instance, in my experience, rehashing large input variables takes far more time than any IO operations.

@leventov
Copy link
Author

leventov commented Dec 5, 2024

Can you elaborate what do you mean by "files are easy for invalidation"?

How I see it:

  • For the user to clear all cache for all notebooks in a directory: remove __marimo__/cache dir vs. remove a single file cache.sqlite. It's equally easy.
  • For the marimo environment or command line to clear cache for specific cell or for all cells in a notebook: well, on the code level, we are talking about the difference between path = build_path(hash); os.remove(path) and conn.execute("delete from cache where hash = ?;", (hash,)) (assuming autocommit was set up for the conn in class constructor). It's equally easy.
  • For the user to clear cache for the specific cell, bypassing the marimo command itself: equally hard/impossible, as they don't know what to do with these hashes and how to find the "right" one.
  • For the user to clear cache for the specific notebook, bypassing the marimo command itself: with SQLite, in virtue of having file_path column, and then the user can do sqlite3 -line ./__marimo__/cache.sqlite "delete from cache where file_path like '%<my_notebook_file_name>.py';". Note that this is robust to notebook file renames, as only hash column is primary key in the table, and file_path is more like "best effort", updated transparently whenever marimo update the cache for the cell. However, with file-per-cell cache, this is hard to achieve: if notebook name or path is included in the file name, this cripples/complicates "load cache by hash" core function.

[files] can be used across the wire-

Well, here you actually inviting subtle problems and transient bugs with silent corruption of cache contents. This is exactly the reason why generally ad hoc file storage is not the best idea. Even the current PickleLoader.save_cache() in local mode is not atomic, as per https://stackoverflow.com/questions/50461868/trying-to-use-atomic-write-on-pickle-object. Let alone in network FS situations, it's a tar pit of possible bugs. But instead of reinventing the wheel, why not just use SQLite?

And for remote, instead of network FS + file IO, it would be better to use rqlite. See rqlite FAQ:

Can I run a single node?
Sure. Many people do so, as they like accessing a SQLite database over HTTP. Of course, you won’t have any redundancy or fault tolerance if you only run a single node. If the single node fails you will need to restart it.

so we would still want to keep them, esp. for backwards compat

Keeping PickleLoader in the codebase of marimo is easy and I guess not a high cost. Just for it not to be the default back-end for persistent_cache when people use it going forward.

@dmadisetti
Copy link
Collaborator

I don't think there's a disagreement here, these are all good points- but like you said, I don't think incorporating this removes PickleLoader.
I think you have convinced me that it should replace the default though! Sqlite being builtin is a plus

@leventov
Copy link
Author

leventov commented Dec 6, 2024

@dmadisetti unless you want to own this because it interferes/intertwines with some other development you are already doing, I can draft a PR on the weekend. You can give extra requirements/ideas how you think it should be done.

@leventov
Copy link
Author

leventov commented Dec 9, 2024

still in progress

@dmadisetti
Copy link
Collaborator

Re spec:

I think it's simple as inheriting from Loader, and overloading for write, load, etc.
For a persistent connection, I'd use the State Manager similar to how @cache does.

Feel free to push up what you have

@dmadisetti
Copy link
Collaborator

I don't think you should make this the default yet- there's a few hash breaking changes yet to come. I think rolling this out with those changes would be ideal. There's an "experimental" mechanism, that this should live under to be default.
I'm happy to put this in if it's not immediately clear

Also, I might do a bit of benchmarking. While I think this is a nice solution- I'm not sold that the gains from this will be noticeable, but may come at a marked memory penalty.
It would be nice to have this option regardless. I think the benchmark should really dictate default status

@leventov
Copy link
Author

The current progress is here: https://gist.github.com/leventov/00d94e82028fd3c6884950e454ce1a4c. Unfortunately this export format doesn't have anchor links, so you should search in page for e.g. "create table", or scroll through the page.

Thread local SQLite connection management

This gist currently uses threading.local() and the standard @contextmanager for thread-local SQLite connection. Apparently marimo._runtime.context would be direct equivalent to this. The marimo._runtime.state.state and marimo._runtime.state.State documentation appears cryptic to me, as well as its usage in @cache. I don't understand what does it provide to me over and in addition to get_context()? Must I use state for some internal Marimo's runtime bookkeeping or management reasons, or not necessarily? Also, state doc appears to be targeted at the "user-level API", not internal API.

cell_id (and non-existent yet app_id) considerations

In the discussions through the gist above and the table schema, you will find history_index idea which is an optimisation for modules that coincide with entire code cells (as, I assume, will become very widespread if persisting all cells becomes a local-dev default: #3054). The idea is that we don't remove a few "historic" versions of cell contents so that un-doing latest changes avoids re-computations. This hinges on the stability of cell_id (e.g., across marimo runtime restarts), which I didn't dig into yet. Are cell_ids stable?

This would also require changing Loader API (pass in cell_id, and cell_coincides_with_module=True|False, to both cache_hit and load_cache).

Yet another optimisation that stable cell_ids and for cell_coinciding-with-cached-module use case is that storing again the unchanged cell results (even though inputs, and hence module hash, have changed) is unnecessary.

Also, notions of app_id would also be useful, for example for clearning all cache for app (which would result in delete from cache_entries where app_id = ?). But it would also require app_id stability.

In general, documenting the semantics of cell_id as some kind of "developer docs" would be helpful. And app_id, if that will be added. Stability across marimo app restarts implies that different marimo app instances also share the cell_id. In turn, this implies an internal database for them, such as another sqlite. Stability across code movements and changes (What if the notebook file is renamed? What if cell is renamed? What if cell code is moved to another file, but explicitly connected to the app from another file?) is also unclear (It would be nice if all of this would be true. This would be unlock interesting stuff when correlating cell_ids across git commits or branches. A topic for another issue.)

Migration path

I currently think of the following:

  1. Cache location by default is __marimo__/cache/cache.db (not as in gist above, in the gist above it's one level higher, i.e., __marimo__/cache.db, I will change that.)
  2. SQLite is not suitable for storing extremely large blobs (nor this is nice from observability perspective, such as when the user wants to clean up the largest space-consuming clutter but doesn't want to or doesn't know how to open sqlite, but may use programs that scout the file system for largest files). Current default threshold is 100MB. So, "old-style" file-per-module cache will still be used.
  3. The metadata for these large file-per-module pickles is still stored in the sqlite table, along with content_hash to ensure non-corrupted results (see details in the gist, the are already largely there).
  4. Upgrade path: cache db stores an extra flag in a separate table with a single entry, a la create table legacy_cache_uplift (done INTEGER) (this table is not created yet in the gist). Upon first usage, if not done yet, we scan the directory and "uplift" the files to create entries for them in the table. From then on, The only hit/discovery path goes through the table (and if table points to a file, the file is read then). To consider: when this uplift is done, we can remove all small files in an atomic way, too. This will not lead to read amplification because we need to scan through all files anyway during uplift to calculate their content_hash. However, some care is needed to handle this atomically.

Module hash pending changes

I actually think this breaking change is a good case for table-driven persistent cache, not the reason to delay it. We can store module_hash_version column and proactively delete all rows older versions than the current instance of marimo app creates. This also aligns as well with the overall design of this cache: we should clean up the outdated versions of module contents (through the history mechanism described above), deleted cells (via cell_ids), deleted apps (via app_ids), entries with file_path column set but file actually missing or corrupted, etc.

Cleaning up non-cell-coinciding module hashes (but still within a cell)

Possible design:

  1. Register any cache hits for module hashes in a separate column accessed_at.
  2. Periodically check if there are cells (via cell_id) for which there are recent hits for some modules within it, but some other other modules haven't been accessed in a long time. I don't see how this may happen if that sub-cell-module with @persistent_cache or persistent_cache context has been deleted from the cell code. Then these rows can be deleted, too.

This leads to some write amplification for such cells. But note that for vast majority of cells, those where the cached module coincides with the cell itself, this accessed_at writing can be avoided (instead another history_index mechanism is used, as described above, but that leads to writes only upon actual cache writes, not hits).

But for those (rare) cells that are not persisted themselves but have persisted modules within, this write amplification seems like a fine tradeoff (which earns automatically-clean cache).

The alternative is to let user periodically clean the cache for cell or app altogether, but this is mental burden and I think should be avoided.

Note that this still hinges on the stability of cell_ids.

Performance and footprint

Also, I might do a bit of benchmarking. While I think this is a nice solution- I'm not sold that the gains from this will be noticeable, but may come at a marked memory penalty.
It would be nice to have this option regardless. I think the benchmark should really dictate default status

As evident from above, table-driven cache enables many paths to clean up the cache automatically and semi-automatically, at the same time not requiring the user to purge ALL cache (including for still used cells), which is bad developer experience IMO. Just from this, if these auto-cleanup routines are enabled, the table-driven cache should in practice result in much smaller __marimo__/cache dirs for the users.

The extra benefits are saner migration paths (enabled by things like module_hash_version and metadata column, see more details in gist) and manageability.

But even the above note about automatic clean-ups was not true and we would actually never delete anything from the cache_entries table (as currently files in cache dirs are never deleted), why do you think the table "may come at a marked memory penalty"? My intuition is the opposite: SQLite's footprint/overhead per row is smaller than operating system's overhead per file (separate file descriptor, probably mandatory alignment to 4KB page size per file internally, etc.).

If your "may come at a marked memory penalty" refers to double-storage of content_blobs in WAL and "main database", then SQLite automatically recycles the WAL approximately every 4MB of WAL. This should not be a concern.

Indexes are in memory (see gist), however they are expected to be dwarfed by the content_blobs sizes, anyway. When they are not, it means that the entire cached content is itself minuscule in size, and hence the cache size is not an issue.

Moreover, the entire point of whether file descriptors vs rows in sqlite are larger footprint is moot (and hence, the few paragraphs above are just of technical curiosity, and to make sure I understand you). This is because we overall don't expect to have millions of active modules-to-be-cached. The real enemy will always be piling up outdated module_hashes and versions. That, again, table-driven cache puts under control, directory + file-per-module cache does not.

@leventov
Copy link
Author

Cleaning up non-cell-coinciding module hashes (but still within a cell)

Ok, I think these accessed_at don't have to be stored in the table. Just marimo app runtime can manage those and compare with created_at of the module_hashes for given cell_id. So, no excessive write amplification on each cache access. But cell_id stability is still vital.

@dmadisetti
Copy link
Collaborator

Thread local SQLite connection management

Using the internal state manager will do default cleanup, and be tied to the calling cell. But I think that might have been a bit of a brain fart for me, since the connection should be on a per notebook- and not per cell basis (but the correct solution if it was cell based).

Current default threshold is 100MB

I think this makes it a deal breaker for the default implementation- and with a file reference seems like many of the benefits of a persistent connection are lost at this point? marimo "just working" is a good design philosophy and shouldn't need configuration

Memory might be less of an issue than I thought if the entries are so limited- my usecase of persistent cache has been for 1gb+ blobs- which would make every call with the sqlite cache just unneeded overhead/ complexity

Module hash pending changes

Module support is already baked in with __version__. The version influences the hash, so doesn't need to explicitly be accounted for. This might not be super well documented, I'll look at the docs again.

In the execution "default" branch (probably stale by now), module objects are replaced with a stub, with additional metadata. __init__.py modification date could be used as a history proxy

Metadata is currently baked into the pickled object, so the sqlite doesn't really add much there either, but will require schema management opposed to just being pinned to a marimo version.

The real enemy will always be piling up outdated module_hashes and versions.

I agree with this, but I think this is a UI not implementation issue. I think a user prompt for "clean cache" in the file hamburger might be the best option here (maybe on a per cell basis if "cached execution"). In which case it doesn't matter if sqlite (filter on access_at entry), or the file modified tag if using files. Implicit cache purging might be a little dangerous, but settings for this could be fine


I don't think I mentioned this before, but one of the strongest points in advocating for file based cache artifacts is because they can be distributed by any static file server. Reading over your implementation I'm back to not being totally convinced by the added complexity, but happy to eat my words with the benchmarking.

That said, here's how I see the implementation:

class DbLoader(Loader):
    """Inherit if it can be placed in sqlite"""
    @abstractmethod
    def load_blob(self, blob: bytes) -> Cache:
        """Convert blob to a cache object

        Args:
            blob: The raw bytes to convert

        Returns:
            cache: An object representing the cache data
        """

    @abstractmethod
    def serialize_cache(self, cache: Cache) -> bytes:
        """Convert cache object to a blob

        Args:
            cache: An object representing the cache data

        Returns:
            blob: The raw bytes to save
        """

# TODO Expand from PickleLoader to an DbLoader

DB_LOADERS : dict[str, DbLoader] = {...}


def get_connection():
  ...

class SqliteLoader(Loader):
    """General loader for serializable objects."""

    def __init__(self, name: str, save_path: str) -> None:
        super().__init__(name)
        self.name = name
        self.save_path = Path(save_path)
        self.save_path.mkdir(parents=True, exist_ok=True)
        self.conn = get_connection() # TODO

    def build_path(self, hashed_context: str, cache_type: CacheType) -> Path:
        return self.save_path / f"cache.db"

    def cache_hit(self, hashed_context: str, cache_type: CacheType) -> bool:
        return self.conn.query(...) # TODO select exists(select 1 from cache where hash = {}, hashed_context)

    def load_cache(self, hashed_context: str, cache_type: CacheType) -> Cache:
        result = self.conn.query(...) # TODO select blob, cache_type, loader from cache where hash = {}
        assert result.loader in LOADERS, "No good"
        loader = DB_LOADERS[result.loader](self.name, self.save_path)
        if not result.blob:
            return loader.load_cache(hashed_context, cache_type)
        return loader.load_blob(result.blob, cache_type)

    def save_cache(self, cache: Cache, loader_type : Literal["pickle"] ="pickle") -> None:
        # pickle just for now
        assert cache.loader in LOADERS, "No good"
        loader = DB_LOADERS[loader_type](self.name, self.save_path)
        # Any added metadata starts to bifurcate this from the more straight
        # forward case, but last_access allows for LRU vs date heuristics
        # cache_type        | hash | loader      | blob | last_access
        # -----------------------------------------------------------
        # cache.cache_type  | hash | loader_type | loader.serialize_cache(cache)
        self.conn.insert(...)

@leventov
Copy link
Author

Using the internal state manager will do default cleanup, and be tied to the calling cell. But I think that might have been a bit of a brain fart for me, since the connection should be on a per notebook- and not per cell basis (but the correct solution if it was cell based).

The connection should be one per thread - and only because SQLite connection is not a thread-safe construct. If the marimo runtime's concurrency architecture is thread-per-notebook (== App, right?), those would be the same things, but I was not sure about this because I didn't dig into Marimo's threading/concurrency architecture.

"But the correct solution if it was cell based" -- why is this? I don't understand.

Current default threshold is 100MB

I think this makes it a deal breaker for the default implementation- and with a file reference seems like many of the benefits of a persistent connection are lost at this point? marimo "just working" is a good design philosophy and shouldn't need configuration

Memory might be less of an issue than I thought if the entries are so limited- my usecase of persistent cache has been for 1gb+ blobs- which would make every call with the sqlite cache just unneeded overhead/ complexity

I somewhat struggle to parse your reply here.

I will assume that by "persistent connection" you meant "using SQLite for cache content blob management" [rather than the native file system, as it is done now].

In summary, the benefits of using SQLite for cache content blob management:

  1. Faster per-cell/app content blob query with abilities to automatically clean up histories: with SQLite this will be a simple query against an in-memory index. In the current management, although it's not implemented yet, the most portable way to do this would be to storer blobs in files with names of the following format: "{prefix}{module_hash}{cell_id}_{app_id}.pickle", but this makes cache_hit() slower because now it requires listing all file names in the cache directory and finding the match in the name.
  2. For cached-module-coinciding-the-cell case (which is the most important, in my view), history_index permits fast management of short histories for the cell so that unfortunate code changes, when reverted, don't require re-computations. With file-system-based content blob management, I don't really see how to do it without major hacks.
  3. Faster content blob checksum storage: with SQLite, that's ~zero extra overhead, as it's just a single (small) extra column in the table. With native file systems, you would either need to reserve a few bytes for checksum storage at the beginning of the file (like, you first write four zero bytes, then do the entire pickle with streaming checksum calculation, and at the end you overwrite the first bytes, or append the checksum as trailing bytes, after the pickle. Or, store checksums separately from the blob files themselves (in SQLite, hehe). Leading/trailing bytes checksum storage will mean extra page reads/writes either on cache read or cache write paths.

To me, all these points scream about the need to do this management in a dedicated small table in SQLite. That "{prefix}{module_hash}{cell_id}_{app_id}.pickle" (plus content checksum in the leading/trailing file bytes) is basically turning file system's directory object into such table, basically, but, obviously, a very poor man's one. I don't see how it's simpler or more elegant -- to me, it seems like very much the opposite.

with a file reference seems like many of the benefits of a persistent connection are lost at this point?

The fact that really large content blobs (larger than 100mb) are still stored outside of SQLite doesn't interact with any of the benefits above. Storing blobs large blobs in files adds the overhead of an extra pair of file IO calls (writes on the write path, reads on the read path) relative to the situation, and reading/writing the file descriptor, compared to the situation when such a blob would have been stored right in the table, but these overheads are going to be completely dwarfed by the time it will take to actually pickle/unpickle and read/write these large files. In fact, storing smaller blobs inside the table is just an optimisation for smaller blobs.

The additional benefit of this optimsation is that it would make the cache directory more "observable" when the user decides to eyeball it: here're my large files (which I may decide to delete selectively to unclutter my disk), and "cache.db" with the rest. Complete purge of the cache is just removing everything from the dir, as before.

marimo "just working" is a good design philosophy and shouldn't need configuration

I don't see how my proposal requires configuration in any real practice. There are probably no people out there going after files smaller than 100mb on their disk to free space. I'm working on a 9-year old laptop with 256Gb which is chronically starved of disk space and I do it regularly (i.e., I go through various cache dirs by various apps, often outdated, such as prev. year's IntelliJ's IDEs which leave entirely separate unused directories of caches after themselves, using Disk Inventory X or similar apps), but even I don't remember I was ever desperate enough to sort through files smaller than 100mb.

Memory might be less of an issue than I thought if the entries are so limited- my usecase of persistent cache has been for 1gb+ blobs- which would make every call with the sqlite cache just unneeded overhead/ complexity

My view is that always-on persistent cache for all cells (as per #3054) recreates the Jupyter experience and expectations.

From computational perspective, when doing Bayesian inference with e.g. PyMC, computations may be very heavy, especially on CPU, but results are small. The same for most types of ML training, I guess.

Another important case is writing LLM-calling workflows in the cells - LLM calls to providers sometimes take long time, and also they cost money. If I have a long, multi-cell workflow that takes dozens of cents to complete, if my laptop loses power, I really don't want to re-pay these dozens of cents to recreate the working state of the program.

Personal anecdote: I'm currently writing these workflows and the "cost anxiety" makes me to store everything that LLMs and embedding models have every returned to me on the disk. Results are too small to bother with the space or disk I/O, but the price of obtaining them is non-negligible. But doing this, as you can imagine, adds a lot of accidental complexity to my code: instead of worrying about the logic, I worry about serde and caching.

@leventov
Copy link
Author

I agree with this, but I think this is a UI not implementation issue. I think a user prompt for "clean cache" in the file hamburger might be the best option here (maybe on a per cell basis if "cached execution"). In which case it doesn't matter if sqlite (filter on access_at entry), or the file modified tag if using files.

File modified tag will break in many ways across a plethora of setups that users may invent to store and share them, including Git, rsync, Dropbox, network FS, and others.

Implicit cache purging might be a little dangerous, but settings for this could be fine

Dangerous in what sense? Fundamentally, everything that we are talking here about is still a cache, not an ironklad storage. Losing it in 1% of situations should be fine. Cf. my anecdotal examples above. If I don't have any persistence of cells at all, I start to worry about computation time and costs in case of laptop power outages/crashes (which happens with my old laptop regularly), or if I share the notebook with a colleague, and hence I add accidentally-complex persistance myself. But if I knew I had a relatively reliable, always-on cache, I would essentially be back to "Jupyter guarantees" (which are also not ironklad in practice - even in more than 1% of cases, cells need to be re-computed for various reasons), and I would stop worrying about persistence. A 1% chance of incurring an extra $1 dollar cost or spending 5 minutes waiting for some computations is not worth worrying at all. In leiu of persistence, even though rationally I should probably also not worry about these (the cost of adding caching complexity myself is higher than moderate risk of incurring extra $1 of costs), somehow it's hard to give up these worries.

I don't think I mentioned this before, but one of the strongest points in advocating for file based cache artifacts is because they can be distributed by any static file server.

Not sure what difference does it make wrt. SQLite-managed cache content blobs and FS-managed -- "cache.db" is synced and shared on that "static file server" as well as the adjacent blob files, if any.

@leventov
Copy link
Author

I realised that SQLite will fail badly when people check in the cache into Git (they will!) and have merge conflicts. Hence I re-assembled the proposal with a new solution here: #3176

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants