-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Published persistent provenance #44
Conversation
@Ex-32 We will have to implement this in Rust at transcribe-time eventually. This is a mock-up of what the Python-side data structures will be. Please especially review the rationale (markdown). |
@dataclasses.dataclass(frozen=True) | ||
class InodeMetadataVersion: | ||
inode_version: InodeVersion | ||
stat_results: bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not a big deal, but we should have a dataclass to represent a stat in the probe_py.generated
.
- `inode_version_writes` maps an inode-and-version (filename) to the process ID (in the file contents) of the process that created it. | ||
- `processes` maps a process ID (filename) to a Process object (in the file contents). | ||
|
||
Note: multi-process sqlite is either not performant (globally lock entire database) or not easy (lock each row somehow), but filesystem will work fine for our case, since we just need a key-value store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sqlite has built-in serializable ACID transactions, as long as there are no network filesystems involved, we can just open the database on multiple processes and have it just work™; sqlite's query optimization engine is good enough we could probably just SELECT * FROM Processes WHERE inode_version=...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That being said, we probably don't want to just dump probe_log
blobs into sqlite since it has a hardcoded limit of 1GB for strings and blobs (ask me how i know 🙃).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether to use sqlite or raw files is an important design choice. We wouldn't be storing the entire probe_log
; we would be disassembling probe_log
into a set of rows: one for each process, one for each inode at each version (versioned at open/close-time).
As far as I understand, serializing the transactions is a bottleneck. In principle, the transactions should be independent, so we don't need to serialize them at all. They both update different (theoretically) files in the current design, because each inode/version is written by exactly one process. However, there may be future tables that do require locking (e.g., maintaining a table of what processes use the inode/version would be racey).
There is also the issue of transferring; when we run SCP or Rsync, we will need to transfer specific key-value pairs over to the remote. In files, this is easy, but results in a lot of files, but I think that is ok. In sqlite, it would result in a number of rows from a number of tables. I'm not sure which is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another point is that traversing the graph requires "pointer-chasing" through multiple rows (each row tells you where to find the next one). This kind of access pattern results in O(N) file reads, which may be a lot, or O(N) select-queries, which is still a lot of queries, but new query is much cheaper than open/close new file-descriptor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(No response required, since Ex-32 is already busy; just thinking out loud)
I need to expand on the transactions we will be doing, and at what frequency. Basically, the persistent store should be a bipartite graph between processes and inode-versions.
- At transcription-time (which probably happens no more than once per minute): insert nodes and edges.
- When the user asks for it (less than once per minute): traverse the graph "up" from a specific inode/version, "what inputs were used to make this output? (aka pull-based updating)". This query is used in applications like "Make-without-Makefile" application.
- When the user asks for it (less than once per minute): traverse the graph "down" from specific inode/version, "what outputs were dependent on this input? (aka push-based updating)". This query is used a lot less than the pull-based version, but could still be useful if the user does a system upgrade, and they don't know which of their projects need to get recompiled/recomputed.
- When the user does an SCP or Rsync (less than once per minute): traverse the graph "up", so we can transfer the "relevant" bits of provenance to the remote, so a user at the destination-machine (destination could be local or remote) can query the provenance of the files we are sending.
- When the user issues "garbage collect" (less than once per minute): iterate over every inode/version; mark the ones that still exist as "not deletable". Mark any ancestor of "not deletable" as "not deletable" (search "up" in the provenance graph from each existent provenance version).
All in all, the operations are slow enough that the parallelism between readers and writers that is potentially available in filesystem but not available in sqlite (uses readers-writers lock), is not that important. On the other hand, traversing up or down the graph is quite important. For a file-based solution, each edge-traversal requires open/close of a file; sqlite loads the graph in memory (if it fits), so each edge-traversal is a pointer dereference. If the graph does not fit in memory, sqlite will load blocks of the table (presumably LRU cached), which is still much more efficient than open/close. While a graph database (e.g., Neo4J) would be even better, Neo4J is not "embedable" in a Python application (only in Java). The most popular deployments of Neo4J are as a daemon process, which would be annoying. With sqlite, PROBE can be "daemonless".
Therefore, I think persistent provenance should be held in sqlite in the future. @Shofiya2003, the logic regarding SCP/Rsync wrapper won't change; still call get_prov_upstream
; the output will be a set of objects, which you can call a new function, transfer(objects, host)
, which will figure out how to transfer the objects to the remote host, whether they are filenames (the current scheme) or objects/sqlite-rows (the future scheme).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At transcription-time (which probably happens no more than once per minute): insert nodes and edges.
While I agree with your overall rational, some tasks, like compiling C/C++ code, where each file spawns several processes (compiler, assembler, etc.), could potentially produce higher transcription loads (in the range of once per second).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Transcription should take place when the PROBEd process-tree completes, not when a single PROBEd process completes. I'm assuming users will do probe make
instead of make CC="probe gcc"
(but maybe they won't). Ideally, we need to explain that one should try to PROBE the greatest unborn ancestor process (when it gets born) 😆
Closed in favor of #75 . |
No description provided.