-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow users to collaborate while using experiment tracking #1218
Comments
This functionality would help some of the internal teams that would like to use Kedro experiment tracking too; they cannot use it because they cannot configure the storage location. |
Very relevant, @limdauto pointed out this recently: https://fly.io/blog/all-in-on-sqlite-litestream/ |
Technical Design Discussion on 11/01/2023The 3 options were evaluated for their advantages and feasibility, and Option 2 was selected as the most feasible. Option 1
Option 2
Option 3
Next Steps for Option 2
|
Another possibility could be to draw inspiration from what Prefect is doing with their Orion UI. In my opinion, it’s the best of both worlds; you could set it up with a local SQLite db so that you use it locally (just like kedro-viz is currently working). But, you also have the option to set it up with a PostgreSQL backend db and run it as a remote web server. In both cases, the server is able to track metadata of your runs (e.g. DAG, how long each step runs, inputs/outputs generated). This is all metadata already available in kedro at runtime so it should be easy enough to expose it through an API call in a hook! |
FWIW what @MatthiasRoels mentioned was the original idea and was also why we bothered with SQLAlchemy in the first place. It should be backend* agnostic. Unless anything has changed recently, literally this is the only place we need to change to enable a different db than sqlite: https://github.com/kedro-org/kedro-viz/blob/main/package/kedro_viz/database.py#L15 -- and make that configurable by end user. I was planning to fork viz to do a PoC for a problem I'm facing at work that will require a different backend than sqlite. I could report some learning back if I get to it. *: SQLAlchemy-compatible backend, not S3 |
This PR addresses Issue 3 (#1218) from the user research for Experiment tracking. The update enables users to store their session_store and tracking data in the cloud. Here's an overview of the Collaborative Experiment Tracking implementation: In the settings.py, user specifies the S3 bucket location, which triggers the upload of their session_store.db (a SQLite database) to the cloud using fsspec. This process is executed via the SQLiteStore._upload() function during a Kedro run, when a user creates a new experiment. When users execute Kedro-viz, session_store.db files from all other users are downloaded to the user's local machine through the SQLiteStore._download() function. These downloaded databases are then merged with the user's current local session_store.db through SQLITEStore._merge() function . As a result, the local session_store.db contains not only the user's experiments, but also those conducted by other team members. Every user collaborating on a particular experiment tracking project consistently maintains a copy of everyone's experiments, both locally and in the cloud. This synchronization is achieved through the SQLiteStore._sync function (which basically downloads, merges and uploads the session_store.db)
Closing this ticket, as we have many others in the works for this feature. Follow along here. |
Description
This is the third highest priority issue resulting from the experiment tracking adoption user research. Users want to be able to write their experiments to storage that is not on their local computer and share their experiments with other team members.
This is important as it enables a team of users to collaborate and see each other's results as they iterate on a pipeline, compared to the experience of being limited to one user's local machine. Hence, this is a deciding factor for the adoption of Kedro Experiment Tracking.
This pain point also came up in the experiment tracking user testing sessions:
Context
What is the problem?
Users can only perform a model run on a local machine, making it difficult to collaborate on a project with the rest of their team because experiment results are on multiple computers.
Additionally, users have raised other concerns:
Who are the users of this functionality?
Users are primarily data scientists, and data engineers are secondary users.
Why do our users currently have this problem?
We designed it this way to launch a simpler experiment tracking in Kedro, even though the predecessor of Experiment Tracking in Kedro (PerformanceAI) had this functionality.
Currently, users can only store their runs on a local machine via SQLite:
What is the impact of solving this problem?
It would be possible to view all user experiments across a team in one place and also solve an outstanding adoption issue for Kedro Experiment Tracking:
How could we implement this functionality?
Option 1
Open the browser and see it - This involves having a server you can connect your Kedro-Viz and other users' Kedro-Viz to. Then you can share from that server.
Option 2
Create a mechanism where only data is shared and Kedro-Viz is still running locally but has access to the shared data service - An S3 database for example, which can provide the new data.
Option 3
Connecting to other solutions (MLflow and Weights & Biases) - These tools provide this functionality natively so we would be relying on their implementations.
What important considerations do we have?
All options above would require we redesign the backend data model. How do we contain all of the things currently shown on Kedro-Viz into a single database vs always deriving the data from code on Kedro framework?
Our current data model is that we currently read the data directly from the Kedro project. We need to solve the data model first - SQLite store - before considering any of these options.
What other related issues can I read?
This is related to other open issues: #1217, #1039, and #1116
The text was updated successfully, but these errors were encountered: