-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
blog/use-case: Unit tests for data using DVC #2512
Comments
I think this is related to what @casperdcl is working on - |
🤔 This diagram from https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning#mlops_level_2_cicd_pipeline_automation describes a high-level mature ML workflow: IMO #2404 addresses the final step and highest level of maturity for the model experimentation/development/test phase as shown in boxes 2 and 3 in the diagram. What @iesahin wrote seems more focused on the production phase (data in the wild) where new unknown data is input to a deployed model. I'd probably see it as being the "data validation" step in box 4 in the diagram, and maybe also related to box 6 (at least the mention of data drift). Ultimately, I guess it depends on what @iesahin intended. Either way, agree that using DVC to validate data could be valuable. |
Thanks for the diagram @dberenbaum I read the paper as a part of Coursera MLOps course. The idea is continously improving the models by training them with new data, but this new data may have different properties than the original data used to train the model and may cause a decrease in performance. For example, if we have a face detection model trained before the pandemic and now we want to improve it, newer data will contain a larger proportion of masked faces. Some of the assumptions in the original model about the original data set may not hold. If we write unit tests about these assumptions, improving the model becomes safer with less technical debt. This can be used in CML as well, a data validation step is almost always required in online improvements. I think #2404 is more about higher level description. What I have in mind is a more specific, concrete example. We can have two different sources of data, a model trained with the first but we want to improve the model with the second without a complete retraining. This is about improving the pipeline reliability by using statistical unit tests. If something weird is supplied to the pipeline, it should notify before trying to retrain the model. A similar document for CML may also be nice, but here I would like to keep the scope limited to manual retraining. |
Hmm. To me "Unit testing" is part of "CI" and I would never consider this as part of the production. It's a development (or pre-prod) phase. Data here is not that different from code to my mind. Either we use it in the existing model (production) or as an input to researchers (implement new models, etc). CI/CD can work the same way, it's just in the data lifecycle even the model development phase can be considered "production". Scenario (not even the right term, more like "create sense of what is possible") I would try to describe as part of the #1942. I have a data registry (in our case repo) and there are datasets that we keep updating there. We should use PRs, on a PR we can run simple safety checks, legal checks, schema checks, etc using CML (CI for data). When it's done we merge (CD for data part) and people could use this updated dataset now.
yes, it means that you talk about small tutorial that would go nearby with that Use Case. Or a blog post indeed. Or how-to. |
I don't assume a predefined data registry. What I propose is something like, if you write unit tests at first, you can be safer when new data comes along. This safer may mean different things in different cases but it's similar to software development where unit tests are used to check the regression (in performance) and updated requirements. I think a tabular, structured data set may be more reflective to show this. Suppose we have built a logistic regression model in some data with 10% missing values in a column and we used some method to fill them. Now a new dataset comes with 80% of the values in that column missing. We need to be aware of this before supplying the data to the model for training. "Writing unit tests for your assumptions (in DVC) saves a lot of headaches later" is the story what I would like to tell. We can automate it and tell this in CML as well but for this first instance, I would focus on the problem manually. |
@shcheklein I think applying traditional software terms like "unit testing" and "production" to these ML scenarios is what's confusing me. Here's what I had in mind related to the diagram above:
|
yep, that's I guess why I was thinking more about dev phase, rather than monitoring.
yep. It's stretching it a bit though. I would still call this the productions phase and it requires indeed monitoring (like in any production engineering system, and running some data tests here is a data engineering realm for me) |
So, if we talk more about data engineering (automated pipelines, prod monitoring, etc) here, how do you see DVC could help? |
These are the key words I see above. From my (non ML) PoV it seems like it's a very narrow thing (maybe a best practice?). I also wonder what's DVC's role here, other than codifying a pipeline stage. The broader topic of data quality may be interesting though e.g. as mentioned above: releasing data like code via PRs with test, QA against stagong/prod models, etc. May relate to the existing Data Registry and Model Registry (idea, see #2490 (comment)) cases. |
I think DVC's role here (at least in "unit testing" a data registry scenario) is that it kinda implies Git flow - data "is" in Git, PR should be made to update it, CI could be run and it has access to that data, it can show you pass/not pass, etc. |
In automated pipelines, those may be DVC pipelines that include data validation stages. Failures may trigger alerts and stop automatic deployment of updated models. Using DVC not only makes it easy to automate this pipeline, but also to checkout the data and outputs from failed pipeline runs for further analysis. In prod monitoring, it's less clear yet if DVC has value. I think it potentially has some if the production data and model are tracked by DVC, since it takes care of provenance so that monitoring results can be directly tied back to specific versions of data and models. |
Makes sense @dberenbaum . It's a bit indirect value (relatively to this specific topic) - it's not like DVC helps you with there "data tests", it's more like DVC could be used instead of Airflow (or along) and cover some those scenarios that data engineering pipeline would cover. In this case it would be good to start with "DVC for ETL"/"DVC for production pipelines"/"nee a better name" case? Mention benefits like:
And then write short tutorials/examples on how this can be applied, including possibility to do unit tests on data (e.g. using libraries like "great expectations"). |
Overlaps a bit with "DVC in Production" UPDATE: Mentioned in #2544 in case you want to close this for consolidation purposes. |
In Towards ML Engineering: A Brief History Of TensorFlow Extended (TFX), authors state
DVC can be used to write unit tests on data retrieved from the wild. It can be used to check the basic statistics and distribution, clean up and sanitize the data to make it available to the model. Models usually have (implicit) assumptions about the distribution of the data and when these change (data drift), the model doesn't perform well.
A blog post or UC document about this case may be useful.
The text was updated successfully, but these errors were encountered: