-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
api: high level dvc.api import/export meets data catalog #2719
Comments
Good one, @Suor . Thanks! Just to start the discussion, the first question I would clarify is why DVC is not enough and we need to specify (effectively) types and list of exports in a new file. And related one - if DVC-file is enough, can we let regular existing dvc.api be aware about types? And do not introduce new APIs? |
@shcheklein in a general case an exported object might depend on several outs from different .dvc files. Even if we limit ourselves to single .dvc I don't like complicating stage files. We also start something representing catalog with that yaml file. Existing api calls return strings or file handles if they suddenly start to return something else that would be crazy. |
(DISCLAIMER: all stuff below is not an attempt to say that @Suor's approach is not right and we should go with DVC-files alone, please, read this as an attempt to challenge some assumptions, get a bit more examples, and understand better the decisions we make)
Would love to have real-life examples before we make this decision. I would not overgeneralize until it's not necessary. @dmpetrov you should have some from the model zoo experiment? and may be from some other experiments? Would be great to see more use cases here, more example of what we are trying to solve, etc.
not sure I get that point? Do you mean something that improves discoverability? Gives a way to higher level
that's true. I didn't mean reusing the same API calls literally. Some modification (or may be even another call) might be required. Probably this is the secondary question to the first one - what is the best way to export artifacts. |
I will let someone else do the input ) |
Looks like a good API design) Some questions:
Cool stuff, @Suor 👍 |
The
With the example yaml file: # dvc-artifacts.yml
artifacts:
- name: face_rec_model:
description: Tensorflow face recognition model
call: export.make_face_rec
deps: ["data/face_rec_model.dat", "data/extra.dat"]
sys.path.insert(0, tmp_dir)
# This is extracted from "call" and "deps"
import export
artifact = export.make_face_rec("data/face_rec_model.dat", "data/extra.dat")
sys.path.popleft()
return artifact |
Wouldn't it be "Read dvc-artifacts.yaml" and then checkout to revision? Checking out before reading
If you want to access The overlaps between |
The
Yes, this is the issue. But if you don't have It is possible to separate # dvc-artifacts.yml in https://github.com/company/dvc-catalog
artifacts:
- name: face_rec_model
description: Tensorflow face recognition model
repo: https://github.com/company/face-rec
rev: 1.1 # Or should we allow specifying this in a .summon() call
call: export.make_face_rec # refers to export.py in `dvc-catalog` repo
deps: "data/face_rec_model.dat" # refers to a file managed by `face-rec` repo
- name: cats_dogs_model
description: Tensorflow face recognition model
repo: https://github.com/company/cats-dogs # other repo
rev: v2
call: ...
deps: ... This way you can reference artifacts from external sources and wrap them into summonable form. This could be the future extension. This, however, implies at least two source repos (catalog and data) and might be an overkill for many use cases. There are lots of questions here also arise. Like distinguishing source and catalog revs, code refs in This extension possibility is however an another argument for separating artifacts decl files from stage files. |
It is. I didn't come up with something better for now. |
@Suor thank you for initiating the discussion! This is a really great direction for DVC. In this issue, only data reusage/getting part is discussed which is the most important part from the implementation point of view. I think it makes sense to discuss all the data catalog scenarios here to make better requirements for this feature. But first, what makes a set of files a data catalog:
Data catalog scenarios
ExampleA special use case: model zoo. Ideally, we should be able to create a repository with many models like https://github.com/cadene/pretrained-models.pytorch with all the information available through DVC API. So, instead of reading text README we can use programmatic API:
Analogshttps://intake.readthedocs.io/en/latest/ Interoperability and supported types@Suor proposing implementing all through Python which limits interoperability. I'd suggest creating a more universal approach with different types/plugins: there are dozens of data frame types - each can be used from any language. See https://intake.readthedocs.io/en/latest/plugin-directory.html. Python can be one of the supported types/plugins and all the ML models (like PyTorch) will be used through this type. So, we won't block future interoperability for common types. Initially, only two types are needed CSV and Python (which covers the rest types initially). Open questions
|
One of the technical challenges that I see today is importing\getting\summoning multi-file resources. For example, to import a PyTorch model you need to get:
@Suor What would be the best way to overcome this issue? Do we need a specific API call? Should we clone the repo and keep it until the importing and instantiation happening. You should have more context on this. Could you please specify this and even open an issue? |
@dmpetrov depending on several artifacts is demonstrated in the very first example, they need to be specified though. This is the only way for dvc to know what it should fetch and checkout. I don't know any reliable way to do that implicitly, there are a few ways to make it unreliably though, e.g. monkeypatching Depending on any files managed by git is not an issue with current implementation, we do a clone anyway. I guess we will need to stick to that because people will tend to forget some |
This would look cool in a jupiter notebook.
I guess we can implement this on top of dvc pipelines somehow. We can also show/visualize the whole pipeline: let people see the data and the code used, and all the data transformations. This looks like an advanced thing for me.
I don't think this is possible without breaking a run or an import stage, which has that model as an out. Only possible for add stages, which means this model pipeline is already broken and we don't know how it was build, etc.
We discussed this and implementing it whichever way limits interoperability or extensibility. I see how more declarative way will facilitate eyeballing and maybe some alternative implementations for very limited number of data types like csv. Still not sure this is a good idea to have from the start. E.g. we have 2 ways of doing things instead of one. |
Different scenarios of data catalog usage. We can think about the scenarios as a requirement for this issue. Models zooDVC-repo with pertained pytorch model (resnet18, resnet50, resnet101): https://github.com/dmpetrov/dvc-pretrained-resnet Usage:
Note, Example of using model zoos: https://github.com/dmpetrov/dvc_for_resnet-50 CSV fileRepo: https://github.com/iterative/df_sea_ice Usage:
Note, the instantiation logic should be implemented as a library (dvc_cat?). The same way we can provide a library for Complicated CSV fileRepo: https://github.com/iterative/df_sea_ice_no_header Usage:
Compound CSV fileTo bring only a slice of data. A repo example is not ready. Usage:
The pattern can be more complicated: @iterative/engineering please share your opinions about the scenarios. Is something missing? |
We are on the same page in the big picture part. Some thoughts mostly small or even cosmetic go below. We need to answer a couple of questions as I see it:
My answers: No and postpone. I would also like to keep model = dvc.api.summon(
'resnet50', # artifact name
repo='https://github.com/dmpetrov/dvc-pretrained-resnet', # pass by name
rev='v1.2' # optional
) CSV filesWe still need to pass artifact name: # I prefer `dvcx` for dvc export/extension, also short
df = dvcx.summon('sea_ice', repo='https://github.com/iterative/df_sea_ice_no_header') This, however, presupposes that we have some default way to summon csv files. We might want to avoid this and make it: # Read csv into dataframe
df = dvcx.to_df('sea_ice', repo='https://github.com/iterative/df_sea_ice_no_header')
# Read csv into Numpy array
arr = dvcx.to_ndarray('sea_ice', repo='https://github.com/iterative/df_sea_ice_no_header') Note that I didn't call it Model Zoo vs Model SourceThese are two scenarios, which slightly contradict each other. In a "model source" scenario we have a model with data and code used to build it and its metrics. All these things are stored in git and we may show them to user potentially which is cool. However, if we Then again we may turn model zoo into a model source by importing models from source repos, if that models are built within dvc repos of cause. On patterns. Let's postpone that discussion, I see the value. |
Quick question, why isn't it just |
I also think DVC shines best in the model source scenario, but building model registries/catalogs without code or data sounds like a useful thing to do and it can be achieved with UPDATE: Oh, I see this is already mentioned:
So it seems we agree the Model Zoo concept is most appropriate, as it can indirectly include model sources with DVC. |
@Suor thank you for the great feedback and the details that you have noticed! I agree with most of the items you brought. It looks like the major unanswered question is model source vs. model zoos. Metrics files and the number of artifact-files are related to this I believe. First, it looks like the scenario with a model source that was generated by data, code and a corresponding metrics file is a bit restrictive. In reality, metrics files can be created separately (custom or taken from previous runs) and DVC should not have control over this. From this point of view, the difference between metrics file (csv/json) and an artifact file that we are inventing now is not so dramatic as it might look like. Second, in an ideal scenario, an artifact yaml file should be generated from a dvc pipeline the same way as metrics file. We will probably need a library to create metrics files and the artifact-yaml. Even without artifacts it would be nice to have. So, why don’t we even merge these two concepts? We can say that any output yml file which marked as a metric file is an dvc-artifact.
Disadvantage: you need to have a corresponded DVC-file even if you don’t use data-files at all. Create that file What do you folks think about treating yaml files the same way as metric ones? PS: Also, we discussed how to get artifact but had not discussed how artifacts can be imported. We might need a new command like |
@jorgeorpinel For CSV we need a code that can deserialize it. The code will depend on libraries like In the |
The artifact file is completely human written, it has many things like name, description and ref to human written code. None of this can be generated. Metric in contrast is an output, which will be rewritten by command. So I don't see how this may be combined:
You first need to say what do you mean by importing artifacts. For me artifacts are strictly in-memory objects. If you mean importing all their deps plus instantiating code then why do we need this? |
So for model zoo I see two options:
These are not mutually exclusive, but I would prefer starting with one. |
A high-level consideration: do not mix human written and machine generated content in a single place. This inhibits its human writability and readability. P.S. I know we already do that in stage files, but it's hard to do anything about that now and we had issues with that. |
@Suor your points about However, the point about output (generating artifact file) is great - it will create duplications (retrain a model if model description changes or modify in two places and commit). It could be better to keep artifact files and metrics files separately. I don’t think that any change coordination is needed at this point - the user should decide. Also, it looks like the most simple decision from the implementation standpoint. PS: Let’s discuss importing artifact later. I think it is a useful feature. The only problem - it won’t work for python imports, only to “catalog native” types such as csv. This discussion can be postponed till the first implementation. |
should we add |
I think a nice addition would be to pass parameters through dvc.api.summon("say_hello", params={"name": "world"}, repo="https://github.com/blahblah") This brings more flexibility:
|
@MrOutis yes, parameters overwriting should be supported in the function. Sorry, it was not specified directly. |
@MrOutis |
@MrOutis for the items inside the yml you can use term These items have types. So, |
I like the purposed terminology, @dmpetrov , will do the change :) |
So
|
Yes, def. docs! Please see iterative/dvc.org/pull/908 (although I'll probably open a separate issue for a more detailed |
docs have question because I see keeping everything in README as an advantage for smaller projects both for users and devs. Creating docs makes it slow, which is not good for a new evolving project. It takes weeks to merge something to docs now. |
@Suor README + good docstings + examples should be enough, I agree. We'll move stuff that needs to be on dvc.org (for SEO, for awareness of summon existence, etc) after that.
yep, we can try and address together on the next retro. |
Yeah "docs" doesn't have to mean the dvc.org website. 2 of the 3 issues are linked are about docstrings. READMEs are great too (and may need more love, do we ever review them?). But all these kinds of docs have different public and goals probably. Good topic for retro indeed.
I checked the last 10 normal PRs in dvc.org. Average merge time is around 24 hours. |
@Suor What is the status in this ticket? What are our next steps? |
Closing the remaining points in favor of separate issues in dvcx: https://github.com/iterative/dvcx/issues/5 and https://github.com/iterative/dvcx/issues/6. |
- what do you do for a living?
- oh, different things, import, export, family business
(any Hollywood smuggler and now a dvc dev)
This feature was first proposed by @dmpetrov and was discussed several times here and there. Recent conversations/debates with @dmpetrov and @shcheklein resulted into some more or less defined approach to it, which I will lay out here.
The need: we want things to be easier reused across repos, so we want some higher level api than just
get()/read()/open()
, presumably some call, which will return a ready to use dataframe, an instantiated model, etc.The design
So here goes the design so far. The exporting side will need to list available things and write some instantiating glue code. Like:
Then the importting side might simply do:
Summon is a good (to my liking) and short name that refers to summoning a genie or a tesla :)
This will provide a flexible base to export/summon anything. Some questions arise about requrements, repeatetive glue code, Python/R interoperability and easing things for a publisher. I will go through this.
Requirements
Instantiating a model or a dataframe will require some libraries installed like pandas or PyTorch, or even their particular versions. In basic scenario we might ignore that - in most cases a user will know what is this about an will have it installed already. The thing is we can't provide it seamlessly since installing python libs inside
dvc.api.summon()
call will be surprising and might break some things. So deps management will be separate anyway. The possible way:Glue code
There are many common scenarios like export csv-file as dataframe, which will produce repetitive glue code. This could be handled by providing common export functions in our
dvcx
(for dvc export) library:In this particular scenario one might even use
pandas.read_csv
directly. There should be a way to parametrize those:Note that deps are passed as positional arguments here.
Python/R interoperability
This is a big concern for @dmpetrov. This will mostly work for simpler cases like dataframes, there is probably no practical way to make PyTorch models usable in R, there might be for some simpler models like linear regression. Anyway to solve those I suggest providing
dvcx
library for R, sharing function and param names, so that same yaml file will work for R too. This is to my mind 2nd or even 3rd layer on top of basic summon functionality, so might be added later.Making it easier for a publisher
Raised by @shcheklein. All this additional yaml/python files might stop someone from publishing things. I don't see this as an issue since:
.read()/.open()/.get()
So the "plain text alternative" - one might just write somewhere in a readme - I have this and this things and provide snippets like:
This is easy to use, does not require any new concepts and any additional effort from our side.
What's next?
In my opinion the basic summon functionality might be implemented in the near future as having high value/effort rate and more or less clear design.
Still all of this was only discussed in my 1-1s with @dmpetrov and @shcheklein, so some input from the rest of the team would be highly desirable. So @efiop @MrOutis @pared @jorgeorpinel please join.
The text was updated successfully, but these errors were encountered: