-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
output/remote: get rid of NamedCache #6008
Conversation
dvc/output/base.py
Outdated
|
||
return ret | ||
def get_used_external(self, **kwargs) -> Dict["RepoPair", str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like we probably want to split the idea of "used objects" and "used external deps" all the way up our call stack, but for now its only done in output
(with the stage/repo level used_cache
functions returning tuples of used_objects, used_external_deps
)
# Validate our index by verifying all indexed .dir hashes | ||
# still exist on the remote | ||
dir_md5s = set(dir_objs.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of these dir_md5
/file_md5
sets/dicts should be dropped in favor of just using the tree objects, but I don't want to touch this functionality yet in this PR
4c0ea6c
to
29699e0
Compare
def filter( | ||
self, odb: "ObjectDB", prefix: Tuple[str], copy: bool = False | ||
) -> Optional[HashFile]: | ||
"""Return filter object(s) for this tree. | ||
|
||
If copy is True, returned object will be a Tree containing | ||
filtered entries, but with hash_info copied from the original tree. | ||
|
||
If copy is False, returned object will be a raw HashFile or Tree with | ||
newly computed hash_info for the filtered object. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current behavior for filter
is to return a new tree containing the filtered set of trie entries, and a newly computed digest for the entire tree. These filtered trees are used when we do granular stage/output operations (and are related to #5988)
For collecting used dir cache, what we really want is a filtered version of self
which contains the original self.hash_info
, but with the filtered trie entries.
The use of copy
is kind of ugly right now, but I'm also not sure that we really need 2 different filter methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy
does not make it clear what it copies though.
Still need to look into how this affects UI for push/pull/status |
@efiop so what's in this PR can be reviewed but I don't think we should merge it yet, since what's in the PR now basically makes ex:
|
As discussed, what
Where we show a count of objects which need to be pushed/pulled, along with the total data size for those objects. In the actual returned data (and I'm going to open a follow up PR (rebased on top of this one) that will move cloud This NamedCache PR shouldn't be considered mergeable until the both PRs are ready, so that |
For the record: had a discussion and decided to try to not change the UI aspect in this PR, possibly by adding filename to the obj for now. |
As discussed, collected used objects now have a edit: we have to keep some suffix since As it stands, the output for
|
def _add_suffix(objs, suffix): | ||
from dvc.objects.tree import Tree | ||
|
||
for obj in objs: | ||
obj.name += suffix | ||
if isinstance(obj, Tree): | ||
for _, entry_obj in obj: | ||
entry_obj.name += suffix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This workaround is needed to preserve current status -c
and repo.status()
behavior.
Assuming that we want to preserve the existing status -c
behavior going forward, it seems like what we will is for repo.status()
to handle returning the mapping of objects or object IDs to the set of revision+path pairs that point to each object, similar to what NamedCache did before.
But it should be handled at the repo level rather than passing string suffixes all the way down to the output collect_used_cache
methods like we were doing before.
objs, external = stage.get_used_cache(*args, **kwargs) | ||
used_objs.update(objs) | ||
for repo_pair, paths in external: | ||
used_external[repo_pair].update(paths) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should not be any external cache here I think.
Lines 37 to 43 in c7a3a00
for dep in stage.deps: | |
if not (dep.scheme == "local" and dep.def_path and dep.get_hash()): | |
return False | |
for out in stage.outs: | |
if out.scheme != "local" or not out.def_path or out.persist: | |
return False |
for obj in used: | ||
used_hashes.add(obj.hash_info.value) | ||
if isinstance(obj, Tree): | ||
used_hashes.update( | ||
entry_obj.hash_info.value for _, entry_obj in obj | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we put this in objects.iterobjs()/objects.iterhashes()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skshetry That one will no longer be necessery after the next PR, as push/pull will no longer work with raw md5 strings, but rather through odb.add, which go through objects one-by-one.
def filter( | ||
self, odb: "ObjectDB", prefix: Tuple[str], copy: bool = False | ||
) -> Optional[HashFile]: | ||
"""Return filter object(s) for this tree. | ||
|
||
If copy is True, returned object will be a Tree containing | ||
filtered entries, but with hash_info copied from the original tree. | ||
|
||
If copy is False, returned object will be a raw HashFile or Tree with | ||
newly computed hash_info for the filtered object. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy
does not make it clear what it copies though.
from voluptuous import Required | ||
|
||
from dvc.path_info import PathInfo | ||
|
||
from .base import Dependency | ||
|
||
|
||
class RepoPair(NamedTuple): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RepoPair
might be confused with pair of Repo, whereas this is more like freezing a repo with a revision. We use *Info
for these cases, so I'd suggest RepoInfo
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already were already using RepoDependency.repo_pair
so it seemed to me that we should just have RepoPair
as the named type. Either way, this is likely going away entirely in the future since we will move to handling external deps as a special object type (so RepoDependency will have an overridden get_obj()
method, and then will be collected the same way as other objects)
for more information, see https://pre-commit.ci
@pmrowla Let's address the review in the followup, please. Merging so you won't have to rebase it for |
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. π
prerequisite for #829
HashFile/Tree
objects.status -c
only displays a single revision suffix for path names instead of every possible one (see output/remote: get rid of NamedCacheΒ #6008 (comment))TODO in followup PRs: