output/remote: get rid of NamedCache #6008

pmrowla · 2021-05-14T08:29:20Z

❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

prerequisite for #829

Updates our used cache collection and data cloud methods to just use collections of HashFile/Tree objects.
status -c only displays a single revision suffix for path names instead of every possible one (see output/remote: get rid of NamedCache #6008 (comment))
Does not modify any other cloud status/push/pull behavior
- eventually we can just use the tree objects directly instead of the multiple dicts/sets of hashes for directory+file statuses that are currently used, but that will come in a later PR

TODO in followup PRs:

refactor external deps into behaving like objects (instead of the current repopair behavior)
migrate cloud status into ODB
unify remote upload/download and local cache save into single odb.save

pmrowla · 2021-05-14T08:48:21Z

dvc/output/base.py


-        return ret
+    def get_used_external(self, **kwargs) -> Dict["RepoPair", str]:


seems like we probably want to split the idea of "used objects" and "used external deps" all the way up our call stack, but for now its only done in output (with the stage/repo level used_cache functions returning tuples of used_objects, used_external_deps)

pmrowla · 2021-05-14T08:49:59Z

dvc/remote/base.py

        # Validate our index by verifying all indexed .dir hashes
        # still exist on the remote
+        dir_md5s = set(dir_objs.keys())


all of these dir_md5/file_md5 sets/dicts should be dropped in favor of just using the tree objects, but I don't want to touch this functionality yet in this PR

pmrowla · 2021-05-25T07:45:55Z

dvc/objects/tree.py

+    def filter(
+        self, odb: "ObjectDB", prefix: Tuple[str], copy: bool = False
+    ) -> Optional[HashFile]:
+        """Return filter object(s) for this tree.
+
+        If copy is True, returned object will be a Tree containing
+        filtered entries, but with hash_info copied from the original tree.
+
+        If copy is False, returned object will be a raw HashFile or Tree with
+        newly computed hash_info for the filtered object.
+        """


The current behavior for filter is to return a new tree containing the filtered set of trie entries, and a newly computed digest for the entire tree. These filtered trees are used when we do granular stage/output operations (and are related to #5988)

For collecting used dir cache, what we really want is a filtered version of self which contains the original self.hash_info, but with the filtered trie entries.

The use of copy is kind of ugly right now, but I'm also not sure that we really need 2 different filter methods?

copy does not make it clear what it copies though.

pmrowla · 2021-05-25T07:47:03Z

Still need to look into how this affects UI for push/pull/status

pmrowla · 2021-05-27T07:21:46Z

@efiop so what's in this PR can be reviewed but I don't think we should merge it yet, since what's in the PR now basically makes status -c unusable since we no longer track the NamedCache hash -> pathname mappings

ex:

$ dvc status -A -c
        deleted:            258622b1688250cb619f3c9ccaefb7eb
        deleted:            c157a79031e1c40f85931829bc5fc552
        deleted:            d3b07384d113edec49eaa6238ad5ff00

pmrowla · 2021-05-27T07:26:51Z

As discussed, what status -c for objects should display is something along the lines of

dvc status -A -c
        new: 2 objects - 12345 byte(s)
        deleted:  1 object - 12345 byte(s)
        missing: ...

Where we show a count of objects which need to be pushed/pulled, along with the total data size for those objects.

In the actual returned data (and --show-json output) for cloud status, we would have a dict where the status is mapped to a list of object ids (with sizes).

I'm going to open a follow up PR (rebased on top of this one) that will move cloud status into the ODB, and will handle the status -c UI changes.

This NamedCache PR shouldn't be considered mergeable until the both PRs are ready, so that status -c stays usable in master (or at least it shouldn't be merged while there's a pending release)

efiop · 2021-05-31T12:36:06Z

For the record: had a discussion and decided to try to not change the UI aspect in this PR, possibly by adding filename to the obj for now.

pmrowla · 2021-06-01T08:39:06Z

As discussed, collected used objects now have a HashFile.name field which will be populated w/the output's relpath. ~~The old brancher suffixes have still been removed from the name field.~~

edit: we have to keep some suffix since repo.status returns a dict with the name field used as keys. So without the suffix, we end up dropping any paths which overlap across multiple revisions.

As it stands, the output for status -c will now include the relpath and suffix for the most recent commit/branch containing an object (instead of previous behavior where we listed the name+commit for every commit that contained an object):

$ dvc status -A -c
        deleted:            foo (workspace)
        deleted:            dir/baz (workspace)
        deleted:            foo (143bf86dbff4cc1872803ff58dc95ce1acf502db)

pmrowla · 2021-06-01T09:26:21Z

dvc/repo/__init__.py

+        def _add_suffix(objs, suffix):
+            from dvc.objects.tree import Tree
+
+            for obj in objs:
+                obj.name += suffix
+                if isinstance(obj, Tree):
+                    for _, entry_obj in obj:
+                        entry_obj.name += suffix


This workaround is needed to preserve current status -c and repo.status() behavior.

Assuming that we want to preserve the existing status -c behavior going forward, it seems like what we will is for repo.status() to handle returning the mapping of objects or object IDs to the set of revision+path pairs that point to each object, similar to what NamedCache did before.

But it should be handled at the repo level rather than passing string suffixes all the way down to the output collect_used_cache methods like we were doing before.

dvc/objects/file.py

dvc/output.py

skshetry · 2021-06-01T10:42:33Z

dvc/stage/cache.py

+            objs, external = stage.get_used_cache(*args, **kwargs)
+            used_objs.update(objs)
+            for repo_pair, paths in external:
+                used_external[repo_pair].update(paths)


There should not be any external cache here I think.

dvc/dvc/stage/cache.py

Lines 37 to 43 in c7a3a00

for dep in stage.deps:

if not (dep.scheme == "local" and dep.def_path and dep.get_hash()):

return False

for out in stage.outs:

if out.scheme != "local" or not out.def_path or out.persist:

return False

skshetry · 2021-06-01T10:42:44Z

dvc/objects/db/base.py

+        for obj in used:
+            used_hashes.add(obj.hash_info.value)
+            if isinstance(obj, Tree):
+                used_hashes.update(
+                    entry_obj.hash_info.value for _, entry_obj in obj
+                )


How about we put this in objects.iterobjs()/objects.iterhashes()?

@skshetry That one will no longer be necessery after the next PR, as push/pull will no longer work with raw md5 strings, but rather through odb.add, which go through objects one-by-one.

skshetry · 2021-06-02T07:48:51Z

dvc/objects/tree.py

+    def filter(
+        self, odb: "ObjectDB", prefix: Tuple[str], copy: bool = False
+    ) -> Optional[HashFile]:
+        """Return filter object(s) for this tree.
+
+        If copy is True, returned object will be a Tree containing
+        filtered entries, but with hash_info copied from the original tree.
+
+        If copy is False, returned object will be a raw HashFile or Tree with
+        newly computed hash_info for the filtered object.
+        """


copy does not make it clear what it copies though.

skshetry · 2021-06-02T07:50:58Z

dvc/dependency/repo.py

 from voluptuous import Required

 from dvc.path_info import PathInfo

 from .base import Dependency


+class RepoPair(NamedTuple):


RepoPair might be confused with pair of Repo, whereas this is more like freezing a repo with a revision. We use *Info for these cases, so I'd suggest RepoInfo here.

We already were already using RepoDependency.repo_pair so it seemed to me that we should just have RepoPair as the named type. Either way, this is likely going away entirely in the future since we will move to handling external deps as a special object type (so RepoDependency will have an overridden get_obj() method, and then will be collected the same way as other objects)

for more information, see https://pre-commit.ci

dvc/remote/base.py

for more information, see https://pre-commit.ci

efiop · 2021-06-02T19:20:58Z

@pmrowla Let's address the review in the followup, please. Merging so you won't have to rebase it for path_info removal. Thank you! 🙏

pmrowla self-assigned this May 14, 2021

pmrowla commented May 14, 2021

View reviewed changes

pmrowla force-pushed the drop-named-cache branch 2 times, most recently from 4c0ea6c to 29699e0 Compare May 25, 2021 07:36

pmrowla commented May 25, 2021

View reviewed changes

skshetry mentioned this pull request May 25, 2021

gc: parallelize garbage collection #5961

Closed

pmrowla force-pushed the drop-named-cache branch from 29699e0 to 1543070 Compare May 26, 2021 05:27

pmrowla marked this pull request as ready for review May 27, 2021 07:26

pmrowla marked this pull request as draft May 27, 2021 07:30

pmrowla force-pushed the drop-named-cache branch from 1543070 to 82a4b85 Compare May 31, 2021 06:07

pmrowla force-pushed the drop-named-cache branch from 82a4b85 to 14dcca4 Compare June 1, 2021 08:34

pmrowla marked this pull request as ready for review June 1, 2021 08:36

pmrowla changed the title ~~[WIP] get rid of NamedCache~~ output/remote: get rid of NamedCache Jun 1, 2021

pmrowla commented Jun 1, 2021

View reviewed changes

pmrowla added refactoring Factoring and re-factoring skip-changelog Skips changelog labels Jun 1, 2021

pmrowla requested a review from efiop June 1, 2021 09:43

pmrowla added 6 commits June 1, 2021 19:23

objects: make HashInfo & HashFile hashable

164d950

stage/output: collect objects (HashFiles) instead of NamedCache

103258a

remote/data_cloud: use objs instead of NamedCache

321d28b

repo: use objects instead of NamedCache

958f464

rerun black after rebase

0860c7d

remove NamedCache usage from run-cache

e4309f0

pmrowla added 11 commits June 1, 2021 19:23

update naive odb.gc

a9d8752

update state save behavior for pull

2330e46

update test_data_cloud

3c3abdc

update test_remote

9fa1a6c

fix gc with multiple repos

f01902c

objects.tree: support filtering w/preserved hash_info

d2cef7a

fix stage cache restore

4b83bb4

drop fs.path_info from HashFile.__hash__

5849de5

objects: add optional name attribute

aad9038

revert missing cache logging name/suffix changes

e44119d

revert name changes in remote tests

31ed794

pmrowla force-pushed the drop-named-cache branch from b172dc1 to 31ed794 Compare June 1, 2021 10:23

efiop reviewed Jun 1, 2021

View reviewed changes

dvc/objects/file.py Outdated Show resolved Hide resolved

efiop reviewed Jun 1, 2021

View reviewed changes

dvc/output.py Outdated Show resolved Hide resolved

review fixes

5675b46

skshetry reviewed Jun 2, 2021

View reviewed changes

efiop and others added 2 commits June 2, 2021 21:27

Merge branch 'master' into drop-named-cache

6f6168e

[pre-commit.ci] auto fixes from pre-commit.com hooks

a0c17d2

for more information, see https://pre-commit.ci

efiop reviewed Jun 2, 2021

View reviewed changes

dvc/remote/base.py Outdated Show resolved Hide resolved

efiop and others added 2 commits June 2, 2021 21:54

Update dvc/remote/base.py

f99151a

[pre-commit.ci] auto fixes from pre-commit.com hooks

4fb4339

for more information, see https://pre-commit.ci

efiop merged commit eb46c03 into iterative:master Jun 2, 2021

pmrowla deleted the drop-named-cache branch June 3, 2021 00:22

efiop mentioned this pull request Jun 3, 2021

pull: misleading/unexpected warning #6105

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

output/remote: get rid of NamedCache #6008

output/remote: get rid of NamedCache #6008

pmrowla commented May 14, 2021 •

edited

Loading

pmrowla May 14, 2021

pmrowla May 14, 2021

pmrowla May 25, 2021

skshetry Jun 2, 2021

pmrowla commented May 25, 2021

pmrowla commented May 27, 2021

pmrowla commented May 27, 2021 •

edited

Loading

efiop commented May 31, 2021

pmrowla commented Jun 1, 2021 •

edited

Loading

pmrowla Jun 1, 2021

skshetry Jun 1, 2021

skshetry Jun 1, 2021

efiop Jun 2, 2021

skshetry Jun 2, 2021

skshetry Jun 2, 2021

pmrowla Jun 3, 2021 •

edited

Loading

efiop commented Jun 2, 2021


		return ret
		def get_used_external(self, **kwargs) -> Dict["RepoPair", str]:

	for dep in stage.deps:
	if not (dep.scheme == "local" and dep.def_path and dep.get_hash()):
	return False

	for out in stage.outs:
	if out.scheme != "local" or not out.def_path or out.persist:
	return False

output/remote: get rid of NamedCache #6008

output/remote: get rid of NamedCache #6008

Conversation

pmrowla commented May 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmrowla commented May 25, 2021

pmrowla commented May 27, 2021

pmrowla commented May 27, 2021 • edited Loading

efiop commented May 31, 2021

pmrowla commented Jun 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmrowla Jun 3, 2021 • edited Loading

Choose a reason for hiding this comment

efiop commented Jun 2, 2021

pmrowla commented May 14, 2021 •

edited

Loading

pmrowla commented May 27, 2021 •

edited

Loading

pmrowla commented Jun 1, 2021 •

edited

Loading

pmrowla Jun 3, 2021 •

edited

Loading