Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

output/remote: get rid of NamedCache #6008

Merged
merged 30 commits into from
Jun 2, 2021
Merged

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented May 14, 2021

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

prerequisite for #829

  • Updates our used cache collection and data cloud methods to just use collections of HashFile/Tree objects.
  • status -c only displays a single revision suffix for path names instead of every possible one (see output/remote: get rid of NamedCacheΒ #6008 (comment))
  • Does not modify any other cloud status/push/pull behavior
    • eventually we can just use the tree objects directly instead of the multiple dicts/sets of hashes for directory+file statuses that are currently used, but that will come in a later PR

TODO in followup PRs:

  • refactor external deps into behaving like objects (instead of the current repopair behavior)
  • migrate cloud status into ODB
  • unify remote upload/download and local cache save into single odb.save

@pmrowla pmrowla self-assigned this May 14, 2021

return ret
def get_used_external(self, **kwargs) -> Dict["RepoPair", str]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like we probably want to split the idea of "used objects" and "used external deps" all the way up our call stack, but for now its only done in output (with the stage/repo level used_cache functions returning tuples of used_objects, used_external_deps)

# Validate our index by verifying all indexed .dir hashes
# still exist on the remote
dir_md5s = set(dir_objs.keys())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of these dir_md5/file_md5 sets/dicts should be dropped in favor of just using the tree objects, but I don't want to touch this functionality yet in this PR

@pmrowla pmrowla force-pushed the drop-named-cache branch 2 times, most recently from 4c0ea6c to 29699e0 Compare May 25, 2021 07:36
Comment on lines +128 to +138
def filter(
self, odb: "ObjectDB", prefix: Tuple[str], copy: bool = False
) -> Optional[HashFile]:
"""Return filter object(s) for this tree.

If copy is True, returned object will be a Tree containing
filtered entries, but with hash_info copied from the original tree.

If copy is False, returned object will be a raw HashFile or Tree with
newly computed hash_info for the filtered object.
"""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current behavior for filter is to return a new tree containing the filtered set of trie entries, and a newly computed digest for the entire tree. These filtered trees are used when we do granular stage/output operations (and are related to #5988)

For collecting used dir cache, what we really want is a filtered version of self which contains the original self.hash_info, but with the filtered trie entries.

The use of copy is kind of ugly right now, but I'm also not sure that we really need 2 different filter methods?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy does not make it clear what it copies though.

@pmrowla
Copy link
Contributor Author

pmrowla commented May 25, 2021

Still need to look into how this affects UI for push/pull/status

@pmrowla
Copy link
Contributor Author

pmrowla commented May 27, 2021

@efiop so what's in this PR can be reviewed but I don't think we should merge it yet, since what's in the PR now basically makes status -c unusable since we no longer track the NamedCache hash -> pathname mappings

ex:

$ dvc status -A -c
        deleted:            258622b1688250cb619f3c9ccaefb7eb
        deleted:            c157a79031e1c40f85931829bc5fc552
        deleted:            d3b07384d113edec49eaa6238ad5ff00

@pmrowla
Copy link
Contributor Author

pmrowla commented May 27, 2021

As discussed, what status -c for objects should display is something along the lines of

dvc status -A -c
        new: 2 objects - 12345 byte(s)
        deleted:  1 object - 12345 byte(s)
        missing: ...

Where we show a count of objects which need to be pushed/pulled, along with the total data size for those objects.

In the actual returned data (and --show-json output) for cloud status, we would have a dict where the status is mapped to a list of object ids (with sizes).


I'm going to open a follow up PR (rebased on top of this one) that will move cloud status into the ODB, and will handle the status -c UI changes.

This NamedCache PR shouldn't be considered mergeable until the both PRs are ready, so that status -c stays usable in master (or at least it shouldn't be merged while there's a pending release)

@pmrowla pmrowla marked this pull request as ready for review May 27, 2021 07:26
@pmrowla pmrowla marked this pull request as draft May 27, 2021 07:30
@pmrowla pmrowla force-pushed the drop-named-cache branch from 1543070 to 82a4b85 Compare May 31, 2021 06:07
@efiop
Copy link
Contributor

efiop commented May 31, 2021

For the record: had a discussion and decided to try to not change the UI aspect in this PR, possibly by adding filename to the obj for now.

@pmrowla pmrowla force-pushed the drop-named-cache branch from 82a4b85 to 14dcca4 Compare June 1, 2021 08:34
@pmrowla pmrowla marked this pull request as ready for review June 1, 2021 08:36
@pmrowla pmrowla changed the title [WIP] get rid of NamedCache output/remote: get rid of NamedCache Jun 1, 2021
@pmrowla
Copy link
Contributor Author

pmrowla commented Jun 1, 2021

As discussed, collected used objects now have a HashFile.name field which will be populated w/the output's relpath. The old brancher suffixes have still been removed from the name field.

edit: we have to keep some suffix since repo.status returns a dict with the name field used as keys. So without the suffix, we end up dropping any paths which overlap across multiple revisions.

As it stands, the output for status -c will now include the relpath and suffix for the most recent commit/branch containing an object (instead of previous behavior where we listed the name+commit for every commit that contained an object):

$ dvc status -A -c
        deleted:            foo (workspace)
        deleted:            dir/baz (workspace)
        deleted:            foo (143bf86dbff4cc1872803ff58dc95ce1acf502db)

Comment on lines +377 to +384
def _add_suffix(objs, suffix):
from dvc.objects.tree import Tree

for obj in objs:
obj.name += suffix
if isinstance(obj, Tree):
for _, entry_obj in obj:
entry_obj.name += suffix
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workaround is needed to preserve current status -c and repo.status() behavior.

Assuming that we want to preserve the existing status -c behavior going forward, it seems like what we will is for repo.status() to handle returning the mapping of objects or object IDs to the set of revision+path pairs that point to each object, similar to what NamedCache did before.

But it should be handled at the repo level rather than passing string suffixes all the way down to the output collect_used_cache methods like we were doing before.

@pmrowla pmrowla added refactoring Factoring and re-factoring skip-changelog Skips changelog labels Jun 1, 2021
@pmrowla pmrowla requested a review from efiop June 1, 2021 09:43
@pmrowla pmrowla force-pushed the drop-named-cache branch from b172dc1 to 31ed794 Compare June 1, 2021 10:23
dvc/objects/file.py Outdated Show resolved Hide resolved
dvc/output.py Outdated Show resolved Hide resolved
Comment on lines +262 to +265
objs, external = stage.get_used_cache(*args, **kwargs)
used_objs.update(objs)
for repo_pair, paths in external:
used_external[repo_pair].update(paths)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should not be any external cache here I think.

dvc/dvc/stage/cache.py

Lines 37 to 43 in c7a3a00

for dep in stage.deps:
if not (dep.scheme == "local" and dep.def_path and dep.get_hash()):
return False
for out in stage.outs:
if out.scheme != "local" or not out.def_path or out.persist:
return False

Comment on lines +321 to +326
for obj in used:
used_hashes.add(obj.hash_info.value)
if isinstance(obj, Tree):
used_hashes.update(
entry_obj.hash_info.value for _, entry_obj in obj
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we put this in objects.iterobjs()/objects.iterhashes()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skshetry That one will no longer be necessery after the next PR, as push/pull will no longer work with raw md5 strings, but rather through odb.add, which go through objects one-by-one.

Comment on lines +128 to +138
def filter(
self, odb: "ObjectDB", prefix: Tuple[str], copy: bool = False
) -> Optional[HashFile]:
"""Return filter object(s) for this tree.

If copy is True, returned object will be a Tree containing
filtered entries, but with hash_info copied from the original tree.

If copy is False, returned object will be a raw HashFile or Tree with
newly computed hash_info for the filtered object.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy does not make it clear what it copies though.

from voluptuous import Required

from dvc.path_info import PathInfo

from .base import Dependency


class RepoPair(NamedTuple):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RepoPair might be confused with pair of Repo, whereas this is more like freezing a repo with a revision. We use *Info for these cases, so I'd suggest RepoInfo here.

Copy link
Contributor Author

@pmrowla pmrowla Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already were already using RepoDependency.repo_pair so it seemed to me that we should just have RepoPair as the named type. Either way, this is likely going away entirely in the future since we will move to handling external deps as a special object type (so RepoDependency will have an overridden get_obj() method, and then will be collected the same way as other objects)

dvc/remote/base.py Outdated Show resolved Hide resolved
@efiop efiop merged commit eb46c03 into iterative:master Jun 2, 2021
@efiop
Copy link
Contributor

efiop commented Jun 2, 2021

@pmrowla Let's address the review in the followup, please. Merging so you won't have to rebase it for path_info removal. Thank you! πŸ™

@pmrowla pmrowla deleted the drop-named-cache branch June 3, 2021 00:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactoring Factoring and re-factoring skip-changelog Skips changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants