Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pull: clones repositories for imported files #9738

Open
peper0 opened this issue Jul 14, 2023 · 6 comments
Open

pull: clones repositories for imported files #9738

peper0 opened this issue Jul 14, 2023 · 6 comments
Labels
A: data-sync Related to dvc get/fetch/import/pull/push p2-medium Medium priority, should be done, but less important

Comments

@peper0
Copy link
Contributor

peper0 commented Jul 14, 2023

Description

dvc pull clones repositories from which files were imported, even though they are cached (have cache: true implicitly or explicitly).

Reproduce

  1. dvc init
  2. dvc import any file from a different git repository
  3. dvc push
  4. clear the local cache
  5. dvc pull

At step 5 the repository is being cloned.

Expected

I expect data to be pushed to the remote in dvc push and pulled from the remote in dvc pull since the data is cached by default without accessing the git repository it was imported from (unless dvc update is called).

This is a big problem, since the git repo may be not accessible when dvc pull is called (e.g. when it is called by CI server). Moreover, it takes a lot of time if data is imported from several repositories with some large ones among them.

In my understanding, outputs are synced with the repository only in dvc update and dvc import. Not at dvc pull or dvc repro. Therefore I don't see why the repo would need to be accessible when calling dvc pull

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.58.2 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-5.4.0-150-generic-x86_64-with-glibc2.31
Subprojects:
        dvc_data = 0.51.0
        dvc_objects = 0.23.0
        dvc_render = 0.5.3
        dvc_task = 0.3.0
        scmrepo = 1.0.4
Supports:
        http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        ssh (sshfs = 2023.4.1)
Config:
        Global: /home/tlakota/.config/dvc
        System: /etc/xdg/dvc
Cache types: symlink
Cache directory: ext4 on /dev/nvme0n1
Caches: local
Remotes: ssh, ssh
Workspace directory: ext4 on /dev/sdc
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/9d372b24e0a6ee54ffae81f6983b321a
@dberenbaum
Copy link
Collaborator

Thanks for the issue @peper0. The current behavior of dvc import is to always download from source and never push imported data to the remote. So the dvc push in your example should not have any impact. The idea is that in most cases users would rather access the source git repo than have an entire extra copy of the dvc-tracked data in remote storage. There's some related discussion about being able to push imports in #4527.

@dberenbaum dberenbaum added A: data-sync Related to dvc get/fetch/import/pull/push awaiting response we are waiting for your reply, please respond! :) labels Jul 17, 2023
@peper0
Copy link
Contributor Author

peper0 commented Jul 25, 2023

@dberenbaum what about the push option of the outputs? Shouldn't it decide whether push the file to the remote?

@dberenbaum
Copy link
Collaborator

dberenbaum commented Jul 25, 2023

Yes, ideally import would set push: false and you could change it to push: true to get the behavior you want. Unfortunately, I don't think it's that simple today because import predates the introduction of the push option.

cc @efiop

@dberenbaum dberenbaum removed the awaiting response we are waiting for your reply, please respond! :) label Jul 25, 2023
@dberenbaum
Copy link
Collaborator

If you are open to a hacky workaround for now, you could make a dvc stage that does dvc get, which would track it as a normal output that gets pushed.

@dberenbaum dberenbaum added the p2-medium Medium priority, should be done, but less important label Jul 25, 2023
@peper0
Copy link
Contributor Author

peper0 commented Jul 26, 2023

@dberenbaum Yes, that's the direction that I'm going to migrate. But it has considerable drawbacks, like no support for dvc update.

@dberenbaum
Copy link
Collaborator

@peper0 If your stage cmd looks like dvc get --rev some_rev repo_url path, then you can update the --rev field to get update-like functionality, which AFAIK is more or less what import does. I don't plan to close this issue since it's a legitimate request to have all this included in import, but hopefully that at least makes it usable for now since I don't think it's something we can fix that quickly or can prioritize right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests

2 participants