-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make show
and other commands work with remote repos
#25
Comments
Asked a question in scmrepo about this - maybe we can use it. |
show
and audit
work on remote reposshow
and other commands work with remote repos
Hi, I'd like to contribute to this one. Any suggestion on where to start from? |
Thanks for asking! There are few parts to this:
AFAIK, the only way to support this in full is to clone the repo locally to temp dir first. In this case, the ideal way would be to contribute to scmrepo in the issue above, I suppose, considering DVC already have something like this implemented. There are also an option to support 1a partially. The thing is, you can read Git tags from a remote repo with This partial option is good for |
Found a similar explanation I've posted in another issue earlier #220 (comment) |
Thanks @aguschin ! Reading through what you sent me, I found this especially illuminating: https://stackoverflow.com/a/1179728/19782654 I think going for a git clone in a tmp folder is probably the most general (it is a necessary step for write functionalities too) and robust solution. Do you agree? Shall we go for that straight away? |
Related issue on |
Sure, we can start with this, and keep I think @pmrowla summarized it well in this comment. GitPython is a library that call Initially I used GitPython in GTO because Maybe @pmrowla could give us more guidance on this (in the |
Two questions:
|
|
Ok, I thought of tmpfiles, it’s just that from direct experience I know
that each call will be slow and therefore it will be a little cumbersome.
But you are right, let’s start like this, optimization comes later 😄
Thanks for your input!
…On Fri 9. Sep 2022 at 06:49 Alexander Guschin ***@***.***> wrote:
1. Let's start it simple with cloning to temp dir. For handling the
temp dirs (creating and deleting them) you can use
https://docs.python.org/3/library/tempfile.html
This is a simple approach that will do for the first time (even DVC
uses it, @pmrowla <https://github.com/pmrowla> please correct me if
I'm wrong). It maybe won't allow us to cache things in between CLI calls
(like if you're going to call GTO multiple times on the same repo), but
once we implement the simple approach, we may consider the approaches to
optimize this. On the bright side, this doesn't require us to move .gto
to .gto/config, making it easier for now in terms of compatibility.
2. Yeah, please start with whichever you like more. gto show sounds
like the most useful to me as well. In any case, once you implement this
for a single command, repeating it for others should be a trivial task.
—
Reply to this email directly, view it on GitHub
<#25 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB7PWYGUEEHGO4XYRS3LZJLV5K6XHANCNFSM5O6Y5KIQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Temp directories (without any persistent caching between CLI calls) is what DVC does: https://github.com/iterative/dvc/blob/main/dvc/external_repo.py |
I have been working a bit on this (francesco086#1). I have one design decision to make: who is responsible to clean up the git cloned directory? What should be the mechanism? |
One option that may work (looks like a hacky one to me) is to use def return_repo(path):
if remote_url(path):
with TemporaryDirectory() as tmpdir:
clone_repo(path, tmpdir)
yield tmpdir
return path
class GitRegistry(BaseModel):
...
@classmethod
def from_repo(cls, repo=Union[str, Repo], config: RegistryConfig = None):
if isinstance(repo, str):
try:
for repo in return_repo(repo): pass
repo = git.Repo(repo, search_parent_directories=True)
except (InvalidGitRepositoryError, NoSuchPathError) as e:
raise NoRepo(repo) from e
if config is None:
config = read_registry_config(
os.path.join(repo.working_dir, CONFIG_FILE_NAME)
) As per Stack Overflow answer, it should get garbage collected after process exit. Which is fine for us, since I'll search for more beautiful solution, but this may suffice :) |
Thanks for the suggestion, it indeed looks very hacky... reading more in detail from the stackoverflow discussion you mentioned it sounds to me quite a bad idea to follow this approach. The directory could be deleted at any time (it's undeterministic). I don't even want to think about what kind of headaches one could get later on... I strongly advise to go another way. From my side I see 2 options:
Of course I am open to discuss this :) |
Sure, we can try it if it looks easier. Two questions:
Re refactoring and the 2nd option, I think we could do something like this (thanks @mike0sv for the discussion and the code snippet) class GitRegistry:
...
@classmethod
@contextmanager
def from_path(cls, path):
repo = GitRegistry(path)
yield repo
if repo.needs_cleanup():
repo.cleanup()
@needs_clone
def ls(self):
...
def needs_cleanup(self):
return self.is_remote and self._local_path is not None
def needs_clone(f):
def inner(self, *args, **kwargs):
if self._local_path is None:
self._local_path = self._clone_repo()
return f(self, *args, **kwargs)
return inner Then all API calls would have to look like: with GitRegistry.from_repo(path) as repo:
gto.api.register(repo, ...) In will require changing dozen of lines in I'll think about this tomorrow morning, but for now the 1st option with cache looks as it have a big pro (cache) and simpler (although it gives you less control over your cache). Btw, this also reminded me that @iterative/studio-backend team wanted to have a way to cache UPD: the |
Option 1. means we are building a stateful solution. Option 2. is stateless.
I leave the decision to you @aguschin, this is pretty much a fundamental design choice for gto :) Then we can see the details |
I can easily imagine someone who is trying to run many commands on a remote repo, so I definitely would prefer an approach with cache. I would assume, even if you want to register/assign, you would first run
I would assume we need to do that each time The other problem here could be if anyone wants to run some commands in parallel on the same remote repo. E.g. I run two |
For the sake of simplicity of the first implementation we could even use an external context manager and a decorator to clone the repo # gto/api.py
def clone_repo_if_needed(f):
def inner(*args, **kwargs):
repo = args[0]
if is_remote(repo):
with TemporaryDirectory() as tmpdir:
clone_repo(tmpdir)
return f(tmpdir, *args[1:], **kwargs)
return f(repo, *args[1:], **kwargs)
return inner
@clone_repo_if_needed
def method(repo, ...):
... That won't require any changes in GitRegistry, creates a cloned repo only until the API function is called... WDYT? If that works, we can implement it first, and then improve with cache on top of it. UPD: updated the snippet |
I'd say stateless approach (option 2) is preferable if we can optimize the performance.
There is a way to only pull certain files (or maybe even none of them), see this |
Besides that, now next target in the radar are I would again take the approach of first creating the option |
Mmm... on a second thought, one may want to choose between |
Better now? I mentioned this issue in the PR and it appeared just above ^
Yes! But please consider that
Interesting idea! Indeed, one may want to commit the change, but wait some time before pushing it. May be handy to implement both :) |
Having |
Thanks for the help and feedback @aguschin To push one first need to commit, and this the part that will require more work I am afraid. So I would first target adding the I think it is tricky because one wants to commit only the changed file (currently |
Got it. I think The question I have: what if there were some uncommitted changes in $ gto annotate mymodel --type model
$ gto annotate mymodel --path X --auto-commit now you probably want to get both |
I cannot think of any better solution... and great that you thought of that, I didn't 😅 |
Here the next PR that adds |
This is the last PR necessary to enable all gto commands to work on remote repos: #292 |
Hi @aguschin , since I have a little bit of spare time, I would like to collect here my thoughts about how I would like to refactor the module Take the decorator @clone(repo_arg="repo")
def register(repo: Union[str, Repo], ...) is more understandable than @clone
def register(repo: Union[str, Repo], ...) This is even more true for other decorators, for example @commit
def annotate(...) will very likely be confused: commit? always? what's happening? @commit(controller_arg="commit")
def annotate(..., commit: bool) then it is probably way easier to guess what's happening. This is the readability / naming point of view. Now I would like to take the TDD point of view. Tests currently are a mess. First I test all functionalities of the Finally, one last point of view. Currently the code is rigid. If I want to rename the argument "repo" in the api functions, I need to change the Because of all this, I would advise to refactor the This is one of the suggestions I have, others after we can agree on this one :) |
Interesting! Didn't think of that. Those changes should make things more decoupled and more easily tested, on that I agree. At the same time, renaming wrappers + adding args should keep things understandable even with shorter names. Curious to see how that will changes tests btw :) |
I will start with a little PR to rename Then I will refactor the |
As promised :) After this one, in principle one could also start extending the cli help and gto documentation to notify that remote repos are supported, because the user interface is as desired. |
Hi @aguschin a short update: I have started refactoring, but I encountered a major difficulty. I cannot figure out a reasonable way to assert that a function is decorated. For example, assert that In case you have any idea about it please let me know. |
As @mike0sv suggested, maybe you can get the function code with And if decorator could set some attribute of the function, you can check that attribute 🤔 |
Mmm... this is what I meant when I said "a reasonable way"... this was one of the option I could find by googling, but I find parsing the code a little extreme. Well, it seems I have no other choice, so I will give it a try. It's the first time I am so heavily using decorators. Not the best experience ever so far... I start to wonder if it was the right choice. |
Now it's funny we have a label named |
As per #309, I consider this issue to be closed. Thank you so much @francesco086, this was a really big effort from your side - I hope you are satisfied with the result! 🙌🏻 |
Sorry for the late answer 😅 I somehow missed the notification... Thank you @aguschin , most credit goes to you, for the many not-easy design decisions and in the end for all the re-factoring! |
Right now they work on local repos only, that's a limitation of
gitpython
.The text was updated successfully, but these errors were encountered: