Make `show` and other commands work with remote repos #25

aguschin · 2022-02-21T14:37:23Z

Right now they work on local repos only, that's a limitation of gitpython.

The text was updated successfully, but these errors were encountered:

aguschin · 2022-03-09T10:26:13Z

Asked a question in scmrepo about this - maybe we can use it.

francesco086 · 2022-09-06T14:33:43Z

Hi, I'd like to contribute to this one. Any suggestion on where to start from?

aguschin · 2022-09-07T07:16:48Z

Thanks for asking!

There are few parts to this:

read-only operations (show, describe, check-ref, history, stages). Specifically:
1a. read Git tags (check-ref; history and show without commits)
1b. read commits (describe)
1c. read both (show --all-commits, history, stages)
write operations (assign, register, deprecate, annotate, remove). Specifically:
2a. create/remove git tag(s) and push (assign, register, deprecate)
2b. create commit and push (annotate, remove)

AFAIK, the only way to support this in full is to clone the repo locally to temp dir first. In this case, the ideal way would be to contribute to scmrepo in the issue above, I suppose, considering DVC already have something like this implemented.

There are also an option to support 1a partially. The thing is, you can read Git tags from a remote repo with git CLI like this: git ls-remote --tags https://github.com/iterative/example-gto. Didn't check if this is supported in GitPython. The justification for this approach is this PR - i.e. all artifacts that don't have Git tags are considered unregistered and are optional to show in gto show.

This partial option is good for gto show command without --all-commits and --all-branches flags, and won't enable gto describe, gto history, etc. Because it should be much quicker than fully cloning repo, eventually we'll need to support both I suppose.

aguschin · 2022-09-07T08:21:13Z

Found a similar explanation I've posted in another issue earlier #220 (comment)

francesco086 · 2022-09-07T12:57:22Z

Thanks @aguschin !
I would surely start from the read-only operation.

Reading through what you sent me, I found this especially illuminating: https://stackoverflow.com/a/1179728/19782654
git is distributed, however often times (always?) there is a reference remote repo. One could choose to rely on those and use GitLab/GitHub APIs.
Otherwise there is no way around, one need to do a git clone.
For tags only, one could also rely on git ls-remote as you suggested.

I think going for a git clone in a tmp folder is probably the most general (it is a necessary step for write functionalities too) and robust solution. Do you agree? Shall we go for that straight away?
I am new to scmrepo, in case git clone functionalities should be part of this repo? or GitPython?

francesco086 · 2022-09-07T13:39:59Z

Related issue on scmrepo: iterative/scmrepo#41

aguschin · 2022-09-07T13:53:45Z

I think going for a git clone in a tmp folder is probably the most general (it is a necessary step for write functionalities too) and robust solution. Do you agree? Shall we go for that straight away?

Sure, we can start with this, and keep git ls-remote as a future optimization.

I think @pmrowla summarized it well in this comment. GitPython is a library that call git cli and provide some API over it (the others are dulwich or pygit2). scmrepo in turn wraps GitPython, Dulwich and pygit2, providing a single interface to them, IIUC.

Initially I used GitPython in GTO because scmrepo didn't have needed functionality at that time, IIRC (i.e. iterating over tags and reading their info). Now we need to make a temporary repo clone, and though we could implement this in GTO again, DVC already have this implemented, so we could take it and change. A better approach would be to move that functionality from DVC to scmrepo, and then reuse by both GTO and DVC.

Maybe @pmrowla could give us more guidance on this (in the scmrepo issue)?

francesco086 · 2022-09-08T12:01:26Z

Two questions:

where should we store the git cloned repo?
I initially thought in .gto/cache, but .gto is used for the configuration.
Would it make sense to move the current gto config inside .gto/config?
Then we could have a cache folder in .gto/.
I would start from the gto show command, if that's ok, as I think is the most useful. But perhaps there are reasons to start from another one?

aguschin · 2022-09-09T04:49:44Z

Let's start it simple with cloning to temp dir. For handling the temp dirs (creating and deleting them) you can use tempfile.TemporaryDirectory() from https://docs.python.org/3/library/tempfile.html#examples
This is a simple approach that will do for the first time (even DVC uses it, @pmrowla please correct me if I'm wrong). It maybe won't allow us to cache things in between CLI calls (like if you're going to call GTO multiple times on the same repo), but once we implement the simple approach, we may consider the approaches to optimize this. On the bright side, this doesn't require us to move .gto to .gto/config, making it easier for now in terms of compatibility.
Yeah, please start with whichever you like more. gto show sounds like the most useful to me as well. In any case, once you implement this for a single command, repeating it for others should be a trivial task.

francesco086 · 2022-09-09T04:54:28Z

Ok, I thought of tmpfiles, it’s just that from direct experience I know that each call will be slow and therefore it will be a little cumbersome. But you are right, let’s start like this, optimization comes later 😄 Thanks for your input!

…

On Fri 9. Sep 2022 at 06:49 Alexander Guschin ***@***.***> wrote: 1. Let's start it simple with cloning to temp dir. For handling the temp dirs (creating and deleting them) you can use https://docs.python.org/3/library/tempfile.html This is a simple approach that will do for the first time (even DVC uses it, @pmrowla <https://github.com/pmrowla> please correct me if I'm wrong). It maybe won't allow us to cache things in between CLI calls (like if you're going to call GTO multiple times on the same repo), but once we implement the simple approach, we may consider the approaches to optimize this. On the bright side, this doesn't require us to move .gto to .gto/config, making it easier for now in terms of compatibility. 2. Yeah, please start with whichever you like more. gto show sounds like the most useful to me as well. In any case, once you implement this for a single command, repeating it for others should be a trivial task. — Reply to this email directly, view it on GitHub <#25 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB7PWYGUEEHGO4XYRS3LZJLV5K6XHANCNFSM5O6Y5KIQ> . You are receiving this because you commented.Message ID: ***@***.***>

pmrowla · 2022-09-09T05:01:57Z

Temp directories (without any persistent caching between CLI calls) is what DVC does: https://github.com/iterative/dvc/blob/main/dvc/external_repo.py

francesco086 · 2022-09-13T16:43:21Z

I have been working a bit on this (francesco086#1).

I have one design decision to make: who is responsible to clean up the git cloned directory? What should be the mechanism?
Currently I am using a TemporaryDirectory, and in the tests I am calling TemporaryDirectory.cleanup().
How should the cleanup be called after that the repo has been "initialized" as in e.g. https://github.com/iterative/gto/blob/main/gto/registry.py#L62 ? Any suggestion?

aguschin · 2022-09-14T10:47:58Z

One option that may work (looks like a hacky one to me) is to use with ... yield construction. Something like (raw draft) :

def return_repo(path):
    if remote_url(path):
        with TemporaryDirectory() as tmpdir:
            clone_repo(path, tmpdir)
            yield tmpdir
    return path

class GitRegistry(BaseModel):
    ...
    @classmethod
    def from_repo(cls, repo=Union[str, Repo], config: RegistryConfig = None):
        if isinstance(repo, str):
            try:
                for repo in return_repo(repo): pass
                repo = git.Repo(repo, search_parent_directories=True)
            except (InvalidGitRepositoryError, NoSuchPathError) as e:
                raise NoRepo(repo) from e
        if config is None:
            config = read_registry_config(
                os.path.join(repo.working_dir, CONFIG_FILE_NAME)
            )

As per Stack Overflow answer, it should get garbage collected after process exit. Which is fine for us, since GitRegistry can be used for all the time process exists.

I'll search for more beautiful solution, but this may suffice :)

francesco086 · 2022-09-14T12:36:38Z

Thanks for the suggestion, it indeed looks very hacky... reading more in detail from the stackoverflow discussion you mentioned it sounds to me quite a bad idea to follow this approach. The directory could be deleted at any time (it's undeterministic). I don't even want to think about what kind of headaches one could get later on... I strongly advise to go another way.

From my side I see 2 options:

go back to the idea of a cache folder (for now we could simply name it ".gto_remote_repos", to avoid clashes with the .gto config file)
refactor quite a lot in gto and enclose all the command execution within a with statement, to ensure that the tmp directory will be deleted at the end

Of course I am open to discuss this :)

aguschin · 2022-09-14T13:56:48Z

go back to the idea of a cache folder (for now we could simply name it ".gto_remote_repos", to avoid clashes with the .gto config file)

Sure, we can try it if it looks easier. Two questions:

Could we get it updated each time, and remove everything that doesn't exist in the remote? (I assume yes, but we need to check if let's say someone deleted few tags/commits in remote, we do the same on the local copy once your instantiate GitRegistr, etc).
I assume there should be the user-friendly way to cleanup the cache. E.g. $ gto gc or maybe there is something better)

Re refactoring and the 2nd option, I think we could do something like this (thanks @mike0sv for the discussion and the code snippet)

class GitRegistry:
    ...
    @classmethod
    @contextmanager
    def from_path(cls, path):
        repo = GitRegistry(path)
        yield repo
        if repo.needs_cleanup():
            repo.cleanup()
    
    @needs_clone
    def ls(self):
        ...
    
    def needs_cleanup(self):
        return self.is_remote and self._local_path is not None
    
def needs_clone(f):
    def inner(self, *args, **kwargs):
        if self._local_path is None:
            self._local_path = self._clone_repo()
        return f(self, *args, **kwargs)
    return inner

Then all API calls would have to look like:

with GitRegistry.from_repo(path) as repo:
   gto.api.register(repo, ...)

In will require changing dozen of lines in gto/api.py and few lines in tests.

I'll think about this tomorrow morning, but for now the 1st option with cache looks as it have a big pro (cache) and simpler (although it gives you less control over your cache).

Btw, this also reminded me that @iterative/studio-backend team wanted to have a way to cache self.get_state() calls in the GitRegistry functions to parse Git repos faster, but that's a different story IMO. (We could get that done by having some "lazy" flag that would make GitRegistry to not update state and reuse the existing one unless it's asked directly).

UPD: the needs_clone decorator above clones repo if it's needed for the certain method to run. In some cases we may just run git ls-remote --tags, get tags and not clone the repo at all.

francesco086 · 2022-09-14T15:44:20Z

Option 1. means we are building a stateful solution. Option 2. is stateless.

can lead to higher performance but can get complex to handle (e.g. as you mentioned, how do you clean the cache? But also, do we want to do a git pull if the repo is already in the cache, how and when?)
is simpler from a user perspective but performance-wise here it could get annoying (git clone can be substantially slower than a git pull). But perhaps with some tricks we could get good performance, e.g. with shallow clones.

I leave the decision to you @aguschin, this is pretty much a fundamental design choice for gto :) Then we can see the details

aguschin · 2022-09-15T05:05:12Z

I can easily imagine someone who is trying to run many commands on a remote repo, so I definitely would prefer an approach with cache. I would assume, even if you want to register/assign, you would first run gto show to recall the exact artifact name. Then you could run gto show modelname to find out the specific version you need to assign a stage to, etc. Cloning a repo each time can make things 2-3x slower in this case.

do we want to do a git pull if the repo is already in the cache, how and when

I would assume we need to do that each time GitRegistry is initialized. Or even each time we re-calculate self.get_state().

The other problem here could be if anyone wants to run some commands in parallel on the same remote repo. E.g. I run two annotate - one for main branch, the other for dev branch - they may clash cause your create a commits in both cases and (I assume) git will need to checkout a branch to create a commit. For now we can do some kind of locking mechanism, so one process will wait until the other is completed. Here I think we need to check what commands run in parallell are prone to this (I assume this doesn't affect reading operations, like show, history, describe. Also register and assign should be unaffected). And since we consider moving artifacts.yaml to dvc.yaml, annotate may not be a problem in future at all.

aguschin · 2022-09-15T05:21:23Z

For the sake of simplicity of the first implementation we could even use an external context manager and a decorator to clone the repo

# gto/api.py
def clone_repo_if_needed(f):
    def inner(*args, **kwargs):
        repo = args[0]
        if is_remote(repo):
            with TemporaryDirectory() as tmpdir:
                clone_repo(tmpdir)
                return f(tmpdir, *args[1:], **kwargs)
        return f(repo, *args[1:], **kwargs)
    return inner

@clone_repo_if_needed
def method(repo, ...):
    ...

That won't require any changes in GitRegistry, creates a cloned repo only until the API function is called... WDYT? If that works, we can implement it first, and then improve with cache on top of it.

UPD: updated the snippet

mike0sv · 2022-09-15T08:01:05Z

I'd say stateless approach (option 2) is preferable if we can optimize the performance.
Let's see what we actually need from temporary cloned repo.

List of tags
artifacts.yaml for every commit?
what else?

There is a way to only pull certain files (or maybe even none of them), see this

francesco086 · 2022-10-12T10:51:36Z

Besides that, now next target in the radar are annotate and remove. In this case it will probably means updating the yml file, git commit, and git push, right?

I would again take the approach of first creating the option --auto-push, is that ok?

francesco086 · 2022-10-12T11:26:04Z

I would again take the approach of first creating the option --auto-push, is that ok?

Mmm... on a second thought, one may want to choose between --auto-commit and --auto-push ...?

aguschin · 2022-10-13T07:19:06Z

Any idea

Better now? I mentioned this issue in the PR and it appeared just above ^

Besides that, now next target in the radar are annotate and remove. In this case it will probably means updating the yml file, git commit, and git push, right?

Yes! But please consider that artifacts.yaml functionality can move to dvc (this is going to be decided not faster than in a couple of months though). I would like this to be implemented anyway since there is a change it will stay in GTO, but don't want to do that without realizing this.

Mmm... on a second thought, one may want to choose between --auto-commit and --auto-push ...?

Interesting idea! Indeed, one may want to commit the change, but wait some time before pushing it. May be handy to implement both :)

aguschin · 2022-10-13T07:20:49Z

Having --auto-push only is totally OK for me also. We could add --auto-commit once it's requested from anyone.

francesco086 · 2022-10-13T07:34:26Z

Thanks for the help and feedback @aguschin

To push one first need to commit, and this the part that will require more work I am afraid. So I would first target adding the --auto-commit option for the commands annotate and remove.

I think it is tricky because one wants to commit only the changed file (currently artifacts.yaml) and otherwise leave the git status unchanged. In case other files have been added to staging, one should first unstage them, add the file artifacts.yaml, commit, then add again the files that have been unstaged.
Or do you have a suggestion on how to achieve this result in an easier way?

aguschin · 2022-10-13T08:09:50Z

@francesco086

Got it. I think git stash + git stash pop should work. That what we do in dvc exp, if I'm not mistaken (@skshetry, could you please clarify this?)

E.g.

The question I have: what if there were some uncommitted changes in artifacts.yaml already? Let's suppose you run

$ gto annotate mymodel --type model
$ gto annotate mymodel --path X --auto-commit

now you probably want to get both path and type in artifacts.yaml. But with git stash ... git stash pop you'll lose it. 🤔 Should we just error out on the 2nd command then? ("You have uncommitted changes in artifacts.yaml, please commit them before running --auto-commit"?)

francesco086 · 2022-10-13T08:14:36Z

Should we just error out on the 2nd command then? ("You have uncommitted changes in artifacts.yaml, please commit them before running --auto-commit"?)

I cannot think of any better solution... and great that you thought of that, I didn't 😅

francesco086 · 2022-10-18T12:40:21Z

Here the next PR that adds auto-commit support to annotate and remove: #290
I must say that I struggled quite a lot for this one.

francesco086 · 2022-10-21T17:43:28Z

This is the last PR necessary to enable all gto commands to work on remote repos: #292

francesco086 · 2022-10-24T17:44:13Z

Hi @aguschin , since I have a little bit of spare time, I would like to collect here my thoughts about how I would like to refactor the module git_utils.
Working TDD, it is quite some time that tests were pretty much screaming at me "please change the design". But, for the sake of delivering functionalities quickly, I ignored them. Also naming difficulties were a warning sign.

Take the decorator clone_on_remote_repo. Why I didn't want to re-name it simply clone?
I find this decorator, as all others, a little tricky, because they silently take the arguments of the decorated function.
In theory, if one wants to properly abstract this decorator, one should pass the argument "repo", the name of the argument to look for.
Then one could change its signature as clone(repo_arg: str). I find this way clearer than clone(). We have a hint that this decorator will look at the argument repo and probably clone it if needed. Then

@clone(repo_arg="repo")
def register(repo: Union[str, Repo], ...)

is more understandable than

@clone
def register(repo: Union[str, Repo], ...)

This is even more true for other decorators, for example commit_produced_changes_on_auto_commit. If one renames it simply commit and reads

@commit
def annotate(...)

will very likely be confused: commit? always? what's happening?
But if you read

@commit(controller_arg="commit")
def annotate(..., commit: bool)

then it is probably way easier to guess what's happening.

This is the readability / naming point of view. Now I would like to take the TDD point of view.

Tests currently are a mess. First I test all functionalities of the git_utils module. But then I cannot simply "trust" that the decorators will work on the api. Why? Because currently the taken variable names are hard-coded. De-facto, the decorators have been written to extend only the api, and there is a tight coupling. So there is no other way to check that everything will work fine than inspecting what happens if you pass e.g. commit=True (first stash, then commit, then de-stash) or commit=False.
If the decorator were taking the argument name (e.g. commit), then I could simply check that when the api function is called, then the decorator has been called with the expected argument (controller_arg="commit"), which is part of the api function signature.
Currently I cannot simplify the tests in this way, because otherwise if I, for example, rename the argument commit in the api function into self_commit, then the tests will still pass, whereas the code would not work.

Finally, one last point of view. Currently the code is rigid. If I want to rename the argument "repo" in the api functions, I need to change the git_utils module (same for commit and push). This is a bad sign for code. Ideally, I should be able to change the module api without touching any other.

Because of all this, I would advise to refactor the git_utils repo to make it more of an independent helper module, as outlined above.

This is one of the suggestions I have, others after we can agree on this one :)

aguschin · 2022-10-25T08:06:44Z

Interesting! Didn't think of that. Those changes should make things more decoupled and more easily tested, on that I agree. At the same time, renaming wrappers + adding args should keep things understandable even with shorter names. Curious to see how that will changes tests btw :)

francesco086 · 2022-10-25T08:17:10Z

I will start with a little PR to rename auto_* as simply * (e.g. auto_push -> push).

Then I will refactor the git_utils module following #25 (comment)

francesco086 · 2022-10-25T09:27:07Z

As promised :)

After this one, in principle one could also start extending the cli help and gto documentation to notify that remote repos are supported, because the user interface is as desired.

francesco086 · 2022-10-26T11:23:43Z

Hi @aguschin a short update: I have started refactoring, but I encountered a major difficulty. I cannot figure out a reasonable way to assert that a function is decorated. For example, assert that gto.api.show is decorated with @clone(repo_arg="repo"). I find it ridiculous, but the problem seems to stem from the fact that the decorator is applied at import time. However, even leveraging on that, I could not find a way through.

In case you have any idea about it please let me know.

aguschin · 2022-10-26T14:08:17Z

As @mike0sv suggested, maybe you can get the function code with inspect.getsource and check if there is correct @decorator in the beginning.

And if decorator could set some attribute of the function, you can check that attribute 🤔

francesco086 · 2022-10-26T14:42:58Z

Mmm... this is what I meant when I said "a reasonable way"... this was one of the option I could find by googling, but I find parsing the code a little extreme. Well, it seems I have no other choice, so I will give it a try.

It's the first time I am so heavily using decorators. Not the best experience ever so far... I start to wonder if it was the right choice.

aguschin · 2022-11-08T09:27:34Z

Now it's funny we have a label named good first issue here. Removing it 😂

aguschin · 2022-12-06T09:03:54Z

As per #309, I consider this issue to be closed. show and other commands are fully supported, and optimization (e.g. caching) is an improvement for the future.

Thank you so much @francesco086, this was a really big effort from your side - I hope you are satisfied with the result! 🙌🏻

francesco086 · 2022-12-13T08:54:37Z

Thank you so much @francesco086, this was a really big effort from your side - I hope you are satisfied with the result! 🙌🏻

Sorry for the late answer 😅 I somehow missed the notification...

Thank you @aguschin , most credit goes to you, for the many not-easy design decisions and in the end for all the re-factoring!
I am indeed very happy of the final result, moreover it was a great opportunity to learn! Looking forward for the next contribution 😃

aguschin added this to the First GTO release milestone Mar 9, 2022

This was referenced Mar 22, 2022

Downloading artifacts from registry #80

Closed

Workflow for getting artifacts in the first release #83

Closed

aguschin removed this from the First GTO release milestone Apr 4, 2022

aguschin added the remote-repository Working with repositories not cloned locally label Jul 28, 2022

aguschin changed the title ~~Make show and audit work on remote repos~~ Make show and other commands work with remote repos Jul 28, 2022

aguschin mentioned this issue Jul 28, 2022

Artifact registry as remote entity #220

Closed

francesco086 mentioned this issue Sep 7, 2022

Question about scmrepo's usage iterative/scmrepo#41

Closed

aguschin added this to MLEM + GTO Sep 15, 2022

aguschin assigned francesco086 Sep 15, 2022

aguschin moved this to In Progress in MLEM + GTO Sep 15, 2022

aguschin mentioned this issue Oct 13, 2022

feat: gto register / assign / deprecate support for remote repo #286

Merged

francesco086 mentioned this issue Oct 18, 2022

Annotate and remove support auto commit #290

Merged

francesco086 mentioned this issue Oct 21, 2022

gto annotate and remove support auto push and remote repo #292

Merged

francesco086 mentioned this issue Oct 25, 2022

Rename auto commit and auto push #295

Merged

This was referenced Oct 26, 2022

WIP: Refactor decorators in git utils #296

Closed

cli help mention experimental support for remote repos #297

Merged

aguschin removed the good first issue Good for newcomers label Nov 8, 2022

aguschin mentioned this issue Nov 29, 2022

annotate doesn't support specifying ref #305

Closed

francesco086 mentioned this issue Dec 6, 2022

Remote git registry #309

Merged

aguschin closed this as completed Dec 6, 2022

Repository owner moved this from In Progress to Done in MLEM + GTO Dec 6, 2022

Make show and other commands work with remote repos #25

Make show and other commands work with remote repos #25

Comments

aguschin commented Feb 21, 2022

aguschin commented Mar 9, 2022

francesco086 commented Sep 6, 2022

aguschin commented Sep 7, 2022

aguschin commented Sep 7, 2022

francesco086 commented Sep 7, 2022 • edited Loading

francesco086 commented Sep 7, 2022

aguschin commented Sep 7, 2022

francesco086 commented Sep 8, 2022

aguschin commented Sep 9, 2022 • edited Loading

francesco086 commented Sep 9, 2022 via email

pmrowla commented Sep 9, 2022

francesco086 commented Sep 13, 2022

aguschin commented Sep 14, 2022 • edited Loading

francesco086 commented Sep 14, 2022 • edited Loading

aguschin commented Sep 14, 2022 • edited Loading

francesco086 commented Sep 14, 2022

aguschin commented Sep 15, 2022 • edited Loading

aguschin commented Sep 15, 2022 • edited Loading

mike0sv commented Sep 15, 2022

francesco086 commented Oct 12, 2022

francesco086 commented Oct 12, 2022

aguschin commented Oct 13, 2022

aguschin commented Oct 13, 2022

francesco086 commented Oct 13, 2022

aguschin commented Oct 13, 2022 • edited Loading

francesco086 commented Oct 13, 2022

francesco086 commented Oct 18, 2022 • edited Loading

francesco086 commented Oct 21, 2022

francesco086 commented Oct 24, 2022

aguschin commented Oct 25, 2022

francesco086 commented Oct 25, 2022

francesco086 commented Oct 25, 2022

francesco086 commented Oct 26, 2022

aguschin commented Oct 26, 2022

francesco086 commented Oct 26, 2022 • edited Loading

aguschin commented Nov 8, 2022

aguschin commented Dec 6, 2022

francesco086 commented Dec 13, 2022

Make `show` and other commands work with remote repos #25

Make `show` and other commands work with remote repos #25

francesco086 commented Sep 7, 2022 •

edited

Loading

aguschin commented Sep 9, 2022 •

edited

Loading

aguschin commented Sep 14, 2022 •

edited

Loading

francesco086 commented Sep 14, 2022 •

edited

Loading

aguschin commented Sep 14, 2022 •

edited

Loading

aguschin commented Sep 15, 2022 •

edited

Loading

aguschin commented Sep 15, 2022 •

edited

Loading

aguschin commented Oct 13, 2022 •

edited

Loading

francesco086 commented Oct 18, 2022 •

edited

Loading

francesco086 commented Oct 26, 2022 •

edited

Loading