Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reposcanner Design Changes #5

Closed
rmmilewi opened this issue Jun 1, 2020 · 5 comments
Closed

Reposcanner Design Changes #5

rmmilewi opened this issue Jun 1, 2020 · 5 comments
Assignees

Comments

@rmmilewi
Copy link
Member

rmmilewi commented Jun 1, 2020

CC'ing @frobnitzem since we discussed some of this.

Unless there are any serious objections, I have a couple changes I'll be making to Reposcanner to facilitate the kinds of analyses we plan to do.

  1. Adding support for loading repo lists via YAML files. Reposcanner will support passing a single repository or a set of repositories. Inputs will now flow through stateful analysis data objects that can hold onto credentials, lists of repositories, the set of routines to be performed, etc. This will allow us to handle any number of repositories in a uniform way.

  2. Making routines reusable. Right now we pass repository information to a routine via a constructor, and we'd have to create a new routine object for every repository we wanted to analyze. I'll be reworking these so that they're reusable interactors that are passed the details they need when execute() is called.

  3. Removing the render step from the routine workflow. Especially if we plan on rendering graphs of many different repositories and combining different data sources, there's not much benefit to generating graphs for each and every step of the process. Rendering can be moved out to the end of Reposcanner's execution.

  4. Creating a one-step solution for provenance and data curation. If we generate data for multiple repositories, we need to generate a "receipt" that covers the time of execution, the version of Reposcanner used, repositories analyzed, routines involved, and files generated. This is easy to do with the data objects that I intend on adding.

  5. Tests! Everything needs to be tested.

@rmmilewi rmmilewi self-assigned this Jun 1, 2020
@frobnitzem
Copy link
Member

Looks good to me. My central design idea is to make an "execute" function that takes a repo-name and a yaml "view" and renders it into a repo-specific-directory location.

There's a "hidden" input in there, which is what information the view has available to render with. That's the data model... My example had to do an extra "controller" call to create a contributor list, and implicitly took the github repo object as its "model". I think the render templates should be paired 1:1 with a controller code that calls the render template.

On the implementation of "stateful analysis data objects" I think simpler is better. A repo name and a controller function would be a minimal state as far as I can tell. The controller code can be explicitly annotated with the "data model" and "view" it uses for provenance, but I hope those are fairly statically associated with their controller.

@rmmilewi
Copy link
Member Author

rmmilewi commented Jun 1, 2020

Looks good to me. My central design idea is to make an "execute" function that takes a repo-name and a yaml "view" and renders it into a repo-specific-directory location.

Cool! Yeah, I just want to make sure I'm building up this codebase with that use case in mind.

On the implementation of "stateful analysis data objects" I think simpler is better. A repo name and a controller function would be a minimal state as far as I can tell. The controller code can be explicitly annotated with the "data model" and "view" it uses for provenance, but I hope those are fairly statically associated with their controller.

Normally, what'd I'd do would be a "clean architecture" style, where I have a request object that gets passed, and a response object that gets returned. Normally this would be used to create a strict wall of separation between the user interface and the analysis code.

However, I'm thinking that I want to retain data from one analysis to the next, which would mean having stateful containers for all the results. If you pass in multiple repos and have Reposcanner perform a set of routines on each, I imagine that you'd want to have all that data bundled together so your visualization routine can just pull out the pieces of data it needs across all the executions. But that all depends on how you want to work with the data downstream. I'm fine with whatever works for you.

@frobnitzem
Copy link
Member

I don't know what the user interface looks like, but I do want to keep views as a per-repo render template and a 'data directory' as a location to store all the rendered info. Then a separate code entirely could analyze the data directory.

@rmmilewi
Copy link
Member Author

rmmilewi commented Jun 8, 2020

@frobnitzem For the record, I am putting serious effort today towards making all these changes, and I'll let you know how things go.

@rmmilewi
Copy link
Member Author

rmmilewi commented Jun 9, 2020

@frobnitzem, I made a couple updates yesterday that I'll mention here. All these changes are guarded by unit tests now, which will help with stability.

  • RepositoryName has been replaced by RepositoryLocation, which can handle URLs and automatically attempts to guess what kind of version control platform is being used and whether it's hosted on an official corporate platform or hosted privately (i.e. GitHub repositories are always on GitHub's servers, but Gitlab and Bitbucket can run on private servers). This makes it easier for us to establish the right type of API session for the right platform.

  • GitHubCredentials is now VersionControlPlatformCredentials. I didn't change the functionality here — it was always broadly defined enough to support any modern version control platform.

  • I created the VCSAPISessionCreator class and a family of subclasses around them. Right now OnlineRepositoryAnalysisRoutine is hard-coded to work exclusively with GitHub repositories. I'll be making a change today where an online routine HAS-A session creator that hides all that knowledge, and different creators can manage different types of connections (i.e. GitHub vs. GitLab. vs Bitbucket).

Now that I've laid the groundwork I needed, I can complete step 2 ("Making routines reusable"). I'll swap out the GitHub specific code in OnlineRepositoryAnalysisRoutine with instances of VCSAPISessionCreator. I'll either do something analogous for OfflineRepositoryAnalysisRoutine (i.e. wrappers for pygit2, gitlab-clone, etc.) or a simple subprocess.call() to git, we'll see. Then I'll make it so they can be reused from one repository to the next, rather than having to construct a new one every time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants