Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reposcanner Design Changes, Take Two! #10

Closed
rmmilewi opened this issue Jan 12, 2021 · 8 comments
Closed

Reposcanner Design Changes, Take Two! #10

rmmilewi opened this issue Jan 12, 2021 · 8 comments

Comments

@rmmilewi
Copy link
Member

rmmilewi commented Jan 12, 2021

I'm planning some upcoming changes to Reposcanner to align the tool with the Elaine's proposed "lab notebook" workflow, add features requested in #8, and make progress on project documentation as defined in #9. @frobnitzem, once you've had a chance to weigh in on these changes, I can start work on the following:

  1. We will create a one-step solution for provenance and data curation. This was the one work item was left unfinished in Reposcanner Design Changes #5. Every run of Reposcanner will generate a log file that contains pretty-printed information that can be copied into @elaineraybourn 's lab notebook format so we can keep track of the data we're generating with this tool. This includes timestamps, the version of Reposcanner used, which routines were included in the analysis, a table with the success/failure outcomes of the different routines on the set of repositories, and the files generated. There will be a module that encapsulates all this logging functionality and talks to the manager to request the information it needs. I also plan on standardizing the outputs of the routines so there's a unique signature at the top of the output indicating the run that the data came from.

  2. Right now I have a hardcoded function that loads in the routines used by Reposcanner, and I have to extend that function every time we add a new routine, and you can't actually control which routines are loaded. I'm going to move this out to a config file that tells Reposcanner which routines to load.

  3. (@frobnitzem, here's where I especially need your input) Any scripts that we run to perform analyses on the data or cross-reference the ECP database need to be version controlled and referenced in our lab notebook entries. The most straight-forward solution I can think of is to make the downstream analyses first-class entities in Reposcanner. Just like data collection routines, analysis scripts that operate on the data generated by Reposcanner and/or the ECP database will be encapsulated in a standardized form and are tracked by the Reposcanner manager in the same way. This way there are no loose ends when it comes to the chain of evidence used in our paper(s).

  4. While I'm doing all this, I'll update the project documentation so that it's current and pushes forward on the progress tracking card on documentation defined in PTC: Documentation #9.

@frobnitzem
Copy link
Member

I have some thoughts here:

  1. sounds good
  2. I like to make configuration files in python - that way they're fully scriptable. For example, you could have queries defined by a "queries.py" file and then load it with:
import importlib.util
#import sys

def import_py(module_name, file_path):
    spec = importlib.util.spec_from_file_location(module_name, file_path)
    module = importlib.util.module_from_spec(spec)
    #sys.modules[module_name] = module
    spec.loader.exec_module(module)
    return module

queries = import_py("queries", "queries.py")

or something.

  1. To do this properly would require merging my database wrangling code out of reposcanner-data and into reposcanner. Then reposcanner-data would be left with only the yaml files defining project members and project repositories. Then, of course, the "version" stamp could be as simple as a reposcanner commit hash. I'd imagine my scripts sitting in an independent analysis subdirectory of reposcanner. I think you're the best one to do this merge though. It should be simple enough to understand how my reposcanner-data project works if you clone it and run make.

  2. Those issues you reference were really just created for the sake of demonstrating how I think PTC-s should be done (lightweight). I don't think we're at the stage where we need comprehensive documentation. A list of commands to run to get the analysis going should be all that's needed at this stage.

@elaineraybourn
Copy link
Member

I updated the location of the lab notebook, it is now here: Reposcanner Lab Notebook/Reposcanner Lab Notebook template.md

@rmmilewi
Copy link
Member Author

rmmilewi commented Jan 13, 2021

@frobnitzem

I like to make configuration files in python - that way they're fully scriptable. For example, you could have queries defined by a "queries.py" file and then load it with...

Ooh, clever idea, I love it!

To do this properly would require merging my database wrangling code out of reposcanner-data and into reposcanner. Then reposcanner-data would be left with only the yaml files defining project members and project repositories.

Hmm... I was hoping to avoid that (if possible) to keep the database decoupled from Reposcanner to the greatest extent possible. Like you run the database scripts and generate the files separately, then an analysis script can just grab those if and when they need those files. But maybe there's a benefit to having the ECP database scripts integrated into Reposcanner in some way, like if you wanted to regenerate the database files on a run to ensure the data is fresh/current. What do you think? I'm open to handling this in whatever way you think is best.

Those issues you reference were really just created for the sake of demonstrating how I think PTC-s should be done (lightweight). I don't think we're at the stage where we need comprehensive documentation. A list of commands to run to get the analysis going should be all that's needed at this stage.

Fair. That being said, the list of commands right now is already out of date, and I want to make sure we keep it up-to-date. I wanted to make sure that I do so, because it's very easy to overlook documentation.

I updated the location of the lab notebook, it is now here: Reposcanner Lab Notebook/Reposcanner Lab Notebook template.md.

@elaineraybourn Cool! And I see you fleshed out an example using work we did for that paper, excellent!

@rmmilewi
Copy link
Member Author

@frobnitzem FYI, I created a branch called PhaseTwoDesignUpdates to capture the work proposed in this issue. I'll let you know once it's ready for review. 😄

@rmmilewi
Copy link
Member Author

@frobnitzem Just so you know, I've made good progress on the "one-step solution for provenance and data curation". Reposcanner now has a notebook feature that leverages the prov library (trungdong/prov, a W3C-compliant provenance data model) on the backend to non-intrusively log everything that Reposcanner does, the data that it uses and generates, etc. This allows us to know exactly where any data we use in a paper comes from, and based on the provenance information, I can generate content to help fill out the Reposcanner Lab Notebook Template that @elaineraybourn put together -- no ambiguity, no confusion, no second-guessing.

run_bc63955e713d11ebac874c3275929e39

As for the remaining feature requests, I'm hoping to have everything ready for review either by the end of this week or early next week.

@rmmilewi
Copy link
Member Author

rmmilewi commented Feb 21, 2021

@elaineraybourn @frobnitzem The phase two design update is is looking good! I'll be providing a demonstration at our next meeting to showcase the improvements, and we can talk about collecting data for upcoming deadlines. 😄

phase_two_corrected

@elaineraybourn
Copy link
Member

Awesome work, @rmmilewi great to hear your updates today.

@rmmilewi
Copy link
Member Author

This has been done for awhile, going back and closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants