Reposcanner Design Changes, Take Two! #10

rmmilewi · 2021-01-12T22:08:50Z

I'm planning some upcoming changes to Reposcanner to align the tool with the Elaine's proposed "lab notebook" workflow, add features requested in #8, and make progress on project documentation as defined in #9. @frobnitzem, once you've had a chance to weigh in on these changes, I can start work on the following:

We will create a one-step solution for provenance and data curation. This was the one work item was left unfinished in Reposcanner Design Changes #5. Every run of Reposcanner will generate a log file that contains pretty-printed information that can be copied into @elaineraybourn 's lab notebook format so we can keep track of the data we're generating with this tool. This includes timestamps, the version of Reposcanner used, which routines were included in the analysis, a table with the success/failure outcomes of the different routines on the set of repositories, and the files generated. There will be a module that encapsulates all this logging functionality and talks to the manager to request the information it needs. I also plan on standardizing the outputs of the routines so there's a unique signature at the top of the output indicating the run that the data came from.
Right now I have a hardcoded function that loads in the routines used by Reposcanner, and I have to extend that function every time we add a new routine, and you can't actually control which routines are loaded. I'm going to move this out to a config file that tells Reposcanner which routines to load.
(@frobnitzem, here's where I especially need your input) Any scripts that we run to perform analyses on the data or cross-reference the ECP database need to be version controlled and referenced in our lab notebook entries. The most straight-forward solution I can think of is to make the downstream analyses first-class entities in Reposcanner. Just like data collection routines, analysis scripts that operate on the data generated by Reposcanner and/or the ECP database will be encapsulated in a standardized form and are tracked by the Reposcanner manager in the same way. This way there are no loose ends when it comes to the chain of evidence used in our paper(s).
While I'm doing all this, I'll update the project documentation so that it's current and pushes forward on the progress tracking card on documentation defined in PTC: Documentation #9.

frobnitzem · 2021-01-12T23:26:18Z

I have some thoughts here:

sounds good
I like to make configuration files in python - that way they're fully scriptable. For example, you could have queries defined by a "queries.py" file and then load it with:

import importlib.util
#import sys

def import_py(module_name, file_path):
    spec = importlib.util.spec_from_file_location(module_name, file_path)
    module = importlib.util.module_from_spec(spec)
    #sys.modules[module_name] = module
    spec.loader.exec_module(module)
    return module

queries = import_py("queries", "queries.py")

or something.

To do this properly would require merging my database wrangling code out of reposcanner-data and into reposcanner. Then reposcanner-data would be left with only the yaml files defining project members and project repositories. Then, of course, the "version" stamp could be as simple as a reposcanner commit hash. I'd imagine my scripts sitting in an independent analysis subdirectory of reposcanner. I think you're the best one to do this merge though. It should be simple enough to understand how my reposcanner-data project works if you clone it and run make.
Those issues you reference were really just created for the sake of demonstrating how I think PTC-s should be done (lightweight). I don't think we're at the stage where we need comprehensive documentation. A list of commands to run to get the analysis going should be all that's needed at this stage.

elaineraybourn · 2021-01-13T00:11:33Z

I updated the location of the lab notebook, it is now here: Reposcanner Lab Notebook/Reposcanner Lab Notebook template.md

rmmilewi · 2021-01-13T17:33:13Z

@frobnitzem

I like to make configuration files in python - that way they're fully scriptable. For example, you could have queries defined by a "queries.py" file and then load it with...

Ooh, clever idea, I love it!

To do this properly would require merging my database wrangling code out of reposcanner-data and into reposcanner. Then reposcanner-data would be left with only the yaml files defining project members and project repositories.

Hmm... I was hoping to avoid that (if possible) to keep the database decoupled from Reposcanner to the greatest extent possible. Like you run the database scripts and generate the files separately, then an analysis script can just grab those if and when they need those files. But maybe there's a benefit to having the ECP database scripts integrated into Reposcanner in some way, like if you wanted to regenerate the database files on a run to ensure the data is fresh/current. What do you think? I'm open to handling this in whatever way you think is best.

Those issues you reference were really just created for the sake of demonstrating how I think PTC-s should be done (lightweight). I don't think we're at the stage where we need comprehensive documentation. A list of commands to run to get the analysis going should be all that's needed at this stage.

Fair. That being said, the list of commands right now is already out of date, and I want to make sure we keep it up-to-date. I wanted to make sure that I do so, because it's very easy to overlook documentation.

I updated the location of the lab notebook, it is now here: Reposcanner Lab Notebook/Reposcanner Lab Notebook template.md.

@elaineraybourn Cool! And I see you fleshed out an example using work we did for that paper, excellent!

rmmilewi · 2021-01-18T18:28:25Z

@frobnitzem FYI, I created a branch called PhaseTwoDesignUpdates to capture the work proposed in this issue. I'll let you know once it's ready for review. 😄

rmmilewi · 2021-02-17T16:41:34Z

@frobnitzem Just so you know, I've made good progress on the "one-step solution for provenance and data curation". Reposcanner now has a notebook feature that leverages the prov library (trungdong/prov, a W3C-compliant provenance data model) on the backend to non-intrusively log everything that Reposcanner does, the data that it uses and generates, etc. This allows us to know exactly where any data we use in a paper comes from, and based on the provenance information, I can generate content to help fill out the Reposcanner Lab Notebook Template that @elaineraybourn put together -- no ambiguity, no confusion, no second-guessing.

As for the remaining feature requests, I'm hoping to have everything ready for review either by the end of this week or early next week.

rmmilewi · 2021-02-21T21:24:50Z

@elaineraybourn @frobnitzem The phase two design update is is looking good! I'll be providing a demonstration at our next meeting to showcase the improvements, and we can talk about collecting data for upcoming deadlines. 😄

elaineraybourn · 2021-02-23T19:28:14Z

Awesome work, @rmmilewi great to hear your updates today.

rmmilewi · 2021-04-14T14:36:43Z

This has been done for awhile, going back and closing this issue.

rmmilewi mentioned this issue Feb 28, 2021

Work on Installation and Developer Documentation #11

Open

rmmilewi closed this as completed Apr 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reposcanner Design Changes, Take Two! #10

Reposcanner Design Changes, Take Two! #10

rmmilewi commented Jan 12, 2021 •

edited

Loading

frobnitzem commented Jan 12, 2021

elaineraybourn commented Jan 13, 2021

rmmilewi commented Jan 13, 2021 •

edited

Loading

rmmilewi commented Jan 18, 2021

rmmilewi commented Feb 17, 2021

rmmilewi commented Feb 21, 2021 •

edited

Loading

elaineraybourn commented Feb 23, 2021

rmmilewi commented Apr 14, 2021

Reposcanner Design Changes, Take Two! #10

Reposcanner Design Changes, Take Two! #10

Comments

rmmilewi commented Jan 12, 2021 • edited Loading

frobnitzem commented Jan 12, 2021

elaineraybourn commented Jan 13, 2021

rmmilewi commented Jan 13, 2021 • edited Loading

rmmilewi commented Jan 18, 2021

rmmilewi commented Feb 17, 2021

rmmilewi commented Feb 21, 2021 • edited Loading

elaineraybourn commented Feb 23, 2021

rmmilewi commented Apr 14, 2021

rmmilewi commented Jan 12, 2021 •

edited

Loading

rmmilewi commented Jan 13, 2021 •

edited

Loading

rmmilewi commented Feb 21, 2021 •

edited

Loading