-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reposcanner Design Changes, Take Two! #10
Comments
I have some thoughts here:
or something.
|
I updated the location of the lab notebook, it is now here: Reposcanner Lab Notebook/Reposcanner Lab Notebook template.md |
Ooh, clever idea, I love it!
Hmm... I was hoping to avoid that (if possible) to keep the database decoupled from Reposcanner to the greatest extent possible. Like you run the database scripts and generate the files separately, then an analysis script can just grab those if and when they need those files. But maybe there's a benefit to having the ECP database scripts integrated into Reposcanner in some way, like if you wanted to regenerate the database files on a run to ensure the data is fresh/current. What do you think? I'm open to handling this in whatever way you think is best.
Fair. That being said, the list of commands right now is already out of date, and I want to make sure we keep it up-to-date. I wanted to make sure that I do so, because it's very easy to overlook documentation.
@elaineraybourn Cool! And I see you fleshed out an example using work we did for that paper, excellent! |
@frobnitzem FYI, I created a branch called PhaseTwoDesignUpdates to capture the work proposed in this issue. I'll let you know once it's ready for review. 😄 |
@frobnitzem Just so you know, I've made good progress on the "one-step solution for provenance and data curation". Reposcanner now has a notebook feature that leverages the prov library (trungdong/prov, a W3C-compliant provenance data model) on the backend to non-intrusively log everything that Reposcanner does, the data that it uses and generates, etc. This allows us to know exactly where any data we use in a paper comes from, and based on the provenance information, I can generate content to help fill out the Reposcanner Lab Notebook Template that @elaineraybourn put together -- no ambiguity, no confusion, no second-guessing. As for the remaining feature requests, I'm hoping to have everything ready for review either by the end of this week or early next week. |
@elaineraybourn @frobnitzem The phase two design update is is looking good! I'll be providing a demonstration at our next meeting to showcase the improvements, and we can talk about collecting data for upcoming deadlines. 😄 |
Awesome work, @rmmilewi great to hear your updates today. |
This has been done for awhile, going back and closing this issue. |
I'm planning some upcoming changes to Reposcanner to align the tool with the Elaine's proposed "lab notebook" workflow, add features requested in #8, and make progress on project documentation as defined in #9. @frobnitzem, once you've had a chance to weigh in on these changes, I can start work on the following:
We will create a one-step solution for provenance and data curation. This was the one work item was left unfinished in Reposcanner Design Changes #5. Every run of Reposcanner will generate a log file that contains pretty-printed information that can be copied into @elaineraybourn 's lab notebook format so we can keep track of the data we're generating with this tool. This includes timestamps, the version of Reposcanner used, which routines were included in the analysis, a table with the success/failure outcomes of the different routines on the set of repositories, and the files generated. There will be a module that encapsulates all this logging functionality and talks to the manager to request the information it needs. I also plan on standardizing the outputs of the routines so there's a unique signature at the top of the output indicating the run that the data came from.
Right now I have a hardcoded function that loads in the routines used by Reposcanner, and I have to extend that function every time we add a new routine, and you can't actually control which routines are loaded. I'm going to move this out to a config file that tells Reposcanner which routines to load.
(@frobnitzem, here's where I especially need your input) Any scripts that we run to perform analyses on the data or cross-reference the ECP database need to be version controlled and referenced in our lab notebook entries. The most straight-forward solution I can think of is to make the downstream analyses first-class entities in Reposcanner. Just like data collection routines, analysis scripts that operate on the data generated by Reposcanner and/or the ECP database will be encapsulated in a standardized form and are tracked by the Reposcanner manager in the same way. This way there are no loose ends when it comes to the chain of evidence used in our paper(s).
While I'm doing all this, I'll update the project documentation so that it's current and pushes forward on the progress tracking card on documentation defined in PTC: Documentation #9.
The text was updated successfully, but these errors were encountered: