Welcome you to visit our yw-idcc-17 web site. This demo consists of examples of YW provenance queries highlighted in the IDCC'17 presentation, paper, and demo.
The purpose of this demo is to demonstrate the Yesworkflow
(YW) query ability to use the prospective provenance created by YW, the retrospective provenance and hybrid provenance together to answer queries that can not be answered solely by prospective provenance or retrospective provenance.
The prospective provenance in this demo is created by YW which models conventional scripts and programs as scientific workflows. YW can provide a number of the benefits of using a scientific workflow management system without having to rewrite scripts and other scientific software. A YW user simply adds special YW comments to existing scripts. These comments declare how data is used and results produced, step by step, by the script. Then, YW interprets these comments and produces graphical output that reveals the stages of computation and the flow of data in the script.
There are various approaches to capture retrospective provenance. Retrospective provenance observables, e.g., from DataONE RunManagers
(file-level), ReproZip
(OS-level), or noWorkflow
(Python code-level) only yield isolated fragments of the overall data lineage and processing history. In this demo, two types of retrospective provenance observables are used: yw-recon
and DataONE RunManager
. The yw-recon
can search the file system for files that match the URI templates declared for @IN and @OUT ports in the script. On the other hand, DataONE RunManager
can record a list of input and output files for a script run.
The following tools are used in our demo project:
Our system demonstration will illustrate the variety of provenance information that we are able to capture, query, and visualize using a combination of tools for exposing both prospective and retrospective provenance. We show how prospective provenance can be declared using YesWorkflow (YW) annotations that reveal the fine-grained (variable level) dataflow graph implicit in scripts, and how this prospective provenance can be integrated with the coarse-grained (file-level) retrospective provenance information recorded by the DataONE Run Managers for MATLAB and R, fine-grained retrospective provenance captured by noWorkflow, user-exported log file at any fine-grained level. We demonstrate the usefulness of integrating prospective and retrospective provenance in this way with queries:
-
Prospective provenance queries in the context of a single script. This can expose and test data dependencies at the workflow-level.
-
Retrospective provenance queries in the context of a single run of a single script: captures actual input and output files of a script run and other runtime observables.
-
Hybrid provenance query in the context of a single script and single run: blends retrospective and prospective provenance, yielding new knowledge artefacts.
-
Provenance query in the context of multiple scripts and multiple runs: query and visualize data dependencies across multiple script runs
Our demonstration queries and provenance reports thus yield a more complete and comprehensible picture of data provenance from multiple script runs.
Please read Query README in the demo repo.
- YesWorkflow Graph for C3C4 Example
- Hybrid Graph for C3C4 Example
- YesWorkflow Graph for LIGO Example
- Hybrid Graph for LIGO Example
- noWorkflow Filtered Graph for LIGO Example
- YesWorkflow Graph for Kurator Example
- Hybrid Graph for Kurator Example
- YesWorkflow Graph for Twitter Example
- Hybrid Graph for Twitter Example
- Multiple_runs_Multiple_scripts_Graph for OHIBC Example
Directory | Description |
---|---|
examples/ | Contains examples demonstrating the queries in the queries folder |
queries/ | it stores the scripts to the nine demo queries we asked. |
rules/ | it contains a set of Prolog rules for generating prospective yesworkflow views rules (yw_rules.P and yw_views.P ), retrospective reconstructed rules (recon_rules.P ), graph rendering rules (gv_rules.P ), and populating graph rules (yw_graph_rules.P ). |
OHIBC_Howe_Sound_project/ | A R workflow project OHIBC_HOWE_Sound that is a real-life use case and consists of multiple R scripts. |
docker/ | Contains a docker image that can help users to reproduce the demonstrated provenance queries. |
yw_jar/ | Contains two version YesWorkflow Java library. |
poster_template/ | Contains the poster and other publications. |
SQLiteToYaml/ | Contains Java program is used to convert Sqlite database into yaml file to be queried by YesWorkflow. |
The example subfolders also have a typical folder structure:
yw-idcc-17/examples/<my_example>/
Subfolders that all <my_example>
folders have:
Directory | Description |
---|---|
script/ | the example script or scripts that make up <my_example> |
facts/ | the YW facts for <my_example>, generated by running YW on the example script(s) |
views/ | materialized views for <my_example> |
recon/ | reconstructed provenance used for <my_example> |
results/ | all artifacts generated by make.sh |
supplementary/ | a folder with supplementary files and information about the example |
clean.sh | removes generated demo artifacts for <my_example> |
make.sh | creates demo artifacts for <my_example> |
Please | |
Note: after running clean.sh and make.sh , you can use git status to see what demo artifacts have just been created. |
simulate_data_collection/
├── clean.sh
├── facts
│ ├── yw_extract_facts.P
│ └── yw_model_facts.P
├── make.sh
├── results
├── script
│ ├── calibration.img
│ ├── cassette_q55_spreadsheet.csv
│ └── simulate_data_collection.py
└── views
└── yw_views.P
The following free software are required in order to run this demo.
-
Java: please install Java SE Development Kit 8 by navigating to http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html to view JDK dowloads. Accept all default installation configuration. Please confirm if Java is available by typing the command below. If not, please locate the directory containing the JDK executables (
C:\Program Files\Java\jdk1.8.0_121\bin
) and add the direcoty containing the JDK executables to my Windowspath
variable.my_home$ java -version java version "1.8.0_91" Java(TM) SE Runtime Environment (build 1.8.0_91-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode) my_home$
-
XSB: a Logic Programming and Deductive Database system for Unix and Windows ([XSB homepage] (http://xsb.sourceforge.net)). The download and installation page for XSB is at [here] (http://xsb.sourceforge.net/downloads/downloads.html) or please navigate to the page https://sourceforge.net/projects/xsb/files/xsb/. The version 3.7 is the newest version.
-
Install XSB on Mac/Linux Download the XSB tar package (XSB 3.6 (Linux/Mac/*nixes)) from here. Then, Unpack the tarball in some directory. This should create a subdirectory, called
XSB
, which contains the XSB sources. In the terminal, typemy_home$ tar xvf XSB.tar my_home$ cd XSB/build my_home$ ./configure my_home$ ./makexsb my_home$ /Users/my_home/XSB/bin/xsb
Next, you might add the path to the XSB executable (/Users/my_home/XSB/bin/xsb
) to the PATH
variable. For example, in a ~/.bashrc file, add this line:
```sh
export PATH="/Users/my_home/XSB/bin:$PATH"
```
Then, in a terminal, typing this command
```sh
my_home$ source ~/.bashrc
my_home$ which xsb
/Users/my_home/XSB/bin/xsb
```
-
Install XSB on Windows Download the XSB executable
xsb-3.6.0.exe
for Windows platform. Run the downloaded installer file and accept all default configuration. This is the extra steps for Windows users. Please determine which directory contains the XSB executable that works for your computer:C:\Program Files (x86)\XSB\config\x64-pc-windows\bin C:\Program Files (x86)\XSB\config\x86-pc-windows\bin
Then, add the path to the XSB executable to my windows path variable
Control Panel -> System and Security -> System -> Advanced System Settings -> Environment Variables -> Path
. Typingxsb
in a command console in order to confirm that XSB can run from the command prompt.C:\Users\my_home> xsb [xsb_configuration loaded] [sysinitrc loaded] [xsbbrat loaded] XSB Version 3.6. (Gazpatcho) of April 22, 2015 [x64-pc-windows; mode: optimal; engine: slg-wam; scheduling: local] [Build date: 2015-04-22] | ?- halt. End XSB (cputime 0.05 secs, elapsetime 4.22 secs)
-
Graphviz: a Graph Visuzlization Software for Unix and Windows. It is available at Graphviz homepage. The download and installation page for Graphviz is at here.
-
For Mac/Linux, please click "Agree" to accept the agreement. Then, you are directed to a download webpage. Please choose the proper install package. For example, on Mac, we use the version graphviz-2.38.0.pkg. When the package is downloaded to your local computer, move the mouse to the "graphviz-2.38.0.pkg", right click, a window will be popped and ask you whether you want to open it, choose "Open". Then, please follow the installation procedure and accept all default configurations. When the installation is completed, you might check the
dot
command in a terminal by typingmy_home$ which dot /usr/local/bin/dot
-
For Windows, please download
graphviz-2.38.msi
installer package and start the installer file. You might accept all default configurations. Please confirm if thedot
command is available by typing the command below. If not, then first determined directory containing dot.exe binary (C:\Program Files (x86)\Graphviz2.38\bin
) and added the directory containing the dot executable to my Windows PATH variable.C:\Users\my_home> dot 'dot' is not recognized as an internal or external command, operable program or batch file.
-
-
Installing Git for Mac
-
The easiest is to use the graphical Git installer, which you can download from the SourceForge page
-
If you have
MacPorts
installed, install Git via
$ sudo port install git
- If you have
Homebrew
installed, install Git via
$ brew install git
-
-
Installing Git for Linux If you want to install Git on Linux via a binary installer, you can generally do so through the basic package-management tool that comes with your distribution. If you’re on Fedora, you can use
yum
:$ yum install git
Or if you’re on a Debian-based distribution like Ubuntu, try apt-get:
$ apt-get install git
-
Install Git for Windows: please download
Git
for Windows from https://git-for-windows.github.io/. Run the downloadedGit-2.11.1-64-bit.exe
and accept default configuration. Then, finish installation. Please check thegit
command in the command shell by typinggit --version
. Next, you might add thepath to bash executable
included with "Git for Windows" (C:\Program Files\Git\bin
) to my Windowspath
variable so that the bash script can run on the command prompt directly.C:\Users\my_home> git --version git version 2.11.1.windows.1
-
SQLite: a high-reliability, embedded, zero-configuration, public-domain, SQL database engine. It is availabe at SQLite homepage.
Clone the yw-idcc-17
git repo to your local machine using the command from the terminal for Mac/Linux or the command shell for Windows.
git clone https://github.com/yesworkflow-org/yw-idcc-17.git
Run the demo from the command shell. For Windows users, you might either run from Git shell
which contains the bash
command or run from the command prompt directly. The bash scritps have been tested on Mac and Windows platform.
-
Go to the
examples/
folder. There are two types of examples demonstrated. One is single script implemented in various programming languages and the other is a R workflow project. We have provided four examples here:- Type I: Single script in various programming languages: a MATLAB example (
C3C4/
) and four Python examples (LIGO/
,Twitter/
,simulate_data_collection/
andkurator-SPNHC16-YW-xsb/
). - Type II: A real-life R workflow project
OHIBC_HOWE_Sound_project/
.
- Type I: Single script in various programming languages: a MATLAB example (
-
Go to one of the above example. First, run the cleaning script by calling
bash clean.sh
or./clean.sh
. -
Run the demo example by calling
bash make.sh
or./make.sh
. For Windows users, please reference the example below. Note that in some cases after addingC:\Program Files\Git
to thePath
variable, and usegit-bash
orgit-cmd
command instead of the bash command. In this way, it works both using bash inGit shell
and usinggit-bash
orgit-cmd
in command shell.
- For Mac/Linux platform,
my_home$ ls
OHIBC_Howe_Sound_project docker queries
README.md examples rules
SQLiteToYaml poster_template yw_jar
my_home$ cd examples/C3C4/
my_home$ ls
clean.sh facts make.sh recon results script supplementary views
my_home$ bash clean.sh
my_home$ bash make.sh
- For Windows platform,
C:\Users\my_home\Desktop\yw-idcc-17>cd examples\C3C4
C:\Users\my_home\Desktop\yw-idcc-17\examples\C3C4>dir
Volume in drive C is Windows8_OS
Volume Serial Number is 6473-FB35
Directory of C:\Users\my_home\Desktop\yw-idcc-17\examples\C3C4
02/20/2017 10:39 AM <DIR> .
02/20/2017 10:39 AM <DIR> ..
02/18/2017 12:47 PM 132 clean.sh
02/18/2017 02:14 PM <DIR> facts
02/18/2017 12:47 PM 8,546 make.sh
02/18/2017 12:47 PM <DIR> recon
02/18/2017 02:14 PM <DIR> results
02/18/2017 12:47 PM <DIR> script
02/18/2017 12:47 PM <DIR> supplementary
02/18/2017 02:14 PM <DIR> views
2 File(s) 8,678 bytes
8 Dir(s) 77,619,445,760 bytes free
C:\Users\my_home\Desktop\yw-idcc-17\examples\C3C4>bash make.sh
- Go to
results/
folder and check the generated provenance query result. For Mac users, you might useopen
command to access the PDF files, while for Windows users, you might usestart
command to access the PDF files.
-
Copy your example folder under examples/ folder.
-
Reorganize your directory layout for your example to be the same as
C3C4
,LIGO
, andsimulate_data_collection
. Create arecon/
folder which contains yourreconfacts.P
. -
Copy two script files
clean.sh
andmake.sh
from thesimulate_data_collection
of the existing three examples to your own example folder. -
Open
make.sh
and customize the scripting name, outputfile name, parameter data object name to your example. -
Run
bash make.sh
.
We have created a Docker image (yesworkflow/provenance-demo
) to help readers to explore the YesWorkflow demonstrated provenance queries. In the yesworkflow/provenance-demo
image, the XSB, Graphivz, YesWorkflow, noWorkflow, dataone demo queries are installed. Users can boot up a Docker container to run the demo provenance queries using this image within seconds, without the need to manually install packages.
Here are instructions for each OS:
As part of this installation process, you’ll need to use a shell prompt. There’s a special version of the shell that comes pre-configured for using Docker commands. Users need to use the above shell prompt in order to run a Docker command or type a specific Docker command. Here is how to open it:
- Mac OS – launch the
Docker Quickstart Terminal
application from Launchpad. - Linux – launch any bash shell prompt, and
docker
will already be available. - Windows – click the
Docker Quickstart Terminal
icon on your desktop.
Users can use the following command to download the image from Docker Hub which is similar to GitHub. The command syntax is docker pull IMAGE_NAME
. The name of our current provenance query image is yesworkflow/provenance-demo. Users can type the following command into a shell prompt.
docker pull yesworkflow/provenance-demo
This will download the image from Docker Hub
for Docker images.
Once downloaded the image, users can run it using the command docker run
. Executing docker run
will create a Docker container which is isolated from the user's local computer. Here are some configuration options for docker run
.
-i
: interactive session-t
: TTY-v H:C
: mount the host path on your computerH
at the pathC
inside the Docker container.
The full command to run the provenance query looks like:
docker run -it -v $HOME:$HOME yesworkflow/provenance-demo
Then, users can go to ... to check the query results.
- Q Zhang, Y Cao, Q Wang, D Vu, P Thavasimani, T McPhillips, P Missier, B Ludäscher. Revealing the Detailed Lineage of Script Outputs Using Hybrid Provenance. IDCC 2017 (Practice Paper track).
- Y Cao, P Slaughter, C Jones, MB Jones, Q Wang, D Vu, P Thavasimani, Q Zhang, T McPhillips, P Missier, L Walker, D Vieglais, B Ludäscher. Demonstrating Hybrid Provenance Queries from Script Runs. IDCC 2017 (Demo).
- BS Halpern, C Longo, D Hardy, KL McLeod, JF Samhouri, SK Katona, et al. (2012) An index to assess the health and benefits of the global ocean. Nature. 2012;488: 615–620. doi:10.1038/nature11397.
- Y Wei, S Liu, D Huntzinger, A Michalak, N Viovy, W Post, C Schwalm, K Schaefer, A Jacobson, C Lu, H Tian, D Ricciuto, R Cook, J Mao, X Shi. (2014) NACP MsTMIP: Global and North American Driver Data for Multi-Model Intercomparison. http://dx.doi.org/10.3334/ORNLDAAC/1220
- LIGO Open Science Center: Signal Processing with GW150914 Open Data. https://losc.ligo.org/events/GW150914/