Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] workflow toml specification and initial implementation #1

Closed
wants to merge 1 commit into from

Conversation

johnnychen94
Copy link
Owner

@johnnychen94 johnnychen94 commented Aug 5, 2021

Key features I have in mind:

  • language-agnostic: it supports running with any language (python, julia, shell, and others)
  • implementation-agnostic: all information should be stored in an exchange format. Here I use TOML because it's Julia's standard library.
  • verbose and reproducibility: it contains all the information to reproduce
  • flexibility: duplication is allowed, flexibility is more important
  • sandbox & atomic: if some task fails, it cleans up all its intermediate temporary results.

My design:

  • A workflow consists of multiple ordered stages. These stages are often dependent, e.g., one stage consumes the outputs from earlier stages.
  • A stage consists of multiple tasks. Tasks are independent and can run concurrently.
  • The workflow runs in a nested mapreduce way:
    1. Workflow handler dispatches work to stage handler sequentially, stage handler dispatches work to task runners concurrently(or sequentially).
    2. task runners run and output some content, stage handler collects them and notifies task runners to delete the data. All results of stage handler are kept as the workflow output.

The idea comes from mainly two designs: dvc yaml and datasets toml

Example: benchmark framework

The initial purpose of this is to support a generic and wide-range benchmark framework, there are some key challenges to achieve this:

  • we want to benchmark multiple frameworks, e.g., Images.jl, OpenCV, scikit-images, and others
  • function f in framework A might not have its corresponding function in other frameworks
  • What people want to benchmark is usually application-oriented so the benchmark target and script can vary very frequently.
  • How people would like to view the benchmark results are highly opinioned.

A natural idea is to separate the data producing stage and data analysis stage. We can produce any many benchmark results as we want, as long as we properly tag them. Then in the data visualization stage, we use filters to collect results that we're interested.

BenchmarkTools and PkgBenchmark use a tree design(nested groups) to organize the benchmark tasks. I reckon it a bad design because 1) it introduces a overly compact form and makes it very hard to adjust benchmark targets, 2) it makes future result filtering much harder. Thus I choose to use "name"+"tags" design, in the visualization and analysis stage:

  • "name" is used to check if there if a function f has multiple implementations.
  • "tags" are used to filter the entire benchmark dataset and give a small scope of data that we might be interested.

A unique ID is generated by join "name" and "tags" (e.g., join([name, sort(tags)...])

cc: @ashwani-rathee

@codecov
Copy link

codecov bot commented Aug 5, 2021

Codecov Report

Merging #1 (962c9f7) into master (9571943) will not change coverage.
The diff coverage is 0.00%.

Impacted file tree graph

@@          Coverage Diff           @@
##           master      #1   +/-   ##
======================================
  Coverage        0   0.00%           
======================================
  Files           0       4    +4     
  Lines           0      71   +71     
======================================
- Misses          0      71   +71     
Impacted Files Coverage Δ
src/Workflows.jl 0.00% <0.00%> (ø)
src/drivers.jl 0.00% <0.00%> (ø)
src/parsing.jl 0.00% <0.00%> (ø)
src/report.jl 0.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9571943...962c9f7. Read the comment docs.

Comment on lines +7 to +10
[stages.run]
metrics = ["time"]
out = ["results.csv"]
driver = "csv"
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, it outputs

id,time
dilate_juliaimages_morphology,0.623235
erode_juliaimages_morphology,0.627855
dilate_skimage_morphology,1.5773831250000003

and if I change to metrics = ["time", "memory"], then it becomes

id,time,memory
dilate_juliaimages_morphology,0.636088,1280096
erode_juliaimages_morphology,0.641488,1280096
dilate_morphology_skimage,1.8287903700000008,

cc: @ashwani-rathee

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should make metrics an optional field and defaults to all results. Or just remove this field.

Comment on lines +7 to +10
[stages.run]
metrics = ["time"]
out = ["results.csv"]
driver = "csv"
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should make metrics an optional field and defaults to all results. Or just remove this field.

count, time = timeit.Timer('dilation(img, square(3))', globals=globals()).autorange()

# export
print(json.dumps({"time": 1e3*time/count})) # ms
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shell runner pass data via stdout=IOBuffer()

Comment on lines +9 to +12
Dict(
"memory" => rst.memory, # byte
"time" => median(rst.times)/1e6, # ms
) |> JSON3.write
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Julia runner could choose to directly pass a Dict, but I want to make it language-agnostic so for consistency, I choose to let it pass a json String.

@johnnychen94
Copy link
Owner Author

closed in favor of #8

@johnnychen94 johnnychen94 deleted the jc/workflow branch February 13, 2022 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant