Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: benchmarking infrastructure. #3

Merged
merged 2 commits into from
Mar 8, 2021

Conversation

cfallin
Copy link
Member

@cfallin cfallin commented Nov 17, 2020

This RFC proposes establishing a benchmarking infrastructure for Cranelift/wasmtime.

rendered

@fitzgen
Copy link
Member

fitzgen commented Nov 18, 2020

Sounds good to me.

FWIW, I have a complimentary RFC incoming that focuses on selecting a corpus of benchmark programs, how to avoid measurement bias, and how to do a sound statistical analysis of results.

I don't think we need to block on separating the benchmark server from the web server, but it should be high priority in my opinion.

Regarding security, we should have an allow list of users who can trigger benchmark runs for PRs. These users should be able to comment @bytecodealliance-bot bench (or whatever) on other people's PRs to trigger a run, if needed. That is, the allow list should be checked against the person commenting to start a benchmark run, not the author of a PR.


If any of the above becomes impractical or difficult, an intermediate
design-point could involve a web interface that allows approved users
(authenticated in some way) to request a run on a given git commit hash; this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this feature; it's especially powerful to search which commit introduced a regression using the bisection method.

that the integration should not present unforeseen problems.

# Open questions
[open-questions]: #open-questions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my experience on maintaining arewefastyet, another pain point has been availability of the service. Machines would sometimes stop running the benchmarks for Some Reason (e.g. timeout, not enough disk space on one machine, network failure,...). Then a person needs to look into it. This can lower availability, and in particular increase latency (if only some people in a given time zone have access to some machines, then the system may be unavailable for half days). I think this would be interesting to think ahead of time, and/or to mention what are the availability expectations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree with this that maintaining continuous benchmarking, while great, does cost resources and will have hiccups. These are fine I think but it's something we'll want to go into eyes-open.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think the most important thing here is that it won't necessarily gate PRs: if there's a performance-sensitive thing going in and the benchmarking infrastructure is down, maybe we wait until we have numbers, but otherwise it seems like a thing we can easily enough consider optional when unavailable.

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks for writing this up!

One small concern I have is about where all of this is going go live in terms of benchmarking code. We don't want to get into a situation where we need to compare two commits but APIs changed between them so we can't compare the two easily. The only answer I can see to this though is to store all of the benchmarks in the wasmtime repository itself so we can update APIs and examples as the API changes, but I'm not sure how feasible this will be over time if we want larger benchmarks. In any case it might be worth thinking a bit about the stable interface between the benchmark runner and wasmtime? (it may also be too early to think about this)

this host for running benchmarks as well, if all stakeholders agree.

We will need an x86-64 host as well; the best option is likely to rent a
low-cost dedicated host. Details of hosting choice and funding are TBD.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this is what rust-lang/rust does and I think it's worked our relatively well for them in terms of reliability and cost.

order to serve this UI. Alternately, if hosting performance becomes an issue,
we could design a static/pregenerated-page workflow wherein the runners upload
data to a git repository and static data is served from GitHub Pages (or
another static host).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a dynamic website one of my concerns would be management with deployment and such, but one possibility is that I believe Heroku has a free tier which should be more than sufficient for this purpose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question; @jlb6740 or @abrown might have more thoughts on what is possible here with the current Sightglass UI?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can get pretty far with a static github pages frontend and have JS to switch between static graph images that were pre-rendered by the benchmark analysis.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cfallin @alexcrichton .. Not sure I get the question correctly (specifically static data?). There is a viewer (nuxt.js) that loads a json file that servers as the database. Separately there is another html file that does a simple plot of the latest data (can be extended to allow selecting any history of data, any specific data). What was attempted before was collecting data on one machine and the scp this updated json database history file to github pages that served the same viewer. I don't remember the issues with that but I can try to duplicate that effort again and go from there.

that the integration should not present unforeseen problems.

# Open questions
[open-questions]: #open-questions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree with this that maintaining continuous benchmarking, while great, does cost resources and will have hiccups. These are fine I think but it's something we'll want to go into eyes-open.

@alexcrichton
Copy link
Member

Oh there's one other thing I thought of just now, which is that we should decide up-front what platforms we would like to benchmark. You mentioned that we happen to have an aarch64 machine, and we can probably get an x86_64 one as well, but there's also a question of whether to include macOS and Windows with what will presumably be Linux benchmarking. The downside to this is that if we have more than one target to benchmark that's multiple physical machines we're managing, and it also makes collection a bit harder since we need to coordinate across systems. That being said though I think we would get a lot of value from benchmarking three targets:

  • x86_64-pc-windows-msvc
  • x86_64-unknown-linux-gnu
  • aarch64-unknown-linux-gnu

Money permitting I think having at least aarch64/x86_64 will be very beneficial to the two main backends, and then once we have two the infrastructure to add a third ideally won't be the worst, but we'd have to do Windows implementation work.

@cfallin
Copy link
Member Author

cfallin commented Nov 18, 2020

The platform question is an important one; thanks for bringing it up! Diversity of evaluation platforms is certainly important; I had had CPU architecture diversity front-of-mind, but I imagine there will be plenty of unique OS-specific performance concerns in runtime work (WASI bindings, async runtime, exception support, etc).

Perhaps we can "lazily instantiate" platforms as needed -- and keep portability in mind as much as possible as we develop the benchmark runner. Out of curiosity, how difficult was WIndows-specific perf testing infra to build in your past experience (Rust) -- is it a MinGW/Cygwin "mostly still Linux" setup or is there a native Windows runner daemon of some sort?

@fitzgen
Copy link
Member

fitzgen commented Nov 18, 2020

@abrown and I have talked a bit about the interface between Wasmtime and the runner. The runner would open a dylib that has an expected interface (this is similar to what sightglass currently does, but we would want a slightly different interface). This interface will consist of three functions, one for each thing that we'll measure: compile, instantiate, and execute. (And also functions to drop and deallocate the instance and module). Implementing these functions should be very straightforward and be very minimally affected by Wasmtime API changes.

When comparing multiple commits, we would build one of these dylibs for each commit.

Additionally, there is the question of the interface between the Wasm file and Wasmtime. This should basically just be WASI. But, to avoid measuring I/O used to load (for example) an OpenCV model when all we really care about is the compute portion of the program, we should define "bench" "start" and "bench" "end" functions that the Wasm module can import. During the execute phase, we will only measure time/counters between these start and end calls (they should only be called once, not repeatedly in a loop). By making these functions available as imports, rather than requiring the Wasm to export setup and tear down functions, it is easier to port existing programs (e.g. no need to change a program from having a main to being a dylib with multiple exports, you just add a couple imports and call them at the right times).

Finally, it should be noted that we don't want to take every sample from the same process, because of the risk of introducing measurement bias, so we will need to split the runner so that it has a parent process that occassionally scraps the process that loads the dylib.

If I were to put this into pseudocode (and ignoring that we should randomize the order of benchmarks executed, as detailed in #4), then the runner would look like this:

struct Config {
    // Which commits we are measuring.
    commits_to_measure: Vec<Commit>,

    // The benchmark programs to run.
    benchmarks: Vec<Wasm>,

    // The number of processes to spread samples across.
    num_process_executions: usize,

    // The number of samples to take from each process. So total number of
    // samples for each benchmark program is
    // `num_process_executions * num_child_iterations`.
    num_child_iterations: usize,
}

fn parent_runner(config: Config) {
    for commit in config.commits_to_measure {
        let dylib = build_dylib(commit);
        for wasm in config.benchmarks {
            for _ in 0..config.num_process_executions {
                child = spawn_child_runner(dylib, wasm, config.num_child_iterations);
                child.wait();
            }
        }
    }
}

and for good measure, here is a diagrammy thing that details the interfaces between each component:

+---------------+
| Parent Runner |
+---------------+
        |
        |
        | (re)spawns child, tells it how many samples to take
        |
        |
        V
+--------------+
| Child Runner |
+--------------+
        |
        |
        | loads dylib, communicates via compile/instantiate/execute interface
        |
        |
        V
+--------------------------+
| Dylib of Wasmtime@Commit |
+--------------------------+
        |
        |
        | WASI and `"bench" "start"`/`"bench" "end"`
        |
        |
        V
+------------------------+
| Wasm Benchmark Program |
+------------------------+

@abrown
Copy link
Contributor

abrown commented Nov 18, 2020

The only answer I can see to this though is to store all of the benchmarks in the wasmtime repository itself so we can update APIs and examples as the API changes

@alexcrichton, I think with what @fitzgen and I discussed would mean that the benchmarks can live anywhere. My preference would be that all benchmarks that we provide are buildable with a Dockerfile so that the artifacts are reproducible. I am working on a tool currently that does this, extracting a benchmark.wasm from the Docker image and verifying that it imports "bench" "start" and "bench" "end".

This interface will consist of three functions, one for each thing that we'll measure: compile, instantiate, and execute

@fitzgen I have more thoughts on this since we last talked: we also discussed using the Wasm C API as the way to control Wasmtime and I am leaning toward using that because it eliminates a step. If we use compile/instantiate/execute, we have to rewrap Wasmtime as this dylib, but if we use the C API as the interface, we could either use the output of cargo build -p wasmtime-c-api or even download the dylib from a GitHub release--seems more flexible. If I'm locally working on an optimization, recompiling my tree to match the compile/instantiate/execute is one additional barrier to running the benchmarks (though hardly a large one).

I also have a selfish reason for using the C API: I can then reuse this tool internally to do performance analysis on other Wasm engines. As @fitzgen has pointed out previously, we probably don't want to trigger an engine-vs-engine benchmark war but, to me, that is solvable by not publicizing head-to-head results, not by restricting the tool. (I'm working on such a tool at the moment so I'm interested in all of your thoughts!).

@alexcrichton
Copy link
Member

@fitzgen and @abrown what y'all are saying makes sense, and what I'm thinking is that this dylib basically must live in the wasmtime repository for us to maintain it because the implementation of the dylib may change over time as we tweak the wasmtime crate's APIs and internals.

I think we need to be careful though because what we choose as an API between the benchmark runner and wasmtime itself dictates what we can an cannot benchmark. For example I don't think using the existing C API is good for benchmarking calling into wasm. In Rust this would idiomatically use the get family of functions which are much more optimized than that C is doing, and we're unlikely to really try to speed up the C API. Similarly defining host functions and stretching various pieces of the embedding API are likely going to be difficult through the current C API.

If we only want to benchmark the compilation and execution of primarily the wasm itself then it's probably not so bad though. Instantiation is a little iffy since we would ideally account for the whole cost of instantiation with the cost of creating Func objects too.


@cfallin I agree yeah that as long as we have 2 platforms out of the gate to benchmark on it should be easy enough to add more platforms in the future. Unfortunately AFAIK rust-lang/rust doesn't do any benchmarking on Windows, and I myself don't know how to do anything off the top of my head other than measuring wall-time. I don't think that running a server or implementing a daemon will be too hard though as long as we stick to the Rust ecosystem, there shouldn't be any portability concerns for stuff we're doing. The only thing I worry about is how to measure the same metrics across platforms. On Linux we can use perf and its counters, but while I'm sure there's an equivalent for Windows AFAIK no one's done it in Rust yet.

@fitzgen
Copy link
Member

fitzgen commented Nov 18, 2020

what I'm thinking is that this dylib basically must live in the wasmtime repository for us to maintain it because the implementation of the dylib may change over time as we tweak the wasmtime crate's APIs and internals.

This makes sense to me.

I think we need to be careful though because what we choose as an API between the benchmark runner and wasmtime itself dictates what we can an cannot benchmark. For example I don't think using the existing C API is good for benchmarking calling into wasm. In Rust this would idiomatically use the get family of functions which are much more optimized than that C is doing, and we're unlikely to really try to speed up the C API. Similarly defining host functions and stretching various pieces of the embedding API are likely going to be difficult through the current C API.

Agreed that designing the interface requires some care.

(Note: I am personally less interested in micro benchmarks. They have their place, but I don't think we need to continuously measure them over time as much as we need to do that for overall compilation/instantiation/execution which have more action at a distance.)

I do think, however, we can largely enable this sort of thing not at the child runner <--> dylib interface level but at the Wasm interface level. For example, if we want to micro-benchmark Wasm->host calls, we can provide a "bench" "host_func" function for the wasm program to import and call a bunch of times during its execution. If we wanted to micro benchmark host->wasm calls, we could make sure that a Wasm program exports a "wasm_func" function and that the host provides a "bench" "call_wasm_a_bunch_of_times" function for importing. The Wasm's main execution would just call the host function which would call the other exported wasm function a bunch of times. Both of these fit into the proposed dylib interface, and just require minimal changes at the Wasm interface level. This also avoids including the overhead of cross-dylib calls when measuring this micro operations.

@abrown rather than swapping engines at the C API level, you can always reimplement the dylib for different engines and, since the proposed dylib interface is so simple, it should be straightforward. I'm thinking less than a hundred lines of code really. You could also make an implementation of the dylib interface that uses the Wasm C API, and then you would only have to write the dylib implementation once, rather than for each engine, and could swap out engines at the C API level again.

@bjorn3
Copy link
Contributor

bjorn3 commented Nov 18, 2020

When using wasmtime as dylib, you should probably set LD_BIND_NOW I think. Otherwise the first run would lazily resolve a lot of symbols, but the rest of the runs wouldn't.

@alexcrichton
Copy link
Member

I think that's a good point actually about microbenchmarks not being too too useful in a regression-testing framework! With that in mind I think you're right that it's less important than I might think the precise API we have, and what you're saying so far makes sense and seems plenty flexible.

@abrown
Copy link
Contributor

abrown commented Nov 18, 2020

@alexcrichton, yeah, I had only really considered whole Wasm programs and I agree with @fitzgen's comments above; I will add that if we do want to do micros for some reason, though, I have split the recording part of sightglass out separately so we should be able to measure custom blocks of Rust code accessing Wasmtime directly.

@fitzgen, I think that if Wasmtime provides an easy way for building the benchmarking dylib, e.g., cargo build wasmtime-benchmark-api, then there is less reason for using the C API. I guess I could maintain a separate codebase for wrappers for other engines (or as you mention, a generic C API wrapper).

@cfallin
Copy link
Member Author

cfallin commented Jan 30, 2021

I've gone ahead and updated this RFC based on the (detailed and very useful!) discussion above; sorry for the delay and thanks everyone! I'm hoping to push this forward in the process some more, as we have benchmarking hosts now and @abrown, @jlb6740 and @fitzgen have been working on benchmarking.

@cfallin
Copy link
Member Author

cfallin commented Feb 25, 2021

Motion to finalize with a disposition to merge

Given that discussion here has quiesced and the benchmarking work is well underway (and the sibling RFC #4 that describes the benchmark suite infra has merged), I'm proposing that we merge this RFC.

I believe that all feedback has been merged into the updated RFC.

Stakeholders sign-off

Tagging all employees of BA-affiliated companies who have committed to the Wasmtime or Lucet repos in the last three months1, plus anyone who has given feedback on this PR as a stakeholder.

Fastly

Intel

IBM

Mozilla


1: Cheatsheet for my own later reference: git log --since=$(date +%Y-%m-%d --date='3 months ago') | grep ^Author: | sort | uniq

@cfallin
Copy link
Member Author

cfallin commented Feb 25, 2021

Entering Final Call Period

https://github.com/bytecodealliance/rfcs/blob/main/accepted/rfc-process.md#making-a-decision-merge-or-close

Once any stakeholder from a different group has signed off, the RFC will move into a 10 calendar day final comment period (FCP), long enough to ensure that other stakeholders have at least a full business week to respond.

The FCP will end on Mon Mar 8.

@cfallin
Copy link
Member Author

cfallin commented Feb 25, 2021

(Sorry, copy/paste error with FCP end date; fixed to Mar 8.)

@cfallin
Copy link
Member Author

cfallin commented Mar 8, 2021

The FInal Call Period has elapsed with no other comments or feedback, so I will now merge this RFC. Thanks all!

@cfallin cfallin merged commit 9e0c587 into bytecodealliance:main Mar 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants