Discussion: Performance improvements and aggregation #401

bdice · 2020-12-17T05:16:01Z

This issue exists to document a series of performance improvements I have worked on recently. We created the next branch to be a testing area for the new aggregation features, where we can focus on eliminating any bugs and improving performance before merging into master and releasing 0.12.0.

In PR #399, I have created a "benchmark" project with 3 operations and 2 labels. This project does not utilize groups or aggregation, so that we can use it as a fair performance comparison with older versions of signac.

Comparison of runtimes on the test project (2 labels, 3 operations with pre/postconditions):

flow	signac	Runtime for 1000 jobs (seconds)
0.8.0	1.2.0	52.5
0.9.0	1.2.0	50.5
0.10.0	1.2.0	65.0
0.11.0	1.2.0	75.6
0.11.0	1.5.0	71.7
0.11.0	master (7db7d57)	42.1
master (`8a33a19`)	master (7db7d57)	43.1
next (`2328ba1`)	master (7db7d57)	64.2
next (`2328ba1`)	feature/optimize-dfs-convert (43605a9)	61.4

We're getting the performance back down to a tolerable level (within the bounds of past releases). It was something like 100-110 seconds before I made the recent round of PRs to signac and signac-flow. Measuring the scaling for larger workspaces will also be important before we make any final conclusions.

It looks like the performance improvements in signac have been extremely helpful, and using signac master with the last flow release 0.11.0 or the master branch of flow gives the overall best performance thus far (and it would be even better with the _dfs_convert optimization in this PR). The overall trend is that newer signac versions are faster, while newer flow versions are slower (as we would expect, since signac has received more optimizations over time and flow has received more complex features over time).

The primary difference between flow's master and next branches is the introduction of aggregation. Breaking down the profile results for those two, the difference in performance appears to be solely affecting status check performance (print_status). Running and submitting jobs appear to have nearly identical performance to past versions, and with similar profiles before/after aggregation.

I studied _fetch_status for a while to identify the cause of the slowdown with aggregation. I believe the difference is due to the swapped order of iteration. With the new aggregation logic, the status fetching essentially does "for every group, get the status of each aggregate" by calling _get_group_status which returns a dictionary structured like {"operation_name": ..., "job_status_details": {aggregate_id: JobStatus for aggregate_id in operation_aggregate_store}} and then it combines the results from every group. The output is a list of dictionaries structured like [{"job_id": job_id, "operations": {operation_name: JobStatus}, "labels": [...]}, ...] (I'm leaving out some details like error handling).

The previous code could loop over the whole project one time, fetching the status of every operation for each job. Iterating over the project one time is faster than iterating over the aggregate store for each group and then stitching the results from each group together. My profiling tool doesn't quite show the right info here, but this is my best guess. If my assessment is accurate, here are some potential solutions:

Change the iteration order: We could implement a more optimal status check that would iterate over each aggregate store one time, fetching the status for all groups that use that aggregate store. Labels would be gathered at the same time as the default aggregate store, saving another pass through the project. For projects that don't use aggregation, this would mean that there is only one loop through the project to get all status info, and it would resemble the pre-aggregation code.
Reduce the impact of looping over jobs: We enable lazy statepoint loading in signac, possibly alongside the new buffering/synced collections. With lazy statepoint loading, Project.open_job would become much cheaper and the order of iteration (groups, aggregate store) or (aggregate store, groups) wouldn't matter nearly as much. It would also provide performance benefits across the board.

From discussions in dev meetings and on Slack, there seems to be a consensus that we should perform a total refactor of status-related code, which will be helped by some of the pickle-related changes #369 and #383 (the code can be parallelized more easily if we utilize cloudpickle).

The text was updated successfully, but these errors were encountered:

bdice · 2020-12-31T22:38:09Z

I did some additional benchmarking. I skipped doing the benchmarks for older package versions because I previously had to add some shims to account for backwards-incompatible changes in flow and didn't want to do that work again. There are still significant internal refactors that need to be done in the _fetch_status code to improve performance, described above.

flow	signac	Time (seconds)
v0.11.0	v1.2.0	81.7
v0.11.0	v1.5.0	77.0
master (`6d04207`)	v1.5.0	77.4
master (`6d04207`)	previous version of `master` (7db7d57)	46.6
master (`6d04207`)	`master` (fd14f48)	57.6
master (`6d04207`)	`feature/lazy-job-init` (ee6a0aa)	46.6
next (`0713893`)	v1.5.0	107
next (`0713893`)	previous version of `master` (7db7d57)	71.1
next (`0713893`)	`master` (fd14f48)	89.7
next (`0713893`)	`feature/lazy-job-init` (ee6a0aa)	69.6

bdice · 2021-01-06T07:14:40Z

My initial assumption that print_status should be able to run faster (i.e. do less work) than run or submit is not true. It is reasonable for print_status to be slower than run and submit because it has to evaluate more conditions and also job labels -- explained below.

When we print the status, we have to evaluate all pre/postconditions to determine both "eligible" (all preconditions met and some unmet postcondition) and "complete" (all postconditions met). With run/submit, we can exit early if some precondition is unmet / etc., but that is not the case for status (we must do more work). I found and implemented a partial optimization that will help in certain cases: If completed=True then the eligibility check is already skipped. However, if completed=False, the eligibility check should ignore postconditions (it has already been established that completed=False, so checking postconditions is unnecessary). This is implemented in e0abac5.

Another place where print_status must do more work than run or submit is labels: labels aren't necessary for run or submit. For the test project I describe below, computing labels takes up about 1/3 of the runtime (which is reasonable because they're approximately as complex as condition functions, and there are around 6 condition functions to evaluate).

I am trying a new approach to the status check. The previous logical flow of _fetch_status was:

Fetch all scheduler jobs; iterate over all aggregates/groups and update the status document if matches are found in the scheduler jobs.
Iterate over all aggregates/groups AGAIN and evaluate conditions to determine eligible and complete.

In the new callback approach (#417), I combine these two iterations into one: a callback is provided to _fetch_scheduler_status and it will handle evaluating conditions in the same loop, immediately after determining the scheduler status. (For now, I've dropped the thread/process-parallel portion of the code, which would not play well with this callback approach in its current form. That can be improved later.)

Current benchmark: the print_status time for my test project (1000 jobs, 3 operations, 6 conditions, 2 labels) down from ~50 seconds to ~26 seconds with signac master. If you use the feature/lazy-job-init branch of signac, it gets down to ~17 seconds. Note that this is only the print_status time, and doesn't include run/submit like the other benchmarks shown above.

Further areas for (performance) improvement:

Need to benchmark some specific use cases for the aggregate/group iteration logic and ensure that we have a fast path for common combinations like _AggregatesCursor objects (with filters as mentioned by @atravitz), selecting only single-operation groups, etc. We should also add tests for project aggregation, testing the same set of selections with the private method and ensuring that the right aggregates/groups are selected.
We could cache the results of pre/post conditions for a given aggregate during status-fetching / running / submission. We would need to assume that they have no side-effects, or only enable caching for the included condition functions that we know lack side-effects. This has been suggested in the past (see Proposal: Automatic caching of condition functions #125). This might provide a 25% speedup in my sample workflow, depending on the number/cost of functions that are cached.
Optimize job labels, only evaluate labels for the relevant set of jobs (see design questions posed in Slack).
Requiring a newer signac version for signac-flow (1.3.0 or higher) would allow us to eliminate overhead from the deprecation package. It is costing us around 5% of runtime just to produce deprecation messages for functions like job.get_id(), which is mildly annoying. I'm not sure how we plan to move forward with signac version support for signac-flow - but let's table that discussion and not dig into it in this issue.

bdice · 2021-01-20T19:01:01Z

This has been resolved by #417.

bdice added refactor aggregation labels Dec 17, 2020

bdice added this to the v0.12.0 milestone Dec 17, 2020

bdice mentioned this issue Jan 12, 2021

New _fetch_status code. #417

Merged

12 tasks

bdice closed this as completed Jan 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Performance improvements and aggregation #401

Discussion: Performance improvements and aggregation #401

bdice commented Dec 17, 2020

bdice commented Dec 31, 2020 •

edited

Loading

bdice commented Jan 6, 2021 •

edited

Loading

bdice commented Jan 20, 2021

Discussion: Performance improvements and aggregation #401

Discussion: Performance improvements and aggregation #401

Comments

bdice commented Dec 17, 2020

bdice commented Dec 31, 2020 • edited Loading

bdice commented Jan 6, 2021 • edited Loading

bdice commented Jan 20, 2021

bdice commented Dec 31, 2020 •

edited

Loading

bdice commented Jan 6, 2021 •

edited

Loading