-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logging lots of out-of-order data using the same timepoint on a timeline leads to extreme slowdown in datastore #4415
Comments
The first thing that immediately jumps out to me is: why are the sorting routines even running at all on the ingestion path?! There is only one very specific case where we ever try to sort during ingestion: the bucket is full and has to be split. rerun/crates/re_arrow_store/src/store_write.rs Lines 546 to 577 in 52ce18d
But then why is the bucket This code looks correct:
So the only explanation I can think of is that the user is somehow repeatedly ingesting data at a single timestamp and doing so with out-of-order In any case I can probably make a patch to move that sort after the early exit: the bucket's time-range bounds are up-to-date even when the bucket is unsorted. |
+1 -- even the whole rerun versioning string ideally |
#4417 only pushed the problem to later: The user was using EDIT: I have been shared a reproduce on a private repository. It logs everything on the main thread, yet the EDIT2: The cause was this: #4658 |
We do log a warning:
I think the only solution is to actually split these buckets so they don't grow. |
### What Rerun is designed to be able to handle out-of-order ingestion, but with a performance hit. Usually that performance hit is very small, but when logging to the exact same timestamp many times, we hit a corner-case of the data store where we get a huge bucket (see #4415). If the data arrives out-of-order, this means an expensive resort of the huge bucket on each ingestion. Usually such out-of-order log rows should only happen in multi-threaded applications, but due to a bug it happened almost always. This PR fixes this bug, and adds regression test for it. This PR thus alleviates the problem in #4415, but does not fix it for actual out-of-order multi-threaded applications. ~I introduced the bug in #4502 after the 0.11 release.~ **EDIT:** no, it was pre-existing! <img src="https://github.com/rerun-io/rerun/assets/1148717/fde1f973-deec-46be-b6cc-b671968bf633" width="250"> ### Checklist * [x] I have read and agree to [Contributor Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and the [Code of Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md) * [x] I've included a screenshot or gif (if applicable) * [x] I have tested the web demo (if applicable): * Using newly built examples: [app.rerun.io](https://app.rerun.io/pr/4658/index.html) * Using examples from latest `main` build: [app.rerun.io](https://app.rerun.io/pr/4658/index.html?manifest_url=https://app.rerun.io/version/main/examples_manifest.json) * Using full set of examples from `nightly` build: [app.rerun.io](https://app.rerun.io/pr/4658/index.html?manifest_url=https://app.rerun.io/version/nightly/examples_manifest.json) * [x] The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG - [PR Build Summary](https://build.rerun.io/pr/4658) - [Docs preview](https://rerun.io/preview/b58459b740b693d86be9e9b3e846d37bf565f060/docs) <!--DOCS-PREVIEW--> - [Examples preview](https://rerun.io/preview/b58459b740b693d86be9e9b3e846d37bf565f060/examples) <!--EXAMPLES-PREVIEW--> - [Recent benchmark results](https://build.rerun.io/graphs/crates.html) - [Wasm size tracking](https://build.rerun.io/graphs/sizes.html)
### What In order to investigate #4415 I wanted to reproduce it. This PR adds a benchmark that inserts many rows using the same timepoint, either in-order or shuffled. I also fixed a "bug" in the existing insertion benchmarks, which accidentally were _also_ measuring the time to call `DataTable::to_rows` (and collect the result), which was about 10% of the measured time. --- 5950X pinned to core #7 / Arch: ``` datastore/num_rows=1000/num_instances=1/shuffled=false/insert_same_time_point/insert 1.00 1641.2±28.89µs 595.0 KElem/sec datastore/num_rows=1000/num_instances=1/shuffled=true/insert_same_time_point/insert 1.00 3.0±0.04ms 328.4 KElem/sec datastore/num_rows=10000/num_instances=1/shuffled=false/insert_same_time_point/insert 1.00 27.1±1.59ms 361.0 KElem/sec datastore/num_rows=10000/num_instances=1/shuffled=true/insert_same_time_point/insert 1.00 162.5±1.23ms 60.1 KElem/sec ``` ### Checklist * [x] I have read and agree to [Contributor Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and the [Code of Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md) * [x] I've included a screenshot or gif (if applicable) * [x] I have tested the web demo (if applicable): * Using newly built examples: [app.rerun.io](https://app.rerun.io/pr/4676/index.html) * Using examples from latest `main` build: [app.rerun.io](https://app.rerun.io/pr/4676/index.html?manifest_url=https://app.rerun.io/version/main/examples_manifest.json) * Using full set of examples from `nightly` build: [app.rerun.io](https://app.rerun.io/pr/4676/index.html?manifest_url=https://app.rerun.io/version/nightly/examples_manifest.json) * [x] The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG - [PR Build Summary](https://build.rerun.io/pr/4676) - [Docs preview](https://rerun.io/preview/75ba0c0685dbea827b473270b1b350e348e72974/docs) <!--DOCS-PREVIEW--> - [Examples preview](https://rerun.io/preview/75ba0c0685dbea827b473270b1b350e348e72974/examples) <!--EXAMPLES-PREVIEW--> - [Recent benchmark results](https://build.rerun.io/graphs/crates.html) - [Wasm size tracking](https://build.rerun.io/graphs/sizes.html) --------- Co-authored-by: Clement Rey <[email protected]>
Summary: When this issue was open, we hit this extreme slow-down every time we logged a lot of data using the same time point. Since then two PRs have landed:
The first reduces the sorting to only be done at query time. The second PR ensures we only hit the corner case if data is logged out-of-order. In other words, the risk for this bug to produce extreme slowdowns is a lot smaller now. |
For future adventurers: The problem with splitting the buckets further even when they share the same timepoint, is the definition of
Since |
A user shared a puffin profile of very slow data ingestion
slow_ingestion.puffin.zip
What I see is this: the data ingestion (which is currently happening in the main rendering loop, see #4298) gets a message containing ~3800 rows of data. Processing that takes ~2 seconds, because each insertion takes a whopping ~9ms. The bulk of the insertion cost is re-sorting, which makes me suspect that the reason for this new slowdown is out-of-order insertions.
There are several layers of bad here:
Workarounds
Run the logging with
RERUN_FLUSH_NUM_ROWS=1
: this will create smaller chunks of work for the viewer, which should get back the viewer frame-rate, but still with slow ingestionStop logging data out-of-order. This could happen if you are using custom timelines that are not monotonically increasing. Rerun should support out-of-order logging, but it seems like we're hitting a bad corner case here.
Other notes
It would be great if we could tell from a puffin profile whether or not the code that was running was a debug or release build.
If this was from a debug build, then this whole issue is a mistake.
The text was updated successfully, but these errors were encountered: