-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compact: Improved memory usage while downsampling #529
compact: Improved memory usage while downsampling #529
Conversation
9aa8061
to
e8c13e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, really good work! Thanks! Some comments, mostly naming.
Let's also add CHANGELOG entry.
It looks nice, the main worry is that we had this done by TSDB compact, so maintained outside of thanos. Clearly the tsdb implementation was not streamed, so we had to write our modified version. But now we have:
- tsdb compactor with random version (as we support multiple versions of Prometheus) compacting our non-compacted TSDB block
- tsdb compactor with version we pin writing blocks for compacted blocks by thanos compactor
- custom streamed TSBD block write for downsampling.
It can be quite worrying, right?
I think ideally we could proposed improved streamed block to TSDB in future and also use this for thanos compaction itself.
Wonder if we are not missing writer
tests then... plus it's not that tied to downsampling, so we could put it in "github.com/improbable-eng/thanos/pkg/block" maybe? (not sure - we currently use it for downsampling only)
Is it rdy for review? |
yep But I left stream_block_writer in downsample package. Don't see if this is valuable, as our main goal is to move it to tsdb |
I guess this work should be proceed for the compaction as well. I've got errors on the compaction stage:
$ kubectl exec -it thanos-compactor-0 thanos bucket ls -o json --objstore.config-file=/etc/thanos/bucket.yaml | jq -s 'def tt(t): t/1000 | todate; map(select(.minTime >= 1536213600000 and .maxTime <= 1536364800000 )) | map({ulid: .ulid, minTime: tt(.minTime), maxTime: tt(.maxTime), stats: .stats, thanos: {source: .thanos.source}})' [
{
"ulid": "01CPQ1R4M7CFCBJHE9904K00CR",
"minTime": "2018-09-06T06:00:00Z",
"maxTime": "2018-09-06T08:00:00Z",
"stats": {
"numSamples": 715119814,
"numSeries": 6173318,
"numChunks": 6192734
},
"thanos": {
"source": "sidecar"
}
},
{
"ulid": "01CQVTTB08YYA1SW1K03CKHMZE",
"minTime": "2018-09-06T08:00:00Z",
"maxTime": "2018-09-06T16:00:00Z",
"stats": {
"numSamples": 2862862948,
"numSeries": 6727848,
"numChunks": 24799606
},
"thanos": {
"source": "compactor"
}
},
{
"ulid": "01CQVVZ7PE4YKJB3WJ1HZER5DC",
"minTime": "2018-09-06T16:00:00Z",
"maxTime": "2018-09-07T00:00:00Z",
"stats": {
"numSamples": 2851257287,
"numSeries": 6781947,
"numChunks": 24807578
},
"thanos": {
"source": "compactor"
}
},
{
"ulid": "01CQVWAFTMZFQM3D147VK258Q0",
"minTime": "2018-09-07T00:00:00Z",
"maxTime": "2018-09-07T08:00:00Z",
"stats": {
"numSamples": 2847793580,
"numSeries": 6554082,
"numChunks": 24585629
},
"thanos": {
"source": "compactor"
}
},
{
"ulid": "01CQVXX40S8J32WQTVPNBAM4MN",
"minTime": "2018-09-07T08:00:00Z",
"maxTime": "2018-09-07T16:00:00Z",
"stats": {
"numSamples": 2852939529,
"numSeries": 6601240,
"numChunks": 24645491
},
"thanos": {
"source": "compactor"
}
},
{
"ulid": "01CQVYJWKTN17QY1853SGKA35G",
"minTime": "2018-09-07T16:00:00Z",
"maxTime": "2018-09-08T00:00:00Z",
"stats": {
"numSamples": 2854566446,
"numSeries": 6711599,
"numChunks": 24739338
},
"thanos": {
"source": "compactor"
}
}
] 5 blocks 4 from compactor and 1 from sidecar $ kubectl exec -it thanos-compactor-0 thanos bucket ls -o json --objstore.config-file=/etc/thanos/bucket.yaml | jq -s 'map(select(.minTime >= 1536213600000 and .maxTime <= 1536364800000 )) | (map(.stats.numChunks) | add)'
$ 129770376 as you can see, number of chunks is |
As it turned out, the only consumption is a postings, and is not as bad as I thought The wrong assumption I had, that the k8s pod memory usage is accounting the emptyDir volume as a memory resource (which is odd, as there wasn't an enabled option https://kubernetes.io/docs/concepts/storage/volumes/#emptydir |
@bwplotka @xjewer Hello guys i tried this code build on existing data, and looks like after downsampling i loose some metrics, dont know why because metadata shows the same amount of metrics in 5m downsample block, bot some metrics are not exist in 5m block, i can give you the block of data that i use to replicate this issue and query that i use. Original code is working well. I can give any advise you need in this, because it will be very helpful if huge block will not consume all memory on 256GB server |
would be nice to have a block and more details. |
block is almost 20gb size, where i can put it so you can download?
I will delete this bucket in one week |
thanks, I downloaded this block and will see, what the issue is |
Nice finding with |
3546c3d
to
8b1bafa
Compare
thanks @dmitriy-lukyanchikov, found and fixed bug. need to add more tests for downsampling and for stream writer in particular
One of the reason - k8s would clean that garbage after pod's died or exited |
We've been running this in production for over a month now, and it's hugely cut down our thanos-compact memory usage (from around 120GB to less than 20GB 🤯) |
actually, I'm still seeing leaks on downsampling state. It affects us with the ~ 0.5Tb data blocks (253e9 data samples), but that's probably a rare case for having such huge blocks, so we can merge it for now 🤔 |
yep, I'm gonna add that commit in here and handle @domgreen remarks |
8b1bafa
to
8f06dfc
Compare
} | ||
}() | ||
|
||
sw.chunkWriter, err = chunks.NewWriter(filepath.Join(blockDir, block.ChunksDirname)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, was thinking about this as well.
so if we pass it as arguments, it will make testing easier, however by doing it like this we make streamwriter responsible for all writers making out interface better and with deep functionality, so I think it's fine for now. 👍
22aa6f2
to
5e99ac3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really really good.
Some small suggestions only and LGTM! Thanks @xjewer really really good work on this 👍 Happy to finally get it in. ;p Sorry for delay in review.
Add instant writer implementation to shrink memory consumption during the downsampling stage. Encoded chunks are written to chunks blob files right away after series was handled. Flush method closes chunk writer and sync all symbols, series, labels, posting and meta data to files. It still works in one thread, hence operates only on one core. Estimated memory consumption is unlikely more than 1Gb, but depends on data set, labels size and series' density: chunk data size (512MB) + encoded buffers + index data Fixes thanos-io#297
Add comments and close resources properly.
Use proper posting index to fetch series data with label set and chunks
the downsampling process. One of the trade-offs is to preserve symbols from raw blocks, as we have to write them before preserving the series. Stream writer allows downsample a huge data blocks with no needs to keep all series in RAM, the only need it preserve label values and postings references.
fb50a78
to
55d755a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially just constructor and LGTM.
55d755a
to
caf4d96
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect. LGTM! @domgreen ?
Got offline 👍 from @domgreen |
Reduce of use public Flush method to finalize index and meta files. In case of error, a caller has to remove block directory with a preserved garbage inside. Rid of use tmp directories and renaming, syncing the final block on disk before upload.
caf4d96
to
12ffbe3
Compare
🎉 |
I'm using commit c5acb9c and still noticing OOM. The block is 6.5GB. Compactor would not start until the memory was set to 15GB. It fails at the downsampling step. "version": 1, Downsampled block: "version": 1, Is this expected? Over the period, block size may increase. Should the memory be increased ti support bigger blocks? |
Changes
Add instant writer implementation to shrink memory consumption during the downsampling stage.
Encoded chunks are written to chunks blob files right away after series was handled.
Flush method closes chunk writer and sync all symbols, series, labels, posting and meta data to files.
It still works in one thread, hence operates only on one core.
Estimated memory consumption is unlikely more than 1Gb, but depends on data set, labels size and series' density: chunk data size (512MB) + encoded buffers + index data
Fixes #297
Verification
Downsampling on the same origin block with previous version and the current one shows the same result.
Memory flame graph:
cc @bwplotka