Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compact: Offline deduplication #1014

Closed
smalldirector opened this issue Apr 7, 2019 · 43 comments · Fixed by #4239
Closed

Compact: Offline deduplication #1014

smalldirector opened this issue Apr 7, 2019 · 43 comments · Fixed by #4239

Comments

@smalldirector
Copy link

Currently Thanos provides dedup function in Query component, it can merge the data on the fly from Prometheus HA pairs or remote object storage.
However, the Query dedup function is not query efficiently, because the replica's blocks in tsdb/object storage are not really merged so that the Query API has to load duplicated data in each request.

With current Prometheus TSDB design, it seems difficult to implement dedup function to merge blocks in each TSDB node on Thanos side.
However, it should be easily to implement it for the object storage as it is one centralized storage, and also we already have one Compactor component running on it.

Offering dedup function on object storage side will definitely help fast metrics query latency and reduce the object storage cost.

Do we have any plan to support such requirements in Thanos? If had, I would like to know your ideas about this feature.

Thanks.

@bwplotka bwplotka changed the title Compact: Built-in dedup function for Object Storage Compact: Offline deduplication Apr 7, 2019
@bwplotka
Copy link
Member

bwplotka commented Apr 7, 2019

Hi 👋 thanks for raising this. I rename title to make it more clear, let me know if this makes sense.

There was always idea like this to allow compactor to deduplicate blocks offline and reduce the storage and bit the query performance. I think I am happy to do so however, we need ourselves question, is the curret deduplication algorithm a correct one for everybody?

We have seen some reports claiming that our "penalty-based" deduplication algorithm is not working for all edge cases e.g: #981 It's fair as our alghorithm is very basic. The problem with this feature and not perfect algorithm is that we will deduplicate data irreversibily (unless we back up those blocks somehow).

Anyway, I think we are happy to add such feature to compactor at some point, given we can solve the gaps, for current algorithm (and add more tests/ test cases maybe).

@smalldirector
Copy link
Author

Thanks for the reply @bwplotka. I'm glad to hear that you also want to support this feature in the future.
I'm working in eBay's monitoring infrastructure team, and we are leveraging Thanos to build eBay's high availability monitoring platform. Thanks for providing such great framework.

For the offline dedup function, we have already started to build it inside the Thanos compactor component. If needed, we are more than happy to contribute it back to Thanos community. Please feel free to let me know your thoughts as well.

@bwplotka
Copy link
Member

bwplotka commented Apr 7, 2019

Of course! PRs would be welcome, especially if you have something proven from your prod (:

@bwplotka
Copy link
Member

bwplotka commented Apr 7, 2019

Are you on our slack BTW? (:

@smalldirector
Copy link
Author

smalldirector commented Apr 7, 2019

Which slack channel are your pointing here? I am only be able to see the announcements channel. I'm not able to join the thanos-dev channel which mentioned here: https://github.com/improbable-eng/thanos/blob/master/CONTRIBUTING.md. BTW, my display name is smalldirector in announcements channel.

@bwplotka
Copy link
Member

bwplotka commented May 7, 2019

I can see you in all #thanos #thanos-dev channels (:

@stale
Copy link

stale bot commented Jan 11, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 11, 2020
@Reamer
Copy link

Reamer commented Jan 13, 2020

Remove "stale", that's an interesting feature

@stale
Copy link

stale bot commented Feb 27, 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

@stale stale bot added the stale label Feb 27, 2020
@bwplotka
Copy link
Member

bwplotka commented Feb 27, 2020 via email

@stale
Copy link

stale bot commented Mar 28, 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

@stale stale bot added the stale label Mar 28, 2020
@ThoreKr
Copy link

ThoreKr commented Mar 28, 2020

Can someone add the label for stalebot to ignore this issue?

@bwplotka
Copy link
Member

bwplotka commented Jul 20, 2020

👋

We are adding new improvements to runtime deduplication so we can safely use it: #2890 This time, improving penalty algorithm. There might be one tricky part about counter reset. In Prometheus there is not metric type information yet, so we are discussing online how we can add that thanks to recent metadata changes to block (cc @pracucci @brian-brazil). If not we will need to guess it from data which needs more testing.

In the mean time we could potentially allow deduplication with backup, so you still back up blocks in some remote location (another cold storage bucket), so we can revert things if needed. Without back up I would not be confident to allow offline dedup for our Thanos users right now, before we handle those missing bits (:

@brian-brazil
Copy link
Contributor

There's some initial discussions about type metadata for remote write via the WAL, getting them into blocks is a completely different kettle of fish.

@bwplotka
Copy link
Member

Also Chunk iterator work is almost done which will be needed for dedup during compaction.

Yea agree, super unlikely for now.

@stale
Copy link

stale bot commented Aug 19, 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Aug 19, 2020
@Reamer
Copy link

Reamer commented Aug 19, 2020

Still important

@stale stale bot removed the stale label Aug 19, 2020
@eightnoteight
Copy link

Hi @bwplotka , interested in this issue, is there anyway I can contribute on this ?

@Antiarchitect
Copy link
Contributor

Also interested very much. Please update the status.

@stale
Copy link

stale bot commented Nov 27, 2020

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Nov 27, 2020
@bwplotka bwplotka removed the stale label Nov 30, 2020
@Reamer
Copy link

Reamer commented Nov 30, 2020

Still interested.

@bwplotka
Copy link
Member

👋🏽

Recently got many questions on DM, please ask them here around offline dedup (: Some update:

We added more docs hopefully cleaning WHY and HOW: https://thanos.io/tip/components/compact.md/#vertical-compactions

Answering one DM:

I saw on the compactor docs that the compactor supports Vertical Compactions via a hidden flag. Our use-case for wanting to implement that is covered under the use-cases listed right within that doc, specifically as the realistic duplication under the Offline deduplication of series bullet point. I’m a bit confused though. Is this something we can use now using the flags:
--compact.enable-vertical-compaction
--deduplication.replica-label="prometheus_replica"
(The prometheus_replica external label is what we use to distinguish between our each Prometheus within a Prometheus HA pair)
It looks like according to the docs that should be okay, however, I also found this issue you’ve been pretty active on:

It's NOT ok, unfortunately. Vertical compaction is implemented for one-to-one duplications, which comes from https://thanos.io/tip/components/receive.md/ deduplication OR when you have for some reason duplicated block with exactly same data.

If you use this against Prometheus replicas it will most likely totally mess your querying experience as it just concatenates samples together. So scrape interval in best case is 2x of original, in the worst case it's totally unstable.

The missing part is adding the deduplication algorithm we have online on the query to the compaction stage so we can leverage that. This algorithm also works on 99% of cases which is fine for the query part, when you can just switch deduplication off. If this 1% happens on offline dedup, then you cannot revert this, that's a problem.

We are exploring different deduplication algorithms that will make this much more reliable. We also need something ideally for Query pushdown: https://docs.google.com/document/d/1ajMPwVJYnedvQ1uJNW2GDBY6Q91rKAw3kgA5fykIRrg/edit# so ... help wanted (:

We can try to enable 99% realistic dedup if we want, help wanted for this 🤗

@heyitsmdr
Copy link

Thank you @bwplotka for answering that question and confirming my suspicions. I'm going to look at the doc and related work and see if this is something I can help out with :) We would greatly benefit from a feature like this to dedupe the duplicated data coming from each Prometheus in the HA pair.

@stale
Copy link

stale bot commented Feb 14, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Feb 14, 2021
@Reamer
Copy link

Reamer commented Feb 15, 2021

This issue is still needed.
@bwplotka it would be nice if Thanos would implement the 99% solution.

@stale stale bot removed the stale label Feb 15, 2021
@stale
Copy link

stale bot commented Apr 18, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Apr 18, 2021
@dpwolfe
Copy link

dpwolfe commented May 1, 2021

Please keep this open.

@stale stale bot removed the stale label May 1, 2021
@2nick
Copy link
Contributor

2nick commented May 11, 2021

I took a look at the querier's deduplication and compaction process and have some questions. :)

For example, we have "main" and "replica" instances of Prometheus.

They could start asynchronously and as a result, it's possible to have next blocks in s3:
main: block1[1am - 3am], block2[3am - 5am]
replica: block3[2am - 5am]

And of course, such overlapping blocks could be much more.

So the first question is - how should work planner and compactor?

The only option I see is that planner makes 2 groups like:

  1. block1+block3
  2. block2+block3

To take from block3 only parts which complement "main" blocks.

But such behavior sound to be a bit sophisticated - maybe there are better options/thoughts about this thing? :)

@yeya24
Copy link
Contributor

yeya24 commented May 16, 2021

@2nick I am looking at the same thing. It is a little bit sophisticated and I think the grouping approach in #1276 is similar to what you mentioned.

IMO the grouping part of offline deduplication can reuse the existing tsdb planner. We don't need to group into 2 groups in this case. We can do it in 2 iterations:

  1. block1 + block3 -> new block [1am - 5am]
  2. new block + block2 -> result [1am - 5am]

WDYT? But I agree this is still a not very good approach. It would cost a lot if the overlap is small, but we still need to iterate the whole block to compact.

@2nick
Copy link
Contributor

2nick commented May 17, 2021

After some thinking I've decided that it's "good enough" approach as it allows to move forward. :)

Really great that you are in it! Looking forward to offline dedup! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.