Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shipper: Be strict about upload order unless it's specified so & cut v0.13.0-rc.2 #2765

Merged
merged 2 commits into from
Jun 15, 2020

Conversation

bwplotka
Copy link
Member

This is actually a quite real case for potential overlaps in Thanos system, so fixing before 0.13.

Fixes #2753

Thanks @gburek-fastly for all pointers, it helped us to narrow this down 💪

Signed-off-by: Bartlomiej Plotka [email protected]

Copy link
Member

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just wondering though .. what's the use case of not uploading oldest to newest? do we really need a flag?

@bwplotka
Copy link
Member Author

I mentioned this in flag help. Why I created an option to still enable it:

  • In case if you don't have the persistent disk you kind of wants to upload as soon as possible. Also this allows you to have block with malformed metadata and it will be just ignored if you set this flag to true (well ignored + error log, metric). Also with upload-compacted you might want best effort upload as soon as possible.
  • For all those releases we were allowing out of order uploads, so someone might have rely on it (?)

Those arguments are not strong I agree, I am happy to reconsider this... Maybe moving this flag to Hidden for now?

Also technically we should do rc.3 with this not full release. Thoughts?

Copy link
Member

@squat squat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good! Just some language nits

CHANGELOG.md Outdated
@@ -26,6 +26,7 @@ We use *breaking* word for marking changes that are not backward compatible (rel
- [#2416](https://github.com/thanos-io/thanos/pull/2416) Bucket: Fixed issue #2416 bug in `inspect --sort-by` doesn't work correctly in all cases.
- [#2719](https://github.com/thanos-io/thanos/pull/2719) Query: `irate` and `resets` use now counter downsampling aggregations.
- [#2705](https://github.com/thanos-io/thanos/pull/2705) minio-go: Added support for `af-south-1` and `eu-south-1` regions.
- [#2753](https://github.com/thanos-io/thanos/issues/2753) Sidecar,Receive,Rule: Fixed cause for compactor overlapping blocks in upload error cases.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit on comment formatting: we should have spaces between component names

@@ -89,6 +89,12 @@ func registerReceive(m map[string]setupFunc, app *kingpin.Application) {

walCompression := cmd.Flag("tsdb.wal-compression", "Compress the tsdb WAL.").Default("true").Bool()

allowOutOfOrderUpload := cmd.Flag("shipper.allow-out-of-order-uploads",
"If true shipper will skip failed block uploads in given iteration and retry later. This means that some newer blocks might uploaded sooner than older."+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep these as full sentences to make it more readable.

Suggested change
"If true shipper will skip failed block uploads in given iteration and retry later. This means that some newer blocks might uploaded sooner than older."+
"If true, shipper will skip failed block uploads in the given iteration and retry later. This means that some newer blocks might be uploaded sooner than older blocks."+

@@ -89,6 +89,12 @@ func registerReceive(m map[string]setupFunc, app *kingpin.Application) {

walCompression := cmd.Flag("tsdb.wal-compression", "Compress the tsdb WAL.").Default("true").Bool()

allowOutOfOrderUpload := cmd.Flag("shipper.allow-out-of-order-uploads",
"If true shipper will skip failed block uploads in given iteration and retry later. This means that some newer blocks might uploaded sooner than older."+
"This will trigger compaction without those blocks, as a resulted create 'valid overlap situation'. Set it to true if you have vertical compaction enabled and wish to upload blocks as soon as possible without caring"+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"This will trigger compaction without those blocks, as a resulted create 'valid overlap situation'. Set it to true if you have vertical compaction enabled and wish to upload blocks as soon as possible without caring"+
"This will trigger compaction without those blocks and as a result will create a 'valid overlap situation'. Set it to true if you have vertical compaction enabled and wish to upload blocks as soon as possible without caring"+

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for clarification, what is a 'valid overlap situation'?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed to just an overlap situation

@@ -29,13 +28,15 @@ Checking producers log for such ULID, and checking meta.json (e.g if sample stat

### Reasons

- You are running Thanos (sidecar, ruler or receive) older than 0.13.0. During transient upload errors there was possibility to have overlaps caused by compactor not being aware of all blocks See: [this](https://github.com/thanos-io/thanos/issues/2753)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- You are running Thanos (sidecar, ruler or receive) older than 0.13.0. During transient upload errors there was possibility to have overlaps caused by compactor not being aware of all blocks See: [this](https://github.com/thanos-io/thanos/issues/2753)
- You are running Thanos (sidecar, ruler or receive) older than 0.13.0. During transient upload errors there is a possibility to have overlaps caused by the compactor not being aware of all blocks See: [this](https://github.com/thanos-io/thanos/issues/2753)

}
}

if err := s.upload(ctx, m); err != nil {
level.Error(s.logger).Log("msg", "shipping failed", "block", m.ULID, "err", err)
if !s.allowOutOfOrderUploads {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

var metas []*metadata.Meta
// blockMetasFromOldest returns the block meta of each block found in dir
// sorted by minTime asc.
func (s *Shipper) blockMetasFromOldest() (metas []*metadata.Meta, _ error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, I think this is a good simplification

@brancz
Copy link
Member

brancz commented Jun 15, 2020

Also technically we should do rc.3 with this not full release. Thoughts?

Yes I think rc.3 is the right thing to do here.

I think a hidden flag for now sounds good.

@bwplotka bwplotka changed the title shipper: Be strict about upload order unless it's specified so. shipper: Be strict about upload order unless it's specified so & cut v0.13.0-rc.2 Jun 15, 2020
@bwplotka bwplotka merged commit 3bf1397 into release-0.13 Jun 15, 2020
@bwplotka bwplotka deleted the fix2753 branch June 15, 2020 14:53
paulfantom added a commit to paulfantom/thanos that referenced this pull request Jul 8, 2020
openshift/master

* upstream/release-0.13:
  Cut release v0.13.0
  shipper: Be strict about upload order unless it's specified so & cut v0.13.0-rc.2 (thanos-io#2765)
  Cut 0.13.0 release. (thanos-io#2762)
  Cut release 0.13.0-rc.1 (thanos-io#2720)
  Store: `irate` and `resets` use now counter downsampling aggregations. (thanos-io#2719)
  deps: Updated minio-go dependency to v6.0.56 to add two region endpoints (thanos-io#2705) (thanos-io#2718)
  store/proxy: Deduplicate chunks on StoreAPI level. Recommend chunk sorting for StoreAPI + Optimized iter chunk dedup. (thanos-io#2710) (thanos-io#2711)
  Allow using multiple memcached clients at the same time. (thanos-io#2648) (thanos-io#2698)
  Updated Prometheus as little as possible to include Isolation fix. (thanos-io#2697)
  Release fix attempt2.
  Fixed test job. (thanos-io#2650)
  Fixed promu build to build in compatible directory that crossbuild understands.
  Cut v0.13.0-rc.0 (thanos-io#2628)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants