Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor replicate execution method to not iterate the origin bucket #2906

Merged
merged 3 commits into from
Oct 30, 2020

Conversation

yeya24
Copy link
Contributor

@yeya24 yeya24 commented Jul 17, 2020

Signed-off-by: Ben Ye [email protected]

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

In the previous implementation, we are iterating the origin bucket, and each time, we use metaFetcher to fetch all meta files. But this seems not necessary, we can just fetch meta files once at the beginning.

Verification

Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments but overall LGTM!

pkg/replicate/scheme.go Show resolved Hide resolved
metas, partials, err := rs.fetcher.Fetch(ctx)
if err != nil && metas == nil {
return err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should log the error if metas != nil && err != nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is good to log the error here, but kind of duplicate.

I just check the code, only here returns metas and error at the same time. In this case, partials != nil, so we will log the partial meta errors later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm the if err != nil && metas == nil { feels like code smell to me though. It just looks wrong from outside which can lead to bugs in future:

  • Inside implementation can change, and metas can be produced without partial for some reason - then we will miss an error.
  • Even if we say it makes sense, maybe another person will come here and refactor it anyway as it looks like error can be missed.

On other hand, shouldn't replicate FAIL on any error spotted? Even if we failed to read one block? Sometimes we heavily rely on replication status so this has to be heavily tested and controlled, ideally with failed metric being somewhere (:

@yeya24
Copy link
Contributor Author

yeya24 commented Jul 19, 2020

CI failure is unrelated. Please take a look at this pr again

@yeya24 yeya24 force-pushed the rework-replicate-exec branch from eb9fe91 to 2d6d058 Compare July 29, 2020 15:59
Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT the CI failure seems related to the changes

@yeya24 yeya24 force-pushed the rework-replicate-exec branch from af3a2a8 to 7eafd1c Compare July 29, 2020 20:08
@yeya24
Copy link
Contributor Author

yeya24 commented Jul 29, 2020

AFAICT the CI failure seems related to the changes

Thanks, the previous problem was caused by the make examples-in-container target, and that should be fixed it in the latest commit 😄 .

pkg/replicate/scheme.go Outdated Show resolved Hide resolved
pkg/replicate/scheme.go Show resolved Hide resolved
@yeya24 yeya24 force-pushed the rework-replicate-exec branch from 7eafd1c to e5f0737 Compare August 8, 2020 18:59
Copy link
Contributor

@OGKevin OGKevin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yeya24
Copy link
Contributor Author

yeya24 commented Sep 2, 2020

Can I get a review for this? I think it is close.

@yeya24 yeya24 force-pushed the rework-replicate-exec branch from 4b15ae5 to f89065a Compare September 9, 2020 00:51
@yeya24 yeya24 requested review from GiedriusS and kakkoyun September 9, 2020 00:52
@yeya24 yeya24 force-pushed the rework-replicate-exec branch from f89065a to 8d2e154 Compare September 9, 2020 01:08
Copy link
Member

@kakkoyun kakkoyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@yeya24 yeya24 force-pushed the rework-replicate-exec branch from 8d2e154 to a75bef5 Compare September 11, 2020 13:03
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, some comments though, before merging (:

pkg/replicate/scheme.go Show resolved Hide resolved
metas, partials, err := rs.fetcher.Fetch(ctx)
if err != nil && metas == nil {
return err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm the if err != nil && metas == nil { feels like code smell to me though. It just looks wrong from outside which can lead to bugs in future:

  • Inside implementation can change, and metas can be produced without partial for some reason - then we will miss an error.
  • Even if we say it makes sense, maybe another person will come here and refactor it anyway as it looks like error can be missed.

On other hand, shouldn't replicate FAIL on any error spotted? Even if we failed to read one block? Sometimes we heavily rely on replication status so this has to be heavily tested and controlled, ideally with failed metric being somewhere (:

return nil
}
for id, partialError := range partials {
level.Info(rs.logger).Log("msg", "failed to fetch block meta. Skipping.", "block_uuid", id.String(), "err", partialError)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this failure? This is actually valid state of blocks being potentially uploaded, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like level.Info(rs.logger).Log("msg", "block meta not uploaded yet. Skipping.", "block_uuid", id.String()) r

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PartialError has 2 cases, one is meta.json not found, another one is meta.json corrupted. What about the corrupted case, is it an error or not?

For different obj store implementation, is it possible that one uploaded file is corrupted? If it is this case, how can we differentiate the case that one file is totally corrupted or still being uploaded?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrupted case, is it an error or not?

For current flow it's not error. We consider corrupted meta same as no meta.json as the reason is usually same - partial upload (:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Updated.

@kakkoyun
Copy link
Member

@yeya24 This looks fine to me. You need to rebase I believe.
Ping @bwplotka are you satisfied with the state of the PR now?

@yeya24 yeya24 force-pushed the rework-replicate-exec branch from ddc233c to c8140be Compare October 19, 2020 13:29
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responded to comment, that still needs to be addressed, otherwise LGTM!

Thanks!

return nil
}
for id, partialError := range partials {
level.Info(rs.logger).Log("msg", "failed to fetch block meta. Skipping.", "block_uuid", id.String(), "err", partialError)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrupted case, is it an error or not?

For current flow it's not error. We consider corrupted meta same as no meta.json as the reason is usually same - partial upload (:

@yeya24 yeya24 force-pushed the rework-replicate-exec branch from c8140be to f1079be Compare October 30, 2020 19:57
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Changelog entry could be better, but we can fix it later.

@@ -28,6 +28,12 @@ We use *breaking :warning:* to mark changes that are not backward compatible (re
- [#3331](https://github.com/thanos-io/thanos/pull/3331) Disable Azure blob exception logging
- [#3341](https://github.com/thanos-io/thanos/pull/3341) Disable Azure blob syslog exception logging

### Changed

- [#2906](https://github.com/thanos-io/thanos/pull/2906) Tools: Refactor Bucket replicate execution. Removed all `thanos_replicate_origin_.*` metrics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically not need to mention refactor, just metric changes (:

@bwplotka bwplotka merged commit 5abda52 into thanos-io:master Oct 30, 2020
@OGKevin
Copy link
Contributor

OGKevin commented Oct 30, 2020

Whoop 🥳 🎊 I'll work on #2979.

@OGKevin
Copy link
Contributor

OGKevin commented Oct 30, 2020

Nice work @yeya24

@bwplotka
Copy link
Member

@bwplotka
Copy link
Member

💪

@yeya24 yeya24 deleted the rework-replicate-exec branch October 30, 2020 22:20
Oghenebrume50 pushed a commit to Oghenebrume50/thanos that referenced this pull request Dec 7, 2020
…hanos-io#2906)

* Refactor replicate execution

Signed-off-by: Ben Ye <[email protected]>

* return error when fetcher gets an error

Signed-off-by: Ben Ye <[email protected]>

* address feedback

Signed-off-by: Ben Ye <[email protected]>
Signed-off-by: Oghenebrume50 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants