-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compactor: adjust interval for period <1-hour #9474
Conversation
/cc @yudai @fanminshi |
compactor/periodic.go
Outdated
} else { | ||
plog.Noticef("Failed auto-compaction at revision %d (%v)", rev, err) | ||
plog.Noticef("Retry after %v", checkCompactInterval) | ||
plog.Noticef("Retry after %v", interval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the behavior of retry when error. Previously, we retry quickly at 1/10 the of the retention period and before that we retry every 5min. With this change, with retention period > 1Hr, we will retry again in 1Hr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yudai We are still discussing better ways to implement retry logic on compact failures. Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retrying in 1 hour sounds a bit longer to me. 5 minutes or retension/10
would be more reasonable.
By the way, If I understand correctly, with this code, we don't slide t.revs
in the error case, and t.revs
gets a new value in the beginning of the next loop iteration (line 65) anyway, meaning the len(t.revs)
increases when an error happens. Eventually, the retention time unexpectedly gets another 1 hour for each error. I think we should slide t.revs
in the error case as well, if we keep this retry logic. Another idea is make line 82 like rev := t.revs[len(t.revs) - interval]
.
The main point is that t.revs = append(t.revs, t.rg.Rev())
at line 65 is expected to be executed every interval. To keep this promise, we cannot add another timed task for retry in the loop.
We might use another gorotuine for retry to keep the main loop running, but I feel like it's an error prone way.
Maybe a better idea is that using a divided interval
(by 10 or something, e.g. 6 minutes interval for 1hour retention period). And at line 82, we can do rev := t.revs[len(t.revs) - interval*10]
? (we also need to modify line 89). To avoid running compaction every retention/10
, we would need a counter as well (we reset it when compaction succeeds).
In this case, we can retry in 1/10 retention time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure if my idea is feasible, I'll try implement it and see how it looks like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yudai Sure. We will try implementing simplest retry logic first (for 3.3 backport). And iterate from there to improve. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea would look like this? However, I realized that if the retention time is small, the subInterval
(interval/10
) will be very very small. So this might not be a good idea to go.
https://gist.github.com/yudai/d976432a71e140598c08c91ae8569435
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using two goroutine looks simpler in a sense and maybe most efficient. The retry interval is interval/10
.
https://gist.github.com/yudai/da0cf189c8c445973615671b33765c7d
@gyuho these two implementations are what I came up with. I personally like the way with two goroutines, but it's a big modification, so not sure if it's acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The retry interval is interval/10
I think this is a good idea since it won't be breaking change for 3.3.
And for two goroutines, let's do it once we backport this fix?
You can create a separate PR once we address this PR.
I will rework on this PR.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
compactor/compactor.go
Outdated
@@ -29,8 +29,6 @@ var ( | |||
) | |||
|
|||
const ( | |||
checkCompactionInterval = 5 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe move this to the revision compactor since it is the only one who is using it?
compactor/revision.go
Outdated
@@ -17,6 +17,7 @@ package compactor | |||
import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to refactor revision compactor code? it is work as expected? we can refactor this in the future pr.
let's make this pr at a minimum?
Codecov Report
@@ Coverage Diff @@
## master #9474 +/- ##
=========================================
Coverage ? 72.59%
=========================================
Files ? 364
Lines ? 30885
Branches ? 0
=========================================
Hits ? 22420
Misses ? 6842
Partials ? 1623
Continue to review full report at Codecov.
|
compactor/periodic.go
Outdated
checkCompactInterval := t.period / time.Duration(periodDivisor) | ||
interval := t.getInterval() | ||
|
||
last := t.clock.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renaming to something like initialWait
? I think last
is confusing.
086f74c
to
9adb346
Compare
What's interesting here is;
I guess this jump maybe confusing and we will get a bug report about this jump in the future. |
@yudai Yeah, this would break old behavior either way. |
compactor/periodic.go
Outdated
func (t *Periodic) getInterval() time.Duration { | ||
itv := t.period | ||
if itv > time.Hour { | ||
itv /= 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this changed? I thought we are going to do x is >1-hour
-> compact every hr
after initial wait?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to keep the old behavior. But at this point, we break old behavior anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you set 60 minutes, compaction happens every 6 minutes
if you set 59 minutes, compaction happens every 59 minutes
The above behavior can be avoided if we do x is >1-hour -> compact every hr after initial wait?
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you pointed out #9474 (comment), that would break old behavior. If we don't backport, I am fine with this too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original code introduced a concept of checkInterval. This interval just to check if I should do compaction or not? it can be quick we had it set as 5 mins . So If a compact error happens, we will just wait another 5 mins to check again. We might want to keep that check interval. The interval in this pr is the compact interval which is different from the check interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This interval just to check if I should do compaction or not? it can be quick we had it set as 5 mins . So If a compact error happens, we will just wait another 5 mins to check again.
5-min interval is for revision compactor. So, v3.3.2 still retries after 1/10 of given compaction period. https://github.com/coreos/etcd/blob/v3.3.2/compactor/periodic.go#L64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed in person. @fanminshi meant 5-min in v3.2 :)
We can define new rules like below if we don't like the jump, but looks a bit complicated.
(you can put a threshold number to Y, like 6 minutes) |
@fanminshi @yudai Yeah, I think it's better to have minimum interval. While keeping the old behavior of 1/10 of given period, we just introduce hard-coded limit of 10-minute(???). Then, 1-hour compaction runs every 10-minute. 20-minute compaction runs every 10-minute, rather than 2-minute. The main motivation is prevent aggressive auto-compaction as in #9443
What do you think? |
@gyuho A simple lower-bound limit makes an issue when you set an under-boundary value to the period. For example, when the limit is 10-minute, if you put 5m to the period, we probably wan to run compaction every 5-minute, not every 10-minute.
comes from this point. Let the threshold 6 minutes,
There is no jumping, but a bit tricky. |
I think the idea "when x > 1hr, compact every hr after x, and When x < 1h, compact every x" is simpler reason about. if x = 1hr, compacts every 1hr. What's the issue with the above behavior? EDIT: add failure path If a compaction failed: if x = 1hr, compacts every 1hr. if compaction has failed, retry again in 6 mins. |
@fanminshi I think "when x > 1hr, compact every hr after x, and When x < 1h, compact every x" is pretty fair and easy to understand. If we can choose it, I think it's the best way to go. The only point that we need to consider is that this behavior is not compatible with what we have with v3.3 (right?). So if we backport this PR to v3.3, it's a breaking change for v3.3. (If we deem that the current v3.3 behavior is a bug, we can backport this PR). I think @gyuho is trying to keep the backward compatiblity with v3.3 in this PR. |
Given that we are not going to break v3.3 behavior in this PR, 'X/10' interval is the baseline. In that sense, the only thing what we can do here is that removing too aggressive auto compaction done by |
It breaks old behavior that sets interval as 1/10 of given period. So, let's do this for v3.4 (current master branch), given that this PR only wants to address #9443. @yudai's logic in #9474 (comment) sounds reasonable, while keeping the old behavior at its best. |
|
Yeah, I will do as we first planned. Might be too confusing for users. |
Just pushed with our first plan. Will document this separately. |
compactor/periodic.go
Outdated
plog.Noticef("Finished auto-compaction at revision %d", rev) | ||
} else { | ||
// retry fast | ||
interval = t.getInterval() / 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in retry iterations, we don't want to update t.revs
.
Maybe something like below is required at line 70.
if interval == t.getInterval() {
t.revs = append(t.revs, t.rg.Rev())
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yudai Good catch. Just fixed. PTAL. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, I have another corner case.
Now t.revs
is not updated while retry is ongoing. But if compaction keeps failing for a while, it will be outdated. So ideally, we need to update t.revs
and shift the window once every 10 times when retry is ongoing to keep t.revs[0]
fresh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing. I think it's just about definition of "interval". If retry fails 9 times, next compaction with a new rev happens at (9/10 + 1) hour
. So the effective interval and the DB overhead can vary between 1 hour through 1.9 hours. But if we define the interval as "the duration from the last SUCCESSFUL compaction", it's just fine.
I guess without using two goroutines, these are hard to implement simpler. You may want to put these corner cases aside in this PR.
ad7f504
to
f0bec73
Compare
@yudai Good point. I would see |
} | ||
|
||
func TestPeriodicPause(t *testing.T) { | ||
func TestPeriodicPauseHourly(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shows what would happen when compactor fails for 30-hours.
compactor/periodic.go
Outdated
p := t.paused | ||
t.mu.RUnlock() | ||
if p { | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this break
won't stop the loop, because it's captured by the select
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can move the pause check block to the top of the loop and maybe merge another one at line 90 as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh right. good catch. fixed.
compactor/periodic.go
Outdated
// divide by 10 to retry faster | ||
// e.g. given period 2-hour, retry in 12-min rather than 1-hour (compaction period) | ||
func (t *Periodic) getRetryInterval() time.Duration { | ||
return t.period / retryDivisor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this should be interval/retryDivisor
for cases t.period > 1h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. thanks @yudai !
compactor/periodic.go
Outdated
if itv > time.Hour { | ||
itv /= retryDivisor | ||
} | ||
return itv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When t.period
is 10minutes for example, retryInterval will be 10 minutes. This is fine.
But retryDivisor
is still 10 in this case, so Run()
tries 10 times with 10 minute wait, meaning 100 minutes in total.
We may need to return retry divisor as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Just fixed. Think we've covered all corner cases now? :)
I will take another look with @fanminshi tomorrow.
Thanks @yudai
Signed-off-by: Gyuho Lee <[email protected]>
I think this is impossible ore pretty hard to write if we don't have two goroutins.
|
@yudai Yeah, I have another related corner case. I will rework on that. |
If you don't mind, I can also work on this PR with two gorouines. Please just ping me. |
@yudai Sure, that would be better! |
If given compaction period x is <1-hour, compact every x duration, with x retention window (e.g. --auto-compaction-mode 'periodic' --auto-compaction-retention='10m', then compact every 10-minute).
If given compaction period x is >1-hour, compact every 1-hour, with x retention window (e.g. --auto-compaction-mode 'periodic' --auto-compaction-retention='72h', then compact every 1-hour). Retry in x/10 duration on compaction failure.