Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keep retrying the proof until we run out of sectors to skip #4633

Merged
merged 2 commits into from
Oct 30, 2020
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions storage/wdpost_run.go
Original file line number Diff line number Diff line change
Expand Up @@ -612,6 +612,15 @@ func (s *WindowPoStScheduler) runPost(ctx context.Context, di dline.Info, ts *ty

log.Warnw("generate window post skipped sectors", "sectors", ps, "error", err, "try", retries)

// Explicitly make sure we haven't aborted this PoSt
// (GenerateWindowPoSt may or may not check this).
// Otherwise, we could try to continue proving a
// deadline after the deadline has ended.
if ctx.Err() != nil {
log.Warnw("aborting PoSt due to context cancellation", "error", ctx.Err(), "deadline", di.Index)
return nil, ctx.Err()
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will explicitly check the context. We should cancel in

// Replace the aborted postWindow with a new one so that we can
// submit again at any time without the state getting clobbered
// when the abort completes
abort := pw.abort
if abort != nil {
pw = &postWindow{
di: pw.di,
ts: advance,
submitState: SubmitStateStart,
}
s.postWindows[pw.di.Open] = pw
// Abort the current submit
abort()
}
.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could also stop retrying once we get, e.g., 2/3rds of the way through the proof time but I'm not sure if that really makes sense. I guess sectors assigned to a single partition are somewhat correlated in time so their failure may be correlated? But I don't wan to:

  1. Spend a lot of time trying to prove one partition.
  2. Give up on that partition because we're running out of time.
  3. Spend a little time trying to prove all other partitions in the deadline and fail because we have a lot of faulty sectors.

When we could have eventually submitted a valid proof for the first partition, if we had simply stuck with it.


skipCount += uint64(len(ps))
for _, sector := range ps {
postSkipped.Set(uint64(sector.Number))
Expand Down