Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wdPoST worker just marks sectors as bad if this happens: remote/vanilla/single: context deadline exceeded #8957

Closed
8 of 18 tasks
benjaminh83 opened this issue Jul 1, 2022 · 14 comments
Labels
area/proving Area: Proving kind/bug Kind: Bug P2 P2: Should be resolved

Comments

@benjaminh83
Copy link

Checklist

  • This is not a security-related bug/issue. If it is, please follow please follow the security policy.
  • This is not a question or a support request. If you have any lotus related questions, please ask in the lotus forum.
  • This is not a new feature request. If it is, please file a feature request instead.
  • This is not an enhancement request. If it is, please file a improvement suggestion instead.
  • I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
  • I am running the Latest release, or the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
  • I did not make any code changes to lotus.

Lotus component

  • lotus daemon - chain sync
  • lotus miner - mining and block production
  • lotus miner/worker - sealing
  • lotus miner - proving(WindowPoSt)
  • lotus miner/market - storage deal
  • lotus miner/market - retrieval deal
  • lotus miner/market - data transfer
  • lotus client
  • lotus JSON-RPC API
  • lotus message management (mpool)
  • Other

Lotus Version

$ lotus-miner version
Daemon:  1.16.0-rc1+calibnet+git.5a67bb0f6+api1.5.0
Local: lotus-miner version 1.16.0-rc1+calibnet+git.5a67bb0f6

Describe the Bug

PoST workers are generating wdPoSTs, but is only including a subset of sectors and marking the rest as bad.
This happens because PoST workers are asking other workers with the LongTermStorage path to do vanilla-proofs, and all proofs that does not finalize before a certain deadline/threshold, is handled as bad (corrupted/missing data).

This is a bug because the wdPoST is delivered on chain fairly fast. This proof was delivered after 4m30s into the deadline:
https://calibration.filscan.io/tipset/message-detail?cid=bafy2bzacectwfvssmw2owbcar5olmfeyw7bqldd7ebwc52sdkks4zaw2urqc2

So basically leaving 25 good minutes (that could have been used to wait for the storage to deliver the last proofs). So instead the combination of PoST workers and other workers introduces this much shorter threshold for getting data processed for windowPoSTing.

I would argue this is an undesirable side effect of running proving on a lotus-worker based setup. This would not happen if the lotus-miner was proving, as it allows for the vanilla-proofs to be slow (as they are not processed over network, but local to the process).

Errors does not show on the PoST worker logs when doing wdPoST, so instead I ran a lotus-miner proving check --only-bad --slow 0
And it revealed these errors in the log for all the sectors marked as bad. Data is physically there, but this happened because the storage is rather slow, and not super responsive (so its a coincidence we experience what happens in this scenario). WindowPoST on the lotus-miner can take up to 15m into the deadline, so the storage is for sure not fast.

This bug is relevant to fix, as usage of PoST workers otherwise changes the requirement for wdPoSTs, so you actually don't have a full deadline to proof your partition, but only the time until the PoST workers gets tired of waiting on the remote worker, and marks the sector as BAD.

Could be related to #8362, but we do not see any breakdown of the scheduling or the wdPoST worker crashes. In this case it basically just rules out good (but slow) vanilla-proofs, because they are not presented to the wdPoST worker within a set threshold.

Logging Information

0         0          1638    bad (generating vanilla proof: do request: Post "http://10.33.18.1:50000/remote/vanilla/single": context deadline exceeded)
0         0          1628    bad (generating vanilla proof: do request: Post "http://10.33.18.1:50000/remote/vanilla/single": context deadline exceeded)
0         0          827     bad (generating vanilla proof: do request: Post "http://10.33.18.1:50000/remote/vanilla/single": context deadline exceeded)
0         0          893     bad (generating vanilla proof: do request: Post "http://10.33.18.1:50000/remote/vanilla/single": context deadline exceeded)
0         0          1619    bad (generating vanilla proof: do request: Post "http://10.33.18.1:50000/remote/vanilla/single": context deadline exceeded)
0         0          1590    bad (generating vanilla proof: do request: Post "http://10.33.18.1:50000/remote/vanilla/single": context deadline exceeded)
0         0          123     bad (generating vanilla proof: do request: Post "http://10.33.18.1:50000/remote/vanilla/single": context deadline exceeded)
0         0          117     bad (generating vanilla proof: do request: Post "http://10.33.18.1:50000/remote/vanilla/single": context deadline exceeded)
0         0          1615    bad (generating vanilla proof: do request: Post "http://10.33.18.1:50000/remote/vanilla/single": context deadline exceeded)
0         0          1504    bad (generating vanilla proof: do request: Post "http://10.33.18.1:50000/remote/vanilla/single": context deadline exceeded)
0         0          1593    bad (generating vanilla proof: do request: Post "http://10.33.18.1:50000/remote/vanilla/single": context deadline exceeded)
0         0          152     bad (generating vanilla proof: do request: Post "http://10.33.18.1:50000/remote/vanilla/single": context deadline exceeded)

Repo Steps

Setup:
1x Lotus-miner: No LongTermStorage path
2x wdPoST worker: No LongTermStorage path
2x wnPoST worker: No LongTermStorage path
8x lotus-workers with no tasks: Holds the LongTermStorage path
14x lotus-workers with AP-C2: No LongTermStorage path

  1. Fill up deadlines with full partitions
  2. Have the LongTermStorage in some cluster or NFS shared configuration where providing IO is no ideal (so not just a local direct attached disk).
  3. Do lotus-miner proving check --only-bad --slow on deadlines with full partitions, and see that not all sectors come out as good - part of them are marked BAD.
  4. See error '/remote/vanilla/single": context deadline exceeded' for sectors if they are not replied in time, and marked BAD.
@rjan90 rjan90 added area/proving Area: Proving P2 P2: Should be resolved and removed need/triage labels Jul 4, 2022
@rjan90
Copy link
Contributor

rjan90 commented Jul 4, 2022

Thanks for the issue ticket @benjaminh83! When reading through the ticket it seems like this might be fixed by #8959, which allows you to disable the PoSt pre-checks. It has just landed in master, so would need to test and confirm that this fixes the issue.

# Disable WindowPoSt provable sector readability checks.
  # 
  # In normal operation, when preparing to compute WindowPoSt, lotus-miner will perform a round of reading challenges
  # from all sectors to confirm that those sectors can be proven. Challenges read in this process are discarded, as
  # we're only interested in checkdng that sector data can be read.
  # 
  # When using builtin proof computation (no PoSt workers, and DisableBuiltinWindowPoSt is set to false), this process
  # can save a lot of time and compute resources in the case that some sectors are not readable - this is caused by
  # the builtin logic not skipping snark computation when some sectors need to be skipped.
  # 
  # When using PoSt workers, this process is mostly redundant, with PoSt workers challenges will be read once, and
  # if challenges for some sectors aren't readable, those sectors will just get skipped.
  # 
  # Disabling sector pre-checks will slightly reduce IO load when proving sectors, possibly resulting in shorter
  # time to produce window PoSt. In setups with good IO capabilities the effect of this option on proving time should
  # be negligible.
  # 
  # NOTE: It likely is a bad idea to disable sector pre-checks in setups with no PoSt workers.
  # 
  # NOTE: Even when this option is enabled, recovering sectors will be checked before recovery declaration message is
  # sent to the chain
  # 
  # After changing this option, confirm that the new value works in your setup by invoking
  # 'lotus-miner proving compute window-post 0'
  #
  # type: bool
  # env var: LOTUS_PROVING_DISABLEWDPOSTPRECHECKS
  DisableWDPoStPreChecks = false

@benjaminh83
Copy link
Author

Very cool @rjan90
But isn't it still a problem that PoST workers run queries over the network, and latency can get them to context deadline exceeded?
We are testing on larger scale here, so I would love to try this out and see if we gain any benefits. This is already merged in latest master right? So we can try out DisableWDPoStPreChecks = false like now?

I would just really want this PoST worker setup to be resilient. Like what happens if we have 10x the power and maybe a cluster of 20 PoST workers - many workers are not worth much, if they individually gives up on proving because the SP is getting big and complex.

@rjan90
Copy link
Contributor

rjan90 commented Jul 5, 2022

But isn't it still a problem that PoST workers run queries over the network, and latency can get them to context deadline exceeded?

I´m not sure, but maybe @magik6k can answer if that is potential issue?

We are testing on larger scale here, so I would love to try this out and see if we gain any benefits. This is already merged in latest master right? So we can try out DisableWDPoStPreChecks = false like now?

Yes, these changes has landed in master now so it should be fairly easy to try out :)

@benjaminh83
Copy link
Author

Okay, great success!
Now I was able to proof all deadlines. Changes to the environment is:

1x Lotus-miner: Holds the LongTermStorage path
2x wdPoST worker: No LongTermStorage path
2x wnPoST worker: No LongTermStorage path
14x lotus-workers with AP-C2: No LongTermStorage path

So all single vanilla proofs are done on the lotus-miner, but both wd and wnPoST has been outsourced. And I have ran this successfully (also on the full deadline that would not prove in the old setup):

  ParallelCheckLimit = 128
  DisableBuiltinWindowPoSt = true
  DisableBuiltinWinningPoSt = true
  DisableWDPoStPreChecks = true

Proving times (the single vanilla proof / data read) are much better so no timeout occured.

@rjan90
Copy link
Contributor

rjan90 commented Jul 5, 2022

Amazing, glad it worked! I will close this issue since it is resolved!

And again, thank you so much for the detailed write up, it will be of great help to document and provide an example of this setup on a larger scale in the docs 😄

@rjan90 rjan90 closed this as completed Jul 5, 2022
@rjan90 rjan90 reopened this Jul 20, 2022
@benjaminh83
Copy link
Author

I was able to reproduce this on this simple setup that we are running now:

1x Lotus-miner: No LongTermStorage path
1x wdPoST worker: No LongTermStorage path
1x wnPoST worker: No LongTermStorage path
34x lotus-workers with AP-C2: No LongTermStorage path
2x lotus-worker --no-default: Holding LongTermStorage path (Running on a ubuntu server with ZFS)

It have worked fine for days, until I have now reached a full 2349 sector deadline. It seems to be unable to get all the single_vanilla_proofs generated before a cut off. So It's possible to generate vanilla proofs for around 2000 sectors, but going above that, it gets cut off, and unproven sectors are maked as BAD.

Logs from doing a lotus-miner proving check --only-bad --slow 0 is provided below. It's completed in <3m! So something is telling me that the deadline threshold for generating single_vanilla_proofs is set quite low! It might not be intentional - it might be a general network threshold for when a query is believed to be dead.

Solutions:

  • Obviously it helped to set DisableWDPoStPreChecks = true, and therefore make the process faster. But that feels a bit like cheating, as we are not really addressing the fact that lotus currently just cuts of generating single_vanilla_proofs, if they take longer than only these 3m. So its just a matter of time before we see it again if we keep on going bigger.

  • So we could look at how lowering the ParallelCheckLimit = 128 setting in config.toml. I'm not sure if this will bring lower load of the network queries, and therefore leave less queries unanswered within the threshold?

  • optimally we should just be able to control this parameter for when "context deadline exceeded" is triggered. Basically it makes no sense that windowPoST is cutting off generating single_vanilla_proofs in like 3m, when we have 30m to prove a deadline, as the GPU part of windowPoST takes like 2m19s on our system.

LOGS:

$ time lotus-miner proving check --only-bad --slow 0
deadline  partition  sector  status
0         0          75      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          63      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          66      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          42      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          23      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          118     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          58      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          71      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          111     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          110     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          96      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          126     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          0       bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          56      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          54      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          120     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          60      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          20      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          16      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          21      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          24      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          131     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          130     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          12      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          86      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          31      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          89      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          98      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          49      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          90      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          95      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          55      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          1       bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          52      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          132     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          116     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          114     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          40      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          35      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          47      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          102     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          26      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          112     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          39      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          6       bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          62      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          105     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          74      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          18      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          69      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          93      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          68      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          64      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          61      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          33      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          50      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          38      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          88      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          72      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          41      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          87      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          37      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          14      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          8       bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          13      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          70      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          19      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          67      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          36      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          27      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          92      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          30      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          99      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          17      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          9       bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          108     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          29      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          43      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          48      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          28      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          46      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          109     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          59      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          91      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          57      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          103     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          15      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          107     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          65      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          32      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          101     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          97      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          94      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          117     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          45      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          115     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          100     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          113     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          34      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          22      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          124     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          119     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          129     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          51      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)

real    2m54.768s
user    0m2.103s
sys     0m0.445s

@benjaminh83
Copy link
Author

Okay, this issues has been fixed again, but this time using other means. So by now, we have set DisableWDPoStPreChecks = true as last time, but we intentionally waited with that and tried something else.

So we could consistently regenerate about 100 BAD sectors when running the lotus-miner proving check --only-bad --slow 0 against this calib-miner. Sealing at full pace or idle made no difference - basically it just can provide all 2349 single vanilla proofs in time without causing some timeouts.

We then tried reducing the number of parallel reads out of a hypothesis about too many read queries over the network might end up with a part being left unanswered for too long time, leading to a timeout. So WindowPoST worker was then run with:

--windowpost=true
--post-parallel-reads=32

And this has done the trick. Read time for a full 2349 sector partition was around 2m, so really no slow down from reducing the parallelism. This could indicate that the number of parallel reads set at 128 might be a bit overkill and actually potentially cause issue with network query congestion which leads to unintended timeouts.

@rjan90
Copy link
Contributor

rjan90 commented Jul 22, 2022

Closing this issue since it has been resolved, and also added a feature enhanement issues for discussing if the default post-parallel-reads value should be lowered.

@rjan90 rjan90 closed this as completed Jul 22, 2022
@donkabat
Copy link

@rjan90
I set post-parallel-reads 25 but context deadline exceeded still appears from time to time. Do you have another idea to fix it?

@rjan90
Copy link
Contributor

rjan90 commented Sep 22, 2022

@donkabat
Copy link

yes, DisableWDPoStPreChecks = true

@rjan90
Copy link
Contributor

rjan90 commented Sep 22, 2022

image (5)
IIRC and the above image you provided earlier are still correct, you are running your long term storage on servers that are also doing other work (sealing/lotus-miner/etc)?

I think you are at a point in terms of sealing throughput/load on these servers that makes your windowPoSt have to compete with other tasks/network resources, etc.

Example:
PC1 server transfers sealed sectors to the PC2 server, while the windowPoSt is trying to read 32 sectors worth of PoSt challenges, in such a scenario its very easy to see that you potentially will get short network timeouts and thus you get the context deadline exceeded

Solution would be to either rewire what the different workers are doing to minimize network traffic (avoid PC1->PC2 transfer at least). Or the more long-term sustainable way is probably getting dedicated storage servers

@donkabat
Copy link

donkabat commented Sep 22, 2022

@rjan90
Maybe yes, but after I stopped snapping yesterday, today I've got much more errors than normally:

lotus-miner proving check --slow --only-bad 10                                                                                                                                                                                                     [42/42]

deadline  partition  sector  status
10        0          30078   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          30339   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          30069   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          31658   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          31442   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          30070   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          30335   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          30074   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          30072   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          30338   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          30075   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          31748   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          30065   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          31309   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          31747   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)

Above results I got when miner had 0 jobs.

PC1 workers use NFS storages from PC2 servers and dedicated ethernet port + VLAN. So data flow between PC1 => PC2 cant have impact on storage.

@donkabat
Copy link

donkabat commented Sep 22, 2022

There is also one change, "FIL-0" server is used by daemon, miner, win/wd post workers, boostd
and neither none of PC2 or C2 workes have long-storage task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/proving Area: Proving kind/bug Kind: Bug P2 P2: Should be resolved
Projects
None yet
Development

No branches or pull requests

3 participants