wdPoST worker just marks sectors as bad if this happens: remote/vanilla/single: context deadline exceeded #8957

benjaminh83 · 2022-07-01T18:02:00Z

Checklist

This is not a security-related bug/issue. If it is, please follow please follow the security policy.
This is not a question or a support request. If you have any lotus related questions, please ask in the lotus forum.
This is not a new feature request. If it is, please file a feature request instead.
This is not an enhancement request. If it is, please file a improvement suggestion instead.
I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
I am running the Latest release, or the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
I did not make any code changes to lotus.

Lotus component

Lotus Version

$ lotus-miner version
Daemon:  1.16.0-rc1+calibnet+git.5a67bb0f6+api1.5.0
Local: lotus-miner version 1.16.0-rc1+calibnet+git.5a67bb0f6

Describe the Bug

PoST workers are generating wdPoSTs, but is only including a subset of sectors and marking the rest as bad.
This happens because PoST workers are asking other workers with the LongTermStorage path to do vanilla-proofs, and all proofs that does not finalize before a certain deadline/threshold, is handled as bad (corrupted/missing data).

This is a bug because the wdPoST is delivered on chain fairly fast. This proof was delivered after 4m30s into the deadline:
https://calibration.filscan.io/tipset/message-detail?cid=bafy2bzacectwfvssmw2owbcar5olmfeyw7bqldd7ebwc52sdkks4zaw2urqc2

So basically leaving 25 good minutes (that could have been used to wait for the storage to deliver the last proofs). So instead the combination of PoST workers and other workers introduces this much shorter threshold for getting data processed for windowPoSTing.

I would argue this is an undesirable side effect of running proving on a lotus-worker based setup. This would not happen if the lotus-miner was proving, as it allows for the vanilla-proofs to be slow (as they are not processed over network, but local to the process).

Errors does not show on the PoST worker logs when doing wdPoST, so instead I ran a lotus-miner proving check --only-bad --slow 0
And it revealed these errors in the log for all the sectors marked as bad. Data is physically there, but this happened because the storage is rather slow, and not super responsive (so its a coincidence we experience what happens in this scenario). WindowPoST on the lotus-miner can take up to 15m into the deadline, so the storage is for sure not fast.

This bug is relevant to fix, as usage of PoST workers otherwise changes the requirement for wdPoSTs, so you actually don't have a full deadline to proof your partition, but only the time until the PoST workers gets tired of waiting on the remote worker, and marks the sector as BAD.

Could be related to #8362, but we do not see any breakdown of the scheduling or the wdPoST worker crashes. In this case it basically just rules out good (but slow) vanilla-proofs, because they are not presented to the wdPoST worker within a set threshold.

Logging Information

0         0          1638    bad (generating vanilla proof: do request: Post "http://10.33.18.1:50000/remote/vanilla/single": context deadline exceeded)
0         0          1628    bad (generating vanilla proof: do request: Post "http://10.33.18.1:50000/remote/vanilla/single": context deadline exceeded)
0         0          827     bad (generating vanilla proof: do request: Post "http://10.33.18.1:50000/remote/vanilla/single": context deadline exceeded)
0         0          893     bad (generating vanilla proof: do request: Post "http://10.33.18.1:50000/remote/vanilla/single": context deadline exceeded)
0         0          1619    bad (generating vanilla proof: do request: Post "http://10.33.18.1:50000/remote/vanilla/single": context deadline exceeded)
0         0          1590    bad (generating vanilla proof: do request: Post "http://10.33.18.1:50000/remote/vanilla/single": context deadline exceeded)
0         0          123     bad (generating vanilla proof: do request: Post "http://10.33.18.1:50000/remote/vanilla/single": context deadline exceeded)
0         0          117     bad (generating vanilla proof: do request: Post "http://10.33.18.1:50000/remote/vanilla/single": context deadline exceeded)
0         0          1615    bad (generating vanilla proof: do request: Post "http://10.33.18.1:50000/remote/vanilla/single": context deadline exceeded)
0         0          1504    bad (generating vanilla proof: do request: Post "http://10.33.18.1:50000/remote/vanilla/single": context deadline exceeded)
0         0          1593    bad (generating vanilla proof: do request: Post "http://10.33.18.1:50000/remote/vanilla/single": context deadline exceeded)
0         0          152     bad (generating vanilla proof: do request: Post "http://10.33.18.1:50000/remote/vanilla/single": context deadline exceeded)

Repo Steps

Setup:
1x Lotus-miner: No LongTermStorage path
2x wdPoST worker: No LongTermStorage path
2x wnPoST worker: No LongTermStorage path
8x lotus-workers with no tasks: Holds the LongTermStorage path
14x lotus-workers with AP-C2: No LongTermStorage path

Fill up deadlines with full partitions
Have the LongTermStorage in some cluster or NFS shared configuration where providing IO is no ideal (so not just a local direct attached disk).
Do lotus-miner proving check --only-bad --slow on deadlines with full partitions, and see that not all sectors come out as good - part of them are marked BAD.
See error '/remote/vanilla/single": context deadline exceeded' for sectors if they are not replied in time, and marked BAD.

The text was updated successfully, but these errors were encountered:

rjan90 · 2022-07-04T21:31:05Z

Thanks for the issue ticket @benjaminh83! When reading through the ticket it seems like this might be fixed by #8959, which allows you to disable the PoSt pre-checks. It has just landed in master, so would need to test and confirm that this fixes the issue.

# Disable WindowPoSt provable sector readability checks.
  # 
  # In normal operation, when preparing to compute WindowPoSt, lotus-miner will perform a round of reading challenges
  # from all sectors to confirm that those sectors can be proven. Challenges read in this process are discarded, as
  # we're only interested in checkdng that sector data can be read.
  # 
  # When using builtin proof computation (no PoSt workers, and DisableBuiltinWindowPoSt is set to false), this process
  # can save a lot of time and compute resources in the case that some sectors are not readable - this is caused by
  # the builtin logic not skipping snark computation when some sectors need to be skipped.
  # 
  # When using PoSt workers, this process is mostly redundant, with PoSt workers challenges will be read once, and
  # if challenges for some sectors aren't readable, those sectors will just get skipped.
  # 
  # Disabling sector pre-checks will slightly reduce IO load when proving sectors, possibly resulting in shorter
  # time to produce window PoSt. In setups with good IO capabilities the effect of this option on proving time should
  # be negligible.
  # 
  # NOTE: It likely is a bad idea to disable sector pre-checks in setups with no PoSt workers.
  # 
  # NOTE: Even when this option is enabled, recovering sectors will be checked before recovery declaration message is
  # sent to the chain
  # 
  # After changing this option, confirm that the new value works in your setup by invoking
  # 'lotus-miner proving compute window-post 0'
  #
  # type: bool
  # env var: LOTUS_PROVING_DISABLEWDPOSTPRECHECKS
  DisableWDPoStPreChecks = false

benjaminh83 · 2022-07-05T14:27:14Z

Very cool @rjan90
But isn't it still a problem that PoST workers run queries over the network, and latency can get them to context deadline exceeded?
We are testing on larger scale here, so I would love to try this out and see if we gain any benefits. This is already merged in latest master right? So we can try out DisableWDPoStPreChecks = false like now?

I would just really want this PoST worker setup to be resilient. Like what happens if we have 10x the power and maybe a cluster of 20 PoST workers - many workers are not worth much, if they individually gives up on proving because the SP is getting big and complex.

rjan90 · 2022-07-05T14:33:44Z

But isn't it still a problem that PoST workers run queries over the network, and latency can get them to context deadline exceeded?

I´m not sure, but maybe @magik6k can answer if that is potential issue?

We are testing on larger scale here, so I would love to try this out and see if we gain any benefits. This is already merged in latest master right? So we can try out DisableWDPoStPreChecks = false like now?

Yes, these changes has landed in master now so it should be fairly easy to try out :)

benjaminh83 · 2022-07-05T19:20:05Z

Okay, great success!
Now I was able to proof all deadlines. Changes to the environment is:

1x Lotus-miner: Holds the LongTermStorage path
2x wdPoST worker: No LongTermStorage path
2x wnPoST worker: No LongTermStorage path
14x lotus-workers with AP-C2: No LongTermStorage path

So all single vanilla proofs are done on the lotus-miner, but both wd and wnPoST has been outsourced. And I have ran this successfully (also on the full deadline that would not prove in the old setup):

  ParallelCheckLimit = 128
  DisableBuiltinWindowPoSt = true
  DisableBuiltinWinningPoSt = true
  DisableWDPoStPreChecks = true

Proving times (the single vanilla proof / data read) are much better so no timeout occured.

rjan90 · 2022-07-05T20:39:00Z

Amazing, glad it worked! I will close this issue since it is resolved!

And again, thank you so much for the detailed write up, it will be of great help to document and provide an example of this setup on a larger scale in the docs 😄

benjaminh83 · 2022-07-20T07:58:22Z

I was able to reproduce this on this simple setup that we are running now:

1x Lotus-miner: No LongTermStorage path
1x wdPoST worker: No LongTermStorage path
1x wnPoST worker: No LongTermStorage path
34x lotus-workers with AP-C2: No LongTermStorage path
2x lotus-worker --no-default: Holding LongTermStorage path (Running on a ubuntu server with ZFS)

It have worked fine for days, until I have now reached a full 2349 sector deadline. It seems to be unable to get all the single_vanilla_proofs generated before a cut off. So It's possible to generate vanilla proofs for around 2000 sectors, but going above that, it gets cut off, and unproven sectors are maked as BAD.

Logs from doing a lotus-miner proving check --only-bad --slow 0 is provided below. It's completed in <3m! So something is telling me that the deadline threshold for generating single_vanilla_proofs is set quite low! It might not be intentional - it might be a general network threshold for when a query is believed to be dead.

Solutions:

Obviously it helped to set DisableWDPoStPreChecks = true, and therefore make the process faster. But that feels a bit like cheating, as we are not really addressing the fact that lotus currently just cuts of generating single_vanilla_proofs, if they take longer than only these 3m. So its just a matter of time before we see it again if we keep on going bigger.
So we could look at how lowering the ParallelCheckLimit = 128 setting in config.toml. I'm not sure if this will bring lower load of the network queries, and therefore leave less queries unanswered within the threshold?
optimally we should just be able to control this parameter for when "context deadline exceeded" is triggered. Basically it makes no sense that windowPoST is cutting off generating single_vanilla_proofs in like 3m, when we have 30m to prove a deadline, as the GPU part of windowPoST takes like 2m19s on our system.

LOGS:

$ time lotus-miner proving check --only-bad --slow 0
deadline  partition  sector  status
0         0          75      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          63      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          66      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          42      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          23      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          118     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          58      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          71      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          111     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          110     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          96      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          126     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          0       bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          56      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          54      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          120     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          60      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          20      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          16      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          21      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          24      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          131     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          130     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          12      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          86      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          31      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          89      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          98      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          49      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          90      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          95      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          55      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          1       bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          52      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          132     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          116     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          114     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          40      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          35      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          47      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          102     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          26      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          112     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          39      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          6       bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          62      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          105     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          74      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          18      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          69      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          93      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          68      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          64      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          61      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          33      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          50      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          38      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          88      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          72      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          41      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          87      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          37      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          14      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          8       bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          13      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          70      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          19      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          67      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          36      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          27      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          92      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          30      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          99      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          17      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          9       bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          108     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          29      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          43      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          48      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          28      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          46      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          109     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          59      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          91      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          57      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          103     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          15      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          107     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          65      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          32      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          101     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          97      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          94      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          117     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          45      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          115     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          100     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          113     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          34      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          22      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          124     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          119     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          129     bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)
0         0          51      bad (generating vanilla proof: do request: Post "http://10.33.16.85:50000/remote/vanilla/single": context deadline exceeded)

real    2m54.768s
user    0m2.103s
sys     0m0.445s

benjaminh83 · 2022-07-21T19:24:23Z

Okay, this issues has been fixed again, but this time using other means. So by now, we have set DisableWDPoStPreChecks = true as last time, but we intentionally waited with that and tried something else.

So we could consistently regenerate about 100 BAD sectors when running the lotus-miner proving check --only-bad --slow 0 against this calib-miner. Sealing at full pace or idle made no difference - basically it just can provide all 2349 single vanilla proofs in time without causing some timeouts.

We then tried reducing the number of parallel reads out of a hypothesis about too many read queries over the network might end up with a part being left unanswered for too long time, leading to a timeout. So WindowPoST worker was then run with:

--windowpost=true
--post-parallel-reads=32

And this has done the trick. Read time for a full 2349 sector partition was around 2m, so really no slow down from reducing the parallelism. This could indicate that the number of parallel reads set at 128 might be a bit overkill and actually potentially cause issue with network query congestion which leads to unintended timeouts.

rjan90 · 2022-07-22T08:29:16Z

Closing this issue since it has been resolved, and also added a feature enhanement issues for discussing if the default post-parallel-reads value should be lowered.

donkabat · 2022-09-22T07:28:01Z

@rjan90
I set post-parallel-reads 25 but context deadline exceeded still appears from time to time. Do you have another idea to fix it?

rjan90 · 2022-09-22T07:30:17Z

And you have DisableWDPoStPreChecks = true set @donkabat ? https://lotus.filecoin.io/storage-providers/advanced-configurations/workers/#proving-pre-checks

donkabat · 2022-09-22T08:59:27Z

yes, DisableWDPoStPreChecks = true

rjan90 · 2022-09-22T09:19:25Z

IIRC and the above image you provided earlier are still correct, you are running your long term storage on servers that are also doing other work (sealing/lotus-miner/etc)?

I think you are at a point in terms of sealing throughput/load on these servers that makes your windowPoSt have to compete with other tasks/network resources, etc.

Example:
PC1 server transfers sealed sectors to the PC2 server, while the windowPoSt is trying to read 32 sectors worth of PoSt challenges, in such a scenario its very easy to see that you potentially will get short network timeouts and thus you get the context deadline exceeded

Solution would be to either rewire what the different workers are doing to minimize network traffic (avoid PC1->PC2 transfer at least). Or the more long-term sustainable way is probably getting dedicated storage servers

donkabat · 2022-09-22T09:41:47Z

@rjan90
Maybe yes, but after I stopped snapping yesterday, today I've got much more errors than normally:

lotus-miner proving check --slow --only-bad 10                                                                                                                                                                                                     [42/42]

deadline  partition  sector  status
10        0          30078   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          30339   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          30069   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          31658   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          31442   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          30070   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          30335   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          30074   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          30072   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          30338   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          30075   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          31748   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          30065   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          31309   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)
10        0          31747   bad (generating vanilla proof: do request: Post "http://192.168.88.20:1208/remote/vanilla/single": context deadline exceeded)

Above results I got when miner had 0 jobs.

PC1 workers use NFS storages from PC2 servers and dedicated ethernet port + VLAN. So data flow between PC1 => PC2 cant have impact on storage.

donkabat · 2022-09-22T09:52:38Z

There is also one change, "FIL-0" server is used by daemon, miner, win/wd post workers, boostd
and neither none of PC2 or C2 workes have long-storage task.

benjaminh83 added kind/bug Kind: Bug need/triage labels Jul 1, 2022

rjan90 added area/proving Area: Proving P2 P2: Should be resolved and removed need/triage labels Jul 4, 2022

rjan90 closed this as completed Jul 5, 2022

rjan90 reopened this Jul 20, 2022

rjan90 mentioned this issue Jul 22, 2022

Changing the default post-parallel-reads value #9074

Closed

15 tasks

rjan90 closed this as completed Jul 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wdPoST worker just marks sectors as bad if this happens: remote/vanilla/single: context deadline exceeded #8957

wdPoST worker just marks sectors as bad if this happens: remote/vanilla/single: context deadline exceeded #8957

benjaminh83 commented Jul 1, 2022

rjan90 commented Jul 4, 2022 •

edited

Loading

benjaminh83 commented Jul 5, 2022

rjan90 commented Jul 5, 2022

benjaminh83 commented Jul 5, 2022

rjan90 commented Jul 5, 2022

benjaminh83 commented Jul 20, 2022

benjaminh83 commented Jul 21, 2022

rjan90 commented Jul 22, 2022

donkabat commented Sep 22, 2022

rjan90 commented Sep 22, 2022

donkabat commented Sep 22, 2022

rjan90 commented Sep 22, 2022

donkabat commented Sep 22, 2022 •

edited

Loading

donkabat commented Sep 22, 2022 •

edited

Loading

wdPoST worker just marks sectors as bad if this happens: remote/vanilla/single: context deadline exceeded #8957

wdPoST worker just marks sectors as bad if this happens: remote/vanilla/single: context deadline exceeded #8957

Comments

benjaminh83 commented Jul 1, 2022

Checklist

Lotus component

Lotus Version

Describe the Bug

Logging Information

Repo Steps

rjan90 commented Jul 4, 2022 • edited Loading

benjaminh83 commented Jul 5, 2022

rjan90 commented Jul 5, 2022

benjaminh83 commented Jul 5, 2022

rjan90 commented Jul 5, 2022

benjaminh83 commented Jul 20, 2022

benjaminh83 commented Jul 21, 2022

rjan90 commented Jul 22, 2022

donkabat commented Sep 22, 2022

rjan90 commented Sep 22, 2022

donkabat commented Sep 22, 2022

rjan90 commented Sep 22, 2022

donkabat commented Sep 22, 2022 • edited Loading

donkabat commented Sep 22, 2022 • edited Loading

rjan90 commented Jul 4, 2022 •

edited

Loading

donkabat commented Sep 22, 2022 •

edited

Loading

donkabat commented Sep 22, 2022 •

edited

Loading