-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Avoid long state queries when serving GetNodeData requests #11444
Conversation
Finish GetDataNode requests early if queries take too long
* master: chore: remove unused dependencies (#11432)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced we need this, also extracted the actual fix to #11447 for easier backporting
Co-Authored-By: Andronik Ordian <[email protected]>
Can you elaborate on your doubts? The real fix here is your |
I mean if the request takes a long time, it's definitely a bug we shouldn't silence it, but having a warning instead of taking a freezing a client is definitely an improvement, so I changed my mind :) |
After ~4h of running this (including @ordian's |
Before (2.7.1), you saw queries in the magnitude of seconds in the worst case, right? |
…thereum into dp/debug/sync-stalls * 'dp/debug/sync-stalls' of github.com:paritytech/parity-ethereum: Update ethcore/sync/src/chain/supplier.rs
Oh no, much much worse. The worst I saw was, I think, 221 seconds, but I suspect there were even worse lockups than that before I added logging. |
…pstream * master: Avoid long state queries when serving GetNodeData requests (#11444) Cargo.lock: cargo update -p kvdb-rocksdb (#11447) rlp_derive: cleanup (#11446) chore: remove unused dependencies (#11432) update kvdb-rocksdb to 0.4 (#11442) Rough architecutre diagram. (#11396) ethjson: impl Copy for hash type wrapper (#11423) Remove dead bootnodes, add new geth bootnodes (#11441) goerli: replace foundation bootnode (#11433) Update publish-docker.sh (#11428) Revert "[Trace] Distinguish between `create` and `create2` (#11311)" (#11427)
* Remove dead bootnodes, add new geth bootnodes * More granular locking when fetching state Finish GetDataNode requests early if queries take too long * typo * Use latest kvdb-rocksdb * Cleanup * Update ethcore/sync/src/chain/supplier.rs Co-Authored-By: Andronik Ordian <[email protected]> * Address review grumbles * Fix compilation * Address review grumbles Co-authored-by: Andronik Ordian <[email protected]>
* update classic testnet bootnodes (#11398) * update classic testnet bootnodes * Update kotti.json * Update kotti.json * Update kotti.json * Update mordor.json * verification: fix race same block + misc (#11400) * ethcore: fix race in verification * verification: fix some nits * verification: refactor err type `Kind::create` * fix: tests * address grumbles * address grumbles: don't panic * fix: export hardcoded sync format (#11416) * fix: export hardcoded sync format * address grumbles * make tests compile with rustc_hex 2.0 * fix grumbles: impl LowerHex for encoded Header * goerli: replace foundation bootnode (#11433) * Remove dead bootnodes, add new geth bootnodes (#11441) * update kvdb-rocksdb to 0.4 (#11442) * Avoid long state queries when serving GetNodeData requests (#11444) * Remove dead bootnodes, add new geth bootnodes * More granular locking when fetching state Finish GetDataNode requests early if queries take too long * typo * Use latest kvdb-rocksdb * Cleanup * Update ethcore/sync/src/chain/supplier.rs Co-Authored-By: Andronik Ordian <[email protected]> * Address review grumbles * Fix compilation * Address review grumbles Co-authored-by: Andronik Ordian <[email protected]> * rlp_derive: cleanup (#11446) * rlp_derive: update syn & co * rlp_derive: remove dummy_const * rlp_derive: remove unused attirubutes * rlp-derive: change authors * Cargo.lock: cargo update -p kvdb-rocksdb (#11447) * Cargo.lock: new lockfile format Manual backport of #11448 * gcc to clang (#11453) * gcc to clang test ``` export CC="sccache "$CC export CXX="sccache "$CXX ``` darwin build ``` CC=clang CXX=clang ``` * darwin - > default clang * Bump version and CHANGELOG.md * chore: remove unused dependencies (#11432) * fix: compiler warnings * chore: remove unused dependencies * Update CHANGELOG.md * update Cargo.lock * update CHANGELOG.md * backwards compatible call_type creation_method (#11450) * rlp_derive: update syn & co * rlp_derive: remove dummy_const * rlp_derive: remove unused attirubutes * rlp-derive: change authors * rlp_derive: add rlp(default) attribute * Revert "Revert "[Trace] Distinguish between `create` and `create2` (#11311)" (#11427)" This reverts commit 5d4993b. * trace: backwards compatible call_type and creation_method * trace: add rlp backward compatibility tests * cleanup * i know, i hate backwards compatibility too * address review grumbles * update CHANGELOG.md * just to make sure we don't screw up this again (#11455) * Update CHANGELOG.md Co-authored-by: Talha Cross <[email protected]> Co-authored-by: Niklas Adolfsson <[email protected]> Co-authored-by: Martin Holst Swende <[email protected]> Co-authored-by: David <[email protected]> Co-authored-by: Andronik Ordian <[email protected]> Co-authored-by: Denis S. Soldatov aka General-Beck <[email protected]>
As networking and db access happen on the same thread, queries for state data can take a long while. To mitigate this, this PR makes the read lock more granular and sets a timeout of 2 seconds to acquire the lock. It also aborts GetNodeData requests early if a single query takes more than 50ms or the whole batch has taken more than 5 seconds to complete.
GetNodeData requests are mainly coming from FastSync nodes (only geth clients?) and when the requested hash is not present in the DB they can take a very long time (up to 220 seconds) and cause stalls and peering issues. It is unclear exactly why we are slow in these cases and this PR does not address the root cause; in my tests it does seem to improve responsiveness significantly though. It's worth noting that the protocol specs mandate nodes to implement
GetNodeData
but implementors are free to return incomplete responses.It is unclear to me why geth nodes are asking for inexistent data but it could simply be reorgs?