Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync procedure is fundamentally broken #5270

Closed
antst opened this issue May 17, 2021 · 78 comments
Closed

Sync procedure is fundamentally broken #5270

antst opened this issue May 17, 2021 · 78 comments

Comments

@antst
Copy link

antst commented May 17, 2021

You can see that there is a number of sync bugs and discussions. And they are appearing with increasing rate.
I looked into it for last 2 days and outcome is simple. Once number of transactions increased, sync speed is lagging speed of grows of blockchain.

In my test which goes for last 36 hours which I started from scratch, I can't catch current block, and this is happening on hardware you can't blame: threadripper, very SSD for DB, real 1GbE connection.
This means once typical node got out of sync for whatever reasons, it is difficult to catch up. And this should be reason of this massive surge of out of sync issues.

@antst antst added the bug Something isn't working label May 17, 2021
@markusmazurczak
Copy link

What out of sync timeframe you are talking about? I am able to catch up after beeing out of sync for 6 hours easily with an old i5 processor and an low budget SSD

@antst
Copy link
Author

antst commented May 17, 2021

To me it looks it depends on your "luck" and peers you are connected to before you reached limit of peers.

@antst
Copy link
Author

antst commented May 17, 2021

this is example from farmer node (no harvesters)

Mon 17 May 2021 09:38:18 AM UTC Block Height: 275398
Mon 17 May 2021 09:42:49 AM UTC Block Height: 275405

47 seconds per block.

@markusmazurczak
Copy link

Ja, that sound legit. Take a look into #3298. I am currently trying to hard-ban peers that I can get no connection to so that the garbage collector can remove them from ChiaServer's internal peer-list

@antst
Copy link
Author

antst commented May 17, 2021

Also, chia_full_node is bounded by CPU, it shows 100% CPU usage, while in this terrible sync.

@antst
Copy link
Author

antst commented May 17, 2021

Another thing I see from logs, I am feeding known blocks to other nodes with speed of light, but this also, probably, one of reasons at CPU bounding.

@leefarg
Copy link

leefarg commented May 17, 2021

I echo this same problem - got out of sync a few days ago, and unable to regain sync. After restart multiple times and complete db delete, in 24 hours, I am only a little over half-sync'd.

@antst
Copy link
Author

antst commented May 17, 2021

One more thing, looks like sometimes it just getting stuck on particular block, I presume, waiting something from particular peer and not getting

@tojefest
Copy link

Ja, that sound legit. Take a look into #3298. I am currently trying to hard-ban peers that I can get no connection to so that the garbage collector can remove them from ChiaServer's internal peer-list

at this moment this solution work... I think this only peers problem. If i have luck, sync work great, bad peer and not synced for 6hours. Synchronization is broken, totally. This isn`t problem of db, cpu or ssd, network. Only app have big problem. Manually wallet restart can helps but it afffect farming...

@antst
Copy link
Author

antst commented May 17, 2021

Can you point towards procedure?

@tojefest
Copy link

i work on it...
at this moment i use simply powershell script:

this is sample :)
loop
.\chia start wallet -r
.\chia.exe show --add-connection lucky_peer_ip:8444
.\chia.exe show --add-connection lucky_peer_ip:8444
.\chia.exe show --add-connection node-eu.chia.net:8444
ping -n 960 gatewayip
reloop

App looks like use only first added peer, but that is bad solution for farming.

@antst
Copy link
Author

antst commented May 17, 2021

i work on it...
at this moment i use simply powershell script:

this is sample :)
loop
.\chia start wallet -r
.\chia.exe show --add-connection lucky_peer_ip:8444
.\chia.exe show --add-connection lucky_peer_ip:8444
.\chia.exe show --add-connection node-eu.chia.net:8444
ping -n 960 gatewayip
reloop

App looks like use only first added peer, but that is bad solution for farming.

What speed of sync you have in this case? number of seconds per block

@tojefest
Copy link

40s per block as long as i`m lucky
just like others:
ValueError: Error short batch syncing, invalid/no response for 293426-293458
full_node full_node_server : WARNING Banning 31.30.70.129 for 10 seconds+
full_node asyncio : ERROR Task exception was never retrieved

Only wallet restart helps for me.
Fully resync new db: 30h to 42h.

@antst
Copy link
Author

antst commented May 17, 2021

40s per block as long as i`m lucky

it is still slower than chain grows.

@antst
Copy link
Author

antst commented May 17, 2021

Protocol and software are terminally ill.
If chia will remain popular another week, that's will be end of it :)
It looks like there are architectural mistakes.

@antst
Copy link
Author

antst commented May 17, 2021

I managed to squeeze to 2 seconds per block (even less), but I will not even publish this, as this will kill it all, once gets out into the wild.

@tojefest
Copy link

Yes,
but as i said:
that only one solutions work for me to stay synced. Some peers look like zombie, architectural for p2p mistakes, and stupid app for full node connections.

@antst
Copy link
Author

antst commented May 17, 2021

Yes,
but as i said:
that only one solutions work for me to stay synced. Some peers look like zombie, architectural for p2p mistakes, and stupid app for full node connections.

There is more to that.
In fact, bad peers alone will not kill average sync speed, from what I see. Other stuff combined with bad peers - this is what kills.

@markusmazurczak
Copy link

Protocol and software are terminally ill.
If chia will remain popular another week, that's will be end of it :)
It looks like there are architectural mistakes.

Would you share your thoughts about the architectural mistakes?

@antst
Copy link
Author

antst commented May 17, 2021

It looks like you either pull or give, once you share to peers who has less than you, your own sync suffers a lot. There is architectural bottleneck somewhere in the software. Most likely, this is related to CPU bounding I see.
if you make sure you don't share, then bad peers suddenly is not an issue. The only affect of them is irregular grows of the height in your DB, but to doesn't affect sync speed. For every 1 minute delay (or so), after it you have large surge of delayed blocks which evens it out.

But if people will stop to share, then chia is as good as dead.

I am approaching 1s/block sync speed with quite specific setup. Which is decent. But then again, if/once number of transactions grows 20-fold, then nobody can keep up, from looks of it :)

@antst
Copy link
Author

antst commented May 17, 2021

From what I see also, most of people who suffer from sync issue (I've seen so far) are from Europe, where people started to get on board later, so there is huge pool of people who wants to get synchronized, and, possible, they are introduced to "closest" European peers as shortest route, which kills it for Europe ;)

@tojefest
Copy link

libtorrent works better in eu:)

@Noodleyman
Copy link

Having kept a real close eye on it today, the past 12 hours I've only been able to successfully farm for 1 hour. "Not Synched", even though the connection to peers are established. then randomly it'll sync, farm for a short while and die again. Trying all the tricks listed, manually connecting, clearing the historic connection data etc isn't working that well anymore.

Getting a little frustrated with it, making me wonder is there any point continuing to plot at the moment. As more users join the network, it'll creek further until it dies unless something is urgently done to fix the issue. It's pointless plotting if you can't farm your plots reliably...

@antst
Copy link
Author

antst commented May 17, 2021

Having kept a real close eye on it today, the past 12 hours I've only been able to successfully farm for 1 hour. "Not Synched", even though the connection to peers are established. then randomly it'll sync, farm for a short while and die again. Trying all the tricks listed, manually connecting, clearing the historic connection data etc isn't working that well anymore.

Getting a little frustrated with it, making me wonder is there any point continuing to plot at the moment. As more users join the network, it'll creek further until it dies unless something is urgently done to fix the issue. It's pointless plotting if you can't farm your plots reliably...

Thus is what happening for for 2 weeks already, once network hit bigger.

And yes, bigger it is - lower chance it will work.
What kills is exactly number of peers which tries to synchronize. If number non-synchronized users will grow above some limit - then they will kill the rest of network by their sync attempts.

Unsync peers pulling from you blocks kill your full_node and it can't keep up with network anymore. Then it gets some air to breeze, sometimes manages to get in sync, but then again killed, as becoming attractive to pull fresh blocks.
Effectively, in current form network pulls everyone to the "average" height, not to maximal.
Miss-design of software. In fact, total separation of in- and out- with meaningful synchronization would solve it.

@grasuu
Copy link

grasuu commented May 17, 2021

Same here ...
In the last 24 hours my Full Node was not able to catch up ... it does not stop completely but runs (almoust exactly) 90 minutes behind.
Like Noodleyman I tried "all the tricks listed", without success. And my setup did work since release of 1.1.5.
Yep, located in Europe, too.

@antst
Copy link
Author

antst commented May 17, 2021

We need a team to write new implementation, as I doubt we can get much further with current one. Performance target was completely missed.

@antst
Copy link
Author

antst commented May 17, 2021

With my test setup I managed to get sync speed as low as 1.107 sec per block so far (at the end of chain, with fattest blocks). Despite of doing nothing specific about "bad peers".
So, bad peers aren't a real problem.

@Noodleyman
Copy link

I've tested connecting direct to a friend also running Chia, connected.. no bandwidth or latency issues between my machine and his. he's up to date. it just can't seem to process the data fast enough to get in sync.

@antst
Copy link
Author

antst commented May 17, 2021

I've tested connecting direct to a friend also running Chia, connected.. no bandwidth or latency issues between my machine and his. he's up to date. it just can't seem to process the data fast enough to get in sync.

Yep. CPU bottleneck.

@antst
Copy link
Author

antst commented May 17, 2021

stuck again at "85 blocks behind" :)
then 77 blocks behind.

and now my log got full of
"full_node chia.full_node.full_node: WARNING Invalid response for slot None"

and

farming status switched from "Sync" to "Not synced or not connected to peers"

@antst
Copy link
Author

antst commented May 17, 2021

-1: always
log_level: WARNING

It is not about what you run. it is about what some peers run. There are number of advices and additions to chia which recommend INFO.

@tojefest
Copy link

ok, now i understand

@antst
Copy link
Author

antst commented May 17, 2021

so, end of story. at "-77 blocks mark" I have 108 active peers, all of them have newer blocks, but ALL of them gave me
full_node chia.full_node.full_node: WARNING Invalid response for slot None

@antst
Copy link
Author

antst commented May 17, 2021

literally, ALL of them refused me

@tojefest
Copy link

2021-05-17T20:03:21.395 full_node chia.full_node.full_node: WARNING Invalid response for slot None
2021-05-17T20:03:21.398 full_node chia.full_node.full_node: WARNING Invalid response for slot None
2021-05-17T20:03:21.400 full_node chia.full_node.full_node: WARNING Invalid response for slot None
2021-05-17T20:03:31.390 full_node chia.full_node.full_node: WARNING Invalid response for slot None
2021-05-17T20:03:31.582 full_node chia.full_node.full_node: WARNING Invalid response for slot None
2021-05-17T20:03:31.727 full_node chia.full_node.full_node: WARNING Invalid response for slot None
2021-05-17T20:03:31.729 full_node chia.full_node.full_node: WARNING Invalid response for slot None
2021-05-17T20:03:31.731 full_node chia.full_node.full_node: WARNING Invalid response for slot None
2021-05-17T20:03:31.981 full_node chia.full_node.full_node: WARNING Invalid response for slot None

@tojefest
Copy link

and multi:
full_node full_node_server : WARNING Banning xx.81.175.114 for 600 seconds

@antst
Copy link
Author

antst commented May 17, 2021

and multi:
full_node full_node_server : WARNING Banning xx.81.175.114 for 600 seconds

How do you see those?

@antst
Copy link
Author

antst commented May 17, 2021

still managed to get to "-65".
But now every few blocks there is a delay for about 5-10 min. With this rate, with my modified setup which seriously pushes to get it all (and accepts only peers which has something to offer and do not waste resources on sharing), it will take hours for last bit.

@tojefest
Copy link

and multi:
full_node full_node_server : WARNING Banning xx.81.175.114 for 600 seconds

How do you see those?

from debug.log file

@antst
Copy link
Author

antst commented May 17, 2021

Banning

ah, yes.

Fun thing I don't see single peer banned twice. So it is not "couple of evil peers"

@antst
Copy link
Author

antst commented May 17, 2021

-55

Now delay is happening every second block.
I presume few blocks further it will happen every block and time will grow, which means "current" will be reachable only in theoretical limit of infinite time )))

@Noodleyman
Copy link

so, end of story. at "-77 blocks mark" I have 108 active peers, all of them have newer blocks, but ALL of them gave me
full_node chia.full_node.full_node: WARNING Invalid response for slot None

and from my log also:
2021-05-17T19:12:36.042 full_node chia.full_node.full_node: WARNING Invalid response for slot None
2021-05-17T19:12:36.059 full_node chia.full_node.full_node: WARNING Invalid response for slot None
2021-05-17T19:12:36.184 full_node chia.full_node.full_node: WARNING Invalid response for slot None
2021-05-17T19:12:36.185 full_node chia.full_node.full_node: WARNING Invalid response for slot None
2021-05-17T19:12:36.358 full_node chia.full_node.full_node: WARNING Invalid response for slot None
2021-05-17T19:12:36.553 full_node chia.full_node.full_node: WARNING Invalid response for slot None

So, nothing we can really do from a client perspective at the moment. ruled out a CPU issue on my rig. we're at the mercy of the peer we connect to and IF that peer wants to give away some data.

I'm not sure how this would improve at the moment, if those peers who are fully in sync are being bottlenecked, then they are just going to get worse as more machines come online.

@antst
Copy link
Author

antst commented May 17, 2021

nope, it is, but at this point on their side :)

  1. CPU limit does makes it slow to get to "proximity to current" on syncing client side.
  2. if local limit is removed, you hit CPU limit of peers who are "on the edge" and we can do nothing about it, indeed,

@antst
Copy link
Author

antst commented May 17, 2021

I am at "-39" now.

@antst
Copy link
Author

antst commented May 17, 2021

what is happening, I need to get lucky that one of peers will respond. So, it is keeping to go through the list of peers (10 seconds per each or whatever it is) till it gets one which responds. Closer to the "edge" more difficult to find one which has resources to respond.

@antst
Copy link
Author

antst commented May 17, 2021

"-6" blocks mark! :)

@antst
Copy link
Author

antst commented May 17, 2021

So, all we can do is

  1. no INFO and more detailed log leves
  2. reduce number of peers which are feeding from you (really bad for network)
  3. set ridiculously high limit of peers
  4. configure to not wait longer than 1-2 sec from peer.

@antst
Copy link
Author

antst commented May 17, 2021

at -5 blocks mark there are only 15 peers which are better than me. And now new coming.

@Noodleyman
Copy link

which property is for the max wait time?

@antst
Copy link
Author

antst commented May 17, 2021

so, I reached to the current.
And....not farming.
there are NO nodes which are at the edge anymore :) SO, I can not get jobs

@antst
Copy link
Author

antst commented May 17, 2021

funny, now al nodes in my peer list show
-SB Height: 0

@Noodleyman
Copy link

so, that's a big F..

Something is VERY wrong.....

@antst
Copy link
Author

antst commented May 17, 2021

                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 18.218.161.62                          50823/8444  e5c1074e... May 17 18:51:17      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 5.104.216.20                           11959/8444  b79ddedc... May 17 18:52:19      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 223.91.32.244                          65193/8444  acd9e699... May 17 18:52:14      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 178.26.141.228                         59302/8444  70f2d770... May 17 18:52:00      0.3|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 31.141.33.7                            55543/8444  0e9e4a4a... May 17 18:52:21      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 73.135.178.190                          5043/8444  ed9268b2... May 17 18:52:24      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 187.9.112.98                           65431/8444  95ece483... May 17 18:51:44      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 92.247.223.234                         43557/8444  3b564cc6... May 17 18:52:16      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 106.6.168.60                           14001/8444  6b5953dd... May 17 18:51:50      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 36.61.180.96                           16636/8444  bdbf96e7... May 17 18:52:21      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 61.152.143.65                          32448/8444  29c84b94... May 17 18:52:07      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 1.20.10.13                             20106/8444  de4f4a64... May 17 18:52:08      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 145.255.172.185                         4089/8444  754c6003... May 17 18:52:12      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 113.14.253.233                         60841/8444  aa7e7f0b... May 17 18:52:21      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 72.216.167.73                          55884/8444  56cd8b19... May 17 18:52:21      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 184.90.225.49                          50220/8444  cb36b735... May 17 18:52:22      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 125.46.169.70                           8444/8444  ef010f26... May 17 18:52:24      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 109.49.244.87                           8444/8444  f1c67678... May 17 18:52:23      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...

this is the best I could get :)
at the same time, I am at the edge. But it gives no fun )

@tojefest
Copy link

which property is for the max wait time?

@antst
Copy link
Author

antst commented May 17, 2021

Have no clue. Didn't find.

@sargonas sargonas changed the title [BUG] Sync procedure is fundamentally broken Sync procedure is fundamentally broken May 17, 2021
@sargonas sargonas removed the bug Something isn't working label May 17, 2021
@Chia-Network Chia-Network locked and limited conversation to collaborators May 17, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants