You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Based on testing of the seed node being deployed in #396, I've discovered a bug involving seed nodes.
After the seed node starts, after exactly 1 hr, the seed node errors out with ErrorNodeGraphEmptyDatabase, and the agent crashes, it does not do a graceful shutdown, just completely crashes.
The ECS currently auto-restarts by starting a new service task and then drains the old container.
The seed nodes are being started with PK_SEED_NODES='', which makes the seed nodes empty, this is true because the first seed node doesn't have any other seed nodes to contact.
It appears there's a bug with the way the nodegraph is operating, an empty node graph should not cause the entire agent to shutdown, let alone crash. I believe this is an uncaught exception. Atm uncaught exceptions leads to a crash, and not a graceful shutdown.
I think we should also enable a graceful shutdown if an uncaught exception occurs, as much as possible.
Using aws logs tail /ecs/polykey-testnet --follow --since 2d we can see these logs:
2022-07-04T05:07:10.202000+00:00 ecs/polykey-testnet/c350e6a6ff25421db7f806629e639a47 {"pid":1,"nodeId":"v7chij8ilv66tdhs7200nro614om557ltrn5j9llsg7il8s6rle50","clientHost":"0.0.0.0","clientPort":1315,"agentHost":"127.0.0.1","agentPort":36233,"proxyHost":"0.0.0.0","proxyPort":1314,"forwardHost":"127.0.0.1","forwardPort":39983,"recoveryCode":"voice cube genuine wide leg negative shallow auto fatigue gentle engage burst clarify virtual smoke category noodle two frog rack lyrics trap idea couch"}
2022-07-04T06:07:09.162000+00:00 ecs/polykey-testnet/c350e6a6ff25421db7f806629e639a47 {"type":"ErrorNodeGraphEmptyDatabase","data":{"message":"","timestamp":"2022-07-04T06:07:09.160Z","data":{},"stack":"ErrorNodeGraphEmptyDatabase\n at constructor_.getClosestGlobalNodes (/lib/node_modules/@matrixai/polykey/dist/nodes/NodeConnectionManager.js:344:19)\n at async constructor_.findNode (/lib/node_modules/@matrixai/polykey/dist/nodes/NodeConnectionManager.js:309:18)\n at async constructor_.refreshBucket (/lib/node_modules/@matrixai/polykey/dist/nodes/NodeManager.js:424:9)\n at async constructor_.startRefreshBucketQueue (/lib/node_modules/@matrixai/polykey/dist/nodes/NodeManager.js:531:17)","description":"NodeGraph database was empty","exitCode":64}}
2022-07-04T06:08:06.683000+00:00 ecs/polykey-testnet/2b9d1c083fe9436392dc198f886f3cb1 INFO:PolykeyAgent:Creating PolykeyAgent
As you can see there are 3 messages. The first occurring at T05:0710, the second at T06:07:09, the second message is the structured exception being logged out. Both are related to the initial task c350e6a6ff25421db7f806629e639a47
Afterwards, the third message shows a new service task being launched 2b9d1c083fe9436392dc198f886f3cb1.
To Reproduce
Start the agent pk agent start
Wait 1 hr
Expected behavior
It should continue running without any problems
Maybe report the error with a error log, but not as an exception
I've fixed getClosestGlobalNodes to just return undefined in the case of an empty node graph. this should prevent fineNode and refreshBucket throwing this error. I've also added a test to confirm this behaviour for refreshBucket.
…deGraphEmptyDatabase` with empty network
`getClosestGlobalNodes` was throwing `ErrorNodeGraphEmptyDatabase` when it failed to get new nodes during the search process. Now it just returns undefined as expected.
Related #398
…deGraphEmptyDatabase` with empty network
`getClosestGlobalNodes` was throwing `ErrorNodeGraphEmptyDatabase` when it failed to get new nodes during the search process. Now it just returns undefined as expected.
Related #398
…deGraphEmptyDatabase` with empty network
`getClosestGlobalNodes` was throwing `ErrorNodeGraphEmptyDatabase` when it failed to get new nodes during the search process. Now it just returns undefined as expected.
Related #398
Describe the bug
Based on testing of the seed node being deployed in #396, I've discovered a bug involving seed nodes.
After the seed node starts, after exactly 1 hr, the seed node errors out with
ErrorNodeGraphEmptyDatabase
, and the agent crashes, it does not do a graceful shutdown, just completely crashes.The ECS currently auto-restarts by starting a new service task and then drains the old container.
The seed nodes are being started with
PK_SEED_NODES=''
, which makes the seed nodes empty, this is true because the first seed node doesn't have any other seed nodes to contact.It appears there's a bug with the way the nodegraph is operating, an empty node graph should not cause the entire agent to shutdown, let alone crash. I believe this is an uncaught exception. Atm uncaught exceptions leads to a crash, and not a graceful shutdown.
I think we should also enable a graceful shutdown if an uncaught exception occurs, as much as possible.
Using
aws logs tail /ecs/polykey-testnet --follow --since 2d
we can see these logs:As you can see there are 3 messages. The first occurring at
T05:0710
, the second atT06:07:09
, the second message is the structured exception being logged out. Both are related to the initial taskc350e6a6ff25421db7f806629e639a47
Afterwards, the third message shows a new service task being launched
2b9d1c083fe9436392dc198f886f3cb1
.To Reproduce
pk agent start
Expected behavior
Additional context
Notify maintainers
@tegefaulkes
The text was updated successfully, but these errors were encountered: