Seed Nodes Restart after 1 hr due to `ErrorNodeGraphEmptyDatabase` #398

CMCDragonkai · 2022-07-04T07:55:32Z

Describe the bug

Based on testing of the seed node being deployed in #396, I've discovered a bug involving seed nodes.

After the seed node starts, after exactly 1 hr, the seed node errors out with ErrorNodeGraphEmptyDatabase, and the agent crashes, it does not do a graceful shutdown, just completely crashes.

The ECS currently auto-restarts by starting a new service task and then drains the old container.

The seed nodes are being started with PK_SEED_NODES='', which makes the seed nodes empty, this is true because the first seed node doesn't have any other seed nodes to contact.

It appears there's a bug with the way the nodegraph is operating, an empty node graph should not cause the entire agent to shutdown, let alone crash. I believe this is an uncaught exception. Atm uncaught exceptions leads to a crash, and not a graceful shutdown.

I think we should also enable a graceful shutdown if an uncaught exception occurs, as much as possible.

Using aws logs tail /ecs/polykey-testnet --follow --since 2d we can see these logs:

2022-07-04T05:07:10.202000+00:00 ecs/polykey-testnet/c350e6a6ff25421db7f806629e639a47 {"pid":1,"nodeId":"v7chij8ilv66tdhs7200nro614om557ltrn5j9llsg7il8s6rle50","clientHost":"0.0.0.0","clientPort":1315,"agentHost":"127.0.0.1","agentPort":36233,"proxyHost":"0.0.0.0","proxyPort":1314,"forwardHost":"127.0.0.1","forwardPort":39983,"recoveryCode":"voice cube genuine wide leg negative shallow auto fatigue gentle engage burst clarify virtual smoke category noodle two frog rack lyrics trap idea couch"}
2022-07-04T06:07:09.162000+00:00 ecs/polykey-testnet/c350e6a6ff25421db7f806629e639a47 {"type":"ErrorNodeGraphEmptyDatabase","data":{"message":"","timestamp":"2022-07-04T06:07:09.160Z","data":{},"stack":"ErrorNodeGraphEmptyDatabase\n    at constructor_.getClosestGlobalNodes (/lib/node_modules/@matrixai/polykey/dist/nodes/NodeConnectionManager.js:344:19)\n    at async constructor_.findNode (/lib/node_modules/@matrixai/polykey/dist/nodes/NodeConnectionManager.js:309:18)\n    at async constructor_.refreshBucket (/lib/node_modules/@matrixai/polykey/dist/nodes/NodeManager.js:424:9)\n    at async constructor_.startRefreshBucketQueue (/lib/node_modules/@matrixai/polykey/dist/nodes/NodeManager.js:531:17)","description":"NodeGraph database was empty","exitCode":64}}
2022-07-04T06:08:06.683000+00:00 ecs/polykey-testnet/2b9d1c083fe9436392dc198f886f3cb1 INFO:PolykeyAgent:Creating PolykeyAgent

As you can see there are 3 messages. The first occurring at T05:0710, the second at T06:07:09, the second message is the structured exception being logged out. Both are related to the initial task c350e6a6ff25421db7f806629e639a47

Afterwards, the third message shows a new service task being launched 2b9d1c083fe9436392dc198f886f3cb1.

To Reproduce

Start the agent pk agent start
Wait 1 hr

Expected behavior

It should continue running without any problems
Maybe report the error with a error log, but not as an exception

Additional context

Testnet Deployment via CI/CD #396 - discovered while deploying the testnet seed nodes

Notify maintainers

@tegefaulkes

The text was updated successfully, but these errors were encountered:

tegefaulkes · 2022-07-08T03:15:28Z

I've fixed getClosestGlobalNodes to just return undefined in the case of an empty node graph. this should prevent fineNode and refreshBucket throwing this error. I've also added a test to confirm this behaviour for refreshBucket.

…deGraphEmptyDatabase` with empty network `getClosestGlobalNodes` was throwing `ErrorNodeGraphEmptyDatabase` when it failed to get new nodes during the search process. Now it just returns undefined as expected. Related #398

CMCDragonkai added development Standard development bug Something isn't working and removed development Standard development labels Jul 4, 2022

CMCDragonkai assigned tegefaulkes Jul 4, 2022

CMCDragonkai mentioned this issue Jul 5, 2022

Testnet Deployment via CI/CD #396

Merged

20 tasks

tegefaulkes mentioned this issue Jul 8, 2022

getClosestGlobalNodes, findNode error fix #402

Merged

6 tasks

CMCDragonkai mentioned this issue Jul 12, 2022

ci: merge staging to master #411

Merged

CMCDragonkai assigned emmacasolin Jul 12, 2022

tegefaulkes closed this as completed in #402 Jul 13, 2022

CMCDragonkai added the r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy label Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seed Nodes Restart after 1 hr due to `ErrorNodeGraphEmptyDatabase` #398

Seed Nodes Restart after 1 hr due to `ErrorNodeGraphEmptyDatabase` #398

CMCDragonkai commented Jul 4, 2022 •

edited

Loading

tegefaulkes commented Jul 8, 2022

Seed Nodes Restart after 1 hr due to ErrorNodeGraphEmptyDatabase #398

Seed Nodes Restart after 1 hr due to ErrorNodeGraphEmptyDatabase #398

Comments

CMCDragonkai commented Jul 4, 2022 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

Notify maintainers

tegefaulkes commented Jul 8, 2022

Seed Nodes Restart after 1 hr due to `ErrorNodeGraphEmptyDatabase` #398

Seed Nodes Restart after 1 hr due to `ErrorNodeGraphEmptyDatabase` #398

CMCDragonkai commented Jul 4, 2022 •

edited

Loading