Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seed Nodes Restart after 1 hr due to ErrorNodeGraphEmptyDatabase #398

Closed
CMCDragonkai opened this issue Jul 4, 2022 · 1 comment · Fixed by #402
Closed

Seed Nodes Restart after 1 hr due to ErrorNodeGraphEmptyDatabase #398

CMCDragonkai opened this issue Jul 4, 2022 · 1 comment · Fixed by #402
Assignees
Labels
bug Something isn't working r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy

Comments

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Jul 4, 2022

Describe the bug

Based on testing of the seed node being deployed in #396, I've discovered a bug involving seed nodes.

After the seed node starts, after exactly 1 hr, the seed node errors out with ErrorNodeGraphEmptyDatabase, and the agent crashes, it does not do a graceful shutdown, just completely crashes.

The ECS currently auto-restarts by starting a new service task and then drains the old container.

The seed nodes are being started with PK_SEED_NODES='', which makes the seed nodes empty, this is true because the first seed node doesn't have any other seed nodes to contact.

It appears there's a bug with the way the nodegraph is operating, an empty node graph should not cause the entire agent to shutdown, let alone crash. I believe this is an uncaught exception. Atm uncaught exceptions leads to a crash, and not a graceful shutdown.

I think we should also enable a graceful shutdown if an uncaught exception occurs, as much as possible.

Using aws logs tail /ecs/polykey-testnet --follow --since 2d we can see these logs:

2022-07-04T05:07:10.202000+00:00 ecs/polykey-testnet/c350e6a6ff25421db7f806629e639a47 {"pid":1,"nodeId":"v7chij8ilv66tdhs7200nro614om557ltrn5j9llsg7il8s6rle50","clientHost":"0.0.0.0","clientPort":1315,"agentHost":"127.0.0.1","agentPort":36233,"proxyHost":"0.0.0.0","proxyPort":1314,"forwardHost":"127.0.0.1","forwardPort":39983,"recoveryCode":"voice cube genuine wide leg negative shallow auto fatigue gentle engage burst clarify virtual smoke category noodle two frog rack lyrics trap idea couch"}
2022-07-04T06:07:09.162000+00:00 ecs/polykey-testnet/c350e6a6ff25421db7f806629e639a47 {"type":"ErrorNodeGraphEmptyDatabase","data":{"message":"","timestamp":"2022-07-04T06:07:09.160Z","data":{},"stack":"ErrorNodeGraphEmptyDatabase\n    at constructor_.getClosestGlobalNodes (/lib/node_modules/@matrixai/polykey/dist/nodes/NodeConnectionManager.js:344:19)\n    at async constructor_.findNode (/lib/node_modules/@matrixai/polykey/dist/nodes/NodeConnectionManager.js:309:18)\n    at async constructor_.refreshBucket (/lib/node_modules/@matrixai/polykey/dist/nodes/NodeManager.js:424:9)\n    at async constructor_.startRefreshBucketQueue (/lib/node_modules/@matrixai/polykey/dist/nodes/NodeManager.js:531:17)","description":"NodeGraph database was empty","exitCode":64}}
2022-07-04T06:08:06.683000+00:00 ecs/polykey-testnet/2b9d1c083fe9436392dc198f886f3cb1 INFO:PolykeyAgent:Creating PolykeyAgent

As you can see there are 3 messages. The first occurring at T05:0710, the second at T06:07:09, the second message is the structured exception being logged out. Both are related to the initial task c350e6a6ff25421db7f806629e639a47

Afterwards, the third message shows a new service task being launched 2b9d1c083fe9436392dc198f886f3cb1.

To Reproduce

  1. Start the agent pk agent start
  2. Wait 1 hr

Expected behavior

  1. It should continue running without any problems
  2. Maybe report the error with a error log, but not as an exception

Additional context

Notify maintainers

@tegefaulkes

@CMCDragonkai CMCDragonkai added development Standard development bug Something isn't working and removed development Standard development labels Jul 4, 2022
@tegefaulkes
Copy link
Contributor

I've fixed getClosestGlobalNodes to just return undefined in the case of an empty node graph. this should prevent fineNode and refreshBucket throwing this error. I've also added a test to confirm this behaviour for refreshBucket.

tegefaulkes added a commit that referenced this issue Jul 8, 2022
…deGraphEmptyDatabase` with empty network

`getClosestGlobalNodes` was throwing `ErrorNodeGraphEmptyDatabase` when it failed to get new nodes during the search process. Now it just returns undefined as expected.

Related #398
tegefaulkes added a commit that referenced this issue Jul 12, 2022
…deGraphEmptyDatabase` with empty network

`getClosestGlobalNodes` was throwing `ErrorNodeGraphEmptyDatabase` when it failed to get new nodes during the search process. Now it just returns undefined as expected.

Related #398
tegefaulkes added a commit that referenced this issue Jul 13, 2022
…deGraphEmptyDatabase` with empty network

`getClosestGlobalNodes` was throwing `ErrorNodeGraphEmptyDatabase` when it failed to get new nodes during the search process. Now it just returns undefined as expected.

Related #398
@CMCDragonkai CMCDragonkai added the r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy label Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy
Development

Successfully merging a pull request may close this issue.

3 participants