Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bot randomly exiting process or going offline/unresponsive #8486

Closed
NeonWizard opened this issue Aug 13, 2022 · 85 comments · Fixed by #8927 or #8989
Closed

Bot randomly exiting process or going offline/unresponsive #8486

NeonWizard opened this issue Aug 13, 2022 · 85 comments · Fixed by #8927 or #8989

Comments

@NeonWizard
Copy link

Which package is this bug report for?

discord.js

Issue description

After around 5-6 hours of running, my discord bot decides to randomly exit with an exit code of 0. There is nothing in my code that should cause that, and adding a debug event listener shows the bot failing around a websocket reacquisition near the 6 hour mark.

I have tried running my bot with typescript, I've tried compiling the code to javascript and running that, I've tried packaging it as a docker container and running that locally, I've tried packing it as a docker container and running that on my production server. Every environment and method results in the same issue.

image

Bot code

https://github.com/NeonWizard/sockbot-discord

Logs

[WS => Shard 0] Heartbeat acknowledged, latency of 86ms.
[WS => Shard 0] [HeartbeatTimer] Sending a heartbeat.
[WS => Shard 0] Heartbeat acknowledged, latency of 89ms.
[WS => Shard 0] [HeartbeatTimer] Sending a heartbeat.
[WS => Shard 0] Heartbeat acknowledged, latency of 85ms.
[WS => Shard 0] [HeartbeatTimer] Sending a heartbeat.
[WS => Shard 0] Heartbeat acknowledged, latency of 83ms.
[WS => Shard 0] [HeartbeatTimer] Sending a heartbeat.
[WS => Shard 0] Heartbeat acknowledged, latency of 84ms.
[WS => Shard 0] [RECONNECT] Discord asked us to reconnect
[WS => Shard 0] [DESTROY]
    Close Code    : 4000
    Reset         : false
    Emit DESTROYED: true
[WS => Shard 0] Clearing the heartbeat interval.
[WS => Shard 0] [WebSocket] Destroy: Attempting to close the WebSocket. | WS State: OPEN      
[WS => Shard 0] [WebSocket] Close: Tried closing. | WS State: CLOSING
[WS => Shard 0] [WebSocket] Adding a WebSocket close timeout to ensure a correct WS reconnect.
        Timeout: 5000ms
[WS => Shard 0] [WebSocket] Clearing the close timeout.
[WS => Shard 0] [WebSocket] Close Emitted: false
[WS => Shard 0] [WebSocket] did not close properly, assuming a zombie connection.
Emitting close and reconnecting again.
[WS => Shard 0] [CLOSE]
    Event Code: 1011
    Clean     : false
    Reason    : INTERNAL_ERROR
[WS => Shard 0] Session id is present, attempting an immediate reconnect...
[WS => Shard 0] [CONNECT]
    Gateway    : wss://gateway.discord.gg/
    Version    : 10
    Encoding   : json
    Compression: none
[WS => Shard 0] Setting a HELLO timeout for 20s.
[WS => Shard 0] [CONNECTED] Took 146ms
[WS => Shard 0] Clearing the HELLO timeout.
[WS => Shard 0] Setting a heartbeat interval for 41250ms.
[WS => Shard 0] [RESUME] Session REDACTED, sequence 51
[WS => Shard 0] [RESUMED] Session REDACTED | Replayed 1 events.       
[WS => Shard 0] [ResumeHeartbeat] Sending a heartbeat.
[WS => Shard 0] Heartbeat acknowledged, latency of 88ms.
[WS => Shard 0] Clearing the heartbeat interval.
[WS => Shard 0] [CLOSE]
    Event Code: 4000
    Clean     : true
    Reason    :
[WS => Shard 0] Session id is present, attempting an immediate reconnect...
[WS => Shard 0] An open connection was found, attempting an immediate identify.
[WS => Shard 0] [RESUME] Session REDACTED, sequence 52
root@ruby:/home/spooky#

Code sample

https://github.com/NeonWizard/sockbot-discord

Package version

[email protected]

Node.js version

v16.16.0, [email protected]

Operating system

Windows (WSL), Linux (Ubuntu), Docker (node:lts-alpine)

Priority this issue should have

High (immediate attention needed)

Which partials do you have configured?

No Partials

Which gateway intents are you subscribing to?

Guilds, GuildMessages, GuildMessageReactions, MessageContent

I have tested this issue on a development release

No response

@zLupa
Copy link
Contributor

zLupa commented Aug 25, 2022

Having the same error, some logs:
obfuscated

@adam-a
Copy link

adam-a commented Aug 27, 2022

Having the same issue here :/

@zLupa
Copy link
Contributor

zLupa commented Aug 28, 2022

Having the same issue here :/

I still haven't found the issue, it's driving me crazy.
I'll try to roll back to some commits to see if it fixes it.

@budsta95
Copy link

budsta95 commented Sep 1, 2022

Are you self hosting or is your bot running in the cloud somewhere? Asking because I have 3 bots all running the same source code, 2 run on a raspberry pi and 1 runs in repl.it. Only the one on repl is the one that gets this "Discord asked us to reconnect" debug event.

@NeonWizard
Copy link
Author

budsta95

As specified in my ticket, I've ran in multiple environments and with multiple methods, with the same result.

  • Locally (WSL)
    • Running typescript directly with tsc
    • Compiling typescript and running the JS directly
    • Compiling typescript, bundling into docker container running node:lts-alpine, executing container
  • Production server (Ubuntu 20.04)
    • Running same docker container from local

@zLupa
Copy link
Contributor

zLupa commented Sep 3, 2022

Running on Oracle Cloud with Docker, but the error also occurs on the host machine.
I'll try to test another cloud to see if it occurs.

@AsylumOfDeads
Copy link

AsylumOfDeads commented Sep 4, 2022

Same error

:[WS => Shard 0] [RECONNECT] Discord asked us to reconnect
[WS => Shard 0] [DESTROY]
    Close Code    : 4000
    Reset         : false
    Emit DESTROYED: true
1[WS => Shard 0] Clearing the heartbeat interval.
Y[WS => Shard 0] [WebSocket] Destroy: Attempting to close the WebSocket. | WS State: OPEN
F[WS => Shard 0] [WebSocket] Close: Tried closing. | WS State: CLOSING
_[WS => Shard 0] [WebSocket] Adding a WebSocket close timeout to ensure a correct WS reconnect.
        Timeout: 5000ms
8[WS => Shard 0] [WebSocket] Clearing the close timeout.
[WS => Shard 0] [CLOSE]
    Event Code: 4000
    Clean     : true
    Reason    : 
L[WS => Shard 0] Session id is present, attempting an immediate reconnect...
[WS => Shard 0] [CONNECT]
+    Gateway    : wss://gateway.discord.gg/
    Version    : 10
    Encoding   : json
    Compression: none
1[WS => Shard 0] Setting a HELLO timeout for 20s.
&[WS => Shard 0] [CONNECTED] Took 50ms
,[WS => Shard 0] Clearing the HELLO timeout.
:[WS => Shard 0] Setting a heartbeat interval for 41250ms.
Q[WS => Shard 0] [RESUME] Session 5853808d345fb8a36a01cc64490a424c, sequence 1723
4[WS => Shard 0] [INVALID SESSION] Resumable: false.
@[WS => Shard 0] Failed to connect to the gateway, requeueing...
@[WS => Manager] Shard Queue Size: 1; continuing in 5 seconds...
P[WS => Shard 0] An open connection was found, attempting an immediate identify.
9[WS => Shard 0] [IDENTIFY] Shard 0/1 with intents: 33283
1[WS => Shard 0] Clearing the heartbeat interval.
[WS => Shard 0] [CLOSE]
    Event Code: 4004
    Clean     : true
'    Reason    : Authentication failed.
%[WS => Shard 0] AuthenticationFailed
�[WS => Manager] Couldn't reconnect or fetch information about the gateway. Error [TokenInvalid]: An invalid token was provided.
C[WS => Manager] Possible network error occurred. Retrying in 5s...

@NeonWizard
Copy link
Author

NeonWizard commented Sep 6, 2022

@AsylumOfDeads

Same error

Slightly different error. You're exiting with an error and "Authentication failed". If you check my logs there is no raised error or authentication failure. It just exits silently. I believe these are two distinct bugs.

@chimpdev
Copy link

chimpdev commented Sep 6, 2022

same

@zLupa
Copy link
Contributor

zLupa commented Sep 11, 2022

Running on Oracle Cloud with Docker, but the error also occurs on the host machine. I'll try to test another cloud to see if it occurs.

I have been running my bot for the past 4 days without silent disconnects, it would be 7 days, but I restarted it and didn't disconnect yet, but it still has some "INTERNAL_ERRORS" in the logs. I'll keep monitoring.

And btw, I moved my instance to another region, currently in UK.

@NeonWizard
Copy link
Author

NeonWizard commented Sep 13, 2022

As of last night, this problem has changed - now the bot doesn't actually exit when the issue occurs. The code keeps silently running but the discord account goes offline and unresponsive. The log has slightly changed, towards the end the final event is now a 1006 instead of a 4000, which is why I believe discord.js isn't exiting anymore.

debug: [WS => Shard 0] [HeartbeatTimer] Sending a heartbeat. {"timestamp":"2022-09-13 02:59:10"}
debug: [WS => Shard 0] Heartbeat acknowledged, latency of 86ms. {"timestamp":"2022-09-13 02:59:10"}
debug: [WS => Shard 0] [HeartbeatTimer] Sending a heartbeat. {"timestamp":"2022-09-13 02:59:51"}
debug: [WS => Shard 0] Heartbeat acknowledged, latency of 88ms. {"timestamp":"2022-09-13 02:59:51"}
debug: [WS => Shard 0] [HeartbeatTimer] Sending a heartbeat. {"timestamp":"2022-09-13 03:00:32"}
debug: [WS => Shard 0] Heartbeat acknowledged, latency of 92ms. {"timestamp":"2022-09-13 03:00:33"}
debug: [WS => Shard 0] [HeartbeatTimer] Sending a heartbeat. {"timestamp":"2022-09-13 03:01:14"}
debug: [WS => Shard 0] Heartbeat acknowledged, latency of 99ms. {"timestamp":"2022-09-13 03:01:14"}
debug: [WS => Shard 0] [HeartbeatTimer] Sending a heartbeat. {"timestamp":"2022-09-13 03:01:55"}
debug: [WS => Shard 0] [HeartbeatTimer] Didn't receive a heartbeat ack last time, assuming zombie connection. Destroying and reconnecting.
    Status          : 0
    Sequence        : 847
    Connection State: OPEN {"timestamp":"2022-09-13 03:02:36"}
debug: [WS => Shard 0] [DESTROY]
    Close Code    : 4009
    Reset         : true
    Emit DESTROYED: true {"timestamp":"2022-09-13 03:02:36"}
debug: [WS => Shard 0] Clearing the heartbeat interval. {"timestamp":"2022-09-13 03:02:36"}
debug: [WS => Shard 0] [WebSocket] Destroy: Attempting to close the WebSocket. | WS State: OPEN {"timestamp":"2022-09-13 03:02:36"}
debug: [WS => Shard 0] [WebSocket] Close: Tried closing. | WS State: CLOSING {"timestamp":"2022-09-13 03:02:36"}
debug: [WS => Shard 0] [WebSocket] Adding a WebSocket close timeout to ensure a correct WS reconnect.
        Timeout: 5000ms {"timestamp":"2022-09-13 03:02:36"}
debug: [WS => Shard 0] [WebSocket] Clearing the close timeout. {"timestamp":"2022-09-13 03:02:41"}
debug: [WS => Shard 0] [WebSocket] Close Emitted: false {"timestamp":"2022-09-13 03:02:41"}
debug: [WS => Shard 0] [WebSocket] did not close properly, assuming a zombie connection.
Emitting close and reconnecting again. {"timestamp":"2022-09-13 03:02:41"}
debug: [WS => Shard 0] [CLOSE]
    Event Code: 1011
    Clean     : false
    Reason    : INTERNAL_ERROR {"timestamp":"2022-09-13 03:02:41"}
debug: [WS => Shard 0] [CONNECT]
    Gateway    : wss://gateway.discord.gg/
    Version    : 10
    Encoding   : json
    Compression: none {"timestamp":"2022-09-13 03:02:41"}
debug: [WS => Shard 0] Setting a HELLO timeout for 20s. {"timestamp":"2022-09-13 03:02:41"}
debug: [WS => Shard 0] [CONNECTED] Took 15721ms {"timestamp":"2022-09-13 03:02:57"}
debug: [WS => Shard 0] Clearing the HELLO timeout. {"timestamp":"2022-09-13 03:02:57"}
debug: [WS => Shard 0] Setting a heartbeat interval for 41250ms. {"timestamp":"2022-09-13 03:02:57"}
debug: [WS => Shard 0] [IDENTIFY] Shard 0/1 with intents: 34305 {"timestamp":"2022-09-13 03:02:57"}
debug: [WS => Shard 0] [READY] Session REDACTED. {"timestamp":"2022-09-13 03:02:57"}
debug: [WS => Shard 0] [ReadyHeartbeat] Sending a heartbeat. {"timestamp":"2022-09-13 03:02:57"}
debug: [WS => Shard 0] Shard received all its guilds. Marking as fully ready. {"timestamp":"2022-09-13 03:02:57"}
debug: [WS => Shard 0] Heartbeat acknowledged, latency of 116ms. {"timestamp":"2022-09-13 03:02:57"}
debug: [WS => Shard 0] Clearing the heartbeat interval. {"timestamp":"2022-09-13 03:03:06"}
debug: [WS => Shard 0] [CLOSE]
    Event Code: 1006
    Clean     : false
    Reason    :  {"timestamp":"2022-09-13 03:03:06"}
debug: [WS => Shard 0] Session id is present, attempting an immediate reconnect... {"timestamp":"2022-09-13 03:03:06"}
debug: [WS => Shard 0] An open connection was found, attempting an immediate identify. {"timestamp":"2022-09-13 03:03:06"}
debug: [WS => Shard 0] [RESUME] Session REDACTED, sequence 3 {"timestamp":"2022-09-13 03:03:06"}

@hiboudev
Copy link

hiboudev commented Sep 13, 2022

I have a similar issue since I upgraded to v14. After 3 to 5 days of uptime the bot goes offline with this logs :

sept. 12 13:16:29 pokpok.fr nodejs-example[3341]: [WS => Shard 0] [HeartbeatTimer] Sending a heartbeat.
sept. 12 13:16:29 pokpok.fr nodejs-example[3341]: [WS => Shard 0] Heartbeat acknowledged, latency of 100ms.
sept. 12 13:17:10 pokpok.fr nodejs-example[3341]: [WS => Shard 0] [HeartbeatTimer] Sending a heartbeat.
sept. 12 13:17:51 pokpok.fr nodejs-example[3341]: [WS => Shard 0] [HeartbeatTimer] Didn't receive a heartbeat ack last time, assuming zombie connection. Destroying and reconne
sept. 12 13:17:51 pokpok.fr nodejs-example[3341]:     Status          : 0
sept. 12 13:17:51 pokpok.fr nodejs-example[3341]:     Sequence        : 8308
sept. 12 13:17:51 pokpok.fr nodejs-example[3341]:     Connection State: OPEN
sept. 12 13:17:51 pokpok.fr nodejs-example[3341]: [WS => Shard 0] [DESTROY]
sept. 12 13:17:51 pokpok.fr nodejs-example[3341]:     Close Code    : 4009
sept. 12 13:17:51 pokpok.fr nodejs-example[3341]:     Reset         : true
sept. 12 13:17:51 pokpok.fr nodejs-example[3341]:     Emit DESTROYED: true
sept. 12 13:17:51 pokpok.fr nodejs-example[3341]: [WS => Shard 0] Clearing the heartbeat interval.
sept. 12 13:17:51 pokpok.fr nodejs-example[3341]: [WS => Shard 0] [WebSocket] Destroy: Attempting to close the WebSocket. | WS State: OPEN
sept. 12 13:17:51 pokpok.fr nodejs-example[3341]: [WS => Shard 0] [WebSocket] Close: Tried closing. | WS State: CLOSING
sept. 12 13:17:51 pokpok.fr nodejs-example[3341]: [WS => Shard 0] [WebSocket] Adding a WebSocket close timeout to ensure a correct WS reconnect.
sept. 12 13:17:51 pokpok.fr nodejs-example[3341]:         Timeout: 5000ms
sept. 12 13:17:56 pokpok.fr nodejs-example[3341]: [WS => Shard 0] [WebSocket] Clearing the close timeout.
sept. 12 13:17:56 pokpok.fr nodejs-example[3341]: [WS => Shard 0] [WebSocket] Close Emitted: false
sept. 12 13:17:56 pokpok.fr nodejs-example[3341]: [WS => Shard 0] [WebSocket] did not close properly, assuming a zombie connection.
sept. 12 13:17:56 pokpok.fr nodejs-example[3341]: Emitting close and reconnecting again.
sept. 12 13:17:56 pokpok.fr nodejs-example[3341]: [WS => Shard 0] [CLOSE]
sept. 12 13:17:56 pokpok.fr nodejs-example[3341]:     Event Code: 1011
sept. 12 13:17:56 pokpok.fr nodejs-example[3341]:     Clean     : false
sept. 12 13:17:56 pokpok.fr nodejs-example[3341]:     Reason    : INTERNAL_ERROR
sept. 12 13:17:56 pokpok.fr nodejs-example[3341]: [WS => Shard 0] [CONNECT]
sept. 12 13:17:56 pokpok.fr nodejs-example[3341]:     Gateway    : wss://gateway.discord.gg/
sept. 12 13:17:56 pokpok.fr nodejs-example[3341]:     Version    : 10
sept. 12 13:17:56 pokpok.fr nodejs-example[3341]:     Encoding   : json
sept. 12 13:17:56 pokpok.fr nodejs-example[3341]:     Compression: none
sept. 12 13:17:56 pokpok.fr nodejs-example[3341]: [WS => Shard 0] Setting a HELLO timeout for 20s.
sept. 12 13:17:56 pokpok.fr nodejs-example[3341]: [WS => Shard 0] [CONNECTED] Took 163ms
sept. 12 13:17:56 pokpok.fr nodejs-example[3341]: [WS => Shard 0] Clearing the HELLO timeout.
sept. 12 13:17:56 pokpok.fr nodejs-example[3341]: [WS => Shard 0] Setting a heartbeat interval for 41250ms.
sept. 12 13:17:56 pokpok.fr nodejs-example[3341]: [WS => Shard 0] [IDENTIFY] Shard 0/1 with intents: 1
sept. 12 13:18:01 pokpok.fr nodejs-example[3341]: [WS => Shard 0] [READY] Session xxx.
sept. 12 13:18:01 pokpok.fr nodejs-example[3341]: [WS => Shard 0] [ReadyHeartbeat] Sending a heartbeat.
sept. 12 13:18:03 pokpok.fr nodejs-example[3341]: [WS => Shard 0] Heartbeat acknowledged, latency of 2218ms.
sept. 12 13:18:04 pokpok.fr nodejs-example[3341]: [WS => Shard 0] Shard received all its guilds. Marking as fully ready.
sept. 12 13:18:05 pokpok.fr nodejs-example[3341]: [WS => Shard 0] Clearing the heartbeat interval.
sept. 12 13:18:05 pokpok.fr nodejs-example[3341]: [WS => Shard 0] [CLOSE]
sept. 12 13:18:05 pokpok.fr nodejs-example[3341]:     Event Code: 1006
sept. 12 13:18:05 pokpok.fr nodejs-example[3341]:     Clean     : false
sept. 12 13:18:05 pokpok.fr nodejs-example[3341]:     Reason    :
sept. 12 13:18:05 pokpok.fr nodejs-example[3341]: [WS => Shard 0] Session id is present, attempting an immediate reconnect...
sept. 12 13:18:05 pokpok.fr nodejs-example[3341]: [WS => Shard 0] An open connection was found, attempting an immediate identify.
sept. 12 13:18:05 pokpok.fr nodejs-example[3341]: [WS => Shard 0] [RESUME] Session xxx, sequence 649

After this, the bot is offline and there's no more log until I restart it manually.

I don't know if it's related but today I also found in the logs that Discord asked for a reconnection several times :

sept. 13 13:29:53 pokpok.fr nodejs-example[28681]: [WS => Shard 0] [HeartbeatTimer] Sending a heartbeat.
sept. 13 13:29:53 pokpok.fr nodejs-example[28681]: [WS => Shard 0] Heartbeat acknowledged, latency of 101ms.
sept. 13 13:30:00 pokpok.fr nodejs-example[28681]: [WS => Shard 0] [RECONNECT] Discord asked us to reconnect
sept. 13 13:30:00 pokpok.fr nodejs-example[28681]: [WS => Shard 0] [DESTROY]
sept. 13 13:30:00 pokpok.fr nodejs-example[28681]:     Close Code    : 4000
sept. 13 13:30:00 pokpok.fr nodejs-example[28681]:     Reset         : false
sept. 13 13:30:00 pokpok.fr nodejs-example[28681]:     Emit DESTROYED: true
sept. 13 13:30:00 pokpok.fr nodejs-example[28681]: [WS => Shard 0] Clearing the heartbeat interval.
sept. 13 13:30:00 pokpok.fr nodejs-example[28681]: [WS => Shard 0] [WebSocket] Destroy: Attempting to close the WebSocket. | WS State: OPEN
sept. 13 13:30:00 pokpok.fr nodejs-example[28681]: [WS => Shard 0] [WebSocket] Close: Tried closing. | WS State: CLOSING
sept. 13 13:30:00 pokpok.fr nodejs-example[28681]: [WS => Shard 0] [WebSocket] Adding a WebSocket close timeout to ensure a correct WS reconnect.
sept. 13 13:30:00 pokpok.fr nodejs-example[28681]:         Timeout: 5000ms
sept. 13 13:30:01 pokpok.fr nodejs-example[28681]: [WS => Shard 0] [WebSocket] Clearing the close timeout.
sept. 13 13:30:01 pokpok.fr nodejs-example[28681]: [WS => Shard 0] [CLOSE]
sept. 13 13:30:01 pokpok.fr nodejs-example[28681]:     Event Code: 4000
sept. 13 13:30:01 pokpok.fr nodejs-example[28681]:     Clean     : true
sept. 13 13:30:01 pokpok.fr nodejs-example[28681]:     Reason    :
sept. 13 13:30:01 pokpok.fr nodejs-example[28681]: [WS => Shard 0] Session id is present, attempting an immediate reconnect...
sept. 13 13:30:01 pokpok.fr nodejs-example[28681]: [WS => Shard 0] [CONNECT]
sept. 13 13:30:01 pokpok.fr nodejs-example[28681]:     Gateway    : wss://gateway.discord.gg/
sept. 13 13:30:01 pokpok.fr nodejs-example[28681]:     Version    : 10
sept. 13 13:30:01 pokpok.fr nodejs-example[28681]:     Encoding   : json
sept. 13 13:30:01 pokpok.fr nodejs-example[28681]:     Compression: none
sept. 13 13:30:01 pokpok.fr nodejs-example[28681]: [WS => Shard 0] Setting a HELLO timeout for 20s.
sept. 13 13:30:01 pokpok.fr nodejs-example[28681]: [WS => Shard 0] [CONNECTED] Took 476ms
sept. 13 13:30:01 pokpok.fr nodejs-example[28681]: [WS => Shard 0] Clearing the HELLO timeout.
sept. 13 13:30:01 pokpok.fr nodejs-example[28681]: [WS => Shard 0] Setting a heartbeat interval for 41250ms.
sept. 13 13:30:01 pokpok.fr nodejs-example[28681]: [WS => Shard 0] [RESUME] Session xxx, sequence 1888
sept. 13 13:30:01 pokpok.fr nodejs-example[28681]: [WS => Shard 0] [RESUMED] Session xxx | Replayed 1 events.
sept. 13 13:30:01 pokpok.fr nodejs-example[28681]: [WS => Shard 0] [ResumeHeartbeat] Sending a heartbeat.
sept. 13 13:30:01 pokpok.fr nodejs-example[28681]: [WS => Shard 0] Heartbeat acknowledged, latency of 128ms.

My bot is hosted on a VPS from OVH.

I'm thinking of downgrading to v13 to check if this is better.

@cobaltt7
Copy link
Contributor

cobaltt7 commented Sep 23, 2022

Having the same problem. My bot is hosted on https://railway.app and it went unresponsive after about 5 days of running. I upgraded to v14 at the same time I switched from Heroku to Railway so I'm not sure which is the cause here. My logs do log the "Discord asked us to reconnect" event every few hours, the last time it just didn't reconnect.

@NeonWizard
Copy link
Author

So in the meantime, what can treat this issue? How can you automatically diagnose when the bot has internally disconnected from the websocket and automatically restart the process? This issue is severe and has been drastically affecting my end users.

@NeonWizard NeonWizard changed the title Bot cleanly exits after around 5-6 hours Bot randomly exiting process or going offline/unresponsive Sep 25, 2022
@KingKophuhn
Copy link

I have the same error, but only on my home server. Everything works as it should on my pc. But I can't leave my pc on 24/7. I have the problem since I updated to discord js version 14. I've also tried just running the login script but even then I got the same error.

@andrevtlopes
Copy link

andrevtlopes commented Sep 27, 2022

I'm experiencing the same error and I'm failing to recover it when it happens, my solution was to catch this and ask the hole server to restart, I'm running it on a VPS in Brazil, my local development server is not being affected.

EDIT: I decided to go back to v13 and now it's working as expected again.

@zLupa
Copy link
Contributor

zLupa commented Oct 1, 2022

Having this issue on another bot, nothing yet? And also, this error is not occurring on v13

@djipgen
Copy link

djipgen commented Oct 1, 2022

@andrevtlopes @zLupa Which exact version of v13 are you using? I have tried it now with v13.10.3 and has still the same problem.

@legendhimself
Copy link
Contributor

try rolling back to 13.9.1 and see if it's better.

@zLupa
Copy link
Contributor

zLupa commented Oct 1, 2022

@andrevtlopes @zLupa Which exact version of v13 are you using? I have tried it now with v13.10.3 and has still the same problem.

13.8.1
There's probably some commit in v13 and v14 that's messing with websocket?

@Dossar
Copy link

Dossar commented Oct 7, 2022

My Findings and Pinging Test Bot

@NeonWizard So to test the theory of Discord 14.x potentially having issues with stability (bots remaining online), I went and created a ping bot that uses discord 14.5.0 which pinged a discord channel every 15 minutes it was online. I also logged to the discord channel whenever the bot logged in -- and it had a setInterval to try automatically destroying the client and instantiating a new one.

Here's a log of today's results, starting from 12:25 AM. You'll notice that no more than 3 pings are sent when the bot remained online (meaning it disconnected around an hour in, since 3 pings at 15 minutes each is 45 minutes). So it looks like there might be some timer that's disconnecting the bot after an hour. Redacted the bot name and tag.

<PingingBot>
BOT
 — Today at 12:25 AM
<PingingBot>#0000 successfully logged in!

<PingingBot>
BOT
 — Today at 12:40 AM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 12:55 AM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 1:10 AM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 1:29 AM
<PingingBot>#0000 successfully logged in!

<PingingBot>
BOT
 — Today at 1:43 AM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 1:58 AM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 2:13 AM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 6:49 AM
<PingingBot>#0000 successfully logged in!

<PingingBot>
BOT
 — Today at 7:04 AM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 7:19 AM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 7:34 AM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 8:34 AM
<PingingBot>#0000 successfully logged in!

<PingingBot>
BOT
 — Today at 8:49 AM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 10:14 AM
<PingingBot>#0000 successfully logged in!

<PingingBot>
BOT
 — Today at 10:29 AM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 10:44 AM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 1:29 PM
<PingingBot>#0000 successfully logged in!

<PingingBot>
BOT
 — Today at 1:44 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 1:59 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 2:14 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 3:39 PM
<PingingBot>#0000 successfully logged in!

<PingingBot>
BOT
 — Today at 3:54 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 4:09 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 4:24 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 4:44 PM
<PingingBot>#0000 successfully logged in!

<PingingBot>
BOT
 — Today at 4:59 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 5:14 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 5:29 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 5:49 PM
<PingingBot>#0000 successfully logged in!

<PingingBot>
BOT
 — Today at 6:04 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 6:19 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 6:34 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 7:59 PM
<PingingBot>#0000 successfully logged in!

<PingingBot>
BOT
 — Today at 8:14 PM
<PingingBot>#0000 successfully logged in!
NEW

<PingingBot>
BOT
 — Today at 8:29 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 8:44 PM
15 minute ping interval successfully sent.

<PingingBot>
BOT
 — Today at 8:59 PM
15 minute ping interval successfully sent.

I stopped gathering data at 10:11 PM. So this bot was offline for an hour after that last ping.

Upon looking into the client instance, I saw something with a timeout of one hour -- going by the observation that my bot did not give more than 3 pings.

  sweepers: Sweepers {
    options: { threads: [Object] },
    intervals: {
      applicationCommands: null,
      bans: null,
      emojis: null,
      invites: null,
      guildMembers: null,
      messages: null,
      presences: null,
      reactions: null,
      stageInstances: null,
      stickers: null,
      threadMembers: null,
      threads: Timeout {
        _idleTimeout: 3600000,
        _idlePrev: [TimersList],
        _idleNext: [Timeout],
        _idleStart: 3686,
        _onTimeout: [Function (anonymous)],
        _timerArgs: undefined,
        _repeat: 3600000,
        _destroyed: false,
        [Symbol(refed)]: false,
        [Symbol(kHasPrimitive)]: false,
        [Symbol(asyncId)]: 8,
        [Symbol(triggerId)]: 1
      },
      users: null,
      voiceStates: null
    }
  },

Under rest.requestManager there also looked to be handlerTimer which had an idleTimeout value of one hour.

      handlerTimer: Timeout {
        _idleTimeout: 3600000,
        _idlePrev: [Timeout],
        _idleNext: [TimersList],
        _idleStart: 3393,
        _onTimeout: [Function (anonymous)],
        _timerArgs: undefined,
        _repeat: 3600000,
        _destroyed: false,
        [Symbol(refed)]: false,
        [Symbol(kHasPrimitive)]: false,
        [Symbol(asyncId)]: 7,
        [Symbol(triggerId)]: 1
      },

When my pinging bot disconnected, on replit this is what I saw in the console:

Preparing to connect to the gateway...
> DEBUG: Guilds we are connected to listed below
(end guild output)

> DEBUG: options => shardCount 1
> DEBUG: presence => status online
client.user is null (bot is not logged in).

This means that the client status was online, just that the user is null (i.e. bot is not logged in).

So it's not an issue with the client instance from what I see, it's that the user (bot) is getting disconnected. And the idle timeout value of one hour looks suspicious to me considering the three 15 minute pings I was seeing at most at any given time the bot was online.

Not only that, but even when my code was checking to log back in every 2 minutes, the bot could take over an hour to go online again.

What a successful login looked like

When I was actually able to log in the bot, there were several web socket manager debug logs. I removed the other statements to focus on just the WS output.

[WS => Manager] Fetched Gateway Information
    URL: wss://gateway.discord.gg
    Recommended Shards: 1

[WS => Manager] Session Limit Information
    Total: 1000
    Remaining: 997

[WS => Manager] Spawning shards: 0

[WS => Shard 0] [CONNECT]
    Gateway    : wss://gateway.discord.gg/
    Version    : 10
    Encoding   : json
    Compression: none

[WS => Shard 0] Setting a HELLO timeout for 20s.

[WS => Shard 0] [CONNECTED] Took 123ms

[WS => Shard 0] Clearing the HELLO timeout.

[WS => Shard 0] Setting a heartbeat interval for 41250ms.

[WS => Shard 0] [IDENTIFY] Shard 0/1 with intents: 33283

[WS => Shard 0] [READY] Session <hash>.

[WS => Shard 0] [ReadyHeartbeat] Sending a heartbeat.

[WS => Shard 0] Shard received all its guilds. Marking as fully ready.

===
Logged in as <PingingBot>#0000!
===

[WS => Shard 0] Heartbeat acknowledged, latency of 34ms.

[WS => Shard 0] [HeartbeatTimer] Sending a heartbeat.

[WS => Shard 0] Heartbeat acknowledged, latency of 25ms.

Note: WS is WebSocketManager
https://discord.js.org/#/docs/discord.js/main/class/WebSocketManager
https://discord.js.org/#/docs/discord.js/main/class/WebSocketShard

  ws: WebSocketManager {
    _events: [Object: null prototype] {},
    _eventsCount: 0,
    _maxListeners: undefined,
    gateway: 'wss://gateway.discord.gg/',
    totalShards: 1,
    shards: Collection(1) [Map] { 0 => [WebSocketShard] },
    status: 3,
    destroyed: false,
    reconnecting: false,
    [Symbol(kCapture)]: false
  },

Findings in Thread

Some other findings in the this thread:

  • @zLupa mentioned discordjs 13.8.1 is not showing these issues.
  • The person in the OP was using [email protected]
  • @djipgen reported v13.10.3 has the same problem.

The common data point seems to be that somewhere between v13.8.1 and v13.10.3, something introduced this issue of bots disconnecting from the client. And then that ended up getting into v14.x

You can see the diff between v13.8.1 and v13.10.3 here:
13.8.1...13.10.3

@legendhimself
Copy link
Contributor

legendhimself commented Oct 7, 2022

@Dossar That's a nice test. The Internet is wack there is no way that any connection is gonna be active 24/7. I mean there is a reason why people use TCP over UDP even tho UDP is way faster than TCP. TCP helps them get back the lost packets. I said this to show you that the internet is not 100% reliable but we found ways to make it reliable. In this case, your bot was actually connected to the Discord API and it gets disconnected and gets reconnected back and this is normal.
The disconnect -> reconnect -> resume happens in a few minutes or instantly sometimes and sometimes it's not resumable so it won't receive the missed events.

I have a bot with 60k+ servers and we have a hell lot of interactions we have around 2-5k active users playing the bot at any given time. It's a game bot so this kind of activity is expected.
From what I've seen staying below v14 is working for us even tho there weren't any or no major changes in Ws codes.
We get tons of disconnects and reconnects similar to the user who created the issue.

Below are the logs of a successful reconnect

[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | [RECONNECT] Discord asked us to reconnect
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | [DESTROY]
    Close Code    : 4000
    Reset         : false
    Emit DESTROYED: true
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | Clearing the heartbeat interval.
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | [WebSocket] Destroy: Attempting to close the WebSocket. | WS State: OPEN
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | [WebSocket] Close: Tried closing. | WS State: CLOSING
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | [WebSocket] Adding a WebSocket close timeout to ensure a correct WS reconnect.
        Timeout: 2500ms
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | [WebSocket] Clearing the close timeout.
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | [CLOSE]
    Event Code: 4000
    Clean     : true
    Reason    : 
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | Session id is present, attempting an immediate reconnect...
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | [CONNECT]
    Gateway    : wss://gateway.discord.gg/
    Version    : 9
    Encoding   : json
    Compression: none
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | Setting a HELLO timeout for 20s.
Message from shard | Job: getFanArray
Message from shard | Job: remaining
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | [CONNECTED] Took 158ms
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | Clearing the HELLO timeout.
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | Setting a heartbeat interval for 41250ms.
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | [RESUME] Session c3bc0d408809a14ee812583979cfb462, sequence 17901
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | [RESUMED] Session c3bc0d408809a14ee812583979cfb462 | Replayed 1 events.
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | [ResumeHeartbeat] Sending a heartbeat.
[ DJS ]: | Fri, 07 Oct 2022 21:38:09 | debug | [WS => Shard 26] | Heartbeat acknowledged, latency of 105ms.

But there are times at which just after the above log ended the WS again sends a CLOSE event similar to the user who created this issue and it again goes through the reconnect process and reconnects. This usually takes a few minutes.
From what I've known by talking to some of the Discord devs, Discord actually uses some external package for all the WebSocket-related stuff and they don't have it natively. So even the Discord devs have no idea of the weird disconnects and reconnect. Before this DJS library didn't handle the 4009 WS close code sent by Discord which caused people to have zombie shards [process is alive but not connected to the gateway until the process restarted]. Discord still sends 4009 close code but now the DJS handles the reconnect.

But if the reconnect is taking several minutes and causing huge downtimes then it might be a bug inside of the DJS. We don't have any such issues with our bot and we are staying below v14 [13.9.1] until the discord/ws package is fully ready and launched. We won't have the package in production until a few weeks of testing.

To sum it up, I think DJS can only work towards better handling and reconnecting to the Discord API. We cannot control the disconnects which are caused by many reasons [Internet, Discord, etc].

@NeonWizard
Copy link
Author

@legendhimslef

I think this ticket is starting to deviate from my issue. In my instance (the OP), it's not reconnecting at all every now and then. Most of the time DJS handles the websocket reacquisition fine, but sometimes it doesn't and my application becomes a zombie. The bot doesn't come back online until the entire process is restarted.

@legendhimself
Copy link
Contributor

legendhimself commented Dec 15, 2022

rather than submitting a PR here to fix the issue not just for them

@kyranet I just wanted the fixes to be tested before I submitted the pr. That's the only reason I asked them to use my pr fork to "test"

@DraftProducts
Copy link
Contributor

DraftProducts commented Dec 16, 2022

On my side, I have good results with the combination of :

I think @legendhimself's Pull Request could now be submitted.
In my opinion, we could also update the documentation for <ClientOptions>.closeTimeout, to better understand how proportional this value is to the ratio of the number of shards and the connection, which requires a different configuration based on that ratio.

The current default value is more than reasonable for small bots, but becomes unsustainable for slightly larger bots (or ratio too low).

It seems that the amount of data received also affects this ratio as if the Discord API puts the data to be sent to you in a global queue, and they are limited by how much they can send you in X amount of time which would make us receive the heartbeats later and therefore need more time before defining a shard connection as zombie.

I had observed that I was having a lot more trouble receiving GUILD_MEMBERS_CHUNK after a global fetch of members with presence intent enabled. This could potentially be related to this.

@Jiralite
Copy link
Member

Got a report stating the linked pull request didn't resolve the issue. Feel free to open a pull request @legendhimself and we'll see how things go from there.

@l1v0n1
Copy link

l1v0n1 commented Feb 12, 2023

doesn't solved for me

@legendhimself
Copy link
Contributor

legendhimself commented Feb 12, 2023

doesn't solved for me

There was discord related issue today which caused many shards to go into a zombie connection which automatically got reconnected for us, There was no process exits though.

What about you,? Please provide logs so we can try to patch it for time being if there is any.
By the way even if we find some issue in this pr, all we can do is wait for the #9099 pr to complete

@l1v0n1
Copy link

l1v0n1 commented Feb 12, 2023

i would be happy to provide logs but nothing happens, day after day, shard sends his heartbeat for the last time and so on a new

@legendhimself
Copy link
Contributor

logs but nothing happens, day after day, shard sends his heartbeat for the

okay, I couldn't understand what you are trying to say but logs can give a rough idea of whats happening.

@l1v0n1
Copy link

l1v0n1 commented Feb 12, 2023

logs but nothing happens, day after day, shard sends his heartbeat for the

okay, I couldn't understand what you are trying to say but logs can give a rough idea of whats happening.

https://pastebin.com/R5BU0Nwq
https://pastebin.com/wMvSVGdQ

@legendhimself
Copy link
Contributor

interesting.

@ZackMercury
Copy link

ZackMercury commented Feb 12, 2023

I started getting exits with status 0 today which are classified by systemctl as service success, after 3-5 days of running the bot with no visible disconnections.
I configured the process to restart on failure, but idk what to do if the process either goes zombie mode like described above or if it exits with status 0.

This is discouraging.

@NeonWizard
Copy link
Author

I started getting exits with status 0 today which are classified by systemctl as service success, after 3-5 days of running the bot with no visible disconnections. I configured the process to restart on failure, but idk what to do if the process either goes zombie mode like described above or if it exits with status 0.

This is discouraging.

I was having both of those issues (exiting with status code 0 and going zombie mode) when I originally created this ticket. I think this should be reopened and revisited.

@ZackMercury
Copy link

ZackMercury commented Feb 13, 2023

Sorry for misunderstanding, I didn't have the zombie mode process issue yet (at least not that I know of, the bot isn't actively used), I was just refering the above.

@Vittiy
Copy link

Vittiy commented Jun 16, 2023

Got same error there, shards exit silently after a certain time and the application stop responding to interaction (djs v14)

@legendhimself
Copy link
Contributor

Got same error there, shards exit silently after a certain time and the application stop responding to interaction (djs v14)

This is a really old issue and was fixed for v13. You might need to create a new issue similar to this for v14.

@NeonWizard
Copy link
Author

Got same error there, shards exit silently after a certain time and the application stop responding to interaction (djs v14)

This is a really old issue and was fixed for v13. You might need to create a new issue similar to this for v14.

This issue was for discord v14. I'm still not sure why it's closed.

@legendhimself
Copy link
Contributor

legendhimself commented Jun 17, 2023

We fixed it for v14 and backported it for v13. The current v14 uses the new @discordjs/ws package. So if this issue is still there for the latest v14, please create a new issue.

@didinele
Copy link
Member

This issue was for discord v14. I'm still not sure why it's closed.

You submitted the bug report for v14, please open a separate issue for v13 if the issue persists there. Also, as pointed out, the entire gateway implementation has been switched out, so there's pretty much no way this is relevant anymore even for v14.

@discordjs discordjs locked as resolved and limited conversation to collaborators Jun 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet