RTM connection fails in large org but works in small org #134

geekdave · 2017-02-10T16:30:41Z

Hello,

I'm attempting to use a bot to connect Slack with Pagerduty: https://github.com/stripe-contrib/pagerbot

I set up the bot with a Slack bot access token.
I verified that the token works by using "curl" from the command line and was able to see the message getting posted in Slack.
I tested the bot on a large org (5000+ people) and it consistently fails to connect.
I tested the bot on a small org (15 people) (using a different token of course), and it worked fine.

I've worked with the developer of the bot @macobo, and he suggested I file an issue here.

I've created a gist with the error logs from heroku. The first log shows the errors when running this in the large org. The second log shows the successful connection with a smaller org:

I added some comments at the bottom noting some differences, notably:

On large org I see this in the response, but not in the small org:

transfer-encoding: "chunked"

Could there be an issue with the large number of members in the org or the number of channels causing something to stall? Is there a huge payload coming back from the RTM API due to my large org?

There is a delay of 100+ seconds in the large org log, and the error:

Socket URL has expired

Thanks for looking!

The text was updated successfully, but these errors were encountered:

geekdave · 2017-02-10T16:35:11Z

UPDATE: I pinged Slack's API help team about this and got the following response:

My guess is the bot is timing out on the initial call to rtm.start, which is the method that's called to initialize an RTM connection — for large teams this can take a long time. Thinking aloud, it could also be a memory allocation issue (depending on your server config).

Any chance you can increase the timeout for the initial API call ?

We're working very (very) hard on improvements in this area, as you can imagine this is a performance problem across-the-board for very large teams.

/cc @macobo

dblock · 2017-02-10T16:37:35Z

Indeed, there's a very large payload coming back from the initial rtm.start with all the org information. Tweak the call to rtm.start with simple_latest: true and no_unreads, that reduces the size by a bit.

Then disable any client-side parsing of that data with Slack::RealTime::Client.new(store_class: nil), see if that helps at all.

dblock · 2017-02-10T16:40:45Z

There's a feature request on supplying a custom Faraday connection in #72. There's an unfinished pull request for adding a timeout option in #74. In the meantime you can probably test other connection options with some editing of https://github.com/slack-ruby/slack-ruby-client/blob/master/lib/slack/web/faraday/connection.rb or monkey-patching.

This solves issues raised in slack-ruby/slack-ruby-client#134

macobo · 2017-02-11T00:11:15Z

Thanks @dblock for the suggestions - we tested with these settings, but ran into a similar issue (didn't try store_class though). Will try events API instead.

Related to this issue: I think this should make client.start! throw in a way that the program exits or attempt to reconnect. Currently the program will just hang indefinitely and the root cause is not really visible if a logger is not specified.

dblock · 2017-02-11T16:57:37Z

Agreed @macobo, please contribute.

tmsteen · 2017-02-17T21:39:00Z

Did anyone develop a short term solution to this? Seeing the same issue and the D, [2017-02-17T13:38:12.283081 #5079] DEBUG -- Slack::RealTime::Concurrency::Eventmachine::Client#parse: �B{"type":"error","error":{"msg":"Socket URL has expired","code":1}} debug error.

It did randomly work for me one time with the suggestions from @dblock with regard to reducing the size of the data coming back, about the same as the commit that @macobo referenced above. But will not work again. As far as I can tell, nothing changed between runs.

UPDATE: May have been a corp network issue. Works when on regular network.

dblock · 2017-02-17T21:45:09Z

@tmsteen @macobo has either of you attempted to increase the timeout? maybe finish #74?

tmsteen · 2017-02-17T22:29:15Z

Not familiar with how to do that, sorry. Certainly willing to assist if you can start me in the right direction. But I am starting to think that after setting simple_latest, no_unreads, and store_class that the issue may only be partially related to those.

Right now, my connection works on a seemingly random interval. From the debug log I get this in all cases D, [2017-02-17T16:11:10.540318 #6791] DEBUG -- Slack::RealTime::Concurrency::Eventmachine::Client#write: GET /websocket/ZKsL1B2ulhjnvmg2pJDOFZLMmEKTxNnzcXGLJJHGXZpNbWjd0NQI9IUwb4Z4s-QynUGzlDvIVuogXhsoAcls4rEa4LI0ny9qvtn2CnwlukVyZdC-RPYEFvSAfi-dSywgMmQ9mCQDxzypcQd9o7Iztg3yxVdz2ciRk= HTTP/1.1

You probably understand the inner workings better than I do, but it seems that after the initial call to the Slack API, it returns a unique web socket to use for the rest of the communication. I am curious if the "randomness" of when my connection works and when it doesn't is caused by that web socket timing out or only being valid within a certain window?

More testing needed, but it seems that it will work once, then fail for a while, and then work again a few minutes later. When it doesn't work, the error is still the same as noted above.

macobo · 2017-02-17T22:38:12Z

Sadly I wasn't able to work on this as I don't have access to a large slack organization.

@tmsteen How big is the organization you're using this gem for?

tmsteen · 2017-02-18T02:10:05Z

5000+ but my intended audience within the org is much smaller.

asheynkman · 2017-02-22T20:10:36Z

We also well facing the same issue setting {simple_latest: true, no_unreads:true} or store_class = nil makes no difference, is there any work around on this issue ?

macobo · 2017-02-22T20:32:34Z

I think there's multiple factors that make debugging this hard:

The developers/others interested don't have access to an account large enough to replicate the problem.
Part of this is on the side of slack - that is, the initial payload will be parsed from a string to something else so a huge payload might be causing issues.

One way to solve these issues is for someone with an admin in a larger account to either:

Debug this themselves (i.e. profile, dump the payload, etc)
Give temporary access to someone who is interested in debugging.

Number two is rather undesirable though due to the security implications for the account given access to.

asheynkman · 2017-02-22T20:36:13Z

One data point I have notice , if using just simple client from the sample with large organization connects and established connection. The issue happens while using Slack::RealTime::Client

require 'http'
require 'json'
require 'faye/websocket'
require 'eventmachine'

rc = JSON.parse HTTP.post("https://slack.com/api/rtm.start", params: {
token: ENV['SLACK_API_TOKEN']
})

url=rc['url']

EM.run {
ws = Faye::WebSocket::Client.new(url)

ws.on :open do |event|
p [:open]
end

ws.on :message do |event|
p [:message, JSON.parse(event.data)]
end

ws.on :close do |event|
p [:close, event.code, event.reason]
ws = nil
end
}

dblock · 2017-02-22T20:57:28Z

Anyone who has this issue in their team, please try and hard-code a bigger timeout just like in #74. Does it solve this? If it does, we need to finish #74 with something cleaner and allow anyone to set it, and possibly increase the default.

asheynkman · 2017-02-22T21:47:14Z

is timeout expose anywhere in version 0.10.1?

dblock · 2017-02-22T22:01:23Z

No @asheynkman.

asheynkman · 2017-02-23T00:31:23Z

@dblock we have increased time out and its all work great. Let us know when you planing to push into release.

dblock · 2017-02-23T02:19:25Z

Can you help finish #74 @asheynkman? Can you copy-paste some code increasing the timeout at least?

asheynkman · 2017-02-23T16:48:52Z

@dblock need to add spec / README and CHANGELOG entries and will push it.

tmsteen · 2017-03-08T17:12:08Z

Seems #74 was closed. @asheynkman did you just implement those same changes or did you increase timeout another way?

My ability to connect to the RTM is still hit or miss.

dblock · 2017-03-08T17:56:44Z

#74 was closed by its author, it needs to be "finished", please

dblock · 2017-03-08T18:11:28Z

I've PRed #135 for timeout options. Can you please try @tmsteen?

dblock · 2017-03-08T18:18:13Z

@asheynkman what timeout and open_timeout values did you use? I want to recommend something in the README.

asheynkman · 2017-03-08T19:17:35Z

@dblock we used 500000 for both

tmsteen · 2017-03-08T19:19:23Z

I just tested #135 with a timeout and opentimeout set at 1000 just to be sure. seems to be working consistently now. Tested with both the Web and RTM clients.

Edit: just saw the comment above.

asheynkman · 2017-03-08T19:20:02Z

1000 should be more then enough , the units in seconds , so should be recommended value

dblock · 2017-03-08T19:27:57Z

A 1000 seconds timeout seems like a generally bad idea, other than an "arbitrarily high number" :)

What does everyone think here about changing the default. What would you do?

dblock · 2017-03-08T19:28:39Z

cc: @technicalpickles

asheynkman · 2017-03-08T19:29:19Z

i just tested with 180 sec , with account mix of channels and users around 12K , took like 2 min to download

dblock · 2017-03-08T19:30:01Z

It's crazy that rtm.start takes 2 minutes, fundamentally bad design for this api :(

asheynkman · 2017-03-08T19:32:32Z

i don't think there are any other options unless slack redesigns their api's

dblock · 2017-03-08T19:32:55Z

I'm just venting @asheynkman :)

asheynkman · 2017-03-08T19:33:48Z

lol

dblock · 2017-03-08T19:36:31Z

I've merged #135. To close this issue we need to increase the default timeout value for rtm_start only (not for all web API calls).

dblock · 2017-03-08T20:33:31Z

Alright, check out #136, comment on!

dblock · 2017-04-29T20:54:29Z

Slack added rtm.connect, which no longer sends any more information than needed to the client on start unlike rtm.start. I've implemented support for it in slack-ruby-client in #145, you should try slack-ruby-client#HEAD, it has dramatically improved the memory footprint and connection speed of my game bot that currently has 385 active teams and should remove the need for a long timeout.

dblock added question bug? labels Feb 10, 2017

macobo pushed a commit to stripe-contrib/pagerbot that referenced this issue Feb 10, 2017

Make initial payload to slack rtm client smaller.

76b7b64

This solves issues raised in slack-ruby/slack-ruby-client#134

macobo pushed a commit to stripe-contrib/pagerbot that referenced this issue Feb 10, 2017

Make initial payload to slack rtm client smaller.

0da5447

This solves issues raised in slack-ruby/slack-ruby-client#134

macobo pushed a commit to stripe-contrib/pagerbot that referenced this issue Feb 10, 2017

Make initial payload to slack rtm client smaller.

0567ef3

This solves issues raised in slack-ruby/slack-ruby-client#134

dblock mentioned this issue Feb 22, 2017

Question about simple_latest slack-ruby/slack-ruby-bot#117

Closed

geekdave mentioned this issue Mar 6, 2017

Slack Bot won't come online stripe-contrib/pagerbot#53

Open

dblock mentioned this issue Mar 8, 2017

Added timeout and open_timeout options to Web API. #135

Merged

dblock mentioned this issue Mar 8, 2017

Increase the default timeout for rtm.start to 180 seconds. #136

Merged

dblock closed this as completed in #136 Mar 8, 2017

kbakhit mentioned this issue Sep 11, 2017

Bot takes 3 minutes to connect and fails with timeout Itiviti/simple-slack-api#185

Open

sankalp-khare mentioned this issue Aug 19, 2020

Hubot cannot connect to Slack and return me "Socket URL has expired" slackapi/hubot-slack#542

Closed

9 tasks

RTM connection fails in large org but works in small org #134

RTM connection fails in large org but works in small org #134

Comments

geekdave commented Feb 10, 2017 • edited Loading

geekdave commented Feb 10, 2017 • edited Loading

dblock commented Feb 10, 2017

dblock commented Feb 10, 2017 • edited Loading

macobo commented Feb 11, 2017

dblock commented Feb 11, 2017

tmsteen commented Feb 17, 2017 • edited Loading

dblock commented Feb 17, 2017

tmsteen commented Feb 17, 2017 • edited Loading

macobo commented Feb 17, 2017

tmsteen commented Feb 18, 2017

asheynkman commented Feb 22, 2017

macobo commented Feb 22, 2017

asheynkman commented Feb 22, 2017

dblock commented Feb 22, 2017

asheynkman commented Feb 22, 2017

dblock commented Feb 22, 2017

asheynkman commented Feb 23, 2017

dblock commented Feb 23, 2017

asheynkman commented Feb 23, 2017

tmsteen commented Mar 8, 2017

dblock commented Mar 8, 2017

dblock commented Mar 8, 2017

dblock commented Mar 8, 2017

asheynkman commented Mar 8, 2017

tmsteen commented Mar 8, 2017 • edited Loading

asheynkman commented Mar 8, 2017

dblock commented Mar 8, 2017

dblock commented Mar 8, 2017

asheynkman commented Mar 8, 2017

dblock commented Mar 8, 2017

asheynkman commented Mar 8, 2017

dblock commented Mar 8, 2017

asheynkman commented Mar 8, 2017

dblock commented Mar 8, 2017

dblock commented Mar 8, 2017

dblock commented Apr 29, 2017

geekdave commented Feb 10, 2017 •

edited

Loading

geekdave commented Feb 10, 2017 •

edited

Loading

dblock commented Feb 10, 2017 •

edited

Loading

tmsteen commented Feb 17, 2017 •

edited

Loading

tmsteen commented Feb 17, 2017 •

edited

Loading

tmsteen commented Mar 8, 2017 •

edited

Loading