Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RTM connection fails in large org but works in small org #134

Closed
geekdave opened this issue Feb 10, 2017 · 36 comments
Closed

RTM connection fails in large org but works in small org #134

geekdave opened this issue Feb 10, 2017 · 36 comments

Comments

@geekdave
Copy link

geekdave commented Feb 10, 2017

Hello,

I'm attempting to use a bot to connect Slack with Pagerduty: https://github.com/stripe-contrib/pagerbot

  • I set up the bot with a Slack bot access token.
  • I verified that the token works by using "curl" from the command line and was able to see the message getting posted in Slack.
  • I tested the bot on a large org (5000+ people) and it consistently fails to connect.
  • I tested the bot on a small org (15 people) (using a different token of course), and it worked fine.

I've worked with the developer of the bot @macobo, and he suggested I file an issue here.

I've created a gist with the error logs from heroku. The first log shows the errors when running this in the large org. The second log shows the successful connection with a smaller org:

I added some comments at the bottom noting some differences, notably:

  1. On large org I see this in the response, but not in the small org:
transfer-encoding: "chunked" 

Could there be an issue with the large number of members in the org or the number of channels causing something to stall? Is there a huge payload coming back from the RTM API due to my large org?

  1. There is a delay of 100+ seconds in the large org log, and the error:
Socket URL has expired

Thanks for looking!

@geekdave
Copy link
Author

geekdave commented Feb 10, 2017

UPDATE: I pinged Slack's API help team about this and got the following response:

My guess is the bot is timing out on the initial call to rtm.start, which is the method that's called to initialize an RTM connection — for large teams this can take a long time. Thinking aloud, it could also be a memory allocation issue (depending on your server config).

Any chance you can increase the timeout for the initial API call ?

We're working very (very) hard on improvements in this area, as you can imagine this is a performance problem across-the-board for very large teams.

/cc @macobo

@dblock
Copy link
Collaborator

dblock commented Feb 10, 2017

Indeed, there's a very large payload coming back from the initial rtm.start with all the org information. Tweak the call to rtm.start with simple_latest: true and no_unreads, that reduces the size by a bit.

Then disable any client-side parsing of that data with Slack::RealTime::Client.new(store_class: nil), see if that helps at all.

@dblock
Copy link
Collaborator

dblock commented Feb 10, 2017

There's a feature request on supplying a custom Faraday connection in #72. There's an unfinished pull request for adding a timeout option in #74. In the meantime you can probably test other connection options with some editing of https://github.com/slack-ruby/slack-ruby-client/blob/master/lib/slack/web/faraday/connection.rb or monkey-patching.

macobo pushed a commit to stripe-contrib/pagerbot that referenced this issue Feb 10, 2017
macobo pushed a commit to stripe-contrib/pagerbot that referenced this issue Feb 10, 2017
macobo pushed a commit to stripe-contrib/pagerbot that referenced this issue Feb 10, 2017
@macobo
Copy link

macobo commented Feb 11, 2017

Thanks @dblock for the suggestions - we tested with these settings, but ran into a similar issue (didn't try store_class though). Will try events API instead.

Related to this issue: I think this should make client.start! throw in a way that the program exits or attempt to reconnect. Currently the program will just hang indefinitely and the root cause is not really visible if a logger is not specified.

@dblock
Copy link
Collaborator

dblock commented Feb 11, 2017

Agreed @macobo, please contribute.

@tmsteen
Copy link

tmsteen commented Feb 17, 2017

Did anyone develop a short term solution to this? Seeing the same issue and the D, [2017-02-17T13:38:12.283081 #5079] DEBUG -- Slack::RealTime::Concurrency::Eventmachine::Client#parse: �B{"type":"error","error":{"msg":"Socket URL has expired","code":1}} debug error.

It did randomly work for me one time with the suggestions from @dblock with regard to reducing the size of the data coming back, about the same as the commit that @macobo referenced above. But will not work again. As far as I can tell, nothing changed between runs.

UPDATE: May have been a corp network issue. Works when on regular network.

@dblock
Copy link
Collaborator

dblock commented Feb 17, 2017

@tmsteen @macobo has either of you attempted to increase the timeout? maybe finish #74?

@tmsteen
Copy link

tmsteen commented Feb 17, 2017

Not familiar with how to do that, sorry. Certainly willing to assist if you can start me in the right direction. But I am starting to think that after setting simple_latest, no_unreads, and store_class that the issue may only be partially related to those.

Right now, my connection works on a seemingly random interval. From the debug log I get this in all cases D, [2017-02-17T16:11:10.540318 #6791] DEBUG -- Slack::RealTime::Concurrency::Eventmachine::Client#write: GET /websocket/ZKsL1B2ulhjnvmg2pJDOFZLMmEKTxNnzcXGLJJHGXZpNbWjd0NQI9IUwb4Z4s-QynUGzlDvIVuogXhsoAcls4rEa4LI0ny9qvtn2CnwlukVyZdC-RPYEFvSAfi-dSywgMmQ9mCQDxzypcQd9o7Iztg3yxVdz2ciRk= HTTP/1.1

You probably understand the inner workings better than I do, but it seems that after the initial call to the Slack API, it returns a unique web socket to use for the rest of the communication. I am curious if the "randomness" of when my connection works and when it doesn't is caused by that web socket timing out or only being valid within a certain window?

More testing needed, but it seems that it will work once, then fail for a while, and then work again a few minutes later. When it doesn't work, the error is still the same as noted above.

@macobo
Copy link

macobo commented Feb 17, 2017

Sadly I wasn't able to work on this as I don't have access to a large slack organization.

@tmsteen How big is the organization you're using this gem for?

@tmsteen
Copy link

tmsteen commented Feb 18, 2017

5000+ but my intended audience within the org is much smaller.

@asheynkman
Copy link

We also well facing the same issue setting {simple_latest: true, no_unreads:true} or store_class = nil makes no difference, is there any work around on this issue ?

@macobo
Copy link

macobo commented Feb 22, 2017

I think there's multiple factors that make debugging this hard:

  • The developers/others interested don't have access to an account large enough to replicate the problem.
  • Part of this is on the side of slack - that is, the initial payload will be parsed from a string to something else so a huge payload might be causing issues.

One way to solve these issues is for someone with an admin in a larger account to either:

  • Debug this themselves (i.e. profile, dump the payload, etc)
  • Give temporary access to someone who is interested in debugging.

Number two is rather undesirable though due to the security implications for the account given access to.

@asheynkman
Copy link

One data point I have notice , if using just simple client from the sample with large organization connects and established connection. The issue happens while using Slack::RealTime::Client

require 'http'
require 'json'
require 'faye/websocket'
require 'eventmachine'

rc = JSON.parse HTTP.post("https://slack.com/api/rtm.start", params: {
token: ENV['SLACK_API_TOKEN']
})

url=rc['url']

EM.run {
ws = Faye::WebSocket::Client.new(url)

ws.on :open do |event|
p [:open]
end

ws.on :message do |event|
p [:message, JSON.parse(event.data)]
end

ws.on :close do |event|
p [:close, event.code, event.reason]
ws = nil
end
}

@dblock
Copy link
Collaborator

dblock commented Feb 22, 2017

Anyone who has this issue in their team, please try and hard-code a bigger timeout just like in #74. Does it solve this? If it does, we need to finish #74 with something cleaner and allow anyone to set it, and possibly increase the default.

@asheynkman
Copy link

is timeout expose anywhere in version 0.10.1?

@dblock
Copy link
Collaborator

dblock commented Feb 22, 2017

No @asheynkman.

@asheynkman
Copy link

@dblock we have increased time out and its all work great. Let us know when you planing to push into release.

@dblock
Copy link
Collaborator

dblock commented Feb 23, 2017

Can you help finish #74 @asheynkman? Can you copy-paste some code increasing the timeout at least?

@asheynkman
Copy link

@dblock need to add spec / README and CHANGELOG entries and will push it.

@tmsteen
Copy link

tmsteen commented Mar 8, 2017

Seems #74 was closed. @asheynkman did you just implement those same changes or did you increase timeout another way?

My ability to connect to the RTM is still hit or miss.

@dblock
Copy link
Collaborator

dblock commented Mar 8, 2017

#74 was closed by its author, it needs to be "finished", please

@dblock
Copy link
Collaborator

dblock commented Mar 8, 2017

I've PRed #135 for timeout options. Can you please try @tmsteen?

@dblock
Copy link
Collaborator

dblock commented Mar 8, 2017

@asheynkman what timeout and open_timeout values did you use? I want to recommend something in the README.

@asheynkman
Copy link

@dblock we used 500000 for both

@tmsteen
Copy link

tmsteen commented Mar 8, 2017

I just tested #135 with a timeout and opentimeout set at 1000 just to be sure. seems to be working consistently now. Tested with both the Web and RTM clients.

Edit: just saw the comment above.

@asheynkman
Copy link

1000 should be more then enough , the units in seconds , so should be recommended value

@dblock
Copy link
Collaborator

dblock commented Mar 8, 2017

A 1000 seconds timeout seems like a generally bad idea, other than an "arbitrarily high number" :)

What does everyone think here about changing the default. What would you do?

@dblock
Copy link
Collaborator

dblock commented Mar 8, 2017

cc: @technicalpickles

@asheynkman
Copy link

i just tested with 180 sec , with account mix of channels and users around 12K , took like 2 min to download

@dblock
Copy link
Collaborator

dblock commented Mar 8, 2017

It's crazy that rtm.start takes 2 minutes, fundamentally bad design for this api :(

@asheynkman
Copy link

i don't think there are any other options unless slack redesigns their api's

@dblock
Copy link
Collaborator

dblock commented Mar 8, 2017

I'm just venting @asheynkman :)

@asheynkman
Copy link

lol

@dblock
Copy link
Collaborator

dblock commented Mar 8, 2017

I've merged #135. To close this issue we need to increase the default timeout value for rtm_start only (not for all web API calls).

@dblock
Copy link
Collaborator

dblock commented Mar 8, 2017

Alright, check out #136, comment on!

@dblock
Copy link
Collaborator

dblock commented Apr 29, 2017

Slack added rtm.connect, which no longer sends any more information than needed to the client on start unlike rtm.start. I've implemented support for it in slack-ruby-client in #145, you should try slack-ruby-client#HEAD, it has dramatically improved the memory footprint and connection speed of my game bot that currently has 385 active teams and should remove the need for a long timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants