-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert "Merge pull request #285 from DataDog/prognant/fix-RMI-socket-… #288
Conversation
@prognant Are there any plans to make a new fix that addresses the bugs? |
Hi @prognant - thanks for the quick reply. Unlike the original implementation c2e7793, PR #289 does not allow the underlying TCP socket connect timeout to be configurable. PR #289 makes the timeout of your poll of the ArrayBlockingQueue configurable, but that won't address the socket connection hangs that we saw in #279 . If you look at the details of #279 you'll see that the collection times we saw were well beyond the hardcoded 20 second timeout, so I don't think making that same timeout configurable addresses our issue. A new timeout is needed. |
Hello @sirianni this piece of code jmxfetch/src/main/java/org/datadog/jmxfetch/Connection.java Lines 90 to 134 in c8ea93b
Thanks ! |
The stack trace where the thread is hung does not include the code you pasted above ( Once the initial JMX Connection is made, the RMI client stub will transparently re-create the Socket when any RMI method call is made:
|
@sirianni I see, the TCPChannel has some reuse logic and it may create additional connection. The idea of #289 was to fail fast for unreachable host. In your case it does not happen at the first connection. The socket factory seems to be the right solution, the non-SSL implementation is rather simple however the SSL one may not be trivial. I will add this task to our backlog for assessment. |
Thanks @prognant !
Yes exactly. We run a dynamic kubernetes environment. This issue presents itself for us when we are upgrading services. During this time, the pods are rolling-restarted which causes a large gap in metrics flowing to Datadog due to the long RMI connection TCP timeouts. |
…connection-timeout"
This reverts commit c2e7793, reversing
changes made to c25dcc3.
Key point:
It could lead java.rmi.server.RMISocketFactory.setSocketFactory to be called more than once
It did not account for SSL connection.