Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Merge pull request #285 from DataDog/prognant/fix-RMI-socket-… #288

Merged
merged 1 commit into from
Apr 1, 2020

Conversation

prognant
Copy link
Contributor

@prognant prognant commented Apr 1, 2020

…connection-timeout"

This reverts commit c2e7793, reversing
changes made to c25dcc3.

Key point:
It could lead java.rmi.server.RMISocketFactory.setSocketFactory to be called more than once
It did not account for SSL connection.

…connection-timeout"

This reverts commit c2e7793, reversing
changes made to c25dcc3.
@prognant prognant merged commit 82f361d into master Apr 1, 2020
@prognant prognant deleted the prognant/revert-285 branch April 2, 2020 11:40
@sirianni
Copy link

It could lead java.rmi.server.RMISocketFactory.setSocketFactory to be called more than once
It did not account for SSL connection.

@prognant Are there any plans to make a new fix that addresses the bugs?

@prognant
Copy link
Contributor Author

Hello @sirianni, the PR #289 brings a working solution. It has been included in jmxfetch 0.36.1 release that will ship with the next release of the DataDog Agent.

@sirianni
Copy link

sirianni commented Apr 23, 2020

Hi @prognant - thanks for the quick reply.

Unlike the original implementation c2e7793, PR #289 does not allow the underlying TCP socket connect timeout to be configurable.

PR #289 makes the timeout of your poll of the ArrayBlockingQueue configurable, but that won't address the socket connection hangs that we saw in #279 . If you look at the details of #279 you'll see that the collection times we saw were well beyond the hardcoded 20 second timeout, so I don't think making that same timeout configurable addresses our issue. A new timeout is needed.

@prognant
Copy link
Contributor Author

prognant commented Apr 24, 2020

Hello @sirianni this piece of code

JMXConnector connectWithTimeout(final JMXServiceURL url, final Map<String, Object> env)
throws IOException {
final BlockingQueue<Object> mailbox = new ArrayBlockingQueue<Object>(1);
ExecutorService executor = Executors.newSingleThreadExecutor(daemonThreadFactory);
executor.submit(
new Runnable() {
public void run() {
try {
JMXConnector connector = JMXConnectorFactory.connect(url, env);
if (!mailbox.offer(connector)) {
connector.close();
}
} catch (Throwable t) {
mailbox.offer(t);
}
}
});
Object result;
try {
result = mailbox.poll(jmxTimeout, TimeUnit.SECONDS);
if (result == null) {
if (!mailbox.offer("")) {
result = mailbox.take();
}
}
} catch (InterruptedException e) {
throw initCause(new InterruptedIOException(e.getMessage()), e);
} finally {
executor.shutdown();
}
if (result == null) {
log.warn("Connection timed out: " + url);
throw new SocketTimeoutException("Connection timed out: " + url);
}
if (result instanceof JMXConnector) {
return (JMXConnector) result;
}
try {
throw (Throwable) result;
} catch (Throwable e) {
throw new IOException(e.toString(), e);
}
}
is supposed to honour the timeout (that is now configurable) while connecting. I did some tests and the connection timeout was enforced. If you do have a scenario where it does not work as expected, would you mind sharing it ?
Thanks !

@sirianni
Copy link

The stack trace where the thread is hung does not include the code you pasted above (Connection.connectWithTimeout() or JMXConnectorFactory.connect())

Once the initial JMX Connection is made, the RMI client stub will transparently re-create the Socket when any RMI method call is made:

"pool-9-thread-1" #31 prio=5 os_prio=0 cpu=2222.15ms elapsed=1233.58s tid=0x00007fb988350000 nid=0x293 runnable  [0x00007fb97090f000]
    java.lang.Thread.State: RUNNABLE
 	at java.net.PlainSocketImpl.socketConnect([email protected]/Native Method)
 	at java.net.AbstractPlainSocketImpl.doConnect([email protected]/AbstractPlainSocketImpl.java:399)
 	- locked <0x000000066bd3fe00> (a java.net.SocksSocketImpl)
 	at java.net.AbstractPlainSocketImpl.connectToAddress([email protected]/AbstractPlainSocketImpl.java:242)
 	at java.net.AbstractPlainSocketImpl.connect([email protected]/AbstractPlainSocketImpl.java:224)
 	at java.net.SocksSocketImpl.connect([email protected]/SocksSocketImpl.java:403)
 	at java.net.Socket.connect([email protected]/Socket.java:609)
 	at java.net.Socket.connect([email protected]/Socket.java:558)
 	at java.net.Socket.<init>([email protected]/Socket.java:454)
 	at java.net.Socket.<init>([email protected]/Socket.java:231)
 	at sun.rmi.transport.tcp.TCPDirectSocketFactory.createSocket([email protected]/TCPDirectSocketFactory.java:40)
 	at sun.rmi.transport.tcp.TCPEndpoint.newSocket([email protected]/TCPEndpoint.java:613)
 	at sun.rmi.transport.tcp.TCPChannel.createConnection([email protected]/TCPChannel.java:209)
 	at sun.rmi.transport.tcp.TCPChannel.newConnection([email protected]/TCPChannel.java:196)
 	at sun.rmi.server.UnicastRef.invoke([email protected]/UnicastRef.java:129)
 	at jdk.jmx.remote.internal.rmi.PRef.invoke(jdk.remoteref/Unknown Source)
 	at javax.management.remote.rmi.RMIConnectionImpl_Stub.getAttribute([email protected]/Unknown Source)
 	at javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection.getAttribute([email protected]/RMIConnector.java:904)
 	at org.datadog.jmxfetch.Connection.getAttribute(Connection.java:72)
 	at org.datadog.jmxfetch.JmxAttribute.getJmxValue(JmxAttribute.java:266)
 	at org.datadog.jmxfetch.JmxSimpleAttribute.getValue(JmxSimpleAttribute.java:120)
 	at org.datadog.jmxfetch.JmxSimpleAttribute.getMetrics(JmxSimpleAttribute.java:45)
 	at org.datadog.jmxfetch.Instance.getMetrics(Instance.java:423)
 	at org.datadog.jmxfetch.MetricCollectionTask.call(MetricCollectionTask.java:26)
 	at org.datadog.jmxfetch.MetricCollectionTask.call(MetricCollectionTask.java:8)

@prognant
Copy link
Contributor Author

@sirianni I see, the TCPChannel has some reuse logic and it may create additional connection. The idea of #289 was to fail fast for unreachable host. In your case it does not happen at the first connection. The socket factory seems to be the right solution, the non-SSL implementation is rather simple however the SSL one may not be trivial. I will add this task to our backlog for assessment.

@sirianni
Copy link

Thanks @prognant !

In your case it does not happen at the first connection.

Yes exactly. We run a dynamic kubernetes environment. This issue presents itself for us when we are upgrading services. During this time, the pods are rolling-restarted which causes a large gap in metrics flowing to Datadog due to the long RMI connection TCP timeouts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants