Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixed-version serialization bug in cluster:monitor/xpack/usage responses #61279

Closed
DaveCTurner opened this issue Aug 18, 2020 · 4 comments
Closed
Assignees
Labels
:Analytics/Aggregations Aggregations >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) won't fix

Comments

@DaveCTurner
Copy link
Contributor

Elasticsearch version (bin/elasticsearch --version): mixed 7.8.0/7.3.2 cluster (also seen with other 7.8.x releases)

Plugins installed: Probably not relevant, but whatever Elastic Cloud installs on AWS.

JVM version (java -version): Probably not relevant, but whatever Elastic Cloud uses.

OS version (uname -a if on a Unix-like system): Probably not relevant, but whatever Elastic Cloud uses.

Description of the problem including expected versus actual behavior:

A very small number of mixed-version clusters are reporting serialisation exceptions for cluster:monitor/xpack/usage transport responses:

[instance-0000000004] [76936572][cluster:monitor/xpack/usage] received request
[instance-0000000005] [76936572][cluster:monitor/xpack/usage] sent to [{instance-0000000004}{...}] (timeout: [null])
[instance-0000000004] [76936572][cluster:monitor/xpack/usage] sent response	instance-0000000004
[instance-0000000005] exception caught on transport layer [Netty4TcpChannel{localAddress=/..., remoteAddress=...}], closing connection
java.lang.IllegalStateException: Message not fully read (response) for requestId [76936572], handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler/org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler/org.elasticsearch.transport.TransportService$6@87c14b], error [false]; resetting
	at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:124) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:78) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:692) [elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:142) [elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:117) [elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:82) [elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:73) [transport-netty4-client-7.8.0.jar:7.8.0]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:271) [netty-handler-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1518) [netty-handler-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1267) [netty-handler-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1314) [netty-handler-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:501) [netty-codec-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:440) [netty-codec-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276) [netty-codec-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [netty-common-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.49.Final.jar:4.1.49.Final]
	at java.lang.Thread.run(Thread.java:832) [?:?]
[instance-0000000005] [76936572][cluster:monitor/xpack/usage] received response from [{instance-0000000004}{...}}]

instance-0000000005 is a 7.8.0 node and instance-0000000004 is running 7.3.2.

There's not a lot of useful information in the logs here, but the only obvious change in this area that I can see between 7.7 and 7.8 is #54911 so I suggest investigating that further.

Steps to reproduce:

It seems very rare, I can only find ~2 clusters exhibiting this in all of us-east-1.

Provide logs (if relevant):

See above.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Aug 18, 2020
@imotov
Copy link
Contributor

imotov commented Sep 4, 2020

I spent some time looking into it and trying to reproduce it. I am pretty sure this was caused by this VLong/ZLong issue. We couldn't really figure out a clean way to deal with it back then. Fixing it would require quite a bit of version archeology to figure out when we introduced the bug, fixed the bug and switched to the new version of serialization. Considering a low impact and complicated change, I would propose close/won't fix it, unless it starts popping up more. @polyfractal what do you think?

@polyfractal
Copy link
Contributor

Hmm, yeah, I'm inclined to mark as close/won't fix. Seems rare, seems like a lot of work to fix, and it should be a transient issue anyway since we expect clusters to upgrade past it shortly. We can revisit if many people start running into this but I'm inclined to close for now.

@imotov
Copy link
Contributor

imotov commented Sep 8, 2020

I am going to close it for now, we can revisit if it because an issue.

@imotov imotov closed this as completed Sep 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) won't fix
Projects
None yet
Development

No branches or pull requests

4 participants