Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unexpected failure while sending request, -84 is not a valid id #4494

Closed
Jakob3xD opened this issue Jun 24, 2024 · 22 comments · Fixed by #4741
Closed

[BUG] Unexpected failure while sending request, -84 is not a valid id #4494

Jakob3xD opened this issue Jun 24, 2024 · 22 comments · Fixed by #4741
Labels
bug Something isn't working triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable.

Comments

@Jakob3xD
Copy link

Jakob3xD commented Jun 24, 2024

Describe the bug

With the rolling upgrade from 2.13.0 to 2.14 one of my Opensearch clusters starts to fail and throwing exceptions.
After restarting the third of six hardware nodes and doing the upgrade of 2.13 to 2.14 some indices went from yellow to red caused by shard failures with the following exception:

Jun 24 14:55:42 some-opensearch-n7[20523]: [2024-06-24T14:55:42,402][WARN ][o.o.c.a.s.ShardStateAction] [some-opensearch-n7] unexpected failure while sending request [internal:cluster/shard/failure] to [{some-opensearch-n10}{emE3d7CIR8qOR0fTQ9NE-w}{wZPLpHEfQTqwbBOJg2Sgcw}{some-opensearch-n10}{172.27.9.10:9300}{dimr}{node=some-opensearch-w4, data=hot, shard_indexing_pressure_enabled=true}] for shard entry [shard id [[.ds-some-index-000005][1]], allocation id [wPuHfO6rT02UsPrtPe0X7A], primary term [1249], message [mark copy as stale], markAsStale [true]]
Jun 24 14:55:42 some-opensearch-n7[20523]: org.opensearch.transport.RemoteTransportException: [some-opensearch-n10][172.27.9.10:9300][internal:cluster/shard/failure]
Jun 24 14:55:42 some-opensearch-n7[20523]: Caused by: org.opensearch.OpenSearchException: java.lang.IllegalArgumentException: -84 is not a valid id
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.security.support.Base64CustomHelper.deserializeObject(Base64CustomHelper.java:136) ~[?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.security.support.Base64Helper.deserializeObject(Base64Helper.java:46) ~[?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.security.auditlog.impl.AbstractAuditLog.getUser(AbstractAuditLog.java:784) ~[?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.security.auditlog.impl.AbstractAuditLog.logMissingPrivileges(AbstractAuditLog.java:218) ~[?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.security.auditlog.impl.AuditLogImpl.logMissingPrivileges(AuditLogImpl.java:164) ~[?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.security.auditlog.AuditLogSslExceptionHandler.logError(AuditLogSslExceptionHandler.java:72) ~[?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.security.ssl.transport.SecuritySSLRequestHandler.messageReceived(SecuritySSLRequestHandler.java:170) ~[?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.security.OpenSearchSecurityPlugin$6$1.messageReceived(OpenSearchSecurityPlugin.java:828) ~[?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptor$interceptHandler$1.messageReceived(RollupInterceptor.kt:114) ~[?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-2.14.0.jar:2.14.0]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.transport.InboundHandler.handleRequest(InboundHandler.java:271) ~[opensearch-2.14.0.jar:2.14.0]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.transport.InboundHandler.messageReceived(InboundHandler.java:144) ~[opensearch-2.14.0.jar:2.14.0]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:127) ~[opensearch-2.14.0.jar:2.14.0]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:770) ~[opensearch-2.14.0.jar:2.14.0]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:175) ~[opensearch-2.14.0.jar:2.14.0]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:150) [opensearch-2.14.0.jar:2.14.0]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:115) [opensearch-2.14.0.jar:2.14.0]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:95) [transport-netty4-client-2.14.0.jar:2.14.0]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) [netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) [netty-handler-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[netty-codec-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) [netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1475) [netty-handler-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1338) [netty-handler-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1387) [netty-handler-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529) [netty-codec-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468) [netty-codec-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290) [netty-codec-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) [netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) [netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) [netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) [netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689) [netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652) [netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [netty-common-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at java.lang.Thread.run(Thread.java:1583) [?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]: Caused by: java.lang.IllegalArgumentException: -84 is not a valid id
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.security.support.Base64CustomHelper$CustomSerializationFormat.fromId(Base64CustomHelper.java:63) ~[?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.security.support.Base64CustomHelper.deserializeObject(Base64CustomHelper.java:117) ~[?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.security.support.Base64Helper.deserializeObject(Base64Helper.java:46) ~[?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.security.auditlog.impl.AbstractAuditLog.getUser(AbstractAuditLog.java:784) ~[?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.security.auditlog.impl.AbstractAuditLog.logMissingPrivileges(AbstractAuditLog.java:218) ~[?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.security.auditlog.impl.AuditLogImpl.logMissingPrivileges(AuditLogImpl.java:164) ~[?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.security.auditlog.AuditLogSslExceptionHandler.logError(AuditLogSslExceptionHandler.java:72) ~[?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.security.ssl.transport.SecuritySSLRequestHandler.messageReceived(SecuritySSLRequestHandler.java:170) ~[?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.security.OpenSearchSecurityPlugin$6$1.messageReceived(OpenSearchSecurityPlugin.java:828) ~[?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptor$interceptHandler$1.messageReceived(RollupInterceptor.kt:114) ~[?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-2.14.0.jar:2.14.0]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.transport.InboundHandler.handleRequest(InboundHandler.java:271) ~[opensearch-2.14.0.jar:2.14.0]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.transport.InboundHandler.messageReceived(InboundHandler.java:144) ~[opensearch-2.14.0.jar:2.14.0]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:127) ~[opensearch-2.14.0.jar:2.14.0]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:770) ~[opensearch-2.14.0.jar:2.14.0]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:175) ~[opensearch-2.14.0.jar:2.14.0]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:150) ~[opensearch-2.14.0.jar:2.14.0]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:115) ~[opensearch-2.14.0.jar:2.14.0]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:95) ~[transport-netty4-client-2.14.0.jar:2.14.0]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) ~[netty-handler-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[netty-codec-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1475) ~[netty-handler-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1338) ~[netty-handler-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1387) ~[netty-handler-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529) ~[netty-codec-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468) ~[netty-codec-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290) ~[netty-codec-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.109.Final.jar:4.1.109.Final]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) ~[?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689) ~[?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652) ~[?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) ~[?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
Jun 24 14:55:42 some-opensearch-n7[20523]:         at java.lang.Thread.run(Thread.java:1583) ~[?:?]

Related component

Other

To Reproduce

Not know.
The only big difference to my other clusters is, that the failing cluster is doing cross cluster search and therefore has remote targets configured.

Expected behavior

The upgrade from 2.13 to 2.14 should happen without such exceptions.

Additional Details

Plugins

opensearch-alerting
opensearch-anomaly-detection
opensearch-asynchronous-search
opensearch-cross-cluster-replication 
opensearch-custom-codecs
opensearch-flow-framework  
opensearch-geospatial      
opensearch-index-management        
opensearch-job-scheduler        
opensearch-knn                  
opensearch-ml                 
opensearch-neural-search        
opensearch-notifications         
opensearch-notifications-core    
opensearch-observability         
opensearch-reports-scheduler     
opensearch-security           
opensearch-security-analytics      
opensearch-skills                  
opensearch-sql                
prometheus-exporter         
repository-s3        

Host/Environment (please complete the following information):

  • OS: Ubuntu
  • Version 24.04

Additional context
Similar exception already existed in the past when updating from 2.10 to 2.11.
opensearch-project/OpenSearch#11491

@Jakob3xD Jakob3xD added bug Something isn't working untriaged Require the attention of the repository maintainers and may need to be prioritized labels Jun 24, 2024
@dbwiddis
Copy link
Member

Previous issue linked to #3771 which was supposed to have been fixed, but there's at least one other report of this error in 2.14 here: #3771 (comment) CC: @cwperks

@dbwiddis
Copy link
Member

dbwiddis commented Jun 24, 2024

Possibly is related to opensearch-project/performance-analyzer#606

Still-open PR to fix: opensearch-project/performance-analyzer#609

@reshippie
Copy link
Contributor

I'm seeing the same issue upgrading from 2.12 to 2.14

@cwperks
Copy link
Member

cwperks commented Jun 24, 2024

@Jakob3xD It would be helpful for debugging to understand the setup that produced this error. What plugins or features are enabled?

When this error was seen on upgrade from < 2.11 to 2.11, it was because PerformanceAnalyzer had wrapped an instance of a TransportChannel and was not delegating getVersion() implementation to the wrapped channel. I opened a PR in PA to fix the wrapping issue, but also opened a different PR in the security repo that fixed the issue in the security repo and did not require the PA PR to be merged (though it should merge since its the strategic fix)

Backwards compatibility for serialization requires the receiving or transmitting node of a transport request to know the correct version of the target node. The target node being the node that the transport request is being sent to (from the transmitter end) or the node that sent the transport request (on the receiver end)

It would be helpful to know the concrete className for connection here:

final var serializationFormat = SerializationFormat.determineFormat(connection.getVersion());

It would also be helpful the know the concrete className for channel here:

SerializationFormat.determineFormat(channel.getVersion()) == SerializationFormat.JDK

@briend
Copy link

briend commented Jun 25, 2024

We're also seeing this upgrading from 2.12.0 to 2.14.0. We have the PA plugin installed but it is disabled via env variable DISABLE_PERFORMANCE_ANALYZER_AGENT_CLI=true. Note, @Jakob3xD didn't seem to have the opensearch-performance-analyzer plugin but still had this error.

opensearch-alerting
opensearch-anomaly-detection
opensearch-asynchronous-search
opensearch-cross-cluster-replication
opensearch-custom-codecs
opensearch-flow-framework
opensearch-geospatial
opensearch-index-management
opensearch-job-scheduler
opensearch-knn
opensearch-ml
opensearch-neural-search
opensearch-notifications
opensearch-notifications-core
opensearch-observability
opensearch-performance-analyzer
opensearch-reports-scheduler
opensearch-security
opensearch-security-analytics
opensearch-skills
opensearch-sql
repository-s3

PA is disabled at startup:

Enabling OpenSearch Security Plugin
Disabling execution of install_demo_configuration.sh for OpenSearch Security Plugin
Disabling execution of /usr/share/opensearch/bin/opensearch-performance-analyzer/performance-analyzer-agent-cli for OpenSearch Performance Analyzer Plugin

error does reference performanceanalyzer.transport.PerformanceAnalyzerTransportRequestHandler.messageReceived:

{"type": "console", "timestamp": "2024-06-24T23:42:24,719Z", "level": "WARN", "component": "o.o.i.e.Engine", "cluster.name": "some-cluster", "node.name": "node-1", "message": " [some-index-2024-06-24][1] failed engine [primary shard [[some-index-2024-06-24][1], node[asdf], [P], s[STARTED], a[id=fdsa]] was demoted while failing replica shard]", "cluster.uuid": "ffff-A", "node.id": "asdf" , 
"stacktrace": ["org.opensearch.OpenSearchException: java.lang.IllegalArgumentException: -84 is not a valid id",
"at org.opensearch.security.support.Base64CustomHelper.deserializeObject(Base64CustomHelper.java:136) ~[?:?]",
"at org.opensearch.security.support.Base64Helper.deserializeObject(Base64Helper.java:46) ~[?:?]",
"at org.opensearch.security.auditlog.impl.AbstractAuditLog.getUser(AbstractAuditLog.java:775) ~[?:?]",
"at org.opensearch.security.auditlog.impl.AbstractAuditLog.logMissingPrivileges(AbstractAuditLog.java:209) ~[?:?]",
"at org.opensearch.security.auditlog.impl.AuditLogImpl.logMissingPrivileges(AuditLogImpl.java:164) ~[?:?]",
"at org.opensearch.security.auditlog.AuditLogSslExceptionHandler.logError(AuditLogSslExceptionHandler.java:72) ~[?:?]",
"at org.opensearch.security.ssl.transport.SecuritySSLRequestHandler.messageReceived(SecuritySSLRequestHandler.java:170) ~[?:?]",
"at org.opensearch.security.OpenSearchSecurityPlugin$6$1.messageReceived(OpenSearchSecurityPlugin.java:795) ~[?:?]",
"at org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptor$interceptHandler$1.messageReceived(RollupInterceptor.kt:114) ~[?:?]",
"at org.opensearch.performanceanalyzer.transport.PerformanceAnalyzerTransportRequestHandler.messageReceived(PerformanceAnalyzerTransportRequestHandler.java:43) ~[?:?]",
"at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-2.14.0.jar:2.14.0]",
"at org.opensearch.transport.InboundHandler.handleRequest(InboundHandler.java:271) ~[opensearch-2.14.0.jar:2.14.0]",
"at org.opensearch.transport.InboundHandler.messageReceived(InboundHandler.java:144) ~[opensearch-2.14.0.jar:2.14.0]",
"at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:127) ~[opensearch-2.14.0.jar:2.14.0]",
"at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:770) ~[opensearch-2.14.0.jar:2.14.0]",
"at org.opensearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:175) ~[opensearch-2.14.0.jar:2.14.0]",
"at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:150) [opensearch-2.14.0.jar:2.14.0]",
"at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:115) [opensearch-2.14.0.jar:2.14.0]",
"at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:95) [transport-netty4-client-2.14.0.jar:2.14.0]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) [netty-handler-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[netty-codec-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1475) [netty-handler-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1338) [netty-handler-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1387) [netty-handler-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529) [netty-codec-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468) [netty-codec-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290) [netty-codec-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [netty-common-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.109.Final.jar:4.1.109.Final]",
"at java.lang.Thread.run(Thread.java:1583) [?:?]",
"Caused by: java.lang.IllegalArgumentException: -84 is not a valid id",
"at org.opensearch.security.support.Base64CustomHelper$CustomSerializationFormat.fromId(Base64CustomHelper.java:63) ~[?:?]",
"at org.opensearch.security.support.Base64CustomHelper.deserializeObject(Base64CustomHelper.java:117) ~[?:?]",
"... 50 more"] }

@Jakob3xD
Copy link
Author

@Jakob3xD It would be helpful for debugging to understand the setup that produced this error. What plugins or features are enabled?

The plugins are listed in the issue. PA is completely removed during the image build process via /usr/share/opensearch/bin/opensearch-plugin remove opensearch-performance-analyzer. The only difference to my other clusters is the cluster configuration of remote clusters and the configuration of an awareness attribute.

To give more details on how i encountered the bug. I upgraded all my remote clusters from 2.13 to 2.14 without any issues and afterwards I wanted to upgrade the main clusters, where all the remote targets are configured. First two node reboots worked without any issue but after the third node thinks started to fail with the already named exception.
I fixed this issue like the last time where the upgrade to 2.11 failed, by doing a full cluster upgrade and not the rolling upgrade. After doing the full cluster upgrade by shutting the cluster down and starting it with 2.14 the exceptions are gone.

It would be helpful to know the concrete className for connection here:

final var serializationFormat = SerializationFormat.determineFormat(connection.getVersion());

It would also be helpful the know the concrete className for channel here:

SerializationFormat.determineFormat(channel.getVersion()) == SerializationFormat.JDK

Not sure how I would get those? I am not familiar with java.

@cwperks
Copy link
Member

cwperks commented Jun 25, 2024

Based on the stack trace it appears the issue is this:

2.12 node ------- transport request ------> 2.14 node

When the 2.12 node is serializing its opting to use JDK serialization, but the 2.14 node is trying to deserialize as custom.

The 2.12 node should be opting to use custom serialization since the target node is >= 2.11: https://github.com/opensearch-project/security/blob/2.12/src/main/java/org/opensearch/security/transport/SecurityInterceptor.java#L153

@cwperks
Copy link
Member

cwperks commented Jun 25, 2024

@briend Do you also have cross cluster configured?

@Jakob3xD Is the error seen on the replica cluster?

@briend
Copy link

briend commented Jun 25, 2024

@briend Do you also have cross cluster configured?

No, we don't have cross cluster search configured on the cluster that had the issue. I was also able to recover by doing a full upgrade instead of rolling (shut all nodes down and start up with 2.14.0).

@peternied peternied transferred this issue from opensearch-project/OpenSearch Jun 26, 2024
@peternied
Copy link
Member

[Triage - attendees 1 2 3]
Transferring to security repo

@Jakob3xD
Copy link
Author

@Jakob3xD Is the error seen on the replica cluster?

Nope, only happened on the main cluster.
And they are not replica clusters. They are just configures as remote cluster to be able to query all logs via one opensearch.

@briend
Copy link

briend commented Jun 27, 2024

I could not reproduce the error with the security plugin disabled, FWIW

@stephen-crawford
Copy link
Contributor

[Triage] Hi @Jakob3xD, thanks for filing this issue. This looks like a good bug to get fixed. Going to mark as triaged.

@stephen-crawford stephen-crawford added triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable. and removed untriaged Require the attention of the repository maintainers and may need to be prioritized labels Jul 1, 2024
@shikharj05
Copy link
Contributor

shikharj05 commented Jul 4, 2024

Also check - #4521 (comment)
#4521 (comment)

@Dhruvan1217
Copy link

When this error was seen on upgrade from < 2.11 to 2.11, it was because PerformanceAnalyzer had wrapped an instance of a TransportChannel and was not delegating getVersion() implementation to the wrapped channel.

Just to add, we didn't had PA enabled but still saw the issue happening while upgrading from < 2.10 to 2.11.1 and now same while upgrading to 2.15. For ref: #4085

@shikharj05
Copy link
Contributor

Reposting with better summary-
In 2.11, custom serialization was introduced with #2802

In 2.14, custom serialization was disabled and the plugin moved back to JDK serialization with #4264

Hence, I think for a cluster upgrading from 2.11/2.12/2.13 to any 2.14+ version, the cluster can run into this issue due to different serialization methods. Similarly, a cluster upgrading from < 2.11 to 2.11-2.13 versions can run into the same issue.

In 2.11/2.12/2.13 - serialization method is decided using version check

In 2.14+ - serialization is decided using version check in SerializationFormat

To fix the issue, one of the possible options is to pick the serialization method using the source node version of the transport request. The fix might be similar to #3826

@Dhruvan1217
Copy link

Hey Everyone, Is anyone working on fixing this? It would be great if someone can help pick this up as it's hampering upgrades to the newer version after 2.10.
Or maybe we introduce variable to pick the serialisation technique like suggested here

@cwperks
Copy link
Member

cwperks commented Sep 13, 2024

@Dhruvan1217 One thing to consider is jumping to 2.14 and above which has Custom serialization disabled by default. I have not been able to reproduce this issue. Can you share a little more information about the requests that receive this error?

One thing I have been looking into is whether its possible that there are multiple transport hops happening while serving a request.

i.e.

Client ----> 2.14 Node (Coordinator node) -----> 2.14 node -----> 2.12 node.

The header that is causing problems that @Jakob3xD pointed out is first populated on the 2.14 node that receives the transport request from another 2.14 node and uses JDK serialization. If there's another transport hop from the 2.14 node -> 2.12 node then it re-uses the header which is JDK serialized. When the 2.12 node receives the request it assumes its custom serialized because the logic on the 2.12 branch looks to see if the transmitter is >= 2.11.

This is all hypothetical though, I haven't been able to replicate.

An example of an API that uses multiple transport hops is indexing a document.

When making a request to insert a document the coordinator node forwards the request to the nodes with the primaries which then forward the request to the nodes containing replicas.

I looked into whether indexing a document would encounter this error, but OpenSearch ensures that replicas are always on the same node version or greater than the primaries. Its not possible to have a primary on a newer node version and a replica on an older node version.

@cwperks
Copy link
Member

cwperks commented Sep 13, 2024

Update: I was able to reproduce it with the scenario from above. It happens when replica shards are on an older node than the primary.

@cwperks
Copy link
Member

cwperks commented Sep 13, 2024

I think a fix would involve moving this block directly to after the call to stashContext around here. Testing will be a challenge because my understanding is that replica shard on an older node than a primary shard happens transiently during rolling upgrades.

@cwperks
Copy link
Member

cwperks commented Sep 13, 2024

Not sure if I can reproduce it again, but these were the steps I followed:

Start with 3 2.12 nodes

curl -XPUT https://admin:admin@localhost:9200/movies -k -H "Content-Type: application/json" --data '{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 2
    }
  }
}'
curl -XPOST "https://admin:admin@localhost:9201/movies/_bulk?pretty" -k -H 'Content-Type: application/json' -d '
{ "index": { "_index": "movies", "_id": "1" } }
{ "title": "The Matrix", "year": 1999, "genre": "Sci-Fi", "director": "Lana Wachowski, Lilly Wachowski" }
{ "index": { "_index": "movies", "_id": "2" } }
{ "title": "Inception", "year": 2010, "genre": "Action", "director": "Christopher Nolan" }
{ "index": { "_index": "movies", "_id": "3" } }
{ "title": "Interstellar", "year": 2014, "genre": "Adventure", "director": "Christopher Nolan" }
'
curl -XPUT "https://admin:admin@localhost:9200/_cluster/settings" -k -H 'Content-Type: application/json' -d '
{
  "persistent": {
    "cluster.routing.allocation.node_concurrent_recoveries": "0"
  }
}'

Reboot the node with a primary shard as 2.14

curl -XPOST "https://admin:admin@localhost:9200/_cluster/reroute?pretty" -k -H 'Content-Type: application/json' -d '
{
  "commands": [
    {
      "move": {
        "index": "movies",
        "shard": 0,
        "from_node": "<older_name>",
        "to_node": "<newer_node>"
      }
    }
  ]
}'
curl -XPUT "https://admin:admin@localhost:9200/_cluster/settings" -k -H 'Content-Type: application/json' -d '
{
  "persistent": {
    "cluster.routing.allocation.node_concurrent_recoveries": "2"
  }
}'

Check shard allocation

curl -XGET https://admin:admin@localhost:9200/_cat/shards\?v -k

Send a request to a node that does not have a shard for the movies index allocated to it

@Dhruvan1217
Copy link

@cwperks This situation makes sense to me.

OpenSearch ensures that replicas are always on the same node version or greater than the primaries.

I think during the upgrade, there are chances of Primary being on an updated node and replica on an old node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable.
Projects
None yet
9 participants