Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Index synced flush rest test sees 409 Conflict #29162

Closed
DaveCTurner opened this issue Mar 20, 2018 · 13 comments
Closed

[CI] Index synced flush rest test sees 409 Conflict #29162

DaveCTurner opened this issue Mar 20, 2018 · 13 comments
Assignees
Labels
blocker :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. >test Issues or PRs that are addressing/adding tests >test-failure Triaged test failures from CI v5.6.9 v6.2.4

Comments

@DaveCTurner
Copy link
Contributor

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.2+intake/162/consoleText

1> [2018-03-19T19:10:22,030][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test {p0=indices.flush/10_basic/Index synced flush rest test}]: before test
  1> [2018-03-19T19:10:22,156][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] Stash dump on test failure [{
  1>   "stash" : {
  1>     "body" : {
  1>       "_shards" : {
  1>         "total" : 5,
  1>         "successful" : 3,
  1>         "failed" : 2
  1>       },
  1>       "testing" : {
  1>         "total" : 5,
  1>         "successful" : 3,
  1>         "failed" : 2,
  1>         "failures" : [
  1>           {
  1>             "shard" : 2,
  1>             "reason" : "[node-1][127.0.0.1:38733][internal:indices/flush/synced/sync] disconnected",
  1>             "routing" : {
  1>               "state" : "STARTED",
  1>               "primary" : true,
  1>               "node" : "CDF_2XQuR3apYhenQdnl7w",
  1>               "relocating_node" : null,
  1>               "shard" : 2,
  1>               "index" : "testing",
  1>               "allocation_id" : {
  1>                 "id" : "lWfMa2hRRBuwosyqnQ6TmQ"
  1>               }
  1>             }
  1>           },
  1>           {
  1>             "shard" : 1,
  1>             "reason" : "[node-0][127.0.0.1:46759][internal:indices/flush/synced/sync] disconnected",
  1>             "routing" : {
  1>               "state" : "STARTED",
  1>               "primary" : true,
  1>               "node" : "2pbkQL89Q2Wf0hNye6uxyQ",
  1>               "relocating_node" : null,
  1>               "shard" : 1,
  1>               "index" : "testing",
  1>               "allocation_id" : {
  1>                 "id" : "jQYe2kcRR_2uduMr8FpK9A"
  1>               }
  1>             }
  1>           }
  1>         ]
  1>       }
  1>     }
  1>   }
  1> }]
  1> [2018-03-19T19:10:22,241][INFO ][o.e.b.MixedClusterClientYamlTestSuiteIT] [test {p0=indices.flush/10_basic/Index synced flush rest test}]: after test
FAILURE 0.22s | MixedClusterClientYamlTestSuiteIT.test {p0=indices.flush/10_basic/Index synced flush rest test} <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: Failure at [indices.flush/10_basic:14]: expected [2xx] status code but api [indices.flush_synced] returned [409 Conflict] [{"_shards":{"total":5,"successful":3,"failed":2},"testing":{"total":5,"successful":3,"failed":2,"failures":[{"shard":2,"reason":"[node-1][127.0.0.1:38733][internal:indices/flush/synced/sync] disconnected","routing":{"state":"STARTED","primary":true,"node":"CDF_2XQuR3apYhenQdnl7w","relocating_node":null,"shard":2,"index":"testing","allocation_id":{"id":"lWfMa2hRRBuwosyqnQ6TmQ"}}},{"shard":1,"reason":"[node-0][127.0.0.1:46759][internal:indices/flush/synced/sync] disconnected","routing":{"state":"STARTED","primary":true,"node":"2pbkQL89Q2Wf0hNye6uxyQ","relocating_node":null,"shard":1,"index":"testing","allocation_id":{"id":"jQYe2kcRR_2uduMr8FpK9A"}}}]}}]
   > 	at __randomizedtesting.SeedInfo.seed([8FC9D8CFC41D4733:79DE7156AE12ACB]:0)
   > 	at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:343)
   > 	at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:320)
   > 	at java.lang.Thread.run(Thread.java:748)
   > Caused by: java.lang.AssertionError: expected [2xx] status code but api [indices.flush_synced] returned [409 Conflict] [{"_shards":{"total":5,"successful":3,"failed":2},"testing":{"total":5,"successful":3,"failed":2,"failures":[{"shard":2,"reason":"[node-1][127.0.0.1:38733][internal:indices/flush/synced/sync] disconnected","routing":{"state":"STARTED","primary":true,"node":"CDF_2XQuR3apYhenQdnl7w","relocating_node":null,"shard":2,"index":"testing","allocation_id":{"id":"lWfMa2hRRBuwosyqnQ6TmQ"}}},{"shard":1,"reason":"[node-0][127.0.0.1:46759][internal:indices/flush/synced/sync] disconnected","routing":{"state":"STARTED","primary":true,"node":"2pbkQL89Q2Wf0hNye6uxyQ","relocating_node":null,"shard":1,"index":"testing","allocation_id":{"id":"jQYe2kcRR_2uduMr8FpK9A"}}}]}}]
   > 	at org.elasticsearch.test.rest.yaml.section.DoSection.execute(DoSection.java:238)
   > 	at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:336)
   > 	... 37 more
REPRODUCE WITH: ./gradlew :qa:mixed-cluster:v5.6.9-SNAPSHOT#mixedClusterTestRunner -Dtests.seed=8FC9D8CFC41D4733 -Dtests.class=org.elasticsearch.backwards.MixedClusterClientYamlTestSuiteIT -Dtests.method="test {p0=indices.flush/10_basic/Index synced flush rest test}" -Dtests.security.manager=true -Dtests.locale=en-PH -Dtests.timezone=Asia/Srednekolymsk

The only change on 6.2 was a doc fix (#29116) but #29103 (on the 5.6 branch) looks like it's a more likely culprit. @dnhatn could you take a look?

@DaveCTurner DaveCTurner added >test-failure Triaged test failures from CI :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v6.2.4 labels Mar 20, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner
Copy link
Contributor Author

It looks like a serialisation failure, and #29103 did change the wire format. I'm not yet clear how this results in the problem seen here, but I'm backing it out to allow for further progress.

[2018-03-20T12:59:50,311][WARN ][o.e.t.n.Netty4Transport  ] [node-2] exception caught on transport layer [NettyTcpChannel{localAddress=/127.0.0.1:56789, remoteAddress=127.0.0.1/127.0.0.1:56716}], closing connection
java.lang.IllegalStateException: Message not fully read (response) for requestId [30], handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler/org.elasticsearch.indices.flush.SyncedFlushService$6@79d738f8], error [false]; resetting
        at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1407) ~[elasticsearch-6.2.3-SNAPSHOT.jar:6.2.3-SNAPSHOT]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:64) ~[transport-netty4-6.2.3-SNAPSHOT.jar:6.2.3-SNAPSHOT]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) [netty-codec-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413) [netty-codec-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) [netty-codec-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:545) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:499) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.16.Final.jar:4.1.16.Final]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]

@dnhatn
Copy link
Member

dnhatn commented Mar 20, 2018

@DaveCTurner I am looking at this.

@jasontedor
Copy link
Member

Please silence this test, it is blocking other effort.

@dnhatn
Copy link
Member

dnhatn commented Mar 20, 2018

I've backported #29103 to 6.2 and corrected the BWC versions in 5.6.9.

@dnhatn dnhatn closed this as completed Mar 20, 2018
@droberts195
Copy link
Contributor

This is still failing when testing for BWC against 6.0.0 in the 6.2 branch: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.2+bwc-tests/178/consoleText

I think there is a problem in this method that was added in f10a3ed, as it assumes that all variants of 6.0 support the new functionality whereas in reality only 6.0 HEAD does. Released versions can never be changed.

        boolean includeExistingSyncId(Version version) {
            return version.onOrAfter(Version.V_5_6_9);
        }

I will mute the tests.

@droberts195
Copy link
Contributor

I think this is a blocker for release of 6.2.4. If the 6.2 branch were released in its current state then the shipped product would not work in a mixed version cluster with released builds of 6.0 and 6.1.

droberts195 added a commit that referenced this issue Mar 21, 2018
@DaveCTurner
Copy link
Contributor Author

I think 5.6.9 is similarly blocked, although I don't think we have tests for that. @bleskes @dnhatn do you think we should carry on trying to fix the BWC implications of #29103 or would it be better to back out and regroup first?

@dnhatn
Copy link
Member

dnhatn commented Mar 21, 2018

@DaveCTurner and @droberts195 I am looking at this.

@droberts195
Copy link
Contributor

I muted the failing tests on the 6.2 branch in fc91822.

But if you check out the commit before the muting, it can be reproduced like this:

git checkout c06b7ab39eb5c6ed7464db9d3e196a43dbf5fd46
./gradlew :qa:mixed-cluster:v6.0.0#mixedClusterTestRunner -Dtests.seed=61C04E681897CE9 -Dtests.security.manager=true -Dtests.locale=ar-MA -Dtests.timezone=America/St_Vincent

@dnhatn
Copy link
Member

dnhatn commented Mar 21, 2018

Talked to @bleskes, we agreed to include #29103 in 6.3.0+ but not in 5.6.9. I am making the change soon. We will re-evaluate the urgency and importance of the issue then decide which versions that the change should be included.

dnhatn pushed a commit that referenced this issue Mar 21, 2018
dnhatn pushed a commit that referenced this issue Mar 21, 2018
@dnhatn
Copy link
Member

dnhatn commented Mar 22, 2018

I've backed out #29103 from 5.6.9 and corrected BWC versions in 6.2, 6.x and the master branches. All intake and bwc builds are good now.

Thanks @bleskes, @DaveCTurner and @droberts195 for reporting and helping on this.

@dnhatn dnhatn closed this as completed Mar 22, 2018
@clintongormley clintongormley added the :Delivery/Build Build or test infrastructure label Apr 18, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@clintongormley clintongormley added >test Issues or PRs that are addressing/adding tests and removed :Delivery/Build Build or test infrastructure labels Apr 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. >test Issues or PRs that are addressing/adding tests >test-failure Triaged test failures from CI v5.6.9 v6.2.4
Projects
None yet
Development

No branches or pull requests

6 participants