-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] Packaging tests fail with connection failure on opensuse and sles #30295
Comments
Pinging @elastic/es-core-infra |
@andyb-elastic saw a similar failure some time back and this failure included the Elasticsearch logs which said this:
Not sure why the Elasticsearch log wasn't included in the build failure above but it'd be mighty useful if it could appear in future. If it is the same situation then this is failing to put a bunch of index templates within the 30 seconds timeout. It's not a network issue, since this is a one-node cluster. It'd be useful to see
(edited - I originally mentioned mappings, but this is not that) |
And another https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+packaging-tests/2660/console @DaveCTurner do you have a recommendation for any logging I can add to get more information about why the node is failing to process these cluster state events? |
The logs seem somewhat garbled - messages seem to be duplicated in a funny order. Nonetheless, this bit seems unusually slow:
On my laptop it takes <1s to make all the templates:
Perhaps this will shed a little more light into where the time is going:
|
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+packaging-tests/2717/console is another instance of this. Again the outermost error is:
and the error in the Elasticsearch log is:
|
Similar story: adding templates is slow:
@andyb-elastic it doesn't look like we're getting extra detail yet. Could you check these settings are being applied in the right places?
|
They're not applied because I didn't follow up on this, my bad. I'll add them now |
Setting that logging the right way ended up being a little involved, I'll push it next week in the interest of not breaking the build Friday afternoon |
This looks very similar to me: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+packaging-tests/890/console @andyb-elastic not sure which additional detail you are looking for, maybe this one shows more? |
I turned on the logging David suggested when the tests run on opensuse ba8bb1d @cbuescher it's likely the same, it's hard to tell since it looks like that one didn't dump the server logs. The details we're looking for is more information about why the server sometimes takes a long time to start on opensuse |
@jasontedor can you assign someone to address these test failures? |
I think this is caused by internal infrastructure issues. I will follow up internally. |
Here's one with the logging turned up. I'm not sure it really revealed anything, or at least I don't see any additional detail on why the template creation is slow. |
I turned up |
Packaging tests are occasionally failing (elastic#30295) because of very slow index template creation. It looks like the slow part is updating the on-disk cluster state, and this change will help to confirm this.
The log is still suggestive of extremely slow IO when writing updated cluster states to disk. More detail from
|
Packaging tests are occasionally failing (#30295) because of very slow index template creation. It looks like the slow part is updating the on-disk cluster state, and this change will help to confirm this.
Packaging tests are occasionally failing (#30295) because of very slow index template creation. It looks like the slow part is updating the on-disk cluster state, and this change will help to confirm this.
Again! https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+packaging-tests/977/console:
|
Another instance, this time in Full log: bats-packaging-log.txt |
@andyb-elastic do you think we should disable these tests while we investigate? I'm not familiar with how to best isolate + mute the bats tests. |
A similar failure of this today: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+packaging-tests/1024/console
|
This build failure failed differently then the others reported here, but looking in the logs a lot of things time out, so assume it is caused by the same issue (super slow IO): https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+packaging-tests/2860/console
|
Been at this for a while - I'm not sure swapping the disk to the other controller will work in the Vagrantfile. The commands to do it are basically
You need to know the disk path or uuid, and to get that from I'm going to take another shot at doing this in the packer build |
The hope is that this will resolve the problems with very slow io we're seeing on this box in elastic#30295
The hope is that this will resolve the problems with very slow io we're seeing on this box in #30295
The hope is that this will resolve the problems with very slow io we're seeing on this box in #30295
The hope is that this will resolve the problems with very slow io we're seeing on this box in #30295
I merged #32053 to turn on host io caching for the opensuse box and re-enabled the tests for the suse boxes on master 14d7e2c We'll see how they do on the opensuse box - I'm not sure where to start on the sles box as it has its disk on the ide controller. It's possible that it's unrelated to what we're seeing here |
Infra merged https://github.com/elastic/infra/pull/5975 and a new version of the box with the disk attached to the ide controller has landed in vagrant cloud I reverted enabling host io caching for the sata controller (16fe220) and turned tests back on for the suse boxes (418540e) |
This one timed out when waiting for the server to start - while we don't have the es logs to confirm it's this same slow io problem, it's not encouraging https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+packaging-tests/2936/console |
Have logs from a bats run now, we're definitely still seeing the issue. I disabled the tests for these boxes again 0b601c6 https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+packaging-tests/2954/console |
I'm re-enabling tests on these boxes on master to confirm we're still seeing this issue in CI, while I dig into it a little more. It did not occur in three PR CI runs in #38864, and I was not able to reproduce it locally still To re-disable tests on these boxes, you can either revert 94a4c32 or cherry-pick 0b601c6 and push it to the branch again, they should have the same effect |
Failed again in the periodic job https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+packaging-tests/454/console build-master-packaging-454.txt.zip Disabled again b88760b Same symptom, 4 seconds to publish cluster state
|
I did some testing to look at the disk usage of the opensuse image, while I'm not sure it's conclusive it does seem consistent with our theory here that the disk is causing an issue This was running the geonames rally track on a single node and monitoring with iostat at a 1 second interval, with both the node and rally running inside a vm image. The node was started like we start it in the packaging tests, e.g. rally did not manage it. The conditions I tested were using the centos-7 image vs the opensuse-42 image, with or without added stress on the host machine (where stress was an elasticsearch (these plots are truncated to an x range where most of the activity is, the time range after this doesn't have much going on) So the opensuse runs had a lot more merged read and write requests queued to the disk per second than the centos runs. Strangely though the average queue size looks similar for both Looking at average service time, the opensuse runs seem spikier but again not super qualitatively different The merged r/w requests queued per second bit seems the most interesting, but I'm not very familiar with how that queueing works so I'll look into it more Data these plots are from opensuse-nostress-iostat-docs-tagged.json.txt |
Both opensuse and sles were re-enabled with a recent refactoring of the packaging tests, and we haven't (yet) seen this issue crop up. Additionally, we are moving to each OS running within GCP instead of vagrant (at least for CI). I'm going to optimistically close this again. |
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+packaging-tests/2645/console
REPRODUCE WITH line says:
./gradlew :x-pack:qa:vagrant:vagrantOpensuse42#batsPackagingTest -Dtests.seed=7E7AE0563BF3B4A3
The main problem seems to be
Unfortunately I can't find the server logs so can't see any more detail than this. Here's the relevant bit of the CI log:
The text was updated successfully, but these errors were encountered: