-
Notifications
You must be signed in to change notification settings - Fork 42
chore: increase timeout for Ceph #618
chore: increase timeout for Ceph #618
Conversation
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
💚 Flaky test reportTests succeeded. Expand to view the summary
Test stats 🧪
|
* chore: simplify sleep parameter, using a Duration * chore: wait longer for Ceph service
* chore: simplify sleep parameter, using a Duration * chore: wait longer for Ceph service
* chore: simplify sleep parameter, using a Duration * chore: wait longer for Ceph service
* chore: simplify sleep parameter, using a Duration * chore: wait longer for Ceph service
@jsoriano I'm still seeing some failures in Ceph tests:
Are you familiar with the |
@mdelapenya investigating this I have seen that the metricsets in the Ceph module query two different endpoints, some metricsets work with the "Ceph REST API", and some others with the "Ceph Manager Daemon". I guess that querying the incorrect endpoint will result in 404s. In the reference configuration they are separated and query different endpoints: https://www.elastic.co/guide/en/beats/metricbeat/7.10/metricbeat-module-ceph.html#_example_configuration_8 The docker compose scenario starts two containers, one called
System tests only support the As this seems quite special, and I guess this module is not so heavily used, I'd suggest to skip tests for this module by now and open a follow up issue. Unless you can think in a straightforward way of supporting this here. |
What I'm concerned about is the tests do pass many many times. But how? It seems it's not a timeout issue (we also increased the timeout_factor in CI from 3 to 5 -fleet reasons), so I'd say there is something weird:
version: '2.3'
services:
ceph:
image: docker.elastic.co/integrations-ci/beats-ceph:${CEPH_VERSION:-master-97985eb-nautilus-centos-7-x86_64}-2
build:
context: ./_meta
dockerfile: Dockerfile.${CEPH_CODENAME:-nautilus}
args:
CEPH_VERSION: ${CEPH_VERSION:-master-97985eb-nautilus-centos-7-x86_64}
ports:
- 5000
- 8003
- 8080
ceph-api:
image: docker.elastic.co/integrations-ci/beats-ceph:master-6373c6a-jewel-centos-7-x86_64-1
build:
context: ./_meta
dockerfile: Dockerfile.jewel
args:
CEPH_VERSION: master-6373c6a-jewel-centos-7-x86_64
ports:
- 5000 massaged one: version: "2.3"
services:
ceph:
image: docker.elastic.co/integrations-ci/beats-ceph:master-6373c6a-jewel-centos-7-x86_64-1
ports:
- 5000
ceph-api:
image: docker.elastic.co/integrations-ci/beats-ceph:master-6373c6a-jewel-centos-7-x86_64-1
ports:
- 5000 For some reason all ports are not extracted back to the new one. I'll debug that process, but we do not do any magic: simply using these structs to load the YAML file: type service interface{}
type compose struct {
Version string `yaml:"version"`
Services map[string]service `yaml:"services"`
}
...
c := compose{}
err = yaml.Unmarshal(bytes, &c)
... |
Wait, is this correct? https://github.com/elastic/beats/blob/master/metricbeat/module/ceph/docker-compose.yml#L5-L16 Is the ceph-api using a hardcoded version for |
@jsoriano I think ceph-api compose file is wrong, probably caused by elastic/beats@273c6a8: I'm going to send a fix there 🙏 review 😄 |
This conversation seems relevant: elastic/beats#16254 (comment) It seems that we decided to delay some things. |
What does this PR do?
We are increasing the timeout when waiting for the Ceph integration to be up-and-running, as the Beats team already do in https://github.com/elastic/beats/blob/ef6274d0d1e36308a333cbed69846a1bd63528ae/metricbeat/module/ceph/mgr_osd_tree/mgr_osd_tree_integration_test.go#L35. Because testcontainers is not supporting (yet) a
waitFor
strategy (it's about to be merged), we have to sleep the thread. And we are increasing the timeout to 2 minutes only for Ceph, the rest of the services stay at 30 seconds.In between, we refactored the Sleep method to simplify its usage, using time.Duration instead of string as input parameter.
Why is it important?
As a result of this PR, we expect to reduce the flakiness detected by the E2E for Cep, causing noise in Beats PRs, which needed to be re-triggered.
Checklist
make notice
in the proper directory)How to test this PR locally
Related issues
Follow-ups
Once testcontainers adds the functionality to wait for compose services, we will be able to consume it in a nicer way, instead of blocking the thread by 2 minutes. See testcontainers/testcontainers-go#261
An example of how it would be: