Skip to content
This repository has been archived by the owner on Sep 17, 2024. It is now read-only.

chore: increase timeout for Ceph #618

Merged
merged 2 commits into from
Jan 19, 2021

Conversation

mdelapenya
Copy link
Contributor

@mdelapenya mdelapenya commented Jan 19, 2021

What does this PR do?

We are increasing the timeout when waiting for the Ceph integration to be up-and-running, as the Beats team already do in https://github.com/elastic/beats/blob/ef6274d0d1e36308a333cbed69846a1bd63528ae/metricbeat/module/ceph/mgr_osd_tree/mgr_osd_tree_integration_test.go#L35. Because testcontainers is not supporting (yet) a waitFor strategy (it's about to be merged), we have to sleep the thread. And we are increasing the timeout to 2 minutes only for Ceph, the rest of the services stay at 30 seconds.

In between, we refactored the Sleep method to simplify its usage, using time.Duration instead of string as input parameter.

Why is it important?

As a result of this PR, we expect to reduce the flakiness detected by the E2E for Cep, causing noise in Beats PRs, which needed to be re-triggered.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have run the Unit tests for the CLI, and they are passing locally
  • I have run the End-2-End tests for the suite I'm working on, and they are passing locally
  • I have noticed new Go dependencies (run make notice in the proper directory)

How to test this PR locally

SUITE="metricbeat" TAGS="integrations && ceph"  TIMEOUT_FACTOR=3 DEVELOPER_MODE=true LOG_LEVEL=TRACE make -C e2e functional-test

Related issues

Follow-ups

Once testcontainers adds the functionality to wait for compose services, we will be able to consume it in a nicer way, instead of blocking the thread by 2 minutes. See testcontainers/testcontainers-go#261

An example of how it would be:

compose.WithExposedService(
    "nginx_1", 80,
    wait.NewHTTPStrategy("/").
        WithPort("80/tcp").
        WithStartupTimeout(10*time.Second))

@mdelapenya mdelapenya self-assigned this Jan 19, 2021
@mdelapenya mdelapenya requested a review from a team January 19, 2021 16:32
@mdelapenya mdelapenya changed the title <!-- Type of change Please label this PR with one of the existing labels, depending on the scope of your change --> chore: increase timeout for Ceph Jan 19, 2021
@mdelapenya mdelapenya requested a review from jsoriano January 19, 2021 16:35
@elasticmachine
Copy link
Contributor

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #618 opened

    • Start Time: 2021-01-19T16:32:31.434+0000
  • Duration: 23 min 29 sec

  • Commit: c6b7652

Test stats 🧪

Test Results
Failed 0
Passed 105
Skipped 9
Total 114

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 105
Skipped 9
Total 114

@mdelapenya mdelapenya marked this pull request as ready for review January 19, 2021 17:00
@mdelapenya mdelapenya merged commit 5ea7866 into elastic:master Jan 19, 2021
mdelapenya added a commit to mdelapenya/e2e-testing that referenced this pull request Jan 19, 2021
* chore: simplify sleep parameter, using a Duration

* chore: wait longer for Ceph service
mdelapenya added a commit to mdelapenya/e2e-testing that referenced this pull request Jan 19, 2021
* chore: simplify sleep parameter, using a Duration

* chore: wait longer for Ceph service
mdelapenya added a commit to mdelapenya/e2e-testing that referenced this pull request Jan 19, 2021
* chore: simplify sleep parameter, using a Duration

* chore: wait longer for Ceph service
mdelapenya added a commit to mdelapenya/e2e-testing that referenced this pull request Jan 19, 2021
* chore: simplify sleep parameter, using a Duration

* chore: wait longer for Ceph service
@mdelapenya mdelapenya deleted the 607-integrations-timeout branch January 19, 2021 17:17
mdelapenya added a commit that referenced this pull request Jan 19, 2021
* chore: simplify sleep parameter, using a Duration

* chore: wait longer for Ceph service
mdelapenya added a commit that referenced this pull request Jan 19, 2021
* chore: simplify sleep parameter, using a Duration

* chore: wait longer for Ceph service
mdelapenya added a commit that referenced this pull request Jan 19, 2021
* chore: simplify sleep parameter, using a Duration

* chore: wait longer for Ceph service
mdelapenya added a commit that referenced this pull request Jan 19, 2021
* chore: simplify sleep parameter, using a Duration

* chore: wait longer for Ceph service
@mdelapenya
Copy link
Contributor Author

@jsoriano I'm still seeing some failures in Ceph tests:

[2021-01-20T14:29:32.458Z] time="2021-01-20T14:29:31Z" level=error msg="Error Hit found" ID=PuIzIHcBlvID7Xz7NoyS error.message="error in fetch: HTTP error 404 in : 404 NOT FOUND"
[2021-01-20T14:29:32.458Z] time="2021-01-20T14:29:31Z" level=error msg="Error Hit found" ID=P-IzIHcBlvID7Xz7NoyS error.message="HTTP error 404 in : 404 NOT FOUND"
[2021-01-20T14:29:32.458Z] time="2021-01-20T14:29:31Z" level=error msg="Error Hit found" ID=QOIzIHcBlvID7Xz7NoyS error.message="error in fetch: HTTP error 404 in : 404 NOT FOUND"
[2021-01-20T14:29:32.458Z] time="2021-01-20T14:29:31Z" level=error msg="Error Hit found" ID=QeIzIHcBlvID7Xz7Oox4 error.message="error in fetch: HTTP error 404 in : 404 NOT FOUND"
[2021-01-20T14:29:32.458Z] time="2021-01-20T14:29:31Z" level=error msg="Error Hit found" ID=QuIzIHcBlvID7Xz7Oox4 error.message="error in fetch: HTTP error 404 in : 404 NOT FOUND"
[2021-01-20T14:29:32.458Z] time="2021-01-20T14:29:31Z" level=error msg="Error Hit found" ID=Q-IzIHcBlvID7Xz7Oox4 error.message="HTTP error 404 in : 404 NOT FOUND"
[2021-01-20T14:29:32.458Z] time="2021-01-20T14:29:31Z" level=error msg="Error Hit found" ID=R-IzIHcBlvID7Xz7Q4zq error.message="error in fetch: HTTP error 404 in : 404 NOT FOUND"
[2021-01-20T14:29:32.458Z] time="2021-01-20T14:29:31Z" level=error msg="Error Hit found" ID=SOIzIHcBlvID7Xz7Q4zq error.message="HTTP error 404 in : 404 NOT FOUND"
[2021-01-20T14:29:32.458Z] time="2021-01-20T14:29:31Z" level=error msg="Error Hit found" ID=SeIzIHcBlvID7Xz7Q4zq error.message="error in fetch: HTTP error 404 in : 404 NOT FOUND"
[2021-01-20T14:29:32.458Z] time="2021-01-20T14:29:31Z" level=error msg="Error Hit found" ID=SuIzIHcBlvID7Xz7Q4zq error.message="HTTP error 404 in : 404 NOT FOUND"

Are you familiar with the error in fetch: HTTP error 404 in : 404 NOT FOUND error? You can find the container logs for the test execution here: https://beats-ci.elastic.co/job/e2e-tests/job/e2e-testing-mbp/job/master/180/artifact/docker_logs_metricbeat_integrations%20&&%20ceph.log

@jsoriano
Copy link
Member

@mdelapenya investigating this I have seen that the metricsets in the Ceph module query two different endpoints, some metricsets work with the "Ceph REST API", and some others with the "Ceph Manager Daemon". I guess that querying the incorrect endpoint will result in 404s.

In the reference configuration they are separated and query different endpoints: https://www.elastic.co/guide/en/beats/metricbeat/7.10/metricbeat-module-ceph.html#_example_configuration_8

The docker compose scenario starts two containers, one called ceph, and another one called ceph-api. Integration tests in Beats explicitly use one or the other service depending on the one they support:

System tests only support the mgr_* metricsets: https://github.com/elastic/beats/blob/c6af0496569b2ea95e7325e2711496b29e831405/metricbeat/module/ceph/test_ceph.py#L33-L40

As this seems quite special, and I guess this module is not so heavily used, I'd suggest to skip tests for this module by now and open a follow up issue. Unless you can think in a straightforward way of supporting this here.

@mdelapenya
Copy link
Contributor Author

What I'm concerned about is the tests do pass many many times. But how? It seems it's not a timeout issue (we also increased the timeout_factor in CI from 3 to 5 -fleet reasons), so I'd say there is something weird:

  • the containers are not properly started. I'm taking a look at the original docker-compose Vs the massaged one:
version: '2.3'

services:
  ceph:
    image: docker.elastic.co/integrations-ci/beats-ceph:${CEPH_VERSION:-master-97985eb-nautilus-centos-7-x86_64}-2
    build:
      context: ./_meta
      dockerfile: Dockerfile.${CEPH_CODENAME:-nautilus}
      args:
        CEPH_VERSION: ${CEPH_VERSION:-master-97985eb-nautilus-centos-7-x86_64}
    ports:
      - 5000
      - 8003
      - 8080
  ceph-api:
    image: docker.elastic.co/integrations-ci/beats-ceph:master-6373c6a-jewel-centos-7-x86_64-1
    build:
      context: ./_meta
      dockerfile: Dockerfile.jewel
      args:
        CEPH_VERSION: master-6373c6a-jewel-centos-7-x86_64
    ports:
      - 5000

massaged one:

version: "2.3"
services:
  ceph:
    image: docker.elastic.co/integrations-ci/beats-ceph:master-6373c6a-jewel-centos-7-x86_64-1
    ports:
    - 5000
  ceph-api:
    image: docker.elastic.co/integrations-ci/beats-ceph:master-6373c6a-jewel-centos-7-x86_64-1
    ports:
    - 5000

For some reason all ports are not extracted back to the new one. I'll debug that process, but we do not do any magic: simply using these structs to load the YAML file:

type service interface{}
type compose struct {
	Version  string             `yaml:"version"`
	Services map[string]service `yaml:"services"`
}

... 
        c := compose{}
	err = yaml.Unmarshal(bytes, &c)
...

@mdelapenya
Copy link
Contributor Author

Wait, is this correct? https://github.com/elastic/beats/blob/master/metricbeat/module/ceph/docker-compose.yml#L5-L16

Is the ceph-api using a hardcoded version for master-6373c6a-jewel?

@mdelapenya
Copy link
Contributor Author

@jsoriano
Copy link
Member

@jsoriano I think ceph-api compose file is wrong, probably caused by elastic/beats@273c6a8:
See https://github.com/elastic/beats/blame/master/metricbeat/module/ceph/docker-compose.yml#L5-L16

This conversation seems relevant: elastic/beats#16254 (comment) It seems that we decided to delay some things.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate why Ceph tests are flaky, as it seems to fail with quite a frequency
5 participants