Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functional tests for ILM & Data Stream Lifecycle #14100

Open
axw opened this issue Sep 17, 2024 · 11 comments
Open

Add functional tests for ILM & Data Stream Lifecycle #14100

axw opened this issue Sep 17, 2024 · 11 comments
Assignees
Milestone

Comments

@axw
Copy link
Member

axw commented Sep 17, 2024

We should implement functional tests that verify ILM or Data Stream Lefecycle Management is used as expected according to user configuration, and depending on whether a cluster is created new or upgraded. See #13898 (comment)

@axw axw added this to the 8.16 milestone Sep 17, 2024
@endorama
Copy link
Member

endorama commented Oct 29, 2024

I've been thinking how to implement functional tests. I explored 3 possible approaches:

  1. as part of smoke tests
  2. as part of system tests
  3. with a new approach independent of the 2 above

TL;DR: my conclusion is that we may want to proceed with option 3.

Option 1 heavy leverages ESS, creating infrastructure using terraform, on top of a custom built Bash framework. We do not need leveraging ESS for functional testing and the framework looks built with that purpose. Moreover Bash flexibility comes at a great readability costs. Orchestration is also not clear to me, and documentation is lacking.

Option 2 uses Go tests, but is strictly focused on testing specific APM server behaviors. As we now rely on apm-data plugin we may need to test interactions with different Elasticsearch versions and more complex interactions, which do not fit the current model (that starts a single stack against which runs all tests).

A third option would be to build a framework that looks similar to Option 2, with guarantees provided by Go, but with a scope more similar to Option 1. I created a simple stub in https://github.com/endorama/apm-server/blob/af3ca3744e4746d4c6d7f65a162927d6c9e19331/functionaltests/main_test.go#L31 using testcontainers library and starting a specific Elasticsearch version in a container. The idea here would be to build the freedom to leverage single stack components, allowing us to express complex upgrade and assertion scenarios.
I have some concerns on this approach:

  • we would duplicate packages already existing in systemtests, but that looks like are lacking the flexibility we would need to express complex cases
  • we would duplicate some logic of smoke tests, but relying on docker would allow faster testing and using Go easier maintenance in the long term

The third option looks the best to me if I think at future use cases (es adding further cases to this logic #13678 is more difficult in Bash than in Go), and I think a test framework must be ergonomic enough to encourage use.

I see potential for convergence in the long run, but is out of scope and not sure how much weight should have in the decision.

@inge4pres
Copy link
Contributor

depending on whether a cluster is created new or upgraded.

The upgrade part is going to be especially tricky IMO, because IIANM Elasticsearch will never allow an upgrade between a released version (e.g. 8.15.3) and an un-released one (SNAPSHOT).
If this is correct, I have no idea on how we could run the test for an upgrade before a release is created.

@endorama
Copy link
Member

Can we use BCs in cloud first region? I'm not sure is possible to upgrade to those though.

@endorama
Copy link
Member

I'll recap the discussions from today about how to move forward. I discussed this with @axw, @1pkg and @inge4pres.

My current stub uses testcontainers, but I was not aware that we had flakiness issues with it in systemtests, so it does not look a great path forward.
Additionally, Andrew noticed that our customers mostly use ESS, not some Docker/Compose stack, and there are benefits in testing there.

Leveraging current smoke tests does not look the preferred path forward, as they are mostly Bash + CI and this greatly limits both expressiveness sin tests and reproducibility.

The current proposal would be to implement a new testing framework built on these principals:

  • we separate the 3 main layers we need: orchestration, fixtures and assertions
  • orchestration will be Terraform + Bash based; transient modifications for testing will happen outside Terraform, produce some configuration drift but that's not considered relevant (as infrastructure is expected to be cleaned up after tests)
  • fixtures and assertions will be Go bases, and developed as reusable components we can use for both apm-server and serverless.

This approach would go towards the convergence mentioned in my previous comment, and not using testcontainers would help with converging earlier. The layers mentioned about should "swappable", so that we can mix infrastructure/fixtures/assertions as needed based on testing scenarios. This could potentially also extends to running tests with a Docker stack or ECK, but is out of scope at the moment.

We will also have to consider how to run tests in parallel, some tests may taint the Elasticsearch stack used in a way that does not make safe reusing it for other tests, and some may not. Is not clear how to address this in our design at the moment, but for efficiency would be interesting to be able to mark clusters as tainted for further tests reuse.

Regarding which tests cases to run, we have a set already mentioned in #13898 (comment) that we should include.
Additionally we should include testing upgrade path from versions before 8.13.0 to 8.15 and 8.16.

@endorama
Copy link
Member

As per our latest discussion, I created a stub of a first test on the new framework we discussed. You can see it here: https://github.com/endorama/apm-server/blob/3b4ec398e8715b9b61ede38cb84aa5928d241492/testing/functional/test1/main_test.go

The first test I'm aiming for is testing the upgrade path from 8.14.0 to 8.15.1 as defined by:

Upgrade 8.14.x to 8.15.1+ with defaults: ILM should continue to be used for old indices, DLM should be used for new indices

I also added a README to clarify the overall idea.

@simitt
Copy link
Contributor

simitt commented Nov 11, 2024

Moving this to the next iteration, @endorama please update this issue with where you are at with the functional tests and the goals for this iteration.

@endorama
Copy link
Member

Adding this as a possible future path to evaluate: leveraging https://pkg.go.dev/github.com/rogpeppe/go-internal/testscript (example in use: https://github.com/kortschak/dex/blob/master/testdata/worklog_load_postgres.txt https://github.com/kortschak/dex/blob/master/main_test.go#L127). This would allow us to "script" our test leveraging CLI tools and go code to augment functionalities, with a defined test runner and test structure. We don't have to take a decision now, as our plan is to write Go code we can integrate it later on. The advantage I see are:

  • clearly separates test logic from helper logic
  • test case can be easier to follow through
    I'm not sure how easy it would be to reproduce them manually, but would not be much different than tests purely written in Go (as there would be no way to trigger specific functions outside of Go code).

This solution has been suggested by our colleague Dan Kortschak (not citing him to avoid subscribing to the issue).

/cc @axw

@axw
Copy link
Member Author

axw commented Nov 20, 2024

I would be happy to see a spike on testscript-based functional tests, I have also advocated for that in the past, but I don't think anyone ever gave it a shot in apm-server.

@endorama
Copy link
Member

I prepared the first test in this draft PR #14935

This implementation leverages:

  • terraform for infrastructure, run by Go code and using the elastic provider to bootstrap and upgrade EC cluster
  • apm-perf/telemetrygen, a new package extracted in apm-perf to allow easy ingestion of canned data from Go (PR adding it add telemetrygen public package apm-perf#197)
  • plain Go assertion to perform the checks

Iteration speed is slow on this test as a full run could easily take 20 to 30 minutes. Overall I did not find major roadblocks. I have to investigate a timeout in applying the upgrade to 8.16.0 through Terraform, as there is not why to change such timeout at the moment. It happened only once so far, so I've yet to drawn a conclusion on how problematic this may be.

@raultorrecilla
Copy link

moving this to it-105 @endorama please update status if needed

@raultorrecilla
Copy link

in progress, moving it from it105 to it106

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants