Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing test: X-Pack Security API Integration Tests (Session Concurrent Limit).x-pack/test/security_api_integration/tests/session_concurrent_limit/cleanup·ts - security APIs - Session Concurrent Limit Session Concurrent Limit cleanup should properly clean up sessions that exceeded concurrent session limit even for multiple providers #149091

Closed
kibanamachine opened this issue Jan 18, 2023 · 42 comments · Fixed by #148985, #173828, #174748 or #183409
Assignees
Labels
failed-test A test failure on a tracked branch, potentially flaky-test Team:Security Team focused on: Auth, Users, Roles, Spaces, Audit Logging, and more!

Comments

@kibanamachine
Copy link
Contributor

kibanamachine commented Jan 18, 2023

A test failed on a tracked branch

Error: expected 6 to equal 4
    at Assertion.assert (expect.js:100:11)
    at Assertion.apply (expect.js:227:8)
    at Assertion.be (expect.js:69:22)
    at Context.<anonymous> (cleanup.ts:214:54)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at Object.apply (wrap_function.js:73:16)

First failure: CI Build - main

@kibanamachine kibanamachine added the failed-test A test failure on a tracked branch, potentially flaky-test label Jan 18, 2023
@botelastic botelastic bot added the needs-team Issues missing a team label label Jan 18, 2023
@kibanamachine kibanamachine added the Team:Security Team focused on: Auth, Users, Roles, Spaces, Audit Logging, and more! label Jan 18, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-security (Team:Security)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 18, 2023
@azasypkin azasypkin self-assigned this Jan 18, 2023
@kibanamachine
Copy link
Contributor Author

New failure: CI Build - main

@mistic
Copy link
Member

mistic commented Jan 18, 2023

Skipped.

main: 74d9321

@azasypkin
Copy link
Member

Duplicate of #149090

@azasypkin azasypkin marked this as a duplicate of #149090 Jan 18, 2023
@azasypkin azasypkin closed this as not planned Won't fix, can't repro, duplicate, stale Jan 18, 2023
wayneseymour pushed a commit to wayneseymour/kibana that referenced this issue Jan 19, 2023
mikecote added a commit that referenced this issue Jan 26, 2023
Resolves #148914
Resolves #149090
Resolves #149091
Resolves #149092

In this PR, I'm making the following Task Manager bulk APIs retry
whenever conflicts are encountered: `bulkEnable`, `bulkDisable`, and
`bulkUpdateSchedules`.

To accomplish this, the following had to be done:
- Revert the original PR (#147808)
because the retries didn't load the updated documents whenever version
conflicts were encountered and the approached had to be redesigned.
- Create a `retryableBulkUpdate` function that can be re-used among the
bulk APIs.
- Fix a bug in `task_store.ts` where `version` field wasn't passed
through properly (no type safety for some reason)
- Remove `entity` from being returned on bulk update errors. This helped
re-use the same response structure when objects weren't found
- Create a `bulkGet` API on the task store so we get the latest
documents prior to a ES refresh happening
- Create a single mock task function that mocks task manager tasks for
unit test purposes. This was necessary as other places were doing `as
unknown as BulkUpdateTaskResult` and escaping type safety

Flaky test runs:
- [Framework]
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/1776
- [Kibana Security]
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/1786

Co-authored-by: kibanamachine <[email protected]>
kqualters-elastic pushed a commit to kqualters-elastic/kibana that referenced this issue Feb 6, 2023
Resolves elastic#148914
Resolves elastic#149090
Resolves elastic#149091
Resolves elastic#149092

In this PR, I'm making the following Task Manager bulk APIs retry
whenever conflicts are encountered: `bulkEnable`, `bulkDisable`, and
`bulkUpdateSchedules`.

To accomplish this, the following had to be done:
- Revert the original PR (elastic#147808)
because the retries didn't load the updated documents whenever version
conflicts were encountered and the approached had to be redesigned.
- Create a `retryableBulkUpdate` function that can be re-used among the
bulk APIs.
- Fix a bug in `task_store.ts` where `version` field wasn't passed
through properly (no type safety for some reason)
- Remove `entity` from being returned on bulk update errors. This helped
re-use the same response structure when objects weren't found
- Create a `bulkGet` API on the task store so we get the latest
documents prior to a ES refresh happening
- Create a single mock task function that mocks task manager tasks for
unit test purposes. This was necessary as other places were doing `as
unknown as BulkUpdateTaskResult` and escaping type safety

Flaky test runs:
- [Framework]
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/1776
- [Kibana Security]
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/1786

Co-authored-by: kibanamachine <[email protected]>
@kibanamachine kibanamachine reopened this May 12, 2023
@kibanamachine
Copy link
Contributor Author

New failure: CI Build - 8.8

@kibanamachine
Copy link
Contributor Author

New failure: CI Build - 8.8

@jeramysoucy
Copy link
Contributor

Ran another flaky test runner just to be sure, but this looks tied to a series of CI failures on Friday.

CoenWarmer pushed a commit to CoenWarmer/kibana that referenced this issue Feb 15, 2024
…ssion limit for users (elastic#174748)

## Summary

Closes elastic#149091 

This PR addresses the potential issue of a session not being found in
the session index by introducing a timeout before attempting to write
the next one. Passing these [changes through
FTR](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4854)
make it pass 100% of the time with 400 test runs.
CoenWarmer pushed a commit to CoenWarmer/kibana that referenced this issue Feb 15, 2024
## Summary

This PR is for troubleshooting
elastic#149091

It duplicates the timeout check per session from the `...legacy
sessions` test (see elastic#174748) for
the `...multiple providers` test.

Note: we are not seeing the additional log of 'Failed to write a new
session', in any of the recent failures.

Could not reproduce the issue with a flaky test runner:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4949
@kibanamachine
Copy link
Contributor Author

New failure: CI Build - 8.13

@kibanamachine kibanamachine reopened this Mar 8, 2024
@kibanamachine
Copy link
Contributor Author

New failure: CI Build - 8.12

@kibanamachine
Copy link
Contributor Author

New failure: CI Build - main

@kibanamachine
Copy link
Contributor Author

New failure: CI Build - 8.13

@kibanamachine
Copy link
Contributor Author

New failure: CI Build - main

@kibanamachine
Copy link
Contributor Author

New failure: CI Build - main

@kibanamachine
Copy link
Contributor Author

New failure: CI Build - main

@kibanamachine
Copy link
Contributor Author

New failure: kibana-on-merge - 8.13

@kibanamachine
Copy link
Contributor Author

New failure: kibana-on-merge - 8.13

@kibanamachine
Copy link
Contributor Author

New failure: kibana-elasticsearch-snapshot-verify - 8.13

@kibanamachine
Copy link
Contributor Author

New failure: kibana-on-merge - main

@legrego
Copy link
Member

legrego commented May 28, 2024

latest failure:

└-> should properly clean up sessions that exceeded concurrent session limit even for multiple providers
--
  | └-> "before each" hook: global before each for "should properly clean up sessions that exceeded concurrent session limit even for multiple providers"
  | └-> "before each" hook for "should properly clean up sessions that exceeded concurrent session limit even for multiple providers"
  | └- ✖ fail: security APIs - Session Concurrent Limit Session Concurrent Limit cleanup should properly clean up sessions that exceeded concurrent session limit even for multiple providers
  | │      Error: retry.tryForTime reached timeout 20000 ms
  | │ Error: expected 5 to equal 6
  | │     at Assertion.assert (expect.js:100:11)
  | │     at Assertion.apply (expect.js:227:8)
  | │     at Assertion.be (expect.js:69:22)
  | │     at cleanup.ts:235:56
  | │     at processTicksAndRejections (node:internal/process/task_queues:95:5)
  | │     at runAttempt (retry_for_success.ts:29:15)
  | │     at retryForSuccess (retry_for_success.ts:98:21)
  | │     at RetryService.tryForTime (retry.ts:37:12)
  | │     at Context.<anonymous> (cleanup.ts:234:7)
  | │     at Object.apply (wrap_function.js:73:16)
  | │       at onFailure (retry_for_success.ts:17:9)
  | │       at retryForSuccess (retry_for_success.ts:84:7)
  | │       at RetryService.tryForTime (retry.ts:37:12)
  | │       at Context.<anonymous> (cleanup.ts:234:7)
  | │       at Object.apply (wrap_function.js:73:16)

@azasypkin azasypkin self-assigned this May 28, 2024
@kibanamachine
Copy link
Contributor Author

New failure: kibana-on-merge - main

@kibanamachine
Copy link
Contributor Author

New failure: kibana-on-merge - main

@elena-shostak
Copy link
Contributor

elena-shostak commented Aug 12, 2024

Came up on interesting thing, the cleanupInterval is set to 5h in FTR config:

--xpack.security.session.cleanupInterval=5h

The log shows us the we invoke the cleanup task in the very first test and get 500.

[00:00:03]         │ debg Existing sessions: {"total":{"value":3,"relation":"eq"},"max_score":1,"hits":[{"_index":".kibana_security_session_1","_id":"gyCzdu4JDUUi03Cd2G+daZuJQKTEjvCidLFgsm6sgA4=","_score":1,"_source":{"provider":{"type":"basic","name":"basic1"},"idleTimeoutExpiration":1720528171722,"lifespanExpiration":1723116571722,"createdAt":1720524571722,"usernameHash":"e9ab99ee1daa1aa2b5cac38d446ac31f555b0a0bd0cd7a7335a3a2a065635e64","content":"M1k1w9v840N6u1EZkfwVVjzwLk9udSESi9jwRJ1znBPz0EO4RAliKE53gVyL99Zr1HpUAIIstiRtTbPIF0cVjoHmp/Z+kHeGxNAB2FbFecBjo7zPpxL6VnOr/7E3qNevj/8/sBObC4rW7l3kkDFV7AgSqWY9S64JazbqmpnH8mG18wE36L/hH4P0gTBmYCOTK35u83ajRoGFrr1Mbtkm8GEowxCKuRgkOAie2CO6btZ0KUGVHv6CiF4P9WKZnGIf2fyOS7YG7rFx98oWAs4OZhRm+/1CDU+4Q5x4L8e4Z3/fvSiah+15+1rHUzG3PNE9RQBtaGfFpxr7mo8KNCDTzvg="}},{"_index":".kibana_security_session_1","_id":"ED/Teu0G+MXg5zFFw63f4IGhJYNDQiLTku9RFKU73ZQ=","_score":1,"_source":{"provider":{"type":"basic","name":"basic1"},"idleTimeoutExpiration":1720528172475,"lifespanExpiration":1723116572475,"createdAt":1720524572475,"usernameHash":"e9ab99ee1daa1aa2b5cac38d446ac31f555b0a0bd0cd7a7335a3a2a065635e64","content":"SVDq4yk0ZZV76/00timLNRKDOnVQrZLYlsPxBX+gbHR8F/36smopXI+tS4MXgqQL4fAmVmrScmA69r/O4y5B3bHNPY1yqTYx9Zomel6bfZ3dsjQScHCrwKOLVUob4+Hn92FvT289OUMncks1OwQg5S5bET06Nd4jYhd7C5kKUzsH+vHDJV7mgMmBqRuf8m1rNUUge3XO4ra19Uil8Ou6qDGJXJRPbYOZWHoq2ey+Nr6h5j8zUPfr/wo5PiTSWdNw1PJG/MXOQiBHRi4hbVTw2ZIFIlxi1ARh+/D2aV5KbziAmxeCHu9o/1g6YJrUAxGNI9Qtd9Po5SY53sGmZ4mBF6U="}},{"_index":".kibana_security_session_1","_id":"/zR3wzCyms2bRib56OUNOn6WSHMnKbM2VjNtyBZrpJw=","_score":1,"_source":{"provider":{"type":"basic","name":"basic1"},"idleTimeoutExpiration":1720528173016,"lifespanExpiration":1723116573016,"createdAt":1720524573016,"usernameHash":"e9ab99ee1daa1aa2b5cac38d446ac31f555b0a0bd0cd7a7335a3a2a065635e64","content":"txDKnicnMjQu/ZP42i+W9BT5sM81jTL8VeHXYLTYfyWiQV7fPB98IJKQeWHPfeEjweGwjvHOfU/zooZANLYv81IeY5X4JlOL3O51aPTdf6a5iaSwJTi2DN3ASiOhi7OcLqabMs8csbAbzP6w9aN6O1PmMekPmdiLt/aQlTNlLWJPMuJ4drdTU6FJ+MFc4aOa4X1MfFX/brBv+AQ/fcr5Fq4oB/7EJPSoj3b8hjUeYaZ1DTu6sB381HfpXkg3OfihRdioUnTGjwaSD20HwT9c+roI9GZhmqewijXDoW+xwLyfv5JYtkxzP+zlktdPbGBZI2NUmxChUUsrMNDAbyEQ7aM="}}]}.
[00:00:03]         │ proc [kibana] [2024-07-09T11:29:33.098+00:00][ERROR][http] 500 Server Error {"http":{"response":{"status_code":500},"request":{"method":"post","path":"/session/_run_cleanup"}},"error":{"message":"Failed to run task \"session_cleanup\" as it is currently running"},"service":{"node":{"roles":["background_tasks","ui"]}}}
[00:00:03]         │ debg --- retry.tryForTime error: expected 200 "OK", got 500 "Internal Server Error"

That means the cleanup job itself was already running, which shouldn't be the case, because we set interval to 5h before and it is the very first test in out test suite and the first time we invoke session/_run_cleanup.

So if the cleanup job is running on some different interval it might corrupt sessions from the following tests even before we invoke session/_run_cleanup which leads to flaky behaviour sometimes. (2 test suites running on the same node and overriding interval?)

cc @azasypkin Probably you can shed some light on it, I don't have that much context around FTR setup in general.

@kibanamachine
Copy link
Contributor Author

New failure: kibana-elasticsearch-snapshot-verify - 8.15

@azasypkin
Copy link
Member

New failure: kibana-elasticsearch-snapshot-verify - 8.15

Same reason as described in #149091 (comment)

@legrego legrego closed this as completed Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
failed-test A test failure on a tracked branch, potentially flaky-test Team:Security Team focused on: Auth, Users, Roles, Spaces, Audit Logging, and more!
Projects
None yet
9 participants