-
Notifications
You must be signed in to change notification settings - Fork 42
feat: initial specs for ingest management #126
Conversation
Then filebeat is started | ||
And metricbeat is started | ||
And endpoint is started |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BDD step is the same, so we could write just one implementation method, with an input parameter (the process to be present in the target)
And the "Fleet" Kibana setup has been created | ||
And the agent binary is installed in the target host | ||
When the agent is un-enrolled from Kibana | ||
Then no new data shows up in Elasticsearc locations using the enrollment token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added using the enrollment token
to match an existing step below. Is this assumption correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I would phase it as 'using' the enrollment token, but its not entirely wrong. I'd phrase it as the host / agent is no longer able to send documents into ES (it will still be attempting to send them, running on the host)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here I think you should say using the access token
when an agent enroll into fleet we exchange an enrollmont token for an access token (that is one per agent).
One you invalidate an enrollment token, the agent already enrolled should continue to work, but you cannot enroll more agents with that enrollment token
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification Nicholas! Please look at L27:33 There is specific scenario for revoking the enrollment token for an agent. Is that what you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm, reading your comment, I'd rephrase this second scenario (the one revoking the token) to this:
Scenario: Revoking the enrollment token for an agent
Given there is a "Fleet" user in Kibana
And the "Fleet" Kibana setup has been created
And the agent binary is installed in the target host
And the agent is un-enrolled from Kibana
When the enrollment token is revoked
Then no new data shows up in Elasticsearc locations using the enrollment token
And the enrolled agent continues to work
And I'd create another use case:
Scenario: A revoked enrollment token cannot enroll more agents
Given there is an enrollment token
When the enrollment token is revoked
Then it's not possible to use the token to enroll more agents
Does it make sense to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, we should clarify what the enrolled agent continues to work
means: i.e. it sends data to elasticsearch, there is an endpoint we can query, a process is running in the host, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combining above two scenarios into one:
Scenario: Revoking the enrollment token for an agent
Given there is an agent enrolled with an enrollment token
When the enrollment token is revoked
Then it's not possible to use the token to enroll more agents
And the enrolled agent continues to work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much Nicolas and Manu, I'm learning here too! Knowing now what I do, I'd suggest we really only have 1 distinct different case to test and I'd phrase it as:
Scenario: Revoking an enrollment token
Given the Fleet user is set up and a valid enrollment token exists
When the enrollment token is revoked
Then an attempt to enroll a new agent fails
the pre-requisite for the test changes such that the agent is NOT running and is NOT already enrolled.
@mdelapenya what do you think? Honestly, if you can get us the first more straight-forward case I'm happy to work this with the code snippets we have and infrastructure you provide. We need not stress about completing this one case now, the team is fine to take it over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this scenario, because it's very straight-forward and simple at the same time. I'd replace what we had. wdyt about rephrasing the Given...
to Given an agent is enrolled
? Or do we want to make it clear for this scenario that we need the fleet user and the existence of a valid token?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, in what state would be the existing agent? Will it pause? will it continue to send data?
Given there is a "Fleet" user in Kibana | ||
And the "Fleet" Kibana setup has been created | ||
When the agent binary is installed in the target host | ||
Then the dashboards for the agent are present in Elasticsearch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to know the exact data needed here: the ES query
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the command to run the agent is:
./elastic-agent run
after this command is executed, we can wait a matter of seconds (5-20 seconds?) and then verify the existence of certain folders / data on the host as evidence of it working.
The logs we can check for are relative to the path where the agent was installed, so it would be, for example with a 7.8 agent:
elastic-agent-7.8.0-darwin-x86_64-BC5/data/logs/default/filebeat
elastic-agent-7.8.0-darwin-x86_64-BC5/data/logs/default/metricbeat
and from here:
elastic-agent-7.8.0-darwin-x86_64-BC5/data/run/default/metricbeat--7.8.0/meta.json
- any non-empty file will suffice for all 3 assertions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And for the Dashboards, lets actually use the API from Kibana, and even the Ingest one to assess this:
/api/ingest_manager/data_streams
- if you call it prior to any Agent being deployed it should return a list of zero data streams as:
{
"data_streams": []
}
when called after the Agent is running, it will return a list of (currently in 7.8) 20 streams, with a format as:
{
"data_streams": [
{},
{
"index": "metrics-system.load-default",
"dataset": "system.load",
"namespace": "default",
"type": "metrics",
"package": "system",
"package_version": "0.1.0",
"last_activity": "2020-06-04T18:59:29.693Z",
"size_in_bytes": 42605308,
"dashboards": [
{
"id": "79ffd6e0-faa0-11e6-947f-177f697178b8-ecs",
"title": "[Metrics System] Host overview ECS"
},
...
{
"id": "5517a150-f9ce-11e6-8115-a7c18106d86a-ecs",
"title": "[Logs System] SSH login attempts ECS"
},
{
"id": "Filebeat-syslog-dashboard-ecs",
"title": "[Logs System] Syslog dashboard ECS"
}
]
},
...
{},
{}
]
}
Lets assert the following...
- the data_streams call returns more than 1 elements in its list.
- the data_streams call returns a list element with an "index" of "metrics-system.process-default"
- the list element "index": "metrics-system.process-default" has a sibling of a list called 'dashboards'
- the list 'dashboards' will be confirmed to have an element with a title of "[Metrics System] Host overview ECS"
I don't think we should walk the whole list here, I understand there is separate automation to confirm this and would make the test brittle to changes. How does that sound?
And the "Fleet" Kibana setup has been created | ||
And the agent binary is installed in the target host | ||
When the agent is un-enrolled from Kibana | ||
Then no new data shows up in Elasticsearc locations using the enrollment token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What data is not present here? I'd be great to understand more about its nature to identify when it shows up and when not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated:
a query you can use is as follows:
query the metrics* index and hit the equivalent of KQL:
host.name:"7exl-w10x64l6-d" and @timestamp >= "2020-06-06T01:30:00.948Z"
where the hostname is replaced correctly and the timestamp in question is captured 2 seconds after the unenroll call.
translated into an ES query (forgive me if this is terrible, its a hacked version from dev tools and I didn't take the time to re-work it much:
- the same find/replace of the hostname and timestamp values is needed of coruse:
GET _search
{
"version": true,
"size": 500,
"docvalue_fields": [
{
"field": "@timestamp",
"format": "date_time"
},
{
"field": "system.process.cpu.start_time",
"format": "date_time"
},
{
"field": "system.service.state_since",
"format": "date_time"
}
],
"_source": {
"excludes": []
},
"query": {
"bool": {
"must": [],
"filter": [
{
"bool": {
"filter": [
{
"bool": {
"should": [
{
"match_phrase": {
"host.name": "7exl-w10x64l6-d"
}
}
],
"minimum_should_match": 1
}
},
{
"bool": {
"should": [
{
"range": {
"@timestamp": {
"gte": "2020-06-06T01:50:00.948Z",
"time_zone": "America/New_York"
}
}
}
],
"minimum_should_match": 1
}
}
]
}
},
{
"range": {
"@timestamp": {
"gte": "2020-06-06T01:36:29.564Z",
"format": "strict_date_optional_time"
}
}
}
],
"should": [],
"must_not": []
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This query is perfect! :)
And the agent is un-enrolled from Kibana | ||
When the agent is re-enrolled from the host | ||
And the agent runs from the host | ||
Then the agent shows up in Kibana |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need here the exact thing to check: and API call, an XPATH element in the UI...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can absolutely get you the API calls and expectations. I don't know all of them off hand and am still digging thru 7.8 testing finding odd bugs, but I will work with the team tomorrow to fill in all of these with haste. we don't have the api documented yet either, so we'll get specifics for this and all similar requests in the branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the re-enroll call is exactly the same as it was prior, and the asserts are the same with the exception that we can check the timestamps on the metricbeat and filebeat files, to see that they are newer. newer than exactly what I'm not 100% sure on (there is some period where the Agent is in a state of transition. we could put a short pause in and wait for it to finish unenrolling and then capture that time and use it in the next step. ?
💔 Tests FailedExpand to view the summary
Build stats
Test stats 🧪
Test errorsExpand to view the tests failures
Log outputExpand to view the last 100 lines of log output
|
And the "Fleet" Kibana setup has been created | ||
When the agent binary is installed in the target host | ||
Then the dashboards for the agent are present in Elasticsearch | ||
And the agent shows up in Kibana |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to get this without checking the UI, maybe an API call? I'd like to avoid any UI/DOM interaction if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes is it. I was using very 'loose' language, 'shows up' and 'in Kibana' can be interpreted to the API as:
Request URL, GET: /api/ingest_manager/fleet/agents?page=1&perPage=20&showInactive=false
With the presumption that there were zero agents when we started, there should be one item in the list[] that is returned. Response snippet we can use to assert:
{
"list": [
{
"id": "0a17686e-40c5-4a81-86ae-fb41ddd7ea96",
"active": true,
"config_id": "f1a077d0-a688-11ea-b905-bd56f880a400",
"type": "PERMANENT",
"enrolled_at": "2020-06-04T18:10:49.376Z",
"user_provided_metadata": {},
"local_metadata": {},
"access_api_key_id": "m7SHgHIBm78rI0UKTW-D",
"current_error_events": [],
"last_checkin": "2020-06-04T18:34:30.949Z",
"config_revision": 3,
"status": "online"
}
],
"success": true,
"total": 1,
"page": 1,
"perPage": 20
}
I suggest we look only that the ID exists and that the current_error_events[] list is empty
The status: 'online' would be good, but note that it is likely to be 'error' after it is enrolled, but before the agent is 'run' just to be aware of that nuance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can call GET /api/ingest_manager/fleet/agents
Putting this here, I think it could help you later when implementing the steps. |
This new step will combine the others
Running "godog -t stop-agent" will filter the execution to those scenarios using the "@stop-agent" annotation. See https://github.com/cucumber/godog#tags
gherkin syntax changes and steps rework
Scenario: Un-enrolling an agent | ||
Given an agent is deployed to Fleet | ||
When the agent is un-enrolled | ||
Then the agent is not listed as online in Fleet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @EricDavisX, could we reuse here the step the agent is listed in Fleet as "online"
?
Therefore we would have:
the agent is listed in Fleet as "online"
the agent is listed in Fleet as "offline"
which would be one single step. wdyt?
Scenario: Re-enrolling an agent | ||
Given an agent is enrolled | ||
And the agent is un-enrolled | ||
And the Agent is stopped on the host |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this one automatically inferred from un-enrolling the agent, or must be done as a separate action?
If the later, I would keep it as is (although from user's perspective it seems more -unproductive?- work)
When the agent is re-enrolled on the host | ||
And the agent is run on the host | ||
Then the agent is listed in Fleet as online | ||
And new documents are inserted into Elasticsearch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to abstract this step to a more product-related level, as I see it very technical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about something like
index `xxx` is created
And index `xxx` has more than 123 documents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it! What if the number of documents is not there after an amount of time (minutes)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes exactly this should fail then
Then the agent is un-enrolled | ||
And the agent is stopped on the host |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scenario: Stopping the agent stops backend processes | ||
Given an agent is deployed to Fleet | ||
When the agent is stopped on the host | ||
Then filebeat is stopped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we would need probably more like Then there are '2' metricbeat processes
as we will need to check monitoring and ingesting beats
This is so great - moving fast and getting better! I agree with Michal we could enhance the 'stopped on the host' to indicate more accurate that the Agent (with defaults set) will start 2 of the Metricbeat and 2 of the Filebeat processes. Sorry I forgot that nuance. Still, for the first version we can leave it as implied and handle it in the implementation assertion (and correct it in a coming PR)
|
quick comment on the step: @mdelapenya @michalpristas index from @EricDavisX I don't mind any rework we want to do in elaborating this. I would like to suggest we keep it really stable and simple however, and I don't know if a given # of documents over a given amount of time would be. The Filebeat / Metricbeat info sent is based on host vm activity, right? I suggest if we have control over the environment and agents then we should be able to wait seconds (not minutes) and confirm changes regarding what Docs the Agent are sending in. Might be good to take this off line and discuss in a quick call if we have this and any other 'final' items before we can get further into the implementation. |
Cool! Let's discuss about the specific implementation details in a follow-up iteration. Then I'd keep that step as |
that sounds great to me. thanks Manu |
@EricDavisX @michalpristas I think we are in the right track! Please let me know if the requirements are ready to be merged, so I can continue with the implementation Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this - it looks ready to merge and we can iterate on it.
And the "Fleet" Kibana setup has been created | ||
And the agent binary is installed in the target host | ||
When the agent is un-enrolled from Kibana | ||
Then no new data shows up in Elasticsearc locations using the enrollment token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I would phase it as 'using' the enrollment token, but its not entirely wrong. I'd phrase it as the host / agent is no longer able to send documents into ES (it will still be attempting to send them, running on the host)
And the agent is un-enrolled from Kibana | ||
When the agent is re-enrolled from the host | ||
And the agent runs from the host | ||
Then the agent shows up in Kibana |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can absolutely get you the API calls and expectations. I don't know all of them off hand and am still digging thru 7.8 testing finding odd bugs, but I will work with the team tomorrow to fill in all of these with haste. we don't have the api documented yet either, so we'll get specifics for this and all similar requests in the branch
Given there is a "Fleet" user in Kibana | ||
And the "Fleet" Kibana setup has been created | ||
And the agent binary is installed in the target host | ||
When the agent is un-enrolled from Kibana |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to mention that we'll have to manually terminate the shell / process running on the host as part of the 'tear down' of this scenario, in order to test the re-enrolling and re-starting of the Agent.
And the "Fleet" Kibana setup has been created | ||
And the agent binary is installed in the target host | ||
When the agent is un-enrolled from Kibana | ||
Then no new data shows up in Elasticsearc locations using the enrollment token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much Nicolas and Manu, I'm learning here too! Knowing now what I do, I'd suggest we really only have 1 distinct different case to test and I'd phrase it as:
Scenario: Revoking an enrollment token
Given the Fleet user is set up and a valid enrollment token exists
When the enrollment token is revoked
Then an attempt to enroll a new agent fails
the pre-requisite for the test changes such that the agent is NOT running and is NOT already enrolled.
@mdelapenya what do you think? Honestly, if you can get us the first more straight-forward case I'm happy to work this with the code snippets we have and infrastructure you provide. We need not stress about completing this one case now, the team is fine to take it over.
And the "Fleet" Kibana setup has been created | ||
And the agent binary is installed in the target host | ||
When the agent is un-enrolled from Kibana | ||
Then no new data shows up in Elasticsearc locations using the enrollment token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated:
a query you can use is as follows:
query the metrics* index and hit the equivalent of KQL:
host.name:"7exl-w10x64l6-d" and @timestamp >= "2020-06-06T01:30:00.948Z"
where the hostname is replaced correctly and the timestamp in question is captured 2 seconds after the unenroll call.
translated into an ES query (forgive me if this is terrible, its a hacked version from dev tools and I didn't take the time to re-work it much:
- the same find/replace of the hostname and timestamp values is needed of coruse:
GET _search
{
"version": true,
"size": 500,
"docvalue_fields": [
{
"field": "@timestamp",
"format": "date_time"
},
{
"field": "system.process.cpu.start_time",
"format": "date_time"
},
{
"field": "system.service.state_since",
"format": "date_time"
}
],
"_source": {
"excludes": []
},
"query": {
"bool": {
"must": [],
"filter": [
{
"bool": {
"filter": [
{
"bool": {
"should": [
{
"match_phrase": {
"host.name": "7exl-w10x64l6-d"
}
}
],
"minimum_should_match": 1
}
},
{
"bool": {
"should": [
{
"range": {
"@timestamp": {
"gte": "2020-06-06T01:50:00.948Z",
"time_zone": "America/New_York"
}
}
}
],
"minimum_should_match": 1
}
}
]
}
},
{
"range": {
"@timestamp": {
"gte": "2020-06-06T01:36:29.564Z",
"format": "strict_date_optional_time"
}
}
}
],
"should": [],
"must_not": []
}
}
}
@EricDavisX @michalpristas merged! I'm going to send a PR with the Go code scaffolding, so please feel free to contribute to it in the way you prefer |
What does this PR do?
It adds the initial specs for the Ingest management project.
Why is it important?
We should start a discussion around them to make them perfect and totally understandable by anybody in the team: product owners, developers, testers, consumers, etc.
Related issues