[Ingest Manager] Define Elastic Agent structure on disk for elastic agent support upgrade and rollbacks. #20048

michalpristas · 2020-07-20T11:02:23Z

With upgrade in mind we need to align structure of where things go to make scenarios smooth and possible.

Proposed structure

/data
  /v7.9.1-abc123d
     /downloads
     /install
     /run
     elastic-agent
  /v7.9.2-dba324e
     /downloads
     /install
     /run
     elastic-agent
     elastic-agent.yml
     fleet.yml
     action_store.yml
  /logs
elastic-agent
elastic-agent.yml
fleet.yml
action_store.yml

v7.9.1-ab123d is a v7.9.1 semver version where the rest of the string is a hash of a version which can contain suffixes like SNAPSHOT, BC...
each version contains its own binary and dependent binaries in download/install directories

in this example /v7.9.2-dba324e is an older version which contains not only binary but snapshot of config files and actionstore

logs were moved from version level to root level of the structure. this is so monitoring wont drop any events which might be unprocessed or generated in between upgrade steps.

run is used during runtime to store sockets etc.

elastic-agent at root level is a symlink to a currently active version, any service manager should point to this file as executable. this gets updated on update/rollback

elastic-agent.yml, fleet.yml and action_store.yml are config/state files which are used by active version of an agent. during upgrade process these are copied to version folder overwriting any previously generated config files if any.
agent copies these files on start from its versioned directory if it contains any and removes them to avoid future overwrite.

older versioned folders are removed after grace period without beats in FAILED state together with prev symlink (if used). after this point rollback wont be possible.

cc @ph @blakerouse

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-07-20T11:02:26Z

Pinging @elastic/ingest-management (Team:Ingest Management)

ph · 2020-07-20T13:39:22Z

@michalpristas I think we should still use the full version if possible in the folder v7.9.0-hash instead of v7. Is there a strong argument to only have v7, The reason I would think of is windows limit on characters?

Where are we locating the registry of filebeat in the data folder?

blakerouse · 2020-07-20T13:51:29Z

There is also the agent.lock that currently exists inside the data folder. I think we should still keep the top-level data folder then nest the version folder. I think with keeping the top-level data folder we could then place the logs directory still in that folder.

/data
  agent.lock
  /v7-abc123d
     /downloads
     /install
     elastic-agent
  /v7-dba324e
     /downloads
     /install
     elastic-agent
     elastic-agent.yml
     fleet.yml
     action_store.yml
   /logs
elastic-agent
prev
elastic-agent.yml
fleet.yml
action_store.yml

I also agree with @ph whould use the full version number (v7.0.0).

I question if we should really have the prev symbolic link, what happens if the user does ./prev and runs it? Would we ever really want to allow that? I think prev should be removed.

ph · 2020-07-20T15:28:22Z

good point @blakerouse, I think the prev is only to keep state? Maybe we should that information in an existing file like fleet.yml or other. We could keep bit more information. IE when did that release was installed?

michalpristas · 2020-07-22T10:33:16Z

updated issue description which what we talked with blake over zoom

blakerouse · 2020-07-22T13:27:39Z

Looks good. 👍

ph · 2020-07-22T13:53:57Z

LGTM

ph · 2020-07-22T17:40:58Z

@michalpristas Can you discuss with @ferullo with our plan, they also have persisted information and their own "installation" structure.

michalpristas · 2020-07-23T06:35:53Z

asked Daniel about their internal state and they got it covered using endpoint.exe install --upgrade
we need to keep this in mind and be aware that we're in upgrade process at the point of installation of endpoint

blakerouse · 2020-07-23T13:16:36Z

Hmm at the moment we only always call: endpoint-security.exe install --resources endpoint-security-resources.zip.

@ferullo Would it be bad for Agent to always call install with --upgrade? Or could we just remove the --upgrade flag and allow the installer to figure out if its an upgrade or not? Feel like it would be best for the installer to just do the correct thing.

Agent does not know specifically if this really is an upgrade or not. In the case of an Agent just being removed (no un-enrolled, Endpoint will still be running). Then a newly enrolled agent on the same host runs the installer again, it will not know that it previously had Endpoint running.

ferullo · 2020-07-23T13:42:32Z

Yes, always passing --upgrade works fine and I agree, it is best if Agent to do that.

We went back and forth on whether to (1) have a different command upgrade vs install, (2) to make it automatic (no --upgrade) or (3) to make it required by harmless if not needed. We settled on (3), which is similar to how pip works.

ph · 2020-07-23T13:44:55Z

@ferullo I assume we can also use the same command to rollback to a previous version?

ferullo · 2020-07-23T13:56:54Z

Endpoint will rollback automatically if it is unable to upgrade for some reason. But if Agent needs to rollback and wants to downgrade Endpoint that should work. However I can't promise it will because its impossible to guarantee a previously released Endpoint works with future Endpoints, especially across major version updates.

Is Agent going to upgrade itself first, then Beats and Endpoint? If so, this seems like a non issue. If not, to downgrade Endpoint I recommend using --upgrade to downgrade but then if Endpoint doesn't wind up at the right version as a last ditch effort run the installed Endpoint's uninstall and install the new Endpoint fresh. That last ditch effort should never be needed but since both its component actions (uninstall, fresh install) are tested and independent of other Endpoint versions should also be reliable across all Endpoint changes over time.

blakerouse · 2020-07-23T14:18:45Z

@ferullo Agent will upgrade itself first, then perform upgrades on the Beats and Endpoint.

Below is a couple failure cases that we consider on upgrade:

Agent upgrades itself and fails
- Endpoint will not have have anything change (aka. no install/upgrade/uninstall would ever be run)
- Agent rollsback to previous version
- Everything is back like upgrade didn't even occur
Agent upgrades itself (works) upgrades beats and Endpoint (either fail)
- Endpoint would have had install --upgrade ran to the next version
- Either Endpoint or Beats reports to Agent an ERROR through GRPC or beats keeps crashing
- Agent rollsback to previous version
- Previous beats version is restart
- Endpoint install --upgrade is called again but on a previous version (not the newest upgraded version)

So the flow for Endpoint breaks down in the worst case to (v1 and v2 just symbol version jumps):

v1 -> v2 -> v1

Also possible if someone forces a downgrade (also possible):

v2 -> v1

ferullo · 2020-07-23T14:55:51Z

Thanks for the details. I had not realized Agent would downgrade itself if Beats or Endpoint didn't work after the upgrade. That's slick.

I think that flow works well. We have automated tests to make sure an Endpoint can be upgpraded, we'll add one to make sure it can be downgraded.

I still think if Beats or Endpoint fail to downgrade then uninstall and re-install is the best course of action to make sure Agent/Beats/Endpoint all stay in sync for their version number. Though perhaps that would be handled by the normal [endpoint|beats] verify flow that is in place?

ph · 2020-07-23T15:13:42Z

@ferullo can you link the issue in #20205 ?

michalpristas · 2020-07-27T07:16:22Z

I was thinking about snapshots over the weekend and this wont work for them, i realised that snapshot versions does not differentiate between snapshots (alwyas 8.0.0-SNAPSHOT or so)

so version hash wont work for them.

so i was thinking that we need something which is known at build time to create a package dir and do a differentiation in between version.
at firt i was thinking of reusing SHA but it is not known at build time and we wont know that for initial install either.

what i was thinking about is a latest commit hash. we know that at build time, we even inject it into an agent binary.
so with this strategy would look like this

Package creation

during package creation we would prepare structure which is ready for unpack e.g
/v8.0.0-SNAPSHOT-{commit} /data etc.

so when this gets unpacked it already is differentiated.

Same SNAPSHOT problem

Problem might be when we receive action to update from one snapshot to another which is the same. Usually we should not upgrade from version to same version but for snapshot we have to.
What might occur is that we end up updating to same snapshot and we replace our running bits with freshly unpacked.

So for SNAPSHOT i was thinking about special handling (not for normal versions) where we unpack to /tmp or take a look inside an archive and if v8.0.0-SNAPSHOT-{commit} directory already exists in our version directory structure we abort.

@ph @blakerouse do you see some gotchas there? does it sound ok?

blakerouse · 2020-07-27T12:54:30Z

@michalpristas Good catch on the issue with snapshots. I think it would be good to not use the whole commit hash, as that might cause issues on Windows due to file path length. Maybe just the first 8 or so?

I was thinking similar without even adjusting the packaged bits, always extract the new upgrade agent into data/{$ current-agent $}/downloads (don't think we need /tmp). Run elastic-agent version --yaml (good to add a YAML output so its machine parsable) to get the commit hash. If its the same do nothing, if its different then upgrade that version.

michalpristas · 2020-07-27T14:11:44Z

first few characters should work as well.
i think unpack to data and then moving it up makes sense as well

ph · 2020-07-27T15:30:44Z

goog catch on the snapshots. the elastic-agent version --yaml seems like an OK strategy? I would prefer if we can only have single path. We might have to solve where to get the matching snapshots for endpoints?

michalpristas · 2020-07-28T09:44:07Z

moved logs under data

ph · 2020-07-28T20:10:43Z

@ruflin Could you take a look?

ruflin · 2020-07-31T06:50:13Z

I like the above proposal especially the part that we can also upgrade between snapshot builds.

The part I didn't get is why not each version has its own log directory. I would expect the log collection pattern to be something like /data/*/logs/*. Like this any log from any version is picked up. If a version is cleaned up, also all its data/logs is cleaned up (is this expected?). The downside is that it makes it harder for a user to find the "current" logs. But to solve this, we can introduce a symlink to the current logs directory inside data.

Does each log even that we currently ship contain the exact version of the Beat + Commit Hash if it is a snapshot?

ph · 2020-08-19T21:05:17Z

I am not against having a log per version.

Does each log even that we currently ship contain the exact version of the Beat + Commit Hash if it is a snapshot?
Good catch, I am pretty sure it doesn't include the hash.

ph · 2020-08-26T12:15:13Z

Decision: We have logs per version, we can provide tooling or symlink to help the local debug experience.

ph · 2020-09-03T12:41:36Z

Closing this PR #20400 was merged.

michalpristas added discuss Issue needs further discussion. Team:Ingest Management labels Jul 20, 2020

michalpristas self-assigned this Jul 20, 2020

ph mentioned this issue Jul 23, 2020

[Meta][Ingest Manager] Allow the elastic agent to upgrade itself and his artifacts #20205

Closed

29 tasks

ph changed the title ~~[Ingest Manager] Proposal Upgrade friendly structure~~ [Ingest Manager] Define Elastic Agent structure on disk for elastic agent support upgrade and rollbacks. Jul 23, 2020

michalpristas mentioned this issue Jul 29, 2020

[Ingest Manager] New Agent structure #20307

Closed

12 tasks

ph closed this as completed Sep 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ingest Manager] Define Elastic Agent structure on disk for elastic agent support upgrade and rollbacks. #20048

[Ingest Manager] Define Elastic Agent structure on disk for elastic agent support upgrade and rollbacks. #20048

michalpristas commented Jul 20, 2020 •

edited

Loading

elasticmachine commented Jul 20, 2020

ph commented Jul 20, 2020 •

edited

Loading

blakerouse commented Jul 20, 2020 •

edited

Loading

ph commented Jul 20, 2020

michalpristas commented Jul 22, 2020

blakerouse commented Jul 22, 2020

ph commented Jul 22, 2020

ph commented Jul 22, 2020

michalpristas commented Jul 23, 2020

blakerouse commented Jul 23, 2020

ferullo commented Jul 23, 2020 •

edited

Loading

ph commented Jul 23, 2020

ferullo commented Jul 23, 2020

blakerouse commented Jul 23, 2020

ferullo commented Jul 23, 2020

ph commented Jul 23, 2020

michalpristas commented Jul 27, 2020

blakerouse commented Jul 27, 2020

michalpristas commented Jul 27, 2020

ph commented Jul 27, 2020

michalpristas commented Jul 28, 2020

ph commented Jul 28, 2020

ruflin commented Jul 31, 2020

ph commented Aug 19, 2020

ph commented Aug 26, 2020

ph commented Sep 3, 2020

[Ingest Manager] Define Elastic Agent structure on disk for elastic agent support upgrade and rollbacks. #20048

[Ingest Manager] Define Elastic Agent structure on disk for elastic agent support upgrade and rollbacks. #20048

Comments

michalpristas commented Jul 20, 2020 • edited Loading

elasticmachine commented Jul 20, 2020

ph commented Jul 20, 2020 • edited Loading

blakerouse commented Jul 20, 2020 • edited Loading

ph commented Jul 20, 2020

michalpristas commented Jul 22, 2020

blakerouse commented Jul 22, 2020

ph commented Jul 22, 2020

ph commented Jul 22, 2020

michalpristas commented Jul 23, 2020

blakerouse commented Jul 23, 2020

ferullo commented Jul 23, 2020 • edited Loading

ph commented Jul 23, 2020

ferullo commented Jul 23, 2020

blakerouse commented Jul 23, 2020

ferullo commented Jul 23, 2020

ph commented Jul 23, 2020

michalpristas commented Jul 27, 2020

Package creation

Same SNAPSHOT problem

blakerouse commented Jul 27, 2020

michalpristas commented Jul 27, 2020

ph commented Jul 27, 2020

michalpristas commented Jul 28, 2020

ph commented Jul 28, 2020

ruflin commented Jul 31, 2020

ph commented Aug 19, 2020

ph commented Aug 26, 2020

ph commented Sep 3, 2020

michalpristas commented Jul 20, 2020 •

edited

Loading

ph commented Jul 20, 2020 •

edited

Loading

blakerouse commented Jul 20, 2020 •

edited

Loading

ferullo commented Jul 23, 2020 •

edited

Loading