Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ingest Manager] Define Elastic Agent structure on disk for elastic agent support upgrade and rollbacks. #20048

Closed
michalpristas opened this issue Jul 20, 2020 · 26 comments
Assignees
Labels
discuss Issue needs further discussion.

Comments

@michalpristas
Copy link
Contributor

michalpristas commented Jul 20, 2020

With upgrade in mind we need to align structure of where things go to make scenarios smooth and possible.

Proposed structure

/data
  /v7.9.1-abc123d
     /downloads
     /install
     /run
     elastic-agent
  /v7.9.2-dba324e
     /downloads
     /install
     /run
     elastic-agent
     elastic-agent.yml
     fleet.yml
     action_store.yml
  /logs
elastic-agent
elastic-agent.yml
fleet.yml
action_store.yml

v7.9.1-ab123d is a v7.9.1 semver version where the rest of the string is a hash of a version which can contain suffixes like SNAPSHOT, BC...
each version contains its own binary and dependent binaries in download/install directories

in this example /v7.9.2-dba324e is an older version which contains not only binary but snapshot of config files and actionstore

logs were moved from version level to root level of the structure. this is so monitoring wont drop any events which might be unprocessed or generated in between upgrade steps.

run is used during runtime to store sockets etc.

elastic-agent at root level is a symlink to a currently active version, any service manager should point to this file as executable. this gets updated on update/rollback

elastic-agent.yml, fleet.yml and action_store.yml are config/state files which are used by active version of an agent. during upgrade process these are copied to version folder overwriting any previously generated config files if any.
agent copies these files on start from its versioned directory if it contains any and removes them to avoid future overwrite.

older versioned folders are removed after grace period without beats in FAILED state together with prev symlink (if used). after this point rollback wont be possible.

cc @ph @blakerouse

@michalpristas michalpristas added discuss Issue needs further discussion. Team:Ingest Management labels Jul 20, 2020
@michalpristas michalpristas self-assigned this Jul 20, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ingest-management (Team:Ingest Management)

@ph
Copy link
Contributor

ph commented Jul 20, 2020

@michalpristas I think we should still use the full version if possible in the folder v7.9.0-hash instead of v7. Is there a strong argument to only have v7, The reason I would think of is windows limit on characters?

Where are we locating the registry of filebeat in the data folder?

@blakerouse
Copy link
Contributor

blakerouse commented Jul 20, 2020

There is also the agent.lock that currently exists inside the data folder. I think we should still keep the top-level data folder then nest the version folder. I think with keeping the top-level data folder we could then place the logs directory still in that folder.

/data
  agent.lock
  /v7-abc123d
     /downloads
     /install
     elastic-agent
  /v7-dba324e
     /downloads
     /install
     elastic-agent
     elastic-agent.yml
     fleet.yml
     action_store.yml
   /logs
elastic-agent
prev
elastic-agent.yml
fleet.yml
action_store.yml

I also agree with @ph whould use the full version number (v7.0.0).

I question if we should really have the prev symbolic link, what happens if the user does ./prev and runs it? Would we ever really want to allow that? I think prev should be removed.

@ph
Copy link
Contributor

ph commented Jul 20, 2020

good point @blakerouse, I think the prev is only to keep state? Maybe we should that information in an existing file like fleet.yml or other. We could keep bit more information. IE when did that release was installed?

@michalpristas
Copy link
Contributor Author

updated issue description which what we talked with blake over zoom

@blakerouse
Copy link
Contributor

Looks good. 👍

@ph
Copy link
Contributor

ph commented Jul 22, 2020

LGTM

@ph
Copy link
Contributor

ph commented Jul 22, 2020

@michalpristas Can you discuss with @ferullo with our plan, they also have persisted information and their own "installation" structure.

@michalpristas
Copy link
Contributor Author

asked Daniel about their internal state and they got it covered using endpoint.exe install --upgrade
we need to keep this in mind and be aware that we're in upgrade process at the point of installation of endpoint

@blakerouse
Copy link
Contributor

Hmm at the moment we only always call: endpoint-security.exe install --resources endpoint-security-resources.zip.

@ferullo Would it be bad for Agent to always call install with --upgrade? Or could we just remove the --upgrade flag and allow the installer to figure out if its an upgrade or not? Feel like it would be best for the installer to just do the correct thing.

Agent does not know specifically if this really is an upgrade or not. In the case of an Agent just being removed (no un-enrolled, Endpoint will still be running). Then a newly enrolled agent on the same host runs the installer again, it will not know that it previously had Endpoint running.

@ferullo
Copy link

ferullo commented Jul 23, 2020

Yes, always passing --upgrade works fine and I agree, it is best if Agent to do that.

We went back and forth on whether to (1) have a different command upgrade vs install, (2) to make it automatic (no --upgrade) or (3) to make it required by harmless if not needed. We settled on (3), which is similar to how pip works.

@ph
Copy link
Contributor

ph commented Jul 23, 2020

@ferullo I assume we can also use the same command to rollback to a previous version?

@ferullo
Copy link

ferullo commented Jul 23, 2020

Endpoint will rollback automatically if it is unable to upgrade for some reason. But if Agent needs to rollback and wants to downgrade Endpoint that should work. However I can't promise it will because its impossible to guarantee a previously released Endpoint works with future Endpoints, especially across major version updates.

Is Agent going to upgrade itself first, then Beats and Endpoint? If so, this seems like a non issue. If not, to downgrade Endpoint I recommend using --upgrade to downgrade but then if Endpoint doesn't wind up at the right version as a last ditch effort run the installed Endpoint's uninstall and install the new Endpoint fresh. That last ditch effort should never be needed but since both its component actions (uninstall, fresh install) are tested and independent of other Endpoint versions should also be reliable across all Endpoint changes over time.

@blakerouse
Copy link
Contributor

@ferullo Agent will upgrade itself first, then perform upgrades on the Beats and Endpoint.

Below is a couple failure cases that we consider on upgrade:

  • Agent upgrades itself and fails
    • Endpoint will not have have anything change (aka. no install/upgrade/uninstall would ever be run)
    • Agent rollsback to previous version
    • Everything is back like upgrade didn't even occur
  • Agent upgrades itself (works) upgrades beats and Endpoint (either fail)
    • Endpoint would have had install --upgrade ran to the next version
    • Either Endpoint or Beats reports to Agent an ERROR through GRPC or beats keeps crashing
    • Agent rollsback to previous version
    • Previous beats version is restart
    • Endpoint install --upgrade is called again but on a previous version (not the newest upgraded version)

So the flow for Endpoint breaks down in the worst case to (v1 and v2 just symbol version jumps):

  • v1 -> v2 -> v1

Also possible if someone forces a downgrade (also possible):

  • v2 -> v1

@ph ph changed the title [Ingest Manager] Proposal Upgrade friendly structure [Ingest Manager] Define Elastic Agent structure on disk for elastic agent support upgrade and rollbacks. Jul 23, 2020
@ferullo
Copy link

ferullo commented Jul 23, 2020

Thanks for the details. I had not realized Agent would downgrade itself if Beats or Endpoint didn't work after the upgrade. That's slick.

I think that flow works well. We have automated tests to make sure an Endpoint can be upgpraded, we'll add one to make sure it can be downgraded.

I still think if Beats or Endpoint fail to downgrade then uninstall and re-install is the best course of action to make sure Agent/Beats/Endpoint all stay in sync for their version number. Though perhaps that would be handled by the normal [endpoint|beats] verify flow that is in place?

@ph
Copy link
Contributor

ph commented Jul 23, 2020

@ferullo can you link the issue in #20205 ?

@michalpristas
Copy link
Contributor Author

I was thinking about snapshots over the weekend and this wont work for them, i realised that snapshot versions does not differentiate between snapshots (alwyas 8.0.0-SNAPSHOT or so)

so version hash wont work for them.

so i was thinking that we need something which is known at build time to create a package dir and do a differentiation in between version.
at firt i was thinking of reusing SHA but it is not known at build time and we wont know that for initial install either.

what i was thinking about is a latest commit hash. we know that at build time, we even inject it into an agent binary.
so with this strategy would look like this

Package creation

during package creation we would prepare structure which is ready for unpack e.g
/v8.0.0-SNAPSHOT-{commit} /data etc.

so when this gets unpacked it already is differentiated.

Same SNAPSHOT problem

Problem might be when we receive action to update from one snapshot to another which is the same. Usually we should not upgrade from version to same version but for snapshot we have to.
What might occur is that we end up updating to same snapshot and we replace our running bits with freshly unpacked.

So for SNAPSHOT i was thinking about special handling (not for normal versions) where we unpack to /tmp or take a look inside an archive and if v8.0.0-SNAPSHOT-{commit} directory already exists in our version directory structure we abort.

@ph @blakerouse do you see some gotchas there? does it sound ok?

@blakerouse
Copy link
Contributor

@michalpristas Good catch on the issue with snapshots. I think it would be good to not use the whole commit hash, as that might cause issues on Windows due to file path length. Maybe just the first 8 or so?

I was thinking similar without even adjusting the packaged bits, always extract the new upgrade agent into data/{$ current-agent $}/downloads (don't think we need /tmp). Run elastic-agent version --yaml (good to add a YAML output so its machine parsable) to get the commit hash. If its the same do nothing, if its different then upgrade that version.

@michalpristas
Copy link
Contributor Author

first few characters should work as well.
i think unpack to data and then moving it up makes sense as well

@ph
Copy link
Contributor

ph commented Jul 27, 2020

goog catch on the snapshots. the elastic-agent version --yaml seems like an OK strategy? I would prefer if we can only have single path. We might have to solve where to get the matching snapshots for endpoints?

@michalpristas
Copy link
Contributor Author

moved logs under data

@ph
Copy link
Contributor

ph commented Jul 28, 2020

@ruflin Could you take a look?

@ruflin
Copy link
Contributor

ruflin commented Jul 31, 2020

I like the above proposal especially the part that we can also upgrade between snapshot builds.

The part I didn't get is why not each version has its own log directory. I would expect the log collection pattern to be something like /data/*/logs/*. Like this any log from any version is picked up. If a version is cleaned up, also all its data/logs is cleaned up (is this expected?). The downside is that it makes it harder for a user to find the "current" logs. But to solve this, we can introduce a symlink to the current logs directory inside data.

Does each log even that we currently ship contain the exact version of the Beat + Commit Hash if it is a snapshot?

@ph
Copy link
Contributor

ph commented Aug 19, 2020

I am not against having a log per version.

Does each log even that we currently ship contain the exact version of the Beat + Commit Hash if it is a snapshot?
Good catch, I am pretty sure it doesn't include the hash.

@ph
Copy link
Contributor

ph commented Aug 26, 2020

Decision: We have logs per version, we can provide tooling or symlink to help the local debug experience.

@ph
Copy link
Contributor

ph commented Sep 3, 2020

Closing this PR #20400 was merged.

@ph ph closed this as completed Sep 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issue needs further discussion.
Projects
None yet
Development

No branches or pull requests

6 participants