Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docker runtime for agent #730

Merged
merged 1 commit into from
Jun 14, 2023
Merged

Conversation

chrisdoherty4
Copy link
Member

@chrisdoherty4 chrisdoherty4 commented May 11, 2023

Summary

This PR introduces a Docker runtime that is usable by the agent to run actions. The code is intended to introduce a functional runtime, therefore it doesn't implement the full set of features available on actions such as volumes; the additional features will be implemented separately.

This PR also doesn't address debugability of actions - for example, action logs are discarded - though this may not be required if we continue with syslog and routing container logs to an external service.

Future work

The tests are integration tests that ensure the code is robust enough. Some functionality like ensuring volume configuration, environment variables, commands and args are all configured correctly would be better implemented as unit tests. A subsequent PR will look to introduce an interface so we can easily test these feature as they're added.

@codecov
Copy link

codecov bot commented May 11, 2023

Codecov Report

Merging #730 (d36ffe4) into main (3723588) will increase coverage by 0.99%.
The diff coverage is 66.66%.

❗ Current head d36ffe4 differs from pull request most recent head 3e45b96. Consider uploading reports for the commit 3e45b96 to get more accurate results

@@            Coverage Diff             @@
##             main     #730      +/-   ##
==========================================
+ Coverage   55.32%   56.31%   +0.99%     
==========================================
  Files          25       28       +3     
  Lines        1099     1243     +144     
==========================================
+ Hits          608      700      +92     
- Misses        475      511      +36     
- Partials       16       32      +16     
Impacted Files Coverage Δ
internal/agent/runtime/docker.go 62.63% <62.63%> (ø)
internal/agent/runtime/internal/failure_files.go 74.46% <74.46%> (ø)

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@chrisdoherty4 chrisdoherty4 force-pushed the feature/runtime branch 3 times, most recently from bbeda78 to b3b3d4a Compare May 13, 2023 14:57
@chrisdoherty4 chrisdoherty4 marked this pull request as ready for review May 13, 2023 15:02
@@ -0,0 +1,90 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to rename the folder where this file is created to utils and keep the file name to ptr.go

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tend to avoid top level util packages and sign up to every package having at least some scoped purpose.

internal/agent/runtime/docker.go Show resolved Hide resolved
internal/agent/runtime/docker.go Show resolved Hide resolved
moadqassem
moadqassem previously approved these changes May 15, 2023
Copy link
Member

@moadqassem moadqassem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

type DockerOption func(*Docker)

// WithLogger returns an option to configure the logger on a Docker instance.
func WithLogger(log logr.Logger) DockerOption {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've realized the logger is actually required because we log when a container couldn't be gracefully stopped. I'll tweak this when adding the other features as noted in the OP.

// Always try to remove the container on exit.
defer func() {
// We can't use the context passed to Run() as it may have been cancelled.
err := d.client.ContainerRemove(context.Background(), create.ID, types.ContainerRemoveOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we enable all the ContainerRemoveOptions?

Copy link
Member Author

@chrisdoherty4 chrisdoherty4 May 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type ContainerRemoveOptions struct {
	RemoveVolumes bool
	RemoveLinks   bool
	Force         bool
}
  • On Mac with Docker Desktop I've seen force leave objects lying on disk that couldn't be cleaned up by a following docker system prune --volumes (I had to manually remove files). I don't know how widespread that impact is, or how challenging it would be to figure out, so I'm not sure if we'd want that enabled, particularly given this is all in-memory and goes away on restart. Given action names are unique I don't foresee any conflicts with respect to launching more actions.
  • I don't think we want to remove volumes because we allow users to share data between actions via volumes.
  • I don't see any issues with link removal.

With that in mind, what do you think about enabling link removal only?

Copy link
Member Author

@chrisdoherty4 chrisdoherty4 May 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On container name uniqueness, this has made me realized the action ID is unique within the context of a workflow, but if a second workflow executes (i.e. the first didn't restart the machine) there could be a collision. I think we could fix that by committing to UUIDs for action IDs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, Chris.

You're right we probably shouldn't remove volumes.

Currently, the main use case for the agent is to run in Hook. The status-quo is to use Force: true, so we know it works in Hook. I'm concerned about resource constrained machines where memory and disk space in Hook are an issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, lets stick with the status quo and force delete. The volumes will likely be in-memory and cleaned on restart anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be addressed.

// Run satisfies agent.ContainerRuntime.
func (d *Docker) Run(ctx context.Context, a workflow.Action) error {
// We need the image to be available before we can create a container.
image, err := d.client.ImagePull(ctx, a.Image, types.ImagePullOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might think about adding retries here. a transient issue to the registry could cause a whole workflow to fail.

Copy link
Member Author

@chrisdoherty4 chrisdoherty4 May 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with github.com/avast/retry-go as it seems quite modular and has all the bits you'd expect from a retry library.

fn(o)
}

if o.log.GetSink() == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you put the logr.Discard into o := &Docker{}, i dont think you'll need this check. Same with the client setting below. These will be the defaults and then only overwritten if a consumer passes either one of them in as a DockerOption.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@chrisdoherty4 chrisdoherty4 force-pushed the feature/runtime branch 4 times, most recently from 6fbb52e to 3db4cdf Compare May 30, 2023 00:59
// In rare cases this may create orphaned volumes that the Docker CLI won't clean up.
opts := types.ContainerRemoveOptions{
Force: true,
RemoveLinks: true,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is causing issues in tests and its not clear why. I no longer think I understand what it does so I'm disabling; we can enable it at a later date.

@chrisdoherty4 chrisdoherty4 force-pushed the feature/runtime branch 2 times, most recently from 42a3072 to 1cb2e5d Compare May 30, 2023 02:22
}

// NewDocker creates a new Docker instance.
func NewDocker(opts ...DockerOption) (*Docker, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: variables, types, and New functions go at the top of the file.

Env: toDockerEnv(a.Env),
}

failureFiles, err := internal.NewFailureFiles()
Copy link
Member

@jacobweinstock jacobweinstock Jun 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this after a bit the interaction between a runtime implementation, the agent, the eventing system, and the action, etc being run in terms of failures feels difficult to follow with just the code. I think we should figure out a way to make this very understandable. docs, code comments, etc. I just want to be sure that all the different parts of the code are resilient and easy to understand how they fit into the whole failure notification system.


// Close closes all files tracked by f.
func (f *FailureFiles) Close() error {
os.Remove(f.reason.Name())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the model is going to be to not capture the errors here, maybe we should change the interface to not return an error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the name Close actually doesn't feel accurate here. deleting files is different than closing them. is this satisfying an interface method?

return strings.TrimRight(message.String(), "\n"), nil
}

func (f *FailureFiles) ToError() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we inject a data structure for failure files to runtimes I suspect this method will be unnecessary so defering.

// TODO: Support all the other things on the action such as volumes.
cfg := container.Config{
Image: a.Image,
Env: toDockerEnv(a.Env),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to set the failure file locations to be env vars for all containers. that way actions dont need to "know" where this location is and we can change it if needed. action writers just look up the env var and get the files.

@chrisdoherty4
Copy link
Member Author

In an effort to move this along we've decided to merge this. This comment serves as a summary of outstanding items:

  1. Inject failure files as they should be common across runtimes.
  2. Rename Close() for failure files to Remove() and drop returning an error.
  3. Restructure the sources to follow Code Review comments guideliens for source organization.

The new agent requires a container runtime to execute workflow actions.
This commit introduces a Docker runtime.

Signed-off-by: Chris Doherty <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge Signal to Mergify to merge the PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants