Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[filebeat][gcs] - Added retry config & refactored the default config options #41699

Closed
wants to merge 3 commits into from

Conversation

ShourieG
Copy link
Contributor

@ShourieG ShourieG commented Nov 20, 2024

Type of change

  • Enhancement

Proposed commit message

  • Added retry config.
  • Refactored the default config options
  • Updated docs accordingly and fixed some existing errors.
  • Added tests.
  • Removed an instance of an un-used log parameter in the fetchStorageClient method

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
    - [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

@ShourieG ShourieG requested a review from a team as a code owner November 20, 2024 10:02
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 20, 2024
Copy link
Contributor

mergify bot commented Nov 20, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @ShourieG? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Nov 20, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Nov 20, 2024
@ShourieG ShourieG requested a review from efd6 November 20, 2024 10:03
@ShourieG ShourieG added the Team:Security-Service Integrations Security Service Integrations Team label Nov 20, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 20, 2024
@ShourieG ShourieG added enhancement needs_team Indicates that the issue/PR needs a Team:* label labels Nov 20, 2024
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 20, 2024
@ShourieG ShourieG added Filebeat Filebeat needs_team Indicates that the issue/PR needs a Team:* label labels Nov 20, 2024
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 20, 2024
@ShourieG ShourieG added needs_team Indicates that the issue/PR needs a Team:* label input:GCS labels Nov 20, 2024
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 20, 2024
Copy link
Contributor

@efd6 efd6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two distinct changes being made here; the refactor and the addition of a new feature. I think they should be in separate PRs.

@@ -90,6 +92,17 @@ type jsonCredentialsConfig struct {
AccountKey string `config:"account_key"`
}

type retryConfig struct {
// MaxAttempts is the maximum number of retry attempts, defaults to 3.
MaxAttempts int `config:"max_attempts" validate:"min=0"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why min=0? Is this attempts or retries? If it's retries, I would rename this MaxRetries and then a zero makes sense, otherwise it's saying don't try the first time.

// MaxBackOffDuration is the maximum value of the retry period, defaults to 30 seconds.
MaxBackOffDuration time.Duration `config:"max_backoff_duration" validate:"min=2"`
// BackOffMultiplier is the factor by which the retry period increases. It should be greater than 1 and defaults to 2.
BackOffMultiplier float64 `config:"backoff_multiplier" validate:"min=2"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does go-ucfg allow float constraints? Is a value greater than one but less than two a valid configuration in principle? If it is, should we not allow that?

Copy link
Contributor Author

@ShourieG ShourieG Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CEL also uses Limit *float64 config:"limit"` in its resource config as well as a couple of other inputs. There is a float unpacker present. FloatUnpacker.

As for if we should allow or not, I mean if the sdk allows it then why not extend the same feature set to the enduser ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not what I'm asking. I'm asking if min can be a float constraint.

Comment on lines +31 to +40
MaxWorkers int `config:"max_workers,omitempty" validate:"max=5000"`
// Poll - Defines if polling should be performed on the input bucket source.
Poll *bool `config:"poll,omitempty"`
Poll bool `config:"poll,omitempty"`
// PollInterval - Defines the maximum amount of time to wait before polling for the next batch of objects from the bucket.
PollInterval *time.Duration `config:"poll_interval,omitempty"`
PollInterval time.Duration `config:"poll_interval,omitempty"`
// ParseJSON - Informs the publisher whether to parse & objectify json data or not. By default this is set to
// false, since it can get expensive dealing with highly nested json data.
ParseJSON *bool `config:"parse_json,omitempty"`
ParseJSON bool `config:"parse_json,omitempty"`
// BucketTimeOut - Defines the maximum time that the sdk will wait for a bucket api response before timing out.
BucketTimeOut *time.Duration `config:"bucket_timeout,omitempty"`
BucketTimeOut time.Duration `config:"bucket_timeout,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were these changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explanation below


This attribute can be used to configure a list of sub attributes that directly control how the input should behave when a download for a file/object fails or gets interrupted. The list of sub attributes are as follows :-

1. `max_attempts`: This attribute defines the maximum number of retries that should be attempted for a failed download. The default value for this is `3`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. `max_attempts`: This attribute defines the maximum number of retries that should be attempted for a failed download. The default value for this is `3`.
1. `max_retries`: This attribute defines the maximum number of retries that should be attempted for a failed download. The default value for this is `3`.

The input can be configured to work with and without polling, though currently, if polling is disabled it will only
perform a one time passthrough, list the file contents and end the process. Polling is generally recommented for most cases
even though it can get expensive with dealing with a very large number of files.
The input can be configured to work with and without polling, though if polling is disabled, it will only perform a one time passthrough, list the file contents and end the process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The input can be configured to work with and without polling, though if polling is disabled, it will only perform a one time passthrough, list the file contents and end the process.
The input can be configured to work with and without polling, though if polling is disabled it will only perform a single collection of data, list the file contents and end the process.

The change s/passthrough/…/ significantly changes the semantics, but I don't think the original meaning is correct; passthrough is a noun that doesn't describe an action.

Though I'm unclear why this is being changed since it does not appear to be related to the logic change here.

Comment on lines +67 to +71
different values, thus offering extensive flexibility and customization. Examples <<bucket-overrides,below>> show this behavior.

On receiving this config the google cloud storage input will connect to the service and retrieve a `Storage Client` using the given `bucket_name` and
`auth.credentials_file`, then it will spawn two main go-routines, one for each bucket. After this each of these routines (threads) will initialize a scheduler
which will in turn use the `max_workers` value to initialize an in-memory worker pool (thread pool) with `3` `workers` available. Basically that equates to two instances of a worker pool,
one per bucket, each having 3 workers. These `workers` will be responsible for performing `jobs` that process a file (in this case read and output the contents of a file).
which will in turn use the `max_workers` value to initialize an in-memory worker pool (thread pool) with `3` `workers` available. Basically that equates to two instances of a worker pool, one per bucket, each having 3 workers. These `workers` will be responsible for performing `jobs` that process a file (in this case read and output the contents of a file).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to change these in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have replied with an explanation and option below

@@ -213,7 +211,7 @@ This is a specific subfield of a bucket. It specifies the bucket name.

This attribute defines the maximum amount of time after which a bucket operation will give and stop if no response is recieved (example: reading a file / listing a file).
It can be defined in the following formats : `{{x}}s`, `{{x}}m`, `{{x}}h`, here `s = seconds`, `m = minutes` and `h = hours`. The value `{{x}}` can be anything we wish.
If no value is specified for this, by default its initialized to `50 seconds`. This attribute can be specified both at the root level of the configuration as well at the bucket level. The bucket level values will always take priority and override the root level values if both are specified. The value of `bucket_timeout` that should be used depends on the size of the files and the network speed. If the timeout is too low, the input will not be able to read the file completely and `context_deadline_exceeded` errors will be seen in the logs. If the timeout is too high, the input will wait for a long time for the file to be read, which can cause the input to be slow. The ratio between the `bucket_timeout` and `poll_interval` should be considered while setting both the values. A low `poll_interval` and a very high `bucket_timeout` can cause resource utilization issues as schedule ops will be spawned every poll iteration. If previous poll ops are still running, this could result in concurrently running ops and so could cause a bottleneck over time.
If no value is specified for this, by default its initialized to `120 seconds`. This attribute can be specified both at the root level of the configuration as well at the bucket level. The bucket level values will always take priority and override the root level values if both are specified. The value of `bucket_timeout` that should be used depends on the size of the files and the network speed. If the timeout is too low, the input will not be able to read the file completely and `context_deadline_exceeded` errors will be seen in the logs. If the timeout is too high, the input will wait for a long time for the file to be read, which can cause the input to be slow. The ratio between the `bucket_timeout` and `poll_interval` should be considered while setting both the values. A low `poll_interval` and a very high `bucket_timeout` can cause resource utilization issues as schedule ops will be spawned every poll iteration. If previous poll ops are still running, this could result in concurrently running ops and so could cause a bottleneck over time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is being changed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initialized to 50 seconds -> initialized to 120 seconds

Copy link
Contributor

mergify bot commented Nov 25, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b gcs/retry_config upstream/gcs/retry_config
git merge upstream/main
git push upstream gcs/retry_config

@ShourieG
Copy link
Contributor Author

ShourieG commented Nov 29, 2024

There are two distinct changes being made here; the refactor and the addition of a new feature. I think they should be in separate PRs.

@efd6, So in order to have the retries working effectively, I had to introduce the defaultConfig(), which warranted some config changes/cleanup. I don't think go-ucfg allows default values in the config tags, I would need to write a custom Unpack method to achieve that and it would become unnecessarily complex. Keeping the boolean values was more of a detriment in this scenario and were unnecessary, hence decided to clean those up along with the default values which lead to the documentation changes. I agree I should have separated the PR when I realised the config was gonna require a small rework. I can still separate it if you want, I'll turn this into the refactor PR and have a separate PR for the actual change.

@ShourieG
Copy link
Contributor Author

@efd6, I'm closing this PR. I've created a separate PR only for the cleanup/update to docs here. Once that is merged, I'll have the retry changes up.

@ShourieG ShourieG closed this Nov 29, 2024
@efd6
Copy link
Contributor

efd6 commented Dec 1, 2024

I don't think go-ucfg allows default values in the config tags, I would need to write a custom Unpack method to achieve that and it would become unnecessarily complex.

The pattern to achieve this is to prefill the default config. Zero values in the serialised config do not over-write the value in the passed-in default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify enhancement Filebeat Filebeat input:GCS Team:Security-Service Integrations Security Service Integrations Team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add configurable retry policy for GCS integration connection failures
3 participants