Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ScrollID used for retrieval of APM Agents configuration #13958

Merged
merged 2 commits into from
Sep 3, 2024

Conversation

up2neck
Copy link
Contributor

@up2neck up2neck commented Aug 30, 2024

  • Fixed broken APM Agents configuration cache

Motivation/summary

APM Agents configuration cache appears to be broken, resulting frequent invalid requests made by APM Server to Elasticsearch cluster.

Checklist

For functional changes, consider:

  • Is it observable through the addition of either logging or metrics?
  • Is its use being published in telemetry to enable product improvement?
  • Have system tests been added to avoid regression?

How to test these changes

  1. Deploy Elastic Stack: Elasticsearch, Kibana, APM Server.
  2. Configure self-instrumentation for APM Server
  3. Open Kibana/APM
  4. Check APM Server has no failed operations of type "Elasticsearch: POST .apm-agent-configuration/_search" for Elasticsearch dependency

Related issues

closes #13957

@up2neck up2neck requested a review from a team as a code owner August 30, 2024 11:14
Copy link

cla-checker-service bot commented Aug 30, 2024

💚 CLA has been signed

@up2neck up2neck force-pushed the fix/apm-agent-cache-renewal branch from 08ecb6f to eaafc91 Compare August 30, 2024 11:14
Copy link
Contributor

mergify bot commented Aug 30, 2024

This pull request does not have a backport label. Could you fix it @up2neck? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-7.17 is the label to automatically backport to the 7.17 branch.
  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.

NOTE: backport-skip has been added to this pull request.

@mergify mergify bot added the backport-skip Skip notification from the automated backport with mergify label Aug 30, 2024
@up2neck up2neck force-pushed the fix/apm-agent-cache-renewal branch from eaafc91 to d9adabc Compare August 30, 2024 11:20
@up2neck up2neck marked this pull request as draft August 30, 2024 11:27
@carsonip carsonip self-requested a review August 30, 2024 11:30
@up2neck up2neck marked this pull request as ready for review August 30, 2024 11:38
Copy link
Member

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Nice catch!

Ideally this PR also updates the test to assert scroll id payload in elasticsearch_test.go, but it may not be trivial. I'm happy to accept this PR as is, and I can add a test in a follow up PR.

changelogs/head.asciidoc Outdated Show resolved Hide resolved
@carsonip carsonip added backport-8.15 Automated backport with mergify and removed backport-skip Skip notification from the automated backport with mergify labels Aug 30, 2024
@up2neck
Copy link
Contributor Author

up2neck commented Aug 30, 2024

@carsonip
Alongside this, what do you think about adding check of "scroll_id" is not empty after cache renewal?
It might require of new error type for such case.

@carsonip
Copy link
Member

Alongside this, what do you think about adding check of "scroll_id" is not empty after cache renewal?

It doesn't hurt, but personally I do not think there is a need in handler to validate anything returned by Elasticsearch. We rely on Elasticsearch for a correct scroll ID, as well as correct search results. What's needed here is a better test, which I'm working on.

Either way the error will be surfaced, just more explicitly if an additional check is in place, or a little more obscurely in a form of HTTP 400 like this case.

@up2neck
Copy link
Contributor Author

up2neck commented Aug 30, 2024

Alongside this, what do you think about adding check of "scroll_id" is not empty after cache renewal?

It doesn't hurt, but personally I do not think there is a need in handler to validate anything returned by Elasticsearch. We rely on Elasticsearch for a correct scroll ID, as well as correct search results. What's needed here is a better test, which I'm working on.

Either way the error will be surfaced, just more explicitly if an additional check is in place, or a little more obscurely in a form of HTTP 400 like this case.

But original code contains HTTP validation, but only for the first search request:

if scrollID == "" {
resp, err := esapi.SearchRequest{
Index: []string{ElasticsearchIndexName},
Size: &f.searchSize,
Scroll: f.cacheDuration,
}.Do(ctx, f.client)
if err != nil {
return result, err
}
defer resp.Body.Close()
if resp.StatusCode >= http.StatusBadRequest {
// Elasticsearch returns 401 on unauthorized requests and 403 on insufficient permission
if resp.StatusCode == http.StatusUnauthorized || resp.StatusCode == http.StatusForbidden {
f.invalidESCfg.Store(true)
}
bodyBytes, err := io.ReadAll(resp.Body)
if err == nil {
f.logger.Debugf("refresh cache elasticsearch returned status %d: %s", resp.StatusCode, string(bodyBytes))
}
return result, fmt.Errorf("refresh cache elasticsearch returned status %d", resp.StatusCode)
}
return result, json.NewDecoder(resp.Body).Decode(&result)
}

Probably we can modify it to handle errors in both requests:

func (f *ElasticsearchFetcher) singlePageRefresh(ctx context.Context, scrollID string) (cacheResult, error) {
	var result cacheResult
	var err error
	var resp *esapi.Response

	if scrollID == "" {
		resp, err = esapi.SearchRequest{
			Index:  []string{ElasticsearchIndexName},
			Size:   &f.searchSize,
			Scroll: f.cacheDuration,
		}.Do(ctx, f.client)
	} else {
		resp, err = esapi.ScrollRequest{
			ScrollID: scrollID,
			Scroll:   f.cacheDuration,
		}.Do(ctx, f.client)
	}
	if err != nil {
		return result, err
	}
	defer resp.Body.Close()

	if resp.StatusCode >= http.StatusBadRequest {
		// Elasticsearch returns 401 on unauthorized requests and 403 on insufficient permission
		if resp.StatusCode == http.StatusUnauthorized || resp.StatusCode == http.StatusForbidden {
			f.invalidESCfg.Store(true)
		}
		bodyBytes, err := io.ReadAll(resp.Body)
		if err == nil {
			f.logger.Debugf("refresh cache elasticsearch returned status %d: %s", resp.StatusCode, string(bodyBytes))
		}
		return result, fmt.Errorf("refresh cache elasticsearch returned status %d", resp.StatusCode)
	}
	return result, json.NewDecoder(resp.Body).Decode(&result)
}

@up2neck up2neck force-pushed the fix/apm-agent-cache-renewal branch from c33e7d0 to 710bdfa Compare August 30, 2024 12:52
@carsonip
Copy link
Member

But original code contains HTTP validation, but only for the first search request

Correct. There is no handling for bad HTTP response codes in scroll requests. In retrospect, there should be. Thanks for adding it to the PR.

However, that increases the scope of this PR a little bit, and it will need to be tested on our end to avoid introducing regression (I don't expect any).

@up2neck up2neck force-pushed the fix/apm-agent-cache-renewal branch 2 times, most recently from 9ac14c6 to 2031c34 Compare August 30, 2024 20:15
@carsonip carsonip self-requested a review September 2, 2024 08:46
carsonip
carsonip previously approved these changes Sep 2, 2024
Copy link
Member

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks for the PR. I can confirm this fixes the scrolling bug by my follow-up PR to add tests: #13959 . The refactoring to handle http errors should be fine, as expired scroll requests returns 404.

@carsonip
Copy link
Member

carsonip commented Sep 2, 2024

run docs-build

@carsonip
Copy link
Member

carsonip commented Sep 2, 2024

@up2neck do you mind signing the commits and force push? See https://docs.github.com/authentication/managing-commit-signature-verification/about-commit-signature-verification

@up2neck
Copy link
Contributor Author

up2neck commented Sep 3, 2024

But original code contains HTTP validation, but only for the first search request

Correct. There is no handling for bad HTTP response codes in scroll requests. In retrospect, there should be. Thanks for adding it to the PR.

However, that increases the scope of this PR a little bit, and it will need to be tested on our end to avoid introducing regression (I don't expect any).

Made it

@up2neck up2neck force-pushed the fix/apm-agent-cache-renewal branch from 9e5c91d to de5453c Compare September 3, 2024 08:00
@carsonip
Copy link
Member

carsonip commented Sep 3, 2024

run docs-build

@up2neck up2neck force-pushed the fix/apm-agent-cache-renewal branch from de5453c to 09d91be Compare September 3, 2024 10:25
@up2neck up2neck requested a review from carsonip September 3, 2024 10:54
@up2neck
Copy link
Contributor Author

up2neck commented Sep 3, 2024

@carsonip unfortunately, Github lost my commit signature after rebase in web UI. Have restored it.

@carsonip
Copy link
Member

carsonip commented Sep 3, 2024

run docs-build

@carsonip
Copy link
Member

carsonip commented Sep 3, 2024

run docs-build

@carsonip carsonip merged commit ecffa8e into elastic:main Sep 3, 2024
11 checks passed
mergify bot pushed a commit that referenced this pull request Sep 3, 2024
APM Agents configuration cache appears to be broken, resulting frequent invalid requests made by APM Server to Elasticsearch cluster. Fix ScrollID used for retrieval of APM Agents configuration.

(cherry picked from commit ecffa8e)
carsonip added a commit that referenced this pull request Sep 3, 2024
mergify bot pushed a commit that referenced this pull request Sep 3, 2024
Follow up on #13958 to add test

(cherry picked from commit 2aec4d1)
mergify bot added a commit that referenced this pull request Sep 9, 2024
APM Agents configuration cache appears to be broken, resulting frequent invalid requests made by APM Server to Elasticsearch cluster. Fix ScrollID used for retrieval of APM Agents configuration.

(cherry picked from commit ecffa8e)

Co-authored-by: up2neck <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
carsonip added a commit that referenced this pull request Sep 9, 2024
Follow up on #13958 to add test

(cherry picked from commit 2aec4d1)
mergify bot added a commit that referenced this pull request Sep 9, 2024
Follow up on #13958 to add test

(cherry picked from commit 2aec4d1)

Co-authored-by: Carson Ip <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.15 Automated backport with mergify
Projects
None yet
Development

Successfully merging this pull request may close these issues.

APM Agents configuration cache refresh appears to be broken
2 participants