Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Filestream store GC, entries are now removed when they TTL expires #38488

Merged
merged 7 commits into from
Apr 9, 2024

Conversation

belimawr
Copy link
Contributor

@belimawr belimawr commented Mar 20, 2024

Proposed commit message

The resources from Filestream registry have a counter to indicate how many 'owners' have got a hold of that resource, this counter was not correctly decremented. Because it never reached zero, no entry was ever removed from the in-memory store and even though the store GC would run periodically, no resource could be removed. That caused the in-memory store to be ever growing and the op: remove never to be seen in the registry log file.

This commit fixes this bug by correctly calling Released in every resource that is retained.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

## Author's Checklist

How to test this PR locally

1. Create the following configuration file

filebeat.inputs:
  - type: filestream
    id: "PR-38488"
    paths:
      - /tmp/logs/*.log
    
    clean_removed: true
    close.on_state_change.inactive: 1s
    prospector.scanner.check_interval: 1s

filebeat:
  registry:
    cleanup_interval: 5s

output:
  file:
    enabled: true
    codec.json:
      pretty: false
    path: ${path.home}
    filename: "output"
    rotate_on_startup: false

queue.mem:
  flush:
    timeout: 1s
    min_events: 32

filebeat.registry.flush: 1s

logging:
  level: debug
  selectors:
    - input
    - input.filestream
  metrics:
    enabled: false

2. Build and start Filebeat

cd filebeat
go build .
./filebeat -e

3. Create a log file and add some data

mkdir -p /tmp/logs
echo "fist file">> /tmp/log.log

4. Wait a few seconds until the file is harvested

You can check the output file or look the logs for the logs stating the harvester has been closed

{"log.level":"debug","@timestamp":"2024-03-22T14:18:35.465+0100","log.logger":"input.filestream","log.origin":{"function":"github.com/elastic/beats/v7/filebeat/input/filestream.(*logFile).Read","file.name":"filestream/filestream.go","file.line":131},"message":"End of file reached: /tmp/logs/log.log; Backoff now.","service.name":"filebeat","id":"PR-38488","source_file":"filestream::PR-38488::native::142572-34","path":"/tmp/logs/log.log","state-id":"native::142572-34","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-22T14:18:39.465+0100","log.logger":"input.filestream","log.origin":{"function":"github.com/elastic/beats/v7/filebeat/input/filestream.(*filestream).readFromSource","file.name":"filestream/input.go","file.line":336},"message":"Reader was closed. Closing. Path='/tmp/logs/log.log'","service.name":"filebeat","id":"PR-38488","source_file":"filestream::PR-38488::native::142572-34","path":"/tmp/logs/log.log","state-id":"native::142572-34","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-03-22T14:18:39.465+0100","log.logger":"input.filestream","log.origin":{"function":"github.com/elastic/beats/v7/filebeat/input/filestream/internal/input-logfile.startHarvester.func1","file.name":"input-logfile/harvester.go","file.line":247},"message":"Stopped harvester for file","service.name":"filebeat","id":"PR-38488","source_file":"filestream::PR-38488::native::142572-34","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-03-22T14:18:39.465+0100","log.logger":"input.filestream","log.origin":{"function":"github.com/elastic/beats/v7/filebeat/input/filestream.(*filestream).Run.func1","file.name":"filestream/input.go","file.line":151},"message":"Closing reader of filestream","service.name":"filebeat","id":"PR-38488","source_file":"filestream::PR-38488::native::142572-34","path":"/tmp/logs/log.log","state-id":"native::142572-34","ecs.version":"1.6.0"}

5. Delete the file

rm /tmp/logs/log.log

6. Wait for the store cleanup logs

{"log.level":"debug","@timestamp":"2024-03-22T14:20:00.464+0100","log.logger":"input","log.origin":{"function":"github.com/elastic/beats/v7/filebeat/input/filestream/internal/input-logfile.gcStore","file.name":"input-logfile/clean.go","file.line":67},"message":"Start store cleanup","service.name":"filebeat","input_type":"filestream","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-03-22T14:20:00.464+0100","log.logger":"input","log.origin":{"function":"github.com/elastic/beats/v7/filebeat/input/filestream/internal/input-logfile.gcStore","file.name":"input-logfile/clean.go","file.line":86},"message":"1 entries removed","service.name":"filebeat","input_type":"filestream","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-03-22T14:20:00.464+0100","log.logger":"input","log.origin":{"function":"github.com/elastic/beats/v7/filebeat/input/filestream/internal/input-logfile.gcStore","file.name":"input-logfile/clean.go","file.line":87},"message":"Done store cleanup","service.name":"filebeat","input_type":"filestream","ecs.version":"1.6.0"}

7. Assert that the entry has also been removed from the registy

You should be able to find the following lines in the registry:

{"op":"set","id":4}
{"k":"filestream::PR-38488::native::142572-34","v":{"ttl":0,"updated":[258163407648,1711113513],"cursor":{"offset":11},"meta":{"source":"/tmp/logs/log.log","identifier_name":"native"}}}
{"op":"remove","id":5}
{"k":"filestream::PR-38488::native::142572-34"}
  • The set operation is setting ttl:0. That is how the entry is "removed" from the registry
  • The remove operation removes any reference to that state and also removes all in-memory references to the resource pointing to this file.

Related issues

## Use cases
## Screenshots
## Logs

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Mar 20, 2024
Copy link
Contributor

mergify bot commented Mar 20, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @belimawr? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@belimawr belimawr added the Team:Elastic-Agent Label for the Agent team label Mar 20, 2024
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Mar 20, 2024
@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 20, 2024

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Duration: 145 min 6 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@belimawr belimawr added bug skip-ci Skip the build in the CI but linting labels Mar 22, 2024
Copy link
Contributor

mergify bot commented Mar 22, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix-clean-removed-filestream upstream/fix-clean-removed-filestream
git merge upstream/main
git push upstream fix-clean-removed-filestream

@belimawr belimawr removed the skip-ci Skip the build in the CI but linting label Mar 22, 2024
@belimawr belimawr force-pushed the fix-clean-removed-filestream branch from 0786577 to 11fa75e Compare March 22, 2024 13:28
@belimawr belimawr marked this pull request as ready for review March 22, 2024 13:29
@belimawr belimawr requested a review from a team as a code owner March 22, 2024 13:29
@belimawr belimawr requested review from ycombinator and faec March 22, 2024 13:29
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @belimawr

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @belimawr

@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 22, 2024

💔 Build Failed

Failed CI Steps

History

cc @belimawr

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @belimawr

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @belimawr

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @belimawr

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @belimawr

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @belimawr

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @belimawr

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @belimawr

@pierrehilbert pierrehilbert requested a review from rdner March 22, 2024 14:28
filebeat.Stop()

registryLogFile := filepath.Join(tempDir, "data/registry/filebeat/log.json")
readFilestreamRegistryLog(t, registryLogFile, "remove", 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I thought we never used remove but just updated the TTL field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me too and that was the bug! 🤯

The resources were never correctly released, so the gcClean would never for any entry.

They way Filestream registry works is:

  • Whenever an entry is "removed", its TTL is set to zero.
  • Periodically the gcStore runs, any resource that is finished is then removed from the in-memory store and have a remove operation added to the log file.

When Find is called, it locks the resource by calling resource.Retain

if resource := s.table[key]; resource != nil && !resource.isDeleted() {
resource.Retain()
return resource
}
, it's the caller's responsibilit to release it by calling Release
func (r *resource) Release() { r.pending.Dec() }

There were a few places never calling Release 🤦‍♂️, hence Finished

func (r *resource) Finished() bool { return r.pending.Load() == 0 }
always returned false and nothing was ever removed from the registry. Which made Filebeat never add the op: remove to the registry log file. 🤯 🤯 🤯 🤯 🤯 🤯 🤯 🤯 🤯

CHANGELOG.next.asciidoc Outdated Show resolved Hide resolved
@belimawr belimawr force-pushed the fix-clean-removed-filestream branch from 026195a to 96a28f2 Compare March 26, 2024 08:09
@belimawr
Copy link
Contributor Author

Changelog updated, rebased onto main to fix merge conflicts, force push.

@belimawr belimawr changed the title Release in-memory resources when clean_removed is set Fix Filestream store GC, entries are now removed when they TTL expires Mar 26, 2024
@belimawr belimawr requested a review from a team as a code owner March 26, 2024 11:35
@botelastic botelastic bot added the Team:Automation Label for the Observability productivity team label Mar 26, 2024
@belimawr belimawr force-pushed the fix-clean-removed-filestream branch from a704351 to 510e462 Compare March 26, 2024 19:11
@belimawr belimawr force-pushed the fix-clean-removed-filestream branch from 510e462 to 31b2b12 Compare March 27, 2024 13:29
belimawr added 6 commits April 8, 2024 10:49
Filestream keeps a `resource` for each file it is harvesting, this
resource gets written to the registry, both in-memory and
on-disk. Even when `clean_removed` was enabled the resource would not
be fully released and never removed from memory. While that did not
affect offset tracking, it would not release memory when it should.
@belimawr belimawr force-pushed the fix-clean-removed-filestream branch from 31b2b12 to 73b6eaa Compare April 8, 2024 08:49
@belimawr belimawr merged commit 8efb366 into elastic:main Apr 9, 2024
186 of 189 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Automation Label for the Observability productivity team Team:Elastic-Agent Label for the Agent team
Projects
None yet
4 participants