Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support summary and detailed file metrics in filestream input #25045

Closed
wants to merge 2 commits into from

Conversation

kvch
Copy link
Contributor

@kvch kvch commented Apr 13, 2021

What does this PR do?

This PR adds support for metrics in filestream input. The summary of reader metrics is enabled by default so the same numbers are reported as in log input.

More detailed metrics are disabled by default. Users can enable it by setting detailed_metrics to true. These metrics are the same as what log input provides. But as we have received numerous complaints about it not being configurable, I have separated it from the other metrics. Now users can configure whether they would like to collect these metrics or not.

Why is it important?

It makes the inner parts of filestream observable. Required for feature-parity with log input.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
    - [ ] I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
    - [ ] I have added tests that prove my fix is effective or that my feature works
    - [ ] I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Apr 13, 2021
@kvch kvch changed the title Init reader metrics in filestream Support summary and detailed file metrics in filestream input Apr 13, 2021
@kvch kvch added the Team:Elastic-Agent Label for the Agent team label Apr 13, 2021
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Apr 13, 2021
@elasticmachine
Copy link
Collaborator

elasticmachine commented Apr 13, 2021

💔 Tests Failed

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #25045 updated

  • Start Time: 2021-06-03T15:09:17.419+0000

  • Duration: 57 min 45 sec

  • Commit: b202b46

Test stats 🧪

Test Results
Failed 1
Passed 5235
Skipped 644
Total 5880

Trends 🧪

Image of Build Times

Image of Tests

Test errors 1

Expand to view the tests failures

Build&Test / filebeat-build / TestFilestreamCloseRemoved – github.com/elastic/beats/v7/filebeat/input/filestream
    Expand to view the error details

     Failed 
    

    Expand to view the stacktrace

     === RUN   TestFilestreamCloseRemoved
    coverage: 59.0% of statements
    panic: test timed out after 10m0s
    
    goroutine 32 [running]:
    testing.(*M).startAlarm.func1()
    	/usr/local/go/src/testing/testing.go:1700 +0xe5
    created by time.goFunc
    	/usr/local/go/src/time/sleep.go:180 +0x45
    
    goroutine 1 [chan receive, 9 minutes]:
    testing.(*T).Run(0xc000102780, 0xcf3278, 0x1a, 0xd19300, 0x493d01)
    	/usr/local/go/src/testing/testing.go:1239 +0x2da
    testing.runTests.func1(0xc00025e900)
    	/usr/local/go/src/testing/testing.go:1511 +0x78
    testing.tRunner(0xc00025e900, 0xc0002dfda8)
    	/usr/local/go/src/testing/testing.go:1193 +0xef
    testing.runTests(0xc0005189a8, 0x12f7ea0, 0x31, 0x31, 0xc0265c61c6e68a63, 0x8bb35ebea3, 0x131a420, 0xce9b4f)
    	/usr/local/go/src/testing/testing.go:1509 +0x2fe
    testing.(*M).Run(0xc0000fd500, 0x0)
    	/usr/local/go/src/testing/testing.go:1417 +0x1eb
    main.main()
    	_testmain.go:203 +0x1c5
    
    goroutine 21 [sleep]:
    time.Sleep(0x989680)
    	/usr/local/go/src/runtime/time.go:193 +0xd2
    github.com/elastic/beats/v7/filebeat/input/filestream.(*inputTestingEnvironment).waitUntilEventCount(0xc000512280, 0x1)
    	/go/src/github.com/elastic/beats/filebeat/input/filestream/environment_test.go:303 +0x30
    github.com/elastic/beats/v7/filebeat/input/filestream.TestFilestreamCloseRemoved(0xc000102780)
    	/go/src/github.com/elastic/beats/filebeat/input/filestream/input_integration_test.go:140 +0x3f0
    testing.tRunner(0xc000102780, 0xd19300)
    	/usr/local/go/src/testing/testing.go:1193 +0xef
    created by testing.(*T).Run
    	/usr/local/go/src/testing/testing.go:1238 +0x2b3
    
    goroutine 26 [semacquire, 9 minutes]:
    sync.runtime_Semacquire(0xc000301fd8)
    	/usr/local/go/src/runtime/sema.go:56 +0x45
    sync.(*WaitGroup).Wait(0xc000301fd0)
    	/usr/local/go/src/sync/waitgroup.go:130 +0x65
    github.com/elastic/go-concert/unison.(*MultiErrGroup).Wait(0xc000301fb0, 0x0, 0x0, 0x0)
    	/go/pkg/mod/github.com/elastic/[email protected]/unison/multierrgroup.go:54 +0x53
    github.com/elastic/beats/v7/filebeat/input/filestream.(*fileProspector).Run(0xc0003d8460, 0xc000072e60, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    	/go/src/github.com/elastic/beats/filebeat/input/filestream/prospector.go:232 +0x3a5
    github.com/elastic/beats/v7/filebeat/input/filestream/internal/input-logfile.(*managedInput).Run(0xc0003d84b0, 0xc000072e60, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    	/go/src/github.com/elastic/beats/filebeat/input/filestream/internal/input-logfile/input.go:79 +0x3cb
    github.com/elastic/beats/v7/filebeat/input/filestream.(*inputTestingEnvironment).startInput.func1(0xdd9818, 0xc0002b2500, 0xdd71f0, 0xc0003d84b0, 0xc000512280, 0xc000512310)
    	/go/src/github.com/elastic/beats/filebeat/input/filestream/environment_test.go:99 +0x14c
    created by github.com/elastic/beats/v7/filebeat/input/filestream.(*inputTestingEnvironment).startInput
    	/go/src/github.com/elastic/beats/filebeat/input/filestream/environment_test.go:95 +0x9a
    
    goroutine 28 [chan receive, 9 minutes]:
    github.com/elastic/beats/v7/filebeat/input/filestream.(*fileWatcher).Event(0xc0002b24c0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    	/go/src/github.com/elastic/beats/filebeat/input/filestream/fswatch.go:229 +0x85
    github.com/elastic/beats/v7/filebeat/input/filestream.(*fileProspector).Run.func2(0xc0001a6b68, 0xc000042b80)
    	/go/src/github.com/elastic/beats/filebeat/input/filestream/prospector.go:126 +0x17f
    github.com/elastic/go-concert/unison.(*MultiErrGroup).Go.func1(0xc000301fb0, 0xc0002b2580)
    	/go/pkg/mod/github.com/elastic/[email protected]/unison/multierrgroup.go:42 +0x64
    created by github.com/elastic/go-concert/unison.(*MultiErrGroup).Go
    	/go/pkg/mod/github.com/elastic/[email protected]/unison/multierrgroup.go:40 +0x66
    
    goroutine 27 [select, 9 minutes]:
    github.com/elastic/go-concert/timed.Periodic(0x7feb4640c030, 0xc0002b2540, 0x4e94914f0000, 0xc00001aec0, 0x0, 0x0)
    	/go/pkg/mod/github.com/elastic/[email protected]/timed/timed.go:76 +0x118
    github.com/elastic/beats/v7/filebeat/input/filestream.(*fileWatcher).Run(0xc0002b24c0, 0x7feb4640c008, 0xc0002b2540)
    	/go/src/github.com/elastic/beats/filebeat/input/filestream/fswatch.go:129 +0x10b
    github.com/elastic/beats/v7/filebeat/input/filestream.(*fileProspector).Run.func1(0x0, 0x0)
    	/go/src/github.com/elastic/beats/filebeat/input/filestream/prospector.go:118 +0x9a
    github.com/elastic/go-concert/unison.(*MultiErrGroup).Go.func1(0xc000301fb0, 0xc0005c82e8)
    	/go/pkg/mod/github.com/elastic/[email protected]/unison/multierrgroup.go:42 +0x64
    created by github.com/elastic/go-concert/unison.(*MultiErrGroup).Go
    	/go/pkg/mod/github.com/elastic/[email protected]/unison/multierrgroup.go:40 +0x66
     
    

Steps errors 2

Expand to view the steps failures

filebeat-build - mage build test
  • Took 20 min 34 sec . View more details on here
  • Description: mage build test
Error signal
  • Took 0 min 0 sec . View more details on here
  • Description: Error 'hudson.AbortException: script returned exit code 1'

Log output

Expand to view the last 100 lines of log output

[2021-06-03T16:05:27.362Z] 8.77s call     x-pack/filebeat/tests/system/test_xpack_modules.py::XPackTest::test_fileset_file_096_oracle
[2021-06-03T16:05:27.362Z] 8.74s call     x-pack/filebeat/tests/system/test_xpack_modules.py::XPackTest::test_fileset_file_382_gcp
[2021-06-03T16:05:27.362Z] ======================= 411 passed in 1737.98s (0:28:57) =======================
[2021-06-03T16:05:27.362Z] >> python test: Integration Testing Complete
[2021-06-03T16:05:30.995Z] Cleaning up /var/lib/jenkins/workspace/PR-25045-16-059d4f09-a11d-4449-b054-faa443f44470
[2021-06-03T16:05:30.995Z] Client: Docker Engine - Community
[2021-06-03T16:05:30.995Z]  Version:           20.10.3
[2021-06-03T16:05:30.995Z]  API version:       1.41
[2021-06-03T16:05:30.995Z]  Go version:        go1.13.15
[2021-06-03T16:05:30.995Z]  Git commit:        48d30b5
[2021-06-03T16:05:30.995Z]  Built:             Fri Jan 29 14:33:13 2021
[2021-06-03T16:05:30.995Z]  OS/Arch:           linux/amd64
[2021-06-03T16:05:30.995Z]  Context:           default
[2021-06-03T16:05:30.995Z]  Experimental:      true
[2021-06-03T16:05:30.995Z] 
[2021-06-03T16:05:30.995Z] Server: Docker Engine - Community
[2021-06-03T16:05:30.995Z]  Engine:
[2021-06-03T16:05:30.995Z]   Version:          20.10.3
[2021-06-03T16:05:30.995Z]   API version:      1.41 (minimum version 1.12)
[2021-06-03T16:05:30.995Z]   Go version:       go1.13.15
[2021-06-03T16:05:30.995Z]   Git commit:       46229ca
[2021-06-03T16:05:30.995Z]   Built:            Fri Jan 29 14:31:25 2021
[2021-06-03T16:05:30.995Z]   OS/Arch:          linux/amd64
[2021-06-03T16:05:30.995Z]   Experimental:     false
[2021-06-03T16:05:30.996Z]  containerd:
[2021-06-03T16:05:30.996Z]   Version:          1.4.4
[2021-06-03T16:05:30.996Z]   GitCommit:        05f951a3781f4f2c1911b05e61c160e9c30eaa8e
[2021-06-03T16:05:30.996Z]  runc:
[2021-06-03T16:05:30.996Z]   Version:          1.0.0-rc93
[2021-06-03T16:05:30.996Z]   GitCommit:        12644e614e25b05da6fd08a38ffa0cfe1903fdec
[2021-06-03T16:05:30.996Z]  docker-init:
[2021-06-03T16:05:30.996Z]   Version:          0.19.0
[2021-06-03T16:05:30.996Z]   GitCommit:        de40ad0
[2021-06-03T16:05:30.996Z] Change ownership of all files inside the specific folder from root/root to current user/group
[2021-06-03T16:05:30.996Z] Unable to find image 'alpine:3.4' locally
[2021-06-03T16:05:31.576Z] 3.4: Pulling from library/alpine
[2021-06-03T16:05:31.838Z] c1e54eec4b57: Pulling fs layer
[2021-06-03T16:05:32.096Z] c1e54eec4b57: Verifying Checksum
[2021-06-03T16:05:32.096Z] c1e54eec4b57: Download complete
[2021-06-03T16:05:32.096Z] c1e54eec4b57: Pull complete
[2021-06-03T16:05:32.355Z] Digest: sha256:b733d4a32c4da6a00a84df2ca32791bb03df95400243648d8c539e7b4cce329c
[2021-06-03T16:05:32.355Z] Status: Downloaded newer image for alpine:3.4
[2021-06-03T16:05:34.257Z] Change permissions with write access of all files inside the specific folder
[2021-06-03T16:05:35.208Z] Running in /var/lib/jenkins/workspace/PR-25045-16-059d4f09-a11d-4449-b054-faa443f44470/src/github.com/elastic/beats/build
[2021-06-03T16:05:35.501Z] + rm -rf ve
[2021-06-03T16:05:35.501Z] + find . -type d -name vendor -exec rm -r {} ;
[2021-06-03T16:05:35.813Z] + python .ci/scripts/pre_archive_test.py
[2021-06-03T16:05:38.343Z] Copy ./x-pack/filebeat/build into build/x-pack/filebeat/build
[2021-06-03T16:05:38.354Z] Running in /var/lib/jenkins/workspace/PR-25045-16-059d4f09-a11d-4449-b054-faa443f44470/src/github.com/elastic/beats/build
[2021-06-03T16:05:38.368Z] Recording test results
[2021-06-03T16:05:39.481Z] [Checks API] No suitable checks publisher found.
[2021-06-03T16:05:39.892Z] + go clean -modcache
[2021-06-03T16:05:42.727Z] Cleaning up /var/lib/jenkins/workspace/PR-25045-16-059d4f09-a11d-4449-b054-faa443f44470
[2021-06-03T16:05:42.727Z] Client: Docker Engine - Community
[2021-06-03T16:05:42.727Z]  Version:           20.10.3
[2021-06-03T16:05:42.727Z]  API version:       1.41
[2021-06-03T16:05:42.727Z]  Go version:        go1.13.15
[2021-06-03T16:05:42.727Z]  Git commit:        48d30b5
[2021-06-03T16:05:42.727Z]  Built:             Fri Jan 29 14:33:13 2021
[2021-06-03T16:05:42.727Z]  OS/Arch:           linux/amd64
[2021-06-03T16:05:42.727Z]  Context:           default
[2021-06-03T16:05:42.727Z]  Experimental:      true
[2021-06-03T16:05:42.727Z] 
[2021-06-03T16:05:42.727Z] Server: Docker Engine - Community
[2021-06-03T16:05:42.727Z]  Engine:
[2021-06-03T16:05:42.727Z]   Version:          20.10.3
[2021-06-03T16:05:42.727Z]   API version:      1.41 (minimum version 1.12)
[2021-06-03T16:05:42.727Z]   Go version:       go1.13.15
[2021-06-03T16:05:42.727Z]   Git commit:       46229ca
[2021-06-03T16:05:42.727Z]   Built:            Fri Jan 29 14:31:25 2021
[2021-06-03T16:05:42.727Z]   OS/Arch:          linux/amd64
[2021-06-03T16:05:42.727Z]   Experimental:     false
[2021-06-03T16:05:42.727Z]  containerd:
[2021-06-03T16:05:42.727Z]   Version:          1.4.4
[2021-06-03T16:05:42.727Z]   GitCommit:        05f951a3781f4f2c1911b05e61c160e9c30eaa8e
[2021-06-03T16:05:42.727Z]  runc:
[2021-06-03T16:05:42.727Z]   Version:          1.0.0-rc93
[2021-06-03T16:05:42.727Z]   GitCommit:        12644e614e25b05da6fd08a38ffa0cfe1903fdec
[2021-06-03T16:05:42.727Z]  docker-init:
[2021-06-03T16:05:42.727Z]   Version:          0.19.0
[2021-06-03T16:05:42.727Z]   GitCommit:        de40ad0
[2021-06-03T16:05:42.727Z] Change ownership of all files inside the specific folder from root/root to current user/group
[2021-06-03T16:05:52.696Z] Change permissions with write access of all files inside the specific folder
[2021-06-03T16:05:52.718Z] Running in /var/lib/jenkins/workspace/PR-25045-16-059d4f09-a11d-4449-b054-faa443f44470
[2021-06-03T16:05:56.948Z] + gsutil --version
[2021-06-03T16:05:58.383Z] Masking supported pattern matches of $FILE_CREDENTIAL
[2021-06-03T16:05:58.692Z] + gcloud auth activate-service-account --key-file ****
[2021-06-03T16:05:59.261Z] Activated service account credentials for: [[email protected]]
[2021-06-03T16:05:59.571Z] + gsutil -m -q cp -a public-read eC1wYWNrL2ZpbGViZWF0LWJ1aWxkYjIwMmI0Njc5NGVhNzA3OTFlOTMxYTg5MzYyYTQ1Nzc1MGNlMTQ5Mw gs://beats-ci-temp/ci/cache/
[2021-06-03T16:06:01.582Z] Stage "Extended" skipped due to earlier failure(s)
[2021-06-03T16:06:01.615Z] Stage "Packaging" skipped due to earlier failure(s)
[2021-06-03T16:06:01.652Z] Stage "Packaging-Pipeline" skipped due to earlier failure(s)
[2021-06-03T16:06:01.705Z] Running in /var/lib/jenkins/workspace/Beats_beats_PR-25045/src/github.com/elastic/beats
[2021-06-03T16:06:02.085Z] Running on Jenkins in /var/lib/jenkins/workspace/Beats_beats_PR-25045
[2021-06-03T16:06:02.128Z] [INFO] getVaultSecret: Getting secrets
[2021-06-03T16:06:02.168Z] Masking supported pattern matches of $VAULT_ADDR or $VAULT_ROLE_ID or $VAULT_SECRET_ID
[2021-06-03T16:06:02.916Z] + chmod 755 generate-build-data.sh
[2021-06-03T16:06:02.916Z] + ./generate-build-data.sh https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/beats/PR-25045/ https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/beats/PR-25045/runs/16 FAILURE 3405235
[2021-06-03T16:06:02.916Z] INFO: curl https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/beats/PR-25045/runs/16/steps/?limit=10000 -o steps-info.json
[2021-06-03T16:06:03.827Z] INFO: curl https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/beats/PR-25045/runs/16/tests/?status=FAILED -o tests-errors.json

🐛 Flaky test report

❕ There are test failures but not known flaky tests.

Expand to view the summary

Test stats 🧪

Test Results
Failed 1
Passed 5235
Skipped 644
Total 5880

Genuine test errors 1

💔 There are test failures but not known flaky tests, most likely a genuine test failure.

  • Name: Build&Test / filebeat-build / TestFilestreamCloseRemoved – github.com/elastic/beats/v7/filebeat/input/filestream

@kvch kvch added needs_backport PR is waiting to be backported to other branches. needs_docs labels Apr 14, 2021
@urso
Copy link

urso commented Apr 14, 2021

+1 on metrics. I wonder if we could rather have the prospector provide the metrics? Then we can also report files that are not actively collected (e.g. due to harvester_limit).

Related discussion on organizing metrics for inputs: https://github.com/elastic/obs-dc-team/issues/401

@kvch
Copy link
Contributor Author

kvch commented Apr 14, 2021

I thought the same thing, but I decided against it because if we add metrics to the Prospector, it has to be added every time we introduce a new one. This way, it is implemented once in the reader. It is something we are not replacing with other types of readers in filestream. Also, I have considered the HarvesterGroup too, but that was too generic for this metrics.

Also, why would we want to report files that we are not collecting? Maybe we should add metrics for the file scanner?

@kvch kvch force-pushed the feature-filebeat-filestream-metrics branch from c4480de to 46d729a Compare April 14, 2021 15:55
@kvch kvch requested a review from urso April 14, 2021 15:55
@kvch kvch marked this pull request as ready for review April 15, 2021 08:51
@elasticmachine
Copy link
Collaborator

Pinging @elastic/agent (Team:Agent)

@kvch
Copy link
Contributor Author

kvch commented Apr 15, 2021

After sleeping on it, I moved global metrics to HarvesterGroup.

@kvch
Copy link
Contributor Author

kvch commented Apr 15, 2021

Failing tests are unrelated.

@mergify
Copy link
Contributor

mergify bot commented Apr 19, 2021

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b feature-filebeat-filestream-metrics upstream/feature-filebeat-filestream-metrics
git merge upstream/master
git push upstream feature-filebeat-filestream-metrics

@urso
Copy link

urso commented Apr 19, 2021

Also, why would we want to report files that we are not collecting? Maybe we should add metrics for the file scanner?

As user I'm curious to know about the amount of unprocessed data + if it is growing or shrinking.

backlog = (f.size for f in watched_files) - sum(f.offset for f in watched_files)

By not taking files into account that we do not harvest, we can't measure the total disk usage of the files we are watching + we can't tell the exact backlog in case a harvester_limit has been configured. The files we actively read is just a subset of the files we are watching. The measurement should also only contains the files we are collecting. As long as a file is not removed from disk (or still held by FB), it should be accounted for.

Maybe we should add metrics for the file scanner?

That might be an option. Or the prospector. Essentially the entity that is responsible for filtering files by path.

@kvch
Copy link
Contributor Author

kvch commented Apr 20, 2021

Also, why would we want to report files that we are not collecting? Maybe we should add metrics for the file scanner?

As user I'm curious to know about the amount of unprocessed data + if it is growing or shrinking.

backlog = (f.size for f in watched_files) - sum(f.offset for f in watched_files)

By not taking files into account that we do not harvest, we can't measure the total disk usage of the files we are watching + we can't tell the exact backlog in case a harvester_limit has been configured. The files we actively read is just a subset of the files we are watching. The measurement should also only contains the files we are collecting. As long as a file is not removed from disk (or still held by FB), it should be accounted for.

So if I understand correctly you want to collect detailed progressMetrics for all files that match the paths configuration, right?

@urso
Copy link

urso commented Apr 20, 2021

So if I understand correctly you want to collect detailed progressMetrics for all files that match the paths configuration, right?

Yes. All files that match paths and are not filtered out because of ignore_older, exclude_files, include_files, or similar. Well ignore older is debatable. If a file is not actively collected, the metrics for the given file should have an indicator why/if it is not collected. So I can properly compute the backlog in bytes at any point in time.

@kvch kvch added the in progress Pull request is currently in progress. label Apr 20, 2021
@kvch kvch force-pushed the feature-filebeat-filestream-metrics branch from 507b602 to de08033 Compare April 21, 2021 12:47
@mergify
Copy link
Contributor

mergify bot commented Apr 22, 2021

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b feature-filebeat-filestream-metrics upstream/feature-filebeat-filestream-metrics
git merge upstream/master
git push upstream feature-filebeat-filestream-metrics

@kvch kvch force-pushed the feature-filebeat-filestream-metrics branch from 03d1110 to 367b2bc Compare April 22, 2021 16:58
@kvch kvch removed the in progress Pull request is currently in progress. label Apr 23, 2021
filebeat/_meta/config/filebeat.inputs.reference.yml.tmpl Outdated Show resolved Hide resolved
filebeat/input/filestream/input.go Outdated Show resolved Hide resolved
inp.detailedMetrics[srcID].updateCurrentSize(fi.Size())
return nil
})
}()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively to actively probing the file with fstat, we could define a function based metric to save CPU. The function based metric will query the size only on demand, and only if the last file size update is >10s.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "only if the last file size update is >10s"? The file was updated 10 seconds ago?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, that instead of pro-actively updating the metrics over and over again, we could make it lazy via the metrics callback. Instead of spinning a go-routine we check the state on disk lazily.
The callback would cache the last size (or maybe even gets its update from the last file watcher scan), and only call fstat if the cached value has not been updated withing the last 10 seconds.

By making it 'cheaper' when the metrics are not required, we don't really need a setting to enable/disable metrics support. It will just always be there.

inp.detailedMetrics[srcID] = newProgressMetrics(inp.ID, path, srcID, monitorCancel)
go func() {
timed.Periodic(monitorCtx, 30*time.Second, func() error {
fi, err := f.Stat()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This go-routine is not cleaned up when the harvester is shutdown. That is, we will likely fail here because the file was already closed. Unfortunately we shouldn't keep the file open, in order to not exhaust file descriptors. Alternatively we might want the prospector/file watcher to update the size.

This reminds me. If possible, the reader should also update the internal 'size' one it hits EOF. That is, the reader should not read until EOF, but until the internal size. Once the limit is reached, using Stat to update the size and continue reading if the size is increased. This is e.g. helpful when reading from NFS, as the stat circumvents 0 reads, as it can force extra roundtrips at the protocol level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't it cleaned up? The canceler is from the input.Context, so if the parent is cancelled, it should be cancelled as well. What did I miss?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed that the monitorCtx is derived from the input context. Nevermind.

There is still a potential race on f, as the overall input shutdown does not wait for the monitor to finish before the file is closed.

filebeat/input/filestream/input.go Outdated Show resolved Hide resolved
@kvch kvch force-pushed the feature-filebeat-filestream-metrics branch from 7fae87b to 6ebf0ba Compare May 17, 2021 12:44
@kvch
Copy link
Contributor Author

kvch commented May 17, 2021

I have updated the PR.

"github.com/elastic/beats/v7/libbeat/common"
"github.com/elastic/beats/v7/libbeat/common/match"
"github.com/elastic/beats/v7/libbeat/reader/readfile"
)

// Config stores the options of a file stream.
type config struct {
ID string `config:"id"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document how ID is used?

Is it used for registry state?

Is it used for monitoring?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added documentation.

@@ -43,6 +45,8 @@ const pluginName = "filestream"

type state struct {
Offset int64 `json:"offset" struct:"offset"`

id string
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state is going to be stored in the regsitry file. Do we really need the ID here? Does the ID change?

go func() {
m := m
timed.Periodic(monitorCtx, 30*time.Second, func() error {
fi, err := f.Stat()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The (*filestream).Run method will close f on shutdown. This go-routine should be guaranteed to be stopped before Run closes the file.

f, err := inp.openFile(log, path, offset)
if err != nil {
return nil, err
}

inp.startMonitoring(log, canceler, srcID, path, f)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we want to move startMonitoring to Run, in order to make it more obvious which tasks an input is running. This would require us to split open into openFile and newReader.

@@ -48,25 +48,36 @@ type Harvester interface {
// Run is the event loop which reads from the source
// and forwards it to the publisher.
Run(input.Context, Source, Cursor, Publisher) error
// Monitor is required for detailed metrics.
Monitor(input.Context, Source)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having a Monitor per harvester, I wonder if we should have the file watcher do the reporting. The file watcher would report the size update and we would find and update the actual state.

@mergify
Copy link
Contributor

mergify bot commented May 18, 2021

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b feature-filebeat-filestream-metrics upstream/feature-filebeat-filestream-metrics
git merge upstream/master
git push upstream feature-filebeat-filestream-metrics

@kvch kvch force-pushed the feature-filebeat-filestream-metrics branch from 160affa to b202b46 Compare June 3, 2021 15:08
@kvch kvch mentioned this pull request Jun 4, 2021
18 tasks
@mergify
Copy link
Contributor

mergify bot commented Jun 24, 2021

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b feature-filebeat-filestream-metrics upstream/feature-filebeat-filestream-metrics
git merge upstream/master
git push upstream feature-filebeat-filestream-metrics

@botelastic
Copy link

botelastic bot commented Jul 24, 2021

Hi!
We just realized that we haven't looked into this PR in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it in as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Jul 24, 2021
@botelastic
Copy link

botelastic bot commented Aug 23, 2021

Hi!
This PR has been stale for a while and we're going to close it as part of our cleanup procedure.
We appreciate your contribution and would like to apologize if we have not been able to review it, due to the current heavy load of the team.
Feel free to re-open this PR if you think it should stay open and is worth rebasing.
Thank you for your contribution!

@botelastic botelastic bot closed this Aug 23, 2021
@kvch
Copy link
Contributor Author

kvch commented Dec 8, 2021

I am reopening the PR because it has to be added to filestream.

@kvch kvch reopened this Dec 8, 2021
@botelastic botelastic bot removed the Stalled label Dec 8, 2021
@mergify
Copy link
Contributor

mergify bot commented Dec 8, 2021

This pull request does not have a backport label. Could you fix it @kvch? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@mergify mergify bot added the backport-skip Skip notification from the automated backport with mergify label Dec 8, 2021
@kvch kvch added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Dec 8, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@jlind23
Copy link
Collaborator

jlind23 commented Jan 6, 2022

@kvch is this PR pending for review?

@mergify mergify bot assigned kvch Feb 8, 2022
@kvch
Copy link
Contributor Author

kvch commented Apr 4, 2022

Stalled, we will need time to clean up this.

@kvch kvch closed this Apr 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Skip notification from the automated backport with mergify needs_backport PR is waiting to be backported to other branches. needs_docs Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants