Refactor beats lockfile to use timeout, retry #34194

fearful-symmetry · 2023-01-05T23:25:02Z

What does this PR do?

This PR substantially refactors the lockfile to remove the "PID check" system and instead retry the underlying lock operation. This is mainly to deal with an (apparently somewhat common) edge case beat will shutdown improperly, leaving the old lockfile around, and the container environment will restart with new PID namespace, allowing for a collision between the PID written in the lockfile, and another running process.

This change puts the lockfile logic back in the hands of the OS; we try to obtain a lock, and if we can't, we retry a set number of times. In a case where a beat has shutdown improperly and the lockfile remains, instead of looking up a PID, we rely on the OS to release the underlying lock for the dead process, which most OSes will generally do, after a set amount of time.

I still need to test this by hand on Windows and Darwin.

Why is it important?

This is meant to deal with a few edge cases in how PID handling works.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

elasticmachine · 2023-01-05T23:25:05Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

mergify · 2023-01-05T23:25:36Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @fearful-symmetry? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

elasticmachine · 2023-01-06T00:45:40Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-01-30T21:26:44.826+0000
Duration: 65 min 7 sec

Test stats 🧪

Test	Results
Failed	0
Passed	25311
Skipped	1962
Total	27273

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

leehinman · 2023-01-09T17:04:28Z

libbeat/cmd/instance/locks/lock.go

-	if openErr != nil {
-		err = lock.handleFailedCreate()
+	for i := 0; i < lock.retryCount; i++ {
+		gotLock, err := lock.fileLock.TryLock()


I don't think TryLock uses the os.O_EXCL flag. That means the file could exist already, and I think that would lead to a race condition in the Unlock function with Unlock & Remove.

Could you elaborate? I assume you mean another beat swooping in while one beat is trying to lock or remove the file?

In the unlock code, it first unlocks, then does a remove. In between those lines of code another beat could create a new lock, but the file would be removed. This results in the new beat having an error if it goes to unlock because the lock file doesn't exist.

panic: unable to unlock data path file testing.lock: remove testing.lock: no such file or directory

hmm, that's an interesting edge case. Gonna see if I can think of a non-awkward way to protect against that.

Alright, made a change to remove the file first before we remove the lock. Going to see how the Windows CI reacts to that, but I imagine we'll want some manual testing, since I don't understand the Windows lockfile logic too well.

Can we put some or all of the detail from the PR description directly in the description of the Lock function? For example adding this would help the next developer to understand how this works.

In a case where a beat has shutdown improperly and the lockfile remains, instead of looking up a PID, we rely on the OS to release the underlying lock for the dead process, which most OSes will generally do, after a set amount of time.

It may also be worth noting that putting the PID into the lock file failed. To some degree we have had several iterations on this code because the original code did not explain itself at all, so let's try to avoid creating that problem again.

libbeat/cmd/instance/locks/lock.go

cmacknz · 2023-01-09T18:02:32Z

Needs a changelog entry 📓

anmironov · 2023-01-11T09:56:19Z

Hi Team, could you please clarify to me, if it will be in new release or backported to 8.6.0? Because, at the moment the issue (Exiting: cannot obtain lockfile: connot start, data directory belongs to process with pid....) persists in 8.6.0

cmacknz · 2023-01-11T14:46:47Z

Hi Team, could you please clarify to me, if it will be in new release or backported to 8.6.0?

Yes the plan is to put this in 8.6.1 as well as 8.7.0.

fearful-symmetry · 2023-01-11T19:50:10Z

Alright, tested manually on Linux, seems fine.

leehinman

unfortunately on Windows we get this panic on unlock

panic: unable to remove data path file testing.lock: remove testing.lock: The process cannot access the file because it is being used by another process.

mergify · 2023-01-23T16:40:04Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b lockfile-with-timeout upstream/lockfile-with-timeout
git merge upstream/main
git push upstream lockfile-with-timeout

fearful-symmetry · 2023-01-23T21:37:24Z

Made a few changed based on a discussion with @leehinman , tested on Linux/darwin

libbeat/cmd/instance/locks/lock.go

leehinman

tried it on Windows, doesn't work. Lock followed by Unlock gives: panic: unable to remove data path file testing.lock: remove testing.lock: The process cannot access the file because it is being used by another process.

leehinman · 2023-01-26T15:28:47Z

tried it on Windows, doesn't work. Lock followed by Unlock gives: panic: unable to remove data path file testing.lock: remove testing.lock: The process cannot access the file because it is being used by another process.

nevermind, it worked on Windows. I screwed up the build and got the version of the PR.

libbeat/cmd/instance/locks/lock.go

leehinman · 2023-01-27T23:07:08Z

libbeat/cmd/instance/locks/lock.go

-			lock.logger.Debugf("%s shut down without removing previous lockfile and is currently in a zombie state, continuing", lock.beatName)
-			return lock.recoverLockfile()
+	// now unlock on windows.
+	if runtime.GOOS == "windows" {


could we split this into 3 files? lock.go, unlock_posix.go, unlock_windows.go. Then we can have the Unlock function in the OS specific file, and use build tags to only compile the "right" one. The big benefit is that the doc strings can be specific to the OS since we are switching behavior based on that and it will be easier to understand later on.

Ah, good idea.

libbeat/cmd/instance/locks/lock.go

leehinman · 2023-01-30T19:47:14Z

libbeat/cmd/instance/locks/lock_windows.go

+	}
+
+	// now unlock on windows.
+	if runtime.GOOS == "windows" {


can we get rid of the runtime check? The build tags should mean this is the only "Unlock" implementation under Windows.

Oh! Forgot to delete that...

leehinman

LGTM

* move lockfile logic to a retries * clean up * add changelog, update docs * change unlock operation, remove file first * fix tests * fix lock on windows * remove debug line * add docs * split out files * remove old OS checks * fix error * format (cherry picked from commit 21b6128) # Conflicts: # libbeat/cmd/instance/locks/lock.go

FranAguiar · 2023-02-06T13:25:29Z

Hi!!
When will be this fix released? Thanks

cmacknz · 2023-02-06T17:53:28Z

This will be in 8.6.2 which is coming soon.

…34435) * Refactor beats lockfile to use timeout, retry (#34194) * move lockfile logic to a retries * clean up * add changelog, update docs * change unlock operation, remove file first * fix tests * fix lock on windows * remove debug line * add docs * split out files * remove old OS checks * fix error * format (cherry picked from commit 21b6128) # Conflicts: # libbeat/cmd/instance/locks/lock.go * fix cherry pick --------- Co-authored-by: Alex K <[email protected]> Co-authored-by: Alex Kristiansen <[email protected]> Co-authored-by: Craig MacKenzie <[email protected]>

ariahi18 · 2023-02-15T14:31:30Z

Hello!!
any updates on the new release 8.6.2 please?
Thank you,

cmacknz · 2023-02-25T18:04:18Z

~~8.6.2 has been released. This may not have completely fixed the problem in some circumstances unfortunately.~~

cmacknz · 2023-02-27T14:46:43Z

The report of this still happening on 8.6.2 was a false alarm related to misconfiguration.

* move lockfile logic to a retries * clean up * add changelog, update docs * change unlock operation, remove file first * fix tests * fix lock on windows * remove debug line * add docs * split out files * remove old OS checks * fix error * format

fearful-symmetry added 2 commits January 5, 2023 15:02

move lockfile logic to a retries

90ee085

Merge remote-tracking branch 'upstream/main' into lockfile-with-timeout

5635bef

fearful-symmetry added bug Team:Elastic-Agent Label for the Agent team labels Jan 5, 2023

fearful-symmetry requested a review from a team as a code owner January 5, 2023 23:25

fearful-symmetry self-assigned this Jan 5, 2023

fearful-symmetry requested review from faec and leehinman and removed request for a team January 5, 2023 23:25

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jan 5, 2023

clean up

b024cb4

leehinman reviewed Jan 9, 2023

View reviewed changes

libbeat/cmd/instance/locks/lock.go Outdated Show resolved Hide resolved

cmacknz added the backport-v8.6.0 Automated backport with mergify label Jan 9, 2023

fearful-symmetry added 2 commits January 9, 2023 10:58

add changelog, update docs

5ae601e

change unlock operation, remove file first

cb2aec6

fix tests

6049384

fearful-symmetry requested a review from leehinman January 11, 2023 19:49

Merge remote-tracking branch 'upstream/main' into lockfile-with-timeout

1d5af74

leehinman requested changes Jan 12, 2023

View reviewed changes

fix lock on windows

1336649

Merge remote-tracking branch 'upstream/main' into lockfile-with-timeout

86d0c41

fearful-symmetry requested a review from leehinman January 23, 2023 19:37

remove debug line

e9047d1

cmacknz reviewed Jan 26, 2023

View reviewed changes

libbeat/cmd/instance/locks/lock.go Outdated Show resolved Hide resolved

cmacknz reviewed Jan 26, 2023

View reviewed changes

libbeat/cmd/instance/locks/lock.go Outdated Show resolved Hide resolved

leehinman requested changes Jan 26, 2023

View reviewed changes

add docs

3618a81

fearful-symmetry requested a review from leehinman January 26, 2023 19:42

leehinman requested changes Jan 27, 2023

View reviewed changes

split out files

2a49fcf

fearful-symmetry requested a review from leehinman January 28, 2023 01:05

leehinman reviewed Jan 30, 2023

View reviewed changes

remove old OS checks

903d003

fearful-symmetry requested a review from leehinman January 30, 2023 20:21

fearful-symmetry added 2 commits January 30, 2023 13:12

fix error

ec43744

format

86dc2fe

leehinman approved these changes Jan 30, 2023

View reviewed changes

fearful-symmetry merged commit 21b6128 into elastic:main Jan 31, 2023

mergify bot mentioned this pull request Jan 31, 2023

[8.6](backport #34194) Refactor beats lockfile to use timeout, retry #34435

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor beats lockfile to use timeout, retry #34194

Refactor beats lockfile to use timeout, retry #34194

fearful-symmetry commented Jan 5, 2023

elasticmachine commented Jan 5, 2023

mergify bot commented Jan 5, 2023

elasticmachine commented Jan 6, 2023 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

leehinman Jan 9, 2023

fearful-symmetry Jan 9, 2023

leehinman Jan 10, 2023

fearful-symmetry Jan 10, 2023

fearful-symmetry Jan 10, 2023

cmacknz Jan 26, 2023

cmacknz commented Jan 9, 2023

anmironov commented Jan 11, 2023

cmacknz commented Jan 11, 2023

fearful-symmetry commented Jan 11, 2023

leehinman left a comment

mergify bot commented Jan 23, 2023

fearful-symmetry commented Jan 23, 2023

leehinman left a comment

leehinman commented Jan 26, 2023

leehinman Jan 27, 2023

fearful-symmetry Jan 28, 2023

leehinman Jan 30, 2023

fearful-symmetry Jan 30, 2023

leehinman left a comment

FranAguiar commented Feb 6, 2023

cmacknz commented Feb 6, 2023

ariahi18 commented Feb 15, 2023

cmacknz commented Feb 25, 2023 •

edited

Loading

cmacknz commented Feb 27, 2023

Refactor beats lockfile to use timeout, retry #34194

Refactor beats lockfile to use timeout, retry #34194

Conversation

fearful-symmetry commented Jan 5, 2023

What does this PR do?

Why is it important?

Checklist

elasticmachine commented Jan 5, 2023

mergify bot commented Jan 5, 2023

elasticmachine commented Jan 6, 2023 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmacknz commented Jan 9, 2023

anmironov commented Jan 11, 2023

cmacknz commented Jan 11, 2023

fearful-symmetry commented Jan 11, 2023

leehinman left a comment

Choose a reason for hiding this comment

mergify bot commented Jan 23, 2023

fearful-symmetry commented Jan 23, 2023

leehinman left a comment

Choose a reason for hiding this comment

leehinman commented Jan 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leehinman left a comment

Choose a reason for hiding this comment

FranAguiar commented Feb 6, 2023

cmacknz commented Feb 6, 2023

ariahi18 commented Feb 15, 2023

cmacknz commented Feb 25, 2023 • edited Loading

cmacknz commented Feb 27, 2023

elasticmachine commented Jan 6, 2023 •

edited by jenkins-beats-ci bot

Loading

cmacknz commented Feb 25, 2023 •

edited

Loading