Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HADOOP-18657. Tune ABFS create() retry logic #5462

Open
wants to merge 2 commits into
base: trunk
Choose a base branch
from

Conversation

steveloughran
Copy link
Contributor

Description of PR

Tunes how abfs handles a failure during create which may be due to concurrency or load-related retries happening in the store.

  • better logging
  • happy with the confict being resolved by the file being deleted
  • more diagnostics in failure raised

How was this patch tested?

lease test run already; doing full hadoop-azure test run

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

Production code changes; no tests yet.

Something with mockito is going to be needed here

Change-Id: I430a9f0e6796461ccec8c23cd80d024258703048
@steveloughran
Copy link
Contributor Author

fyi @saxenapranav @mehakmeet

as well as improving diagnostics, this patch also changes the recovery code by handling a deletion of the target file between the first failure and the retry.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 57s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 38m 13s trunk passed
+1 💚 compile 0m 42s trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 compile 0m 39s trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 checkstyle 0m 35s trunk passed
+1 💚 mvnsite 0m 43s trunk passed
+1 💚 javadoc 0m 42s trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 0m 34s trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 spotbugs 1m 16s trunk passed
+1 💚 shadedclient 20m 31s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 32s the patch passed
+1 💚 compile 0m 34s the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javac 0m 34s the patch passed
+1 💚 compile 0m 30s the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 javac 0m 30s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 20s the patch passed
+1 💚 mvnsite 0m 33s the patch passed
+1 💚 javadoc 0m 26s the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 0m 24s the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 spotbugs 1m 3s the patch passed
+1 💚 shadedclient 20m 12s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 2m 11s hadoop-azure in the patch passed.
+1 💚 asflicense 0m 38s The patch does not generate ASF License warnings.
93m 52s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5462/1/artifact/out/Dockerfile
GITHUB PR #5462
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 0304206b7a96 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / a4122e2
Default Java Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5462/1/testReport/
Max. process+thread count 627 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-azure U: hadoop-tools/hadoop-azure
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5462/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Contributor

@saxenapranav saxenapranav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Requesting @snvijaya @anmolanmol1234 @sreeb-msft to please review it as well.

Regards.

LOG.debug("Failed to create file {} with etag {}; status code={}",
relativePath, eTag, sc, ex);
if (sc == HttpURLConnection.HTTP_PRECON_FAILED
|| sc == HttpURLConnection.HTTP_CONFLICT) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good that we have taken care of 409 which can come when due to etag!=null -> overwrite argument to client.createPath = false.

would be awesome if we can put it in comments, and also have log according to it.
log1: about some file is there whose eTag is with our process. When we went back to createPath with the same eTag, some other process had replaced that file which would lead to 412, which is present in the added code:

 final ConcurrentWriteOperationDetectedException ex2 =
                new ConcurrentWriteOperationDetectedException(
                    AbfsErrors.ERR_PARALLEL_ACCESS_DETECTED
                        + " Path =\"" + relativePath+ "\""
                        + "; Status code =" + sc
                        + "; etag = \"" + eTag + "\""
                        + "; error =" + ex.getErrorMessage());

suggestion to add log2: where in when we searched for etag, there was no file, now when we will try to createPath with overWrite = false, if it will give 409 in case some other process created a file on same path.

Also, in case of 409, it is similar to the case we started with in this method. Should we get into 409 control as in

for a number of times. Like if we keep threshold as 2. If it happens that it gets 409 at this line, we will try once again to handle 409, post that we fail. @snvijaya @anmolanmol1234 @sreeb-msft, what you feel.

throw new ConcurrentWriteOperationDetectedException(
"Parallel access to the create path detected. Failing request "
+ "to honor single writer semantics");
// this means the other thread deleted it and the conflict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a race condition in the job, and developer should be informed about the same. @snvijaya @anmolanmol1234 @sreeb-msft , what you feel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do; text will indicate this may be due to a lease on the parent dir too.

log error at warn; full stack at DEBUG.

Change-Id: Id05d8d0fa0d5913529cbaa093bf7cf8ed09d5031
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 53s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 50m 19s trunk passed
+1 💚 compile 0m 41s trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 compile 0m 38s trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 checkstyle 0m 35s trunk passed
+1 💚 mvnsite 0m 44s trunk passed
+1 💚 javadoc 0m 40s trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 0m 34s trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 spotbugs 1m 16s trunk passed
+1 💚 shadedclient 20m 16s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 31s the patch passed
+1 💚 compile 0m 33s the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javac 0m 33s the patch passed
+1 💚 compile 0m 29s the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 javac 0m 29s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 19s the patch passed
+1 💚 mvnsite 0m 32s the patch passed
+1 💚 javadoc 0m 25s the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 0m 23s the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 spotbugs 1m 4s the patch passed
+1 💚 shadedclient 20m 21s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 2m 11s hadoop-azure in the patch passed.
+1 💚 asflicense 0m 37s The patch does not generate ASF License warnings.
105m 15s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5462/2/artifact/out/Dockerfile
GITHUB PR #5462
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 24e1da3b49dc 4.15.0-206-generic #217-Ubuntu SMP Fri Feb 3 19:10:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 1853a46
Default Java Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5462/2/testReport/
Max. process+thread count 554 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-azure U: hadoop-tools/hadoop-azure
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5462/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@steveloughran
Copy link
Contributor Author

any updates on this? big issue is whether to retry on 409 or not?

try {
op = client.getPathStatus(relativePath, false, tracingContext);
} catch (AbfsRestOperationException ex) {
LOG.debug("Failed to to getPathStatus {}", relativePath, ex);
if (ex.getStatusCode() == HttpURLConnection.HTTP_NOT_FOUND) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @steveloughran, Given Hadoop is single writer semantic, would it be correct to expect that as part of job parallelization only one worker process should try to create a file ? As this check for FileNotFound is post an attempt to create the file with overwrite=false, which inturn failed with conflict indicating file was just present, concurrent operation on the file is indeed confirmed.

Its quite possible that if we let this create proceed, some other operation such as delete can kick in later on as well. Below code that throws exception at the first indication of parallel activity would be the right thing to do ?

As the workload pattern is not honoring the single writer semantic I feel we should retain the logic to throw ConcurrentWriteOperationDetectedException.

@steveloughran
Copy link
Contributor Author

reviewing this; too many other things have got in my way.

I agree, with create overwrite=false, we must fail with a concurrency error

what we don't want to do is overreact if we are doing overwrite=true and something does happen partway.

I'll look at this in more detail, maybe focus purely on being meaningful on errors, in particular making sure that if the file is deleted before the error is raised, keep raising that concurrency error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants