-
Notifications
You must be signed in to change notification settings - Fork 9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-18657. Tune ABFS create() retry logic #5462
base: trunk
Are you sure you want to change the base?
HADOOP-18657. Tune ABFS create() retry logic #5462
Conversation
Production code changes; no tests yet. Something with mockito is going to be needed here Change-Id: I430a9f0e6796461ccec8c23cd80d024258703048
as well as improving diagnostics, this patch also changes the recovery code by handling a deletion of the target file between the first failure and the retry. |
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Requesting @snvijaya @anmolanmol1234 @sreeb-msft to please review it as well.
Regards.
LOG.debug("Failed to create file {} with etag {}; status code={}", | ||
relativePath, eTag, sc, ex); | ||
if (sc == HttpURLConnection.HTTP_PRECON_FAILED | ||
|| sc == HttpURLConnection.HTTP_CONFLICT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good that we have taken care of 409 which can come when due to etag!=null
-> overwrite argument to client.createPath
= false.
would be awesome if we can put it in comments, and also have log according to it.
log1: about some file is there whose eTag is with our process. When we went back to createPath with the same eTag, some other process had replaced that file which would lead to 412, which is present in the added code:
final ConcurrentWriteOperationDetectedException ex2 =
new ConcurrentWriteOperationDetectedException(
AbfsErrors.ERR_PARALLEL_ACCESS_DETECTED
+ " Path =\"" + relativePath+ "\""
+ "; Status code =" + sc
+ "; etag = \"" + eTag + "\""
+ "; error =" + ex.getErrorMessage());
suggestion to add log2: where in when we searched for etag, there was no file, now when we will try to createPath with overWrite = false, if it will give 409 in case some other process created a file on same path.
Also, in case of 409, it is similar to the case we started with in this method. Should we get into 409 control as in
Line 624 in 7f9ca10
if (e.getStatusCode() == HttpURLConnection.HTTP_CONFLICT) { |
throw new ConcurrentWriteOperationDetectedException( | ||
"Parallel access to the create path detected. Failing request " | ||
+ "to honor single writer semantics"); | ||
// this means the other thread deleted it and the conflict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a race condition in the job, and developer should be informed about the same. @snvijaya @anmolanmol1234 @sreeb-msft , what you feel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do; text will indicate this may be due to a lease on the parent dir too.
log error at warn; full stack at DEBUG. Change-Id: Id05d8d0fa0d5913529cbaa093bf7cf8ed09d5031
💔 -1 overall
This message was automatically generated. |
any updates on this? big issue is whether to retry on 409 or not? |
try { | ||
op = client.getPathStatus(relativePath, false, tracingContext); | ||
} catch (AbfsRestOperationException ex) { | ||
LOG.debug("Failed to to getPathStatus {}", relativePath, ex); | ||
if (ex.getStatusCode() == HttpURLConnection.HTTP_NOT_FOUND) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @steveloughran, Given Hadoop is single writer semantic, would it be correct to expect that as part of job parallelization only one worker process should try to create a file ? As this check for FileNotFound is post an attempt to create the file with overwrite=false, which inturn failed with conflict indicating file was just present, concurrent operation on the file is indeed confirmed.
Its quite possible that if we let this create proceed, some other operation such as delete can kick in later on as well. Below code that throws exception at the first indication of parallel activity would be the right thing to do ?
As the workload pattern is not honoring the single writer semantic I feel we should retain the logic to throw ConcurrentWriteOperationDetectedException.
reviewing this; too many other things have got in my way. I agree, with create overwrite=false, we must fail with a concurrency error what we don't want to do is overreact if we are doing overwrite=true and something does happen partway. I'll look at this in more detail, maybe focus purely on being meaningful on errors, in particular making sure that if the file is deleted before the error is raised, keep raising that concurrency error. |
Description of PR
Tunes how abfs handles a failure during create which may be due to concurrency or load-related retries happening in the store.
How was this patch tested?
lease test run already; doing full hadoop-azure test run
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?