-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4896] don’t redundantly overwrite executor JAR deps #2848
Conversation
Can one of the admins verify this patch? |
Jenkins, test this please. This LGTM pending tests - I do wonder whether there is a simple way to structure the logic such that we don't need a big comment explaining the safety there, but I couldn't think of one that didn't involve a lot of nesting. |
QA tests have started for PR 2848 at commit
|
QA tests have finished for PR 2848 at commit
|
Test PASSed. |
@preaudc see last commit, I applied this change to the |
Jenkins, test this please. Does that work if I am not an admin? @pwendell agreed, the logic is a little tricky but I couldn't find a simpler way to express it; in the meantime, I factored it out since it was repeated in two |
Jenkins, retest this please. |
QA tests have started for PR 2848 at commit
|
QA tests have finished for PR 2848 at commit
|
Test PASSed. |
Can one of the admins verify this patch? |
bump; what's the process for getting things merged in? any more comments here? |
5f1a6f1
to
80a20bd
Compare
bump. @JoshRosen @pwendell lmk what the process is please? |
At this point its waiting on review from us. No action needed on your side. Sent from my phone
|
cool, thanks @pwendell. I am new here so don't know if there are other things I'm supposed to be doing! |
in: InputStream, | ||
tempFile: File, | ||
destFile: File, | ||
fileOverwrite: Boolean): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style should be
private def moveFile(
url: String,
...
fileOverwrite: Boolean): Unit = {
...
}
retest this please, just because Jenkins hasn't tested this one in a while. |
Test build #23029 has started for PR 2848 at commit
|
Hey @ryan-williams +1 on abstracting the duplicate code. Just one clarification: the only change in functionality here is it saves overwriting an existing jar unnecessarily, is that correct? How does this solve SPARK-3967? |
Thanks for having a look, I'll get to that style nit momentarily. The tie-in to SPARK-3967 is that, on my cluster, when attempting to redundantly copy the JAR deps, I was getting "Permission Denied" errors because not only were the JARs already there, they were read-only and could not be overwritten. I did not get to the bottom of who was putting them there; it seemed like every executor would try to copy its JARs, find them to be already present, try to copy anyway, and fail due to "permission denied." Simply skipping the redundant copy attempt (as this PR does) resulted in my jobs passing, and it seemed like the original authors intended to not attempt to redundantly copy but had a bug, so I settled for this "fix" and didn't get fully to the bottom of this. For example, suppose an executor E hit the code path where a given JAR is present but its contents are not identical to the one E wants. In this case, E would attempt to overwrite the existing JAR and fail due to the same "Permission denied", so there could be some larger way in which it needs to be made possible for executors to overwrite their JAR deps (maybe that is just a cluster config issue I and @preaudc are hitting?), but if so I think that's orthogonal to this change. Let me know if I should just take the "SPARK-3967" label out of this PR, since it might not fully "solve" the issue. |
80a20bd
to
4df5afe
Compare
fixed the style nit, also added a comment nit I found in CoGroupedRDD, lmk if you want that in a separate PR or something |
Test build #23029 has finished for PR 2848 at commit
|
Test PASSed. |
} | ||
// If the destFile exists at this point, it is equal to tempFile and we can skip moving | ||
// it; the code above deletes it or throws an Exception, as appropriate, if it exists and | ||
// is not equal to tempFile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ryan-williams my concern is that if the delete fails (e.g. because of some permissions thing), then we will always have an old version of the file because destFile.exists
will always be true. It might make sense to throw an exception if delete is unsuccessful instead of logging a message up there in L429 that is technically not true (I realize you just moved the code, but it would still be good to fix this).
Hey @ryan-williams thanks for your explanation. Looks like there could be another potential source of the same issue in L397, when we use the executor cache. |
OK, I refactored a little further. Updates:
lmk how it looks! |
d74f785
to
bcf2543
Compare
OK, I added logic for making sure we always delete |
Jenkins, retest this please. |
Test build #24431 has started for PR 2848 at commit
|
Test build #24431 has finished for PR 2848 at commit
|
Test FAILed. |
Hmm, maybe a legitimate test failure?
|
interesting. this may be more appropriate on my gigantic thread about testing/development best practices, but do you have any recommendations for debugging this? AFAICT, this seems to be a test that requires a full assembly-build to run; I've tried running I'm guessing this failure must have something to do with having moved / removed (temp?) files that this test was relying on; my naive approach will be to comment out some of the move/removal lines and see if that fixes the test, but on pain of running a Is there a better way? |
Disclaimer: for iterative debugging, I use The issue here is probably that you're running a full Let's say that I'm starting from a completely cold build (but with Maven dependencies already downloaded): # Since I have zinc installed, I'll use it:
zinc -start
git checkout /a/branch/with/your/pr/code
# Here, the -T C1 says "build in parallel with one thread per core":
time mvn -T C1 clean package -DskipTests This didn't take super long, but it was a few minutes:
Let's run just the test suite that we're interested in (instructions from here:
This took a little while because it had to build a bunch of test sources, but it was only a few seconds before the tests started running (and failing):
Let's try adding a print statement to
And to re-run the test:
So this took a couple of minutes. Let's do the same thing in SBT. First, let's start off from a completely clean-slate:
The timing here could be messed up because I have a multi-core machine:
Next, let's run just the suite we're interested in. Here's a naive way to do this, which involves building every test, so this first run will take longer than subsequent runs:
Whoops, I made a mistake here! My
This was pretty fast:
I could also have run this from the interactive shell to get automatic rebuilding on source changes. There's an even faster way of running just FileServerSuite, though: I can tell SBT to only build / run the [info] Set current project to spark-parent (in build file:/Users/joshrosen/Documents/spark/)
> clean
[success] Total time: 20 s, completed Dec 15, 2014 7:01:20 PM
> project core
[info] Set current project to spark-core (in build file:/Users/joshrosen/Documents/spark/)
> package
[...]
[info] Compiling 42 Java sources to /Users/joshrosen/Documents/spark/network/common/target/scala-2.10/classes...
[info] Compiling 20 Java sources to /Users/joshrosen/Documents/spark/network/shuffle/target/scala-2.10/classes...
[info] Compiling 397 Scala sources and 33 Java sources to /Users/joshrosen/Documents/spark/core/target/scala-2.10/classes...
[...]
[info] Packaging /Users/joshrosen/Documents/spark/core/target/scala-2.10/spark-core_2.10-1.3.0-SNAPSHOT.jar ...
[info] Done packaging.
[success] Total time: 64 s, completed Dec 15, 2014 7:02:36 PM
> test-only *FileServerSuite
[...]
[info] Compiling 124 Scala sources and 4 Java sources to /Users/joshrosen/Documents/spark/core/target/scala-2.10/test-classes...
[...]
[---- tests run ----]
[---- tests go into infinite loop ----]
ls: /Users/joshrosen/Documents/spark/assembly/target/scala-2.10: No such file or directory
ls: /Users/joshrosen/Documents/spark/assembly/target/scala-2.10: No such file or directory
ls: /Users/joshrosen/Documents/spark/assembly/target/scala-2.10: No such file or directory
ls: /Users/joshrosen/Documents/spark/assembly/target/scala-2.10: No such file or directory
[... infinite repetitions ...] Hmm, so it looks like the tests that rely on
Now, from the SBT shell: > project core
[info] Set current project to spark-core (in build file:/Users/joshrosen/Documents/spark/)
> ~test-only *FileServerSuite
[... tests run ...]
[info] *** 3 TESTS FAILED ***
[error] Failed: Total 7, Failed 3, Errors 0, Passed 4
[error] Failed tests:
[error] org.apache.spark.FileServerSuite
[error] (core/test:testOnly) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 18 s, completed Dec 15, 2014 7:10:57 PM
1. Waiting for source changes... (press enter to interrupt)
[... add a println to Utils.addFile ...]
[... tests start up almost instantly and run ...]
[info] *** 3 TESTS FAILED ***
[error] Failed: Total 7, Failed 3, Errors 0, Passed 4
[error] Failed tests:
[error] org.apache.spark.FileServerSuite
[error] (core/test:testOnly) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 19 s, completed Dec 15, 2014 7:11:46 PM So, to summarize: I agree that there are a bunch of pain-points in the current build process. Day-to-day, though, it hasn't affected me that much since I'll usually |
see SPARK-3967 Combine some file-move logic in Utils.fetchFile
bcf2543
to
c14daff
Compare
Thanks a lot for the build-process brain-dump, @JoshRosen; some of that is way better than processes I've wasted a lot of time on. I'm interested in getting all of these tips and tricks pushed to Spark docs, but even after all of the emailing your response includes tricks I've not seen documented anywhere (e.g. In any case, I was able to debug and (I believe) fix the test failure, which was a result of having inadvertently changed a file "copy" to a "move", so it's a good thing the test caught it. This should be ready to re-test. |
Jenkins, retest this please. |
Test build #24518 has started for PR 2848 at commit
|
Test build #24518 has finished for PR 2848 at commit
|
Test PASSed. |
This looks good to me, so I'm ready to merge it. One final little thing, though: since we have two separate PRs that address SPARK-3967, I'd like to create a new JIRA for this fix just in case this doesn't fix the YARN issue in the original ticket. Therefore, I've opened https://issues.apache.org/jira/browse/SPARK-4896. Do you mind editing this PR's title to reference that JIRA instead? Once you do that, I'll merge this into all of the maintenance branches and resolve the original JIRA. Thanks for your patience during this long review and with the build difficulties. I think that the end result here is a significant improvement over the old |
thanks for shepherding this @JoshRosen, I've updated the PR title accordingly, lmk if I should do anything else here! |
Thanks for updating the title. This looks good to me, so I'm going to merge this into |
Author: Ryan Williams <[email protected]> Closes #2848 from ryan-williams/fetch-file and squashes the following commits: c14daff [Ryan Williams] Fix copy that was changed to a move inadvertently 8e39c16 [Ryan Williams] code review feedback 788ed41 [Ryan Williams] don’t redundantly overwrite executor JAR deps (cherry picked from commit 7981f96) Signed-off-by: Josh Rosen <[email protected]>
Looks like the |
Oh, and also because some of the cache stuff wasn't present. |
Author: Ryan Williams <[email protected]> Closes #2848 from ryan-williams/fetch-file and squashes the following commits: c14daff [Ryan Williams] Fix copy that was changed to a move inadvertently 8e39c16 [Ryan Williams] code review feedback 788ed41 [Ryan Williams] don’t redundantly overwrite executor JAR deps (cherry picked from commit 7981f96) Signed-off-by: Josh Rosen <[email protected]> Conflicts: core/src/main/scala/org/apache/spark/util/Utils.scala
Alright, finished the 1.1 cherry-pick: 546a239 |
No description provided.