-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check whether renames worked in atomic hdfs pipes #2119
Conversation
@daveFNbuck, thanks for your PR! By analyzing the history of the files in this pull request, we identified @Tarrasch, @freider and @lallea to be potential reviewers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rubberstamping as green. Didn't look through this in detail, but you seem to have good test coverage.
Using the snakebite client, renaming a file to an existing file fails without raising an exception. Instead, it returns a list of moves and whether each ones succeeded. This means that when using the snakebite client, atomic writes to existing files fail silently versus failing while throwing an error with the hadoopcli client. There are two ways to fix this. First, we can simply remove the target file if it already exists. Second, we can ensure that we read the snakebite error and throw it if it happens. This PR does both. For atomic directory writes, this also adds a check for whether the directory was renamed to the target or moved inside an existing target directory.
5b42b44
to
b032814
Compare
Fixed the merge conflict with the other PR. Thanks for reviewing so quickly. You may want to just check that you think my solution is reasonable. It adds a decent amount of overhead to these operations. I think it's ok because hdfs writes tend to be large and data integrity should always take priority over pipeline speed. |
Honestly I never used luigi pipes to write to files after I quit Spotify. I reviewed a little bit more. looks good. :) |
I was having the same problem and am trying out the latest version with this fix. It doesn't seem to be working, but I think that is because it doesn't play well with snakebite, which I'm using as my hdfs client. It throws a FileNotFoundException rather than a HDFSCliError when trying to remove a file that doesn't exist.
Indeed switching the hdfs client back to the default works. Any suggestions as to the best way to fix? |
Description
Using the snakebite client, renaming a file to an existing file fails without raising an exception. Instead, it returns a list of moves and whether each ones succeeded. This means that when using the snakebite client, atomic writes to existing files fail silently versus failing while throwing an error with the hadoopcli client.
There are two ways to fix this. First, we can simply remove the target file if it already exists. Second, we can ensure that we read the snakebite error and throw it if it happens. This PR does both.
For atomic directory writes, this also adds a check for whether the directory was renamed to the target or moved inside an existing target directory.
Motivation and Context
As described in #2118, I was having an issue where jobs that seemed to be running correctly were not resulting in updated files. This resolves #2118.
Have you tested this? If so, how?
I added minicluster unit tests. I have also tested the part of this that removes the target if it exists for the non-directory pipe. The demonstration code works as expected with that added. I added the other fixes after coming up with unit tests.