-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gcs poc multi #52
base: develop
Are you sure you want to change the base?
Gcs poc multi #52
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The biggest challenge with this approach is to test all possible failure cases. I am wondering if we should look on how spark itself does it and make a copy of all test cases
Path tempPath = new Path(outputPath, FileOutputCommitter.PENDING_DIR_NAME); | ||
FileSystem fs = tempPath.getFileSystem(context.getConfiguration()); | ||
String taskId = context.getTaskAttemptID().getTaskID().toString(); | ||
Path taskPartitionFile = new Path(tempPath, taskId + "_partitions.txt"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's a task retry we should overwrite it
if (!fs.exists(taskPartitionFile)) { | ||
fs.createNewFile(taskPartitionFile); | ||
} | ||
DataOutputStream out = fs.append(taskPartitionFile); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use try with resources to ensure stream is always closed
DataInputStream in = new DataInputStream(new BufferedInputStream(dis)); | ||
BufferedReader br = new BufferedReader(new java.io.InputStreamReader(in)); | ||
String line; | ||
while ((line = br.readLine()) != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should never be more than 1 line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can have multiple lines if that split has records for multiple table names. E.g. We have country as tableName, and split has records for both US and India, it will write path upto country/suffix into this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one task can be responsible for processing multiple partitions, hence there can be multiple lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got you. Well, in this case we need to ensure file is removed on the beginning of attempt (in case of any attempt retries)
WIP
We create a _temp dir on the root path and store all the paths,
At the final driver commit we take the set on all the paths and do one commit on each path !