-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
continue hadoop job for sparse intervals #2098
Conversation
Capture<String> pathCapture = Capture.newInstance(CaptureType.ALL); | ||
Job job = Job.getInstance(); | ||
|
||
PowerMock.mockStatic(FSSpideringIterator.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it not possible to setup the appropriate files in some temporary space on disk that reproduces the bug instead of adding PowerMock ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that should work too - do you prefer that approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this test case does not look like a major reason to add another big framework for testing.
Thanks for the feedback @himanshug , I removed PowerMock and created local temp files for the unit test. |
👍 |
Job job = Job.getInstance(); | ||
String formatStr = "file:%s/%s;org.apache.hadoop.mapreduce.lib.input.TextInputFormat"; | ||
|
||
File baseDir = Files.createTempDir(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use the @rule TemporaryFolder for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@himanshug done, thanks!
@dclim can you rebase from master to get the fixes for all these transient failures? |
continue hadoop job for sparse intervals
I messed up the previous PR so I recreated it. Thanks Gian for the comments - I agree that having FSSpideringIterator return an empty iterator is less brittle than trying to manage an exception.
Hadoop jobs would fail with a FileNotFoundException when a granularity inputSpec was used and there wasn't data for each portion of the segment - for example, if dataGranularity was hour and the interval was 2015-11-10T00:00Z/2015-11-11T00:00Z, we would fail if our input data folder structure looked like this:
/dataSource/y=2015/m=11/d=10/H=00
/dataSource/y=2015/m=11/d=10/H=01
/dataSource/y=2015/m=11/d=10/H=03
/dataSource/y=2015/m=11/d=10/H=04
/dataSource/y=2015/m=11/d=10/H=08
/dataSource/y=2015/m=11/d=10/H=09
/dataSource/y=2015/m=11/d=10/H=15
/dataSource/y=2015/m=11/d=10/H=23
This caused segment creation to fail for sparse data which may not have an event every hour. We now ignore intervals with no data and allow the job to continue.