continue hadoop job for sparse intervals #2098

dclim · 2015-12-16T04:45:56Z

I messed up the previous PR so I recreated it. Thanks Gian for the comments - I agree that having FSSpideringIterator return an empty iterator is less brittle than trying to manage an exception.

Hadoop jobs would fail with a FileNotFoundException when a granularity inputSpec was used and there wasn't data for each portion of the segment - for example, if dataGranularity was hour and the interval was 2015-11-10T00:00Z/2015-11-11T00:00Z, we would fail if our input data folder structure looked like this:

/dataSource/y=2015/m=11/d=10/H=00
/dataSource/y=2015/m=11/d=10/H=01
/dataSource/y=2015/m=11/d=10/H=03
/dataSource/y=2015/m=11/d=10/H=04
/dataSource/y=2015/m=11/d=10/H=08
/dataSource/y=2015/m=11/d=10/H=09
/dataSource/y=2015/m=11/d=10/H=15
/dataSource/y=2015/m=11/d=10/H=23

This caused segment creation to fail for sparse data which may not have an event every hour. We now ignore intervals with no data and allow the job to continue.

himanshug · 2015-12-16T22:47:25Z

indexing-hadoop/src/test/java/io/druid/indexer/path/GranularityPathSpecTest.java

+    Capture<String> pathCapture = Capture.newInstance(CaptureType.ALL);
+    Job job = Job.getInstance();
+
+    PowerMock.mockStatic(FSSpideringIterator.class);


is it not possible to setup the appropriate files in some temporary space on disk that reproduces the bug instead of adding PowerMock ?

Yes that should work too - do you prefer that approach?

yes, this test case does not look like a major reason to add another big framework for testing.

dclim · 2015-12-17T22:47:38Z

Thanks for the feedback @himanshug , I removed PowerMock and created local temp files for the unit test.

fjy · 2015-12-18T20:06:42Z

👍

himanshug · 2015-12-18T20:13:48Z

indexing-hadoop/src/test/java/io/druid/indexer/path/GranularityPathSpecTest.java

+    Job job = Job.getInstance();
+    String formatStr = "file:%s/%s;org.apache.hadoop.mapreduce.lib.input.TextInputFormat";
+
+    File baseDir = Files.createTempDir();


can you use the @rule TemporaryFolder for this?

@himanshug done, thanks!

fjy · 2016-01-04T23:22:26Z

@dclim can you rebase from master to get the fixes for all these transient failures?

continue hadoop job for sparse intervals

himanshug reviewed Dec 16, 2015
View reviewed changes

fjy closed this Dec 18, 2015

fjy reopened this Dec 18, 2015

himanshug reviewed Dec 18, 2015
View reviewed changes

fjy closed this Dec 18, 2015

fjy reopened this Dec 18, 2015

continue hadoop job for sparse intervals

2308c8c

himanshug added a commit that referenced this pull request Jan 7, 2016

Merge pull request #2098 from dclim/sparse-granularity-fix

5ace91f

continue hadoop job for sparse intervals

himanshug merged commit 5ace91f into apache:master Jan 7, 2016

dclim deleted the sparse-granularity-fix branch January 7, 2016 18:27

fjy modified the milestone: 0.9.0 Feb 4, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

continue hadoop job for sparse intervals #2098

continue hadoop job for sparse intervals #2098

dclim commented Dec 16, 2015

himanshug Dec 16, 2015

dclim Dec 16, 2015

himanshug Dec 17, 2015

dclim commented Dec 17, 2015

fjy commented Dec 18, 2015

himanshug Dec 18, 2015

dclim Jan 4, 2016

fjy commented Jan 4, 2016

continue hadoop job for sparse intervals #2098

continue hadoop job for sparse intervals #2098

Conversation

dclim commented Dec 16, 2015

himanshug Dec 16, 2015

Choose a reason for hiding this comment

dclim Dec 16, 2015

Choose a reason for hiding this comment

himanshug Dec 17, 2015

Choose a reason for hiding this comment

dclim commented Dec 17, 2015

fjy commented Dec 18, 2015

himanshug Dec 18, 2015

Choose a reason for hiding this comment

dclim Jan 4, 2016

Choose a reason for hiding this comment

fjy commented Jan 4, 2016