Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

continue hadoop job for sparse intervals #2098

Merged
merged 1 commit into from
Jan 7, 2016
Merged

continue hadoop job for sparse intervals #2098

merged 1 commit into from
Jan 7, 2016

Conversation

dclim
Copy link
Contributor

@dclim dclim commented Dec 16, 2015

I messed up the previous PR so I recreated it. Thanks Gian for the comments - I agree that having FSSpideringIterator return an empty iterator is less brittle than trying to manage an exception.

Hadoop jobs would fail with a FileNotFoundException when a granularity inputSpec was used and there wasn't data for each portion of the segment - for example, if dataGranularity was hour and the interval was 2015-11-10T00:00Z/2015-11-11T00:00Z, we would fail if our input data folder structure looked like this:

/dataSource/y=2015/m=11/d=10/H=00
/dataSource/y=2015/m=11/d=10/H=01
/dataSource/y=2015/m=11/d=10/H=03
/dataSource/y=2015/m=11/d=10/H=04
/dataSource/y=2015/m=11/d=10/H=08
/dataSource/y=2015/m=11/d=10/H=09
/dataSource/y=2015/m=11/d=10/H=15
/dataSource/y=2015/m=11/d=10/H=23

This caused segment creation to fail for sparse data which may not have an event every hour. We now ignore intervals with no data and allow the job to continue.

Capture<String> pathCapture = Capture.newInstance(CaptureType.ALL);
Job job = Job.getInstance();

PowerMock.mockStatic(FSSpideringIterator.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it not possible to setup the appropriate files in some temporary space on disk that reproduces the bug instead of adding PowerMock ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that should work too - do you prefer that approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this test case does not look like a major reason to add another big framework for testing.

@dclim
Copy link
Contributor Author

dclim commented Dec 17, 2015

Thanks for the feedback @himanshug , I removed PowerMock and created local temp files for the unit test.

@fjy
Copy link
Contributor

fjy commented Dec 18, 2015

👍

@fjy fjy closed this Dec 18, 2015
@fjy fjy reopened this Dec 18, 2015
Job job = Job.getInstance();
String formatStr = "file:%s/%s;org.apache.hadoop.mapreduce.lib.input.TextInputFormat";

File baseDir = Files.createTempDir();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use the @rule TemporaryFolder for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@himanshug done, thanks!

@fjy fjy closed this Dec 18, 2015
@fjy fjy reopened this Dec 18, 2015
@fjy
Copy link
Contributor

fjy commented Jan 4, 2016

@dclim can you rebase from master to get the fixes for all these transient failures?

himanshug added a commit that referenced this pull request Jan 7, 2016
continue hadoop job for sparse intervals
@himanshug himanshug merged commit 5ace91f into apache:master Jan 7, 2016
@dclim dclim deleted the sparse-granularity-fix branch January 7, 2016 18:27
@fjy fjy modified the milestone: 0.9.0 Feb 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants