Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve DirectorySanner to properly handle first file being empty #931

Merged
merged 3 commits into from
May 21, 2019

Conversation

smgallo
Copy link
Contributor

@smgallo smgallo commented May 20, 2019

Description

The DirectoryScanner data endpoint now properly handles the case where the first file scanned is empty or otherwise contains no valid records. Previously, the first record (which is used to determine field names) was set even if the first file had no data and this caused an error.

Motivation and Context

Bugfix

Tests performed

Added new test, manual testing in docker.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project as found in the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@smgallo smgallo added bug Bugfixes Category:ETL Extract Transform Load labels May 20, 2019
@smgallo smgallo added this to the 8.5.0 milestone May 20, 2019
@smgallo smgallo force-pushed the etl/handle-empty-data-file branch 5 times, most recently from be3d949 to 406e15e Compare May 20, 2019 17:19
@smgallo smgallo force-pushed the etl/handle-empty-data-file branch from 406e15e to 1ea2e62 Compare May 20, 2019 18:46
@@ -98,7 +98,7 @@ protected function _execute()
// If there are no records we can bail out. Otherwise we may not even be able to infer the
// source field names.

if ( 0 == $this->sourceEndpoint->count() ) {
if ( false === $firstRecord || 0 == $this->sourceEndpoint->count() ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be ===

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is already ===

$options = sprintf('-c %s -a %s %s', $configFile, $action, $localOptions);

// Add a verbosity flag if the local options do not already contain one
if ( "" == $localOptions || false === strpos($localOptions, '-v') ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be ===

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean "" == $localOptions? What does that add?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just that it would fail if $localOptions was ever anything other than a string. Don't know if that's a concern or not though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have complete control over the tests and the default is a string if not specified I think we're fine with a ==.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me

@smgallo smgallo requested review from plessbd, jtpalmer and ryanrath May 21, 2019 15:24
Copy link
Contributor

@ryanrath ryanrath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only question / comment I have would be whether or not we ever think we'll need to identify a particular record in either of the two new tables ( generic_cloud_raw_event or generic_cloud_raw_instance_type ). If so, it might be useful to have a generated unique id of some type added. Other then that, LGTM.

@smgallo
Copy link
Contributor Author

smgallo commented May 21, 2019

If we do need to identify a specific record, that would be handled by the ingestor. The DirectoryScanner is just an endpoint for reading the files.

@ryanrath
Copy link
Contributor

facepalm I may have missed that those are test artifacts... Sorry 'bout that. All good :)

@smgallo smgallo merged commit a41740a into ubccr:xdmod8.5 May 21, 2019
@smgallo smgallo deleted the etl/handle-empty-data-file branch May 21, 2019 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugfixes Category:ETL Extract Transform Load
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants