Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modifications to support new StructuredFile ingestor #14

Merged
merged 10 commits into from
May 25, 2017
Merged

Modifications to support new StructuredFile ingestor #14

merged 10 commits into from
May 25, 2017

Conversation

smgallo
Copy link
Contributor

@smgallo smgallo commented May 19, 2017

These changes are in support of the new StructuredFile data endpoint. Note that this PR requires XDMoD PR ubccr/xdmod#145.*

  • The StructuredFile endpoint is now an iterator rather than returning all data from the parse() method
  • The StructuredFile endpoint now supports any external process as a filter. The filter specification has been updated accordingly. A use case for multiple filters might be to run all grants through sed to collapse the various NIH agencies (e.g. NIH-NLM) into a single "NIH" agency using sed -r 's/("agency": "NIH)(.*)",/\1",/'
  • Renamed the data endpoint configuration key array_element_schema_path to record_schema_path since we no longer require the data to be an array
  • The schemas for person, grant, and organization have been moved into the etl_schema.d directory

Testing was performed by comparing a dump of the modw_value_analytics tables on va-demo to the newly ingested tables. Note that the only differences are that in the existing VA Demo data there is no abbreviation for Indiana University and the individual NIH funding agencies have been collapsed into a single NIH.

$ ./etl_overseer.php -c ../../../etc/etl/etl.json -v debug -p value_analytics -o "experimental_enable_batch_aggregation=true"
...

$ ./verify_table_data.php -s modw_value_analytics_baseline -d modw_value_analytics_etltest \
  -t people -t people_groups -t people_organizations -t people_identifiers -t organizations \
  -t identity_providers -t groups -t grant_types -t grants_people -t grants -t grants_roles \
  -t funding_agencies -n 1

2017-05-19 14:53:33 [notice] Compare tables src=modw_value_analytics_baseline.people, dest=modw_value_analytics_etltest.people
2017-05-19 14:53:33 [notice] Identical
2017-05-19 14:53:33 [notice] Compare tables src=modw_value_analytics_baseline.people_groups, dest=modw_value_analytics_etltest.people_groups
2017-05-19 14:53:33 [notice] Identical
2017-05-19 14:53:33 [notice] Compare tables src=modw_value_analytics_baseline.people_organizations, dest=modw_value_analytics_etltest.people_organizations
2017-05-19 14:53:33 [notice] Identical
2017-05-19 14:53:33 [notice] Compare tables src=modw_value_analytics_baseline.people_identifiers, dest=modw_value_analytics_etltest.people_identifiers
2017-05-19 14:53:33 [notice] Identical
2017-05-19 14:53:33 [notice] Compare tables src=modw_value_analytics_baseline.organizations, dest=modw_value_analytics_etltest.organizations
2017-05-19 14:53:33 [warning] Missing 1 rows in modw_value_analytics_etltest.organizations
2017-05-19 14:53:33 [warning] Missing row: Array
(
    [id] => 1
    [name] => Indiana University
    [abbrev] => 
)

2017-05-19 14:53:33 [notice] Compare tables src=modw_value_analytics_baseline.identity_providers, dest=modw_value_analytics_etltest.identity_providers
2017-05-19 14:53:34 [notice] Identical
2017-05-19 14:53:34 [notice] Compare tables src=modw_value_analytics_baseline.groups, dest=modw_value_analytics_etltest.groups
2017-05-19 14:53:34 [notice] Identical
2017-05-19 14:53:34 [notice] Compare tables src=modw_value_analytics_baseline.grant_types, dest=modw_value_analytics_etltest.grant_types
2017-05-19 14:53:34 [notice] Identical
2017-05-19 14:53:34 [notice] Compare tables src=modw_value_analytics_baseline.grants_people, dest=modw_value_analytics_etltest.grants_people
2017-05-19 14:53:34 [notice] Identical
2017-05-19 14:53:34 [notice] Compare tables src=modw_value_analytics_baseline.grants, dest=modw_value_analytics_etltest.grants
2017-05-19 14:53:34 [notice] Identical
2017-05-19 14:53:34 [notice] Compare tables src=modw_value_analytics_baseline.grants_roles, dest=modw_value_analytics_etltest.grants_roles
2017-05-19 14:53:34 [error] Table '`modw_value_analytics_baseline`.`grants_roles`' does not exist
2017-05-19 14:53:34 [error] Table '`modw_value_analytics_etltest`.`grants_roles`' does not exist
2017-05-19 14:53:34 [notice] Compare tables src=modw_value_analytics_baseline.funding_agencies, dest=modw_value_analytics_etltest.funding_agencies
2017-05-19 14:53:34 [warning] Missing 1 rows in modw_value_analytics_etltest.funding_agencies
2017-05-19 14:53:34 [warning] Missing row: Array
(
    [id] => 2216
    [name] => NIH-FOGARTY INTL CTR
)

Copy link
Contributor

@tyearke tyearke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! My only request is that the contents of specs/schemas and configuration/etl/etl_schemas.d/value_analytics be deduplicated and references to those directories be cleaned up. This could be accomplished either by deleting specs/schemas and removing the reference to it in build.json or by deleting configuration/etl/etl_schemas.d/value_analytics and updating build.json to use this directory for builds instead of the old etl_specs.d directory.

@smgallo
Copy link
Contributor Author

smgallo commented May 25, 2017

@tyearke Requested changes completed. I also updated the ETL config file to define filters in an array to bring it in line with changes to ubccr/xdmod#145 (and re-ran tests)

Copy link
Contributor

@tyearke tyearke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specs/schemas just needs to be removed from include_paths in build.json and it should be good to go.

Copy link
Contributor

@tyearke tyearke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Also, thanks for fixing the use of MySQL type aliases for some of the columns - I didn't realize using those would cause table alterations on every run.

@smgallo
Copy link
Contributor Author

smgallo commented May 25, 2017

I have it on my list to add better support for things like boolean, integer and other places where MySQL does some internal normalization. That will make it more intuitive for folks to use.

@smgallo smgallo merged commit c01882e into ubccr:master May 25, 2017
@smgallo smgallo deleted the structured-file-mods branch May 25, 2017 15:15
@tyearke tyearke added this to the v7.0.0 milestone Sep 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants