Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Compatibility changes to the gremlin integration #260

Merged
merged 3 commits into from
Feb 26, 2021

Conversation

AndrewCiambrone
Copy link
Contributor

Signed-off-by: Andrew Ciambrone [email protected]

Summary of Changes

Made changes to the Neptune proxy so that you can configure the s3_bucket via the configuration settings. Before you had to made changes to the code to do that.

Added a Configuration for Neptune.

Made shards optional in Neptune.

Minor changes to the queries to be more compatible with the data builder.

Tests

Made changes to the testing framework to be more compatible.

Documentation

What documentation did you add or modify and why? Add any relevant links then remove this line

CheckList

Make sure you have checked all steps below to ensure a timely review.

  • PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
  • PR includes a summary of changes.
  • PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
  • PR passes make test

Signed-off-by: Andrew Ciambrone <[email protected]>


# The databuilder expects this to be False currently. We are defaulting to true because the testing expects this
if bool(distutils.util.strtobool(os.environ.get('IGNORE_NEPTUNE_SHARD', 'False'))):
Copy link
Contributor Author

@AndrewCiambrone AndrewCiambrone Feb 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be at the base layer instead of the config because once get_shard is called the shard_set_explicitly no longer works.

shard_set_explicitly('')


class NeptuneConfig(LocalGremlinConfig, Config):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move this to documentation if you all wish.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong feelings either way on my end, but certainly we could inherit the things unrelated to neptune/gremlin setup at least.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could leave it here to give people who use neptune for reference.

users_by_type['owner'] = sorted(
_safe_get_list(result, f'all_owners', transform=self._convert_to_user) or [],
key=attrgetter('user_id'))
users_by_type['owner'] = _safe_get_list(result, f'all_owners', transform=self._convert_to_user) or []
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User_id was not always defined. I can switch the sort to be on another attribute if you all prefer.

@@ -1087,18 +1087,18 @@ def _get_table_itself(self, *, table_uri: str) -> Mapping[str, Any]:
hasLabel(VertexTypes.Application.value.label).fold()).as_('application')
g = g.coalesce(select('table').outE(EdgeTypes.LastUpdatedAt.value.label).inV().
hasLabel(VertexTypes.Updatedtimestamp.value.label).
values('latest_timestamp').fold()).as_('timestamp')
values('timestamp').fold()).as_('timestamp')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

g = g.coalesce(select('table').outE(EdgeTypes.Description.value.label).
inV().has(VertexTypes.Description.value.label, 'source', without('user')).fold()). \
as_('programmatic_descriptions')
inV().hasLabel(VertexTypes.Description.value.label).fold()).as_('description')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was filtering on source originally but for some reason it didn't work as expected. Made it work on the descriptions labels instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it looks like the databuilder integration is creating things a bit differently? In our Neptune, all Descriptions are Description type nodes and we just distinguish user descriptions from programmatic descriptions by source attribute.

I think it's alright to do it this way if you prefer, but we should definitely add Programmatic_Description to VertexTypes so it's not a magic string.

@@ -1151,7 +1152,7 @@ def _get_table_columns(self, *, table_uri: str) -> List[Column]:
col = Column(name=_safe_get(result, 'column', 'name'),
key=_safe_get(result, 'column', self.key_property_name),
description=_safe_get(result, 'description', 'description'),
col_type=_safe_get(result, 'column', 'col_type'),
col_type=_safe_get(result, 'column', 'type'),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could change it databuilder actually to make it compatible. In fact, we should consider how we could build the model based on common repo. cc @allisonsuarez

@@ -1449,7 +1451,7 @@ def get_latest_updated_ts(self) -> int:
"""

results = _V(g=self.g, label=VertexTypes.Updatedtimestamp,
key=AMUNDSEN_TIMESTAMP_KEY).values('latest_timestamp').toList()
key=AMUNDSEN_TIMESTAMP_KEY).values('latest_timestmap').toList()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Omg, can we fix this typo in databuilder instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, let's fix in databuilder for the typo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha fixed!


def _convert_to_description(self, result: Mapping[str, Any]) -> ProgrammaticDescription:
return ProgrammaticDescription(text=_safe_get(result, 'description'),
source=_safe_get(result, 'source'))
source=_safe_get(result, 'description_source'))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think description_source has been defined in databuilder for programmatic description. I would prefer to change it here to match the databuilder implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think source is more concise so I'd prefer that in a model with description already in the name, but at the very least common and databuilder should be in agreement about the field names!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we have a follow up todo to change the common repo to make it consistent?

@@ -1689,7 +1692,7 @@ def _convert_to_source(self, result: Mapping[str, Any]) -> Source:

def _convert_to_statistics(self, result: Mapping[str, Any]) -> Stat:
return Stat(
stat_type=_safe_get(result, 'stat_type'),
stat_type=_safe_get(result, 'stat_name'),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be an issue for amundsendatabuilder. cc @allisonsuarez

Copy link
Contributor Author

@AndrewCiambrone AndrewCiambrone Feb 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know how you want to handle this @allisonsuarez. I don't mind making the changes in both databuilder and metadata.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's change the databuilder to make it consistent.

@@ -57,7 +57,7 @@ class NeptuneGremlinProxy(AbstractGremlinProxy):
def __init__(self, *, host: str, port: Optional[int] = None, user: str = None,
password: Optional[Union[str, boto3.session.Session]] = None,
driver_remote_connection_options: Mapping[str, Any] = {},
neptune_bulk_loader_s3_bucket_name: Optional[str] = None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was no way to define this thru configuration. Decided to reuse the existing client_kwargs field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's a little magical, but if you do like follows:

options = current_app.config[config.PROXY_CLIENT_KWARGS] or {}
_proxy_client = NeptuneGremlinProxy(host=host, port=port, user=user, password=password, **options)

python should pick up any keys from the PROXY_CLIENT_KWARGS dict and use them as keyword args. So that's how neptune_bulk_loader_s3_bucket_name would be populated here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, the existing proxies don't use this (we've diverged a bit). It's probably least disruptive to just conform to the existing interface then, sure!

@feng-tao
Copy link
Member

cc @friendtocephalopods could you take a look? thanks

Copy link
Contributor

@friendtocephalopods friendtocephalopods left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice catches here Andrew! I think it would be great if we could tweak up the databuillder field names to match the amundsen common models!

Ideally, also revert the client_kwargs stuff and just pass in the PROXY_CLIENT_KWARGS?

shard_set_explicitly('')


class NeptuneConfig(LocalGremlinConfig, Config):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong feelings either way on my end, but certainly we could inherit the things unrelated to neptune/gremlin setup at least.

g = g.coalesce(select('table').outE(EdgeTypes.Description.value.label).
inV().has(VertexTypes.Description.value.label, 'source', without('user')).fold()). \
as_('programmatic_descriptions')
inV().hasLabel(VertexTypes.Description.value.label).fold()).as_('description')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it looks like the databuilder integration is creating things a bit differently? In our Neptune, all Descriptions are Description type nodes and we just distinguish user descriptions from programmatic descriptions by source attribute.

I think it's alright to do it this way if you prefer, but we should definitely add Programmatic_Description to VertexTypes so it's not a magic string.

@@ -1449,7 +1451,7 @@ def get_latest_updated_ts(self) -> int:
"""

results = _V(g=self.g, label=VertexTypes.Updatedtimestamp,
key=AMUNDSEN_TIMESTAMP_KEY).values('latest_timestamp').toList()
key=AMUNDSEN_TIMESTAMP_KEY).values('latest_timestmap').toList()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Omg, can we fix this typo in databuilder instead?

@@ -1151,7 +1152,7 @@ def _get_table_columns(self, *, table_uri: str) -> List[Column]:
col = Column(name=_safe_get(result, 'column', 'name'),
key=_safe_get(result, 'column', self.key_property_name),
description=_safe_get(result, 'description', 'description'),
col_type=_safe_get(result, 'column', 'col_type'),
col_type=_safe_get(result, 'column', 'type'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


def _convert_to_description(self, result: Mapping[str, Any]) -> ProgrammaticDescription:
return ProgrammaticDescription(text=_safe_get(result, 'description'),
source=_safe_get(result, 'source'))
source=_safe_get(result, 'description_source'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -57,7 +57,7 @@ class NeptuneGremlinProxy(AbstractGremlinProxy):
def __init__(self, *, host: str, port: Optional[int] = None, user: str = None,
password: Optional[Union[str, boto3.session.Session]] = None,
driver_remote_connection_options: Mapping[str, Any] = {},
neptune_bulk_loader_s3_bucket_name: Optional[str] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's a little magical, but if you do like follows:

options = current_app.config[config.PROXY_CLIENT_KWARGS] or {}
_proxy_client = NeptuneGremlinProxy(host=host, port=port, user=user, password=password, **options)

python should pick up any keys from the PROXY_CLIENT_KWARGS dict and use them as keyword args. So that's how neptune_bulk_loader_s3_bucket_name would be populated here.

@@ -1689,7 +1692,7 @@ def _convert_to_source(self, result: Mapping[str, Any]) -> Source:

def _convert_to_statistics(self, result: Mapping[str, Any]) -> Stat:
return Stat(
stat_type=_safe_get(result, 'stat_type'),
stat_type=_safe_get(result, 'stat_name'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Andrew Ciambrone <[email protected]>
@codecov-io
Copy link

codecov-io commented Feb 23, 2021

Codecov Report

Merging #260 (e334e66) into master (2752492) will increase coverage by 3.51%.
The diff coverage is 72.35%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #260      +/-   ##
==========================================
+ Coverage   74.10%   77.61%   +3.51%     
==========================================
  Files          25       27       +2     
  Lines        1255     1358     +103     
  Branches      136      162      +26     
==========================================
+ Hits          930     1054     +124     
+ Misses        297      256      -41     
- Partials       28       48      +20     
Impacted Files Coverage Δ
metadata_service/api/__init__.py 82.60% <ø> (ø)
metadata_service/api/column.py 100.00% <ø> (ø)
metadata_service/api/popular_tables.py 100.00% <ø> (ø)
metadata_service/api/system.py 66.66% <ø> (ø)
metadata_service/api/user.py 100.00% <ø> (ø)
metadata_service/proxy/statsd_utilities.py 81.25% <ø> (ø)
metadata_service/util.py 100.00% <ø> (ø)
metadata_service/proxy/shared.py 28.57% <28.57%> (ø)
metadata_service/api/badge.py 61.29% <61.29%> (ø)
metadata_service/proxy/neo4j_proxy.py 71.65% <61.84%> (-3.35%) ⬇️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1880cec...e334e66. Read the comment docs.

@AndrewCiambrone
Copy link
Contributor Author

I think I have resolved all differences pointed out between the databuilder and the common models. @feng-tao let me know if there is anything I missed.

@feng-tao
Copy link
Member

will take a look

@feng-tao
Copy link
Member

@AndrewCiambrone going to merge this pr now. Do you need to plan to update the databuilder pr with some name changed discussed here?

@feng-tao feng-tao merged commit a765424 into amundsen-io:master Feb 26, 2021
@AndrewCiambrone
Copy link
Contributor Author

@feng-tao Pretty sure all the name changes should be set in amundsen-io/amundsendatabuilder#445

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants