-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Compatibility changes to the gremlin integration #260
fix: Compatibility changes to the gremlin integration #260
Conversation
Signed-off-by: Andrew Ciambrone <[email protected]>
|
||
|
||
# The databuilder expects this to be False currently. We are defaulting to true because the testing expects this | ||
if bool(distutils.util.strtobool(os.environ.get('IGNORE_NEPTUNE_SHARD', 'False'))): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be at the base layer instead of the config because once get_shard is called the shard_set_explicitly no longer works.
shard_set_explicitly('') | ||
|
||
|
||
class NeptuneConfig(LocalGremlinConfig, Config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can move this to documentation if you all wish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong feelings either way on my end, but certainly we could inherit the things unrelated to neptune/gremlin setup at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could leave it here to give people who use neptune for reference.
users_by_type['owner'] = sorted( | ||
_safe_get_list(result, f'all_owners', transform=self._convert_to_user) or [], | ||
key=attrgetter('user_id')) | ||
users_by_type['owner'] = _safe_get_list(result, f'all_owners', transform=self._convert_to_user) or [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User_id was not always defined. I can switch the sort to be on another attribute if you all prefer.
@@ -1087,18 +1087,18 @@ def _get_table_itself(self, *, table_uri: str) -> Mapping[str, Any]: | |||
hasLabel(VertexTypes.Application.value.label).fold()).as_('application') | |||
g = g.coalesce(select('table').outE(EdgeTypes.LastUpdatedAt.value.label).inV(). | |||
hasLabel(VertexTypes.Updatedtimestamp.value.label). | |||
values('latest_timestamp').fold()).as_('timestamp') | |||
values('timestamp').fold()).as_('timestamp') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
g = g.coalesce(select('table').outE(EdgeTypes.Description.value.label). | ||
inV().has(VertexTypes.Description.value.label, 'source', without('user')).fold()). \ | ||
as_('programmatic_descriptions') | ||
inV().hasLabel(VertexTypes.Description.value.label).fold()).as_('description') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was filtering on source originally but for some reason it didn't work as expected. Made it work on the descriptions labels instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it looks like the databuilder integration is creating things a bit differently? In our Neptune, all Descriptions are Description
type nodes and we just distinguish user descriptions from programmatic descriptions by source
attribute.
I think it's alright to do it this way if you prefer, but we should definitely add Programmatic_Description
to VertexTypes so it's not a magic string.
@@ -1151,7 +1152,7 @@ def _get_table_columns(self, *, table_uri: str) -> List[Column]: | |||
col = Column(name=_safe_get(result, 'column', 'name'), | |||
key=_safe_get(result, 'column', self.key_property_name), | |||
description=_safe_get(result, 'description', 'description'), | |||
col_type=_safe_get(result, 'column', 'col_type'), | |||
col_type=_safe_get(result, 'column', 'type'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bug in the databuilder interface: https://github.com/amundsen-io/amundsencommon/blob/master/amundsen_common/models/table.py#L81
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could change it databuilder actually to make it compatible. In fact, we should consider how we could build the model based on common repo. cc @allisonsuarez
@@ -1449,7 +1451,7 @@ def get_latest_updated_ts(self) -> int: | |||
""" | |||
|
|||
results = _V(g=self.g, label=VertexTypes.Updatedtimestamp, | |||
key=AMUNDSEN_TIMESTAMP_KEY).values('latest_timestamp').toList() | |||
key=AMUNDSEN_TIMESTAMP_KEY).values('latest_timestmap').toList() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Omg, can we fix this typo in databuilder instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, let's fix in databuilder for the typo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha fixed!
|
||
def _convert_to_description(self, result: Mapping[str, Any]) -> ProgrammaticDescription: | ||
return ProgrammaticDescription(text=_safe_get(result, 'description'), | ||
source=_safe_get(result, 'source')) | ||
source=_safe_get(result, 'description_source')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I'd claim this is a bug with the databuilder impl again: https://github.com/amundsen-io/amundsencommon/blob/master/amundsen_common/models/table.py#L139
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think description_source
has been defined in databuilder for programmatic description. I would prefer to change it here to match the databuilder implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think source
is more concise so I'd prefer that in a model with description already in the name, but at the very least common and databuilder should be in agreement about the field names!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we have a follow up todo to change the common repo to make it consistent?
@@ -1689,7 +1692,7 @@ def _convert_to_source(self, result: Mapping[str, Any]) -> Source: | |||
|
|||
def _convert_to_statistics(self, result: Mapping[str, Any]) -> Stat: | |||
return Stat( | |||
stat_type=_safe_get(result, 'stat_type'), | |||
stat_type=_safe_get(result, 'stat_name'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again a mismatch between amundsen common and databuilder: https://github.com/amundsen-io/amundsencommon/blob/master/amundsen_common/models/table.py#L64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to be an issue for amundsendatabuilder. cc @allisonsuarez
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know how you want to handle this @allisonsuarez. I don't mind making the changes in both databuilder and metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's change the databuilder to make it consistent.
@@ -57,7 +57,7 @@ class NeptuneGremlinProxy(AbstractGremlinProxy): | |||
def __init__(self, *, host: str, port: Optional[int] = None, user: str = None, | |||
password: Optional[Union[str, boto3.session.Session]] = None, | |||
driver_remote_connection_options: Mapping[str, Any] = {}, | |||
neptune_bulk_loader_s3_bucket_name: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was no way to define this thru configuration. Decided to reuse the existing client_kwargs field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it's a little magical, but if you do like follows:
options = current_app.config[config.PROXY_CLIENT_KWARGS] or {}
_proxy_client = NeptuneGremlinProxy(host=host, port=port, user=user, password=password, **options)
python should pick up any keys from the PROXY_CLIENT_KWARGS dict and use them as keyword args. So that's how neptune_bulk_loader_s3_bucket_name
would be populated here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you suggesting I make those changes here: https://github.com/amundsen-io/amundsenmetadatalibrary/blob/master/metadata_service/proxy/__init__.py#L41?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see, the existing proxies don't use this (we've diverged a bit). It's probably least disruptive to just conform to the existing interface then, sure!
cc @friendtocephalopods could you take a look? thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice catches here Andrew! I think it would be great if we could tweak up the databuillder field names to match the amundsen common models!
Ideally, also revert the client_kwargs
stuff and just pass in the PROXY_CLIENT_KWARGS
?
shard_set_explicitly('') | ||
|
||
|
||
class NeptuneConfig(LocalGremlinConfig, Config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong feelings either way on my end, but certainly we could inherit the things unrelated to neptune/gremlin setup at least.
g = g.coalesce(select('table').outE(EdgeTypes.Description.value.label). | ||
inV().has(VertexTypes.Description.value.label, 'source', without('user')).fold()). \ | ||
as_('programmatic_descriptions') | ||
inV().hasLabel(VertexTypes.Description.value.label).fold()).as_('description') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it looks like the databuilder integration is creating things a bit differently? In our Neptune, all Descriptions are Description
type nodes and we just distinguish user descriptions from programmatic descriptions by source
attribute.
I think it's alright to do it this way if you prefer, but we should definitely add Programmatic_Description
to VertexTypes so it's not a magic string.
@@ -1449,7 +1451,7 @@ def get_latest_updated_ts(self) -> int: | |||
""" | |||
|
|||
results = _V(g=self.g, label=VertexTypes.Updatedtimestamp, | |||
key=AMUNDSEN_TIMESTAMP_KEY).values('latest_timestamp').toList() | |||
key=AMUNDSEN_TIMESTAMP_KEY).values('latest_timestmap').toList() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Omg, can we fix this typo in databuilder instead?
@@ -1151,7 +1152,7 @@ def _get_table_columns(self, *, table_uri: str) -> List[Column]: | |||
col = Column(name=_safe_get(result, 'column', 'name'), | |||
key=_safe_get(result, 'column', self.key_property_name), | |||
description=_safe_get(result, 'description', 'description'), | |||
col_type=_safe_get(result, 'column', 'col_type'), | |||
col_type=_safe_get(result, 'column', 'type'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bug in the databuilder interface: https://github.com/amundsen-io/amundsencommon/blob/master/amundsen_common/models/table.py#L81
|
||
def _convert_to_description(self, result: Mapping[str, Any]) -> ProgrammaticDescription: | ||
return ProgrammaticDescription(text=_safe_get(result, 'description'), | ||
source=_safe_get(result, 'source')) | ||
source=_safe_get(result, 'description_source')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I'd claim this is a bug with the databuilder impl again: https://github.com/amundsen-io/amundsencommon/blob/master/amundsen_common/models/table.py#L139
@@ -57,7 +57,7 @@ class NeptuneGremlinProxy(AbstractGremlinProxy): | |||
def __init__(self, *, host: str, port: Optional[int] = None, user: str = None, | |||
password: Optional[Union[str, boto3.session.Session]] = None, | |||
driver_remote_connection_options: Mapping[str, Any] = {}, | |||
neptune_bulk_loader_s3_bucket_name: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it's a little magical, but if you do like follows:
options = current_app.config[config.PROXY_CLIENT_KWARGS] or {}
_proxy_client = NeptuneGremlinProxy(host=host, port=port, user=user, password=password, **options)
python should pick up any keys from the PROXY_CLIENT_KWARGS dict and use them as keyword args. So that's how neptune_bulk_loader_s3_bucket_name
would be populated here.
@@ -1689,7 +1692,7 @@ def _convert_to_source(self, result: Mapping[str, Any]) -> Source: | |||
|
|||
def _convert_to_statistics(self, result: Mapping[str, Any]) -> Stat: | |||
return Stat( | |||
stat_type=_safe_get(result, 'stat_type'), | |||
stat_type=_safe_get(result, 'stat_name'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again a mismatch between amundsen common and databuilder: https://github.com/amundsen-io/amundsencommon/blob/master/amundsen_common/models/table.py#L64
Signed-off-by: Andrew Ciambrone <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #260 +/- ##
==========================================
+ Coverage 74.10% 77.61% +3.51%
==========================================
Files 25 27 +2
Lines 1255 1358 +103
Branches 136 162 +26
==========================================
+ Hits 930 1054 +124
+ Misses 297 256 -41
- Partials 28 48 +20
Continue to review full report at Codecov.
|
Signed-off-by: Andrew Ciambrone <[email protected]>
I think I have resolved all differences pointed out between the databuilder and the common models. @feng-tao let me know if there is anything I missed. |
will take a look |
@AndrewCiambrone going to merge this pr now. Do you need to plan to update the databuilder pr with some name changed discussed here? |
@feng-tao Pretty sure all the name changes should be set in amundsen-io/amundsendatabuilder#445 |
Signed-off-by: Andrew Ciambrone [email protected]
Summary of Changes
Made changes to the Neptune proxy so that you can configure the s3_bucket via the configuration settings. Before you had to made changes to the code to do that.
Added a Configuration for Neptune.
Made shards optional in Neptune.
Minor changes to the queries to be more compatible with the data builder.
Tests
Made changes to the testing framework to be more compatible.
Documentation
What documentation did you add or modify and why? Add any relevant links then remove this line
CheckList
Make sure you have checked all steps below to ensure a timely review.
make test