fix: Compatibility changes to the gremlin integration #260

AndrewCiambrone · 2021-02-22T20:34:07Z

Signed-off-by: Andrew Ciambrone [email protected]

Summary of Changes

Made changes to the Neptune proxy so that you can configure the s3_bucket via the configuration settings. Before you had to made changes to the code to do that.

Added a Configuration for Neptune.

Made shards optional in Neptune.

Minor changes to the queries to be more compatible with the data builder.

Tests

Made changes to the testing framework to be more compatible.

Documentation

What documentation did you add or modify and why? Add any relevant links then remove this line

CheckList

Make sure you have checked all steps below to ensure a timely review.

PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.
PR includes a summary of changes.
PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain docstrings that explain what it does
PR passes make test

Signed-off-by: Andrew Ciambrone <[email protected]>

AndrewCiambrone · 2021-02-22T20:35:05Z

metadata_service/config.py

+
+
+# The databuilder expects this to be False currently. We are defaulting to true because the testing expects this
+if bool(distutils.util.strtobool(os.environ.get('IGNORE_NEPTUNE_SHARD', 'False'))):


This needs to be at the base layer instead of the config because once get_shard is called the shard_set_explicitly no longer works.

AndrewCiambrone · 2021-02-22T20:35:32Z

metadata_service/config.py

+    shard_set_explicitly('')
+
+
+class NeptuneConfig(LocalGremlinConfig, Config):


I can move this to documentation if you all wish.

No strong feelings either way on my end, but certainly we could inherit the things unrelated to neptune/gremlin setup at least.

we could leave it here to give people who use neptune for reference.

AndrewCiambrone · 2021-02-22T20:36:54Z

metadata_service/proxy/gremlin_proxy.py

-        users_by_type['owner'] = sorted(
-            _safe_get_list(result, f'all_owners', transform=self._convert_to_user) or [],
-            key=attrgetter('user_id'))
+        users_by_type['owner'] = _safe_get_list(result, f'all_owners', transform=self._convert_to_user) or []


User_id was not always defined. I can switch the sort to be on another attribute if you all prefer.

AndrewCiambrone · 2021-02-22T20:38:17Z

metadata_service/proxy/gremlin_proxy.py

@@ -1087,18 +1087,18 @@ def _get_table_itself(self, *, table_uri: str) -> Mapping[str, Any]:
                       hasLabel(VertexTypes.Application.value.label).fold()).as_('application')
        g = g.coalesce(select('table').outE(EdgeTypes.LastUpdatedAt.value.label).inV().
                       hasLabel(VertexTypes.Updatedtimestamp.value.label).
-                       values('latest_timestamp').fold()).as_('timestamp')
+                       values('timestamp').fold()).as_('timestamp')


From: https://github.com/amundsen-io/amundsendatabuilder/blob/master/databuilder/models/timestamp/timestamp_constants.py#L8

AndrewCiambrone · 2021-02-22T20:39:28Z

metadata_service/proxy/gremlin_proxy.py

-        g = g.coalesce(select('table').outE(EdgeTypes.Description.value.label).
-                       inV().has(VertexTypes.Description.value.label, 'source', without('user')).fold()). \
-            as_('programmatic_descriptions')
+                       inV().hasLabel(VertexTypes.Description.value.label).fold()).as_('description')


This was filtering on source originally but for some reason it didn't work as expected. Made it work on the descriptions labels instead.

Oh, it looks like the databuilder integration is creating things a bit differently? In our Neptune, all Descriptions are Description type nodes and we just distinguish user descriptions from programmatic descriptions by source attribute.

I think it's alright to do it this way if you prefer, but we should definitely add Programmatic_Description to VertexTypes so it's not a magic string.

AndrewCiambrone · 2021-02-22T20:40:14Z

metadata_service/proxy/gremlin_proxy.py

@@ -1151,7 +1152,7 @@ def _get_table_columns(self, *, table_uri: str) -> List[Column]:
            col = Column(name=_safe_get(result, 'column', 'name'),
                         key=_safe_get(result, 'column', self.key_property_name),
                         description=_safe_get(result, 'description', 'description'),
-                         col_type=_safe_get(result, 'column', 'col_type'),
+                         col_type=_safe_get(result, 'column', 'type'),


From: https://github.com/amundsen-io/amundsendatabuilder/blob/master/databuilder/models/table_metadata.py#L160

I think this is a bug in the databuilder interface: https://github.com/amundsen-io/amundsencommon/blob/master/amundsen_common/models/table.py#L81

we could change it databuilder actually to make it compatible. In fact, we should consider how we could build the model based on common repo. cc @allisonsuarez

AndrewCiambrone · 2021-02-22T20:40:49Z

metadata_service/proxy/gremlin_proxy.py

@@ -1449,7 +1451,7 @@ def get_latest_updated_ts(self) -> int:
        """

        results = _V(g=self.g, label=VertexTypes.Updatedtimestamp,
-                     key=AMUNDSEN_TIMESTAMP_KEY).values('latest_timestamp').toList()
+                     key=AMUNDSEN_TIMESTAMP_KEY).values('latest_timestmap').toList()


From: https://github.com/amundsen-io/amundsendatabuilder/blob/master/databuilder/models/es_last_updated.py#L19

Omg, can we fix this typo in databuilder instead?

agree, let's fix in databuilder for the typo.

Haha fixed!

AndrewCiambrone · 2021-02-22T20:41:29Z

metadata_service/proxy/gremlin_proxy.py


    def _convert_to_description(self, result: Mapping[str, Any]) -> ProgrammaticDescription:
        return ProgrammaticDescription(text=_safe_get(result, 'description'),
-                                       source=_safe_get(result, 'source'))
+                                       source=_safe_get(result, 'description_source'))


From: https://github.com/amundsen-io/amundsendatabuilder/blob/master/databuilder/models/table_metadata.py#L86

Hm I'd claim this is a bug with the databuilder impl again: https://github.com/amundsen-io/amundsencommon/blob/master/amundsen_common/models/table.py#L139

I think description_source has been defined in databuilder for programmatic description. I would prefer to change it here to match the databuilder implementation.

I think source is more concise so I'd prefer that in a model with description already in the name, but at the very least common and databuilder should be in agreement about the field names!

could we have a follow up todo to change the common repo to make it consistent?

AndrewCiambrone · 2021-02-22T20:42:18Z

metadata_service/proxy/gremlin_proxy.py

@@ -1689,7 +1692,7 @@ def _convert_to_source(self, result: Mapping[str, Any]) -> Source:

    def _convert_to_statistics(self, result: Mapping[str, Any]) -> Stat:
        return Stat(
-            stat_type=_safe_get(result, 'stat_type'),
+            stat_type=_safe_get(result, 'stat_name'),


From: https://github.com/amundsen-io/amundsendatabuilder/blob/master/databuilder/models/table_stats.py#L87

Again a mismatch between amundsen common and databuilder: https://github.com/amundsen-io/amundsencommon/blob/master/amundsen_common/models/table.py#L64

this seems to be an issue for amundsendatabuilder. cc @allisonsuarez

Let me know how you want to handle this @allisonsuarez. I don't mind making the changes in both databuilder and metadata.

let's change the databuilder to make it consistent.

AndrewCiambrone · 2021-02-22T20:42:56Z

metadata_service/proxy/neptune_proxy.py

@@ -57,7 +57,7 @@ class NeptuneGremlinProxy(AbstractGremlinProxy):
    def __init__(self, *, host: str, port: Optional[int] = None, user: str = None,
                 password: Optional[Union[str, boto3.session.Session]] = None,
                 driver_remote_connection_options: Mapping[str, Any] = {},
-                 neptune_bulk_loader_s3_bucket_name: Optional[str] = None,


There was no way to define this thru configuration. Decided to reuse the existing client_kwargs field.

Oh, it's a little magical, but if you do like follows:

options = current_app.config[config.PROXY_CLIENT_KWARGS] or {} _proxy_client = NeptuneGremlinProxy(host=host, port=port, user=user, password=password, **options)

python should pick up any keys from the PROXY_CLIENT_KWARGS dict and use them as keyword args. So that's how neptune_bulk_loader_s3_bucket_name would be populated here.

are you suggesting I make those changes here: https://github.com/amundsen-io/amundsenmetadatalibrary/blob/master/metadata_service/proxy/__init__.py#L41?

Oh, I see, the existing proxies don't use this (we've diverged a bit). It's probably least disruptive to just conform to the existing interface then, sure!

feng-tao · 2021-02-22T20:46:53Z

cc @friendtocephalopods could you take a look? thanks

friendtocephalopods

Really nice catches here Andrew! I think it would be great if we could tweak up the databuillder field names to match the amundsen common models!

Ideally, also revert the client_kwargs stuff and just pass in the PROXY_CLIENT_KWARGS?

friendtocephalopods · 2021-02-22T23:40:54Z

metadata_service/config.py

+    shard_set_explicitly('')
+
+
+class NeptuneConfig(LocalGremlinConfig, Config):


No strong feelings either way on my end, but certainly we could inherit the things unrelated to neptune/gremlin setup at least.

friendtocephalopods · 2021-02-22T23:50:13Z

metadata_service/proxy/gremlin_proxy.py

-        g = g.coalesce(select('table').outE(EdgeTypes.Description.value.label).
-                       inV().has(VertexTypes.Description.value.label, 'source', without('user')).fold()). \
-            as_('programmatic_descriptions')
+                       inV().hasLabel(VertexTypes.Description.value.label).fold()).as_('description')


Oh, it looks like the databuilder integration is creating things a bit differently? In our Neptune, all Descriptions are Description type nodes and we just distinguish user descriptions from programmatic descriptions by source attribute.

I think it's alright to do it this way if you prefer, but we should definitely add Programmatic_Description to VertexTypes so it's not a magic string.

friendtocephalopods · 2021-02-22T23:53:10Z

metadata_service/proxy/gremlin_proxy.py

@@ -1449,7 +1451,7 @@ def get_latest_updated_ts(self) -> int:
        """

        results = _V(g=self.g, label=VertexTypes.Updatedtimestamp,
-                     key=AMUNDSEN_TIMESTAMP_KEY).values('latest_timestamp').toList()
+                     key=AMUNDSEN_TIMESTAMP_KEY).values('latest_timestmap').toList()


Omg, can we fix this typo in databuilder instead?

friendtocephalopods · 2021-02-22T23:53:58Z

metadata_service/proxy/gremlin_proxy.py

@@ -1151,7 +1152,7 @@ def _get_table_columns(self, *, table_uri: str) -> List[Column]:
            col = Column(name=_safe_get(result, 'column', 'name'),
                         key=_safe_get(result, 'column', self.key_property_name),
                         description=_safe_get(result, 'description', 'description'),
-                         col_type=_safe_get(result, 'column', 'col_type'),
+                         col_type=_safe_get(result, 'column', 'type'),


I think this is a bug in the databuilder interface: https://github.com/amundsen-io/amundsencommon/blob/master/amundsen_common/models/table.py#L81

friendtocephalopods · 2021-02-22T23:57:15Z

metadata_service/proxy/gremlin_proxy.py


    def _convert_to_description(self, result: Mapping[str, Any]) -> ProgrammaticDescription:
        return ProgrammaticDescription(text=_safe_get(result, 'description'),
-                                       source=_safe_get(result, 'source'))
+                                       source=_safe_get(result, 'description_source'))


Hm I'd claim this is a bug with the databuilder impl again: https://github.com/amundsen-io/amundsencommon/blob/master/amundsen_common/models/table.py#L139

friendtocephalopods · 2021-02-23T00:00:15Z

metadata_service/proxy/neptune_proxy.py

@@ -57,7 +57,7 @@ class NeptuneGremlinProxy(AbstractGremlinProxy):
    def __init__(self, *, host: str, port: Optional[int] = None, user: str = None,
                 password: Optional[Union[str, boto3.session.Session]] = None,
                 driver_remote_connection_options: Mapping[str, Any] = {},
-                 neptune_bulk_loader_s3_bucket_name: Optional[str] = None,


Oh, it's a little magical, but if you do like follows:

options = current_app.config[config.PROXY_CLIENT_KWARGS] or {} _proxy_client = NeptuneGremlinProxy(host=host, port=port, user=user, password=password, **options)

python should pick up any keys from the PROXY_CLIENT_KWARGS dict and use them as keyword args. So that's how neptune_bulk_loader_s3_bucket_name would be populated here.

friendtocephalopods · 2021-02-23T00:01:56Z

metadata_service/proxy/gremlin_proxy.py

@@ -1689,7 +1692,7 @@ def _convert_to_source(self, result: Mapping[str, Any]) -> Source:

    def _convert_to_statistics(self, result: Mapping[str, Any]) -> Stat:
        return Stat(
-            stat_type=_safe_get(result, 'stat_type'),
+            stat_type=_safe_get(result, 'stat_name'),


Again a mismatch between amundsen common and databuilder: https://github.com/amundsen-io/amundsencommon/blob/master/amundsen_common/models/table.py#L64

Signed-off-by: Andrew Ciambrone <[email protected]>

codecov-io · 2021-02-23T14:57:05Z

Codecov Report

Merging #260 (e334e66) into master (2752492) will increase coverage by 3.51%.
The diff coverage is 72.35%.

@@            Coverage Diff             @@
##           master     #260      +/-   ##
==========================================
+ Coverage   74.10%   77.61%   +3.51%     
==========================================
  Files          25       27       +2     
  Lines        1255     1358     +103     
  Branches      136      162      +26     
==========================================
+ Hits          930     1054     +124     
+ Misses        297      256      -41     
- Partials       28       48      +20

Impacted Files	Coverage Δ
metadata_service/api/__init__.py	`82.60% <ø> (ø)`
metadata_service/api/column.py	`100.00% <ø> (ø)`
metadata_service/api/popular_tables.py	`100.00% <ø> (ø)`
metadata_service/api/system.py	`66.66% <ø> (ø)`
metadata_service/api/user.py	`100.00% <ø> (ø)`
metadata_service/proxy/statsd_utilities.py	`81.25% <ø> (ø)`
metadata_service/util.py	`100.00% <ø> (ø)`
metadata_service/proxy/shared.py	`28.57% <28.57%> (ø)`
metadata_service/api/badge.py	`61.29% <61.29%> (ø)`
metadata_service/proxy/neo4j_proxy.py	`71.65% <61.84%> (-3.35%)`	⬇️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1880cec...e334e66. Read the comment docs.

Signed-off-by: Andrew Ciambrone <[email protected]>

AndrewCiambrone · 2021-02-25T15:24:50Z

I think I have resolved all differences pointed out between the databuilder and the common models. @feng-tao let me know if there is anything I missed.

feng-tao · 2021-02-25T20:14:30Z

will take a look

feng-tao · 2021-02-26T17:54:08Z

@AndrewCiambrone going to merge this pr now. Do you need to plan to update the databuilder pr with some name changed discussed here?

AndrewCiambrone · 2021-02-26T18:22:56Z

@feng-tao Pretty sure all the name changes should be set in amundsen-io/amundsendatabuilder#445

Some fixes to neptune integration

19983b4

Signed-off-by: Andrew Ciambrone <[email protected]>

AndrewCiambrone requested review from allisonsuarez, bolkedebruin, dikshathakur3119, feng-tao, jinhyukchang, mgorsk1 and verdan as code owners February 22, 2021 20:34

AndrewCiambrone commented Feb 22, 2021

View reviewed changes

friendtocephalopods reviewed Feb 23, 2021

View reviewed changes

correct a typo in timestamp

e334e66

Signed-off-by: Andrew Ciambrone <[email protected]>

rename stat_name to stat_type and type to col_type

4582681

Signed-off-by: Andrew Ciambrone <[email protected]>

feng-tao approved these changes Feb 25, 2021

View reviewed changes

feng-tao merged commit a765424 into amundsen-io:master Feb 26, 2021

mgorsk1 mentioned this pull request Mar 2, 2021

Amundsen gremlin breaks Amundsen Metadata installation amundsen-io/amundsen#937

Closed



		# The databuilder expects this to be False currently. We are defaulting to true because the testing expects this
		if bool(distutils.util.strtobool(os.environ.get('IGNORE_NEPTUNE_SHARD', 'False'))):

		shard_set_explicitly('')


		class NeptuneConfig(LocalGremlinConfig, Config):

fix: Compatibility changes to the gremlin integration #260

fix: Compatibility changes to the gremlin integration #260

Conversation

AndrewCiambrone commented Feb 22, 2021

Summary of Changes

Tests

Documentation

CheckList

AndrewCiambrone Feb 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewCiambrone Feb 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feng-tao commented Feb 22, 2021

friendtocephalopods left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Feb 23, 2021 • edited Loading

Codecov Report

AndrewCiambrone commented Feb 25, 2021

feng-tao commented Feb 25, 2021

feng-tao commented Feb 26, 2021

AndrewCiambrone commented Feb 26, 2021

AndrewCiambrone Feb 22, 2021 •

edited

Loading

AndrewCiambrone Feb 23, 2021 •

edited

Loading

codecov-io commented Feb 23, 2021 •

edited

Loading