Programmatic and Manual pathways for table and column descriptions #147

samshuster · 2019-10-23T19:24:34Z

Overview

As a data engineer, there are quite a few properties that we can extract programmatically that are currently not supported by amundsen as first class properties in the UI. For some of these properties that are likely widely useful, I can understand creating issues to ingest them so that they appear in the panel on the right. However, some of these properties are very company specific and wouldn't be needed by other companies. Therefore, I think that it would be useful to allow users to ingest structured data without needing to make changes to amundsen infrastructure.

What do we do now? And why it won't work in long run?

Currently, we get around this by programmatically updating the table description with prepared markdown, however in the long run we also want users to be able to edit table and column descriptions through amundsen which will put us in a bind. We no longer would able to update programmatically without a lot of added complexity concerning reconciliation and merging of changes.

Proposal

My proposal is that there are two types of descriptions for tables and columns

One would be programmatic and cannot be modified manually.
The other would be the current description.
By default, the programmatic description would not appear on the page unless it is populated.

Note about column level

This is true on the column level as well, where we may want to include company specific attributes, but I can understand why column level programmatic descriptions maybe too much to ask for in terms of cluttering the UI.

jornh · 2019-10-23T22:42:46Z

On table level

For the User model in databuilder, I noticed it already takes an optional kwargs parameter: https://github.com/lyft/amundsendatabuilder/blame/1.4.10/databuilder/models/user.py#L62 so you can give keynames that can double as field description(s) in the UI and values

I'd imagine something similar could be added to the Table model in https://github.com/lyft/amundsendatabuilder/blob/1.4.10/databuilder/models/table_metadata.py

Note: I haven't tried out the kwargs mechanism though - but the PR amundsen-io/amundsendatabuilder#95 gives some hints in the unit test

On column level

I wonder to what extents Stats already has whats needed? https://github.com/lyft/amundsendatabuilder/blob/1.4.10/databuilder/models/table_stats.py

The metadata model in Architecture.md: https://github.com/lyft/amundsen/blob/master/docs/architecture.md#metadata

If Stats doesn't suit the needs then Column could of course also get a kwargs parameter.

samshuster · 2019-10-24T22:30:44Z

thanks for the detailed thoughts @jornh!

If I understand you correctly, we should be able to use the kwargs of the TableMetadata model (looks like there is already one there with the comment:
:param kwargs: Put additional attributes to the table model if there is any.)
to ingest other programmatic data.

I will take a look at this tomorrow hopefully and see how it works and get back to this issue with my research.
Thanks!

samshuster · 2019-10-27T21:24:35Z

Ok @jornh I looked into it a bit today and I was able to get table kwarg properties into neo4j as extra properties of a given node. However, nothing showed up in the table ui. Looking through the frontend code, unless I am missing something, I don't see how these extra properties get wired up to the frontend code even for User (I know next to nothing about frontend though so me missing something could be quite likely).

If it is true that the frontend currently does not display the extra kwarg properties, would this story then be frontend only changes, plus possibly additions to sample_data_loader to show how it can take any key value pairs?

UPDATE: now that I have taken a look at column_stats, I definitely think this takes care of what we need on column level.

I wonder if having a table_stats analogous to table_column_stats would be a better approach even over the kwargs?

Thanks for your help and sorry for any dumb questions!

javamonkey79 · 2019-12-07T00:01:40Z

@feng-tao @markgrover @jinhyukchang we could really use your insights on this one. idk if it would be best to bring this up at the next community meeting, or, best to solicit this on slack, or, just submit a PR.

Overall, I think this could be adopted without too much disruption. What it would take:

on the backend: a new table field (lets call it "noneditable_description" for simplicities sake)
on the front end: a new div section above (or perhaps below) the existing description section populated with noneditable_description data; I'm not a front end engineer, so, I'll submit that there could be a better way.

To reiterate, why do we want to do this? Consider these images:

First, you will notice, that all of this markdown data is quite useful, but, it was not created by user entry in amundsen. Instead if was machine generated upstream. Some of this information, such as description, documentation, dashboards we would want editable. Other parts, such as "Describe Extended Table Info" should not be editable, since it is metadata (similar to column info).

Please lmk what you think. We'd really like to do this work, and we think others in the community could benefit, but, we don't want to do the work in such a way as to be unsatisfactory (e.g. if you feel like it is unwanted\unneeded).

Will scroll past GitHub repo folders (specially more handy on small screens like phones/tablets)

feng-tao · 2019-12-14T06:41:05Z

@javamonkey79 not sure if everyone sees the discussion in the issue, would you mind posting above into the slack channel so that everyone could take a look? Thanks.

javamonkey79 · 2019-12-16T19:16:52Z

hi @feng-tao yes, I'll solicit feedback there

javamonkey79 · 2019-12-16T23:46:31Z

There was some discussion in slack about this today. The general consensus is that, instead of just 2 description sections, we should allow for N description sections and whether or not the section will be editable or not would be set as a property of that section. This solution would be quite versatile, and should fit everyones needs. However, it comes with a bit of cost, namely complexity.

Here is one example, of how this might look (courtesy of Matt Spiel):

And here are some possible concerns Tamika has raised as far as implementation:

How to change the metadata model to support multiple “descriptions”, and each piece would also have a concept title, content, and whether or not it is editable.
How that data would be ingested.
The implications for search and if there is a requirement for these new sections to be searchable as well.
Whether or not frontend apis have to change given the outcomes of the first and third items.

Further discussion should happen here.

javamonkey79 · 2019-12-20T00:34:30Z

@ttannis I've been thinking about this some more, and based on the backing neo4j structure, it would appear to be possible to assign multiple description nodes for each table. The only issue with this approach, is that we would need some ordering id assigned to each description. Further, we would also likely want to set a new flag on each description node "editable" which could then be propagated to the front end. wdyt? @feng-tao is there another more suitable approach?

ttannis · 2019-12-20T19:53:48Z

That sounds good on my end. The descriptions would be rendered according to the ordering id -- with what we currently know as description being the first one.They would be rendered in order on the table detail page, and support markdown if they are editable.

My current thoughts are:

Our APIs between metadata and frontend for getting and editing descriptions will have to change, so I'd be interested to talk about that next if we all support this approach.
Also interested to talk about what this approach would mean for the search service having to consider the content of all descriptions when searching.

javamonkey79 · 2019-12-20T23:22:18Z

@ttannis

Our APIs between metadata and frontend for getting and editing descriptions will have to change, so I'd be interested to talk about that next if we all support this approach.

What needs to be talked about? I think it would need to be updated per the new logic, but, that's about it. Did I miss something?

Also interested to talk about what this approach would mean for the search service having to consider the content of all descriptions when searching.

We'd just need to update the logic that populates elasticsearch to look at all description nodes, instead of the one. Or, is there more to it?

markgrover · 2019-12-21T21:19:49Z

Thanks @javamonkey79 for this, and @ttannis for all your help. Would love to see this problem to be fixed, let me know if there's something that I can help with.

samshuster · 2020-01-24T21:59:34Z

Just writing here that I am going to take a stab at this to at least get some more clarity on what could be done. I imagine that my thoughts below are not going to be the final solution because I am sure I am missing important details.

Given this, here are my thoughts on more specific implementation details:

I am going to start first with the data loader changes to populate more description objects in neo4j. I am going to make it backwards compatible.
EDIT: after trying to use Node = Description for all Descriptions, I felt that this was too disruptive to the exsting cipher queries in the metadata service. Because of this, I have currently opted to only use Description nodes for the original Description and have come up with a new Node Type for this new type of metadata. Currently it is called "Programmatic_Description", but a better name would be great.

At first, I thought it would be simpler to only support just one is_editable text box, but thinking through it more, I am only saying this because I don't know enough about frontend. I could see how it would be great to have different boxes that are editable to help guide users to curate specific things. We could even lock down security on different boxes based on roles. Only users who are owners can edit table_description box, but all users can edit another box for example.

    DESCRIPTION_NODE_LABEL = 'Description'
    DESCRIPTION_KEY_FORMAT = '{description}'
    DESCRIPTION_ORDER = 'description_order'

    def __init__(self,
                 source, # type: str
                 text,  # type: Union[None, str]
                 order, # type: int
                 is_editable # type: boolean
                 ):
        """
        :param source: The unique source of what is populating this description
        :param text: the description text. Markdown supported.
        :param order: the display order. Lower numbers mean it will be displayed first. Ties are handled alphabetically by source.
        """

Two options here:

Is that TableMetadata and ColumnMetadata object can take a list of "programmatic_descriptions"
TableMetadata and ColumnMetadata will now both take optional "description_source" parameter which is either None or a string that is where the description comes from.

I actually think approach 2 is better, because that way we can more easily decouple the different "sources" of programmatic descriptions and re-use the existing description parameter.

Will then construct the neo4j nodes accordingly using the "source" as part of the description node id.

After above is done, I will modify Metadata Service to have a new method to fetch all of the programmatic description nodes and concat them together based on order as a first attempt. It is possible that it would be better to return all separately for more frontend flexibility.
NOTE: one difficulty I found here is that this will require a modification to amundsen_common as the Table object itself will need to be modified.

Then the frontend would need to be modified to make these requests and display them.
NOTE: I am not a frontend person, but I wonder if it would be possible to have source based pluggable css or custom javascript with custom frontend builds?

As a first pass for frontend, I will just concatenate all of the programmatic descriptions and display them in a single box.

@javamonkey79 @ttannis

samshuster · 2020-01-27T15:57:43Z

A couple of other thoughts that came up since working on it on Friday.

I think we should get rid of description_order as part of data loader as that seems like we are coupling presentation logic to the data. Instead, I think the frontend should be configurable to display the different descriptions based on the "description_source" (where is this description metadata coming from? i.e. s3_scraper, extended table details etc.)

samshuster · 2020-01-28T01:47:39Z

Hey @javamonkey79 @ttannis, as promised I made some headway on this and was able to get a very early stages prototype together here:

This contrived example shows 2 different sources of metadata (s3 status and quality report) that now populate dedicated boxes in the left pane.

The data is being populated from rows like this:

database,cluster,schema_name,name,description,tags,description_source,description_editable,description_order
hive,gold,test_schema,test_table1,"**Size**: 50T\n\n**Monthly Cost**: $5000","expensive","s3_crawler",false,1
dynamo,gold,test_schema,test_table2,"**Size**: 1T\n\n**Monthly Cost**: $500",,"s3_crawler",false,1
hive,gold,test_schema,test_table1,"### Quality Report:\n --- \n Ipsus enom. Ipsus enom ipsus lorenum.\n ---\n[![Build Status](https://api.travis-ci.com/lyft/amundsendatabuilder.svg?branch=master)](https://travis-ci.com/lyft/amundsendatabuilder)","low_quality","quality_service",false,1

I have not tested out columns yet.

Notice that tags can also come from these client specific metadata sources.

A big thing that needs to be still decided is a name for this new concept. I think "programmatic descriptions" is not a good name. Other possible names that have been suggested are:

Custom_Metadata
Metadata_Properties
Generic_Metadata

Please loop in anybody else that should be part of this conversation. cc @markgrover

ttannis · 2020-01-28T02:39:49Z

@samshuster do you have any branches that you can share with this, it's alright if things aren't clean or perfect.

samshuster · 2020-01-28T15:10:52Z

sure thing @ttannis I will get up some WIP branches for all of the modules that had to be touched... basically all of them! I imagine that there will be some major rework that is asked by you guys, so no worries there.

NOTE: I didn't bother with search at this point, I figure that can be a separate story down the line to index these new datapoints. But happy to try and do it if you feel differently.

samshuster · 2020-01-28T15:52:37Z

ok @ttannis you can check out the above WIP MRs. Let me know if you have any questions!

jinhyukchang · 2020-01-28T19:19:22Z

It seems it first started with un-editable description, but now it became generic metadata. I understand the value of being flexible, but on the other side, I am afraid we are opening the door such that most of new metadata could just go through this approach, not being captured as a new use case model. (And also, we would need control on which metadata should be indexed in search service or not which adds more complexity).

In other words, if I understand it correctly, most of current metadata (and future metadata) can be represented by this flexible model, and I wonder if that is better path as it would easily make new metadata model hide in each companies private repo in databuilder configuration. I think there's trade-off between being flexible on new metadata introduction vs we may see less contribution on new metadata model. May be I am too worried. WDYT?

cc: @markgrover @feng-tao @ttannis

samshuster · 2020-01-28T19:37:46Z

hey @jinhyukchang I guess this is an aged old question flexibility vs explicit. You raised some great points!

Here are some of my thoughts for why I believe there will always be a use case for having a flexible option like this approach, but I could be wrong.

This is the main one. There are certain attributes that are too company specific to ever make sense to be contributed back to open source. For example, if this issue is closed without being eventually accepted, I would either need to customize our build, or put out separate PRs for new attributes that may never be used by other companies.
The amount of effort to add a new piece of metadata to the data model is currently quite large. If a company is prototyping a new piece of metadata, it might not be worth the effort.
I highly doubt that these flexible options would ever look as nice (currently only markdown! although I suppose that could be changed in the future with some customization on frontend as I mentioned earlier) or be as powerful as an explicit addition to the data model. Given this I think there will still be a drive to add new content explicitly.

jinhyukchang · 2020-01-28T19:41:04Z

Fair points, @samshuster ! Let's not make it too powerful :)

samshuster · 2020-01-30T14:06:04Z

If there are no major concerns with the approach (happy to make changes if there are!) I am going to revisit where I left off on this a few days ago and focus on:

adding tests where appropriate
adding frontend non-editable section. (NOTE: for now I think I will skip on making these editable as that is considerably more work and might not even be needed).
cleaning up and refactoring where applicable

samshuster · 2020-02-03T21:21:21Z

hey @ttannis @jinhyukchang checking in here again with a couple more updates.

I think I am going to skip implementing the editable properties of these descriptions on frontend as it is considerably more work with not much benefit in my opinion. Let me know if I should persevere, but given that we don't want to make these too powerful, it seems not necessary. I did however currently keep the properties in the data_loader, metadata and api but can get rid of these if it is desired.
I think column stats can pretty much serve this same purpose for columns, so I removed this new functionality for columns. It would be for tables only.

With those above 2 changes, I think this is almost ready to go! I should have some complete MRs by tomorrow for you guys to take a look at.
The order of how they should be looked at and eventually merged if accepted should be:

data_builder
amundsen-common
metadata
frontend

Given this, I moved databuilder MR out of WIP but anticipate that there will still be some major changes left: https://github.com/lyft/amundsendatabuilder/pull/187/files

Thanks for your feedback and help!

jinhyukchang · 2020-02-04T00:50:34Z

Thanks @samshuster !
Let us know when it's ready to review.

On the editable properties, I prefer to remove it as it won't be an active code. Let me know what you think.

samshuster · 2020-02-04T00:57:14Z

sounds good @jinhyukchang that makes sense, i will remove the is_editable property from the data model and will tag you tomorrow when they are all ready to be reviewed

samshuster · 2020-02-04T14:18:34Z

ok @jinhyukchang @ttannis https://github.com/lyft/amundsendatabuilder/pull/187/files this change is ready for you guys to review.

NOTE: the other MR's are also ready for you to review too! but the builds won't pass until amundsen-common is pushed

samshuster · 2020-02-12T16:50:38Z

ok @jinhyukchang amundsen-common and metadata are ready for review (sorry to keep tagging you!)

amundsen-io/amundsencommon#18

below will build once above is merged in
amundsen-io/amundsenmetadatalibrary#104

samshuster · 2020-03-02T15:23:18Z

hey @ttannis just checking in to see when we can start revisiting this MR
amundsen-io/amundsenfrontendlibrary#381

ttannis · 2020-03-02T23:00:42Z

hey @ttannis just checking in to see when we can start revisiting this MR
lyft/amundsenfrontendlibrary#381

Last Thursday one of our designers took a look at this concept and is currently working on some suggestions. They will either respond directly on your PR, or I'll relay the information once it's ready.

samshuster · 2020-03-06T13:47:24Z

hey @ttannis @feng-tao would you guys be open to having my plain design get pushed through as a starting point? It would only show up on amundsen if people ingest programmatic data on the backend, so it would be low risk. We could consider it "beta".

That could give you guys more time to rework the design.

Just an idea, understood if that isn't going to work!

samshuster · 2020-03-19T14:34:55Z

i believe this issue can be closed now. Thanks everybody!

jornh · 2020-03-19T15:29:43Z

👏👏👏 thanks for pulling through on this feature Sam.

I’m sure it will benefit a lot of other orgs using Amundsen.

javamonkey79 · 2020-03-19T16:40:19Z

I agree, nice work on a Herculean effort @samshuster 💪 💯 💥

markgrover · 2020-03-19T16:41:36Z

Indeed, awesome stuff! Thanks Sam!

…

On Thu, Mar 19, 2020 at 9:40 AM Shaun Elliott ***@***.***> wrote: I agree, nice work on a Herculean effort @samshuster <https://github.com/samshuster> 💪 💯 💥 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#147 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANBLC6FIJFZA3B7DHRYZA3RIJDIFANCNFSM4JEIACXQ> .

-- Mark Grover Product Manager <https://www.lyft.com/>

ttannis · 2020-03-20T01:19:08Z

i believe this issue can be closed now. Thanks everybody!

Thanks Sam, I'll be closing this issue after creating and publishing the new release for frontend, and updating the quick starts.

ttannis · 2020-04-03T17:04:55Z

Release Notes: https://github.com/lyft/amundsenfrontendlibrary/releases/tag/v2.1.0

* 🎉 Initial commit. * ♻️ Refactoring code. * 👌 Updating code due to code review changes. * 👌 Updating code due to code review changes. (tests)

Will scroll past GitHub repo folders (specially more handy on small screens like phones/tablets)

* 🎉 Initial commit. * ♻️ Refactoring code. * 👌 Updating code due to code review changes. * 👌 Updating code due to code review changes. (tests)

Will scroll past GitHub repo folders (specially more handy on small screens like phones/tablets)

* 🎉 Initial commit. * ♻️ Refactoring code. * 👌 Updating code due to code review changes. * 👌 Updating code due to code review changes. (tests)

Will scroll past GitHub repo folders (specially more handy on small screens like phones/tablets)

* 🎉 Initial commit. * ♻️ Refactoring code. * 👌 Updating code due to code review changes. * 👌 Updating code due to code review changes. (tests)

feng-tao added the keep fresh Disables stalebot from closing an issue label Oct 24, 2019

jakemongaya pushed a commit to jakemongaya/amundsen that referenced this issue Dec 11, 2019

Link deeper to doc content (amundsen-io#147)

97c9190

Will scroll past GitHub repo folders (specially more handy on small screens like phones/tablets)

samshuster mentioned this issue Feb 8, 2020

Introduce "Domains/Subject Areas" as a first class citizen in amundsen #275

Closed

samshuster mentioned this issue Feb 21, 2020

Add Programmatic_Descriptions to Search #297

Closed

ttannis closed this as completed Apr 3, 2020

feng-tao added the status:needs_triage For all issues that need to be processed label Apr 15, 2020

dorianj pushed a commit to dorianj/amundsen that referenced this issue Apr 25, 2021

Link deeper to doc content (amundsen-io#147)

0110dca

Will scroll past GitHub repo folders (specially more handy on small screens like phones/tablets)

feng-tao pushed a commit that referenced this issue May 7, 2021

Link deeper to doc content (#147)

9c65760

Will scroll past GitHub repo folders (specially more handy on small screens like phones/tablets)

feng-tao pushed a commit that referenced this issue May 7, 2021

feat: Add Frequent Users feature in [AtlasProxy] (#147)

e80fcfd

* 🎉 Initial commit. * ♻️ Refactoring code. * 👌 Updating code due to code review changes. * 👌 Updating code due to code review changes. (tests)

jkulzick mentioned this issue Jul 27, 2021

Feature Proposal: Editable Custom Table Attributes #1372

Closed

hansadriaans pushed a commit to DataChefHQ/amundsen that referenced this issue Jun 30, 2022

Link deeper to doc content (amundsen-io#147)

c2514cb

Will scroll past GitHub repo folders (specially more handy on small screens like phones/tablets)

Golodhros added type:feature A new feature request status:completed Issue is completed and on master and removed keep fresh Disables stalebot from closing an issue status:needs_triage For all issues that need to be processed labels Dec 15, 2022

Programmatic and Manual pathways for table and column descriptions #147

Programmatic and Manual pathways for table and column descriptions #147

Comments

samshuster commented Oct 23, 2019 • edited Loading

Overview

What do we do now? And why it won't work in long run?

Proposal

Note about column level

jornh commented Oct 23, 2019

On table level

On column level

samshuster commented Oct 24, 2019

samshuster commented Oct 27, 2019 • edited Loading

javamonkey79 commented Dec 7, 2019 • edited Loading

feng-tao commented Dec 14, 2019

javamonkey79 commented Dec 16, 2019

javamonkey79 commented Dec 16, 2019

javamonkey79 commented Dec 20, 2019

ttannis commented Dec 20, 2019

javamonkey79 commented Dec 20, 2019

markgrover commented Dec 21, 2019

samshuster commented Jan 24, 2020 • edited Loading

samshuster commented Jan 27, 2020

samshuster commented Jan 28, 2020

ttannis commented Jan 28, 2020

samshuster commented Jan 28, 2020

samshuster commented Jan 28, 2020

jinhyukchang commented Jan 28, 2020 • edited Loading

samshuster commented Jan 28, 2020

jinhyukchang commented Jan 28, 2020 • edited Loading

samshuster commented Jan 30, 2020

samshuster commented Feb 3, 2020

jinhyukchang commented Feb 4, 2020

samshuster commented Feb 4, 2020

samshuster commented Feb 4, 2020

samshuster commented Feb 12, 2020

samshuster commented Mar 2, 2020

ttannis commented Mar 2, 2020

samshuster commented Mar 6, 2020

samshuster commented Mar 19, 2020

jornh commented Mar 19, 2020

javamonkey79 commented Mar 19, 2020

markgrover commented Mar 19, 2020 via email

ttannis commented Mar 20, 2020

ttannis commented Apr 3, 2020

samshuster commented Oct 23, 2019 •

edited

Loading

samshuster commented Oct 27, 2019 •

edited

Loading

javamonkey79 commented Dec 7, 2019 •

edited

Loading

samshuster commented Jan 24, 2020 •

edited

Loading

jinhyukchang commented Jan 28, 2020 •

edited

Loading

jinhyukchang commented Jan 28, 2020 •

edited

Loading