Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Business Attribute RFC #6

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ramvisa
Copy link

@ramvisa ramvisa commented Oct 23, 2023

Request for Proposal for RFC

Request for Proposal for RFC
@JerryDataCollins
Copy link

Hi Ram,
Do you think it would be useful to:

  • Allow business attributes to be grouped together into business entities (feels like an LDM but driven textually rather than via a modelling tool).
  • Allow import of the above from the leading data modelling tools (ER/Studio, erwin. Maybe Sparx EA, SAP PowerDesigner)
  • I think auto assignment of business attributes to schema attributes automatically will be hard. Consider SAP - it has thousands of columns, names are in German and use German abbreviations. if this becomes a manual task, how viable will this solution be as business attribute names will be very different from the target schema column names?
  • Are you going to allow a hierarchy of business attributes? For example, Address decomposes into Address Line 1 etc.

Cheers
Jerry

@ramvisa
Copy link
Author

ramvisa commented Oct 25, 2023

Do you think it would be useful to:
• Allow business attributes to be grouped together into business entities (feels like an LDM but driven textually rather than via a modelling tool).

Yes, that is indeed the idea of what is loosely titled a “business record” and is on our drawing board. A business record groups business attributes and enables us to model and abstract a business entity. We think of this as a Logical Business Model, where we create a model of the business. The interesting part of these two constructs is that they enable us to now attach business rules. A business rule will describe the information in business terms, and enables us to inherit or translate a lot of business terms into the technical model.

We actively use the business attributes at Visa to model recurring elements that are then mapped into or used to create new table attributes in various technologies.

Allow import of the above from the leading data modelling tools (ER/Studio, erwin. Maybe Sparx EA, SAP PowerDesigner)

Might be an idea. Not sure if this type of metadata is commonly captured in tools like ER/Studio etc. It might be more relevant to focus on importing from BI tools, like Tableau, Looker, etc. as they have business focused metadata, like reporting on “Sales Orders”, “HR data”, etc. In Visa’s case, certainly those tools are relevant in finding these business attributes. Looking forward to discussing and fleshing this out some more.

I think auto assignment of business attributes to schema attributes automatically will be hard. Consider SAP - it has thousands of columns, names are in German and use German abbreviations. if this becomes a manual task, how viable will this solution be as business attribute names will be very different from the target schema column names?

The way we propose to implement the business attributes (and this is how we did it at Visa), is for these attributes to have an ID, a rich business name and a “technical name”. This technical name is an abbreviation generated from the business name. This will – typically – cover some high percentage of mapping.

We have this running at Visa, and we programmatically match across tens of instances with high match rates. The crux of course is to have naming standards.

We would not anticipate to do this on of the shelve applications like SAP and ServiceNow models, as these typically come with the semantic level and are curated. We would, in this case, love to have a good way to import these and generate business attributes – or somehow have a view from these apps in the catalog to apply to analytical apps that use this data. But this is not the highest priority for most users. Focus on analytical apps is likely more valuable.

We do have some of these use cases, would love to explore more on these types of apps.

Are you going to allow a hierarchy of business attributes? For example, Address decomposes into Address Line 1 etc.

Generally, this is likely going to take the concept of relationships. Likely modeled after the already existing model in DataHub Glossary. This would support a hierarchy. However, in most cases the business attribute is intentionally kept more holistic* and will map to a set of table attributes to create the example of Address (which is the business attribute) and it then mapped to Address Line 1, Address Line 2, Postal Code, etc. By qualifying all table attributes with this business attribute, a data consumer can find all table attributes that form the address.

  • Holistic in the sense that a business user will think less in fields (address line 1, 2, etc.) but more in the business construct of needing to have an address, which is used for, or represents “a place to bill to” etc. We’d like to encourage users to model the business attributes at this level, while providing the technical ability to build relationships.

### Business User

#### Must Haves
1. Ability to search for fields using business description/tags/glossary attached to business attribute

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the expectation that this expanded search should be the default universal search experience on the main search bar?
How should ranking work when you have a match on a business description through a business attribute attached to a field versus a match on a field level description .. or a match on the table description?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our ranking strategy prioritizes field-level descriptions, followed by descriptions of business attributes. The primary objective is to ensure that the corresponding dataset is displayed as long as there is a match with the business attribute description.


#### Enabling Capability of searching Dataset entities as per Description/tags of Business Attributes

We are proposing to introduce new annotation "`@SearchableRef`", through which we can populate the elastic indexes with expanded details regarding the referenced entity.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The write fanout in this case would be the main issue with this approach. Have you investigated whether elastic can offer any kind of join-like capabilities for us to maintain separate documents for the dataset entity index and the business attribute entity index while still being able to drive searches to match on descriptions etc from business attributes to return dataset docs?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our investigation reveals that due to its distributed nature, join operations are generally not advised and can be costly in Elasticsearch, which performs optimally with denormalized data. As a result, we suggest incorporating business attribute properties into the datasets index using PlatformEvents. This approach ensures eventual consistency in search results, while changes are instantly reflected on the user interface.


User can able to attach `Business Attributes` only to `Schema Fields`. We will make necessary UI changes to control this feature.

New `BusinessAttributeInfo` aspect contains existing `EditableSchemaFieldInfo` record type and also `customProperties`. Also we are introducing one new record type, `PhysicalEditableSchemaFieldInfo` which includes existing `EditableSchemaFieldInfo`. As per this implementation, existing aspect `EditableSchemaMetadata` now contains a list of `PhysicalEditableSchemaFieldInfo` records instead of `EditableSchemaFieldInfo` records.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you describe why you need a separation of PhysicalEditable versus Editable?

Copy link

@deepgarg-visa deepgarg-visa Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've restructured the Record names to enhance their relevance and maintain backward compatibility. This was achieved by extracting common properties from EditableSchemaFieldInfo and incorporating them into a new base class, which is now included in BusinessAttributeInfo. Please refer to the updated diagrams for more details. However, incorporating a BusinessAttribute reference into EditableSchemaFieldInfo creates cyclic dependencies, which do not accurately represent our intended data model.
business-attribute-model


New `BusinessAttributeInfo` aspect contains existing `EditableSchemaFieldInfo` record type and also `customProperties`. Also we are introducing one new record type, `PhysicalEditableSchemaFieldInfo` which includes existing `EditableSchemaFieldInfo`. As per this implementation, existing aspect `EditableSchemaMetadata` now contains a list of `PhysicalEditableSchemaFieldInfo` records instead of `EditableSchemaFieldInfo` records.

Each `PhysicalEditableSchemaFieldInfo` will contain an `EditableSchemaFieldInfo` and the new field "`businessAttribute`" to represent the link between dataset schema field and business attribute. We are also proposing the introduction of a new field, "`type`", in `EditableSchemaFieldInfo`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we just had a new aspect (similar to globalTags and glossaryTerms) called businessAttributes attached to the schemaField entity and that aspect contained a list of business attribute urns that were associated with the schema field entity? Would that simplify the representation?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A schema-field is represents a unique Semantic field or Business Attribute. We have taken your feedback in other comment and simplified the Record structures.


## Drawbacks

Cascading changes made in Business Attribute to related Dataset elastic indexes put a load on Kafka and elastic. Changes made in business attribute results in Kafka events, which gets consumes by the consumer which in turn update the elastic indexes of the referenced dataset with the updated changes of business attribute. As per current implementation, Kafka consumer finalise the message processing until it updates the elastic index, so for example, if business attribute is referenced in datasets(order of 100K), then processing large volumes of messages can lead to performance degradation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that this write amplification might make this approach infeasible to implement. Curious if you've run any benchmarks to see how this scales if you have 1M entities attached to a business attribute, and you have a change in the description of the attribute.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this approach we will have eventual consistency. When there is change, for example in business attribute description, now this change should be propagated to the elastic indexes for all the referenced datasets. During this propagation search experience might be inconsistent but after some time eventual consistency will be achieved. As of now we have not run any benchmarking. After design approval we will take care of testing and benchmarking during implementation time.

@shirshanka
Copy link

Thanks for the responses here! Will mark this RFC as approved pending some of the performance benchmarks.

@deepgarg-visa
Copy link

Thanks for the responses here! Will mark this RFC as approved pending some of the performance benchmarks.
Sure @shirshanka, working on the same, will get back to you and share the results about benchmarking

@aabharti-visa
Copy link

aabharti-visa commented Feb 20, 2024

@shirshanka we used the approach to generate platform events for each business attribute updates (namely tag,
glossary, documentation update) and then directly emitting MCL events for each of associated dataset entities.
As part of benchmark testing, we associated one business attribute to one column field with each of 10K hive datasets, and using new custom hook fetched all related entities using relationships API such that it can fetch and store more than 10K documents (since scroll feature is currently disabled in datahub we could not fetch more than 10K documents with /relationships/v1 Open API using elastic and hence we used neo4j instead) and then generated MCL events for each of these dataset.

The Kafka events were consumed in batches and with below configuration elastic was able to process the bulk requests and update the indices within minutes. We could testify that for 100K MCL it won't take more than 10 minutes based on our benchmark tests.

Here are the findings
Total number of related entities for business attribute = 10K
Total message size in topic MetadataChangeLog_Versioned_v1 = 1GB
Used elastic instance as docker image and used ES_BULK_REQUESTS_LIMIT as 2000.
Processed 10K MCL in around 47 seconds.

@aabharti-visa
Copy link

aabharti-visa commented Feb 20, 2024

Below is tail from gms.log
2024-02-16 06:37:49,356 [ThreadPoolTaskExecutor-1] INFO c.l.m.s.BusinessAttributeUpdateService - Business Attribute update hook invoked for :urn:li:businessAttribute:c86e85c4-4af0-4c19-94ab-62cd0788a0b0 2024-02-16 06:38:06,901 [I/O dispatcher 1] INFO c.l.m.s.e.update.BulkListener - Successfully fed bulk request. Number of events: 1048 Took time ms: -1 2024-02-16 06:38:14,750 [I/O dispatcher 1] INFO c.l.m.s.e.update.BulkListener - Successfully fed bulk request. Number of events: 1129 Took time ms: -1 2024-02-16 06:38:26,096 [I/O dispatcher 1] INFO c.l.m.s.e.update.BulkListener - Successfully fed bulk request. Number of events: 1906 Took time ms: -1 2024-02-16 06:38:37,339 [I/O dispatcher 1] INFO c.l.m.s.e.update.BulkListener - Successfully fed bulk request. Number of events: 1906 Took time ms: -1 2024-02-16 06:38:47,506 [I/O dispatcher 1] INFO c.l.m.s.e.update.BulkListener - Successfully fed bulk request. Number of events: 1733 Took time ms: -1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants