-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Business Attribute RFC #6
base: main
Are you sure you want to change the base?
Conversation
Request for Proposal for RFC
Hi Ram,
Cheers |
Yes, that is indeed the idea of what is loosely titled a “business record” and is on our drawing board. A business record groups business attributes and enables us to model and abstract a business entity. We think of this as a Logical Business Model, where we create a model of the business. The interesting part of these two constructs is that they enable us to now attach business rules. A business rule will describe the information in business terms, and enables us to inherit or translate a lot of business terms into the technical model. We actively use the business attributes at Visa to model recurring elements that are then mapped into or used to create new table attributes in various technologies.
Might be an idea. Not sure if this type of metadata is commonly captured in tools like ER/Studio etc. It might be more relevant to focus on importing from BI tools, like Tableau, Looker, etc. as they have business focused metadata, like reporting on “Sales Orders”, “HR data”, etc. In Visa’s case, certainly those tools are relevant in finding these business attributes. Looking forward to discussing and fleshing this out some more.
The way we propose to implement the business attributes (and this is how we did it at Visa), is for these attributes to have an ID, a rich business name and a “technical name”. This technical name is an abbreviation generated from the business name. This will – typically – cover some high percentage of mapping.
We would not anticipate to do this on of the shelve applications like SAP and ServiceNow models, as these typically come with the semantic level and are curated. We would, in this case, love to have a good way to import these and generate business attributes – or somehow have a view from these apps in the catalog to apply to analytical apps that use this data. But this is not the highest priority for most users. Focus on analytical apps is likely more valuable. We do have some of these use cases, would love to explore more on these types of apps.
Generally, this is likely going to take the concept of relationships. Likely modeled after the already existing model in DataHub Glossary. This would support a hierarchy. However, in most cases the business attribute is intentionally kept more holistic* and will map to a set of table attributes to create the example of Address (which is the business attribute) and it then mapped to Address Line 1, Address Line 2, Postal Code, etc. By qualifying all table attributes with this business attribute, a data consumer can find all table attributes that form the address.
|
### Business User | ||
|
||
#### Must Haves | ||
1. Ability to search for fields using business description/tags/glossary attached to business attribute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the expectation that this expanded search should be the default universal search experience on the main search bar?
How should ranking work when you have a match on a business description through a business attribute attached to a field versus a match on a field level description .. or a match on the table description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our ranking strategy prioritizes field-level descriptions, followed by descriptions of business attributes. The primary objective is to ensure that the corresponding dataset is displayed as long as there is a match with the business attribute description.
|
||
#### Enabling Capability of searching Dataset entities as per Description/tags of Business Attributes | ||
|
||
We are proposing to introduce new annotation "`@SearchableRef`", through which we can populate the elastic indexes with expanded details regarding the referenced entity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The write fanout in this case would be the main issue with this approach. Have you investigated whether elastic can offer any kind of join-like capabilities for us to maintain separate documents for the dataset entity index and the business attribute entity index while still being able to drive searches to match on descriptions etc from business attributes to return dataset docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our investigation reveals that due to its distributed nature, join operations are generally not advised and can be costly in Elasticsearch, which performs optimally with denormalized data. As a result, we suggest incorporating business attribute properties into the datasets index using PlatformEvents. This approach ensures eventual consistency in search results, while changes are instantly reflected on the user interface.
|
||
User can able to attach `Business Attributes` only to `Schema Fields`. We will make necessary UI changes to control this feature. | ||
|
||
New `BusinessAttributeInfo` aspect contains existing `EditableSchemaFieldInfo` record type and also `customProperties`. Also we are introducing one new record type, `PhysicalEditableSchemaFieldInfo` which includes existing `EditableSchemaFieldInfo`. As per this implementation, existing aspect `EditableSchemaMetadata` now contains a list of `PhysicalEditableSchemaFieldInfo` records instead of `EditableSchemaFieldInfo` records. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you describe why you need a separation of PhysicalEditable
versus Editable
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've restructured the Record names to enhance their relevance and maintain backward compatibility. This was achieved by extracting common properties from EditableSchemaFieldInfo and incorporating them into a new base class, which is now included in BusinessAttributeInfo. Please refer to the updated diagrams for more details. However, incorporating a BusinessAttribute reference into EditableSchemaFieldInfo creates cyclic dependencies, which do not accurately represent our intended data model.
|
||
New `BusinessAttributeInfo` aspect contains existing `EditableSchemaFieldInfo` record type and also `customProperties`. Also we are introducing one new record type, `PhysicalEditableSchemaFieldInfo` which includes existing `EditableSchemaFieldInfo`. As per this implementation, existing aspect `EditableSchemaMetadata` now contains a list of `PhysicalEditableSchemaFieldInfo` records instead of `EditableSchemaFieldInfo` records. | ||
|
||
Each `PhysicalEditableSchemaFieldInfo` will contain an `EditableSchemaFieldInfo` and the new field "`businessAttribute`" to represent the link between dataset schema field and business attribute. We are also proposing the introduction of a new field, "`type`", in `EditableSchemaFieldInfo`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we just had a new aspect (similar to globalTags and glossaryTerms) called businessAttributes
attached to the schemaField
entity and that aspect contained a list of business attribute urns that were associated with the schema field entity? Would that simplify the representation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A schema-field is represents a unique Semantic field or Business Attribute. We have taken your feedback in other comment and simplified the Record structures.
|
||
## Drawbacks | ||
|
||
Cascading changes made in Business Attribute to related Dataset elastic indexes put a load on Kafka and elastic. Changes made in business attribute results in Kafka events, which gets consumes by the consumer which in turn update the elastic indexes of the referenced dataset with the updated changes of business attribute. As per current implementation, Kafka consumer finalise the message processing until it updates the elastic index, so for example, if business attribute is referenced in datasets(order of 100K), then processing large volumes of messages can lead to performance degradation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that this write amplification might make this approach infeasible to implement. Curious if you've run any benchmarks to see how this scales if you have 1M entities attached to a business attribute, and you have a change in the description of the attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this approach we will have eventual consistency. When there is change, for example in business attribute description, now this change should be propagated to the elastic indexes for all the referenced datasets. During this propagation search experience might be inconsistent but after some time eventual consistency will be achieved. As of now we have not run any benchmarking. After design approval we will take care of testing and benchmarking during implementation time.
Thanks for the responses here! Will mark this RFC as approved pending some of the performance benchmarks. |
|
@shirshanka we used the approach to generate platform events for each business attribute updates (namely tag, The Kafka events were consumed in batches and with below configuration elastic was able to process the bulk requests and update the indices within minutes. We could testify that for 100K MCL it won't take more than 10 minutes based on our benchmark tests. Here are the findings |
Below is tail from gms.log |
Request for Proposal for RFC