-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing purview test issues and improve performance #350
Conversation
@YihuiGuo this should also fix your earlier issue |
Could you please add more tech details and investigations about the bug? I'm not sure what happened actually. |
As described in the PR, the implementation of the More details are available in this issue: wjohnson/pyapacheatlas#206 where I talked with Will offline, and he agrees to add some backoff in the search_entities API. |
Also update the PR description to make it a bit more descriptive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The canonical way to retrieve entity/entities from PurView is to fetch by GUID, also we already stored related GUIDs in our data model. |
I agree, when CLI save data to purview, it does not specify any hits to partition data by project, that means search via startsWith might still experience perf issues when data volume grows. Since CLI already writes lineage relationship, for example, project contains feature/anchor/derived, using guid list to fetch registered features sounds more efficient and scalable. |
Agree, but the goal of this PR is not to solve all those issues. I have a separate PR solving those issues and please take a look: #368 |
* main: Fixing purview test issues and improve performance (#350) [feathr] Add product_recommendation advanced sample (#348) obejectId query cmd update (#360) add license, release, docs, python api ref badges with shields img (#357) quick fix the 404 not found in read me link (#355) Python SQL Registry (#311) enable JWT token param in frontend API calls (#337) Optimize environment variable behavior (#333) Adding better warning message to let user know that config file is missing and they need to set env parameters. (#347) Feature Monitoring (#330) Windoze/211 maven submission (#334) Windoze/211 maven submission (#334) Windoze/211 maven submission (#334) Fix Synapse quickstart link (#346) Show feature details when click feature in lineage graph (#339) Update pull_request_push_test.yml Update UI README for how to create overrides for local development (#335) Update databricks quick start experience (#217)
Currently there are two issues:
The current way we get features/list features for a project isn't very scalable. What we are currently doing is to first get all the entities, and filter out the ones that we need on the client side, which sometimes causes the service to throttle. We are also calling purview repeatedly for entities that we already fetched.
This PR solves those issue by issuing a server side filtering query, as well as optimize the get_features_from_registry logic to avoid duplicated purview calls.
Call time of
get_features_from_registry()
can be reduced from ~20s to around 3s now.