Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/bigquery): add option to enable/disable legacy sharded table support #6822

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 16 additions & 4 deletions docs/how/updating-datahub.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,23 @@ This file documents any backwards-incompatible changes in DataHub and assists pe
## Next

### Breaking Changes

### Potential Downtime

### Deprecations

### Other notable Changes

## 0.9.4

### Breaking Changes

- #6243 apache-ranger authorizer is no longer the core part of DataHub GMS, and it is shifted as plugin. Please refer updated documentation [Configuring Authorization with Apache Ranger](./configuring-authorization-with-apache-ranger.md#configuring-your-datahub-deployment) for configuring `apache-ranger-plugin` in DataHub GMS.
- #6243 apache-ranger authorizer as plugin is not supported in DataHub Kubernetes deployment.
- #6243 Authentication and Authorization plugins configuration are removed from [application.yml](../../metadata-service/factories/src/main/resources/application.yml). Refer documentation [Migration Of Plugins From application.yml](../plugins.md#migration-of-plugins-from-applicationyml) for migrating any existing custom plugins.
- #6243 Authentication and Authorization plugins configuration are removed from [application.yml](../../metadata-service/factories/src/main/resources/application.yml). Refer documentation [Migration Of Plugins From application.yml](../plugins.md#migration-of-plugins-from-applicationyml) for migrating any existing custom plugins.
- `datahub check graph-consistency` command has been removed. It was a beta API that we had considered but decided there are better solutions for this. So removing this.
- `graphql_url` option of `powerbi-report-server` source deprecated as the options is not used.
- #6789 biquery-source: sharded table support changes a bit and it will generate different id as before to make sure it does not clash with non-sharded table names. This means if stateful ingestion is enabled then old sharded tables will be recreated with new id and attached tags/glossary_terms/etc needs to be added again.
- #6789 BigQuery ingestion: If `enable_legacy_sharded_table_support` is set to False, sharded table names will be suffixed with \_yyyymmdd to make sure they don't clash with non-sharded tables. This means if stateful ingestion is enabled then old sharded tables will be recreated with a new id and attached tags/glossary terms/etc will need to be added again. _This behavior is not enabled by default yet, but will be enabled by default in a future release._

### Potential Downtime

Expand All @@ -25,7 +36,7 @@ This file documents any backwards-incompatible changes in DataHub and assists pe

### Breaking Changes

- The beta `datahub check graph-consistency` command has been removed.
- The beta `datahub check graph-consistency` command has been removed.

### Potential Downtime

Expand Down Expand Up @@ -56,7 +67,8 @@ This file documents any backwards-incompatible changes in DataHub and assists pe
## 0.9.1

### Breaking Changes
- we have promoted `bigqery-beta` to `bigquery`. If you are using `bigquery-beta` then change your recipes to use the type `bigquery`

- We have promoted `bigquery-beta` to `bigquery`. If you are using `bigquery-beta` then change your recipes to use the type `bigquery`.

### Potential Downtime

Expand Down
2 changes: 1 addition & 1 deletion metadata-ingestion/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -361,7 +361,7 @@ def get_long_description():
"types-pkg_resources",
"types-six",
"types-python-dateutil",
"types-requests",
"types-requests>=2.28.11.6",
"types-toml",
"types-PyMySQL",
"types-PyYAML",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -198,6 +198,8 @@ def __init__(self, ctx: PipelineContext, config: BigQueryV2Config):
BigqueryTableIdentifier._BIGQUERY_DEFAULT_SHARDED_TABLE_REGEX = (
self.config.sharded_table_pattern
)
if self.config.enable_legacy_sharded_table_support:
BigqueryTableIdentifier._BQ_SHARDED_TABLE_SUFFIX = ""

set_dataset_urn_to_lower(self.config.convert_urns_to_lowercase)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@
from dateutil import parser

from datahub.emitter.mce_builder import make_dataset_urn
from datahub.ingestion.source.bigquery_v2.common import BQ_SHARDED_TABLE_SUFFIX
from datahub.utilities.parsing_util import (
get_first_missing_key,
get_first_missing_key_any,
Expand Down Expand Up @@ -81,6 +80,7 @@ class BigqueryTableIdentifier:
invalid_chars: ClassVar[Set[str]] = {"$", "@"}
_BIGQUERY_DEFAULT_SHARDED_TABLE_REGEX: ClassVar[str] = "((.+)[_$])?(\\d{8})$"
_BIGQUERY_WILDCARD_REGEX: ClassVar[str] = "((_(\\d+)?)\\*$)|\\*$"
_BQ_SHARDED_TABLE_SUFFIX: str = "_yyyymmdd"

@staticmethod
def get_table_and_shard(table_name: str) -> Tuple[str, Optional[str]]:
Expand Down Expand Up @@ -134,7 +134,9 @@ def get_table_name(self) -> str:
f"{self.project_id}.{self.dataset}.{self.get_table_display_name()}"
)
if self.is_sharded_table():
table_name = f"{table_name}{BQ_SHARDED_TABLE_SUFFIX}"
table_name = (
f"{table_name}{BigqueryTableIdentifier._BQ_SHARDED_TABLE_SUFFIX}"
)
return table_name

def is_sharded_table(self) -> bool:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,11 @@ class BigQueryV2Config(BigQueryConfig, LineageConfig):
description="Convert urns to lowercase.",
)

enable_legacy_sharded_table_support: bool = Field(
default=True,
description="Use the legacy sharded table urn suffix added.",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now maybe we can start printing deprecation warnings when this is set to true

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. We need to get this in asap. I didn't realize this PR was still outstanding


@root_validator(pre=False)
def profile_default_settings(cls, values: Dict) -> Dict:
# Extra default SQLAlchemy option for better connection pooling and threading.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,6 @@
BQ_EXTERNAL_TABLE_URL_TEMPLATE = "https://console.cloud.google.com/bigquery?project={project}&ws=!1m5!1m4!4m3!1s{project}!2s{dataset}!3s{table}"
BQ_EXTERNAL_DATASET_URL_TEMPLATE = "https://console.cloud.google.com/bigquery?project={project}&ws=!1m4!1m3!3m2!1s{project}!2s{dataset}"

BQ_SHARDED_TABLE_SUFFIX = "_yyyymmdd"


def _make_gcp_logging_client(
project_id: Optional[str] = None, extra_client_options: Dict[str, Any] = {}
Expand Down
8 changes: 4 additions & 4 deletions metadata-ingestion/src/datahub/ingestion/source/mode.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,9 +68,9 @@ class ModeConfig(DatasetLineageProviderConfigBase):
connect_uri: str = Field(
default="https://app.mode.com", description="Mode host URL."
)
token: Optional[str] = Field(default=None, description="Mode user token.")
password: Optional[pydantic.SecretStr] = Field(
default=None, description="Mode password for authentication."
token: str = Field(description="Mode user token.")
password: pydantic.SecretStr = Field(
description="Mode password for authentication."
)
workspace: Optional[str] = Field(default=None, description="")
default_schema: str = Field(
Expand Down Expand Up @@ -172,7 +172,7 @@ def __init__(self, ctx: PipelineContext, config: ModeConfig):
self.session = requests.session()
self.session.auth = HTTPBasicAuth(
self.config.token,
self.config.password.get_secret_value() if self.config.password else None,
self.config.password.get_secret_value(),
)
self.session.headers.update(
{
Expand Down