Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(cli): Make consistent use of DataHubGraphClientConfig #10466

Merged
merged 7 commits into from
Jul 25, 2024

Conversation

pedro93
Copy link
Collaborator

@pedro93 pedro93 commented May 9, 2024

Deprecates get_url_and_token() in favor of a more complete option: load_graph_config() that returns a full DatahubClientConfig.
This change was then propagated across previous usages of get_url_and_token so that connections to DataHub server from the client respect the full breadth of configuration specified by DatahubClientConfig.

I.e: You can now specify disable_ssl_verification: true in your ~/.datahubenv file so that all cli functions to the server work when ssl certification is disabled.

Fixes #9705

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

Summary by CodeRabbit

  • New Features

    • Enhanced configuration management for the Datahub CLI, allowing users to set SSL verification preferences via the ~/.datahubenv file.
    • Introduced new utility functions for environment variable handling and client configuration, improving setup and maintenance.
  • Bug Fixes

    • Removed deprecated functions, streamlining the interaction with the DataHub API.
  • Refactor

    • Major refactoring of CLI utilities to improve environment variable management and session handling.
    • Simplified configuration handling by removing unnecessary classes and functions, enhancing clarity and usability.
  • Tests

    • Updated test cases to align with new client-based approach for entity interactions and configuration management.

@@ -541,6 +546,9 @@ def _relationships_endpoint(self):
def _aspect_count_endpoint(self):
return f"{self.config.server}/aspects?action=getCount"

def _session(self) -> Session:
return super()._session
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't look right

I don't think it's necessary, unless you want _session to be a method instead of an attr

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not able in the DataHubGraph object to pull attributes from the super class, so I built this function to surface that. At least the auto-import in Intellij wasn't finding it.

What would be the right way to surface this super's attribute?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should be able to do graph._session without any changes

if you really want a method, add a @property decorator

return get_url_and_token()[1]


def get_session_and_host():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to reduce the amount of code changes, you could make this call get_default_graph() and then return the tuple

basically keep it as a compatibility shim

Copy link
Collaborator Author

@pedro93 pedro93 May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing that causes a cyclical import dependency. It was my first implementation attempt.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can put the import inside the get_session_and_host method to avoid that issue

@jjoyce0510
Copy link
Collaborator

This is a GREAT contribution for code quality. @hsheth2 what needs to happen to get this in?

Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor nits, but otherwise lgtm

apologies for not reviewing this earlier - I thought it was still in progress given the WIP title

logger.debug(
f"Failed to retrieve config from file {config_utils.DATAHUB_CONFIG_PATH}: {e}. This isn't fatal."
)
config_dict = {**previous_config, **config.dict()}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this merge logic is strange and probably hurts correctness

Copy link
Collaborator Author

@pedro93 pedro93 Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function already existed in metadata-ingestion/src/datahub/cli/config_utils.py, I simply moved it to the client. It is used when users call datahub init when users fill out a form to configure connectivity to DataHub. A user will provide partial information like the GMS host & token, that information should merge with other configs like retry_max_times.

What is your concern with correctness here? Can you provide more details?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's existing behavior then we can leave it as-is. however, I do think we should be overwriting here - having the write_gms_config method not fully overwrite is highly unexpected

Copy link
Contributor

coderabbitai bot commented Jul 25, 2024

Walkthrough

The updates to the Datahub CLI enhance configuration management, particularly with the ~/.datahubenv file, ensuring seamless interaction with SSL settings. The code refactoring improves environment variable management, integrates client instantiation, and simplifies function signatures across various modules. Overall, these modifications enhance usability, flexibility, and security when using the CLI with remote hosts.

Changes

Files Change Summary
docs/how/updating-datahub.md Enhanced configuration management for SSL settings in ~/.datahubenv.
metadata-ingestion/src/datahub/cli/*.py Refactored environment variable management, removed deprecated functions, and improved session handling.
metadata-ingestion/src/datahub/cli/config_utils.py Removed classes for simpler configuration handling.
metadata-ingestion/src/datahub/cli/delete_cli.py Introduced client connection for enhanced metadata deletion.
metadata-ingestion/src/datahub/ingestion/*.py Updated client instantiation and configuration handling across various modules.
metadata-ingestion/src/datahub/upgrade/upgrade.py Refactored server configuration management, making token optional.
smoke-test/tests/*.py Adjusted tests to utilize the new client instantiation methods.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant Environment
    participant DatahubAPI

    User->>CLI: Execute CLI Command
    CLI->>Environment: Read ~/.datahubenv
    Environment-->>CLI: Get Configurations
    CLI->>DatahubAPI: Call with Configurations
    DatahubAPI-->>CLI: Response
    CLI-->>User: Display Result
Loading

Assessment against linked issues

Objective Addressed Explanation
Cannot use Datahub CLI with remote host (SSL verification) (#9705)

🐇 In the meadow, changes dance,
New settings grant a second chance.
With SSL now in harmony,
The CLI sings, oh, joy and glee!
Hops of joy for code so bright,
Datahub shines, a guiding light! ✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@pedro93 pedro93 changed the title [WIP] feat(cli): Make consistent use of DataHubGraphClientConfig feat(cli): Make consistent use of DataHubGraphClientConfig Jul 25, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Outside diff range, codebase verification and nitpick comments (2)
metadata-ingestion/src/datahub/cli/config_utils.py (1)

33-38: Ensure proper handling of None return type from get_client_config.

The function get_client_config might return None if there is an error in loading the YAML configuration. However, the following usages do not handle this case and assume it always returns a dictionary:

  • metadata-ingestion/src/datahub/cli/lite_cli.py
    • Lines around client_config_dict = get_client_config()
    • Lines around cli_config = get_client_config()
  • metadata-ingestion/src/datahub/ingestion/graph/client.py
    • Lines around client_config_dict = config_utils.get_client_config()
    • Lines around previous_config = config_utils.get_client_config()

Please add checks to handle the None return type to avoid potential runtime errors.

Analysis chain

LGTM! But verify the function usage in the codebase.

The code changes are approved.

However, ensure that all function calls to get_client_config are correctly updated to handle the new return type.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all function calls to `get_client_config` are correctly updated to handle the new return type.

# Test: Search for the function usage. Expect: Only occurrences of the new return type handling.
rg --type python -A 5 $'get_client_config'

Length of output: 72


Script:

#!/bin/bash
# Description: Verify all function calls to `get_client_config` are correctly updated to handle the new return type.

# Test: Search for the function usage. Expect: Only occurrences of the new return type handling.
rg --type py -A 5 $'get_client_config'

Length of output: 3313


Script:

#!/bin/bash
# Description: Verify that the return value of `get_client_config` is handled properly in the identified files.

# Check for handling of None return type in lite_cli.py
rg --type py -A 5 'get_client_config' metadata-ingestion/src/datahub/cli/lite_cli.py

# Check for handling of None return type in client.py
rg --type py -A 5 'get_client_config' metadata-ingestion/src/datahub/ingestion/graph/client.py

Length of output: 1346

metadata-ingestion/src/datahub/ingestion/graph/client.py (1)

1791-1806: Consider adding type hints for variables.

Adding type hints for host, port, token, protocol, and url can improve code readability and maintainability.

def get_details_from_env() -> Tuple[Optional[str], Optional[str]]:
    host: Optional[str] = os.environ.get(ENV_METADATA_HOST)
    port: Optional[str] = os.environ.get(ENV_METADATA_PORT)
    token: Optional[str] = os.environ.get(ENV_METADATA_TOKEN)
    protocol: str = os.environ.get(ENV_METADATA_PROTOCOL, "http")
    url: Optional[str] = os.environ.get(ENV_METADATA_HOST_URL)
    if port is not None:
        url = f"{protocol}://{host}:{port}"
        return url, token
    # The reason for using host as URL is backward compatibility
    # If port is not being used we assume someone is using host env var as URL
    if url is None and host is not None:
        logger.warning(
            f"Do not use {ENV_METADATA_HOST} as URL. Use {ENV_METADATA_HOST_URL} instead"
        )
    return url or host, token
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between f4fb89e and 863df15.

Files selected for processing (26)
  • docs/how/updating-datahub.md (1 hunks)
  • metadata-ingestion/src/datahub/cli/cli_utils.py (9 hunks)
  • metadata-ingestion/src/datahub/cli/config_utils.py (2 hunks)
  • metadata-ingestion/src/datahub/cli/delete_cli.py (2 hunks)
  • metadata-ingestion/src/datahub/cli/get_cli.py (2 hunks)
  • metadata-ingestion/src/datahub/cli/ingest_cli.py (4 hunks)
  • metadata-ingestion/src/datahub/cli/lite_cli.py (3 hunks)
  • metadata-ingestion/src/datahub/cli/migrate.py (6 hunks)
  • metadata-ingestion/src/datahub/cli/migration_utils.py (3 hunks)
  • metadata-ingestion/src/datahub/cli/put_cli.py (1 hunks)
  • metadata-ingestion/src/datahub/cli/timeline_cli.py (2 hunks)
  • metadata-ingestion/src/datahub/entrypoints.py (2 hunks)
  • metadata-ingestion/src/datahub/ingestion/graph/client.py (5 hunks)
  • metadata-ingestion/src/datahub/ingestion/run/pipeline_config.py (2 hunks)
  • metadata-ingestion/src/datahub/ingestion/sink/datahub_rest.py (2 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/metadata/lineage.py (2 hunks)
  • metadata-ingestion/src/datahub/upgrade/upgrade.py (3 hunks)
  • metadata-ingestion/tests/unit/test_cli_utils.py (5 hunks)
  • smoke-test/tests/cli/datahub_cli.py (5 hunks)
  • smoke-test/tests/cli/datahub_graph_test.py (2 hunks)
  • smoke-test/tests/delete/delete_test.py (4 hunks)
  • smoke-test/tests/lineage/test_lineage.py (3 hunks)
  • smoke-test/tests/patch/test_dataset_patches.py (7 hunks)
  • smoke-test/tests/telemetry/telemetry_test.py (3 hunks)
  • smoke-test/tests/timeline/timeline_test.py (1 hunks)
  • smoke-test/tests/utils.py (2 hunks)
Files skipped from review due to trivial changes (1)
  • metadata-ingestion/src/datahub/entrypoints.py
Additional context used
Markdownlint
docs/how/updating-datahub.md

83-83: null
Bare URL used

(MD034, no-bare-urls)

Additional comments not posted (66)
metadata-ingestion/src/datahub/cli/config_utils.py (2)

Line range hint 15-18:
LGTM!

The function correctly persists the configuration dictionary to a YAML file.


21-22: LGTM!

The function correctly checks the environment variable to determine if the configuration should be skipped.

smoke-test/tests/telemetry/telemetry_test.py (2)

13-22: LGTM!

The function correctly initializes a client and enhances the context in which aspects are fetched.


31-40: LGTM!

The function correctly initializes a client and enhances the context in which aspects are fetched.

metadata-ingestion/src/datahub/cli/get_cli.py (1)

48-58: LGTM!

The function correctly initializes a client and enhances the context in which aspects are fetched.

metadata-ingestion/tests/unit/test_cli_utils.py (8)

20-20: LGTM!

The change to use get_details_from_env() directly is correct.


27-27: LGTM!

The change to use get_details_from_env() directly is correct.


39-39: LGTM!

The change to use get_details_from_env() directly is correct.


52-52: LGTM!

The change to use get_details_from_env() directly is correct.


62-62: LGTM!

The change to use get_details_from_env() directly is correct.


Line range hint 29-34: LGTM!

The change to use get_default_graph() and client.get_aspect() is correct.


41-41: LGTM!

The change to use client.get_aspect() is correct.


61-62: LGTM!

The change to use client.get_aspect() is correct.

metadata-ingestion/src/datahub/cli/put_cli.py (1)

49-50: LGTM!

The change to instantiate the client using get_default_graph() is correct.

smoke-test/tests/cli/datahub_cli.py (8)

29-34: LGTM!

The change to use get_default_graph() and client.get_aspect() is correct.


41-41: LGTM!

The change to use client.get_aspect() is correct.


61-62: LGTM!

The change to use client.get_aspect() is correct.


74-76: LGTM!

The change to use get_default_graph() and client.get_aspect() is correct.


81-81: LGTM!

The change to use client.get_aspect() is correct.


90-90: LGTM!

The change to use client.get_aspect() is correct.


106-107: LGTM!

The change to use client.get_aspect() is correct.


109-109: LGTM!

The change to use client.get_aspect() is correct.

smoke-test/tests/delete/delete_test.py (4)

69-73: LGTM! Verify the get_aspects_for_entity function usage.

The repeated call to get_aspects_for_entity with the new parameters is consistent with the previous context.

However, ensure that the get_aspects_for_entity function correctly handles these parameters.


88-99: LGTM! Verify the get_aspects_for_entity function usage.

The final assertions using get_aspects_for_entity with the new parameters are consistent with the previous context.

However, ensure that the get_aspects_for_entity function correctly handles these parameters.


41-43: LGTM! Verify the new client object usage.

The introduction of the new client object and deriving the session and host from it enhances encapsulation and clarity.

However, ensure that the get_datahub_graph function correctly initializes the client object and its attributes.

Verification successful

LGTM! Verify the new client object usage.

The introduction of the new client object and deriving the session and host from it enhances encapsulation and clarity. The get_datahub_graph function correctly initializes the client object and its attributes.

  • smoke-test/tests/delete/delete_test.py: Lines 41-43
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the `get_datahub_graph` function initializes the client object correctly.

# Test: Search for the function implementation. Expect: Correct initialization of the client object.
rg --type python -A 10 $'def get_datahub_graph'

Length of output: 79


Script:

#!/bin/bash
# Description: Verify the `get_datahub_graph` function initializes the client object correctly.

# Test: Search for the function implementation. Expect: Correct initialization of the client object.
rg --type py -A 10 $'def get_datahub_graph'

Length of output: 575


Script:

#!/bin/bash
# Description: Verify the `get_default_graph` function initializes the client object correctly.

# Test: Search for the function implementation. Expect: Correct initialization of the client object.
rg --type py -A 10 $'def get_default_graph'

Length of output: 901


Script:

#!/bin/bash
# Description: Verify the `load_client_config` function provides the correct configuration for the `DataHubGraph`.

# Test: Search for the function implementation. Expect: Correct configuration loading.
rg --type py -A 10 $'def load_client_config'

Length of output: 1116


Script:

#!/bin/bash
# Description: Verify the `config_utils.get_client_config` function retrieves the correct configuration.

# Test: Search for the function implementation. Expect: Correct configuration retrieval.
rg --type py -A 10 $'def get_client_config'

Length of output: 692


47-58: LGTM! Verify the get_aspects_for_entity function usage.

The updated calls to get_aspects_for_entity now include the session and gms_host parameters, which improves context.

However, ensure that the get_aspects_for_entity function correctly handles these parameters.

Verification successful

Verified: The get_aspects_for_entity function in metadata-ingestion/src/datahub/cli/cli_utils.py correctly handles the session and gms_host parameters.

  • The function passes these parameters to other functions like get_entity and get_latest_timeseries_aspect_values, ensuring they are used appropriately within the function.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the `get_aspects_for_entity` function handles the new parameters correctly.

# Test: Search for the function implementation. Expect: Correct handling of `session` and `gms_host`.
rg --type python -A 10 $'def get_aspects_for_entity'

Length of output: 84


Script:

#!/bin/bash
# Description: Verify the `get_aspects_for_entity` function handles the new parameters correctly.

# Locate Python files and search for the function definition within them.
fd -e py -x rg 'def get_aspects_for_entity' {} +

Length of output: 48689


Script:

#!/bin/bash
# Description: Extract and review the `get_aspects_for_entity` function definitions from the identified files.

# Extract the function definition from the first file.
rg -A 20 'def get_aspects_for_entity' metadata-ingestion/src/datahub/ingestion/graph/client.py

# Extract the function definition from the second file.
rg -A 20 'def get_aspects_for_entity' metadata-ingestion/src/datahub/cli/cli_utils.py

Length of output: 2370

smoke-test/tests/cli/datahub_graph_test.py (2)

1-2: Import Optional from typing.

The import of Optional is necessary for the updated type annotation.


41-43: LGTM! Verify the new client object usage and type annotation.

The renaming of the graph variable to client enhances clarity. The updated type annotation for schema_metadata handles optional return types.

However, ensure that the DataHubGraph initialization and get_aspect_v2 method handle the new client object and optional return type correctly.

Verification successful

Verified: The new client object usage and type annotation are correctly handled.

The DataHubGraph initialization and get_aspect_v2 method correctly handle the new client object and the optional return type.

  • DataHubGraph class implementation: metadata-ingestion/src/datahub/ingestion/graph/client.py
  • get_aspect_v2 method definition and usage: metadata-ingestion/src/datahub/ingestion/graph/client.py, smoke-test/tests/cli/datahub_graph_test.py, and other test files.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the `DataHubGraph` initialization and `get_aspect_v2` method handle the new client object and optional return type correctly.

# Test: Search for the class implementation and method usage. Expect: Correct initialization and handling of optional return type.
rg --type python -A 10 $'class DataHubGraph'
rg --type python -A 10 $'def get_aspect_v2'

Length of output: 152


Script:

#!/bin/bash
# Description: Verify the `DataHubGraph` initialization and `get_aspect_v2` method handle the new client object and optional return type correctly.

# Search for the class implementation of DataHubGraph
rg 'class DataHubGraph' -A 10

# Search for the definition of the get_aspect_v2 method
rg 'def get_aspect_v2' -A 10

# Search for the usage of the get_aspect_v2 method to ensure it handles the optional return type correctly
rg 'get_aspect_v2' -A 10

Length of output: 15064

metadata-ingestion/src/datahub/ingestion/run/pipeline_config.py (3)

3-7: Import root_validator and validator from pydantic.

The import of root_validator and validator is necessary for the new validators.


106-121: LGTM! Verify the new root validator implementation.

The new root validator default_sink_is_datahub_rest sets a default sink configuration if none is provided, ensuring environment variables are resolved appropriately.

However, ensure that the load_client_config function and config_loader.resolve_env_variables method handle the configuration correctly.

Verification successful

Verified: The load_client_config function and config_loader.resolve_env_variables method handle the configuration correctly.

  • load_client_config is responsible for loading and parsing the client configuration, and it handles validation errors appropriately.
  • resolve_env_variables recursively resolves environment variables within a configuration dictionary.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the `load_client_config` function and `config_loader.resolve_env_variables` method handle the configuration correctly.

# Test: Search for the function and method implementations. Expect: Correct handling of the configuration.
rg --type python -A 10 $'def load_client_config'
rg --type python -A 10 $'def resolve_env_variables'

Length of output: 164


Script:

#!/bin/bash
# Description: Verify the `load_client_config` function and `config_loader.resolve_env_variables` method handle the configuration correctly.

# Test: Search for the function and method implementations. Expect: Correct handling of the configuration.
rg -A 10 $'def load_client_config'
rg -A 10 $'def resolve_env_variables'

Length of output: 2292


123-133: LGTM! Verify the new validator implementation.

The new validator datahub_api_should_use_rest_sink_as_default ensures that a valid API configuration is always provided when using the default sink.

However, ensure that the DatahubClientConfig.parse_obj_allow_extras method handles the configuration correctly.

Verification successful

The DatahubClientConfig.parse_obj_allow_extras method correctly handles the configuration by allowing extra fields temporarily and then parsing the object.

  • The method implementation ensures compatibility with both Pydantic version 2 and earlier versions.
  • The configuration is correctly handled by temporarily modifying the model configuration to allow extra fields.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the `DatahubClientConfig.parse_obj_allow_extras` method handles the configuration correctly.

# Test: Search for the method implementation. Expect: Correct handling of the configuration.
rg --type python -A 10 $'def parse_obj_allow_extras'

Length of output: 84


Script:

#!/bin/bash
# Description: Verify the `DatahubClientConfig.parse_obj_allow_extras` method handles the configuration correctly.

# Test: Search for the method implementation. Expect: Correct handling of the configuration.
rg -A 10 'def parse_obj_allow_extras'

Length of output: 1099


Script:

#!/bin/bash
# Description: Verify the `DatahubClientConfig.parse_obj_allow_extras` method handles the configuration correctly.

# Test: Search for the complete method implementation. Expect: Correct handling of the configuration.
rg -A 20 'def parse_obj_allow_extras'

Length of output: 2120

smoke-test/tests/utils.py (1)

123-123: LGTM! Simplifies the function implementation.

The change to use get_default_graph simplifies the function by reducing dependency on explicit configuration.

However, ensure that get_default_graph is correctly used throughout the codebase.

Verification successful

LGTM! Simplifies the function implementation.

The change to use get_default_graph simplifies the function by reducing dependency on explicit configuration. The usage of get_default_graph is consistent and correct throughout the codebase.

  • smoke-test/tests/utils.py
  • smoke-test/tests/telemetry/telemetry_test.py
  • smoke-test/tests/lineage/test_lineage.py
  • smoke-test/tests/patch/test_dataset_patches.py
  • smoke-test/tests/cli/datahub_cli.py
  • smoke-test/tests/cli/user_groups_cmd/test_group_cmd.py
  • metadata-ingestion/src/datahub/entrypoints.py
  • metadata-ingestion/src/datahub/ingestion/graph/client.py
  • metadata-ingestion/src/datahub/ingestion/source/metadata/lineage.py
  • metadata-ingestion/src/datahub/ingestion/source/ge_data_profiler.py
  • metadata-ingestion/src/datahub/ingestion/run/pipeline.py
  • metadata-ingestion/src/datahub/integrations/assertion/common.py
  • metadata-ingestion/src/datahub/cli/put_cli.py
  • metadata-ingestion/src/datahub/cli/migrate.py
  • metadata-ingestion/src/datahub/cli/ingest_cli.py
  • metadata-ingestion/src/datahub/cli/migration_utils.py
  • metadata-ingestion/src/datahub/cli/timeline_cli.py
  • metadata-ingestion/src/datahub/api/entities/forms/forms.py
  • metadata-ingestion/src/datahub/api/entities/structuredproperties/structuredproperties.py
  • metadata-ingestion/src/datahub/cli/delete_cli.py
  • metadata-ingestion/src/datahub/cli/state_cli.py
  • metadata-ingestion/src/datahub/cli/get_cli.py
  • metadata-ingestion/src/datahub/cli/specific/datacontract_cli.py
  • metadata-ingestion/src/datahub/cli/specific/structuredproperties_cli.py
  • metadata-ingestion/src/datahub/cli/specific/forms_cli.py
  • metadata-ingestion/src/datahub/cli/specific/assertions_cli.py
  • metadata-ingestion/src/datahub/cli/specific/user_cli.py
  • metadata-ingestion/src/datahub/cli/exists_cli.py
  • metadata-ingestion/src/datahub/cli/specific/group_cli.py
  • metadata-ingestion/src/datahub/cli/specific/dataset_cli.py
  • metadata-ingestion/src/datahub/cli/check_cli.py
  • metadata-ingestion/src/datahub/cli/specific/dataproduct_cli.py
  • metadata-ingestion/examples/perf/lineage_perf_example.py
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `get_default_graph` in the codebase.

# Test: Search for the function usage. Expect: Consistent usage of `get_default_graph`.
rg --type python -A 5 $'get_default_graph'

Length of output: 72


Script:

#!/bin/bash
# Description: Verify the usage of `get_default_graph` in the codebase.

# Test: Search for the function usage. Expect: Consistent usage of `get_default_graph`.
rg --glob '*.py' -A 5 'get_default_graph'

Length of output: 52166

smoke-test/tests/timeline/timeline_test.py (1)

182-188: LGTM! Improves session and server configuration management.

The change to use get_datahub_graph improves the function's ability to manage sessions and server configurations dynamically.

However, ensure that get_datahub_graph is correctly used throughout the codebase.

Verification successful

LGTM! Improves session and server configuration management.

The change to use get_datahub_graph improves the function's ability to manage sessions and server configurations dynamically. The usage of get_datahub_graph is consistent and correct throughout the codebase.

  • smoke-test/tests/utils.py
  • smoke-test/tests/timeline/timeline_test.py
  • smoke-test/tests/delete/delete_test.py
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `get_datahub_graph` in the codebase.

# Test: Search for the function usage. Expect: Consistent usage of `get_datahub_graph`.
rg --type python -A 5 $'get_datahub_graph'

Length of output: 72


Script:

#!/bin/bash
# Description: Verify the usage of `get_datahub_graph` in the codebase.

# Test: Search for the function usage. Expect: Consistent usage of `get_datahub_graph`.
rg -t py -A 5 'get_datahub_graph'

Length of output: 5723

metadata-ingestion/src/datahub/cli/timeline_cli.py (1)

66-68: LGTM! Improves session and server configuration management.

The change to use get_default_graph improves the function's ability to manage sessions and server configurations dynamically.

However, ensure that get_default_graph is correctly used throughout the codebase.

Verification successful

Verified: Consistent usage of get_default_graph

The function get_default_graph is consistently used across the codebase, confirming its correct integration and usage.

  • Lines in metadata-ingestion/src/datahub/cli/timeline_cli.py are correctly utilizing get_default_graph.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `get_default_graph` in the codebase.

# Test: Search for the function usage. Expect: Consistent usage of `get_default_graph`.
rg --type python -A 5 $'get_default_graph'

Length of output: 72


Script:

#!/bin/bash
# Description: Verify the usage of `get_default_graph` in the codebase.

# Locate all Python files and search for the function usage.
fd -e py -x rg 'get_default_graph' {}

Length of output: 4470

metadata-ingestion/src/datahub/cli/migration_utils.py (3)

246-246: LGTM!

The change to use get_default_graph ensures that the client instance is correctly configured.


276-288: LGTM!

The change to use get_default_graph ensures that the client instance is correctly configured. The return type change to Iterable[RelatedEntity] improves type safety.


292-304: LGTM!

The change to use get_default_graph ensures that the client instance is correctly configured. The return type change to Iterable[RelatedEntity] improves type safety.

metadata-ingestion/src/datahub/ingestion/source/metadata/lineage.py (1)

214-215: LGTM!

The change to use get_default_graph ensures that the client instance is correctly configured.

smoke-test/tests/patch/test_dataset_patches.py (5)

75-84: LGTM!

The change to use get_default_graph ensures that the client instance is correctly configured. The additional assertions improve test robustness.


100-101: LGTM!

The change to use get_default_graph ensures that the client instance is correctly configured. The additional assertions improve test robustness.


119-120: LGTM!

The change to use get_default_graph ensures that the client instance is correctly configured. The additional assertions improve test robustness.


158-158: LGTM!

The change to use get_default_graph ensures that the client instance is correctly configured. The additional assertions improve test robustness.


219-219: LGTM!

The change to use get_default_graph ensures that the client instance is correctly configured. The additional assertions improve test robustness.

metadata-ingestion/src/datahub/ingestion/sink/datahub_rest.py (1)

Line range hint 51-51:
Verify the removal of set_env_variables_override_config.

The removal of set_env_variables_override_config suggests a change in how environment variables are managed. Ensure that this functionality is no longer needed or is handled elsewhere.

metadata-ingestion/src/datahub/cli/lite_cli.py (2)

48-50: Verify the new usage of get_client_config.

The change in the function call to get_client_config suggests a shift in how the client configuration is retrieved. Ensure that the new usage is appropriate and does not introduce any issues.


312-314: Verify the new usage of get_client_config.

The change in the function call to get_client_config suggests a shift in how the client configuration is retrieved. Ensure that the new usage is appropriate and does not introduce any issues.

metadata-ingestion/src/datahub/cli/cli_utils.py (5)

97-100: Verify the new parameters in post_rollback_endpoint.

The addition of session and gms_host parameters suggests a shift towards explicit dependency injection. Ensure that the new parameters are used appropriately and do not introduce any issues.


133-136: Verify the new parameters in get_entity.

The addition of session and gms_host parameters suggests a shift towards explicit dependency injection. Ensure that the new parameters are used appropriately and do not introduce any issues.


161-165: Verify the new parameters in post_entity.

The addition of session and gms_host parameters suggests a shift towards explicit dependency injection. Ensure that the new parameters are used appropriately and do not introduce any issues.


207-210: Verify the new parameters in get_latest_timeseries_aspect_values.

The addition of session and gms_host parameters suggests a shift towards explicit dependency injection. Ensure that the new parameters are used appropriately and do not introduce any issues.


230-233: Verify the new parameters in get_aspects_for_entity.

The addition of session and gms_host parameters suggests a shift towards explicit dependency injection. Ensure that the new parameters are used appropriately and do not introduce any issues.

metadata-ingestion/src/datahub/upgrade/upgrade.py (2)

130-132: LGTM! But verify the function usage in the codebase.

The changes to use load_client_config instead of cli_utils.get_url_and_token enhance maintainability and consistency.

However, ensure that all function calls to get_server_version_stats handle the new approach correctly.

Verification successful

LGTM! But verify the function usage in the codebase.

The changes to use load_client_config instead of cli_utils.get_url_and_token enhance maintainability and consistency.

However, ensure that all function calls to get_server_version_stats handle the new approach correctly.

  • The function get_server_version_stats is only used within the same file where it is defined (metadata-ingestion/src/datahub/upgrade/upgrade.py).
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all function calls to `get_server_version_stats` handle the new approach correctly.

# Test: Search for the function usage. Expect: Only occurances of the new approach.
rg --type python -A 5 $'get_server_version_stats'

Length of output: 79


Script:

#!/bin/bash
# Description: Verify all function calls to `get_server_version_stats` handle the new approach correctly.

# Test: Search for the function usage. Expect: Only occurrences of the new approach.
rg --type py -A 5 'get_server_version_stats'

Length of output: 1052


103-113: LGTM! But verify the function usage in the codebase.

The changes to make the token parameter optional enhance flexibility. The conditional construction of headers is a good practice.

However, ensure that all function calls to get_server_config handle the optional token parameter correctly.

metadata-ingestion/src/datahub/cli/migrate.py (2)

152-162: LGTM!

The changes to use the graph object for retrieving URNs and aspects enhance maintainability and consistency.


406-435: LGTM!

The addition of the batch_get_ids function improves efficiency and enhances code clarity with type hints.

metadata-ingestion/src/datahub/cli/ingest_cli.py (3)

430-432: LGTM!

The changes to use the get_default_graph method for obtaining the client instance enhance encapsulation and streamline the process of interacting with the DataHub API.


481-483: LGTM!

The changes to use the get_default_graph method for obtaining the client instance enhance encapsulation and streamline the process of interacting with the DataHub API.


Line range hint 531-549:
LGTM!

The changes to use the get_default_graph method for obtaining the client instance and the removal of the connectivity check enhance encapsulation and streamline the process of interacting with the DataHub API.

metadata-ingestion/src/datahub/cli/delete_cli.py (2)

126-126: Initialize the client correctly.

The client initialization using get_default_graph() is correct and enhances modularity.


143-145: Ensure the client session and server configuration are correctly used.

The updated function call to cli_utils.post_rollback_endpoint correctly uses the client session and server configuration, enhancing configurability.

docs/how/updating-datahub.md (1)

83-83: Document the new configuration functionality clearly.

The documentation accurately describes the new functionality, allowing settings in ~/.datahubenv to be respected across all CLI calls. Ensure users are aware of the new disable_ssl_verification setting.

Tools
Markdownlint

83-83: null
Bare URL used

(MD034, no-bare-urls)

smoke-test/tests/lineage/test_lineage.py (2)

848-848: Initialize the graph correctly.

The graph initialization using get_default_graph() is correct and simplifies the code.


893-893: Initialize the graph correctly.

The graph initialization using get_default_graph() is correct and simplifies the code.

metadata-ingestion/src/datahub/ingestion/graph/client.py (3)

93-97: LGTM! Environment variable constants are correctly defined.

The new environment variable constants are straightforward and correctly defined.


1838-1845: LGTM! The function is straightforward and includes proper logging.

The ensure_datahub_config function is well-structured and includes proper logging for user feedback.


1847-1864: Review the merge logic for potential correctness issues.

The merge logic for the configuration dictionary might hurt correctness. Ensure that the merging process does not overwrite necessary configurations.

metadata-ingestion/src/datahub/ingestion/graph/client.py Outdated Show resolved Hide resolved
Comment on lines +51 to +54
# TODO: Replace with client.emit, requires figuring out the correct subsclass of _Aspect to create from the data
status = post_entity(
client._session,
client.config.server,
Copy link
Contributor

@coderabbitai coderabbitai bot Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder: Address the TODO comment.

The TODO comment indicates that further improvements are planned.

Do you want me to generate the code to replace post_entity with client.emit or open a GitHub issue to track this task?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, generate the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unable to process this comment: maximum number of iterations reached.

Analysis chain



Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for subclasses of _Aspect in the datahub.metadata.schema_classes module.
ast-grep --lang python --pattern $'class $_(_Aspect):' --json

Length of output: 65


Script:

#!/bin/bash
# Search for classes in the datahub.metadata.schema_classes module to identify potential aspect classes.
ast-grep --lang python --pattern $'class $_:' --json

Length of output: 56


Script:

#!/bin/bash
# Search for all class definitions in the datahub directory to identify potential aspect classes.
ast-grep --lang python --pattern $'class $_:' --json

Length of output: 56


Script:

#!/bin/bash
# Use rg to search for class definitions in the datahub directory to identify potential aspect classes.
rg 'class ' --glob '**/*.py'

Length of output: 189687


Script:

#!/bin/bash
# Examine the datahub.metadata.schema_classes module and its imports to identify the correct aspect classes.
rg 'import' metadata-ingestion/src/datahub/metadata/schema_classes.py
rg 'from' metadata-ingestion/src/datahub/metadata/schema_classes.py

Length of output: 333

@pedro93 pedro93 enabled auto-merge (squash) July 25, 2024 18:47
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Outside diff range, codebase verification and nitpick comments (3)
metadata-ingestion/src/datahub/ingestion/graph/client.py (3)

1791-1806: Improve the logger warning message.

The logger warning should be more explicit about the potential issues with using the host variable as a URL directly.

- logger.warning(f"Do not use {ENV_METADATA_HOST} as URL. Use {ENV_METADATA_HOST_URL} instead")
+ logger.warning(f"Using {ENV_METADATA_HOST} as URL is deprecated and may lead to unexpected behavior. Please use {ENV_METADATA_HOST_URL} instead.")

595-597: Remove the commented-out function.

The commented-out function is not needed and should be removed to improve code readability.

- # def _session(self) -> Session:
- #    return super()._session

1809-1833: Ensure config_override is defined.

The config_override dictionary is used without being defined in the function or elsewhere in the main codebase, which might lead to issues if it is not defined. Please define or import config_override to avoid runtime errors.

  • File: metadata-ingestion/src/datahub/ingestion/graph/client.py
  • Lines: 1809-1833
Analysis chain

Ensure config_override is defined.

The config_override dictionary is used without being defined in the function, which might lead to issues if it is not defined elsewhere in the code.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that `config_override` is defined elsewhere in the codebase.

# Test: Search for the definition of `config_override`. Expect: Definition of `config_override`.
rg --type python 'config_override\s*='

Length of output: 71


Script:

#!/bin/bash
# Description: Verify that `config_override` is defined elsewhere in the codebase.

# Test: Search for the definition of `config_override`. Expect: Definition of `config_override`.
rg -t py 'config_override\s*='

Length of output: 249

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 863df15 and dc362f5.

Files selected for processing (1)
  • metadata-ingestion/src/datahub/ingestion/graph/client.py (5 hunks)
Additional comments not posted (3)
metadata-ingestion/src/datahub/ingestion/graph/client.py (3)

1836-1843: LGTM!

The code changes are approved.


1784-1786: LGTM!

The code changes are approved.


1778-1782: LGTM!

The code changes are approved.

Comment on lines +1845 to +1862
def write_gms_config(
host: str, token: Optional[str], merge_with_previous: bool = True
) -> None:
config = DatahubConfig(gms=DatahubClientConfig(server=host, token=token))
if merge_with_previous:
try:
previous_config = config_utils.get_client_config()
assert isinstance(previous_config, dict)
except Exception as e:
# ok to fail on this
previous_config = {}
logger.debug(
f"Failed to retrieve config from file {config_utils.DATAHUB_CONFIG_PATH}: {e}. This isn't fatal."
)
config_dict = {**previous_config, **config.dict()}
else:
config_dict = config.dict()
config_utils.persist_datahub_config(config_dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review the merge logic.

The merge logic might lead to unexpected behavior if the existing configuration contains conflicting settings. Consider overwriting the existing configuration by default.

- config_dict = {**previous_config, **config.dict()}
+ config_dict = {**config.dict(), **previous_config}
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def write_gms_config(
host: str, token: Optional[str], merge_with_previous: bool = True
) -> None:
config = DatahubConfig(gms=DatahubClientConfig(server=host, token=token))
if merge_with_previous:
try:
previous_config = config_utils.get_client_config()
assert isinstance(previous_config, dict)
except Exception as e:
# ok to fail on this
previous_config = {}
logger.debug(
f"Failed to retrieve config from file {config_utils.DATAHUB_CONFIG_PATH}: {e}. This isn't fatal."
)
config_dict = {**previous_config, **config.dict()}
else:
config_dict = config.dict()
config_utils.persist_datahub_config(config_dict)
def write_gms_config(
host: str, token: Optional[str], merge_with_previous: bool = True
) -> None:
config = DatahubConfig(gms=DatahubClientConfig(server=host, token=token))
if merge_with_previous:
try:
previous_config = config_utils.get_client_config()
assert isinstance(previous_config, dict)
except Exception as e:
# ok to fail on this
previous_config = {}
logger.debug(
f"Failed to retrieve config from file {config_utils.DATAHUB_CONFIG_PATH}: {e}. This isn't fatal."
)
config_dict = {**config.dict(), **previous_config}
else:
config_dict = config.dict()
config_utils.persist_datahub_config(config_dict)

@pedro93 pedro93 merged commit dd732d0 into master Jul 25, 2024
58 of 59 checks passed
@pedro93 pedro93 deleted the feature/ing-588 branch July 25, 2024 19:06
arosanda added a commit to infobip/datahub that referenced this pull request Sep 23, 2024
* feat(forms) Handle deleting forms references when hard deleting forms (datahub-project#10820)

* refactor(ui): Misc improvements to the setup ingestion flow (ingest uplift 1/2)  (datahub-project#10764)

Co-authored-by: John Joyce <[email protected]>
Co-authored-by: John Joyce <[email protected]>

* fix(ingestion/airflow-plugin): pipeline tasks discoverable in search (datahub-project#10819)

* feat(ingest/transformer): tags to terms transformer (datahub-project#10758)

Co-authored-by: Aseem Bansal <[email protected]>

* fix(ingestion/unity-catalog): fixed issue with profiling with GE turned on (datahub-project#10752)

Co-authored-by: Aseem Bansal <[email protected]>

* feat(forms) Add java SDK for form entity PATCH + CRUD examples (datahub-project#10822)

* feat(SDK) Add java SDK for structuredProperty entity PATCH + CRUD examples (datahub-project#10823)

* feat(SDK) Add StructuredPropertyPatchBuilder in python sdk and provide sample CRUD files (datahub-project#10824)

* feat(forms) Add CRUD endpoints to GraphQL for Form entities (datahub-project#10825)

* add flag for includeSoftDeleted in scroll entities API (datahub-project#10831)

* feat(deprecation) Return actor entity with deprecation aspect (datahub-project#10832)

* feat(structuredProperties) Add CRUD graphql APIs for structured property entities (datahub-project#10826)

* add scroll parameters to openapi v3 spec (datahub-project#10833)

* fix(ingest): correct profile_day_of_week implementation (datahub-project#10818)

* feat(ingest/glue): allow ingestion of empty databases from Glue (datahub-project#10666)

Co-authored-by: Harshal Sheth <[email protected]>

* feat(cli): add more details to get cli (datahub-project#10815)

* fix(ingestion/glue): ensure date formatting works on all platforms for aws glue (datahub-project#10836)

* fix(ingestion): fix datajob patcher (datahub-project#10827)

* fix(smoke-test): add suffix in temp file creation (datahub-project#10841)

* feat(ingest/glue): add helper method to permit user or group ownership (datahub-project#10784)

* feat(): Show data platform instances in policy modal if they are set on the policy (datahub-project#10645)

Co-authored-by: Hendrik Richert <[email protected]>

* docs(patch): add patch documentation for how implementation works (datahub-project#10010)

Co-authored-by: John Joyce <[email protected]>

* fix(jar): add missing custom-plugin-jar task (datahub-project#10847)

* fix(): also check exceptions/stack trace when filtering log messages (datahub-project#10391)

Co-authored-by: John Joyce <[email protected]>

* docs(): Update posts.md (datahub-project#9893)

Co-authored-by: Hyejin Yoon <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* chore(ingest): update acryl-datahub-classify version (datahub-project#10844)

* refactor(ingest): Refactor structured logging to support infos, warnings, and failures structured reporting to UI (datahub-project#10828)

Co-authored-by: John Joyce <[email protected]>
Co-authored-by: Harshal Sheth <[email protected]>

* fix(restli): log aspect-not-found as a warning rather than as an error (datahub-project#10834)

* fix(ingest/nifi): remove duplicate upstream jobs (datahub-project#10849)

* fix(smoke-test): test access to create/revoke personal access tokens (datahub-project#10848)

* fix(smoke-test): missing test for move domain (datahub-project#10837)

* ci: update usernames to not considered for community (datahub-project#10851)

* env: change defaults for data contract visibility (datahub-project#10854)

* fix(ingest/tableau): quote special characters in external URL (datahub-project#10842)

* fix(smoke-test): fix flakiness of auto complete test

* ci(ingest): pin dask dependency for feast (datahub-project#10865)

* fix(ingestion/lookml): liquid template resolution and view-to-view cll (datahub-project#10542)

* feat(ingest/audit): add client id and version in system metadata props (datahub-project#10829)

* chore(ingest): Mypy 1.10.1 pin (datahub-project#10867)

* docs: use acryl-datahub-actions as expected python package to install (datahub-project#10852)

* docs: add new js snippet (datahub-project#10846)

* refactor(ingestion): remove company domain for security reason (datahub-project#10839)

* fix(ingestion/spark): Platform instance and column level lineage fix (datahub-project#10843)

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* feat(ingestion/tableau): optionally ingest multiple sites and create site containers (datahub-project#10498)

Co-authored-by: Yanik Häni <[email protected]>

* fix(ingestion/looker): Add sqlglot dependency and remove unused sqlparser (datahub-project#10874)

* fix(manage-tokens): fix manage access token policy (datahub-project#10853)

* Batch get entity endpoints (datahub-project#10880)

* feat(system): support conditional write semantics (datahub-project#10868)

* fix(build): upgrade vercel builds to Node 20.x (datahub-project#10890)

* feat(ingest/lookml): shallow clone repos (datahub-project#10888)

* fix(ingest/looker): add missing dependency (datahub-project#10876)

* fix(ingest): only populate audit stamps where accurate (datahub-project#10604)

* fix(ingest/dbt): always encode tag urns (datahub-project#10799)

* fix(ingest/redshift): handle multiline alter table commands (datahub-project#10727)

* fix(ingestion/looker): column name missing in explore (datahub-project#10892)

* fix(lineage) Fix lineage source/dest filtering with explored per hop limit (datahub-project#10879)

* feat(conditional-writes): misc updates and fixes (datahub-project#10901)

* feat(ci): update outdated action (datahub-project#10899)

* feat(rest-emitter): adding async flag to rest emitter (datahub-project#10902)

Co-authored-by: Gabe Lyons <[email protected]>

* feat(ingest): add snowflake-queries source (datahub-project#10835)

* fix(ingest): improve `auto_materialize_referenced_tags_terms` error handling (datahub-project#10906)

* docs: add new company to adoption list (datahub-project#10909)

* refactor(redshift): Improve redshift error handling with new structured reporting system (datahub-project#10870)

Co-authored-by: John Joyce <[email protected]>
Co-authored-by: Harshal Sheth <[email protected]>

* feat(ui) Finalize support for all entity types on forms (datahub-project#10915)

* Index ExecutionRequestResults status field (datahub-project#10811)

* feat(ingest): grafana connector (datahub-project#10891)

Co-authored-by: Shirshanka Das <[email protected]>
Co-authored-by: Harshal Sheth <[email protected]>

* fix(gms) Add Form entity type to EntityTypeMapper (datahub-project#10916)

* feat(dataset): add support for external url in Dataset (datahub-project#10877)

* docs(saas-overview) added missing features to observe section (datahub-project#10913)

Co-authored-by: John Joyce <[email protected]>

* fix(ingest/spark): Fixing Micrometer warning (datahub-project#10882)

* fix(structured properties): allow application of structured properties without schema file (datahub-project#10918)

* fix(data-contracts-web) handle other schedule types (datahub-project#10919)

* fix(ingestion/tableau): human-readable message for PERMISSIONS_MODE_SWITCHED error (datahub-project#10866)

Co-authored-by: Harshal Sheth <[email protected]>

* Add feature flag for view defintions (datahub-project#10914)

Co-authored-by: Ethan Cartwright <[email protected]>

* feat(ingest/BigQuery): refactor+parallelize dataset metadata extraction (datahub-project#10884)

* fix(airflow): add error handling around render_template() (datahub-project#10907)

* feat(ingestion/sqlglot): add optional `default_dialect` parameter to sqlglot lineage (datahub-project#10830)

* feat(mcp-mutator): new mcp mutator plugin (datahub-project#10904)

* fix(ingest/bigquery): changes helper function to decode unicode scape sequences (datahub-project#10845)

* feat(ingest/postgres): fetch table sizes for profile (datahub-project#10864)

* feat(ingest/abs): Adding azure blob storage ingestion source (datahub-project#10813)

* fix(ingest/redshift): reduce severity of SQL parsing issues (datahub-project#10924)

* fix(build): fix lint fix web react (datahub-project#10896)

* fix(ingest/bigquery): handle quota exceeded for project.list requests (datahub-project#10912)

* feat(ingest): report extractor failures more loudly (datahub-project#10908)

* feat(ingest/snowflake): integrate snowflake-queries into main source (datahub-project#10905)

* fix(ingest): fix docs build (datahub-project#10926)

* fix(ingest/snowflake): fix test connection (datahub-project#10927)

* fix(ingest/lookml): add view load failures to cache (datahub-project#10923)

* docs(slack) overhauled setup instructions and screenshots (datahub-project#10922)

Co-authored-by: John Joyce <[email protected]>

* fix(airflow): Add comma parsing of owners to DataJobs (datahub-project#10903)

* fix(entityservice): fix merging sideeffects (datahub-project#10937)

* feat(ingest): Support System Ingestion Sources, Show and hide system ingestion sources with Command-S (datahub-project#10938)

Co-authored-by: John Joyce <[email protected]>

* chore() Set a default lineage filtering end time on backend when a start time is present (datahub-project#10925)

Co-authored-by: John Joyce <[email protected]>
Co-authored-by: John Joyce <[email protected]>

* Added relationships APIs to V3. Added these generic APIs to V3 swagger doc. (datahub-project#10939)

* docs: add learning center to docs (datahub-project#10921)

* doc: Update hubspot form id (datahub-project#10943)

* chore(airflow): add python 3.11 w/ Airflow 2.9 to CI (datahub-project#10941)

* fix(ingest/Glue): column upstream lineage between S3 and Glue (datahub-project#10895)

* fix(ingest/abs): split abs utils into multiple files (datahub-project#10945)

* doc(ingest/looker): fix doc for sql parsing documentation (datahub-project#10883)

Co-authored-by: Harshal Sheth <[email protected]>

* fix(ingest/bigquery): Adding missing BigQuery types (datahub-project#10950)

* fix(ingest/setup): feast and abs source setup (datahub-project#10951)

* fix(connections) Harden adding /gms to connections in backend (datahub-project#10942)

* feat(siblings) Add flag to prevent combining siblings in the UI (datahub-project#10952)

* fix(docs): make graphql doc gen more automated (datahub-project#10953)

* feat(ingest/athena): Add option for Athena partitioned profiling (datahub-project#10723)

* fix(spark-lineage): default timeout for future responses (datahub-project#10947)

* feat(datajob/flow): add environment filter using info aspects (datahub-project#10814)

* fix(ui/ingest): correct privilege used to show tab (datahub-project#10483)

Co-authored-by: Kunal-kankriya <[email protected]>

* feat(ingest/looker): include dashboard urns in browse v2 (datahub-project#10955)

* add a structured type to batchGet in OpenAPI V3 spec (datahub-project#10956)

* fix(ui): scroll on the domain sidebar to show all domains (datahub-project#10966)

* fix(ingest/sagemaker): resolve incorrect variable assignment for SageMaker API call (datahub-project#10965)

* fix(airflow/build): Pinning mypy (datahub-project#10972)

* Fixed a bug where the OpenAPI V3 spec was incorrect. The bug was introduced in datahub-project#10939. (datahub-project#10974)

* fix(ingest/test): Fix for mssql integration tests (datahub-project#10978)

* fix(entity-service) exist check correctly extracts status (datahub-project#10973)

* fix(structuredProps) casing bug in StructuredPropertiesValidator (datahub-project#10982)

* bugfix: use anyOf instead of allOf when creating references in openapi v3 spec (datahub-project#10986)

* fix(ui): Remove ant less imports (datahub-project#10988)

* feat(ingest/graph): Add get_results_by_filter to DataHubGraph (datahub-project#10987)

* feat(ingest/cli): init does not actually support environment variables (datahub-project#10989)

* fix(ingest/graph): Update get_results_by_filter graphql query (datahub-project#10991)

* feat(ingest/spark): Promote beta plugin (datahub-project#10881)

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* feat(ingest): support domains in meta -> "datahub" section (datahub-project#10967)

* feat(ingest): add `check server-config` command (datahub-project#10990)

* feat(cli): Make consistent use of DataHubGraphClientConfig (datahub-project#10466)

Deprecates get_url_and_token() in favor of a more complete option: load_graph_config() that returns a full DatahubClientConfig.
This change was then propagated across previous usages of get_url_and_token so that connections to DataHub server from the client respect the full breadth of configuration specified by DatahubClientConfig.

I.e: You can now specify disable_ssl_verification: true in your ~/.datahubenv file so that all cli functions to the server work when ssl certification is disabled.

Fixes datahub-project#9705

* fix(ingest/s3): Fixing container creation when there is no folder in path (datahub-project#10993)

* fix(ingest/looker): support platform instance for dashboards & charts (datahub-project#10771)

* feat(ingest/bigquery): improve handling of information schema in sql parser (datahub-project#10985)

* feat(ingest): improve `ingest deploy` command (datahub-project#10944)

* fix(backend): allow excluding soft-deleted entities in relationship-queries; exclude soft-deleted members of groups (datahub-project#10920)

- allow excluding soft-deleted entities in relationship-queries
- exclude soft-deleted members of groups

* fix(ingest/looker): downgrade missing chart type log level (datahub-project#10996)

* doc(acryl-cloud): release docs for 0.3.4.x (datahub-project#10984)

Co-authored-by: John Joyce <[email protected]>
Co-authored-by: RyanHolstien <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: Pedro Silva <[email protected]>

* fix(protobuf/build): Fix protobuf check jar script (datahub-project#11006)

* fix(ui/ingest): Support invalid cron jobs (datahub-project#10998)

* fix(ingest): fix graph config loading (datahub-project#11002)

Co-authored-by: Pedro Silva <[email protected]>

* feat(docs): Document __DATAHUB_TO_FILE_ directive (datahub-project#10968)

Co-authored-by: Harshal Sheth <[email protected]>

* fix(graphql/upsertIngestionSource): Validate cron schedule; parse error in CLI (datahub-project#11011)

* feat(ece): support custom ownership type urns in ECE generation (datahub-project#10999)

* feat(assertion-v2): changed Validation tab to Quality and created new Governance tab (datahub-project#10935)

* fix(ingestion/glue): Add support for missing config options for profiling in Glue (datahub-project#10858)

* feat(propagation): Add models for schema field docs, tags, terms (datahub-project#2959) (datahub-project#11016)

Co-authored-by: Chris Collins <[email protected]>

* docs: standardize terminology to DataHub Cloud (datahub-project#11003)

* fix(ingestion/transformer): replace the externalUrl container (datahub-project#11013)

* docs(slack) troubleshoot docs (datahub-project#11014)

* feat(propagation): Add graphql API (datahub-project#11030)

Co-authored-by: Chris Collins <[email protected]>

* feat(propagation):  Add models for Action feature settings (datahub-project#11029)

* docs(custom properties): Remove duplicate from sidebar (datahub-project#11033)

* feat(models): Introducing Dataset Partitions Aspect (datahub-project#10997)

Co-authored-by: John Joyce <[email protected]>
Co-authored-by: John Joyce <[email protected]>

* feat(propagation): Add Documentation Propagation Settings (datahub-project#11038)

* fix(models): chart schema fields mapping, add dataHubAction entity, t… (datahub-project#11040)

* fix(ci): smoke test lint failures (datahub-project#11044)

* docs: fix learning center color scheme & typo (datahub-project#11043)

* feat: add cloud main page (datahub-project#11017)

Co-authored-by: Jay <[email protected]>

* feat(restore-indices): add additional step to also clear system metadata service (datahub-project#10662)

Co-authored-by: John Joyce <[email protected]>

* docs: fix typo (datahub-project#11046)

* fix(lint): apply spotless (datahub-project#11050)

* docs(airflow): example query to get datajobs for a dataflow (datahub-project#11034)

* feat(cli): Add run-id option to put sub-command (datahub-project#11023)

Adds an option to assign run-id to a given put command execution. 
This is useful when transformers do not exist for a given ingestion payload, we can follow up with custom metadata and assign it to an ingestion pipeline.

* fix(ingest): improve sql error reporting calls (datahub-project#11025)

* fix(airflow): fix CI setup (datahub-project#11031)

* feat(ingest/dbt): add experimental `prefer_sql_parser_lineage` flag (datahub-project#11039)

* fix(ingestion/lookml): enable stack-trace in lookml logs (datahub-project#10971)

* (chore): Linting fix (datahub-project#11015)

* chore(ci): update deprecated github actions (datahub-project#10977)

* Fix ALB configuration example (datahub-project#10981)

* chore(ingestion-base): bump base image packages (datahub-project#11053)

* feat(cli): Trim report of dataHubExecutionRequestResult to max GMS size (datahub-project#11051)

* fix(ingestion/lookml): emit dummy sql condition for lookml custom condition tag (datahub-project#11008)

Co-authored-by: Harshal Sheth <[email protected]>

* fix(ingestion/powerbi): fix issue with broken report lineage (datahub-project#10910)

* feat(ingest/tableau): add retry on timeout (datahub-project#10995)

* change generate kafka connect properties from env (datahub-project#10545)

Co-authored-by: david-leifker <[email protected]>

* fix(ingest): fix oracle cronjob ingestion (datahub-project#11001)

Co-authored-by: david-leifker <[email protected]>

* chore(ci): revert update deprecated github actions (datahub-project#10977) (datahub-project#11062)

* feat(ingest/dbt-cloud): update metadata_endpoint inference (datahub-project#11041)

* build: Reduce size of datahub-frontend-react image by 50-ish% (datahub-project#10878)

Co-authored-by: david-leifker <[email protected]>

* fix(ci): Fix lint issue in datahub_ingestion_run_summary_provider.py (datahub-project#11063)

* docs(ingest): update developing-a-transformer.md (datahub-project#11019)

* feat(search-test): update search tests from datahub-project#10408 (datahub-project#11056)

* feat(cli): add aspects parameter to DataHubGraph.get_entity_semityped (datahub-project#11009)

Co-authored-by: Harshal Sheth <[email protected]>

* docs(airflow): update min version for plugin v2 (datahub-project#11065)

* doc(ingestion/tableau): doc update for derived permission (datahub-project#11054)

Co-authored-by: Pedro Silva <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: Harshal Sheth <[email protected]>

* fix(py): remove dep on types-pkg_resources (datahub-project#11076)

* feat(ingest/mode): add option to exclude restricted (datahub-project#11081)

* fix(ingest): set lastObserved in sdk when unset (datahub-project#11071)

* doc(ingest): Update capabilities (datahub-project#11072)

* chore(vulnerability): Log Injection (datahub-project#11090)

* chore(vulnerability): Information exposure through a stack trace (datahub-project#11091)

* chore(vulnerability): Comparison of narrow type with wide type in loop condition (datahub-project#11089)

* chore(vulnerability): Insertion of sensitive information into log files (datahub-project#11088)

* chore(vulnerability): Risky Cryptographic Algorithm (datahub-project#11059)

* chore(vulnerability): Overly permissive regex range (datahub-project#11061)

Co-authored-by: Harshal Sheth <[email protected]>

* fix: update customer data (datahub-project#11075)

* fix(models): fixing the datasetPartition models (datahub-project#11085)

Co-authored-by: John Joyce <[email protected]>

* fix(ui): Adding view, forms GraphQL query, remove showing a fallback error message on unhandled GraphQL error (datahub-project#11084)

Co-authored-by: John Joyce <[email protected]>

* feat(docs-site): hiding learn more from cloud page (datahub-project#11097)

* fix(docs): Add correct usage of orFilters in search API docs (datahub-project#11082)

Co-authored-by: Jay <[email protected]>

* fix(ingest/mode): Regexp in mode name matcher didn't allow underscore (datahub-project#11098)

* docs: Refactor customer stories section (datahub-project#10869)

Co-authored-by: Jeff Merrick <[email protected]>

* fix(release): fix full/slim suffix on tag (datahub-project#11087)

* feat(config): support alternate hashing algorithm for doc id (datahub-project#10423)

Co-authored-by: david-leifker <[email protected]>
Co-authored-by: John Joyce <[email protected]>

* fix(emitter): fix typo in get method of java kafka emitter (datahub-project#11007)

* fix(ingest): use correct native data type in all SQLAlchemy sources by compiling data type using dialect (datahub-project#10898)

Co-authored-by: Harshal Sheth <[email protected]>

* chore: Update contributors list in PR labeler (datahub-project#11105)

* feat(ingest): tweak stale entity removal messaging (datahub-project#11064)

* fix(ingestion): enforce lastObserved timestamps in SystemMetadata (datahub-project#11104)

* fix(ingest/powerbi): fix broken lineage between chart and dataset (datahub-project#11080)

* feat(ingest/lookml): CLL support for sql set in sql_table_name attribute of lookml view (datahub-project#11069)

* docs: update graphql docs on forms & structured properties (datahub-project#11100)

* test(search): search openAPI v3 test (datahub-project#11049)

* fix(ingest/tableau): prevent empty site content urls (datahub-project#11057)

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* feat(entity-client): implement client batch interface (datahub-project#11106)

* fix(snowflake): avoid reporting warnings/info for sys tables (datahub-project#11114)

* fix(ingest): downgrade column type mapping warning to info (datahub-project#11115)

* feat(api): add AuditStamp to the V3 API entity/aspect response (datahub-project#11118)

* fix(ingest/redshift): replace r'\n' with '\n' to avoid token error redshift serverless… (datahub-project#11111)

* fix(entiy-client): handle null entityUrn case for restli (datahub-project#11122)

* fix(sql-parser): prevent bad urns from alter table lineage (datahub-project#11092)

* fix(ingest/bigquery): use small batch size if use_tables_list_query_v2 is set (datahub-project#11121)

* fix(graphql): add missing entities to EntityTypeMapper and EntityTypeUrnMapper (datahub-project#10366)

* feat(ui): Changes to allow editable dataset name (datahub-project#10608)

Co-authored-by: Jay Kadambi <[email protected]>

* fix: remove saxo (datahub-project#11127)

* feat(mcl-processor): Update mcl processor hooks (datahub-project#11134)

* fix(openapi): fix openapi v2 endpoints & v3 documentation update

* Revert "fix(openapi): fix openapi v2 endpoints & v3 documentation update"

This reverts commit 573c1cb.

* docs(policies): updates to policies documentation (datahub-project#11073)

* fix(openapi): fix openapi v2 and v3 docs update (datahub-project#11139)

* feat(auth): grant type and acr values custom oidc parameters support (datahub-project#11116)

* fix(mutator): mutator hook fixes (datahub-project#11140)

* feat(search): support sorting on multiple fields (datahub-project#10775)

* feat(ingest): various logging improvements (datahub-project#11126)

* fix(ingestion/lookml): fix for sql parsing error (datahub-project#11079)

Co-authored-by: Harshal Sheth <[email protected]>

* feat(docs-site) cloud page spacing and content polishes (datahub-project#11141)

* feat(ui) Enable editing structured props on fields (datahub-project#11042)

* feat(tests): add md5 and last computed to testResult model (datahub-project#11117)

* test(openapi): openapi regression smoke tests (datahub-project#11143)

* fix(airflow): fix tox tests + update docs (datahub-project#11125)

* docs: add chime to adoption stories (datahub-project#11142)

* fix(ingest/databricks): Updating code to work with Databricks sdk 0.30 (datahub-project#11158)

* fix(kafka-setup): add missing script to image (datahub-project#11190)

* fix(config): fix hash algo config (datahub-project#11191)

* test(smoke-test): updates to smoke-tests (datahub-project#11152)

* fix(elasticsearch): refactor idHashAlgo setting (datahub-project#11193)

* chore(kafka): kafka version bump (datahub-project#11211)

* readd UsageStatsWorkUnit

* fix merge problems

* change logo

---------

Co-authored-by: Chris Collins <[email protected]>
Co-authored-by: John Joyce <[email protected]>
Co-authored-by: John Joyce <[email protected]>
Co-authored-by: John Joyce <[email protected]>
Co-authored-by: dushayntAW <[email protected]>
Co-authored-by: sagar-salvi-apptware <[email protected]>
Co-authored-by: Aseem Bansal <[email protected]>
Co-authored-by: Kevin Chun <[email protected]>
Co-authored-by: jordanjeremy <[email protected]>
Co-authored-by: skrydal <[email protected]>
Co-authored-by: Harshal Sheth <[email protected]>
Co-authored-by: david-leifker <[email protected]>
Co-authored-by: sid-acryl <[email protected]>
Co-authored-by: Julien Jehannet <[email protected]>
Co-authored-by: Hendrik Richert <[email protected]>
Co-authored-by: Hendrik Richert <[email protected]>
Co-authored-by: RyanHolstien <[email protected]>
Co-authored-by: Felix Lüdin <[email protected]>
Co-authored-by: Pirry <[email protected]>
Co-authored-by: Hyejin Yoon <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: cburroughs <[email protected]>
Co-authored-by: ksrinath <[email protected]>
Co-authored-by: Mayuri Nehate <[email protected]>
Co-authored-by: Kunal-kankriya <[email protected]>
Co-authored-by: Shirshanka Das <[email protected]>
Co-authored-by: ipolding-cais <[email protected]>
Co-authored-by: Tamas Nemeth <[email protected]>
Co-authored-by: Shubham Jagtap <[email protected]>
Co-authored-by: haeniya <[email protected]>
Co-authored-by: Yanik Häni <[email protected]>
Co-authored-by: Gabe Lyons <[email protected]>
Co-authored-by: Gabe Lyons <[email protected]>
Co-authored-by: 808OVADOZE <[email protected]>
Co-authored-by: noggi <[email protected]>
Co-authored-by: Nicholas Pena <[email protected]>
Co-authored-by: Jay <[email protected]>
Co-authored-by: ethan-cartwright <[email protected]>
Co-authored-by: Ethan Cartwright <[email protected]>
Co-authored-by: Nadav Gross <[email protected]>
Co-authored-by: Patrick Franco Braz <[email protected]>
Co-authored-by: pie1nthesky <[email protected]>
Co-authored-by: Joel Pinto Mata (KPN-DSH-DEX team) <[email protected]>
Co-authored-by: Ellie O'Neil <[email protected]>
Co-authored-by: Ajoy Majumdar <[email protected]>
Co-authored-by: deepgarg-visa <[email protected]>
Co-authored-by: Tristan Heisler <[email protected]>
Co-authored-by: Andrew Sikowitz <[email protected]>
Co-authored-by: Davi Arnaut <[email protected]>
Co-authored-by: Pedro Silva <[email protected]>
Co-authored-by: amit-apptware <[email protected]>
Co-authored-by: Sam Black <[email protected]>
Co-authored-by: Raj Tekal <[email protected]>
Co-authored-by: Steffen Grohsschmiedt <[email protected]>
Co-authored-by: jaegwon.seo <[email protected]>
Co-authored-by: Renan F. Lima <[email protected]>
Co-authored-by: Matt Exchange <[email protected]>
Co-authored-by: Jonny Dixon <[email protected]>
Co-authored-by: Pedro Silva <[email protected]>
Co-authored-by: Pinaki Bhattacharjee <[email protected]>
Co-authored-by: Jeff Merrick <[email protected]>
Co-authored-by: skrydal <[email protected]>
Co-authored-by: AndreasHegerNuritas <[email protected]>
Co-authored-by: jayasimhankv <[email protected]>
Co-authored-by: Jay Kadambi <[email protected]>
Co-authored-by: David Leifker <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot use Datahub CLI using a remote host with SSL verification
3 participants