Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingest/pulsar): handle Avro schema with missing namespace or name #12058

Merged

Conversation

Alice-608
Copy link
Contributor

@Alice-608 Alice-608 commented Dec 8, 2024

fix: pulsar ingestion fails when Avro schema is missing namespace or name

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Dec 8, 2024
@Alice-608
Copy link
Contributor Author

fix: pulsar ingestion fails when Avro schema is missing namespace or name

Problem Description
During the Pulsar ingestion process in DataHub, if the Avro schema for a topic is missing either the namespace or name field, the ingestion process crashes with the following error:
<class 'TypeError'>: unsupported operand type(s) for +: 'NoneType' and 'str'

The root cause is that the code attempts to concatenate these fields without checking if they exist:
self.schema_name = avro_schema.get("namespace") + "." + avro_schema.get("name")

Fix Description
This fix introduces the following changes:
Validation: Check for the presence of the namespace and name fields before concatenation.
Default Values: If either field is missing, assign default values:
namespace: "default_namespace"
name: "default_name"
Informational Logging: Log messages to inform users about missing fields and the default values used.
This ensures that the ingestion process no longer fails due to missing schema fields and continues processing metadata gracefully.

Modified Files
datahub/metadata-ingestion/src/datahub/ingestion/source/pulsar.py

@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Dec 8, 2024
@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Dec 11, 2024
@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Dec 14, 2024
Copy link
Contributor

@shirshanka shirshanka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good - I accepted @sgomezvillamor's suggestion

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Dec 14, 2024
@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Dec 16, 2024
self.schema_name = "null"
if avro_schema.get("namespace") and avro_schema.get("name"):
self.schema_name = (
avro_schema.get("namespace") + "." + avro_schema.get("name")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest build failed because of a format issue, which should be fixed with:

Suggested change
avro_schema.get("namespace") + "." + avro_schema.get("name")
avro_schema.get("namespace") + "." + avro_schema.get("name")

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Dec 17, 2024
@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Dec 17, 2024
@hsheth2 hsheth2 changed the title fix: pulsar ingestion fails when Avro schema is missing namespace or name fix(ingest/pulsar): handle Avro schema with missing namespace or name Dec 17, 2024
@datahub-cyborg datahub-cyborg bot added merge-pending-ci A PR that has passed review and should be merged once CI is green. and removed needs-review Label for PRs that need review from a maintainer. labels Dec 17, 2024
@sgomezvillamor sgomezvillamor enabled auto-merge (squash) December 18, 2024 06:48
@sgomezvillamor sgomezvillamor merged commit 5946558 into datahub-project:master Dec 18, 2024
62 of 64 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants