Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload git info with metadata file (#1) #37802

Merged
merged 1 commit into from
May 9, 2024

Conversation

bnchrch
Copy link
Contributor

@bnchrch bnchrch commented May 3, 2024

What

Adds git commit info to the metadata file during upload.

image.png

Spun out of #32715 as a stack

Copy link

vercel bot commented May 3, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview May 8, 2024 9:34pm

Copy link
Contributor Author

bnchrch commented May 3, 2024

@bnchrch bnchrch requested a review from alafanechere May 3, 2024 18:43
@bnchrch bnchrch marked this pull request as ready for review May 3, 2024 18:43
@bnchrch bnchrch requested a review from a team May 3, 2024 18:43
Copy link
Contributor

@alafanechere alafanechere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes because I'd like your opinion on these suggestions:

  • Create a separate step in the publish pipeline which would set the generated fields in metadata.yaml
  • Remove the git info retrieval logic from the metadata service lib and make airbyte-ci pass the git info. It will remove the need of mounting a git repo to the metadata service lib container and keep a single source of truth about the current git info (stored in the PipelineContext object).

Also something which bothers me a bit is that the checked in metadata file will be different from the one uploaded to GCS... We could make the publish pipeline commit and push to master... But what if we did not set these field in metadata.yaml and just generate them at registry generation?


metadata, error = validate_and_load(metadata_file_path, POST_UPLOAD_VALIDATORS, validator_opts)
metadata_file_path = _apply_modifications_to_metadata_file(metadata_file_path, validator_opts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would find it more explicit and maintainable to decouple metadata mutation from metadata upload. Wdyt about create a separate step in publish which would perform this apply_modifications_to_metadata_file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I dont think I like the logic living in the pipeline.

Simply because right now we have metadata logic living already in two places

  1. Metadata service (validation and upload)
  2. Orchestrator (Parsing and generation)

Currently they are at least contained in the metadata_service folder

But if we add the git logic of

for a given metadata file get the author of the last commit that modified

Then we end up with 3 locations to look for logic, 1 of which is now outside the area in the codebase


commit_sha: Optional[str] = Field(
None,
description="The git commit sha of the last commit that modified this file. DO NOT DEFINE THIS FIELD IN YOUR SPEC. It will be overwritten by the CI.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by SPEC in this context? Do you mean the metadata.yaml file?
Sounds confusing to me as spec also relates to the connector spec command.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good catch. Mistype Ill change to "metadata file"

@@ -82,6 +82,7 @@ def __init__(
title=title,
context=context,
paths_to_mount=[
MountPath(GIT_DIRECTORY_ROOT_PATH),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mounting .git is quite an expensive operation (depending on how many branches the local repo has).
As we have the git info in the context (self.context.git_branch, self.context.git_revision) what do you think about passing these to the step and remove the git logic from the metadata service lib?
It will have two advantages:

  • remove the requirement for the metadata service lib to be executed on top of a git repo
  • keep a single source of truth about the current git info (stored in the context object)

Let me know what you think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I summed up my thoughts here:
#37802 (comment)

But I wanted to ask, what is the time penalty for mounting git in CI?

I ran locally with a fairly limited number of branches (since I reset my laptop) and it ran decently quick.

My thoughts were it wouldn't be the worst penalty since its during publish and not test

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time penalty would rather happen locally, on CI we only checkout the current branch + master, which is not too expensive.

My thoughts were it wouldn't be the worst penalty since its during publish and not test

Agreed.

@bnchrch bnchrch requested a review from alafanechere May 6, 2024 22:06
Copy link
Contributor

@alafanechere alafanechere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving as your point about keeping metadata service / orchestrator as self contained as possible sounds reasonable to me.

I'm still concerned about this:

Also something which bothers me a bit is that the checked in metadata file will be different from the one uploaded to GCS... We could make the publish pipeline commit and push to master... But what if we did not set these field in metadata.yaml and just generate them at registry generation?

Can you share your thoughts on it?

@bnchrch
Copy link
Contributor Author

bnchrch commented May 7, 2024

@alafanechere Thinking on the "at registry generation" time 🤔

The difficult part is the registry currently has no context as to

  1. Where the metadata file came from in the git repo
  2. Or What commit a change has come from
  3. Or the git repo checked out

If we were to change any of these then we would

  1. Still be modifiying the metadata file to include a generated field (commit, file_path, or both)
  2. Potentially running a git checkout for every metadata file we parse, but in a location with no caching.

So Im not sure this approach would gain us anything. It actually seems like we would have even longer release times then the proposed path of retrieving git commit info during upload.

@bnchrch bnchrch force-pushed the bnchrch/registry/add-git-commit-info branch from 7e76b6c to 219e7b5 Compare May 7, 2024 19:21
@bnchrch bnchrch force-pushed the bnchrch/registry/add-git-commit-info branch from 219e7b5 to 5f9a6b9 Compare May 8, 2024 21:33
Copy link
Contributor Author

bnchrch commented May 9, 2024

Merge activity

  • May 9, 10:18 AM EDT: @bnchrch started a stack merge that includes this pull request via Graphite.
  • May 9, 10:19 AM EDT: @bnchrch merged this pull request with Graphite.

@bnchrch bnchrch merged commit c0492b0 into master May 9, 2024
34 checks passed
@bnchrch bnchrch deleted the bnchrch/registry/add-git-commit-info branch May 9, 2024 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants