-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial version of BigQuery query execution component. #601
Conversation
Hi @cbreuel. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
1 similar comment
Hi @cbreuel. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cbreuel!
|
||
def run_query(query, project_id, dataset_id, table_id, output): | ||
results = _query(query, project_id, dataset_id, table_id, output) | ||
results['output'] = output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems useful if we store the files pattern, not just the output dir into results. As the component provider, I probably know the file pattern beforehand, and that is useful information to component consumers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am actually assuming the output parameter will be a complete path, including a file name or pattern. Would you prefer to make it just a directory and force the file pattern to be e.g. the table name, then include the pattern in the output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the output is just whatever is passed in right? Is it a dir or a file prefix? I would prefer the output includes the pattern or file (if single file). For example, if passed in output is "gs://b/dir", the output could be "gs://b/dir/files-*"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok-to-test
/lgtm
Made some changes based on your comments, PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 5 files reviewed, 4 unresolved discussions (waiting on @qimingj, @Ark-kun, @gaoning777, and @hongye-sun)
components/bigquery/query/src/query.py, line 66 at r1 (raw file):
Previously, qimingj wrote…
then "required=True"?
Done.
components/bigquery/query/src/query.py, line 81 at r1 (raw file):
Previously, qimingj wrote…
Here the output is just whatever is passed in right? Is it a dir or a file prefix? I would prefer the output includes the pattern or file (if single file). For example, if passed in output is "gs://b/dir", the output could be "gs://b/dir/files-*"?
The idea is that "gs://b/dir/files-*" would be passed in, not "gs://b/dir", so it would make sense for them to be the same. I could also change the code to expect only the dir portion to be passed in and add the file name pattern in the code, then outputting the combination. Do you think that would work better?
/ok-to-test |
You are right. Based on https://google-cloud.readthedocs.io/en/latest/bigquery/generated/google.cloud.bigquery.client.Client.extract_table.html output should be the file path. |
|
||
RUN easy_install pip | ||
|
||
RUN pip install google-apitools==0.5.20 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will be nice to parameterize this values i.e ARG GOOG_CLOUD_BQ='1.8.1'
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: qimingj The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: qimingj The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This is a component that runs a query in BigQuery and stores the results in a new table, and optionally also exports the results to GCS.
This change is