-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coalesce sources until compressed serialized bundles under API limit #26267
Conversation
Verified that the unit test fails in current master and suceeds with fix in. Current master:
With the fix:
|
R: @chamikaramj @kennknowles per file history internal tracker: 277995484 |
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control |
+ "of %d bytes is still larger than the limit %d. For more information, please " | ||
+ "check the corresponding FAQ entry at " | ||
+ "https://cloud.google.com/dataflow/pipelines/troubleshooting-your-pipeline", | ||
source, bundles.size(), serializedSize, apiByteLimit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we suggest this workaround with withTemplateCompatiblity
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This option is BigQueryIO.read only while the bug is generic to dataflow worker. With that option the read teansform will expand differently and likely avoided the problematic code path here (haven't verify)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workaround will be using Runner v2. That will convert sources into SDFs that do not run into this source split limit. We should suggest that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @chamikaramj is this code path only for Dataflow runner v1? If so we could document the recommendation of runner v2 also in https://cloud.google.com/dataflow/docs/guides/common-errors#boundedsource-objects-splitintobundles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this codepath (source split operation) is only get hit by Runner v1. +1 for documenting in the Website.
String.format( | ||
"Unable to coalesce the sources into compressed serialized bundles to satisfy the " | ||
+ "allowable limit when splitting %s. With %d bundles, total serialized size " | ||
+ "of %d bytes is still larger than the limit %d. For more information, please " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use this https://cloud.google.com/dataflow/docs/guides/common-errors#boundedsource-objects-splitintobundles? Please update the previous LOG.info with this link as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, this url is more accurate
Run Java PreCommit |
LGTM |
Thanks for suggestions. The linked cloud documentation is going to be updated. Merging the fix for now. |
Fixes #26264
The original
limitNumberOfBundles
logic has problems that it always limits to 100 bundles if the split bundle is greater than this number and did not check the resulting response size. It could result in IllegalArgumentException immediately below.Please add a meaningful description for your change here
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.