Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit job ids via side output in TriggerFileLoads process to keep beam.Flatten() happy for Spark and Flink runners #23954

Merged
merged 2 commits into from
Nov 3, 2022

Conversation

ahmedabu98
Copy link
Contributor

@ahmedabu98 ahmedabu98 commented Nov 2, 2022

Fixes some tests for #23907

Changes in #23012 included waiting until jobs in a given step are finished before emitting the job IDs to the next step, see code block. This introduced a problem with FILE_LOADS in Flink and Spark runners:
RuntimeError: Pipeline BeamApp-jenkins-1102122925-c4be4910_cbaf26e4-874a-42dc-adf1-f3a63242d10c failed in state FAILED: java.lang.IllegalArgumentException: PCollectionNodes [PCollectionNode{id=ref_PCollection_PCollection_52, PCollection=unique_name: "61Write/BigQueryBatchFileLoads/TriggerLoadJobsWithoutTempTables.None" coder_id: "ref_Coder_FastPrimitivesCoder_3" is_bounded: BOUNDED windowing_strategy_id: "ref_Windowing_Windowing_1" }] were consumed but never produced
See #23907 for more details.

These changes add a side output to TriggerLoadJobs and yield job IDs to this output in process (as was done previously) to keep the beam.Flatten() that combines job IDs here happy. The main output still waits on jobs to finish before emitting them collectively to ensure atomicity between steps in the workflow.

@ahmedabu98
Copy link
Contributor Author

Run Python Examples_Spark

@ahmedabu98
Copy link
Contributor Author

Run Python Examples_Flink

@ahmedabu98
Copy link
Contributor Author

Run Python 3.8 PostCommit

@ahmedabu98 ahmedabu98 changed the title Fix regression by #23012 seen in Flink and Spark runners Emit job ids via side output in TriggerFileLoads process to keep beam.Flatten() happy for Spark and Flink runners Nov 2, 2022
@ahmedabu98 ahmedabu98 marked this pull request as ready for review November 2, 2022 23:17
@ahmedabu98
Copy link
Contributor Author

R: @chamikaramj

Relevant tests are passing now.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 2, 2022

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

@codecov
Copy link

codecov bot commented Nov 2, 2022

Codecov Report

Merging #23954 (9c3ce5d) into master (acaffba) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master   #23954   +/-   ##
=======================================
  Coverage   73.53%   73.53%           
=======================================
  Files         707      707           
  Lines       95856    95859    +3     
=======================================
+ Hits        70486    70494    +8     
+ Misses      24053    24048    -5     
  Partials     1317     1317           
Flag Coverage Δ
python 83.33% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...s/python/apache_beam/io/gcp/bigquery_file_loads.py 87.41% <100.00%> (+0.08%) ⬆️
sdks/python/apache_beam/io/source_test_utils.py 88.47% <0.00%> (-1.39%) ⬇️
...hon/apache_beam/runners/worker/bundle_processor.py 93.67% <0.00%> (+0.12%) ⬆️
...dks/python/apache_beam/options/pipeline_options.py 94.90% <0.00%> (+0.44%) ⬆️
...eam/runners/portability/fn_api_runner/execution.py 93.08% <0.00%> (+0.64%) ⬆️
.../python/apache_beam/transforms/periodicsequence.py 100.00% <0.00%> (+1.61%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@chamikaramj
Copy link
Contributor

Thanks. Also, pls make sure that we execute relavent BQ post-commit test suites to make sure that we do not break BQ sink.

@ahmedabu98
Copy link
Contributor Author

Run Python PreCommit

1 similar comment
@chamikaramj
Copy link
Contributor

Run Python PreCommit

@chamikaramj
Copy link
Contributor

Run Python 3.8 PostCommit

@chamikaramj
Copy link
Contributor

Run BigQueryIO Read Performance Test Python

@chamikaramj
Copy link
Contributor

Run BigQueryIO Write Performance Test Python Batch

@ahmedabu98
Copy link
Contributor Author

Run Python 3.7 PostCommit

@chamikaramj chamikaramj merged commit 4f64c7e into apache:master Nov 3, 2022
chamikaramj added a commit to chamikaramj/beam that referenced this pull request Nov 3, 2022
…gerFileLoads process to keep beam.Flatten() happy for Spark and Flink runners
chamikaramj added a commit that referenced this pull request Nov 4, 2022
…a side output in TriggerFileLoads process to keep beam.Flatten() happy for Spark and Flink runners
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants