Emit job ids via side output in TriggerFileLoads process to keep beam.Flatten() happy for Spark and Flink runners #23954

ahmedabu98 · 2022-11-02T22:55:43Z

Fixes some tests for #23907

Changes in #23012 included waiting until jobs in a given step are finished before emitting the job IDs to the next step, see code block. This introduced a problem with FILE_LOADS in Flink and Spark runners:
RuntimeError: Pipeline BeamApp-jenkins-1102122925-c4be4910_cbaf26e4-874a-42dc-adf1-f3a63242d10c failed in state FAILED: java.lang.IllegalArgumentException: PCollectionNodes [PCollectionNode{id=ref_PCollection_PCollection_52, PCollection=unique_name: "61Write/BigQueryBatchFileLoads/TriggerLoadJobsWithoutTempTables.None" coder_id: "ref_Coder_FastPrimitivesCoder_3" is_bounded: BOUNDED windowing_strategy_id: "ref_Windowing_Windowing_1" }] were consumed but never produced
See #23907 for more details.

These changes add a side output to TriggerLoadJobs and yield job IDs to this output in process (as was done previously) to keep the beam.Flatten() that combines job IDs here happy. The main output still waits on jobs to finish before emitting them collectively to ensure atomicity between steps in the workflow.

ahmedabu98 · 2022-11-02T22:56:06Z

Run Python Examples_Spark

ahmedabu98 · 2022-11-02T22:56:12Z

Run Python Examples_Flink

ahmedabu98 · 2022-11-02T22:57:06Z

Run Python 3.8 PostCommit

ahmedabu98 · 2022-11-02T23:19:13Z

R: @chamikaramj

Relevant tests are passing now.

github-actions · 2022-11-02T23:20:22Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

codecov · 2022-11-02T23:25:37Z

Codecov Report

Merging #23954 (9c3ce5d) into master (acaffba) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master   #23954   +/-   ##
=======================================
  Coverage   73.53%   73.53%           
=======================================
  Files         707      707           
  Lines       95856    95859    +3     
=======================================
+ Hits        70486    70494    +8     
+ Misses      24053    24048    -5     
  Partials     1317     1317

Flag	Coverage Δ
python	`83.33% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...s/python/apache_beam/io/gcp/bigquery_file_loads.py	`87.41% <100.00%> (+0.08%)`	⬆️
sdks/python/apache_beam/io/source_test_utils.py	`88.47% <0.00%> (-1.39%)`	⬇️
...hon/apache_beam/runners/worker/bundle_processor.py	`93.67% <0.00%> (+0.12%)`	⬆️
...dks/python/apache_beam/options/pipeline_options.py	`94.90% <0.00%> (+0.44%)`	⬆️
...eam/runners/portability/fn_api_runner/execution.py	`93.08% <0.00%> (+0.64%)`	⬆️
.../python/apache_beam/transforms/periodicsequence.py	`100.00% <0.00%> (+1.61%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

sdks/python/apache_beam/io/gcp/bigquery_file_loads.py

chamikaramj · 2022-11-02T23:48:42Z

Thanks. Also, pls make sure that we execute relavent BQ post-commit test suites to make sure that we do not break BQ sink.

ahmedabu98 · 2022-11-03T02:53:24Z

Run Python PreCommit

chamikaramj · 2022-11-03T06:02:06Z

Run Python PreCommit

chamikaramj · 2022-11-03T06:03:46Z

Run Python 3.8 PostCommit

chamikaramj · 2022-11-03T06:04:30Z

Run BigQueryIO Read Performance Test Python

chamikaramj · 2022-11-03T06:04:39Z

Run BigQueryIO Write Performance Test Python Batch

ahmedabu98 · 2022-11-03T11:41:20Z

Run Python 3.7 PostCommit

…gerFileLoads process to keep beam.Flatten() happy for Spark and Flink runners

…a side output in TriggerFileLoads process to keep beam.Flatten() happy for Spark and Flink runners

ahmedabu98 added 2 commits November 2, 2022 22:47

emit load job IDs as soon as they come up

15297da

style fix

9c3ce5d

github-actions bot added gcp io python labels Nov 2, 2022

ahmedabu98 changed the title ~~Fix regression by #23012 seen in Flink and Spark runners~~ Emit job ids via side output in TriggerFileLoads process to keep beam.Flatten() happy for Spark and Flink runners Nov 2, 2022

ahmedabu98 marked this pull request as ready for review November 2, 2022 23:17

ahmedabu98 mentioned this pull request Nov 2, 2022

[Bug]: Number of Python examples are failing for Flink and Spark on 2.43.0 release branch #23907

Closed

chamikaramj reviewed Nov 2, 2022

View reviewed changes

sdks/python/apache_beam/io/gcp/bigquery_file_loads.py Show resolved Hide resolved

chamikaramj merged commit 4f64c7e into apache:master Nov 3, 2022

chamikaramj added a commit to chamikaramj/beam that referenced this pull request Nov 3, 2022

Merge pull request apache#23954: Emit job ids via side output in Trig…

6e49e08

…gerFileLoads process to keep beam.Flatten() happy for Spark and Flink runners

chamikaramj mentioned this pull request Nov 3, 2022

Merge pull request #23954: Emit job ids via side output in TriggerFileLoads process to keep beam.Flatten() happy for Spark and Flink runners #23967

Merged

4 tasks

chamikaramj added a commit that referenced this pull request Nov 4, 2022

Merge pull request #23967: Merge pull request #23954: Emit job ids vi…

f4ab317

…a side output in TriggerFileLoads process to keep beam.Flatten() happy for Spark and Flink runners

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emit job ids via side output in TriggerFileLoads process to keep beam.Flatten() happy for Spark and Flink runners #23954

Emit job ids via side output in TriggerFileLoads process to keep beam.Flatten() happy for Spark and Flink runners #23954

ahmedabu98 commented Nov 2, 2022 •

edited

Loading

ahmedabu98 commented Nov 2, 2022

ahmedabu98 commented Nov 2, 2022

ahmedabu98 commented Nov 2, 2022

ahmedabu98 commented Nov 2, 2022

github-actions bot commented Nov 2, 2022

codecov bot commented Nov 2, 2022 •

edited

Loading

chamikaramj commented Nov 2, 2022

ahmedabu98 commented Nov 3, 2022

chamikaramj commented Nov 3, 2022

chamikaramj commented Nov 3, 2022

chamikaramj commented Nov 3, 2022

chamikaramj commented Nov 3, 2022

ahmedabu98 commented Nov 3, 2022

Emit job ids via side output in TriggerFileLoads process to keep beam.Flatten() happy for Spark and Flink runners #23954

Emit job ids via side output in TriggerFileLoads process to keep beam.Flatten() happy for Spark and Flink runners #23954

Conversation

ahmedabu98 commented Nov 2, 2022 • edited Loading

ahmedabu98 commented Nov 2, 2022

ahmedabu98 commented Nov 2, 2022

ahmedabu98 commented Nov 2, 2022

ahmedabu98 commented Nov 2, 2022

github-actions bot commented Nov 2, 2022

codecov bot commented Nov 2, 2022 • edited Loading

Codecov Report

chamikaramj commented Nov 2, 2022

ahmedabu98 commented Nov 3, 2022

chamikaramj commented Nov 3, 2022

chamikaramj commented Nov 3, 2022

chamikaramj commented Nov 3, 2022

chamikaramj commented Nov 3, 2022

ahmedabu98 commented Nov 3, 2022

ahmedabu98 commented Nov 2, 2022 •

edited

Loading

codecov bot commented Nov 2, 2022 •

edited

Loading