Releases: apache/beam
Beam 2.53.0 release
We are happy to present the new 2.53.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.53.0, check out the detailed release notes.
Highlights
- Python streaming users that use 2.47.0 and newer versions of Beam should update to version 2.53.0, which fixes a known issue: (#27330).
I/Os
- TextIO now supports skipping multiple header lines (Java) (#17990).
- Python GCSIO is now implemented with GCP GCS Client instead of apitools (#25676)
- Adding support for LowCardinality DataType in ClickHouse (Java) (#29533).
- Added support for handling bad records to KafkaIO (Java) (#29546)
- Add support for generating text embeddings in MLTransform for Vertex AI and Hugging Face Hub models.(#29564)
- NATS IO connector added (Go) (#29000).
New Features / Improvements
- The Python SDK now type checks
collections.abc.Collections
types properly. Some type hints that were erroneously allowed by the SDK may now fail. (#29272) - Running multi-language pipelines locally no longer requires Docker.
Instead, the same (generally auto-started) subprocess used to perform the
expansion can also be used as the cross-language worker. - Framework for adding Error Handlers to composite transforms added in Java (#29164).
- Python 3.11 images now include google-cloud-profiler (#29561).
Breaking Changes
- Upgraded to go 1.21.5 to build, fixing CVE-2023-45285 and CVE-2023-39326
Deprecations
- Euphoria DSL is deprecated and will be removed in a future release (not before 2.56.0) (#29451)
Bugfixes
- (Python) Fixed sporadic crashes in streaming pipelines that affected some users of 2.47.0 and newer SDKs (#27330).
- (Python) Fixed a bug that caused MLTransform to drop identical elements in the output PCollection (#29600).
List of Contributors
According to git shortlog, the following people contributed to the 2.53.0 release. Thank you to all contributors!
Ahmed Abualsaud
Ahmet Altay
Alexey Romanenko
Anand Inguva
Arun Pandian
Balázs Németh
Bruno Volpato
Byron Ellis
Calvin Swenson Jr
Chamikara Jayalath
Clay Johnson
Damon
Danny McCormick
Ferran Fernández Garrido
Georgii Zemlianyi
Israel Herraiz
Jack McCluskey
Jacob Tomlinson
Jan Lukavský
JayajP
Jeffrey Kinard
Johanna Öjeling
Julian Braha
Julien Tournay
Kenneth Knowles
Lawrence Qiu
Mark Zitnik
Mattie Fu
Michel Davit
Mike Williamson
Naireen
Naireen Hussain
Niel Markwick
Pablo Estrada
Radosław Stankiewicz
Rebecca Szper
Reuven Lax
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Sam Rohde
Sam Whittle
Shunping Huang
Svetak Sundhar
Talat UYARER
Tom Stepp
Tony Tang
Vlado Djerek
Yi Hu
Zechen Jiang
clmccart
damccorm
darshan-sj
gabry.wu
johnjcasey
liferoad
lrakla
martin trieu
tvalentyn
Beam 2.52.0 release
We are happy to present the new 2.52.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.52.0, check out the detailed release notes.
Highlights
- Previously deprecated Avro-dependent code (Beam Release 2.46.0) has been finally removed from Java SDK "core" package.
Please, usebeam-sdks-java-extensions-avro
instead. This will allow to easily update Avro version in user code without
potential breaking changes in Beam "core" since the Beam Avro extension already supports the latest Avro versions and
should handle this. (#25252). - Publishing Java 21 SDK container images now supported as part of Apache Beam release process. (#28120)
- Direct Runner and Dataflow Runner support running pipelines on Java21 (experimental until tests fully setup). For other runners (Flink, Spark, Samza, etc) support status depend on runner projects.
New Features / Improvements
- Add
UseDataStreamForBatch
pipeline option to the Flink runner. When it is set to true, Flink runner will run batch
jobs using the DataStream API. By default the option is set to false, so the batch jobs are still executed
using the DataSet API. upload_graph
as one of the Experiments options for DataflowRunner is no longer required when the graph is larger than 10MB for Java SDK (PR#28621.- state amd side input cache has been enabled to a default of 100 MB. Use
--max_cache_memory_usage_mb=X
to provide cache size for the user state API and side inputs. (Python) (#28770). - Beam YAML stable release. Beam pipelines can now be written using YAML and leverage the Beam YAML framework which includes a preliminary set of IO's and turnkey transforms. More information can be found in the YAML root folder and in the README.
Breaking Changes
org.apache.beam.sdk.io.CountingSource.CounterMark
uses customCounterMarkCoder
as a default coder since all Avro-dependent
classes finally moved toextensions/avro
. In case if it's still required to useAvroCoder
forCounterMark
, then,
as a workaround, a copy of "old"CountingSource
class should be placed into a project code and used directly
(#25252).- Renamed
host
tofirestoreHost
inFirestoreOptions
to avoid potential conflict of command line arguments (Java) (#29201).
Bugfixes
- Fixed "Desired bundle size 0 bytes must be greater than 0" in Java SDK's BigtableIO.BigtableSource when you have more cores than bytes to read (Java) #28793.
watch_file_pattern
arg of the RunInference arg had no effect prior to 2.52.0. To use the behavior of argwatch_file_pattern
prior to 2.52.0, follow the documentation at https://beam.apache.org/documentation/ml/side-input-updates/ and useWatchFilePattern
PTransform as a SideInput. (#28948)MLTransform
doesn't output artifacts such as min, max and quantiles. Instead,MLTransform
will add a feature to output these artifacts as human readable format - #29017. For now, to use the artifacts such as min and max that were produced by the earilerMLTransform
, useread_artifact_location
ofMLTransform
, which reads artifacts that were produced earlier in a differentMLTransform
(#29016)- Fixed a memory leak, which affected some long-running Python pipelines: #28246.
Security Fixes
- Fixed CVE-2023-39325 (Java/Python/Go) (#29118).
- Mitigated CVE-2023-47248 (Python) #29392.
List of Contributors
According to git shortlog, the following people contributed to the 2.52.0 release. Thank you to all contributors!
Ahmed Abualsaud
Ahmet Altay
Aleksandr Dudko
Alexey Romanenko
Anand Inguva
Andrei Gurau
Andrey Devyatkin
BjornPrime
Bruno Volpato
Bulat
Chamikara Jayalath
Damon
Danny McCormick
Devansh Modi
Dominik Dębowczyk
Ferran Fernández Garrido
Hai Joey Tran
Israel Herraiz
Jack McCluskey
Jan Lukavský
JayajP
Jeff Kinard
Jeffrey Kinard
Jiangjie Qin
Jing
Joar Wandborg
Johanna Öjeling
Julien Tournay
Kanishk Karanawat
Kenneth Knowles
Kerry Donny-Clark
Luís Bianchin
Minbo Bae
Pranav Bhandari
Rebecca Szper
Reuven Lax
Ritesh Ghorse
Robert Bradshaw
Robert Burke
RyuSA
Shunping Huang
Steven van Rossum
Svetak Sundhar
Tony Tang
Vitaly Terentyev
Vivek Sumanth
Vlado Djerek
Yi Hu
aku019
brucearctor
caneff
damccorm
ddebowczyk92
dependabot[bot]
dpcollins-google
edman124
gabry.wu
illoise
johnjcasey
jonathan-lemos
kennknowles
liferoad
magicgoody
martin trieu
nancyxu123
pablo rodriguez defino
tvalentyn
Beam 2.51.0 release
We are happy to present the new 2.51.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.51.0, check out the detailed release notes.
New Features / Improvements
- In Python, RunInference now supports loading many models in the same transform using a KeyedModelHandler (#27628).
- In Python, the VertexAIModelHandlerJSON now supports passing in inference_args. These will be passed through to the Vertex endpoint as parameters.
- Added support to run
mypy
on user pipelines (#27906)
Breaking Changes
- Removed fastjson library dependency for Beam SQL. Table property is changed to be based on jackson ObjectNode (Java) (#24154).
- Removed TensorFlow from Beam Python container images PR. If you have been negatively affected by this change, please comment on #20605.
- Removed the parameter
t reflect.Type
fromparquetio.Write
. The element type is derived from the input PCollection (Go) (#28490) - Refactor BeamSqlSeekableTable.setUp adding a parameter joinSubsetType. #28283
Bugfixes
- Fixed exception chaining issue in GCS connector (Python) (#26769).
- Fixed streaming inserts exception handling, GoogleAPICallErrors are now retried according to retry strategy and routed to failed rows where appropriate rather than causing a pipeline error (Python) (#21080).
- Fixed a bug in Python SDK's cross-language Bigtable sink that mishandled records that don't have an explicit timestamp set: #28632.
Security Fixes
- Python containers updated, fixing CVE-2021-30474, CVE-2021-30475, CVE-2021-30473, CVE-2020-36133, CVE-2020-36131, CVE-2020-36130, and CVE-2020-36135
- Used go 1.21.1 to build, fixing CVE-2023-39320
Known Issues
- Python pipelines using BigQuery Storage Read API must pin
fastavro
dependency to 1.8.3
or earlier: #28811
List of Contributors
According to git shortlog, the following people contributed to the 2.50.0 release. Thank you to all contributors!
Adam Whitmore
Ahmed Abualsaud
Ahmet Altay
Aleksandr Dudko
Alexey Romanenko
Anand Inguva
Andrey Devyatkin
Arvind Ram
Arwin Tio
BjornPrime
Bruno Volpato
Bulat
Celeste Zeng
Chamikara Jayalath
Clay Johnson
Damon
Danny McCormick
David Cavazos
Dip Patel
Hai Joey Tran
Hao Xu
Haruka Abe
Jack Dingilian
Jack McCluskey
Jeff Kinard
Jeffrey Kinard
Joey Tran
Johanna Öjeling
Julien Tournay
Kenneth Knowles
Kerry Donny-Clark
Mattie Fu
Melissa Pashniak
Michel Davit
Moritz Mack
Pranav Bhandari
Rebecca Szper
Reeba Qureshi
Reuven Lax
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Ruwann
Ryan Tam
Sam Rohde
Sereana Seim
Svetak Sundhar
Tim Grein
Udi Meiri
Valentyn Tymofieiev
Vitaly Terentyev
Vlado Djerek
Xinyu Liu
Yi Hu
Zbynek Konecny
Zechen Jiang
bzablocki
caneff
dependabot[bot]
gDuperran
gabry.wu
johnjcasey
kberezin-nshl
kennknowles
liferoad
lostluck
magicgoody
martin trieu
mosche
olalamichelle
tvalentyn
xqhu
Łukasz Spyra
Beam 2.50.0 release
We are happy to present the new 2.50.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.50.0, check out the detailed release notes.
Highlights
- Spark 3.2.2 is used as default version for Spark runner (#23804).
- The Go SDK has a new default local runner, called Prism (#24789).
- All Beam released container images are now multi-arch images that support both x86 and ARM CPU architectures.
I/Os
- Java KafkaIO now supports picking up topics via topicPattern (#26948)
- Support for read from Cosmos DB Core SQL API (#23604)
- Upgraded to HBase 2.5.5 for HBaseIO. (Java) (#27711)
- Added support for GoogleAdsIO source (Java) (#27681).
New Features / Improvements
- The Go SDK now requires Go 1.20 to build. (#27558)
- The Go SDK has a new default local runner, Prism. (#24789).
- Prism is a portable runner that executes each transform independantly, ensuring coders.
- At this point it supercedes the Go direct runner in functionality. The Go direct runner is now deprecated.
- See https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/prism/README.md for the goals and features of Prism.
- Hugging Face Model Handler for RunInference added to Python SDK. (#26632)
- Hugging Face Pipelines support for RunInference added to Python SDK. (#27399)
- Vertex AI Model Handler for RunInference now supports private endpoints (#27696)
- MLTransform transform added with support for common ML pre/postprocessing operations (#26795)
- Upgraded the Kryo extension for the Java SDK to Kryo 5.5.0. This brings in bug fixes, performance improvements, and serialization of Java 14 records. (#27635)
- All Beam released container images are now multi-arch images that support both x86 and ARM CPU architectures. (#27674). The multi-arch container images include:
- All versions of Go, Python, Java and Typescript SDK containers.
- All versions of Flink job server containers.
- Java and Python expansion service containers.
- Transform service controller container.
- Spark3 job server container.
- Added support for batched writes to AWS SQS for improved throughput (Java, AWS 2).(#21429)
Breaking Changes
- Python SDK: Legacy runner support removed from Dataflow, all pipelines must use runner v2.
- Python SDK: Dataflow Runner will no longer stage Beam SDK from PyPI in the
--staging_location
at pipeline submission. Custom container images that are not based on Beam's default image must include Apache Beam installation.(#26996)
Deprecations
- The Go Direct Runner is now Deprecated. It remains available to reduce migration churn.
- Tests can be set back to the direct runner by overriding TestMain:
func TestMain(m *testing.M) { ptest.MainWithDefault(m, "direct") }
- It's recommended to fix issues seen in tests using Prism, as they can also happen on any portable runner.
- Use the generic register package for your pipeline DoFns to ensure pipelines function on portable runners, like prism.
- Do not rely on closures or using package globals for DoFn configuration. They don't function on portable runners.
- Tests can be set back to the direct runner by overriding TestMain:
Bugfixes
- Fixed DirectRunner bug in Python SDK where GroupByKey gets empty PCollection and fails when pipeline option
direct_num_workers!=1
.(#27373) - Fixed BigQuery I/O bug when estimating size on queries that utilize row-level security (#27474)
List of Contributors
According to git shortlog, the following people contributed to the 2.50.0 release. Thank you to all contributors!
Abacn
acejune
AdalbertMemSQL
ahmedabu98
Ahmed Abualsaud
al97
Aleksandr Dudko
Alexey Romanenko
Anand Inguva
Andrey Devyatkin
Anton Shalkovich
ArjunGHUB
Bjorn Pedersen
BjornPrime
Brett Morgan
Bruno Volpato
Buqian Zheng
Burke Davison
Byron Ellis
bzablocki
case-k
Celeste Zeng
Chamikara Jayalath
Clay Johnson
Connor Brett
Damon
Damon Douglas
Dan Hansen
Danny McCormick
Darkhan Nausharipov
Dip Patel
Dmytro Sadovnychyi
Florent Biville
Gabriel Lacroix
Hai Joey Tran
Hong Liang Teoh
Jack McCluskey
James Fricker
Jeff Kinard
Jeff Zhang
Jing
johnjcasey
jon esperanza
Josef Šimánek
Kenneth Knowles
Laksh
Liam Miller-Cushon
liferoad
magicgoody
Mahmud Ridwan
Manav Garg
Marco Vela
martin trieu
Mattie Fu
Michel Davit
Moritz Mack
mosche
Peter Sobot
Pranav Bhandari
Reeba Qureshi
Reuven Lax
Ritesh Ghorse
Robert Bradshaw
Robert Burke
RyuSA
Saba Sathya
Sam Whittle
Steven Niemitz
Steven van Rossum
Svetak Sundhar
Tony Tang
Valentyn Tymofieiev
Vitaly Terentyev
Vlado Djerek
Yichi Zhang
Yi Hu
Zechen Jiang
Beam 2.49.0 release
We are happy to present the new 2.49.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.49.0, check out the detailed release notes.
I/Os
- Support for Bigtable Change Streams added in Java
BigtableIO.ReadChangeStream
(#27183). - Added Bigtable Read and Write cross-language transforms to Python SDK ((#26593), (#27146)).
New Features / Improvements
- Allow prebuilding large images when using
--prebuild_sdk_container_engine=cloud_build
, like images depending ontensorflow
ortorch
(#27023). - Disabled
pip
cache when installing packages on the workers. This reduces the size of prebuilt Python container images (#27035). - Select dedicated avro datum reader and writer (Java) (#18874).
- Timer API for the Go SDK (Go) (#22737).
Deprecations
- Remove Python 3.7 support. (#26447)
Bugfixes
- Fixed KinesisIO
NullPointerException
when a progress check is made before the reader is started (IO) (#23868)
Known Issues
List of Contributors
According to git shortlog, the following people contributed to the 2.49.0 release. Thank you to all contributors!
Abzal Tuganbay
AdalbertMemSQL
Ahmed Abualsaud
Ahmet Altay
Alan Zhang
Alexey Romanenko
Anand Inguva
Andrei Gurau
Arwin Tio
Bartosz Zablocki
Bruno Volpato
Burke Davison
Byron Ellis
Chamikara Jayalath
Charles Rothrock
Chris Gavin
Claire McGinty
Clay Johnson
Damon
Daniel Dopierała
Danny McCormick
Darkhan Nausharipov
David Cavazos
Dip Patel
Dmitry Repin
Gavin McDonald
Jack Dingilian
Jack McCluskey
James Fricker
Jan Lukavský
Jasper Van den Bossche
John Casey
John Gill
Joseph Crowley
Kanishk Karanawat
Katie Liu
Kenneth Knowles
Kyle Galloway
Liam Miller-Cushon
MakarkinSAkvelon
Masato Nakamura
Mattie Fu
Michel Davit
Naireen Hussain
Nathaniel Young
Nelson Osacky
Nick Li
Oleh Borysevych
Pablo Estrada
Reeba Qureshi
Reuven Lax
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Rouslan
Saadat Su
Sam Rohde
Sam Whittle
Sanil Jain
Shunping Huang
Smeet nagda
Svetak Sundhar
Timur Sultanov
Udi Meiri
Valentyn Tymofieiev
Vlado Djerek
WuA
XQ Hu
Xianhua Liu
Xinyu Liu
Yi Hu
Zachary Houfek
alexeyinkin
bigduu
bullet03
bzablocki
jonathan-lemos
jubebo
magicgoody
ruslan-ikhsan
sultanalieva-s
vitaly.terentyev
Beam 2.48.0 release
We are happy to present the new 2.48.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.48.0, check out the detailed release notes.
Note: The release tag for Go SDK for this release is sdks/v2.48.2 instead of sdks/v2.48.0 because of incorrect commit attached to the release tag sdks/v2.48.0.
Highlights
- "Experimental" annotation cleanup: the annotation and concept have been removed from Beam to avoid
the misperception of code as "not ready". Any proposed breaking changes will be subject to
case-by-case pro/con decision making (and generally avoided) rather than using the "Experimental"
to allow them.
I/Os
- Added rename for GCS and copy for local filesystem (Go) (#25779).
- Added support for enhanced fan-out in KinesisIO.Read (Java) (#19967).
- This change is not compatible with Flink savepoints created by Beam 2.46.0 applications which had KinesisIO sources.
- Added textio.ReadWithFilename transform (Go) (#25812).
- Added fileio.MatchContinuously transform (Go) (#26186).
New Features / Improvements
- Allow passing service name for google-cloud-profiler (Python) (#26280).
- Dead letter queue support added to RunInference in Python (#24209).
- Support added for defining pre/postprocessing operations on the RunInference transform (#26308)
- Adds a Docker Compose based transform service that can be used to discover and use portable Beam transforms (#26023).
Breaking Changes
- Passing a tag into MultiProcessShared is now required in the Python SDK (#26168).
- CloudDebuggerOptions is removed (deprecated in Beam v2.47.0) for Dataflow runner as the Google Cloud Debugger service is shutting down. (Java) (#25959).
- AWS 2 client providers (deprecated in Beam v2.38.0) are finally removed (#26681).
- AWS 2 SnsIO.writeAsync (deprecated in Beam v2.37.0 due to risk of data loss) was finally removed (#26710).
- AWS 2 coders (deprecated in Beam v2.43.0 when adding Schema support for AWS Sdk Pojos) are finally removed (#23315).
Bugfixes
- Fixed Java bootloader failing with Too Long Args due to long classpaths, with a pathing jar. (Java) (#25582).
List of Contributors
According to git shortlog, the following people contributed to the 2.48.0 release. Thank you to all contributors!
Abzal Tuganbay
Ahmed Abualsaud
Alexey Romanenko
Anand Inguva
Andrei Gurau
Andrey Devyatkin
Balázs Németh
Bazyli Polednia
Bruno Volpato
Chamikara Jayalath
Clay Johnson
Damon
Daniel Arn
Danny McCormick
Darkhan Nausharipov
Dip Patel
Dmitry Repin
George Novitskiy
Israel Herraiz
Jack Dingilian
Jack McCluskey
Jan Lukavský
Jasper Van den Bossche
Jeff Zhang
Jeremy Edwards
Johanna Öjeling
John Casey
Katie Liu
Kenneth Knowles
Kerry Donny-Clark
Kuba Rauch
Liam Miller-Cushon
MakarkinSAkvelon
Mattie Fu
Michel Davit
Moritz Mack
Nick Li
Oleh Borysevych
Pablo Estrada
Pranav Bhandari
Pranjal Joshi
Rebecca Szper
Reuven Lax
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Rouslan
RuiLong J
RyujiTamaki
Sam Whittle
Sanil Jain
Svetak Sundhar
Timur Sultanov
Tony Tang
Udi Meiri
Valentyn Tymofieiev
Vishal Bhise
Vitaly Terentyev
Xinyu Liu
Yi Hu
bullet03
darshan-sj
kellen
liferoad
mokamoka03210120
psolomin
Beam 2.47.0 release
We are happy to present the new 2.47.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.47.0, check out the detailed release notes.
Highlights
- Apache Beam adds Python 3.11 support (#23848).
I/Os
- BigQuery Storage Write API is now available in Python SDK via cross-language (#21961).
- Added HbaseIO support for writing RowMutations (ordered by rowkey) to Hbase (Java) (#25830).
- Added fileio transforms MatchFiles, MatchAll and ReadMatches (Go) (#25779).
- Add integration test for JmsIO + fix issue with multiple connections (Java) (#25887).
New Features / Improvements
- The Flink runner now supports Flink 1.16.x (#25046).
- Schema'd PTransforms can now be directly applied to Beam dataframes just like PCollections.
(Note that when doing multiple operations, it may be more efficient to explicitly chain the operations
likedf | (Transform1 | Transform2 | ...)
to avoid excessive conversions.) - The Go SDK adds new transforms periodic.Impulse and periodic.Sequence that extends support
for slowly updating side input patterns. (#23106) - Python SDK now supports
protobuf <4.23.0
(#24599) - Several Google client libraries in Python SDK dependency chain were updated to latest available major versions. (#24599)
Breaking Changes
- If a main session fails to load, the pipeline will now fail at worker startup. (#25401).
- Python pipeline options will now ignore unparsed command line flags prefixed with a single dash. (#25943).
- The SmallestPerKey combiner now requires keyword-only arguments for specifying optional parameters, such as
key
andreverse
. (#25888).
Deprecations
- Cloud Debugger support and its pipeline options are deprecated and will be removed in the next Beam version,
in response to the Google Cloud Debugger service turning down.
(Java) (#25959).
Bugfixes
- BigQuery sink in STORAGE_WRITE_API mode in batch pipelines might result in data consistency issues during the handling of other unrelated transient errors for Beam SDKs 2.35.0 - 2.46.0 (inclusive). For more details see: #26521
List of Contributors
According to git shortlog, the following people contributed to the 2.47.0 release. Thank you to all contributors!
Ahmed Abualsaud
Ahmet Altay
Alexey Romanenko
Amir Fayazi
Amrane Ait Zeouay
Anand Inguva
Andrew Pilloud
Andrey Kot
Bjorn Pedersen
Bruno Volpato
Buqian Zheng
Chamikara Jayalath
ChangyuLi28
Damon
Danny McCormick
Dmitry Repin
George Ma
Jack Dingilian
Jack McCluskey
Jasper Van den Bossche
Jeremy Edwards
Jiangjie (Becket) Qin
Johanna Öjeling
Juta Staes
Kenneth Knowles
Kyle Weaver
Mattie Fu
Moritz Mack
Nick Li
Oleh Borysevych
Pablo Estrada
Rebecca Szper
Reuven Lax
Reza Rokni
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Saadat Su
Saifuddin53
Sam Rohde
Shubham Krishna
Svetak Sundhar
Theodore Ni
Thomas Gaddy
Timur Sultanov
Udi Meiri
Valentyn Tymofieiev
Xinyu Liu
Yanan Hao
Yi Hu
Yuvi Panda
andres-vv
bochap
dannikay
darshan-sj
dependabot[bot]
harrisonlimh
hnnsgstfssn
jrmccluskey
liferoad
tvalentyn
xianhualiu
zhangskz
Beam 2.46.0 release
We are happy to present the new 2.46.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.46.0, check out the detailed release notes.
Highlights
- Java SDK containers migrated to Eclipse Temurin
as a base. This change migrates away from the deprecated OpenJDK
container. Eclipse Temurin is currently based upon Ubuntu 22.04 while the OpenJDK
container was based upon Debian 11. - RunInference PTransform will accept model paths as SideInputs in Python SDK. (#24042)
- RunInference supports ONNX runtime in Python SDK (#22972)
- Tensorflow Model Handler for RunInference in Python SDK (#25366)
- Java SDK modules migrated to use
:sdks:java:extensions:avro
(#24748)
I/Os
- Added in JmsIO a retry policy for failed publications (Java) (#24971).
- Support for
LZMA
compression/decompression of text files added to the Python SDK (#25316) - Added ReadFrom/WriteTo Csv/Json as top-level transforms to the Python SDK.
New Features / Improvements
- Add UDF metrics support for Samza portable mode.
- Option for SparkRunner to avoid the need of SDF output to fit in memory (#23852).
This helps e.g. with ParquetIO reads. Turn the feature on by adding experimentuse_bounded_concurrent_output_for_sdf
. - Add
WatchFilePattern
transform, which can be used as a side input to the RunInference PTransfrom to watch for model updates using a file pattern. (#24042) - Add support for loading TorchScript models with
PytorchModelHandler
. The TorchScript model path can be
passed to PytorchModelHandler usingtorch_script_model_path=<path_to_model>
. (#25321) - The Go SDK now requires Go 1.19 to build. (#25545)
- The Go SDK now has an initial native Go implementation of a portable Beam Runner called Prism. (#24789)
- For more details and current state see https://github.com/apache/beam/tree/master/sdks/go/pkg/beam/runners/prism.
Breaking Changes
- The deprecated SparkRunner for Spark 2 (see 2.41.0) was removed (#25263).
- Python's BatchElements performs more aggressive batching in some cases,
capping at 10 second rather than 1 second batches by default and excluding
fixed cost in this computation to better handle cases where the fixed cost
is larger than a single second. To get the old behavior, one can pass
target_batch_duration_secs_including_fixed_cost=1
to BatchElements.
Deprecations
- Avro related classes are deprecated in module
beam-sdks-java-core
and will be eventually removed. Please, migrate to a new modulebeam-sdks-java-extensions-avro
instead by importing the classes fromorg.apache.beam.sdk.extensions.avro
package.
For the sake of migration simplicity, the relative package path and the whole class hierarchy of Avro related classes in new module is preserved the same as it was before.
For example, importorg.apache.beam.sdk.extensions.avro.coders.AvroCoder
class instead oforg.apache.beam.sdk.coders.AvroCoder
. (#24749).
List of Contributors
According to git shortlog, the following people contributed to the 2.46.0 release. Thank you to all contributors!
Ahmet Altay
Alan Zhang
Alexey Romanenko
Amrane Ait Zeouay
Anand Inguva
Andrew Pilloud
Brian Hulette
Bruno Volpato
Byron Ellis
Chamikara Jayalath
Damon
Danny McCormick
Darkhan Nausharipov
David Katz
Dmitry Repin
Doug Judd
Egbert van der Wal
Elizaveta Lomteva
Evan Galpin
Herman Mak
Jack McCluskey
Jan Lukavský
Johanna Öjeling
John Casey
Jozef Vilcek
Junhao Liu
Juta Staes
Katie Liu
Kiley Sok
Liam Miller-Cushon
Luke Cwik
Moritz Mack
Ning Kang
Oleh Borysevych
Pablo E
Pablo Estrada
Reuven Lax
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Ruslan Altynnikov
Ryan Zhang
Sam Rohde
Sam Whittle
Sam sam
Sergei Lilichenko
Shivam
Shubham Krishna
Theodore Ni
Timur Sultanov
Tony Tang
Vachan
Veronica Wasson
Vincent Devillers
Vitaly Terentyev
William Ross Morrow
Xinyu Liu
Yi Hu
ZhengLin Li
Ziqi Ma
ahmedabu98
alexeyinkin
aliftadvantage
bullet03
dannikay
darshan-sj
dependabot[bot]
johnjcasey
kamrankoupayi
kileys
liferoad
nancyxu123
nickuncaged1201
pablo rodriguez defino
tvalentyn
xqhu
Beam 2.45.0 release
We are happy to present the new 2.45.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.45.0, check out the detailed release notes.
I/Os
- MongoDB IO connector added (Go) (#24575).
New Features / Improvements
- RunInference Wrapper with Sklearn Model Handler support added in Go SDK (#24497).
- Adding override of allowed TLS algorithms (Java), now maintaining the disabled/legacy algorithms
present in 2.43.0 (up to 1.8.0_342, 11.0.16, 17.0.2 for respective Java versions). This is accompanied
by an explicit re-enabling of TLSv1 and TLSv1.1 for Java 8 and Java 11. - Add UDF metrics support for Samza portable mode.
Breaking Changes
- Portable Java pipelines, Go pipelines, Python streaming pipelines, and portable Python batch
pipelines on Dataflow are required to use Runner V2. Thedisable_runner_v2
,
disable_runner_v2_until_2023
,disable_prime_runner_v2
experiments will raise an error during
pipeline construction. You can no longer specify the Dataflow worker jar override. Note that
non-portable Java jobs and non-portable Python batch jobs are not impacted. (#24515).
Bugfixes
- Avoids Cassandra syntax error when user-defined query has no where clause in it (Java) (#24829).
- Fixed JDBC connection failures (Java) during handshake due to deprecated TLSv1(.1) protocol for the JDK. (#24623)
- Fixed Python BigQuery Batch Load write may truncate valid data when deposition sets to WRITE_TRUNCATE and incoming data is large (Python) (#24623).
- Fixed Kafka watermark issue with sparse data on many partitions (#24205)
List of Contributors
According to git shortlog, the following people contributed to the 2.45.0 release. Thank you to all contributors!
AdalbertMemSQL
Ahmed Abualsaud
Ahmet Altay
Alexey Romanenko
Anand Inguva
Andrea Nardelli
Andrei Gurau
Andrew Pilloud
Benjamin Gonzalez
BjornPrime
Brian Hulette
Bulat
Byron Ellis
Chamikara Jayalath
Charles Rothrock
Damon
Daniela Martín
Danny McCormick
Darkhan Nausharipov
Dejan Spasic
Diego Gomez
Dmitry Repin
Doug Judd
Elias Segundo Antonio
Evan Galpin
Evgeny Antyshev
Fernando Morales
Jack McCluskey
Johanna Öjeling
John Casey
Junhao Liu
Kanishk Karanawat
Kenneth Knowles
Kiley Sok
Liam Miller-Cushon
Lucas Marques
Luke Cwik
MakarkinSAkvelon
Marco Robles
Mark Zitnik
Melanie
Moritz Mack
Ning Kang
Oleh Borysevych
Pablo Estrada
Philippe Moussalli
Piyush Sagar
Rebecca Szper
Reuven Lax
Rick Viscomi
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Sam Whittle
Sergei Lilichenko
Seung Jin An
Shane Hansen
Sho Nakatani
Shunya Ueta
Siddharth Agrawal
Timur Sultanov
Veronica Wasson
Vitaly Terentyev
Xinbin Huang
Xinyu Liu
Xinyue Zhang
Yi Hu
ZhengLin Li
alexeyinkin
andoni-guzman
andthezhang
bullet03
camphillips22
gabihodoroaga
harrisonlimh
pablo rodriguez defino
ruslan-ikhsan
tvalentyn
yyy1000
zhengbuqian
v2.44.0
We are happy to present the new 2.44.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.44.0, check out the detailed release notes.
I/Os
- Support for Bigtable sink (Write and WriteBatch) added (Go) (#23324).
- S3 implementation of the Beam filesystem (Go) (#23991).
- Support for SingleStoreDB source and sink added (Java) (#22617).
- Added support for DefaultAzureCredential authentication in Azure Filesystem (Python) (#24210).
- Added new CdapIO for CDAP Batch and Streaming Source/Sinks (Java) (#24961).
- Added new SparkReceiverIO for Spark Receivers 2.4.* (Java) (#24960).
New Features / Improvements
- Beam now provides a portable "runner" that can render pipeline graphs with
graphviz. Seepython -m apache_beam.runners.render --help
for more details. - Local packages can now be used as dependencies in the requirements.txt file, rather
than requiring them to be passed separately via the--extra_package
option
(Python) (#23684). - Pipeline Resource Hints now supported via
--resource_hints
flag (Go) (#23990). - Make Python SDK containers reusable on portable runners by installing dependencies to temporary venvs (BEAM-12792).
- RunInference model handlers now support the specification of a custom inference function in Python (#22572)
- Support for
map_windows
urn added to Go SDK (#24307).
Breaking Changes
ParquetIO.withSplit
was removed since splittable reading has been the default behavior since 2.35.0. The effect of
this change is to drop support for non-splittable reading (Java)(#23832).beam-sdks-java-extensions-google-cloud-platform-core
is no longer a
dependency of the Java SDK Harness. Some users of a portable runner (such as Dataflow Runner v2)
may have an undeclared dependency on this package (for example using GCS with
TextIO) and will now need to declare the dependency.beam-sdks-java-core
is no longer a dependency of the Java SDK Harness. Users of a portable
runner (such as Dataflow Runner v2) will need to provide this package and its dependencies.- Slices now use the Beam Iterable Coder. This enables cross language use, but breaks pipeline updates
if a Slice type is used as a PCollection element or State API element. (Go)#24339
Bugfixes
- Fixed JmsIO acknowledgment issue (Java) (#20814)
- Fixed Beam SQL CalciteUtils (Java) and Cross-language JdbcIO (Python) did not support JDBC CHAR/VARCHAR, BINARY/VARBINARY logical types (#23747, #23526).
- Ensure iterated and emitted types are used with the generic register package are registered with the type and schema registries.(Go) (#23889)
List of Contributors
According to git shortlog, the following people contributed to the 2.44.0 release. Thank you to all contributors!
Ahmed Abualsaud
Ahmet Altay
Alex Merose
Alexey Inkin
Alexey Romanenko
Anand Inguva
Andrei Gurau
Andrej Galad
Andrew Pilloud
Ayush Sharma
Benjamin Gonzalez
Bjorn Pedersen
Brian Hulette
Bruno Volpato
Bulat Safiullin
Chamikara Jayalath
Chris Gavin
Damon Douglas
Danielle Syse
Danny McCormick
Darkhan Nausharipov
David Cavazos
Dmitry Repin
Doug Judd
Elias Segundo Antonio
Evan Galpin
Evgeny Antyshev
Heejong Lee
Henrik Heggelund-Berg
Israel Herraiz
Jack McCluskey
Jan Lukavsk\u00fd
Janek Bevendorff
Johanna \u00d6jeling
John J. Casey
Jozef Vilcek
Kanishk Karanawat
Kenneth Knowles
Kiley Sok
Laksh
Liam Miller-Cushon
Luke Cwik
MakarkinSAkvelon
Minbo Bae
Moritz Mack
Nancy Xu
Ning Kang
Nivaldo Tokuda
Oleh Borysevych
Pablo Estrada
Philippe Moussalli
Pranav Bhandari
Rebecca Szper
Reuven Lax
Rick Smit
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Ryan Thompson
Sam Whittle
Sanil Jain
Scott Strong
Shubham Krishna
Steven van Rossum
Svetak Sundhar
Thiago Nunes
Tianyang Hu
Trevor Gevers
Valentyn Tymofieiev
Vitaly Terentyev
Vladislav Chunikhin
Xinyu Liu
Yi Hu
Yichi Zhang
AdalbertMemSQL
agvdndor
andremissaglia
arne-alex
bullet03
camphillips22
capthiron
creste
fab-jul
illoise
kn1kn1
nancyxu123
peridotml
shinannegans
smeet07