Apache Beam

Apache Beam 2.74.0

Tue, 02 Jun 2026 14:00:00 -0500

We are happy to present the new 2.74.0 release of Beam. This release includes both improvements and new functionality. See the download page for this release.

For more information on changes in 2.74.0, check out the detailed release notes.

Highlights

Spark 4 runner support for Java SDK (#38255).

I/Os

IcebergIO: support declaring a table’s sort order on dynamic table creation via the new sort_fields config (#38269).
IcebergIO: support writing with hash distribution mode, and with autosharding (#38061).

New Features / Improvements

Capability introduces an indicator for aggregations and timers firing during a pipeline drain, allowing users and sinks to recognize and appropriately handle potentially incomplete or partial data (#36884).
Added support for setting disk provisioned IOPS and throughput in Dataflow runner via --diskProvisionedIops and --diskProvisionedThroughputMibps pipeline options (Java/Go/Python) (#38349).
TriggerStateMachineRunner changes from BitSetCoder to SentinelBitSetCoder to encode finished bitset. SentinelBitSetCoder and BitSetCoder are state compatible. Both coders can decode encoded bytes from the other coder (#38139).
(Python) Added type alias for with_exception_handling to be used for typehints. (#38173).
(Java) BatchElements transform for Java SDK (#38369)
Added plugin mechanism to support different Lineage implementations (Java) (#36790).
(Python) Supported Python user type in Beam SQL. For example, SQL statements like SELECT some_field from PCOLLECTION can now operate a PCollection of Beam Row containing pickable Python user type (#20738).
(Python) Introduced beam.coders.registry.register_row as preferred API to register a named tuple or dataclass with a Beam Row. At pipelne runtime, the original type associated with the registered row are preserved across the serialization boundary (#38108).

Breaking Changes

(Python) Made Beartype the default fallback type checking tool. This can be disabled with the --disable_beartype pipeline option. (#38275)

Deprecations

Dropped Java 8 support (#31678).
Removed Samza Runner support (#35448).

Bugfixes

Fixed BigQueryEnrichmentHandler batch mode dropping earlier requests when multiple requests share the same enrichment key (Python) (#38035).
Added max_batch_duration_secs passthrough support in Python Enrichment BigQuery and CloudSQL handlers so batching duration can be forwarded to BatchElements (#38243).

According to git shortlog, the following people contributed to the 2.74.0 release. Thank you to all contributors!

Abdelrahman Ibrahim, Ahmed Abualsaud, Andrew Crites, Andrew Kabas, Arran Cudbard-Bell, Arun Pandian, Asish Kumar, Bentsi Leviav, Blake Jones, Bruno Volpato, Chris Jordan, Danny McCormick, Deji Ibrahim, Derrick Williams, Elia LIU, Ganesh Sivakumar, Jack McCluskey, Kenneth Knowles, Lalit Yadav, M Junaid Shaukat, Matej Aleksandrov, Prabhnoor Singh, Radek Stankiewicz, RadosÅ‚aw Stankiewicz, Reuven Lax, RuiLong J., Sam Whittle, Shunping Huang, Subramanya V, Tarun Annapareddy, Tobias Kaymak, TongruiLi, Valentyn Tymofieiev, Vitaly Terentyev, XQ Hu, Yi Hu, ZIHAN DAI, apanich, bambadiouf1, chenxuesdu, claudevdm, harshadkhetpal, johnjcasey, parveensania, tianz101

Apache Beam 2.73.0

Wed, 29 Apr 2026 09:00:00 -0700

We are happy to present the new 2.73.0 release of Beam. This release includes both improvements and new functionality. See the download page for this release.

For more information on changes in 2.73.0, check out the detailed release notes.

Highlights

I/Os

DebeziumIO (Java): added OffsetRetainer interface and FileSystemOffsetRetainer implementation to persist and restore CDC offsets across pipeline restarts, and exposed withStartOffset / withOffsetRetainer on DebeziumIO.Read and the cross-language ReadBuilder (#28248).

New Features / Improvements

(Python) Added BigQuery CDC streaming source (#37724)
Added ADKAgentModelHandler for running Google Agent Development Kit (ADK) agents (Python) (#37917).
(Python) Added exception chaining to preserve error context in CloudSQLEnrichmentHandler, processes utilities, and core transforms (#37422).
(Python) Added a pipeline option --experiments=pip_no_build_isolation to disable build isolation when installing dependencies in the runtime environment (#37331).
(Go) Added OrderedListState support to the Go SDK stateful DoFn API (#37629).
Added support for large pipeline options via a file (Python) (#37370).
Supported infer schema from dataclass (Python) (#22085). Default coder for typehint-ed (or set with_output_type) for non-frozen dataclasses changed to RowCoder. To preserve the old behavior (fast primitive coder), explicitly register the type with FastPrimitiveCoder.
Updates minimum Go version to 1.26.1 (#37897).
(Python) Added image embedding support in apache_beam.ml.rag package (#37628).
(Python) Added support for Python version 3.14 (#37247).

Breaking Changes

The Python SDK container’s boot.go now passes pipeline options through a file instead of the PIPELINE_OPTIONS environment variable. If a user pairs a new Python SDK container with an older SDK version (which does not support the file-based approach), the pipeline options will not be recognized and the pipeline will fail. Users must ensure their SDK and container versions are synchronized (#37370).
Python DoFn.with_exception_handling now respects user DoFn typehints. This can break update compatibility if coders change. It can also break pipeline compilation if existing typehints are incorrect. To update safely sepcify the pipeline option --update_compatibility_version=2.72.0. To fix typehints replace any incorrect typehints that were previously ignored (#37590)

Bugfixes

Fixed ProcessManager not reaping child processes, causing zombie process accumulation on long-running Flink deployments (Java) (#37930).

Security Fixes

Fixed CVE-2023-46604 (CVSS 10.0) and CVE-2022-41678 by upgrading ActiveMQ from 5.14.5 to 5.19.2 (Java) (#37943).
Fixed CVE-2024-1597, CVE-2022-31197, and CVE-2022-21724 by upgrading PostgreSQL JDBC Driver from 42.2.16 to 42.6.2 (Java) (#37942).

List of Contributors

According to git shortlog, the following people contributed to the 2.73.0 release. Thank you to all contributors!

Abdelrahman Ibrahim, Ahmed Abualsaud, Alex Malao, Alexander Nieuwenhuijse, Andres Tiko, Andrew Crites, Arun Pandian, Bentsi Leviav, Bruno Volpato, Chamikara Jayalath, Chandra Kiran Bolla, Danny McCormick, Deji Ibrahim, Derrick Williams, Elia LIU, Esmelealem, Hannes Gustafsson, Jack McCluskey, Joey Tran, Kenneth Knowles, M Junaid Shaukat, Mansi Singh, Matej Aleksandrov, Mathijs Deelen, Mattie Fu, Praneet Nadella, Radek Stankiewicz, RadosÅ‚aw Stankiewicz, Reuven Lax, RuiLong J., S. VeyriÃ©, Sakthivel Subramanian, Sam Whittle, Shubham Thakur, Shunping Huang, Subramanya V, Tarun Annapareddy, Tobias Kaymak, Valentyn Tymofieiev, Vitaly Terentyev, XQ Hu, Yi Hu, ZIHAN DAI, claudevdm, kishorepola, parveensania

Apache Beam 2.72.0

Mon, 30 Mar 2026 09:00:00 -0700

We are happy to present the new 2.72.0 release of Beam. This release includes both improvements and new functionality. See the download page for this release.

For more information on changes in 2.72.0, check out the detailed release notes.

Highlights

Flink 2.0 support (#36947).

I/Os

Add Datadog IO support (Java) (#37318).
Remove Pubsublite IO support, since service will be deprecated in March 2026. (#37375).
(Java) ClickHouse - migrating from the legacy JDBC driver (v0.6.3) to ClickHouse Java Client v2 (v0.9.6). See the class documentation for migration guide (#37610).
(Java) Upgraded GoogleAdsIO to use GoogleAdsIO API v23 (#37620).

New Features / Improvements

(Python) Added exception chaining to preserve error context in CloudSQLEnrichmentHandler, processes utilities, and core transforms (#37422).
(Python) Added a pipeline option --experiments=pip_no_build_isolation to disable build isolation when installing dependencies in the runtime environment (#37331).

Deprecations

(Python) Removed previously deprecated list_prefix method for filesystem interfaces (#37587).

Bugfixes

Fixed (Yaml) issue with validate compatible method (#37588).
Fixed (Yaml) issue with Create transform dealing with different type elements (#37585).

Security Fixes

Fixed CVE-2024-28397 by switching from js2py to pythonmonkey (Yaml) (#37560).

List of Contributors

According to git shortlog, the following people contributed to the 2.72.0 release. Thank you to all contributors!

Abdelrahman Ibrahim, Ahmed Abualsaud, Andrew Crites, Arun Pandian, Ben Feinstein, Bentsi Leviav, Celeste Zeng, Danny McCormick, Danny Mccormick, Derrick Williams, Elia LIU, Ganesh, Jack McCluskey, Kenneth Knowles, Labesse KÃ©vin, M Junaid Shaukat, Mansi Singh, Mattie Fu, Nayan Mathur, Pablo Estrada, Pirzada Ahmad Faraz, Radek Stankiewicz, RadosÅ‚aw Stankiewicz, Robert Bradshaw, Rohan Sah, RuiLong J., Sakthivel Subramanian, Sam Whittle, Shaheer Amjad, Shunping Huang, Steven van Rossum, Tarun Annapareddy, Tobias Kaymak, Valentyn Tymofieiev, Vitaly Terentyev, Yi Hu, XQ Hu, ZIHAN DAI, apanich, chenxuesdu, claudevdm, franzonia137

Apache Beam 2.71.0

Tue, 13 Jan 2026 09:00:00 -0700

We are happy to present the new 2.71.0 release of Beam. This release includes both improvements and new functionality. See the download page for this release.

For more information on changes in 2.71.0, check out the detailed release notes.

I/Os

(Java) Elasticsearch 9 Support (#36491).
(Java) Upgraded HCatalogIO to Hive 4.0.1 (#32189).

New Features / Improvements

Support configuring Firestore database on ReadFn transforms (Java) (#36904).
(Python) Inference args are now allowed in most model handlers, except where they are explicitly/intentionally disallowed (#37093).

Bugfixes

Fixed FirestoreV1 Beam connectors allow configuring inconsistent project/database IDs between RPC requests and routing headers #36895 (Java) (#36895).
Logical type and coder registry are saved for pipelines in the case of default pickler (#36271). This fixes a side effect of switching to cloudpickle as default pickler in Beam 2.65.0 (Python) (#35738).

Known Issues

For the most up to date list of known issues, see https://github.com/apache/beam/blob/master/CHANGES.md

List of Contributors

According to git shortlog, the following people contributed to the 2.71.0 release. Thank you to all contributors!

Ahmed Abualsaud, Andrew Crites, apanich, Arun, Arun Pandian, assaf127, Chamikara Jayalath, CherisPatelInfocusp, Cheskel Twersky, Claire McGinty, Claude, Danny Mccormick, Derrick Williams, Egbert van der Wal, Evan Galpin, Ganesh, hekk-kaori-maeda, Jack Dingilian, Jack McCluskey, JayajP, Jiang Zhu, Kenneth Knowles, liferoad, M Junaid Shaukat, Nayan Mathur, Noah Stapp, Paco Avila, Radek Stankiewicz, RadosÅ‚aw Stankiewicz, Robert Stupp, Sam Whittle, Shunping Huang, Steven van Rossum, Suvrat Acharya, Tarun Annapareddy, tvalentyn, Utkarsh Parekh, Vitaly Terentyev, Xiaochu Liu, Yala Huang Feng, Yi Hu, Yu Watanabe, zhan7236

Apache Beam 2.70.0

Tue, 16 Dec 2025 15:00:00 -0500

We are happy to present the new 2.70.0 release of Beam. This release includes both improvements and new functionality. See the download page for this release.

For more information on changes in 2.70.0, check out the detailed release notes.

Highlights

Flink 1.20 support added (#32647).

New Features / Improvements

Python examples added for Milvus search enrichment handler on Beam Website including jupyter notebook example (Python) (#36176).
Milvus sink I/O connector added (Python) (#36702). Now Beam has full support for Milvus integration including Milvus enrichment and sink operations.

Breaking Changes

(Python) Some Python dependencies have been split out into extras. To ensure all previously installed dependencies are installed, when installing Beam you can pip install apache-beam[gcp,interactive,yaml,redis,hadoop,tfrecord], though most users will not need all of these extras (#34554).

Deprecations

(Python) Python 3.9 reached EOL in October 2025 and support for the language version has been removed. (#36665).

List of Contributors

According to git shortlog, the following people contributed to the 2.70.0 release. Thank you to all contributors!

Abdelrahman Ibrahim, Ahmed Abualsaud, Alex Chermenin, Andrew Crites, Arun Pandian, Celeste Zeng, Chamikara Jayalath, Chenzo, Claire McGinty, Danny McCormick, Derrick Williams, Dustin Rhodes, Enrique Calderon, Ian Liao, Jack McCluskey, Jessica Hsiao, Joey Tran, Karthik Talluri, Kenneth Knowles, Maciej Szwaja, Mehdi.D, Mohamed Awnallah, Praneet Nadella, Radek Stankiewicz, RadosÅ‚aw Stankiewicz, Reuven Lax, RuiLong J., S. VeyriÃ©, Sam Whittle, Shunping Huang, Stephan Hoyer, Steven van Rossum, Tanu Sharma, Tarun Annapareddy, Tom Stepp, Valentyn Tymofieiev, Vitaly Terentyev, XQ Hu, Yi Hu, changliiu, claudevdm, fozzie15, kristynsmith, wolfchris-google

Apache Beam 2.69.0

Tue, 28 Oct 2025 15:00:00 -0500

We are happy to present the new 2.69.0 release of Beam. This release includes both improvements and new functionality. See the download page for this release.

For more information on changes in 2.69.0, check out the detailed release notes.

Highlights

(Python) Add YAML Editor and Visualization Panel (#35772).
(Java) Java 25 Support (#35627).

I/Os

Upgraded Iceberg dependency to 1.10.0 (#36123).

New Features / Improvements

Enhance JAXBCoder with XMLInputFactory support (Java) (#36446).
Python examples added for CloudSQL enrichment handler on Beam website (Python) (#35473).
Support for batch mode execution in WriteToPubSub transform added (Python) (#35990).
Added official support for Python 3.13 (#34869).
Added an optional output_schema verification to all YAML transforms (#35952).
Support for encryption when using GroupByKey added, along with --gbek pipeline option to automatically replace all GroupByKey transforms (Java/Python) (#36214).

Breaking Changes

(Python) dill is no longer a required, default dependency for Apache Beam (#21298).
- This change only affects pipelines that explicitly use the pickle_library=dill pipeline option.
- While dill==0.3.1.1 is still pre-installed on the official Beam SDK base images, it is no longer a direct dependency of the apache-beam Python package. This means it can be overridden by other dependencies in your environment.
- If your pipeline uses pickle_library=dill, you must manually ensure dill==0.3.1.1 is installed in both your submission and runtime environments.
  - Submission environment: Install the dill extra in your local environment pip install apache-beam[gcpdill].
  - Runtime (worker) environment: Your action depends on how you manage your worker’s environment.
    - If using default containers or custom containers with the official Beam base image e.g. FROM apache/beam_python3.10_sdk:2.69.0
      - Add dill==0.3.1.1 to your worker’s requirements file (e.g., requirements.txt)
      - Pass this file to your pipeline using the –requirements_file requirements.txt pipeline option (For more details see managing Dataflow dependencies).
    - If custom containers with a non-Beam base image e.g. FROM python:3.9-slim
      - Install apache-beam with the dill extra in your docker file e.g. RUN pip install --no-cache-dir apache-beam[gcp,dill]
- If there is a dill version mismatch between submission and runtime environments you might encounter unpickling errors like Can't get attribute '_create_code' on <module 'dill._dill' from....
- If dill is not installed in the runtime environment you will see the error ImportError: Pipeline option pickle_library=dill is set, but dill is not installed...
- Report any issues you encounter when using pickle_library=dill to the GitHub issue (#21298)
(Python) Added a pickle_library=dill_unsafe pipeline option. This allows overriding dill==0.3.1.1 using dill as the pickle_library. Use with extreme caution. Other versions of dill has not been tested with Apache Beam (#21298).
(Python) The deterministic fallback coder for complex types like NamedTuple, Enum, and dataclasses now normalizes filepaths for better determinism guarantees. This affects streaming pipelines updating from 2.68 to 2.69 that utilize this fallback coder. If your pipeline is affected, you may see a warning like: “Using fallback deterministic coder for type X…”. To update safely sepcify the pipeline option --update_compatibility_version=2.68.0 (#36345).
(Python) Fixed transform naming conflict when executing DataTransform on a dictionary of PColls (#30445). This may break update compatibility if you don’t provide a --transform_name_mapping.
Removed deprecated Hadoop versions (2.10.2 and 3.2.4) that are no longer supported for Iceberg from IcebergIO (#36282).
(Go) Coder construction on SDK side is more faithful to the specs from runners without stripping length-prefix. This may break streaming pipeline update as the underlying coder could be changed (#36387).
Minimum Go version for Beam Go updated to 1.25.2 (#36461).
(Java) DoFn OutputReceiver now requires implementing a builder method as part of extended metadata support for elements (#34902).
(Java) Removed ProcessContext outputWindowedValue introduced in 2.68 that allowed setting offset and record Id. Use OutputReceiver’s builder to set those field (#36523).

Bugfixes

Fixed passing of pipeline options to x-lang transforms when called from the Java SDK (Java) (#36443).
PulsarIO has now changed support status from incomplete to experimental. Both read and writes should now minimally function (un-partitioned topics, without schema support, timestamp ordered messages for read) (Java) (#36141).
Fixed Spanner Change Stream reading stuck issue due to watermark of partition moving backwards (#36470).

List of Contributors

According to git shortlog, the following people contributed to the 2.69.0 release. Thank you to all contributors!

Abdelrahman Ibrahim, Ahmed Abualsaud, Andrew Crites, Arun Pandian, Bryan Dang, Chamikara Jayalath, Charles Nguyen, Chenzo, Clay Johnson, Danny McCormick, David A, Derrick Williams, Enrique Calderon, Hai Joey Tran, Ian Liao, Ian Mburu, Jack McCluskey, Jiang Zhu, Joey Tran, Kenneth Knowles, Kyle Stanley, Maciej Szwaja, Minbo Bae, Mohamed Awnallah, Radek Stankiewicz, RadosÅ‚aw Stankiewicz, Razvan Culea, Reuven Lax, Sagnik Ghosh, Sam Whittle, Shunping Huang, Steven van Rossum, Talat UYARER, Tanu Sharma, Tarun Annapareddy, Tom Stepp, Valentyn Tymofieiev, Vitaly Terentyev, XQ Hu, Yi Hu, Yilei, claudevdm, flpablo, fozzie15, johnjcasey, lim1t, parveensania, yashu

Google Summer of Code 2025 - Enhanced Interactive Pipeline Development Environment for JupyterLab

Tue, 14 Oct 2025 00:00:00 +0800

GSoC 2025 Basic Information

Student: [Canyu Chen] (@Chenzo1001) Mentors: [XQ Hu] (@liferoad) Organization: [Apache Beam] Proposal Link: Here

Project Overview

BeamVision significantly enhances the Apache Beam development experience within JupyterLab by providing a unified, visual interface for pipeline inspection and analysis. This project successfully delivered a production-ready JupyterLab extension that replaces fragmented workflows with an integrated workspace, featuring a dynamic side panel for pipeline visualization and a multi-tab interface for comparative workflow analysis.

Core Achievements:

Modernized Extension: Upgraded the JupyterLab Sidepanel to v4.x, ensuring compatibility with the latest ecosystem and releasing the package on both NPM and PyPI.

YAML Visualization Suite: Implemented a powerful visual editor for Beam YAML, combining a code editor, an interactive flow chart (built with @xyflow/react-flow), and a collapsible key-value panel for intuitive pipeline design.

Enhanced Accessibility & Stability: Added pip installation support and fixed critical bugs in Interactive Beam, improving stability and user onboarding.

Community Engagement: Active participation in the Beam community, including contributing to a hackathon project and successfully integrating all work into the Apache Beam codebase via merged Pull Requests.

Development Workflow

As early as the beginning of March, I saw Apache’s project information on the official GSoC website and came across Beam among the projects released by Apache. Since I have some interest in front-end development and wanted to truly integrate into the open-source community for development work, I contacted mentor XQ Hu via email and received positive feedback from him. In April, XQ Hu posted notes for all GSoC students on the Beam Mailing List. It was essential to keep an eye on the Mailing List promptly. Between March and May, besides completing the project proposal and preparation work, I also used my spare time to partially migrate the Beam JupyterLab Extension to version 4.0. This helped me get into the development state more quickly.

I also participated in the Beam Hackathon held in May. There were several topics to choose from, and I opted for the free topic. This allowed me to implement any innovative work on Beam. I combined Beam and GCP to create an Automatic Emotion Analysis Tool for comments. This tool integrates Beam Pipeline, Flink, Docker, and GCP to collect and perform sentiment analysis on real-time comment stream data, storing the results in GCP’s BigQuery. This is a highly meaningful task because sentiment analysis of comments can help businesses better understand users’ opinions about their products, thereby improving the products more effectively. However, the time during the Hackathon was too tight, so I haven’t fully completed this project yet, and it can be further improved later. This Hackathon gave me a deeper understanding of Beam and GCP, and also enhanced my knowledge of the development of the Beam JupyterLab Extension.

In June, I officially started the project development and maintained close communication with my mentor to ensure the project progressed smoothly. XQ Hu and I held a half-hour weekly meeting every Monday on Google Meet, primarily to address issues encountered during the previous week’s development and to discuss the tasks for the upcoming week. XQ Hu is an excellent mentor, and I had no communication barriers with him whatsoever. He is also very understanding; sometimes, when I needed to postpone some development tasks due to personal reasons, he was always supportive and gave me ample freedom. During this month, I improved the plugin to make it fully compatible with JupyterLab 4.0.

In July and August, I made some modifications to the plugin’s source code structure and published it on PyPI to facilitate user installation and promote the plugin. During this period, I also fixed several bugs. Afterwards, I began developing a new feature: the YAML visual editor (design doc HERE). This feature is particularly meaningful because Beam’s Pipeline is described through YAML files, and a visual editor for YAML files can significantly improve developers’ efficiency. In July, I published the proposal for the YAML visual editor and, after gathering feedback from the community for some time, started working on its development. Initially, I planned to use native Cytoscape to build the plugin from scratch, but the workload was too heavy, and there were many mature flow chart plugins in the open-source community that could be referenced. Therefore, I chose XYFlow as the component for flow visualization and integrated it into the plugin. In August, I further optimized the YAML visual editor and fixed some bugs.

In September, I completed the project submission, passed Google’s review, and successfully concluded the project.

Development Conclusion

Overall, collaborating with Apache Beam’s developers was a very enjoyable process. I learned a lot about Beam, and since I am a student engaged in high-performance geographic computing research, Beam may play a significant role in my future studies and work.

I am excited to remain an active member of the Beam community. I hope to continue contributing to its development, applying what I have learned to both my academic pursuits and future collaborative projects. The experience has strengthened my commitment to open-source innovation and has set a strong foundation for ongoing participation in Apache Beam and related technologies.

Special Thanks

I would like to express my sincere gratitude to my mentor XQ Hu for his guidance and support throughout the project. Without his help, I would not have been able to complete this project successfully. His professionalism, patience, and passion have been truly inspiring. As a Google employee, he consistently dedicated time each week to the open-source community and willingly assisted students like me. His selfless dedication to open source is something I deeply admire and strive to emulate. He is also an exceptionally devoted teacher who not only imparted technical knowledge but also taught me how to communicate more effectively, handle interpersonal relationships, and collaborate better in a team setting. He always patiently addressed my questions and provided invaluable advice. I am immensely grateful to him and hope to have the opportunity to work with him again in the future.

I also want to thank the Apache Beam community for their valuable feedback and suggestions, which have greatly contributed to the improvement of the plugin. I feel incredibly fortunate that we, as a society, have open-source communities where individuals contribute their intellect and time to drive collective technological progress and innovation. These communities provide students like me with invaluable opportunities to grow and develop rapidly.

Finally, I would like to thank the Google Summer of Code program for providing me with this opportunity to contribute to open-source projects and gain valuable experience. Without Google Summer of Code, I might never have had the chance to engage with so many open-source projects, take that first step into the open-source community, or experience such substantial personal and professional growth.

Google Summer of Code 2025 - Beam ML Vector DB/Feature Store integrations

Fri, 26 Sep 2025 00:00:00 -0400

What Will I Cover In This Blog Post?

I have three objectives in mind when writing this blog post:

Documenting the work I’ve been doing during this GSoC period in collaboration with the Apache Beam community
A thoughtful and cumulative thank you to my mentor and the Beam Community
Writing to an older version of myself before making my first ever contribution to Beam. This can be helpful for future contributors

What Was This GSoC Project About?

The goal of this project is to enhance Beam’s Python SDK by developing connectors for vector databases like Milvus and feature stores like Tecton. These integrations will improve support for ML use cases such as Retrieval-Augmented Generation (RAG) and feature engineering. By bridging Beam with these systems, this project will attract more users, particularly in the ML community.

Why Was This Project Important?

While Beam’s Python SDK supports some vector databases, feature stores and embedding generators, the current integrations are limited to a few systems as mentioned in the tables down below. Expanding this ecosystem will provide more flexibility and richness for ML workflows particularly in feature engineering and RAG applications, potentially attracting more users, particularly in the ML community.

Vector Database	Feature Store	Embedding Generator
BigQuery	Vertex AI	Vertex AI
AlloyDB	Feast	Hugging Face

Why Did I Choose Beam As Part of GSoC Among 180+ Orgs?

I chose to apply to Beam from among 180+ GSoC organizations because it aligns well with my passion for data processing systems that serve information retrieval systems and my core career values:

Freedom: Working on Beam supports open-source development, liberating developers from vendor lock-in through its unified programming model while enabling services like Project Shield to protect free speech globally
Innovation: Working on Beam allows engagement with cutting-edge data processing techniques and distributed computing paradigms
Accessibility: Working on Beam helps build open-source technology that makes powerful data processing capabilities available to all organizations regardless of size or resources. This accessibility enables projects like Project Shield to provide free protection to media, elections, and human rights websites worldwide

What Did I Work On During the GSoC Program?

During my GSoC program, I focused on developing connectors for vector databases, feature stores, and embedding generators to enhance Beam’s ML capabilities. Here are the artifacts I worked on and what remains to be done:

Type	System	Artifact
Enrichment Handler	Milvus	PR #35216 PR #35577 PR #35467
Sink I/O	Milvus	PR #35708 PR #35944
Enrichment Handler	Tecton	PR #36062
Sink I/O	Tecton	PR #36078
Embedding Gen	OpenAI	PR #36081
Embedding Gen	Anthropic	To Be Added

Here are side-artifacts that are not directly linked to my project:

Type	System	Artifact
AI Code Review	Gemini Code Assist	PR #35532
Enrichment Handler	CloudSQL	PR #34398 PR #35473
Pytest Markers	GitHub CI	PR #35655 PR #35740 PR #35816

For more granular contributions, checking out my ongoing Beam contributions.

How Did I Approach This Project?

My approach centered on community-driven design and iterative implementation, Originally inspired by my mentor’s work. Here’s how it looked:

Design Document: Created a comprehensive design document outlining the proposed ML connector architecture
Community Feedback: Shared the design with the Beam developer community mailing list for review
Iterative Implementation: Incorporated community feedback and applied learnings in subsequent pull requests
Continuous Improvement: Refined the approach based on real-world usage patterns and maintainer guidance

Here are some samples of those design docs:

Component	Type	Design Document
Milvus	Vector Enrichment Handler	[Proposal][GSoC 2025] Milvus Vector Enrichment Handler for Beam
Milvus	Vector Sink I/O Connector	[Proposal][GSoC 2025] Milvus Vector Sink I/O Connector for Beam
Tecton	Feature Store Enrichment Handler	[Proposal][GSoC 2025] Tecton Feature Store Enrichment Handler for Beam
Tecton	Feature Store Sink I/O Connector	[Proposal][GSoC 2025] Tecton Feature Store Sink I/O Connector for Beam

Where Did Challenges Arise During The Project?

There were 2 places where challenges arose:

Running Docker TestContainers in Beam Self-Hosted CI Environment: The main challenge was that Beam runs in CI on Ubuntu 20.04, which caused compatibility and connectivity issues with Milvus TestContainers due to the Docker-in-Docker environment. After several experiments with trial and error, I eventually tested with Ubuntu latest (which at the time of writing this blog post is Ubuntu 25.04), and no issues arose. This version compatibility problem led to the container startup failures and network connectivity issues
Triggering and Modifying the PostCommit Python Workflows: This challenge magnified the above issue since for every experiment update to the given workflow, I had to do a round trip to my mentor to include those changes in the relevant workflow files and evaluate the results. I also wasn’t aware that someone can trigger post-commit Python workflows by updating the trigger files in .github/trigger_files until near the middle of GSoC. I discovered there is actually a workflows README document in .github/workflows/README.md that was not referenced in the CONTRIBUTING.md file at the time of writing this post

How Did This Project Start To Attract Users in the ML Community?

It is observed that after we had a Milvus Enrichment Handler PR before even merging, we started to see community-driven contributions like this one that adds Qdrant. Qdrant is a competitor to Milvus in the vector space. This demonstrates how the project’s momentum and visibility in the ML community space attracted contributors who wanted to expand the Beam ML ecosystem with additional vector database integrations.

How Did This GSoC Experience Working With Beam Community Shape Me?

If I have to boil it down across three dimensions, they would be:

Mindset: Before I was probably working in solitude making PRs about new integrations with mental chatter in the form of fingers crossed, hoping that there will be no divergence on the design. Now I can engage people I am working with through design docs, making sure my work aligns with their vision, which potentially leads to faster PR merges
Skillset: It was one year before contributing to Beam where I wrote professionally in Python, so it was a great opprtunity to brush up on my Python skills and seeing how some design patterns are used in practice, like the query builder pattern seen in CloudSQL Vector Ingestion in the RAG package. I also learned about vector databases and feature stores, and also some AI integrations. I also think I got a bit better than before in root cause analysis and filtering signals from noise in long log files like PostCommit Python workflows
Toolset: Learning about Beam Python SDK, Milvus, Tecton, Google CloudSQL, OpenAI and Anthropic text embedding generators, and lnav for effective log file navigation, including their capabilities and limitations

Tips for Future Contributors

If I have to boil them down to three, they would be:

Observing: Observing how experienced developers in the Beam dev team workâ€”how their PRs look, how they write design docs, what kind of feedback they get on their design docs and PRs, and how you can apply it (if feasible) to avoid getting the same feedback again. What kind of follow-up PRs do they create after their initial ones? How do they document and illustrate their work? What kind of comments do they post when reviewing other people’s related work? Over time, you build your own mental model and knowledge base on how the ideal contribution looks in this area. There is a lot to learn and explore in an exciting, not intimidating way
Orienting: Understanding your place in the ecosystem and aligning your work with the project’s context. This means grasping how your contribution fits into Beam’s architecture and roadmap, identifying your role in addressing current gaps, and mapping stakeholders who will review, use, and maintain your work. Most importantly, align with both your mentor’s vision and the community’s vision to ensure your work serves the broader goals
Acting: Acting on feedback from code reviews, design document discussions, and community input. This means thoughtfully addressing suggested changes in a way that moves the discussion forward, addressing concerns raised by maintainers, and iterating on your work based on community guidance. Being responsive to feedback, asking clarifying questions when needed, and demonstrating that you’re incorporating the community’s input into your contributions given that it is aligned with the project direction

Who Do I Want To Thank for Making This Journey Possible?

If I have to boil them down to three, they would be:

My Mentor, Danny McCormick: I wouldn’t hesitate to say that Danny is the best mentor I have worked with so far, given that I have worked with several mentors. What makes me say that:
- Generosity: Danny is very generous with his time, feedback, and genuinely committed to reviewing my work on a regular basis. We have weekly 30-minute sync calls over almost 21 weeks (5 months) since the official community bonding period, where he shares with me his contextual expertise and addresses any questions I may have with openness to extend time if needed and flexible about skipping calls when there was no agenda
- Flexibility: When I got accepted to GSoC, after a few days I also got accepted to a part-time internship that I had applied to before GSoC, while also managing my last semester in my Bachelor of Computer Science, which was probably the hardest semester. During our discussion about working capacity, Danny was very flexible regarding that, with more emphasis on making progress, which encouraged me to make even more progress. I have also never felt there are very hard boundaries around my project scopeâ€”I felt there was an area to explore that motivated me to think of and add some side-artifacts to Beam, e.g., adding Gemini Code Assist for AI code review
- Proactivity: Danny was very proactive in offering support and help without originally asking, e.g., making Beam Infra tickets that add API keys to unblock my work
Beam Community: From my first ever contribution to Beam adding FlattenWith and Tee examples to the playground, I was welcomed with open arms and felt encouraged to make more contributions. Also, for their valuable comments on my design documents on the dev mailing list as well as the PRs
Google: I would like to genuinely thank Google for introducing me to open source in GSoC 2023 and giving me a second chance to interact with Apache Beam through GSoC 2025. Without it, I probably wouldn’t be here writing this blog post, nor would I have this fruitful experience

What’s Next?

I am now focusing on helping move the remaining artifacts in this project scope from the in-progress state to the merging state. After this, I would love to keep my contributions alive in Beam Python and Go SDK, to name a few. I would also love to connect with you all on my LinkedIn and GitHub.

References

Google Summer of Code 2025 - Beam YAML, Kafka and Iceberg User Accessibility

Tue, 23 Sep 2025 00:00:00 -0400

The relatively new Beam YAML SDK was introduced in the spirit of making data processing easy, but it has gained little adoption for complex ML tasks and hasnâ€™t been widely used with Managed I/O such as Kafka and Iceberg. As part of Google Summer of Code 2025, new illustrative, production-ready pipeline examples of ML use cases with Kafka and Iceberg data sources using the YAML SDK have been developed to address this adoption gap.

Context

The YAML SDK was introduced in Spring 2024 as Beamâ€™s first no-code SDK. It follows a declarative approach of defining a data processing pipeline using a YAML DSL, as opposed to other programming language specific SDKs. At the time, it had few meaningful examples and documentation to go along with it. Key missing examples were ML workflows and integration with the Kafka and Iceberg Managed I/O. Foundational work had already been done to add support for ML capabilities as well as Kafka and Iceberg IO connectors in the YAML SDK, but there were no end-to-end examples demonstrating their usage.

Beam, as well as Kafka and Iceberg, are mainstream big data technologies but they also have a learning curve. The overall theme of the project is to help democratize data processing for scientists and analysts who traditionally donâ€™t have a strong background in software engineering. They can now refer to these meaningful examples as the starting point, helping them onboard faster and be more productive when authoring ML/data pipelines to their use cases with Beam and its YAML DSL.

Contributions

The data pipelines/workflows developed are production-ready: Kafka and Iceberg data sources are set up on GCP, and the data used are raw public datasets. The pipelines are tested end-to-end on Google Cloud Dataflow and are also unit tested to ensure correct transformation logic.

Delivered pipelines/workflows, each with documentation as README.md, address 4 main ML use cases below:

Streaming Classification Inference: A streaming ML pipeline that demonstrates Beam YAML capability to perform classification inference on a stream of incoming data from Kafka. The overall workflow also includes DistilBERT model deployment and serving on Google Cloud Vertex AI where the pipeline can access for remote inferences. The pipeline is applied to a sentiment analysis task on a stream of YouTube comments, preprocessing data and classifying whether a comment is positive or negative. See pipeline and documentation.
Streaming Regression Inference: A streaming ML pipeline that demonstrates Beam YAML capability to perform regression inference on a stream of incoming data from Kafka. The overall workflow also includes custom model training, deployment and serving on Google Cloud Vertex AI where the pipeline can access for remote inferences. The pipeline is applied to a regression task on a stream of taxi rides, preprocessing data and predicting the fare amount for every ride. See pipeline and documentation.
Batch Anomaly Detection: A ML workflow that demonstrates ML-specific transformations and reading from/writing to Iceberg IO. The workflow contains unsupervised model training and several pipelines that leverage Iceberg for storing results, BigQuery for storing vector embeddings and MLTransform for computing embeddings to demonstrate an end-to-end anomaly detection workflow on a dataset of system logs. See workflow and documentation.
Feature Engineering & Model Evaluation: A ML workflow that demonstrates Beam YAML capability to do feature engineering which is subsequently used for model evaluation, and its integration with Iceberg IO. The workflow contains model training and several pipelines, showcasing an end-to-end Fraud Detection MLOps solution that generates features and evaluates models to detect credit card transaction frauds. See workflow and documentation.

Challenges

The main challenge of the project was a lack of previous YAML pipeline examples and good documentation to rely on. Unlike the Python or Java SDKs where there are already many notebooks and end-to-end examples demonstrating various use cases, the examples for YAML SDK only involved simple transformations such as filter, group by, etc. More complex transforms like MLTransform and ReadFromIceberg had no examples and requires configurations that didn’t have clear API reference at the time. As a result, there were a lot of deep dives into the actual implementation of the PTransforms across YAML, Python and Java SDKs to understand the error messages and how to correctly use the transforms.

Another challenge was writing unit tests for the pipeline to ensure that the pipelineâ€™s logic is correct. It was a learning curve to understand how the existing test suite is set up and how it can be used to write unit tests for the data pipelines. A lot of time was spent on properly writing mocks for the pipeline’s sources and sinks, as well as for the transforms that require external services such as Vertex AI.

Conclusion & Personal Thoughts

These production-ready pipelines demonstrate the potential of Beam YAML SDK to author complex ML workflows that interact with Iceberg and Kafka. The examples are a nice addition to Beam, especially with Beam 3.0.0 milestones coming up where low-code/no-code, ML capabilities and Managed I/O are focused on.

I had an amazing time working with the big data technologies Beam, Iceberg, and Kafka as well as many Google Cloud services (Dataflow, Vertex AI and Google Kubernetes Engine, to name a few). Iâ€™ve always wanted to work more in the ML space, and this experience has been a great growth opportunity for me. Google Summer of Code this year has been selective, and the project’s success would not have been possible without the support of my mentor, Chamikara Jayalath. It’s been a pleasure working closely with him and the broader Beam community to contribute to this open-source project that has a meaningful impact on the data engineering community.

My advice for future Google Summer of Code participants is to first and foremost research and choose a project that aligns closely with your interest. Most importantly, spend a lot of time making yourself visible and writing a good proposal when the program is opened for applications. Being visible (e.g. by sharing your proposal, or generally any ideas and questions on the project’s communication channel early on) makes it more likely for you to be selected; and a good proposal not only will make you even more likely to be in the program, but also give you a lot of confidence when contributing to and completing the project.

References

Apache Beam 2.68.0

Mon, 22 Sep 2025 15:00:00 -0500

We are happy to present the new 2.68.0 release of Beam. This release includes both improvements and new functionality. See the download page for this release.

For more information on changes in 2.68.0, check out the detailed release notes.

Highlights

[Python] Prism runner now enabled by default for most Python pipelines using the direct runner (#34612). This may break some tests, see https://github.com/apache/beam/pull/34612 for details on how to handle issues.

I/Os

Upgraded Iceberg dependency to 1.9.2 (#35981)

New Features / Improvements

BigtableRead Connector for BeamYaml added with new Config Param (#35696)
MongoDB Java driver upgraded from 3.12.11 to 5.5.0 with API refactoring and GridFS implementation updates (Java) (#35946).
Introduced a dedicated module for JUnit-based testing support: sdks/java/testing/junit, which provides TestPipelineExtension for JUnit 5 while maintaining backward compatibility with existing JUnit 4 TestRule-based tests (Java) (#18733, #35688).
- To use JUnit 5 with Beam tests, add a test-scoped dependency on org.apache.beam:beam-sdks-java-testing-junit.
Google CloudSQL enrichment handler added (Python) (#34398). Beam now supports data enrichment capabilities using SQL databases, with built-in support for:
- Managed PostgreSQL, MySQL, and Microsoft SQL Server instances on CloudSQL
- Unmanaged SQL database instances not hosted on CloudSQL (e.g., self-hosted or on-premises databases)
[Python] Added the ReactiveThrottler and ThrottlingSignaler classes to streamline throttling behavior in DoFns, expose throttling mechanisms for users (#35984)
Added a pipeline option to specify the processing timeout for a single element by any PTransform (Java/Python/Go) (#35174).
- When specified, the SDK harness automatically restarts if an element takes too long to process. Beam runner may then retry processing of the same work item.
- Use the --element_processing_timeout_minutes option to reduce the chance of having stalled pipelines due to unexpected cases of slow processing, where slowness might not happen again if processing of the same element is retried.
(Python) Adding GCP Spanner Change Stream support for Python (apache_beam.io.gcp.spanner) (#24103).

Breaking Changes

Previously deprecated Beam ZetaSQL component has been removed (#34423). ZetaSQL users could migrate to Calcite SQL with BigQuery dialect enabled.
Upgraded Beam vendored Calcite to 1.40.0 for Beam SQL (#35483), which improves support for BigQuery and other SQL dialects. Note: Minor behavior changes are observed such as output significant digits related to casting.
(Python) The deterministic fallback coder for complex types like NamedTuple, Enum, and dataclasses now uses cloudpickle instead of dill. If your pipeline is affected, you may see a warning like: “Using fallback deterministic coder for type X…”. You can revert to the previous behavior by using the pipeline option --update_compatibility_version=2.67.0 (35725). Report any pickling related issues to #34903
(Python) Prism runner now enabled by default for most Python pipelines using the direct runner (#34612). This may break some tests, see https://github.com/apache/beam/pull/34612 for details on how to handle issues.
Dropped Java 8 support for IO expansion-service. Cross-language pipelines using this expansion service will need a Java11+ runtime (#35981.

Deprecations

Python SDK native SpannerIO (apache_beam/io/gcp/experimental/spannerio) is deprecated. Use cross-language wrapper (apache_beam/io/gcp/spanner) instead (Python) (#35860).
Samza runner is deprecated and scheduled for removal in Beam 3.0 (#35448).
Twister2 runner is deprecated and scheduled for removal in Beam 3.0 (#35905)).

Bugfixes

(Python) Fixed Java YAML provider fails on Windows (#35617).
Fixed BigQueryIO creating temporary datasets in wrong project when temp_dataset is specified with a different project than the pipeline project. For some jobs, temporary datasets will now be created in the correct project (Python) (#35813).
(Go) Fix duplicates due to reads after blind writes to Bag State (#35869).
- Earlier Go SDK versions can avoid the issue by not reading in the same call after a blind write.

List of Contributors

According to git shortlog, the following people contributed to the 2.68.0 release. Thank you to all contributors!

Ahmed Abualsaud, Andrew Crites, Ashok Devireddy, Chamikara Jayalath, Charles Nguyen, Danny McCormick, Davda James, Derrick Williams, Diego Hernandez, Dip Patel, Dustin Rhodes, Enrique Calderon, Hai Joey Tran, Jack McCluskey, Kenneth Knowles, Keshav, Khorbaladze A., LEEKYE, Lanny Boarts, Mattie Fu, Minbo Bae, Mohamed Awnallah, Naireen Hussain, Nathaniel Young, RadosÅ‚aw Stankiewicz, Razvan Culea, Robert Bradshaw, Robert Burke, Sam Whittle, Shehab, Shingo Furuyama, Shunping Huang, Steven van Rossum, Suvrat Acharya, Svetak Sundhar, Tarun Annapareddy, Tom Stepp, Valentyn Tymofieiev, Vitaly Terentyev, XQ Hu, Yi Hu, apanich, arnavarora2004, claudevdm, flpablo, kristynsmith, shreyakhajanchi