Apache Beam

Launching Apache Beam pipelines written in Java.

Last Updated: Feb. 5, 2021

Access Instructions

Install the Apache Beam provider package into your Airflow environment.

Import the module into your DAG file and instantiate it with your desired params.


jarstrThe reference to a self executing Apache Beam jar (templated).
runnerstrRunner on which pipeline will be run. By default "DirectRunner" is being used. See:
job_classstrThe name of the Apache Beam pipeline class to be executed, it is often not the main class configured in the pipeline jar file.
default_pipeline_optionsdictMap of default job pipeline_options.
pipeline_optionsdictMap of job specific pipeline_options.The key must be a dictionary. The value can contain different types:If the value is None, the single option - --key (without value) will be added.If the value is False, this option will be skippedIf the value is True, the single option - --key (without value) will be added.If the value is list, the many pipeline_options will be added for each key. If the value is ['A', 'B'] and the key is key then the --key=A --key-B pipeline_options will be leftOther value types will be replaced with the Python textual representation.When defining labels (labels option), you can also provide a dictionary.
gcp_conn_idstrThe connection ID to use connecting to Google Cloud Storage if jar is on GCS
delegate_tostrThe account to impersonate using domain-wide delegation of authority, if any. For this to work, the service account making the request must have domain-wide delegation enabled.
dataflow_configUnion[dict,]Dataflow configuration, used when runner type is set to DataflowRunner


Note that both default_pipeline_options and pipeline_options will be merged to specify pipeline execution parameter, and default_pipeline_options is expected to save high-level pipeline_options, for instances, project and zone information, which apply to all Apache Beam operators in the DAG.

For more information on how to use this operator, take a look at the guide: Run Java Pipelines in Apache Beam

For more detail on Apache Beam have a look at the reference:

You need to pass the path to your jar file as a file reference with the jar parameter, the jar needs to be a self executing jar (see documentation here: Use pipeline_options to pass on pipeline_options to your job.

