BeamRunJavaPipelineOperator

Apache Beam

Launching Apache Beam pipelines written in Java.

View Source

Last Updated: Feb. 5, 2021

Access Instructions

Install the Apache Beam provider package into your Airflow environment.

Import the module into your DAG file and instantiate it with your desired params.

Parameters

jarstrThe reference to a self executing Apache Beam jar (templated).
runnerstrRunner on which pipeline will be run. By default "DirectRunner" is being used. See: https://beam.apache.org/documentation/runners/capability-matrix/
job_classstrThe name of the Apache Beam pipeline class to be executed, it is often not the main class configured in the pipeline jar file.
default_pipeline_optionsdictMap of default job pipeline_options.
pipeline_optionsdictMap of job specific pipeline_options.The key must be a dictionary. The value can contain different types:If the value is None, the single option - --key (without value) will be added.If the value is False, this option will be skippedIf the value is True, the single option - --key (without value) will be added.If the value is list, the many pipeline_options will be added for each key. If the value is ['A', 'B'] and the key is key then the --key=A --key-B pipeline_options will be leftOther value types will be replaced with the Python textual representation.When defining labels (labels option), you can also provide a dictionary.
gcp_conn_idstrThe connection ID to use connecting to Google Cloud Storage if jar is on GCS
delegate_tostrThe account to impersonate using domain-wide delegation of authority, if any. For this to work, the service account making the request must have domain-wide delegation enabled.
dataflow_configUnion[dict, providers.google.cloud.operators.dataflow.DataflowConfiguration]Dataflow configuration, used when runner type is set to DataflowRunner

Documentation

Launching Apache Beam pipelines written in Java.

Note that both default_pipeline_options and pipeline_options will be merged to specify pipeline execution parameter, and default_pipeline_options is expected to save high-level pipeline_options, for instances, project and zone information, which apply to all Apache Beam operators in the DAG.

See also

For more information on how to use this operator, take a look at the guide: Run Java Pipelines in Apache Beam

See also

For more detail on Apache Beam have a look at the reference: https://beam.apache.org/documentation/

You need to pass the path to your jar file as a file reference with the jar parameter, the jar needs to be a self executing jar (see documentation here: https://beam.apache.org/documentation/runners/dataflow/#self-executing-jar). Use pipeline_options to pass on pipeline_options to your job.

Was this page helpful?