BeamRunPythonPipelineOperator

Apache Beam

Launching Apache Beam pipelines written in Python. Note that both default_pipeline_options and pipeline_options will be merged to specify pipeline execution parameter, and default_pipeline_options is expected to save high-level options, for instances, project and zone information, which apply to all beam operators in the DAG.

View Source

Last Updated: Feb. 5, 2021

Access Instructions

Install the Apache Beam provider package into your Airflow environment.

Import the module into your DAG file and instantiate it with your desired params.

Parameters

py_filestrReference to the python Apache Beam pipeline file.py, e.g., /some/local/file/path/to/your/python/pipeline/file. (templated)
runnerstrRunner on which pipeline will be run. By default "DirectRunner" is being used. Other possible options: DataflowRunner, SparkRunner, FlinkRunner. See: :class:`~providers.apache.beam.hooks.beam.BeamRunnerType` See: https://beam.apache.org/documentation/runners/capability-matrix/No role entry for "class" in module "docutils.parsers.rst.languages.en". Trying "class" as canonical role name.Unknown interpreted text role "class".
py_optionslist[str]Additional python options, e.g., ["-m", "-v"].
default_pipeline_optionsdictMap of default pipeline options.
pipeline_optionsdictMap of pipeline options.The key must be a dictionary. The value can contain different types:If the value is None, the single option - --key (without value) will be added.If the value is False, this option will be skippedIf the value is True, the single option - --key (without value) will be added.If the value is list, the many options will be added for each key. If the value is ['A', 'B'] and the key is key then the --key=A --key-B options will be leftOther value types will be replaced with the Python textual representation.When defining labels (labels option), you can also provide a dictionary.
py_interpreterstrPython version of the beam pipeline. If None, this defaults to the python3. To track python versions supported by beam and related issues check: https://issues.apache.org/jira/browse/BEAM-1251
py_requirementsList[str]Additional python package(s) to install. If a value is passed to this parameter, a new virtual environment has been created with additional packages installed.You could also install the apache_beam package if it is not installed on your system or you want to use a different version.
py_system_site_packagesWhether to include system_site_packages in your virtualenv. See virtualenv documentation for more information.This option is only relevant if the py_requirements parameter is not None.
gcp_conn_idstrOptional. The connection ID to use connecting to Google Cloud Storage if python file is on GCS.
delegate_tostrOptional. The account to impersonate using domain-wide delegation of authority, if any. For this to work, the service account making the request must have domain-wide delegation enabled.
dataflow_configUnion[dict, providers.google.cloud.operators.dataflow.DataflowConfiguration]Dataflow configuration, used when runner type is set to DataflowRunner

Documentation

Launching Apache Beam pipelines written in Python. Note that both default_pipeline_options and pipeline_options will be merged to specify pipeline execution parameter, and default_pipeline_options is expected to save high-level options, for instances, project and zone information, which apply to all beam operators in the DAG.

See also

For more information on how to use this operator, take a look at the guide: Run Python Pipelines in Apache Beam

See also

For more detail on Apache Beam have a look at the reference: https://beam.apache.org/documentation/

Was this page helpful?