DataprocSubmitPySparkJobOperator

Google

Start a PySpark Job on a Cloud DataProc cluster.

View Source

Last Updated: Apr. 21, 2021

Access Instructions

Install the Google provider package into your Airflow environment.

Import the module into your DAG file and instantiate it with your desired params.

Parameters

mainstr[Required] The Hadoop Compatible Filesystem (HCFS) URI of the main Python file to use as the driver. Must be a .py file. (templated)
argumentslistArguments for the job. (templated)
archiveslistList of archived files that will be unpacked in the work directory. Should be stored in Cloud Storage.
fileslistList of files to be copied to the working directory
pyfileslistList of Python files to pass to the PySpark framework. Supported file types: .py, .egg, and .zip

Documentation

Start a PySpark Job on a Cloud DataProc cluster.

Example DAGs

Improve this module by creating an example DAG.

View Source
  1. Add an `example_dags` directory to the top-level source of the provider package with an empty `__init__.py` file.
  2. Add your DAG to this directory. Be sure to include a well-written and descriptive docstring
  3. Create a pull request against the source code. Once the package gets released, your DAG will show up on the Registry.

Was this page helpful?