SparkSubmitHook

Spark

This hook is a wrapper around the spark-submit binary to kick off a spark-submit job. It requires that the “spark-submit” binary is in the PATH or the spark_home to be supplied.

View Source

Last Updated: Apr. 27, 2021

Access Instructions

Install the Spark provider package into your Airflow environment.

Import the module into your DAG file and instantiate it with your desired params.

Parameters

confdictArbitrary Spark configuration properties
conn_idstrThe connection id as configured in Airflow administration. When an invalid connection_id is supplied, it will default to yarn.
filesstrUpload additional files to the executor running the job, separated by a comma. Files will be placed in the working directory of each executor. For example, serialized objects.
py_filesstrAdditional python files used by the job, can be .zip, .egg or .py.
driver_class_pathstrAdditional, driver-specific, classpath settings.
jarsstrSubmit additional jars to upload and place them in executor classpath.
java_classstrthe main class of the Java application
packagesstrComma-separated list of maven coordinates of jars to include on the driver and executor classpaths
exclude_packagesstrComma-separated list of maven coordinates of jars to exclude while resolving the dependencies provided in 'packages'
repositoriesstrComma-separated list of additional remote repositories to search for the maven coordinates given with 'packages'
total_executor_coresint(Standalone & Mesos only) Total cores for all executors (Default: all the available cores on the worker)
executor_coresint(Standalone, YARN and Kubernetes only) Number of cores per executor (Default: 2)
executor_memorystrMemory per executor (e.g. 1000M, 2G) (Default: 1G)
driver_memorystrMemory allocated to the driver (e.g. 1000M, 2G) (Default: 1G)
keytabstrFull path to the file that contains the keytab
principalstrThe name of the kerberos principal used for keytab
proxy_userstrUser to impersonate when submitting the application
namestrName of the job (default airflow-spark)
num_executorsintNumber of executors to launch
status_poll_intervalintSeconds to wait between polls of driver status in cluster mode (Default: 1)
application_argslistArguments for the application being submitted
env_varsdictEnvironment variables for spark-submit. It supports yarn and k8s mode too.
verboseboolWhether to pass the verbose flag to spark-submit process for debugging
spark_binarystrThe command to use for spark submit. Some distros may use spark2-submit.

Documentation

This hook is a wrapper around the spark-submit binary to kick off a spark-submit job. It requires that the “spark-submit” binary is in the PATH or the spark_home to be supplied.

Example DAGs

Improve this module by creating an example DAG.

View Source
  1. Add an `example_dags` directory to the top-level source of the provider package with an empty `__init__.py` file.
  2. Add your DAG to this directory. Be sure to include a well-written and descriptive docstring
  3. Create a pull request against the source code. Once the package gets released, your DAG will show up on the Registry.

Was this page helpful?