SparkSubmitOperator

Spark

This hook is a wrapper around the spark-submit binary to kick off a spark-submit job. It requires that the “spark-submit” binary is in the PATH or the spark-home is set in the extra on the connection.

View Source

Last Updated: Apr. 27, 2021

Access Instructions

Install the Spark provider package into your Airflow environment.

Import the module into your DAG file and instantiate it with your desired params.

Parameters

applicationstrThe application that submitted as a job, either jar or py file. (templated)
confdictArbitrary Spark configuration properties (templated)
conn_idstrThe connection id as configured in Airflow administration. When an invalid connection_id is supplied, it will default to yarn.
filesstrUpload additional files to the executor running the job, separated by a comma. Files will be placed in the working directory of each executor. For example, serialized objects. (templated)
py_filesstrAdditional python files used by the job, can be .zip, .egg or .py. (templated)
jarsstrSubmit additional jars to upload and place them in executor classpath. (templated)
driver_class_pathstrAdditional, driver-specific, classpath settings. (templated)
java_classstrthe main class of the Java application
packagesstrComma-separated list of maven coordinates of jars to include on the driver and executor classpaths. (templated)
exclude_packagesstrComma-separated list of maven coordinates of jars to exclude while resolving the dependencies provided in 'packages' (templated)
repositoriesstrComma-separated list of additional remote repositories to search for the maven coordinates given with 'packages'
total_executor_coresint(Standalone & Mesos only) Total cores for all executors (Default: all the available cores on the worker)
executor_coresint(Standalone & YARN only) Number of cores per executor (Default: 2)
executor_memorystrMemory per executor (e.g. 1000M, 2G) (Default: 1G)
driver_memorystrMemory allocated to the driver (e.g. 1000M, 2G) (Default: 1G)
keytabstrFull path to the file that contains the keytab (templated)
principalstrThe name of the kerberos principal used for keytab (templated)
proxy_userstrUser to impersonate when submitting the application (templated)
namestrName of the job (default airflow-spark). (templated)
num_executorsintNumber of executors to launch
status_poll_intervalintSeconds to wait between polls of driver status in cluster mode (Default: 1)
application_argslistArguments for the application being submitted (templated)
env_varsdictEnvironment variables for spark-submit. It supports yarn and k8s mode too. (templated)
verboseboolWhether to pass the verbose flag to spark-submit process for debugging
spark_binarystrThe command to use for spark submit. Some distros may use spark2-submit.

Documentation

This hook is a wrapper around the spark-submit binary to kick off a spark-submit job. It requires that the “spark-submit” binary is in the PATH or the spark-home is set in the extra on the connection.

See also

For more information on how to use this operator, take a look at the guide: SparkSubmitOperator

Was this page helpful?