DataprocSubmitPigJobOperator

Google

Start a Pig query Job on a Cloud DataProc cluster. The parameters of the operation will be passed to the cluster.

View Source

Last Updated: Apr. 21, 2021

Access Instructions

Install the Google provider package into your Airflow environment.

Import the module into your DAG file and instantiate it with your desired params.

Parameters

querystrThe query or reference to the query file (pg or pig extension). (templated)
query_uristrThe HCFS URI of the script that contains the Pig queries.
variablesdictMap of named parameters for the query. (templated)

Documentation

Start a Pig query Job on a Cloud DataProc cluster. The parameters of the operation will be passed to the cluster.

It’s a good practice to define dataproc_* parameters in the default_args of the dag like the cluster name and UDFs.

default_args = { 'cluster_name': 'cluster-1', 'dataproc_pig_jars': [ 'gs://example/udf/jar/datafu/1.2.0/datafu.jar', 'gs://example/udf/jar/gpig/1.2/gpig.jar' ] }

You can pass a pig script as string or file reference. Use variables to pass on variables for the pig script to be resolved on the cluster or use the parameters to be resolved in the script as template parameters.

Example:

t1 = DataProcPigOperator(
task_id='dataproc_pig',
query='a_pig_script.pig',
variables={'out': 'gs://example/output/{{ds}}'},
dag=dag)

See also

For more detail on about job submission have a look at the reference: https://cloud.google.com/dataproc/reference/rest/v1/projects.regions.jobs

Example DAGs

Improve this module by creating an example DAG.

View Source
  1. Add an `example_dags` directory to the top-level source of the provider package with an empty `__init__.py` file.
  2. Add your DAG to this directory. Be sure to include a well-written and descriptive docstring
  3. Create a pull request against the source code. Once the package gets released, your DAG will show up on the Registry.

Was this page helpful?