GreatExpectationsBigQueryOperator

Great Expectations

An operator that allows you to use Great Expectations to validate data Expectations against a BigQuery table or the result of a SQL query.

View Source

Last Updated: Mar. 26, 2021

Access Instructions

Install the Great Expectations provider package into your Airflow environment.

Import the module into your DAG file and instantiate it with your desired params.

Parameters

gcp_projectstrThe GCP project which houses the GCS buckets where the Expectations files are stored and where the validation files & data docs will be output (e.g. HTML docs showing if the data matches Expectations).
expectations_suite_namestrThe name of the Expectation Suite containing the Expectations for the data. The suite should be in a JSON file with the same name as the suite (e.g. if the Expectations Suite named in the Expectation file is 'my_suite' then the Expectations file should be called my_suite.json)
gcs_bucketstrGoogle Cloud Storage bucket where expectation files are stored and where validation outputs and data docs will be saved. (e.g. gs:////Unexpected indentation.gs://mybucket/myprefix/myexpectationsfile.json )
gcs_expectations_prefixstrGoogle Cloud Storage prefix where the Expectations file can be found. (e.g. 'ge/expectations')
gcs_validations_prefixstrGoogle Cloud Storage prefix where the validation output files should be saved. (e.g. 'ge/validations')
gcs_datadocs_prefixstrGoogle Cloud Storage prefix where the validation datadocs files should be saved. (e.g. 'ge/datadocs')
querystra SQL query that defines the set of data to be validated (i.e. compared against Expectations). If the query parameter is filled in then the table parameter cannot be.
tablestrThe name of the BigQuery table (dataset_name.table_name) that defines the set of data to be validated. If the table parameter is filled in then the query parameter cannot be.
bigquery_conn_idstrName of the BigQuery connection (as configured in Airflow) that contains the connection and credentials info needed to connect to BigQuery.
bq_dataset_namestrThe name of the BigQuery data set where any temp tables will be created that are needed as part of the GE validation process.
send_alert_emailbooleanSend an alert email if one or more Expectations fail to be met. Defaults to True. This requires configuring an SMTP server in the Airflow config.
datadocs_link_in_emailbooleanInclude in the alert email a link to the data doc in GCS that shows the validation results? Defaults to False because there's extra setup needed to serve HTML data docs stored in GCS. When set to False, only a GCS path to the results are included in the email. Set up a GAE app to serve the data docs if you want a clickable link for the data doc to be included in the email. See here for set up instructions: https://docs.greatexpectations.io/en/latest/guides/how_to_guides/configuring_data_docs/how_to_host_and_share_data_docs_on_gcs.html
datadocs_domainstrThe domain from which the data docs are set up to be served (e.g. ge-data-docs-dot-my-gcp-project.ue.r.appspot.com). This only needs to be set if datadocs_link_in_email is set to True.
email_tostrEmail address to receive any alerts when Expectations are not met.
fail_task_on_validation_failurebooleanFail the Airflow task if Expectations are not met? Defaults to True.

Documentation

An operator that allows you to use Great Expectations to validate data Expectations against a BigQuery table or the result of a SQL query.

The Expectations need to be stored in a JSON file sitting in an accessible GCS bucket. The validation results are output to GCS in both JSON and HTML formats. Here’s the current list of expectations types: https://docs.greatexpectations.io/en/latest/reference/glossary_of_expectations.html Here’s how to create expectations files: https://docs.greatexpectations.io/en/latest/guides/tutorials/how_to_create_expectations.html

Was this page helpful?