S3ToHiveOperator

Hive

Moves data from S3 to Hive. The operator downloads a file from S3, stores the file locally before loading it into a Hive table. If the create or recreate arguments are set to True, a CREATE TABLE and DROP TABLE statements are generated. Hive data types are inferred from the cursor’s metadata from.

View on GitHub

Last Updated: Jun. 28, 2021

Access Instructions

Install the Hive provider package into your Airflow environment.

Import the module into your DAG file and instantiate it with your desired params.

Parameters

s3_keyRequiredstrThe key to be retrieved from S3. (templated)
field_dictRequireddictA dictionary of the fields name in the file as keys and their Hive types as values
hive_tableRequiredstrtarget Hive table, use dot notation to target a specific database. (templated)
delimiterstrfield delimiter in the file
createboolwhether to create the table if it doesn’t exist
recreateboolwhether to drop and recreate the table at every execution
partitiondicttarget partition as a dict of partition columns and values. (templated)
headersboolwhether the file contains column names on the first line
check_headersboolwhether the column names on the first line should be checked against the keys of field_dict
wildcard_matchboolwhether the s3_key should be interpreted as a Unix wildcard pattern
aws_conn_idstrsource s3 connection
verifybool or strWhether or not to verify SSL certificates for S3 connection. By default SSL certificates are verified. You can provide the following values: False: do not validate SSL certificates. SSL will still be used(unless use_ssl is False), but SSL certificates will not be verified. path/to/cert/bundle.pem: A filename of the CA cert bundle to uses.You can specify this argument if you want to use a different CA cert bundle than the one used by botocore.
hive_cli_conn_idstrdestination hive connection
input_compressedboolBoolean to determine if file decompression is required to process headers
tblpropertiesdictTBLPROPERTIES of the hive table being created
select_expressionstrS3 Select expression

Documentation

Moves data from S3 to Hive. The operator downloads a file from S3, stores the file locally before loading it into a Hive table. If the create or recreate arguments are set to True, a CREATE TABLE and DROP TABLE statements are generated. Hive data types are inferred from the cursor’s metadata from.

Note that the table generated in Hive uses STORED AS textfile which isn’t the most efficient serialization format. If a large amount of data is loaded and/or if the tables gets queried considerably, you may want to use this operator only to stage the data into a temporary table before loading it into its final destination using a HiveOperator.

Was this page helpful?