HiveToDruidOperator

Druid

Moves data from Hive to Druid, [del]note that for now the data is loaded into memory before being pushed to Druid, so this operator should be used for smallish amount of data.[/del]

View on GitHub

Last Updated: May. 22, 2021

Access Instructions

Install the Druid provider package into your Airflow environment.

Import the module into your DAG file and instantiate it with your desired params.

Parameters

sqlRequiredstrSQL query to execute against the Druid database. (templated)
druid_datasourceRequiredstrthe datasource you want to ingest into in druid
ts_dimRequiredstrthe timestamp dimension
metric_speclistthe metrics you want to define for your data
hive_cli_conn_idstrthe hive connection id
druid_ingest_conn_idstrthe druid ingest connection id
metastore_conn_idstrthe metastore connection id
hadoop_dependency_coordinateslist[str]list of coordinates to squeeze int the ingest json
intervalslistlist of time intervals that defines segments, this is passed as is to the json object. (templated)
num_shardsfloatDirectly specify the number of shards to create.
target_partition_sizeintTarget number of rows to include in a partition,
query_granularitystrThe minimum granularity to be able to query results at and the granularity of the data inside the segment. E.g. a value of “minute” will mean that data is aggregated at minutely granularity. That is, if there are collisions in the tuple (minute(timestamp), dimensions), then it will aggregate values together using the aggregators instead of storing individual rows. A granularity of ‘NONE’ means millisecond granularity.
segment_granularitystrThe granularity to create time chunks at. Multiple segments can be created per time chunk. For example, with ‘DAY’ segmentGranularity, the events of the same day fall into the same time chunk which can be optionally further partitioned into multiple segments based on other configurations and input size.
hive_tblpropertiesdictadditional properties for tblproperties in hive for the staging table
job_propertiesdictadditional properties for job

Documentation

Moves data from Hive to Druid, [del]note that for now the data is loaded into memory before being pushed to Druid, so this operator should be used for smallish amount of data.[/del]

Was this page helpful?