DynamoDBToS3Operator

Amazon

Replicates records from a DynamoDB table to S3. It scans a DynamoDB table and write the received records to a file on the local filesystem. It flushes the file to S3 once the file size exceeds the file size limit specified by the user.

View on GitHub

Last Updated: Jun. 28, 2021

Access Instructions

Install the Amazon provider package into your Airflow environment.

Import the module into your DAG file and instantiate it with your desired params.

Parameters

dynamodb_table_nameRequiredstrDynamodb table to replicate data from
s3_bucket_nameRequiredstrS3 bucket to replicate data to
file_sizeRequiredintFlush file to s3 if file size >= file_size
dynamodb_scan_kwargsOptional[Dict[str, Any]]kwargs pass to # noqa: E501 pylint: disable=line-too-long
s3_key_prefixOptional[str]Prefix of s3 object key
process_funcCallable[[Dict[str, Any]], bytes]How we transforms a dynamodb item to bytes. By default we dump the json

Documentation

Replicates records from a DynamoDB table to S3. It scans a DynamoDB table and write the received records to a file on the local filesystem. It flushes the file to S3 once the file size exceeds the file size limit specified by the user.

Users can also specify a filtering criteria using dynamodb_scan_kwargs to only replicate records that satisfy the criteria.

To parallelize the replication, users can create multiple tasks of DynamoDBToS3Operator. For instance to replicate with parallelism of 2, create two tasks like:

op1 = DynamoDBToS3Operator(
task_id='replicator-1',
dynamodb_table_name='hello',
dynamodb_scan_kwargs={
'TotalSegments': 2,
'Segment': 0,
},
...
)
op2 = DynamoDBToS3Operator(
task_id='replicator-2',
dynamodb_table_name='hello',
dynamodb_scan_kwargs={
'TotalSegments': 2,
'Segment': 1,
},
...
)

Was this page helpful?