Skip to content

mrshu/dbt-athena

 
 

Repository files navigation

pypi Imports: isort Code style: black Stats: pepy

dbt-athena

  • Supports dbt version 1.3.*
  • Supports Seeds
  • Correctly detects views and their columns
  • Supports table materialization
    • Iceberg tables is supported only with Athena Engine v3 and a unique table location (see table location section below)
    • Hive tables is supported by both Athena engines.
  • Supports incremental models
    • On iceberg tables :
      • Support the use of unique_key only with the merge strategy
      • Support the append strategy
    • On Hive tables :
      • Support two incremental update strategies: insert_overwrite and append
      • Does not support the use of unique_key
  • Does not support Python models

Installation

  • pip install dbt-athena-community
  • Or pip install git+https://github.com/dbt-athena/dbt-athena.git

Prerequisites

To start, you will need an S3 bucket, for instance my-bucket and an Athena database:

CREATE DATABASE IF NOT EXISTS analytics_dev
COMMENT 'Analytics models generated by dbt (development)'
LOCATION 's3://my-bucket/'
WITH DBPROPERTIES ('creator'='Foo Bar', 'email'='[email protected]');

Notes:

  • Take note of your AWS region code (e.g. us-west-2 or eu-west-2, etc.).
  • You can also use AWS Glue to create and manage Athena databases.

Credentials

This plugin does not accept any credentials directly. Instead, credentials are determined automatically based on aws cli/boto3 conventions and stored login info. You can configure the AWS profile name to use via aws_profile_name. Checkout DBT profile configuration below for details.

Configuring your profile

A dbt profile can be configured to run against AWS Athena using the following configuration:

Option Description Required? Example
s3_staging_dir S3 location to store Athena query results and metadata Required s3://bucket/dbt/
s3_data_dir Prefix for storing tables, if different from the connection's s3_staging_dir Optional s3://bucket2/dbt/
s3_data_naming How to generate table paths in s3_data_dir Optional schema_table_unique
region_name AWS region of your Athena instance Required eu-west-1
schema Specify the schema (Athena database) to build models into (lowercase only) Required dbt
database Specify the database (Data catalog) to build models into (lowercase only) Required awsdatacatalog
poll_interval Interval in seconds to use for polling the status of query results in Athena Optional 5
aws_profile_name Profile to use from your AWS shared credentials file. Optional my-profile
work_group Identifier of Athena workgroup Optional my-custom-workgroup
num_retries Number of times to retry a failing query Optional 3

Example profiles.yml entry:

athena:
  target: dev
  outputs:
    dev:
      type: athena
      s3_staging_dir: s3://athena-query-results/dbt/
      s3_data_dir: s3://your_s3_bucket/dbt/
      s3_data_naming: schema_table
      region_name: eu-west-1
      schema: dbt
      database: awsdatacatalog
      aws_profile_name: my-profile
      work_group: my-workgroup

Additional information

  • threads is supported
  • database and catalog can be used interchangeably

Models

Table Configuration

  • external_location (default=none)
    • If set, the full S3 path in which the table will be saved. (Does not work with Iceberg table).
  • partitioned_by (default=none)
    • An array list of columns by which the table will be partitioned
    • Limited to creation of 100 partitions (currently)
  • bucketed_by (default=none)
    • An array list of columns to bucket data, ignored if using Iceberg
  • bucket_count (default=none)
    • The number of buckets for bucketing your data, ignored if using Iceberg
  • table_type (default='hive')
    • The type of table
    • Supports hive or iceberg
  • format (default='parquet')
    • The data format for the table
    • Supports ORC, PARQUET, AVRO, JSON, TEXTFILE
  • write_compression (default=none)
    • The compression type to use for any storage format that allows compression to be specified. To see which options are available, check out [CREATE TABLE AS][create-table-as]
  • field_delimiter (default=none)
    • Custom field delimiter, for when format is set to TEXTFILE
  • table_properties: table properties to add to the table, valid for Iceberg only

Table location

The location in which a table is saved is determined by:

  1. If external_location is defined, that value is used.
  2. If s3_data_dir is defined, the path is determined by that and s3_data_naming
  3. If s3_data_dir is not defined data is stored under s3_staging_dir/tables/

Here all the options available for s3_data_naming:

  • uuid: {s3_data_dir}/{uuid4()}/
  • table_table: {s3_data_dir}/{table}/
  • table_unique: {s3_data_dir}/{table}/{uuid4()}/
  • schema_table: {s3_data_dir}/{schema}/{table}/
  • s3_data_naming=schema_table_unique: {s3_data_dir}/{schema}/{table}/{uuid4()}/

It's possible to set the s3_data_naming globally in the target profile, or overwrite the value in the table config, or setting up the value for groups of model in dbt_project.yml

Incremental models

Support for incremental models.

These strategies are supported:

  • insert_overwrite (default): The insert overwrite strategy deletes the overlapping partitions from the destination table, and then inserts the new records from the source. This strategy depends on the partitioned_by keyword! If no partitions are defined, dbt will fall back to the append strategy.
  • append: Insert new records without updating, deleting or overwriting any existing data. There might be duplicate data (e.g. great for log or historical data).
  • merge: Conditionally updates, deletes, or inserts rows into an Iceberg table. Used in combination with unique_key. Only available when using Iceberg.

On schema change

on_schema_change is an option to reflect changes of schema in incremental models. The following options are supported:

  • ignore (default)
  • fail
  • append_new_columns
  • sync_all_columns

In detail, please refer to dbt docs.

Iceberg

The adapter supports table materialization for Iceberg.

To get started just add this as your model:

{{ config(
    materialized='table',
    table_type='iceberg',
    format='parquet',
    partitioned_by=['bucket(user_id, 5)'],
    table_properties={
    	'optimize_rewrite_delete_file_threshold': '2'
    	}
) }}

SELECT
	'A' AS user_id,
	'pi' AS name,
	'active' AS status,
	17.89 AS cost,
	1 AS quantity,
	100000000 AS quantity_big,
	current_date AS my_date

Iceberg supports bucketing as hidden partitions, therefore use the partitioned_by config to add specific bucketing conditions.

Iceberg supports several table formats for data : PARQUET, AVRO and ORC.

It is possible to use iceberg in an incremental fashion, specifically 2 strategies are supported:

  • append: new records are appended to the table, this can lead to duplicates
  • merge: must be used in combination with unique_key and it's only available with Engine version 3. It performs an upsert, new record are added, and record already existing are updated

Unsupported functionality

Due to the nature of AWS Athena, not all core dbt functionality is supported. The following features of dbt are not implemented on Athena:

  • Snapshots

Known issues

  • Incremental Iceberg models - Sync all columns on schema change can't remove columns used as partitioning. The only way, from a dbt perspective, is to do a full-refresh of the incremental model.

  • Quoting is not currently supported

    • If you need to quote your sources, escape the quote characters in your source definitions:
    version: 2
    
    sources:
      - name: my_source
        tables:
          - name: first_table
            identifier: "first table"       # Not like that
          - name: second_table
            identifier: "\"second table\""  # Like this
  • Tables, schemas and database should only be lowercase

Contributing

This connector works with Python from 3.7 to 3.10.

Getting started

In order to start developing on this adapter clone the repo and run this make command (see Makefile) :

make setup

It will :

  1. Install all dependencies.
  2. Install pre-commit hooks.
  3. Generate your .env file

Next, adjust .env file by configuring the environment variables to match your Athena development environment.

Running tests

We have 2 different types of testing:

  • unit testing: you can run this type of tests running make unit_test
  • functional testing: you must have an AWS account with Athena setup in order to launch this type of tests and have a .env file in place with the right values. You can run this type of tests running make functional_test

All type of tests can be run using make:

make test

Pull Request

  • Create a commit with your changes and push them to a fork.
  • Create a pull request on Github.
  • Pull request title and message (and PR title and description) must adhere to conventionalcommits.
  • Pull request body should describe motivation.

Helpful Resources

About

The athena adapter plugin for dbt (https://getdbt.com)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.2%
  • Makefile 1.6%
  • Shell 0.2%