-
Notifications
You must be signed in to change notification settings - Fork 25
Using Materializations with Dremio
For an overview of what seeds are, refer to Seeds in the dbt documentation.
- Object storage
- Scratch storage
Configuration | Description |
---|---|
object_storage_source |
The name of the filesystem in which to store seeds. The dbt alias is datalake .The default value is $scratch .
|
object_storage_path |
The path in the filesystem in which to store seeds. The dbt alias is root_path .The default is the root level of the filesystem, represented as the value no_schema . Nested folders in the path are separated with periods.Non-default values correspond to the path in this location in the Datasets page in Dremio: |
By default, seeds are given the same name as the file used to create them. Use the file
configuration to give a seed a non-default name.
A Dremio table (also known as a physical dataset) is a table
materialization that is either created directly from a source or is created from a seed.
- Object storage
- Scratch storage
Configuration | Description |
---|---|
object_storage_source |
The name of the filesystem in which to create tables. The dbt alias is datalake .The default value is $scratch .
|
object_storage_path |
The path in the filesystem in which to create tables. The dbt alias is root_path .The default is the root level of the filesystem, represented as the value no_schema . Nested folders in the path are separated with periods.Non-default values correspond to the path in this location in the Datasets page in Dremio: |
By default, tables are given the same name as the model used to create them. Use the file
configuration to give a table a non-default name. For example, a model named my_basic_table.sql
creates by default a table named my_basic_table
. With the file
configuration, you could create instead a table named my_differently-named-basic-table
.
The format
configuration is an optional configuration that determines the format of a persisted table.
For table
materializations created from sources, you can specify the format
configuration in a project file or a property file. For table
materializations created from seeds, you can specify the format
configuration in a project file, a property file or a config
block. See Model Confgurations in the dbt documentation.
Possible values for both types of table
materializations:
-
iceberg
: Formats persisted tables as Apache Iceberg tables. -
json
: Formats persisted tables as JSON files. -
parquet
: Formats persisted tables as Parquet files. -
text
: Formats persisted tables as CSV files.
Additional possible values for table
materializations created directly from sources:
-
delta
: Formats source tables as Delta Lake tables. -
excel
: Formats source tables as Microsoft Excel files.
If using the excel
format, there are additional settings; however, these do not show in PDS but rather only applied when running the VDS generated from the twin strategy.
Setting | Description |
---|---|
extract_header |
Extract the column names from the first line of the file. |
xls |
If the file is in the xls format instead of the xlsx format. Note: This option only works for xls files. |
has_merged_cells |
Expands cells that have been merged in the Excel sheet. |
If using the text
format, there are additional settings; however, these do not show in PDS but rather only applied when running the VDS generated from the twin strategy.
Settings | Description |
---|---|
field_delimiter |
Field Delimiter Character |
line_delimiter |
Line Delimiter Character |
quote |
Character used to quote. e.g. if the field delimiter character is , and the quote character is ", using "hello, there" would ensure hello and there would not be separated. |
comment |
Character used for comments. |
skip_first_line |
Skip the first line of the file. |
extract_header |
Extract the column names from the first line of the file. |
trim_header |
Trims column names to a currently unknown number of characters. |
auto_generated_column_names |
Create automatic columns if no column names are specified. |
The partitioning
configuration partitions a table
materialization.
Option | Value | Default | Comment |
---|---|---|---|
partition_by |
Names of columns, separated by commas | N/A | Mutually exclusive with distribute_by
|
partition_method |
striped , hash , roundrobin
|
striped |
Valid only when used with partition_by
|
distribute_by |
Names of columns, separated by commas | N/A | Mutually exclusive with partition_by
|
localsort_by |
Names of columns, separated by commas | N/A | Used for sorting within each partition |
{{ config(
materialized='table',
partition_by='cd_credit_rating',
localsort_by='cd_marital_status',
partition_method='striped'
) }}
{{ config(
materialized='table',
partition_by='cd_credit_rating',
localsort_by='cd_marital_status',
partition_method='roundrobin'
) }}
{{ config(
materialized='table',
distribute_by=[ 'cd_credit_rating', 'cd_marital_status' ],
localsort_by='cd_marital_status'
) }}
{{ config(
materialized='table',
partition_by='cd_credit_rating',
localsort_by='cd_marital_status',
partition_method='hash'
) }}
{{ config(
materialized='table',
distribute_by=[ 'cd_credit_rating', 'cd_marital_status' ],
localsort_by='cd_marital_status'
) }}
For an overview of what incremental models are, see Incremental Models in the dbt documentation.
- Object storage
- Scratch storage
Configuration | Description |
---|---|
object_storage_source |
The name of the filesystem in which to create tables from incremental models. The dbt alias is datalake .The default value is $scratch .
|
object_storage_path |
The path in the filesystem in which to create tables from incremental models. The dbt alias is root_path .The default is the root level of the filesystem, represented as the value no_schema . Nested folders in the path are separated with periods.Non-default values correspond to the path in this location in the Datasets page in Dremio: |
By default, incremental materializations are given the same name as the model used to create them. Use the file
configuration to give an incremental materialization a non-default name. For example, a model named my_basic_incremental.sql
creates by default an incremental materialization named my_basic_incremental
. With the file
configuration, you could create instead an incremental materialization named my_differently-named-basic-incremental
.
The optional incremental_strategy
config tells the incremental model how to insert data into tables that are already built from the model.
-
append
: Insert new records without updating or overwriting any existing data. This is the option that is set by default. -
merge
: Insert new records and update existing records that match the unique ID provided.
The on_schema_change
configuration specifies how to update a table built from an incremental model when one or more columns are added to or removed from a schema.
The possible values are:
sync_all_columns
append_new_columns
fail
ignore
The format
configuration is an optional configuration that determines the format of an incremental materialization. You can specify the format
configuration in a project file, a property file or a config
block. See Model Confgurations in the dbt documentation.
-
iceberg
: Formats an incremental materialization as an Apache Iceberg table. This is the only option and is set by default.
The partitioning
configuration partitions an incremental materialization.
Option | Value | Default | Comment |
---|---|---|---|
partition_by |
Names of columns, separated by commas | N/A | Mutually exclusive with distribute_by
|
partition_method |
striped , hash , roundrobin
|
striped |
Valid only when used with partition_by
|
distribute_by |
Names of columns, separated by commas | N/A | Mutually exclusive with partition_by
|
localsort_by |
Names of columns, separated by commas | N/A | Used for sorting within each partition |
{{ config(
materialized='table',
partition_by='cd_credit_rating',
localsort_by='cd_marital_status',
partition_method='striped'
) }}
{{ config(
materialized='table',
partition_by='cd_credit_rating',
localsort_by='cd_marital_status',
partition_method='roundrobin'
) }}
{{ config(
materialized='table',
distribute_by=[ 'cd_credit_rating', 'cd_marital_status' ],
localsort_by='cd_marital_status'
) }}
{{ config(
materialized='table',
partition_by='cd_credit_rating',
localsort_by='cd_marital_status',
partition_method='hash'
) }}
{{ config(
materialized='table',
distribute_by=[ 'cd_credit_rating', 'cd_marital_status' ],
localsort_by='cd_marital_status'
) }}
A Dremio view (also known as a virtual dataset) is a view
materialization that is based on a Dremio table.
In Dremio spaces.
Configuration | Description |
---|---|
dremio_space |
The value of the Dremio space in which to create views. The dbt alias is database .The default value is @<username> , where <username> is the username in Dremio.Non-default values correspond to the name in this location in the Spaces section of the Datasets page in Dremio: |
dremio_space_folder |
The folder in the Dremio space in which to create views. The dbt alias is schema . The default is the root level of the space, and is represented by the value no_schema . Non-default values correspond to the path in this location in the Datasets page in Dremio: Nested folders are separated with periods. Do not include the root no_schema in the paths of views. |
Paths in space are recursive, like filesystem folders : dbt.internal."my folder's name"
. Dots are not allowed in folder names.
The root level of a space is represented as no_schema
. To materialize a model contained in the folder my_dbt_project/models/examples
at the root folder of the space mySpace
, you would set the examples
configuration in my_dbt_project/dbt_project.yml
like this:
models:
my_dbt_project:
examples:
+dremio_space: mySpace
+space_folder: no_schema
Note that you can also use the equivalent dbt terminology i.e. +database: mySpace
, +schema: no_schema
Folders that do not exist are created if the user running a model has the appropriate permission. However, the space must already exist in Dremio before the model is run.
Do not include the root folder no_schema
in the paths of views.
Dremio accepts almost any string character in the objects' names. Therefore, the adapter double-quotes each part of the space.folder.name
path of a view. If you specify nested folders in the path of a view, each folder's name is double-quoted. Example: "space"."folder"."sub-folder"."sub-sub-folder"."identifier"
By default, views are given the same name as the model used to create them. Use the alias
configuration to give a view a non-default name. For example, a model named my_basic_view.sql
creates by default a view named my_basic_view
. With the alias
configuration, you could create instead a view named my_differently-named-basic-view
.
A reflection is a materialization that corresponds to a reflection in Dremio.
You do not specify where to create reflections. Reflections are automatically created in Dremio's reflection store.
To take reflections into account when it comes to metadata, add the following to bottom of your dbt_project.yml file:
vars:
dremio:reflections_metadata_enabled: true
Please keep in mind that this requires an admin user
The table of configuration options below can be set under vars
in your dbt_project.yml
file. They can also be set in the reflection model’s sql file. In these two examples, my_anchor
is a placeholder for the name of the dataset that the reflection is built from.
Example 1
{{ config(materialized='reflection',
name='this is the name',
reflection_type='aggregate',
dimensions=['Datetime0'],
measures=['Num3', 'Int2'],
computations=['COUNT','COUNT'])}}
-- depends_on: {{ ref('my_anchor') }}
Example 2
{{ config(materialized='reflection',
reflection_type='aggregate',
dimensions=['Datetime0'],
date_dimensions=['Datetime0'],
measures=['Num3', 'Int2'],
computations=['MIN,SUM','COUNT'],
arrow_cache=true) }}
-- depends_on: {{ ref('my_anchor') }}
The model definition will not contain a SELECT
statement, but a simple :
-- depends_on: {{ ref('my_anchor') }}
Configuration | Reflection type | Value | Default |
---|---|---|---|
name |
Both | Name of the refletion | Unnamed Reflection |
reflection_type |
Both |
raw , aggregate
|
raw |
display |
raw |
Names of columns to include in a reflection, separated by commas | All columns |
dimensions |
aggregate |
Names of columns to use as dimension columns, separated by commas | All non-decimal, float, and double columns |
date_dimensions |
aggregate |
Names of the columns to use with a DATE granularity |
None |
distribute_by |
aggregate |
Names of the columns to use as Distribution Fields | None |
measures |
aggregate |
Names of columns to use as measure columns, separated by commas | All decimal, float, and double columns |
computations |
aggregate |
The list of computations to use. The supported computations are SUM, MIN, MAX, COUNT, and APPROX_COUNT_DISTINCT. This list maps directly to the list of measures. | SUM, COUNT for each numeric measure, COUNT for others |
The partitioning
configuration partitions a reflection.
Option | Value | Default | Comment |
---|---|---|---|
partition_by |
Names of columns, separated by commas | N/A | |
partition_transform |
The list of transformations to use. The supported transformations are YEAR, MONTH, DAY, HOUR, BUCKET(n) and TRUNCATE(n). This list maps directly to the list of partitions. | Original/Identity | |
partition_method |
striped , consolidated
|
striped |
|
localsort_by |
Names of columns, separated by commas | N/A | Used for sorting within each partition |
{{ config(
materialized='table',
partition_by='cd_credit_rating',
localsort_by='cd_marital_status',
partition_method='striped'
) }}
{{ config(
materialized='reflection',
reflection_type='raw',
partition_by='cd_credit_rating'
) }}
{{ config(
materialized='reflection',
reflection_type='raw',
partition_by='cd_credit_rating',
partition_method='consolidated'
) }}
Dremio tables can be created only in object storage and Dremio views can be created only in Dremio spaces. Therefore, the adapter uses a twin strategy to determine how to handle creating a view with the same name as an existing table and vice versa.
Configuration | Applicable Materializations | Possible Values | Default |
---|---|---|---|
twin_strategy | every materialization but reflection | allow, prevent, clone | clone |
Allows materializations of different types to have the same names. For example, if the materialized view "mySpace"."myFolder"."customers"
exists in a space, it is possible to materialize the table "mySource"."myFolder"."customers"
in object storage. And vice-versa.
Prevents materializations of different types from having the same names. For example, if the materialized view "mySpace"."myFolder"."customers"
exists in a space, then materializaing the table "mySource"."myFolder"."customers"
in object storage drops "mySpace"."myFolder"."customers"
. And vice-versa.
- If you materialize a table that has the same name as an existing view, the definition of the view is changed to a
select *
on the table. For example, if the materialized view"mySpace"."myFolder"."customers"
exists in a space, and you materialize the table"mySource"."myFolder"."customers"
in object storage, the definition of the view changes toselect * from {{ "mySource"."myFolder"."customers" }}
. - If you materialize a view that has the same name as an existing table, the view is defined as a
select *
on the table. For example, if the materialized table"mySource"."myFolder"."customers"
exists in object storage, and you materialize the view"mySpace"."myFolder"."customers"
, the view is defined asselect * from {{ "mySource"."myFolder"."customers" }}
.
{{ config(materialized='table', twin_strategy='prevent') }}
models:
test_proj:
# Config indicated by + and applies to all files under models/example/
example:
+materialized: view
+twin_strategy: prevent
Dremio can send native SQL queries to external sources set up in Dremio as database sources. Such queries are known as "External Queries".
External queries can be set in models with the following materializations:
view
table
incremental
- Ensure that the
external_query
configuration option is set totrue
in the config block in the model. The SQL being ran must not useref()
orsource()
, instead only[schema.]table
paths. - Set which Database Source the external query uses, at least one
-- depends_on : {{ source("<source_name>", "<table_name>") }}
must be added to the model's SQL. - Create a
schema.yml
file that specifies the source and source tables.
{{ config(
external_query="true",
sql_header="/*External Query*/"
) }}
SELECT *
FROM public.countries
-- depends_on : {{ source( "external_source", "external_table") }}
version: 2
sources:
- name: external_source
database: "postgres"
tables:
- name: external_table
If you were to use these two examples, running the model would cause dbt to run this query in Dremio:
create or replace view "@dremio"."select_external_query" as (
select *
from table("postgres".external_query('
SELECT *
FROM public.countries
-- depends_on : "postgres"."external_source"."external_table"'))
)
A same dremio installation could handle several data environments (dev, prod, etc.). Here is how the profile will look with two different object storage paths:
multiple_targets:
outputs:
dev:
dremio_space: '@user'
dremio_space_folder: no_schema
object_storage_path: firstpath
object_storage_source: myobjectstorage
password:
port: 9047
software_host: my-test-host
threads: 1
type: dremio
use_ssl: false
user:
prod:
dremio_space: '@user'
dremio_space_folder: no_schema
object_storage_path: secondpath
object_storage_source: myobjectstorage
password:
port: 9047
software_host: my-test-host
threads: 1
type: dremio
use_ssl: false
user:
target: dev
Pass in --target prod in the dbt run command to choose the prod target, since dev is the default in the profile above:
`dbt run --target prod`