This project serves as a proof-of-concept (PoC) to demonstrate how to integrate dbt (Data Build Tool) with Spark, a powerful big data processing framework. By combining dbt and Spark, you can streamline your data transformation workflows and leverage the capabilities of Spark for large-scale data processing.
Follow these instructions to get the project up and running on your local machine.
Before you begin, ensure you have met the following requirements:
- Python (3.8)
- Poetry: This project uses Poetry for dependency management.
- Apache Spark: Make sure Spark is installed and configured on your machine.
-
Clone the repository:
git clone https://github.com/damavis/damavis-dbt-spark-poc.git cd damavis-dbt-spark-poc
-
Set up a Python virtual environment and install the project dependencies using Poetry:
poetry install
To run the PoC, follow these steps:
-
Configure your dbt project to connect to Spark as the execution engine. You may need to update your profiles.yml to include Spark-specific settings.
-
Create your dbt models and transformations, and use Spark for data processing where needed.
-
Run your dbt project as usual:
dbt run
-
Observe the integration of dbt with Spark in your data transformation workflows.
Thank you for watching =)