Datapark is a self-hosted data platform for educational purposes. It consists of a collection of containerized services that allow the user to build solutions for data-related problems. To use them, you'll need to have docker installed. On the docker-compose file you can find the following services:
- jupyterlab: a Jupyter lab server. This is where a developer should be able to use notebooks for handling their data and prototyping their solution.
- postgresql: a PostgreSQL database. This can be used for storing data. It's used by other services to store their metadata, such as Minio and MLFlow.
- minio: a Minio storage service. It behaves similarly to S3 (AWS). This is the intended place for storing data.
- mlflow: a MLFlow tracking server to support machine leraning tasks and applications.
- spark: the 3 Spark containers (one master and two workers) provide a Spark cluster that can be used for computing tasks.
- airflow: the 3 Airflow containers (one for setting up, one for the web-ui, and one for the scheduler) allows for the scheduling and monitoring of data workflows.
To use, simply clone this repository. To run everything (on a Unix/WSL terminal):
docker compose up -d
To shut it down:
docker compose down
To access the different services on the browser:
- jupyterlab: http://localhost:8888
- minio: http://localhost:9001
- mlflow: http://localhost:8080
- airflow: http://localhost:8081
- spark: http://localhost:9090
You can find usernames and password for the different services on the .env file. Please make sure you change those before using.
The platform has examples to help you use the different services from notebooks. There is also an example on how to build Airflow DAGs that run on Spark.
By defaut, notebooks are stored on platform/jupyterlab/notebooks/
and DAGs can be found on platform/airflow/dags
.