To go through this lab you will need to install the following tools in your workstation.
Download it from the official website. It uncludes both the database (which we will use to store and process data) as well as pgAdmin, and an environment to query and manage Postgre databases. During installation keep note of 2 things we will need to connect to the database afterwards:
- Port: By default it should be 5432. If it's any other, keep note of it
- Superuser's (postgres) password: To keep it simple I set it to "postgres", but if you set any other... Don't forget it!
We need the following libraries:
- Jupyter: Needed to work with Jupyter notebooks.
- Luigi: This library allows us to build and schedule data pipelines.
⚠️ To have access to all Luigi's features you need a UNIX machine (Linux, macOS). In this lab we will just run the pipelines using the local-scheduler mode.
- Pandas: One of the best-known Python libraries for data manipulation.
- SQL Alchemy: Library used as an interface to interact from Python with different databases.
- Psycopg: This module serves as a connector to PostgreSQL.
If you have pip installed, you can download these libraries using the requirements file in this repository as follows:
pip install -r requirements.txt
In this lab we use some basic SQL (Structured Query Language). This is the main tool to interact with the vast majority of databases. Each SQL has its own "dialect", but there is a common core to all of them.
If you are not familiar with SQL already have no worries, it's by far the easiest programming language. You can get a grasp of the basics in any of the resources below:
Files you will find in this repository:
- Data Engineering Lab.ipynb: A Jupyter notebook with some examples of data manipulation that serves as an initial setup.
- Several parameter we will use during the lab (database settings, file paths)
- A collection of functions to be reused in the lab, mostly to interact with the database (create tables, load data, run queries,retrieve data...).
- pipelines (folder): You can find here some examples of data pipelines using Luigi.
- : Downloads daily reports from API.
- : Joins 2 reports from different indicators (covid & mask) into a single table.
- Similar to the previous pipeline, but using a table schema that is more escalable.
- Yet another iteration on covid_survey_covid_mask, but now making it fully escalable to handle a dinamic list of rpl_covid_XXX reports as input.