Create a data pipeline that ingests user data via an API, processes and stores it, and then retrieves it in a serialized format.
- Data Source: Random API for fake user data
- Python & Pandas: For programming and data manipulation.
- Redis: Caching recent data for quick access.
- Postgres: Long-term data storage.
- FastAPI For an API endpoint for data retrieval
- Docker: Containerization of the entire pipeline.
- Data Ingestion:
- Python script to fetch data random user data from an API.
- Validate the data before processing.
- Pandas for data cleaning and transformation.
- Caching Layer:
- Redis setup for caching recent User data and set a TTL.
- Python logic for data retrieval from Redis and Postgres.
- Data Storage:
- Design and implement a Postgres database schema for the user data.
- Make sure PII is hashed before putting into storage
- Store processed data into Postgres.
- Data Retrieval:
- API endpoint (e.g., using FastAPI) for data retrieval.
- Dockerization:
- Dockerfile for the Python application.
- Docker Compose for orchestrating Redis and Postgres services.
- Testing and Deployment:
- Unit tests for pipeline components.
- Data pipeline architecture.
- Skills in Python, Pandas, Redis, Postgres, FastAPI and Docker.
Front-end dashboard for data display.- Advanced data processing features.
Clone the repo
git clone https://github.com/mrpbennett/etl-pipeline.git
cd
into the cloned repo and run docker compose up
docker compose up
Then head over to the URL to access the front end to see where the data is stored
http://120.0.0.1:5173