In this tutorial, we'll guide you through constructing an Open Data Lakehouse, starting with raw source data. You'll gain practical insights into how to efficiently ingest data, transform raw information, create tailored datasets for dashboarding and reporting, and develop a predictive model using historical records.
These labs will showcase CDP's user-friendly features and robust capabilities, allowing organizations to effectively manage, analyze, and extract valuable insights from their data, regardless of its structure or source. Let's embark on this journey into the realm of data-driven success together.
For our tutorial, we will use a raw airlines dataset to -
- Pre-reqs - Set up CDP user workload password and deploy the Applied Machine Learning Prototype (AMP) for
Canceled Flight Prediction
- Ingest - Build an ingestion data pipeline to enable advanced analytics and Machine Learning (ML) use cases
- Analyze - Explore the ingested data and conduct an interactive analysis
- Visualize - Create a visualization dashboard and deploy an ML project
- Predict - Predict the likelihood of a flight being canceled based on historical records
- Do More with Iceberg - Test Iceberg features such as Time Travel and Partition Evolution, and change the ML Project to train the
Canceled Flight Prediction
model using the Data Lakehouse (Iceberg) data
Learn how to use an open-source pre-trained instruction-following LLM (Large Language Model) to build a ChatBot-like web application. The responses of the LLM are enhanced by giving it context from an internal knowledge base. This context is retrieved by using an open-source Vector Database to do a semantic search.
Data Model: