Skip to content

Latest commit

 

History

History
108 lines (57 loc) · 3.42 KB

README.md

File metadata and controls

108 lines (57 loc) · 3.42 KB

3.1 Data preparation: ETL and feature engineering

Ingestion

3.1.1 Creating a new project

In order to create a new project, firstly, you need to right-click on a folder in Text editor, and choose New Mage project. Secondly, you need to open Settings and click on Register project.

Video

Opening a text editor:

  • Go to the command center (At the top)
  • Type "text editor"

3.1.2 Data preparation - Ingestion

The project unit_1_data_preparation now has an empty pipeline, and it can be developed further using blocks. The first one we'll create is an ingestion block, which uses Python code to download the parquet files from January to March of the green taxi datasets and concatenate them. Done that, generate a series of graphs and charts useful for data profiling.

  • Note: If the time chart isn't displayed, insert the following snippet df['lpep_pickup_datetime_cleaned'] = df['lpep_pickup_datetime'].astype(np.int64) // 10**9 just above the dfs.append(df) line in ingest.py

Video

Code:

3.1.3 Utility helper functions

Utility functions are already created in the utils folder. They will be then imported into the transformer block.

Video

Code

2. Data Preparation

Videos

  1. Data preparation block
  2. Visualize prepared data

To see the correct histogram, change last two lines of the default code to:

col = 'trip_distance'
x = df_1[df_1[col] <= 20][col]

Code


3. Build training sets

Videos

  1. Encoding functions
  2. Training set block

Code


4. Data validations using built-in testing framework

Videos

  1. Writing data validations

Code


Code

  1. Complete code solution
  2. Pipeline configuration

Resources

  1. Global Data Products

  2. Data validations using built-in testing framework

  3. Data quality checks with Great Expectations integration

  4. Unit tests

  5. Feature encoding