diff --git a/hugo/content/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lab.md b/hugo/content/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lab.md new file mode 100644 index 0000000..0d0a71d --- /dev/null +++ b/hugo/content/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lab.md @@ -0,0 +1,259 @@ +--- +date: 2024-11-16T17:52:23+01:00 +draft: false +author: Gabriel LOPEZ +title: (DataExpert.io) Bootcamp - Day 1 - Lab +--- + +> **Goal :** Create a *cumulative table design* + +## Problem overview +We have a table containing the stats for the NBA players, there's one record for each player's season. + +``` +postgres=# \d player_seasons; +``` + +Table `public.player_seasons` + +| Column | Type | +| ------------ | ------- | +| player_name | text | +| age | integer | +| height | text | +| weight | integer | +| college | text | +| country | text | +| draft_year | text | +| draft_round | text | +| draft_number | text | +| gp | real | +| pts | real | +| reb | real | +| ast | real | +| netrtg | real | +| oreb_pct | real | +| dreb_pct | real | +| usg_pct | real | +| ts_pct | real | +| ast_pct | real | +| season | integer | + +**Indexes:** +`"player_seasons_pkey" PRIMARY KEY, btree (player_name, season)` + +We have a temporal data problem with the table where joining the table with another would cause shuffling of the players records (same player statics won't be following each other) making **run-length encoding** compression less efficient +## Run-Length Encoding + +> **Run-Length Encoding** is a simple data compression algorithm that encodes consecutive repeated data elements (runs) as a single value plus a count of its repetitions. + +Instead of storing the repeated data multiple times, it stores the data value and the number of times it appears consecutively. + +We are going to transform the table to have one row per player with a column of arrays of the player seasons. + +## Cumulative table design + +The cumulative design serves two distinct but complementary purposes: + +1. **Join/GroupBy Optimization**: + - By storing temporal data (seasons) together in arrays, we optimize for: + - Fewer rows to join + - Less data shuffling during grouping + - Better data locality + +2. **RLE Compression**: When we later explode/unnest the arrays, the data will naturally group temporal values together, making RLE more efficient. + + +## What things are part of a season and what things aren't ? +We want to store the temporal component in it's own data type. + +We create a `STRUCT` named `season_stats` with Postges : +```sql +CREATE TYPE season_stats AS ( + season INTEGER, + gp INTEGER, + pts REAL, + reb REAL, + ast REAL +) +``` +We don't take all the season statistics in this struct as we won't need all of them. + +## Creating the cumulative table +Then we create the cumulative table schema using our new `STRUCT` : +```sql +CREATE TABLE players ( + player_name TEXT, + height TEXT, + college TEXT, + country TEXT, + draft_year TEXT, + draft_round TEXT, + season_stats season_stats[], + current_season INTEGER, + PRIMARY KEY(player_name, current_season) +) +``` + +We want to figure out what is the first year in the table is : +```sql +SELECT MIN(season) FROM player_seasons; +``` + +It is `1996` + +```sql +WITH yesterday AS ( + SELECT * FROM players + WHERE current_seasons = 1995 +), +today AS ( + SELECT * FROM player_seasons + WHERE season = 1996 +) + +SELECT * FROM today t FULL OUTER JOIN yesterday y + ON t.player_name = y.player_name +``` + +The request give us `` values for the left side of the join `1995` (`yesterday`) as it doesn't exists. + +Now we want to `COALESCE()` the non temporal values. + +```sql +WITH yesterday AS ( + SELECT * FROM players + WHERE current_seasons = 1995 +), +today AS ( + SELECT * FROM player_seasons + WHERE season = 1996 +) +SELECT + COALESCE(t.player_name, y.player_name) AS player_name, + COALESCE(t.height, y.height) AS height, + COALESCE(t.college, y.college) AS college, + COALESCE(t.country, y.country) AS country, + COALESCE(t.draft_year, y.draft_year) AS draft_year, + COALESCE(t.draft_round, y.draft_round) AS draft_round, + COALESCE(t.draft_number, y.draft_number) AS draft_number, +FROM today t FULL OUTER JOIN yesterday y + ON t.player_name = y.player_name +``` + + **Purpose of COALESCE here**: + - Handling data continuity between two time periods + - It ensures we keep the non-temporal data when a player exists in either period + +**What the query actually does**: +For each player: +- If player exists in 'today' (1996): use today's data +- If player only exists in 'yesterday' (1995): use yesterday's data +- If player exists in both: use today's data (through COALESCE taking first non-NULL)` + +```sql +SELECT * FROM player_seasons; + +DROP TABLE IF EXISTS players ; + +CREATE TYPE scoring_class AS ENUM('star', 'good', 'average', 'bad'); + +CREATE TABLE players ( + player_name TEXT, + height TEXT, + college TEXT, + country TEXT, + draft_year TEXT, + draft_round TEXT, + draft_number TEXT, + season_stats season_stats[], + scoring_class scoring_class, + years_since_last_season INTEGER, + current_season INTEGER, + PRIMARY KEY(player_name, current_season) +); + +-- This is the SEED query for cumulation because year 1995 is going to be , +-- the FULL OUTER JOIN is just taking everything from today as yesterday doesn't exist. +INSERT INTO players +WITH yesterday AS ( + SELECT * FROM players + WHERE current_season = 2000 +), +today AS ( + SELECT * FROM player_seasons + WHERE season = 2001 +) + +SELECT + COALESCE(t.player_name, y.player_name) AS player_name, + COALESCE(t.height, y.height) AS height, + COALESCE(t.college, y.college) AS college, + COALESCE(t.country, y.country) AS country, + COALESCE(t.draft_year, y.draft_year) AS draft_year, + COALESCE(t.draft_round, y.draft_round) AS draft_round, + COALESCE(t.draft_number, y.draft_number) AS draft_number, + -- If yesterday is null we create the initial array + CASE WHEN y.season_stats IS NULL THEN + ARRAY[ROW( + t.season, + t.gp, + t.pts, + t.reb, + t.ast + )::season_stats] + -- If today is not null we create the new value by concatenating the array of previous values + -- with today's ones. + -- We don't want to keep adding values to the season_stats array if the player is retired + WHEN t.season IS NOT NULL THEN + y.season_stats || ARRAY[ROW( + t.season, + t.gp, + t.pts, + t.reb, + t.ast + )::season_stats] + -- Otherwise we carry the history forward without modifying it. + ELSE y.season_stats + END AS season_stats, + -- Determine the scoring class of the player for current season + CASE + WHEN t.season IS NOT NULL THEN + CASE WHEN t.pts > 20 THEN 'star' + WHEN t.pts > 15 THEN 'good' + WHEN t.pts > 10 THEN 'average' + ELSE 'bad' + END::scoring_class + ELSE y.scoring_class + END as scoring_class, + CASE WHEN t.season IS NOT NULL THEN 0 + ELSE y.years_since_last_season + 1 + END as years_since_last_season, + COALESCE(t.season, y.current_season + 1) as current_season +FROM today t FULL OUTER JOIN yesterday y + ON t.player_name = y.player_name; + +-- No GROUP BY = very fast, everything happens in the map step it can be parrallelized +SELECT + player_name, + (season_stats[CARDINALITY(season_stats)]::season_stats).pts / + CASE WHEN (season_stats[1]::season_stats).pts = 0 THEN 1 ELSE ((season_stats[1]::season_stats).pts) END +FROM players +WHERE current_season = 2001 +AND scoring_class = 'star'; + +-- Going back the original table +WITH unnested AS ( + SELECT player_name, + UNNEST(season_stats) AS season_stats + FROM players + WHERE current_season = 2001 +) + +SELECT player_name, + (season_stats::season_stats).* +FROM unnested; + +-- Here we keep player stats (temporal attributes) sorted through the JOIN +-- We can apply RLE compression efficiently +``` diff --git a/hugo/hugo.toml b/hugo/hugo.toml index 4d40dbd..b0b260e 100644 --- a/hugo/hugo.toml +++ b/hugo/hugo.toml @@ -1,3 +1,5 @@ baseURL="https://glopez.github.io/blog/" title="Gabriel Study Blog" theme="archie" + +enableEmoji = true diff --git a/hugo/public/index.html b/hugo/public/index.html index ea67a77..4e34c9a 100644 --- a/hugo/public/index.html +++ b/hugo/public/index.html @@ -93,6 +93,115 @@

Idempotency

Read more ⟶ +
+

(DataExpert.io) Bootcamp - Day 1 - Lab

+ +
+ +
+

Goal : Create a cumulative table design

+
+

Problem overview

+

We have a table containing the stats for the NBA players, there’s one record for each player’s season.

+
postgres=# \d player_seasons;
+

Table public.player_seasons

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ColumnType
player_nametext
ageinteger
heighttext
weightinteger
collegetext
countrytext
draft_yeartext
draft_roundtext
draft_numbertext
gpreal
ptsreal
rebreal
astreal
netrtgreal
oreb_pctreal
dreb_pctreal
usg_pctreal
ts_pctreal
ast_pctreal
seasoninteger
+

Indexes: +"player_seasons_pkey" PRIMARY KEY, btree (player_name, season)

… + +
+ Read more ⟶ +
+

(DataExpert.io) Bootcamp - Day 1 - Lecture

diff --git a/hugo/public/index.xml b/hugo/public/index.xml index 6cc2f11..9e509e4 100644 --- a/hugo/public/index.xml +++ b/hugo/public/index.xml @@ -22,6 +22,13 @@ http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day2/lecture/ <p>Today&rsquo;s lecture deals with <strong>Slowly Changing Dimensions</strong> and <strong>Idempotency</strong>.</p> <blockquote> <p><strong>Slowly changing dimensions</strong> = An attribute that drifts over time</p> </blockquote> <p><em>Example:</em> Your favorite food</p> <h2 id="idempotency">Idempotency</h2> <p>You need to model slowly dimensions the right way because they impact idempotency.</p> <blockquote> <p><strong>Idempotent</strong> = Denoting an element of a set which is unchanged in value when multiplied or otherwise operated on by itself.</p> </blockquote> <blockquote> <p><strong>Idempotent pipeline</strong> = The ability for your data pipeline to produce the same results whether it&rsquo;s running in production or in backfill.</p> + + (DataExpert.io) Bootcamp - Day 1 - Lab + http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lab/ + Sat, 16 Nov 2024 17:52:23 +0100 + http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lab/ + <blockquote> <p><strong>Goal :</strong> Create a <em>cumulative table design</em></p> </blockquote> <h2 id="problem-overview">Problem overview</h2> <p>We have a table containing the stats for the NBA players, there&rsquo;s one record for each player&rsquo;s season.</p> <pre tabindex="0"><code>postgres=# \d player_seasons; </code></pre><p>Table <code>public.player_seasons</code></p> <table> <thead> <tr> <th>Column</th> <th>Type</th> </tr> </thead> <tbody> <tr> <td>player_name</td> <td>text</td> </tr> <tr> <td>age</td> <td>integer</td> </tr> <tr> <td>height</td> <td>text</td> </tr> <tr> <td>weight</td> <td>integer</td> </tr> <tr> <td>college</td> <td>text</td> </tr> <tr> <td>country</td> <td>text</td> </tr> <tr> <td>draft_year</td> <td>text</td> </tr> <tr> <td>draft_round</td> <td>text</td> </tr> <tr> <td>draft_number</td> <td>text</td> </tr> <tr> <td>gp</td> <td>real</td> </tr> <tr> <td>pts</td> <td>real</td> </tr> <tr> <td>reb</td> <td>real</td> </tr> <tr> <td>ast</td> <td>real</td> </tr> <tr> <td>netrtg</td> <td>real</td> </tr> <tr> <td>oreb_pct</td> <td>real</td> </tr> <tr> <td>dreb_pct</td> <td>real</td> </tr> <tr> <td>usg_pct</td> <td>real</td> </tr> <tr> <td>ts_pct</td> <td>real</td> </tr> <tr> <td>ast_pct</td> <td>real</td> </tr> <tr> <td>season</td> <td>integer</td> </tr> </tbody> </table> <p><strong>Indexes:</strong> <code>&quot;player_seasons_pkey&quot; PRIMARY KEY, btree (player_name, season)</code></p> + (DataExpert.io) Bootcamp - Day 1 - Lecture http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture/ diff --git a/hugo/public/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/RLE.svg b/hugo/public/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/RLE.svg new file mode 100644 index 0000000..824c092 --- /dev/null +++ b/hugo/public/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/RLE.svg @@ -0,0 +1,29 @@ + + + + + + + + + + + + + + + + + + + + 3x + + + 2x + + + 1x + + diff --git a/hugo/public/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lab/index.html b/hugo/public/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lab/index.html new file mode 100644 index 0000000..e018c76 --- /dev/null +++ b/hugo/public/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lab/index.html @@ -0,0 +1,480 @@ + + + + (DataExpert.io) Bootcamp - Day 1 - Lab - Gabriel Study Blog + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+

(DataExpert.io) Bootcamp - Day 1 - Lab

+
Posted on Nov 16, 2024
+
+ + + + +
+
+

Goal : Create a cumulative table design

+
+

Problem overview

+

We have a table containing the stats for the NBA players, there’s one record for each player’s season.

+
postgres=# \d player_seasons;
+

Table public.player_seasons

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ColumnType
player_nametext
ageinteger
heighttext
weightinteger
collegetext
countrytext
draft_yeartext
draft_roundtext
draft_numbertext
gpreal
ptsreal
rebreal
astreal
netrtgreal
oreb_pctreal
dreb_pctreal
usg_pctreal
ts_pctreal
ast_pctreal
seasoninteger
+

Indexes: +"player_seasons_pkey" PRIMARY KEY, btree (player_name, season)

+

We have a temporal data problem with the table where joining the table with another would cause shuffling of the players records (same player statics won’t be following each other) making run-length encoding compression less efficient

+

Run-Length Encoding

+
+

Run-Length Encoding is a simple data compression algorithm that encodes consecutive repeated data elements (runs) as a single value plus a count of its repetitions.

+
+

Instead of storing the repeated data multiple times, it stores the data value and the number of times it appears consecutively.

+

We are going to transform the table to have one row per player with a column of arrays of the player seasons.

+

Cumulative table design

+

The cumulative design serves two distinct but complementary purposes:

+
    +
  1. +

    Join/GroupBy Optimization:

    +
      +
    • By storing temporal data (seasons) together in arrays, we optimize for: +
        +
      • Fewer rows to join
      • +
      • Less data shuffling during grouping
      • +
      • Better data locality
      • +
      +
    • +
    +
  2. +
  3. +

    RLE Compression: When we later explode/unnest the arrays, the data will naturally group temporal values together, making RLE more efficient.

    +
  4. +
+

What things are part of a season and what things aren’t ?

+

We want to store the temporal component in it’s own data type.

+

We create a STRUCT named season_stats with Postges :

+
CREATE TYPE season_stats AS (  
+    season INTEGER,  
+    gp INTEGER,  
+    pts REAL,  
+    reb REAL,  
+    ast REAL  
+)
+

We don’t take all the season statistics in this struct as we won’t need all of them.

+

Creating the cumulative table

+

Then we create the cumulative table schema using our new STRUCT :

+
CREATE TABLE players (  
+    player_name TEXT,  
+    height TEXT,  
+    college TEXT,  
+    country TEXT,  
+	draft_year TEXT,
+    draft_round TEXT,  
+    season_stats season_stats[],  
+    current_season INTEGER,  
+    PRIMARY KEY(player_name, current_season)  
+)
+

We want to figure out what is the first year in the table is :

+
SELECT MIN(season) FROM player_seasons;
+

It is 1996

+
WITH yesterday AS (  
+    SELECT * FROM players  
+    WHERE current_seasons = 1995  
+),  
+today AS (  
+    SELECT * FROM player_seasons  
+    WHERE season = 1996  
+)  
+  
+SELECT * FROM today t FULL OUTER JOIN yesterday y  
+    ON t.player_name = y.player_name
+

The request give us <null> values for the left side of the join 1995 (yesterday) as it doesn’t exists.

+

Now we want to COALESCE() the non temporal values.

+
WITH yesterday AS (  
+    SELECT * FROM players  
+    WHERE current_seasons = 1995  
+),  
+today AS (  
+    SELECT * FROM player_seasons  
+    WHERE season = 1996  
+)  
+SELECT 
+	COALESCE(t.player_name, y.player_name) AS player_name,
+	COALESCE(t.height, y.height) AS height,
+	COALESCE(t.college, y.college) AS college,
+	COALESCE(t.country, y.country) AS country,
+	COALESCE(t.draft_year, y.draft_year) AS draft_year,
+	COALESCE(t.draft_round, y.draft_round) AS draft_round,
+	COALESCE(t.draft_number, y.draft_number) AS draft_number,
+FROM today t FULL OUTER JOIN yesterday y  
+    ON t.player_name = y.player_name
+

Purpose of COALESCE here:

+
    +
  • Handling data continuity between two time periods
  • +
  • It ensures we keep the non-temporal data when a player exists in either period
  • +
+

What the query actually does: +For each player:

+
    +
  • If player exists in ’today’ (1996): use today’s data
  • +
  • If player only exists in ‘yesterday’ (1995): use yesterday’s data
  • +
  • If player exists in both: use today’s data (through COALESCE taking first non-NULL)`
  • +
+
SELECT * FROM player_seasons;  
+  
+DROP TABLE IF EXISTS players ;  
+  
+CREATE TYPE scoring_class AS ENUM('star', 'good', 'average', 'bad');  
+  
+CREATE TABLE players (  
+    player_name TEXT,  
+    height TEXT,  
+    college TEXT,  
+    country TEXT,  
+    draft_year TEXT,  
+    draft_round TEXT,  
+    draft_number TEXT,  
+    season_stats season_stats[],  
+    scoring_class scoring_class,  
+    years_since_last_season INTEGER,  
+    current_season INTEGER,  
+    PRIMARY KEY(player_name, current_season)  
+);  
+  
+-- This is the SEED query for cumulation because year 1995 is going to be <null>,  
+-- the FULL OUTER JOIN is just taking everything from today as yesterday doesn't exist.  
+INSERT INTO players  
+WITH yesterday AS (  
+    SELECT * FROM players  
+    WHERE current_season  = 2000  
+),  
+today AS (  
+    SELECT * FROM player_seasons  
+    WHERE season = 2001  
+)  
+  
+SELECT  
+    COALESCE(t.player_name, y.player_name) AS player_name,  
+    COALESCE(t.height, y.height) AS height,  
+    COALESCE(t.college, y.college) AS college,  
+    COALESCE(t.country, y.country) AS country,  
+    COALESCE(t.draft_year, y.draft_year) AS draft_year,  
+    COALESCE(t.draft_round, y.draft_round) AS draft_round,  
+    COALESCE(t.draft_number, y.draft_number) AS draft_number,  
+    -- If yesterday is null we create the initial array  
+    CASE WHEN y.season_stats IS NULL THEN  
+        ARRAY[ROW(  
+            t.season,  
+            t.gp,  
+            t.pts,  
+            t.reb,  
+            t.ast  
+        )::season_stats]  
+    -- If today is not null we create the new value by concatenating the array of previous values  
+    -- with today's ones.    
+    -- We don't want to keep adding values to the season_stats array if the player is retired 
+    WHEN t.season IS NOT NULL THEN  
+        y.season_stats || ARRAY[ROW(  
+            t.season,  
+            t.gp,  
+            t.pts,  
+            t.reb,  
+            t.ast  
+        )::season_stats]  
+    -- Otherwise we carry the history forward without modifying it.  
+    ELSE y.season_stats  
+    END AS season_stats,  
+    -- Determine the scoring class of the player for current season  
+    CASE  
+        WHEN t.season IS NOT NULL THEN  
+        CASE WHEN t.pts > 20 THEN 'star'  
+            WHEN t.pts >  15 THEN 'good'  
+            WHEN t.pts > 10 THEN 'average'  
+            ELSE 'bad'  
+        END::scoring_class  
+        ELSE y.scoring_class  
+    END as scoring_class,  
+    CASE WHEN t.season IS NOT NULL THEN 0  
+        ELSE y.years_since_last_season + 1  
+    END as years_since_last_season,  
+    COALESCE(t.season, y.current_season + 1) as current_season  
+FROM today t FULL OUTER JOIN yesterday y  
+    ON t.player_name = y.player_name;  
+  
+-- No GROUP BY = very fast, everything happens in the map step it can be parrallelized
+SELECT  
+    player_name,  
+    (season_stats[CARDINALITY(season_stats)]::season_stats).pts /  
+    CASE WHEN (season_stats[1]::season_stats).pts = 0 THEN 1 ELSE ((season_stats[1]::season_stats).pts) END  
+FROM players  
+WHERE current_season = 2001  
+AND scoring_class = 'star';  
+  
+-- Going back the original table  
+WITH unnested AS (  
+    SELECT player_name,  
+        UNNEST(season_stats) AS season_stats  
+    FROM players  
+    WHERE current_season = 2001  
+)  
+  
+SELECT player_name,  
+       (season_stats::season_stats).*  
+FROM unnested;  
+  
+-- Here we keep player stats (temporal attributes) sorted through the JOIN
+-- We can apply RLE compression efficiently
+
+
+ + +
+
+ +
+ + diff --git a/hugo/public/posts/index.html b/hugo/public/posts/index.html index 703e1bf..164f313 100644 --- a/hugo/public/posts/index.html +++ b/hugo/public/posts/index.html @@ -48,6 +48,8 @@

All articles

(DataExpert.io) Bootcamp - Day 3 - Lecture Nov 19, 2024
  • (DataExpert.io) Bootcamp - Day 2 - Lecture Nov 17, 2024 +
  • + (DataExpert.io) Bootcamp - Day 1 - Lab Nov 16, 2024
  • (DataExpert.io) Bootcamp - Day 1 - Lecture Nov 15, 2024
  • diff --git a/hugo/public/posts/index.xml b/hugo/public/posts/index.xml index e93f36a..3e5048e 100644 --- a/hugo/public/posts/index.xml +++ b/hugo/public/posts/index.xml @@ -13,7 +13,7 @@ http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day3/lecture/ Tue, 19 Nov 2024 17:46:24 +0100 http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day3/lecture/ - <p>How to build a data agnostic graph data model ?</p> <h2 id="index">Index</h2> <ul> <li>Additive VS non-additive dimensions</li> <li>The power of Enums</li> <li>When should you use flexible data types ?</li> <li>Graph data modeling</li> </ul> <h2 id="additive-vs-non-additive-dimensions">Additive vs Non-additive dimensions</h2> <h3 id="what-makes-a-dimension-additive-">What makes a dimension additive ?</h3> <p>Additivity refers to whether numerical facts (measures) in a fact table can be meaningfully aggregated across different dimensions.</p> <p>If you take all the sub-totals and sum them up you should have the total</p> + <p>Today&rsquo;s lecture is about dimensional additivity and how to build a flexible data model ready for graph database consumption</p> <h2 id="index">Index</h2> <ul> <li>Additive VS non-additive dimensions</li> <li>The power of Enums</li> <li>When should you use flexible data types ?</li> <li>Graph data modeling</li> </ul> <h2 id="additive-vs-non-additive-dimensions">Additive vs Non-additive dimensions</h2> <h3 id="what-makes-a-dimension-additive-">What makes a dimension additive ?</h3> <p>Additivity refers to whether numerical facts (measures) in a fact table can be meaningfully aggregated across different dimensions.</p> <p>If you take all the sub-totals and sum them up you should have the total</p>
    (DataExpert.io) Bootcamp - Day 2 - Lecture @@ -22,6 +22,13 @@ http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day2/lecture/ <p>Today&rsquo;s lecture deals with <strong>Slowly Changing Dimensions</strong> and <strong>Idempotency</strong>.</p> <blockquote> <p><strong>Slowly changing dimensions</strong> = An attribute that drifts over time</p> </blockquote> <p><em>Example:</em> Your favorite food</p> <h2 id="idempotency">Idempotency</h2> <p>You need to model slowly dimensions the right way because they impact idempotency.</p> <blockquote> <p><strong>Idempotent</strong> = Denoting an element of a set which is unchanged in value when multiplied or otherwise operated on by itself.</p> </blockquote> <blockquote> <p><strong>Idempotent pipeline</strong> = The ability for your data pipeline to produce the same results whether it&rsquo;s running in production or in backfill.</p> + + (DataExpert.io) Bootcamp - Day 1 - Lab + http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lab/ + Sat, 16 Nov 2024 17:52:23 +0100 + http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lab/ + <blockquote> <p><strong>Goal :</strong> Create a <em>cumulative table design</em></p> </blockquote> <h2 id="problem-overview">Problem overview</h2> <p>We have a table containing the stats for the NBA players, there&rsquo;s one record for each player&rsquo;s season.</p> <pre tabindex="0"><code>postgres=# \d player_seasons; </code></pre><p>Table <code>public.player_seasons</code></p> <table> <thead> <tr> <th>Column</th> <th>Type</th> </tr> </thead> <tbody> <tr> <td>player_name</td> <td>text</td> </tr> <tr> <td>age</td> <td>integer</td> </tr> <tr> <td>height</td> <td>text</td> </tr> <tr> <td>weight</td> <td>integer</td> </tr> <tr> <td>college</td> <td>text</td> </tr> <tr> <td>country</td> <td>text</td> </tr> <tr> <td>draft_year</td> <td>text</td> </tr> <tr> <td>draft_round</td> <td>text</td> </tr> <tr> <td>draft_number</td> <td>text</td> </tr> <tr> <td>gp</td> <td>real</td> </tr> <tr> <td>pts</td> <td>real</td> </tr> <tr> <td>reb</td> <td>real</td> </tr> <tr> <td>ast</td> <td>real</td> </tr> <tr> <td>netrtg</td> <td>real</td> </tr> <tr> <td>oreb_pct</td> <td>real</td> </tr> <tr> <td>dreb_pct</td> <td>real</td> </tr> <tr> <td>usg_pct</td> <td>real</td> </tr> <tr> <td>ts_pct</td> <td>real</td> </tr> <tr> <td>ast_pct</td> <td>real</td> </tr> <tr> <td>season</td> <td>integer</td> </tr> </tbody> </table> <p><strong>Indexes:</strong> <code>&quot;player_seasons_pkey&quot; PRIMARY KEY, btree (player_name, season)</code></p> + (DataExpert.io) Bootcamp - Day 1 - Lecture http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture/ diff --git a/hugo/public/sitemap.xml b/hugo/public/sitemap.xml index 6df5fb5..7b730a4 100644 --- a/hugo/public/sitemap.xml +++ b/hugo/public/sitemap.xml @@ -13,6 +13,9 @@ http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day2/lecture/ 2024-11-17T19:21:32+01:00 + + http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lab/ + 2024-11-16T17:52:23+01:00 http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture/ 2024-11-15T17:46:24+01:00 diff --git a/hugo/public/svg/RLE.svg b/hugo/public/svg/RLE.svg new file mode 100644 index 0000000..824c092 --- /dev/null +++ b/hugo/public/svg/RLE.svg @@ -0,0 +1,29 @@ + + + + + + + + + + + + + + + + + + + + 3x + + + 2x + + + 1x + + diff --git a/hugo/static/svg/RLE.svg b/hugo/static/svg/RLE.svg new file mode 100644 index 0000000..824c092 --- /dev/null +++ b/hugo/static/svg/RLE.svg @@ -0,0 +1,29 @@ + + + + + + + + + + + + + + + + + + + + 3x + + + 2x + + + 1x + +