✨ feat( DataExpert.io ): Add day 1 lab

glopez-dev · Nov 23, 2024 · aa44233 · aa44233
1 parent 3238610
commit aa44233
Show file tree

Hide file tree

Showing 11 changed files with 957 additions and 1 deletion.
diff --git a/...ata-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lab.md b/...ata-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lab.md
@@ -0,0 +1,259 @@
+---
+date: 2024-11-16T17:52:23+01:00
+draft: false 
+author: Gabriel LOPEZ
+title: (DataExpert.io) Bootcamp - Day 1 - Lab
+---
+
+> **Goal :** Create a *cumulative table design*
+
+## Problem overview 
+We have a table containing the stats for the NBA players, there's one record for each player's season. 
+
+```
+postgres=# \d player_seasons;
+```
+
+Table `public.player_seasons`
+
+| Column       | Type    | 
+| ------------ | ------- | 
+| player_name  | text    | 
+| age          | integer | 
+| height       | text    | 
+| weight       | integer | 
+| college      | text    | 
+| country      | text    |
+| draft_year   | text    |
+| draft_round  | text    |
+| draft_number | text    |
+| gp           | real    |
+| pts          | real    |
+| reb          | real    |
+| ast          | real    | 
+| netrtg       | real    |  
+| oreb_pct     | real    |   
+| dreb_pct     | real    |    
+| usg_pct      | real    |     
+| ts_pct       | real    |      
+| ast_pct      | real    |       
+| season       | integer |        
+
+**Indexes:**
+`"player_seasons_pkey" PRIMARY KEY, btree (player_name, season)`
+
+We have a temporal data problem with the table where joining the table with another would cause shuffling of the players records (same player statics won't be following each other) making **run-length encoding** compression less efficient
+## Run-Length Encoding
+
+> **Run-Length Encoding** is a simple data compression algorithm that encodes consecutive repeated data elements (runs) as a single value plus a count of its repetitions. 
+ 
+Instead of storing the repeated data multiple times, it stores the data value and the number of times it appears consecutively.
+
+We are going to transform the table to have one row per player with a column of arrays of the player seasons. 
+
+## Cumulative table design
+
+The cumulative design serves two distinct but complementary purposes:
+
+1. **Join/GroupBy Optimization**:
+    - By storing temporal data (seasons) together in arrays, we optimize for:
+        - Fewer rows to join
+        - Less data shuffling during grouping
+        - Better data locality
+
+2. **RLE Compression**: When we later explode/unnest the arrays, the data will naturally group temporal values together, making RLE more efficient.
+
+
+## What things are part of a season and what things aren't ?
+We want to store the temporal component in it's own data type. 
+
+We create a `STRUCT` named `season_stats` with Postges :
+```sql
+CREATE TYPE season_stats AS (  
+    season INTEGER,  
+    gp INTEGER,  
+    pts REAL,  
+    reb REAL,  
+    ast REAL  
+)
+```
+We don't take all the season statistics in this struct as we won't need all of them.
+
+## Creating the cumulative table
+Then we create the cumulative table schema using our new `STRUCT` :
+```sql
+CREATE TABLE players (  
+    player_name TEXT,  
+    height TEXT,  
+    college TEXT,  
+    country TEXT,  
+	draft_year TEXT,
+    draft_round TEXT,  
+    season_stats season_stats[],  
+    current_season INTEGER,  
+    PRIMARY KEY(player_name, current_season)  
+)
+```
+
+We want to figure out what is the first year in the table is :
+```sql
+SELECT MIN(season) FROM player_seasons;
+```
+
+It is `1996`
+
+```sql
+WITH yesterday AS (  
+    SELECT * FROM players  
+    WHERE current_seasons = 1995  
+),  
+today AS (  
+    SELECT * FROM player_seasons  
+    WHERE season = 1996  
+)  
+
+SELECT * FROM today t FULL OUTER JOIN yesterday y  
+    ON t.player_name = y.player_name
+```
+
+The request give us `<null>` values for the left side of the join `1995` (`yesterday`) as it doesn't exists.
+
+Now we want to `COALESCE()` the non temporal values. 
+
+```sql
+WITH yesterday AS (  
+    SELECT * FROM players  
+    WHERE current_seasons = 1995  
+),  
+today AS (  
+    SELECT * FROM player_seasons  
+    WHERE season = 1996  
+)  
+SELECT 
+	COALESCE(t.player_name, y.player_name) AS player_name,
+	COALESCE(t.height, y.height) AS height,
+	COALESCE(t.college, y.college) AS college,
+	COALESCE(t.country, y.country) AS country,
+	COALESCE(t.draft_year, y.draft_year) AS draft_year,
+	COALESCE(t.draft_round, y.draft_round) AS draft_round,
+	COALESCE(t.draft_number, y.draft_number) AS draft_number,
+FROM today t FULL OUTER JOIN yesterday y  
+    ON t.player_name = y.player_name
+```
+
+ **Purpose of COALESCE here**:
+   - Handling data continuity between two time periods
+   - It ensures we keep the non-temporal data when a player exists in either period
+
+**What the query actually does**:
+For each player: 
+- If player exists in 'today' (1996): use today's data 
+- If player only exists in 'yesterday' (1995): use yesterday's data 
+- If player exists in both: use today's data (through COALESCE taking first non-NULL)`
+
+```sql
+SELECT * FROM player_seasons;  
+
+DROP TABLE IF EXISTS players ;  
+
+CREATE TYPE scoring_class AS ENUM('star', 'good', 'average', 'bad');  
+
+CREATE TABLE players (  
+    player_name TEXT,  
+    height TEXT,  
+    college TEXT,  
+    country TEXT,  
+    draft_year TEXT,  
+    draft_round TEXT,  
+    draft_number TEXT,  
+    season_stats season_stats[],  
+    scoring_class scoring_class,  
+    years_since_last_season INTEGER,  
+    current_season INTEGER,  
+    PRIMARY KEY(player_name, current_season)  
+);  
+
+-- This is the SEED query for cumulation because year 1995 is going to be <null>,  
+-- the FULL OUTER JOIN is just taking everything from today as yesterday doesn't exist.  
+INSERT INTO players  
+WITH yesterday AS (  
+    SELECT * FROM players  
+    WHERE current_season  = 2000  
+),  
+today AS (  
+    SELECT * FROM player_seasons  
+    WHERE season = 2001  
+)  
+
+SELECT  
+    COALESCE(t.player_name, y.player_name) AS player_name,  
+    COALESCE(t.height, y.height) AS height,  
+    COALESCE(t.college, y.college) AS college,  
+    COALESCE(t.country, y.country) AS country,  
+    COALESCE(t.draft_year, y.draft_year) AS draft_year,  
+    COALESCE(t.draft_round, y.draft_round) AS draft_round,  
+    COALESCE(t.draft_number, y.draft_number) AS draft_number,  
+    -- If yesterday is null we create the initial array  
+    CASE WHEN y.season_stats IS NULL THEN  
+        ARRAY[ROW(  
+            t.season,  
+            t.gp,  
+            t.pts,  
+            t.reb,  
+            t.ast  
+        )::season_stats]  
+    -- If today is not null we create the new value by concatenating the array of previous values  
+    -- with today's ones.    
+    -- We don't want to keep adding values to the season_stats array if the player is retired 
+    WHEN t.season IS NOT NULL THEN  
+        y.season_stats || ARRAY[ROW(  
+            t.season,  
+            t.gp,  
+            t.pts,  
+            t.reb,  
+            t.ast  
+        )::season_stats]  
+    -- Otherwise we carry the history forward without modifying it.  
+    ELSE y.season_stats  
+    END AS season_stats,  
+    -- Determine the scoring class of the player for current season  
+    CASE  
+        WHEN t.season IS NOT NULL THEN  
+        CASE WHEN t.pts > 20 THEN 'star'  
+            WHEN t.pts >  15 THEN 'good'  
+            WHEN t.pts > 10 THEN 'average'  
+            ELSE 'bad'  
+        END::scoring_class  
+        ELSE y.scoring_class  
+    END as scoring_class,  
+    CASE WHEN t.season IS NOT NULL THEN 0  
+        ELSE y.years_since_last_season + 1  
+    END as years_since_last_season,  
+    COALESCE(t.season, y.current_season + 1) as current_season  
+FROM today t FULL OUTER JOIN yesterday y  
+    ON t.player_name = y.player_name;  
+
+-- No GROUP BY = very fast, everything happens in the map step it can be parrallelized
+SELECT  
+    player_name,  
+    (season_stats[CARDINALITY(season_stats)]::season_stats).pts /  
+    CASE WHEN (season_stats[1]::season_stats).pts = 0 THEN 1 ELSE ((season_stats[1]::season_stats).pts) END  
+FROM players  
+WHERE current_season = 2001  
+AND scoring_class = 'star';  
+
+-- Going back the original table  
+WITH unnested AS (  
+    SELECT player_name,  
+        UNNEST(season_stats) AS season_stats  
+    FROM players  
+    WHERE current_season = 2001  
+)  
+
+SELECT player_name,  
+       (season_stats::season_stats).*  
+FROM unnested;  
+
+-- Here we keep player stats (temporal attributes) sorted through the JOIN
+-- We can apply RLE compression efficiently
+```
diff --git a/hugo/hugo.toml b/hugo/hugo.toml
@@ -1,3 +1,5 @@
 baseURL="https://glopez.github.io/blog/"
 title="Gabriel Study Blog"
 theme="archie"
+
+enableEmoji = true
diff --git a/hugo/public/index.html b/hugo/public/index.html
@@ -93,6 +93,115 @@ <h2 id="idempotency">Idempotency</h2>
 					<a class="readmore" href="/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day2/lecture/">Read more ⟶</a>
 				</section>
 
+				<section class="list-item">
+					<h1 class="title"><a href="/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lab/">(DataExpert.io) Bootcamp - Day 1 - Lab</a></h1>
+					<time>Nov 16, 2024</time>
+					<br><div class="description">
+
+	<blockquote>
+<p><strong>Goal :</strong> Create a <em>cumulative table design</em></p>
+</blockquote>
+<h2 id="problem-overview">Problem overview</h2>
+<p>We have a table containing the stats for the NBA players, there&rsquo;s one record for each player&rsquo;s season.</p>
+<pre tabindex="0"><code>postgres=# \d player_seasons;
+</code></pre><p>Table <code>public.player_seasons</code></p>
+<table>
+  <thead>
+      <tr>
+          <th>Column</th>
+          <th>Type</th>
+      </tr>
+  </thead>
+  <tbody>
+      <tr>
+          <td>player_name</td>
+          <td>text</td>
+      </tr>
+      <tr>
+          <td>age</td>
+          <td>integer</td>
+      </tr>
+      <tr>
+          <td>height</td>
+          <td>text</td>
+      </tr>
+      <tr>
+          <td>weight</td>
+          <td>integer</td>
+      </tr>
+      <tr>
+          <td>college</td>
+          <td>text</td>
+      </tr>
+      <tr>
+          <td>country</td>
+          <td>text</td>
+      </tr>
+      <tr>
+          <td>draft_year</td>
+          <td>text</td>
+      </tr>
+      <tr>
+          <td>draft_round</td>
+          <td>text</td>
+      </tr>
+      <tr>
+          <td>draft_number</td>
+          <td>text</td>
+      </tr>
+      <tr>
+          <td>gp</td>
+          <td>real</td>
+      </tr>
+      <tr>
+          <td>pts</td>
+          <td>real</td>
+      </tr>
+      <tr>
+          <td>reb</td>
+          <td>real</td>
+      </tr>
+      <tr>
+          <td>ast</td>
+          <td>real</td>
+      </tr>
+      <tr>
+          <td>netrtg</td>
+          <td>real</td>
+      </tr>
+      <tr>
+          <td>oreb_pct</td>
+          <td>real</td>
+      </tr>
+      <tr>
+          <td>dreb_pct</td>
+          <td>real</td>
+      </tr>
+      <tr>
+          <td>usg_pct</td>
+          <td>real</td>
+      </tr>
+      <tr>
+          <td>ts_pct</td>
+          <td>real</td>
+      </tr>
+      <tr>
+          <td>ast_pct</td>
+          <td>real</td>
+      </tr>
+      <tr>
+          <td>season</td>
+          <td>integer</td>
+      </tr>
+  </tbody>
+</table>
+<p><strong>Indexes:</strong>
+<code>&quot;player_seasons_pkey&quot; PRIMARY KEY, btree (player_name, season)</code></p>&hellip;
+
+</div>
+					<a class="readmore" href="/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lab/">Read more ⟶</a>
+				</section>
+
 				<section class="list-item">
 					<h1 class="title"><a href="/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture/">(DataExpert.io) Bootcamp - Day 1 - Lecture</a></h1>
 					<time>Nov 15, 2024</time>

diff --git a/hugo/public/index.xml b/hugo/public/index.xml
@@ -22,6 +22,13 @@
       <guid>http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day2/lecture/</guid>
       <description>&lt;p&gt;Today&amp;rsquo;s lecture deals with &lt;strong&gt;Slowly Changing Dimensions&lt;/strong&gt; and &lt;strong&gt;Idempotency&lt;/strong&gt;.&lt;/p&gt;&#xA;&lt;blockquote&gt;&#xA;&lt;p&gt;&lt;strong&gt;Slowly changing dimensions&lt;/strong&gt; = An attribute that drifts over time&lt;/p&gt;&#xA;&lt;/blockquote&gt;&#xA;&lt;p&gt;&lt;em&gt;Example:&lt;/em&gt; Your favorite food&lt;/p&gt;&#xA;&lt;h2 id=&#34;idempotency&#34;&gt;Idempotency&lt;/h2&gt;&#xA;&lt;p&gt;You need to model slowly dimensions the right way because they impact idempotency.&lt;/p&gt;&#xA;&lt;blockquote&gt;&#xA;&lt;p&gt;&lt;strong&gt;Idempotent&lt;/strong&gt; = Denoting an element of a set which is unchanged in value when multiplied or otherwise operated on by itself.&lt;/p&gt;&#xA;&lt;/blockquote&gt;&#xA;&lt;blockquote&gt;&#xA;&lt;p&gt;&lt;strong&gt;Idempotent pipeline&lt;/strong&gt; = The ability for your data pipeline to produce the same results whether it&amp;rsquo;s running in production or in backfill.&lt;/p&gt;</description>
     </item>
+    <item>
+      <title>(DataExpert.io) Bootcamp - Day 1 - Lab</title>
+      <link>http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lab/</link>
+      <pubDate>Sat, 16 Nov 2024 17:52:23 +0100</pubDate>
+      <guid>http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lab/</guid>
+      <description>&lt;blockquote&gt;&#xA;&lt;p&gt;&lt;strong&gt;Goal :&lt;/strong&gt; Create a &lt;em&gt;cumulative table design&lt;/em&gt;&lt;/p&gt;&#xA;&lt;/blockquote&gt;&#xA;&lt;h2 id=&#34;problem-overview&#34;&gt;Problem overview&lt;/h2&gt;&#xA;&lt;p&gt;We have a table containing the stats for the NBA players, there&amp;rsquo;s one record for each player&amp;rsquo;s season.&lt;/p&gt;&#xA;&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;postgres=# \d player_seasons;&#xA;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Table &lt;code&gt;public.player_seasons&lt;/code&gt;&lt;/p&gt;&#xA;&lt;table&gt;&#xA;  &lt;thead&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;th&gt;Column&lt;/th&gt;&#xA;          &lt;th&gt;Type&lt;/th&gt;&#xA;      &lt;/tr&gt;&#xA;  &lt;/thead&gt;&#xA;  &lt;tbody&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;player_name&lt;/td&gt;&#xA;          &lt;td&gt;text&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;age&lt;/td&gt;&#xA;          &lt;td&gt;integer&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;height&lt;/td&gt;&#xA;          &lt;td&gt;text&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;weight&lt;/td&gt;&#xA;          &lt;td&gt;integer&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;college&lt;/td&gt;&#xA;          &lt;td&gt;text&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;country&lt;/td&gt;&#xA;          &lt;td&gt;text&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;draft_year&lt;/td&gt;&#xA;          &lt;td&gt;text&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;draft_round&lt;/td&gt;&#xA;          &lt;td&gt;text&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;draft_number&lt;/td&gt;&#xA;          &lt;td&gt;text&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;gp&lt;/td&gt;&#xA;          &lt;td&gt;real&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;pts&lt;/td&gt;&#xA;          &lt;td&gt;real&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;reb&lt;/td&gt;&#xA;          &lt;td&gt;real&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;ast&lt;/td&gt;&#xA;          &lt;td&gt;real&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;netrtg&lt;/td&gt;&#xA;          &lt;td&gt;real&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;oreb_pct&lt;/td&gt;&#xA;          &lt;td&gt;real&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;dreb_pct&lt;/td&gt;&#xA;          &lt;td&gt;real&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;usg_pct&lt;/td&gt;&#xA;          &lt;td&gt;real&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;ts_pct&lt;/td&gt;&#xA;          &lt;td&gt;real&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;ast_pct&lt;/td&gt;&#xA;          &lt;td&gt;real&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;season&lt;/td&gt;&#xA;          &lt;td&gt;integer&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;  &lt;/tbody&gt;&#xA;&lt;/table&gt;&#xA;&lt;p&gt;&lt;strong&gt;Indexes:&lt;/strong&gt;&#xA;&lt;code&gt;&amp;quot;player_seasons_pkey&amp;quot; PRIMARY KEY, btree (player_name, season)&lt;/code&gt;&lt;/p&gt;</description>
+    </item>
     <item>
       <title>(DataExpert.io) Bootcamp - Day 1 - Lecture</title>
       <link>http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day1/lecture/</link>