✨ feat( DataExpert.io ): Day 3 lecture

glopez-dev · Nov 22, 2024 · 3238610 · 3238610
1 parent 66a8949
commit 3238610
Show file tree

Hide file tree

Showing 7 changed files with 433 additions and 7 deletions.
diff --git a/...engineering/bootcamps/data-expert-io/dimensional-data-modelling/day3/lecture.md b/...engineering/bootcamps/data-expert-io/dimensional-data-modelling/day3/lecture.md
@@ -0,0 +1,155 @@
+---
+date: 2024-11-19T17:46:24+01:00
+draft: false
+author: Gabriel LOPEZ
+title: (DataExpert.io) Bootcamp - Day 3 - Lecture 
+---
+Today's lecture is about dimensional additivity and how to build a flexible data model ready for graph database consumption
+
+## Index
+- Additive VS non-additive dimensions
+- The power of Enums
+- When should you use flexible data types ?
+- Graph data modeling
+
+## Additive vs Non-additive dimensions
+### What makes a dimension additive ?
+Additivity refers to whether numerical facts (measures) in a fact table can be meaningfully aggregated across different dimensions.
+
+If you take all the sub-totals and sum them up you should have the total
+
+*Example :* The sum of the count of people of each age gives the total population  
+
+An additive dimension is one that don't "double count"
+
+*Example :* The total number of driver doesn't equals the sum of each car brand drivers because some people owns two cars
+
+> You should ask yourself if a user can be two of these at the same time on a given day.
+
+### The essential nature of additivity
+A dimension is additive over a specific time window, if and only if, the grain of that window can only be one value at a time.
+- In a time window of one second, one driver can't drive two cars
+- In a time window of one day, one driver can drive many cars
+
+### How does additivity helps ?
+You don't need to use `COUNT(DISTINCT)` on preaggregated dimensions.
+
+> **Note :** Non-additive dimensions are usually only non-additive with respect to `COUNT` aggregations but not `SUM` aggregations.
+
+
+## The power of Enums
+### When should you use enums ?
+Enums are great for low-to-medium cardinality
+
+*Example :* countries (200) might be to much for an enum.
+
+If it's less than 50 it might be a good enum
+
+### Why should you use enums ?
+
+#### Built in data quality
+- The Enum guarantees the set of values a field can take.
+- This prevents invalid values at the database level
+- You can't insert 'pending ' (with space) or 'PENDING' (wrong case)
+- More reliable than `CHECK` constraints on `VARCHAR` columns
+
+#### Built in static fields
+- The enum values are stored as internal constants in the database
+- More storage efficient than VARCHAR
+- PostgreSQL stores enum values as integers internally but presents them as strings to users
+```sql
+-- More efficient than storing the full string each time
+SELECT COUNT(*) FROM orders GROUP BY order_status;
+```
+
+#### Built in documentation
+- The enum definition serves as documentation in the database schema
+- Developers can see all valid values by checking the enum type:
+
+```shell
+# Postgres SQL command :
+\dT+ status
+```
+
+### Enumerations and subpartitions
+Enumerations makes amazing subpartitions because :
+- You have an exausthive list
+- They chunk up the big data problem into manageable pieces
+
+*Thrift* is a tool to manage schemas in your logging as well as you ETLs
+
+>  https://en.wikipedia.org/wiki/Apache_Thrift
+
+Zach created a framework named "Little book of pipelines"
+
+>  https://github.com/EcZachly/little-book-of-pipelines
+
+The enum pattern is useful whenever you have tons of sources mapping to a shared schema.
+
+## How do you model data from disparate sources into a shared schema ?
+
+You don't want to bring all the columns from 40 tables into one table with 500 columns 
+- Most columns are going to be NULL all the time
+- It's not really usable table
+
+You want to use a "flexible schema"
+- Use a lot of map types
+- Overlaps a lot with graph database
+	- The enumerated list of things is similar to a "Vertex type"
+
+### Flexible schemas
+
+If you need to add more things you just add it to the map
+
+If new columns appears you can add them to the map
+
+> **Limit :** Maps in Spark can have up to 65,000 keys
+
+#### Benefits
+- No `ALTER TABLE` commands
+- You can manage a lot more columns
+- You don't have a ton of `NULL` columns
+- You can use an "other_properties" 
+	- Carry rarely used but needed column 
+	- Allows you to ignore them during modelisation
+
+#### Drawbacks
+- Compression is terrible 
+	- Map is one of the worst compression of all data types
+	- The only thing worse is JSON
+	- The column headers becomes map keys so they are part of the data instead of the schema and duplicated in each row
+
+## How is graph modelling different ?
+
+Graph modelling is Relationship focused, not Entity focused.
+
+You can do a very poor job at modeling entities
+
+### Zach super secret sauce for Graph data modelling :
+He always met the same schema when doing graph databases
+
+We don't care about columns
+
+Usually for an **entity** in a graph the model is :
+```sql
+Identifier: STRING
+Type: STRING
+Properties: MAP<STRING, STRING>
+```
+
+The **relationships** are modeled a little bit more in depth;
+
+Usually the model looks like :
+```sql
+subject_identifier: STRING
+-- Example: player_name
+subject_type: VERTEX_TYPE 
+-- Example: player
+object_identifier: STRING
+-- Example: team_name
+object_type: VERTEX_TYPE 
+-- Example: team
+edge_type: EDGE_TYPE
+-- Example: "PLAYS WITH" , "PLAYS AGAINST" (an action)
+properties: MAP<STRING, STRING>
+```
diff --git a/hugo/public/index.html b/hugo/public/index.html
@@ -49,6 +49,28 @@
 
 
 
+				<section class="list-item">
+					<h1 class="title"><a href="/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day3/lecture/">(DataExpert.io) Bootcamp - Day 3 - Lecture</a></h1>
+					<time>Nov 19, 2024</time>
+					<br><div class="description">
+
+	<p>Today&rsquo;s lecture is about dimensional additivity and how to build a flexible data model ready for graph database consumption</p>
+<h2 id="index">Index</h2>
+<ul>
+<li>Additive VS non-additive dimensions</li>
+<li>The power of Enums</li>
+<li>When should you use flexible data types ?</li>
+<li>Graph data modeling</li>
+</ul>
+<h2 id="additive-vs-non-additive-dimensions">Additive vs Non-additive dimensions</h2>
+<h3 id="what-makes-a-dimension-additive-">What makes a dimension additive ?</h3>
+<p>Additivity refers to whether numerical facts (measures) in a fact table can be meaningfully aggregated across different dimensions.</p>
+<p>If you take all the sub-totals and sum them up you should have the total</p>&hellip;
+
+</div>
+					<a class="readmore" href="/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day3/lecture/">Read more ⟶</a>
+				</section>
+
 				<section class="list-item">
 					<h1 class="title"><a href="/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day2/lecture/">(DataExpert.io) Bootcamp - Day 2 - Lecture</a></h1>
 					<time>Nov 17, 2024</time>

diff --git a/hugo/public/index.xml b/hugo/public/index.xml
@@ -6,8 +6,15 @@
     <description>Recent content on Gabriel Study Blog</description>
     <generator>Hugo</generator>
     <language>en</language>
-    <lastBuildDate>Sun, 17 Nov 2024 19:21:32 +0100</lastBuildDate>
+    <lastBuildDate>Tue, 19 Nov 2024 17:46:24 +0100</lastBuildDate>
     <atom:link href="http://localhost:1313/blog/index.xml" rel="self" type="application/rss+xml" />
+    <item>
+      <title>(DataExpert.io) Bootcamp - Day 3 - Lecture</title>
+      <link>http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day3/lecture/</link>
+      <pubDate>Tue, 19 Nov 2024 17:46:24 +0100</pubDate>
+      <guid>http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day3/lecture/</guid>
+      <description>&lt;p&gt;Today&amp;rsquo;s lecture is about dimensional additivity and how to build a flexible data model ready for graph database consumption&lt;/p&gt;&#xA;&lt;h2 id=&#34;index&#34;&gt;Index&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Additive VS non-additive dimensions&lt;/li&gt;&#xA;&lt;li&gt;The power of Enums&lt;/li&gt;&#xA;&lt;li&gt;When should you use flexible data types ?&lt;/li&gt;&#xA;&lt;li&gt;Graph data modeling&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;additive-vs-non-additive-dimensions&#34;&gt;Additive vs Non-additive dimensions&lt;/h2&gt;&#xA;&lt;h3 id=&#34;what-makes-a-dimension-additive-&#34;&gt;What makes a dimension additive ?&lt;/h3&gt;&#xA;&lt;p&gt;Additivity refers to whether numerical facts (measures) in a fact table can be meaningfully aggregated across different dimensions.&lt;/p&gt;&#xA;&lt;p&gt;If you take all the sub-totals and sum them up you should have the total&lt;/p&gt;</description>
+    </item>
     <item>
       <title>(DataExpert.io) Bootcamp - Day 2 - Lecture</title>
       <link>http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day2/lecture/</link>