Skip to content

Commit

Permalink
✨ feat( DataExpert.io ): Day 3 lecture
Browse files Browse the repository at this point in the history
  • Loading branch information
glopez-dev committed Nov 22, 2024
1 parent 66a8949 commit 3238610
Show file tree
Hide file tree
Showing 7 changed files with 433 additions and 7 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
---
date: 2024-11-19T17:46:24+01:00
draft: false
author: Gabriel LOPEZ
title: (DataExpert.io) Bootcamp - Day 3 - Lecture
---
Today's lecture is about dimensional additivity and how to build a flexible data model ready for graph database consumption

## Index
- Additive VS non-additive dimensions
- The power of Enums
- When should you use flexible data types ?
- Graph data modeling

## Additive vs Non-additive dimensions
### What makes a dimension additive ?
Additivity refers to whether numerical facts (measures) in a fact table can be meaningfully aggregated across different dimensions.

If you take all the sub-totals and sum them up you should have the total

*Example :* The sum of the count of people of each age gives the total population

An additive dimension is one that don't "double count"

*Example :* The total number of driver doesn't equals the sum of each car brand drivers because some people owns two cars

> You should ask yourself if a user can be two of these at the same time on a given day.
### The essential nature of additivity
A dimension is additive over a specific time window, if and only if, the grain of that window can only be one value at a time.
- In a time window of one second, one driver can't drive two cars
- In a time window of one day, one driver can drive many cars

### How does additivity helps ?
You don't need to use `COUNT(DISTINCT)` on preaggregated dimensions.

> **Note :** Non-additive dimensions are usually only non-additive with respect to `COUNT` aggregations but not `SUM` aggregations.

## The power of Enums
### When should you use enums ?
Enums are great for low-to-medium cardinality

*Example :* countries (200) might be to much for an enum.

If it's less than 50 it might be a good enum

### Why should you use enums ?

#### Built in data quality
- The Enum guarantees the set of values a field can take.
- This prevents invalid values at the database level
- You can't insert 'pending ' (with space) or 'PENDING' (wrong case)
- More reliable than `CHECK` constraints on `VARCHAR` columns

#### Built in static fields
- The enum values are stored as internal constants in the database
- More storage efficient than VARCHAR
- PostgreSQL stores enum values as integers internally but presents them as strings to users
```sql
-- More efficient than storing the full string each time
SELECT COUNT(*) FROM orders GROUP BY order_status;
```

#### Built in documentation
- The enum definition serves as documentation in the database schema
- Developers can see all valid values by checking the enum type:

```shell
# Postgres SQL command :
\dT+ status
```

### Enumerations and subpartitions
Enumerations makes amazing subpartitions because :
- You have an exausthive list
- They chunk up the big data problem into manageable pieces

*Thrift* is a tool to manage schemas in your logging as well as you ETLs

> https://en.wikipedia.org/wiki/Apache_Thrift
Zach created a framework named "Little book of pipelines"

> https://github.com/EcZachly/little-book-of-pipelines
The enum pattern is useful whenever you have tons of sources mapping to a shared schema.

## How do you model data from disparate sources into a shared schema ?

You don't want to bring all the columns from 40 tables into one table with 500 columns
- Most columns are going to be NULL all the time
- It's not really usable table

You want to use a "flexible schema"
- Use a lot of map types
- Overlaps a lot with graph database
- The enumerated list of things is similar to a "Vertex type"

### Flexible schemas

If you need to add more things you just add it to the map

If new columns appears you can add them to the map

> **Limit :** Maps in Spark can have up to 65,000 keys
#### Benefits
- No `ALTER TABLE` commands
- You can manage a lot more columns
- You don't have a ton of `NULL` columns
- You can use an "other_properties"
- Carry rarely used but needed column
- Allows you to ignore them during modelisation

#### Drawbacks
- Compression is terrible
- Map is one of the worst compression of all data types
- The only thing worse is JSON
- The column headers becomes map keys so they are part of the data instead of the schema and duplicated in each row

## How is graph modelling different ?

Graph modelling is Relationship focused, not Entity focused.

You can do a very poor job at modeling entities

### Zach super secret sauce for Graph data modelling :
He always met the same schema when doing graph databases

We don't care about columns

Usually for an **entity** in a graph the model is :
```sql
Identifier: STRING
Type: STRING
Properties: MAP<STRING, STRING>
```

The **relationships** are modeled a little bit more in depth;

Usually the model looks like :
```sql
subject_identifier: STRING
-- Example: player_name
subject_type: VERTEX_TYPE
-- Example: player
object_identifier: STRING
-- Example: team_name
object_type: VERTEX_TYPE
-- Example: team
edge_type: EDGE_TYPE
-- Example: "PLAYS WITH" , "PLAYS AGAINST" (an action)
properties: MAP<STRING, STRING>
```
22 changes: 22 additions & 0 deletions hugo/public/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,28 @@



<section class="list-item">
<h1 class="title"><a href="/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day3/lecture/">(DataExpert.io) Bootcamp - Day 3 - Lecture</a></h1>
<time>Nov 19, 2024</time>
<br><div class="description">

<p>Today&rsquo;s lecture is about dimensional additivity and how to build a flexible data model ready for graph database consumption</p>
<h2 id="index">Index</h2>
<ul>
<li>Additive VS non-additive dimensions</li>
<li>The power of Enums</li>
<li>When should you use flexible data types ?</li>
<li>Graph data modeling</li>
</ul>
<h2 id="additive-vs-non-additive-dimensions">Additive vs Non-additive dimensions</h2>
<h3 id="what-makes-a-dimension-additive-">What makes a dimension additive ?</h3>
<p>Additivity refers to whether numerical facts (measures) in a fact table can be meaningfully aggregated across different dimensions.</p>
<p>If you take all the sub-totals and sum them up you should have the total</p>&hellip;

</div>
<a class="readmore" href="/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day3/lecture/">Read more ⟶</a>
</section>

<section class="list-item">
<h1 class="title"><a href="/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day2/lecture/">(DataExpert.io) Bootcamp - Day 2 - Lecture</a></h1>
<time>Nov 17, 2024</time>
Expand Down
9 changes: 8 additions & 1 deletion hugo/public/index.xml
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,15 @@
<description>Recent content on Gabriel Study Blog</description>
<generator>Hugo</generator>
<language>en</language>
<lastBuildDate>Sun, 17 Nov 2024 19:21:32 +0100</lastBuildDate>
<lastBuildDate>Tue, 19 Nov 2024 17:46:24 +0100</lastBuildDate>
<atom:link href="http://localhost:1313/blog/index.xml" rel="self" type="application/rss+xml" />
<item>
<title>(DataExpert.io) Bootcamp - Day 3 - Lecture</title>
<link>http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day3/lecture/</link>
<pubDate>Tue, 19 Nov 2024 17:46:24 +0100</pubDate>
<guid>http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day3/lecture/</guid>
<description>&lt;p&gt;Today&amp;rsquo;s lecture is about dimensional additivity and how to build a flexible data model ready for graph database consumption&lt;/p&gt;&#xA;&lt;h2 id=&#34;index&#34;&gt;Index&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Additive VS non-additive dimensions&lt;/li&gt;&#xA;&lt;li&gt;The power of Enums&lt;/li&gt;&#xA;&lt;li&gt;When should you use flexible data types ?&lt;/li&gt;&#xA;&lt;li&gt;Graph data modeling&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;additive-vs-non-additive-dimensions&#34;&gt;Additive vs Non-additive dimensions&lt;/h2&gt;&#xA;&lt;h3 id=&#34;what-makes-a-dimension-additive-&#34;&gt;What makes a dimension additive ?&lt;/h3&gt;&#xA;&lt;p&gt;Additivity refers to whether numerical facts (measures) in a fact table can be meaningfully aggregated across different dimensions.&lt;/p&gt;&#xA;&lt;p&gt;If you take all the sub-totals and sum them up you should have the total&lt;/p&gt;</description>
</item>
<item>
<title>(DataExpert.io) Bootcamp - Day 2 - Lecture</title>
<link>http://localhost:1313/blog/posts/data-engineering/bootcamps/data-expert-io/dimensional-data-modelling/day2/lecture/</link>
Expand Down
Loading

0 comments on commit 3238610

Please sign in to comment.