In this workshop you'll learn how to design, extract, transform and load a datawarehouse using SQL Server Data Tools(SSDT) and SQL Server Integration Services(SSIS). At the end of this workshop you'll be able to run background jobs to update the datawarehouse.
- SQL Server 2017+ Installed;
- SQL Server Management Studio Installed;
- SQL Server Data Tools Installed;
- SQL Server Integration Tservices Installed;
- A running copy of the database AdventureWorks2014.
- If you've followed the installation guide everything should be ready to go.
- Getting started
- Setting up the Connections Managers
- Extract data
- Test your flow
- Loading data from a .csv file
- Create a new database called
AdventureWorksDW
; - Create the following tables using the scripts below
- DimSalesTerritory
- DimDate
CREATE TABLE [DimSalesTerritory] ( [SalesTerritoryKey] INT NOT NULL, [SalesTerritoryRegion] NVARCHAR(50) NOT NULL, [SalesTerritoryCountry] NVARCHAR(50) NOT NULL, [SalesTerritoryGroup] NVARCHAR(50) NOT NULL, CONSTRAINT PK_DimSalesTerritory PRIMARY KEY([SalesTerritoryKey]) )
CREATE TABLE [DimDate] ( [DateKey] INT NOT NULL, [FullDateAlternateKey] DATE NOT NULL, [EnglishDayNameOfWeek] NVARCHAR(50) NOT NULL, [DutchDayNameOfWeek] NVARCHAR(50) NOT NULL, [MonthNumber] TINYINT NOT NULL, [EnglishMonthName] NVARCHAR(50) NOT NULL, [DutchMonthName] NVARCHAR(50) NOT NULL, [CalendarQuarter] TINYINT NOT NULL, [CalendarYear] SMALLINT NOT NULL CONSTRAINT PK_DimDate PRIMARY KEY(DateKey) )
- Prepare the extraction of the data by creating 2 VIEWs in the
AdventureWorks2014
database, we'll be using these views later on so it's a good idea to get to know the inner workings of that the VIEWs represent. Execute the following statements:CREATE VIEW VwSalesTerritory AS SELECT SalesTerritory.TerritoryID AS SalesTerritoryKey ,SalesTerritory.[Name] AS SalesTerritoryRegion ,CountryRegion.[Name] AS SalesTerritoryCountry ,SalesTerritory.[Group] AS SalesTerritoryGroup FROM Sales.SalesTerritory SalesTerritory JOIN Person.CountryRegion CountryRegion on SalesTerritory.CountryRegionCode = CountryRegion.CountryRegionCode;
CREATE VIEW VwDimDate AS SELECT DISTINCT CAST(FORMAT(OrderDate,'yyyyMMdd') AS INT) AS DateKey ,OrderDate AS FullDateAlternateKey ,FORMAT(OrderDate,'dddd','en-UK') AS EnglishDayNameOfWeek ,FORMAT(OrderDate,'dddd','nl-NL') AS DutchDayNameOfWeek ,MONTH(OrderDate) AS MonthNumber ,FORMAT(OrderDate,'MMMM','en-UK') AS EnglishMonthName ,FORMAT(OrderDate,'MMMM','nl-NL') AS DutchMonthName ,DATEPART(q,OrderDate) AS CalendarQuarter ,DATEPART(YYYY,OrderDate) CalendarYear FROM Sales.SalesOrderHeader WHERE OnlineOrderFlag = 1
- Create a new
Integration Services Project
using Visual Studio 2017+ calledAdventureWorks
.If the
Integration Services Project
option is not available, make sure to revisit the steps in the installation guide. - Right click on
package.dtsx
in the solution explorer to rename it toFillDW.dtsx
Deep Dive : The
.dtsx
extension is a shorthand forData Transformation Service
, the former name of SSIS. The Data Transformation Services Package XML (DTSX) file format is an XML-based file format that stores the instructions for the processing of a data flow, including transformations and optional processing steps, from its points of origin to its points of destination, learn more.
Configure two Connection Managers
one which connects to the newly created AdventureWorksDW
and one which connects to the OLTP AdventureWorks2014
. This will allow you to re-use connections in multiple packages and steps. Take the following actions:
-
In the solution explorer, right click on
Connection Managers
, selectNew Connection Manager
. -
Select the
OLEDB
option and clickadd
. -
Click
New...
-
In the Server Name field, type
localhost
and select theAdventureWorks2014
database. -
Test the connection by clicking the
Test Connection
button and make sure you're able to connect to the database. -
Repeat step 3-5 for the
AdventureWorksDW
database. -
By now your solution should look similar to the following screenshot:
We want to extract data from the OLTP database AdventureWorks2014
and load it into the data warehouse AdventureWorksDW
, automagically based on certain criteria.
Before we start filling the dimension tables of the data warehouse, we want to delete all records from the table DimSalesTerritory
before each transfer. Therefore we need a Execute SQL Task
.
- Drag the
Execute SQL Task
from the toolbox onto theControl Flow
tab. - Double click the
Execute SQL Task
in theControl Flow
and edit the fields according to the screenshot below. .
Once we know the table is empty by truncating the DimSalesTerritory table, we can start stuffing it with data coming from the OLTP database AdventureWorks2014
.
- Drag the
Data Flow Task
from the toolbox onto theControl Flow
tab, rename the task toFill Table DimSalesTerritory
. - Connect the preceding task, so that after a successful
DELETE
statement theData Flow Task
is started. You can add as many follow-ups on a failiure or successful execution. . - Double click the
Fill Table DimSalesTerritory
task, theData Flow
tab is now active and selected, notice how the SSIS toolbar changed based on the selected tabControl Flow
vs.Data Flow
. - Drag the data flow task
OLE DB Source
(Other Sources) to the data flow. Configure it as follows by right clicking the task:- Rename the task to
Extract SalesTerritory
. - Make sure the Connection Manager is set to
AdventureWorks2014
. - The VIEW to load the data from is
VwSalesTerritory
.
- Rename the task to
- Drag the data flow task
OLE DB Destination
(Other Destinations) to the data flow. Afterwards connect the source to the destination. Configure it as follows by right clicking the task:- Rename the task to
Fill DimSalesTerritory
. - Make sure the Connection Manager is set to
AdventureWorksDW
. - The TABLE to load the data in is
DimSalesTerritory
.
- Rename the task to
- Check the Mappings in the destination and make sure they match.
IMPORTANT : If the name of the VIEW column(s) don't match the name of the destination table or the datatype (even NVARCHAR vs. VARCHAR) are different , you'll get errors or the columns are ignored. It's always a good idea to check the mappings and if they don't match ALTER the VIEW or TABLE, or use
CAST
orCONVERT
functions in the VIEW. - Run the package by navigating to the
Control Flow
and pressingF5
orStart
. - Double check your results by writing a SQL query against the
DimSalesTerritory
table in the datawarehouse. .
For filling the DimDate
table you'll have to do exactly the same as DimSalesTerritory
.
- Create a new
Data Flow Task
, give an appropriate name. - Add 1 OLE DB source and 1 OLE DB destination for the corresponding view and table.
- Create a connection between the source and the destination.
- Don't forget to delete all the records from the
DimDate
table. If you forgot why you need to do this, re-read the section about it. - Check your results by running the package again and query the filled table
DimDate
. - Control flow at this point:
- Data flow
Fill Table DimDate
at this point:
The product dimension or the DimProduct
table is a bit different then the DimDate
and DimSalesTerritory
dimensions. We'll be keeping track of product history in this dimensions, also known as a slowly changing dimension
.
-
Create the
DimProduct
table in the datawarehouse based on the following relational model, theProductKey
column is automagically generated by the database engine. -
On the
Control Flow
add a newData Flow Task
calledFill Table DimProduct(SCD)
, notice that we do not add aTruncate Table
orDelete
command since we'll be keeping history. -
Add a
OLE DB Source
task, with the nameExtract Product
. Make sure theProduction.Product
table from the OLTP databaseAdventureWorks2014
is selected. -
Add a
Slowly Changing Dimension
task, with the nameFill DimProduct(SCD)
. -
Connect the
Extract Product
task to theFill DimProduct(SCD)
. -
Right click the
Fill DimProduct(SCD)
and selectEdit
, which will start a Slowly Changing Dimension wizard. In the wizard take the following actions based on the provided screenshots, make sure to read the given information to understand what's going on: -
The generated
Data Flow
is the following, make sure to check what each task does by right clicking the task and pressingEdit
, you can give each task a appropriate name: -
By default SSIS set the
EndDate
equal to the new start date. It seems more appropriate to set theEndDate
to yesterday (the day before the SSIS runs if we would schedule the package on a daily basis) and the new start date to today. Double clickDerived Column
and change theExpression
field as follows:DATEADD("day",-1,(DT_DBDATE)(@[System::StartTime]))
-
Run the project and check the results.
DimProduct
is filled but the StartDate is equal to today. From a business point of view this is essentially wrong, the current product data is valid since the start of the company (since we have no history yet). So execute the following statement once:
UPDATE DimProduct SET StartDate = ( SELECT MIN(OrderDate) FROM AdventureWorks2014.Sales.SalesOrderHeader );
-
Check what happens when -for example- the color of a product is updated and you rerun the package, take the following steps:
- On the OLTP database, execute the following command
UPDATE Production.Product SET Color = 'Blue' WHERE ProductID = 776;
- Rerun the package
- On the datawarehouse, execute the following command
SELECT * FROM DimProduct WHERE ProductID = 776;
- What's the result?
- On the OLTP database, execute the following command
-
Create the
FactSales
table in the datawarehouse based on the following relational model, theSalesOrderLineNumber
column is not automagically generated by the database engine. Since the column has no real busniness meaning and is a surrogate key in the OLTP database, we'll re-use it's values in the datawarehouse.Remarks
ProductKey
(and notProductID
, which is kept as a business key) is inserted as foreign key so we can also link to the correctProduct
information when making sales reports.- The first where clause makes sure that we join with the correct
DimProduct
line so we can insert the correctProductKey
due to theslowly changing dimension
. - Due to the second where clause we only add new lines to
FactSales
in consecutive runs of this statement. TheISNULL
function is necessary for the first run (when FactSales is empty) becauseSELECT MAX()
returnsNULL
on an empty table.
-
Distinguish which columns are foreign keys and which are real fact columns based on the name of the columns, we'll be adding them later using the Database Diagram.
-
Now we can start inserting facts based on the OLTP database into the data warehouse, add a new
Execute SQL Command Task
in thecontrol flow
and connect it to the last taskFill Table DimProduct(SCD)
with the nameFill Table FactSales
afterwards press right click the on the new task and pressedit
. In this dialog connect to the data warehouse and copy-paste the following insert statement into theSQLStatement field
:INSERT INTO FactSales ( -- Columns we're inserting into. SalesOrderLineNumber ,ProductKey ,SalesTerritoryKey ,OrderDateKey ,OrderQuantity ,UnitPrice ,ExtendedAmount ) SELECT d.SalesOrderDetailID ,p.ProductKey ,h.TerritoryID ,CAST(FORMAT(h.OrderDate,'yyyyMMdd') AS INT) ,d.OrderQty,d.UnitPrice ,d.OrderQty * d.UnitPrice FROM AdventureWorks2014.Sales.SalesOrderHeader h JOIN AdventureWorks2014.Sales.SalesOrderDetail d ON h.SalesOrderID = d.SalesOrderID JOIN DimProduct p ON d.ProductID = p.ProductID WHERE /* Slowly Changing Dimension dimproduct */ h.OrderDate >= p.StartDate AND (p.EndDate IS NULL OR h.OrderDate <= p.EndDate) AND /* increment, also make sure it runs from an empty factsales table */ d.SalesOrderDetailID > (SELECT ISNULL(MAX(SalesOrderLineNumber),0) FROM factsales);
-
Draw the
Database Diagram
of the data warehouse and connect the foreign keys as shown in the .gif.
IMPORTANT Do not forget to save the diagram and the changes you made to the foreign keys, else it the contraints won't be added.
- Run the project as a whole and check the resulting tables. We get an error message because we are trying to delete
DimDate
andDimSalesTerritory
since there are foreign key constraints between those tables and theFactSales
table. One way to cope with this is to temporarily disable the constraints at the start of the fill operation and enable them again at the end. So add twoExecute SQL
tasks in the control flow, don't forget to select theConnection Manager
for the datawarehouse.- At the start
ALTER TABLE FactSales NOCHECK CONSTRAINT ALL;
- At the end
ALTER TABLE FactSales WITH CHECK CHECK CONSTRAINT ALL;
CHECK CHECK
is not a typo.
- At the start
As a final test we can add a sales line for an updated product in the operational system and check if the corresponding FactSales
line in the datawarehouse will be linked to the correct ProductKey
in DimProduct
:
- Copy the last line from [Sales].[SalesOrderHeader] (SalesOrderID is an identity column, so we have to specify all fields)
INSERT INTO [Sales].[SalesOrderHeader] ([RevisionNumber],[OrderDate],[DueDate],[ShipDate],[Status],[OnlineOrderFlag],[PurchaseOrderNumber] ,[AccountNumber],[CustomerID],[SalesPersonID],[TerritoryID],[BillToAddressID],[ShipToAddressID] ,[ShipMethodID],[CreditCardID],[CreditCardApprovalCode],[CurrencyRateID],[SubTotal] ,[TaxAmt],[Freight],[Comment],[rowguid],[ModifiedDate]) select s.RevisionNumber,'2020-03-08', '2020-03-08', '2020-03-08',s.Status, s.OnlineOrderFlag,s.PurchaseOrderNumber,s.AccountNumber,s.CustomerID, s.SalesPersonID,s.TerritoryID,s.BillToAddressID,s.ShipToAddressID, s.ShipMethodID,s.CreditCardID,s.CreditCardApprovalCode,s.CurrencyRateID, s.SubTotal,s.TaxAmt,s.Freight,s.Comment,newid(),getdate() from sales.salesorderheader s where salesorderid = (select max(salesorderid) from sales.salesorderheader);
- Check for the key value to use in [Sales].[SalesOrderDetail]:
select max(salesorderid) from sales.salesorderheader; --> 75129
- Add a line to the above created
SalesOrderHeader
:INSERT INTO [Sales].[SalesOrderDetail]([SalesOrderID],[CarrierTrackingNumber],[OrderQty],[ProductID] ,[SpecialOfferID],[UnitPrice],[UnitPriceDiscount],[rowguid],[ModifiedDate]) VALUES (75129,null,4,776,1,10,0,newid(),getdate());
- Now rerun the package and check
FactSales
:select * from dimproduct where productid = 776;
506 is the correct product key.
select top 1 * from factsales order by salesorderlinenumber desc;
506 should be linked in this fact.
- Create a new SSIS package; name it
Transfer Country Data
. - Create the destination table for this example by running the following script on the AdventureWorksDW database:
CREATE TABLE [country]( [id] [int] IDENTITY(1,1) NOT NULL, [name] [varchar](150) NOT NULL, [alpha-2] [varchar](50) NULL, [alpha-3] [varchar](50) NULL, [country-code] [varchar](50) NULL, [iso_3166-2] [varchar](50) NULL, [region] [varchar](50) NULL, [sub-region] [varchar](50) NULL, [region-code] [varchar](50) NULL, [sub-region-code] [varchar](50) NULL, CONSTRAINT [PK_country] PRIMARY KEY ([id] ASC) );
- Drag-and-drop a Data Flow Task from SSIS Toolbox into the package designer. Double-click on the task. You will be redirected to a new tab named Data Flow.
- In the Data Flow tab, drag-and-drop a flat file source from SSIS Toolbox into the Package Designer. Then, double-click on the flat file source.
- In Flat File Source Editor, create a new connection. After this step, the Flat File Connection Manager menu will be opened.
- Browse for the Country.csv file, and leave the locale and code page configurations as is. Verify the Format field to be set as Delimited, as the next screenshot shows, and check the box that says Column names in the first data row:
- Go to the Columns tab and you will see how columns are recognized by the settings that we made in the General tab. You can also change the row and column delimiters in this tab.
- The Advanced tab shows detailed information about each column; you can set the data type, length, and some other properties for each column in that tab. Finally, you can view the data rows as they are processed in the flat file source's connection manager in the Preview tab. Do not change anything in the Advanced tab and click on OK.
- In the Flat File Source Editor, set the Retain null values from the source option as null values in the Data Flow.
- Go to the Columns tab; here, you can check as many columns as you want to be fetched from the source. By default, all columns will be fetched. Leave it as is and close the Flat File Source Editor.
- Now, add an OLE DB destination and connect to the Flat File Source. Configure the Destination for the country table from the AventureWorksDW database. Check the Columns tab to verify all corresponding fields are linked.
- Run the package and check the table country.
You find here the solutions to the DWH EntertainmentAgency Exercises on both OLTP & DWH, including the statement to fill the fact table. You find here the create statements for the fact and dimension tables of the DWH EntertainmentAgency Exercise.
You can get more information about SSIS by reading the official documentation provided by Microsoft.