Build a logical Enterprise Data Warehouse with ADLS and Synapse Serverless SQL pool

jehayes · ‎Sep 12 2022

Build a Logical Enterprise Data Warehouse using Azure Data Lake Storage Gen 2 and Synapse Serverless SQL pools

Overview

A common pattern in Modern Data Warehouse architecture is to land your source data in its existing format into a data lake followed by transforming and loading it into an Enterprise Data Warehouse (EDW) for reporting and analytics. In Microsoft Azure, this translates to loading data into Azure Data Lake Storage Gen2 (ADLS), transforming data with Synapse or Azure Data Factory Pipelines, storing data in an Azure Synapse Dedicated Pool database or other Azure relational data store, and building reports over that data in Power BI, Excel, or other reporting tools. Typically, the EDW data is stored in a Star Schema, the optimal design for many reporting and analytical tools like Power BI. This is a great practice for enterprise reporting requirements but has some pitfalls:

Traditional transformation and load of data to the final EDW data store can be time consuming
Ingress and storage costs increase when transforming and loading from the landing zone to the final data store for reporting
Deep knowledge of the data sources is required in order to provide business insights and optimal data warehouse design

But:

What if your data was transformed and ready for reporting as soon as it is landed in the cloud?
What if you could eliminate that time consuming ETL process?
How about not incurring additional costs for moving, transforming and storing data in another location?
What if you suspect that you will need a traditional EDW but would like to start an Exploratory Data Analysis project in the cloud?
How about doing this all within T-SQL?

Then Serverless SQL Pools in Azure Synapse Analytics may be the answer for you!

In this post, we'll walk through creating a logical data warehouse over your ADLS data using a Serverless SQL database.

Prerequisites

Synapse Analytics Workspace
- Use an existing workspace or create a new Synapse Analytics Workspace
Data in an Azure Data Lake Storage Gen 2
- Use your own data and follow steps below as a guideline
- Download the parquet files used in my demo and upload them to your default Synapse ADLS storage account using Azure Storage Explorer or through the Azure portal. (In real-world applications, this is typically done via a Synapse Pipeline rather than manually.)

Steps to create a logical EDW via Synapse SQL Serverless

Open Synapse Studio
Create a SQL Serverless Database in Azure Synapse Analytics Workspace Data hub

I created two schemas, Bronze and Silver. The views in the Bronze schema represent the ADLS data in its existing format. The views in the Silver schema contain queries over the Bronze views to denormalize the data into Facts and Dimensions. In the Data hub, go to your SQL Serverless database and click on the ellipses to the right of the database name, choose New SQL script, then Empty script

In the empty script, type CREATE SCHEMA Bronze, then click Run:

Repeat step 3 to create a Silver schema.
Create the SQL views in your Bronze schema over your ADLS files to define column names for easy querying in t-SQL.

A quick way to do this is to navigate to your ADLS account in the Data hub and generate a New SQL script over the folder (or file) for each entity

You will be prompted for the format (Text, Parquet, or Delta) and then a script will be created for you. Change the script to eliminate the TOP100 from the SELECT statement and add CREATE VIEW. Click Run to create the view. Do this for all of the ADLS files that will be used in your logical data warehouse.

Here are the views I created in my Bronze Schema:

Create SQL views in the Silver schema to define the transformations for the fact and dimension tables for your star schema.

Below are the views I created in the Silver schema. You can create your own views or download the code I created for the views below

Here is an example of the StockItems dimension table that was denormalized for the star schema:

Create View Silver.StockItems as
SELECT  SI_A.StockItemID,
SI_A.StockItemName,
S.SupplierID,
S.SupplierName,
SC.SupplierCategoryID,
SC.SupplierCategoryName,
C.ColorID,
SI_A.UnitPackageID as [Selling Package],
SP_A.PackageTypeName,
SI_A.OuterPackageID as [Buying Package],
SI_A.Brand,
SI_A.Size,
SI_A.LeadTimeDays,
SI_A.QuantityPerOuter,
SI_A.IsChillerStock,
SI_A.Barcode,
SI_A.TaxRate,
SI_A.UnitPrice,
SI_A.RecommendedRetailPrice,
SI_A.TypicalWeightPerUnit,
SI_A.MarketingComments,
SI_A.InternalComments,
SI_A.Photo, 
SI_A.CustomFields,
SI_A.Tags,
SI_A.SearchDetails,
SI_A.LastEditedBy,
SI_A.ValidFrom,
SI_A.ValidTo
From Bronze.StockItems as SI_A
INNER JOIN Bronze.Suppliers S ON SI_A.SupplierID = S.SupplierID
INNER JOIN Bronze.Colors C on SI_A.ColorID = C.ColorID
INNER JOIN Bronze.PackageTypes SP_A on SP_A.PackageTypeID = SI_A.UnitPackageID
INNER JOIN Bronze.SupplierCategories SC on S.SupplierCategoryID = SC.SupplierCategoryID

Connect to the SQL Serverless Endpoint to query your database in your tool of choice, such as Power BI or SQL Server Management studio. Start by navigating to your Synapse Analytics Workspace in the Azure Portal and copying the Serverless SQL Endpoint:

Connect to the endpoint in SQL Server Management Studio by pasting the Serverless SQL endpoint URL as the Server name:

Here's the results of a query over one of the views:

Connect to the Serverless SQL endpoint in Power BI Desktop:

Load and transform just like any other Power BI data source:

Summary

Keeping data in ADLS Gen2 without transforming and loading to another data platform gives users quick access to the data they need for reporting and analytics without the cost and delays due to heavy transformations and data movement. After creating your logical EDW in the Synapse SQL Serverless database, you can further evaluate if Synapse SQL Serverless meets your goals for query performance. If more performance is needed, you can leverage your Synapse SQL Serverless views to move data to another Azure data platform like Azure Synapse Analytics SQL Dedicated pool for further query optimization and performance features.

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs