Lakehouse End-to-End Architecture: Fabric’s Integration for Streamlined Data Operations

While data and analytics professionals may be familiar with many of the principles in Fabric, applying those concepts in a new setting can be difficult. The goal of this blog is to provide a fundamental understanding of the Fabric UX, the different workloads and their integration points, and the experiences of Fabric professional and citizen developers by taking the user step-by-step through an end-to-end scenario from data collecting to data consumption.
The blog is not meant to serve as a comprehensive feature and functionality list, a reference architecture, or a list of recommended best practices.

In the past, businesses have constructed state-of-the-art data warehouses to meet their demands for processing and analyzing structured and transactional data, as well as data lakehouses to handle huge data (semi/unstructured data). Due to the parallel operation of these two systems, silos, duplicate data, and a higher total cost of ownership were produced.
Fabric’s integration of data stores and standardizing on the Delta Lake format enable you to get rid of redundant data, break down silos, and significantly lower total cost of ownership.

The lakehouse end-to-end architecture

Data Sources: Fabric facilitates fast and simple connections to on-premises data sources, cloud services, and Azure Data Services for data intake.
Ingestion: You can rapidly develop insights for your company using the 200+ native connectors included in the Fabric pipeline, as well as the drag-and-drop data transformation capabilities of dataflow. With Fabric’s new shortcut functionality, you may connect to data that already exists without having to move or duplicate it.

Transform and Store – Fabric standardizes on Delta Lake format, that means all the engines of Fabric can read and work on the same dataset stored in One Lake – no need for data duplicity. This storage allows you to build lake houses using a medallion architecture or data mesh based on your organizational need. For transformation, you can choose either low-code or no-code experience with pipelines/dataflows or notebook/Spark for a code first experience.

Consume – Data from Lakehouse can be consumed by Power BI, industry leading business intelligence tool, for reporting and visualization. Each Lakehouse comes with a built-in TDS/SQL endpoint for easily connecting to and querying data in

the Lakehouse tables from other reporting tools, when needed. When a Lakehouse is created a secondary item, called a Warehouse, will be automatically generated at the same time with the same name as the Lakehouse and this Warehouse item provides you with the TDS/SQL endpoint.

Data Source – source data is in Parquet file format in an un-partitioned structure, stored in a folder for each table. For the purpose of this blog, we will set up pipeline to copy/ingest the complete historical or onetime data to the

Lakehouse: To keep things simple and workable for this implementation, we will build a single Lakehouse, ingest data into the Files area, and then create delta lake tables in the Tables part.
At the conclusion of this session, there is an optional module that discusses designing a Lakehouse with medallion architecture (Bronze, Silver, and Gold) and offers a suggested methodology.

Transform: We will show code-first users how to utilize Notebooks/Spark for data preparation and transformation, and low-code/no-code users how to use Pipelines/Dataflow.
Consume: To illustrate data consumption, you will discover how to generate reports and dashboards using Power BI’s Direct Lake capability, as well as how to query data straight from the Lakehouse. Additionally, to illustrate how you may make your data accessible to other reporting tools, you can connect to the warehouse using the TDS/SQL endpoint and do analytics using SQL-based queries.

Build your first Lakehouse in Fabric

This blog aims to swiftly construct the entire process of building a lakehouse, ingesting data for a table, applying transformation as needed, and utilizing the data that has been fed into the lakehouse delta table for report creation.

Create a lakehouse

  1. In the Power BI service select Workspaces in the left-hand menu.
  2. Search for your workspace by typing in the search textbox at the top and click on your workspace to open it.
  3. From the workload switcher located at the bottom left of the screen, select Data engineering.

In the Data Engineering section, select Lakehouse to create a lakehouse.

Enter LHDEMO in the Name box.Click Create. The new lakehouse will be created and automatically opened.

Data Ingestion:

In the lakehouse view, you will see options to load data into lakehouse. Click on New Dataflow Gen2.

On the new dataflow page, click on SQL Server to connect and import the table sales customer

Clicking Next on the previous screen will open the Preview file data page, click on Create to proceed and return back to dataflow canvas.

Once selected the LHDemo lakehouse which we created, click on next

Click on Publish.

A spinning circle next to the dataflow’s name will indicate publishing is in progress on the artifact view. When this is completed, click on the ellipsis and select Properties to rename the dataflow. For the purpose of this module, Change the name to Load Lakehouse Table and select Save.

Next to the name of the data flow in the artifact view, there is an icon (Refresh now) to refresh the dataflow.

Click on it to kick off execution of dataflow and to move the data from the source file to lakehouse table. While it’s in progress, you will see a spinning circle under Refreshed column in the artifact view.

Once you open the created Lakehouse, you can see the table customer

Once the dataflow’s refresh is completed, you can go to the lakehouse, and you will notice dimension_customer

delta table has been created. When you click on it, you should be able to preview its data. Further, you can use the SQL endpoint of the lakehouse to query the data with SQL statements in warehouse mode. Click on SQL endpoint under Lake mode on top right of the screen

In the warehouse mode, you can click on dimension_customer table to preview its data and/or click on New SQL query to write your SQL statements.

Here is a sample query to aggregate the row count based on buyinggroup column of the dimension_customer table and its output. SQL query files are saved automatically for future references, and you can rename or delete these files appropriately based on your need.

To run the script, click on the Run icon at the top of the script file.

SELECT CustomerCategoryID, Count(*) AS Total FROM customer

GROUP BY CustomerCategoryID

Building a Report:

In the artifact view of the workspace, click on LHDemo default semantic model, which gets created automatically with the same name of the lakehouse when you create a lakehouse.

On the semantic model screen, you can view all the tables. You will have options to create reports either from scratch, paginated report or let Power BI do magic for you by automatically creating a report based on your data. For the purpose of this module, click on Auto-create a report under Explore this data. In the next module, we will create a report from scratch.

Since the table is a dimension and there are no measures in it, Power BI smartly creates a measure for the row sum and aggregates it across different columns and creates different charts as below.

You can save this report for the future by clicking on Save button at the top ribbon and giving it a name. You can further make changes to this report to meet your requirement by including or excluding additional tables or columns.

Ingest, Prep and Analyze:

Here we are going to build on the work you completed in the previous example loading the data using Dataflow Gen2 and ingest additional tables (dimensions and fact) of the dimensional model of salesdataimporter as mentioned in the section of this blog. Next, you will use notebooks with Spark runtime to transform and prepare the data. Finally, you will create Power BI data model and create a report from scratch.

Data Intgestion

we will use Copy data activity of Data Factory pipeline to ingest sample data from source (Azure storage account) to the Files section of the lakehouse you created earlier.

Click on Data pipeline under New to create a new data pipeline.

use copy assistant to get ingest the data

Next, set up an HTTP connection to import the sample Importers data into the Lakehouse. From the list of New sources, select View more, search for Http and select it.

PropertyValue
URLhttps://assetsprod.microsoft.com/en-us/wwi-sample-dataset.zip
ConnectionCreate a new connection
Connection namewwisampledata
Data gatewayNone
Authentication kindAnonymous

Click on next,enable the Binary copy and choose ZipDeflate (.zip) as the Compression type since the source is a .zip file. Keep the other fields at their default values and click Next

In the Connect to data destination window, specify the Root folder as Files and click Next. This will write the data to the Files section of the lakehouse.

Choose the File format as Binary for the destination. Click Next and then Save+Run. You can schedule pipelines to refresh data periodically. In this tutorial, we only run the pipeline once. The data copy process takes approximately 10-15 minutes to complete.

You can monitor the pipeline execution and activity in the Output tab. You can also view detailed data transfer information by selecting the glasses icon next to the pipeline name, which appears when you hover over the name.

In summary
Organizations can take advantage of the scalability, performance, and flexibility of a data lake combined with the structure and dependability of a data warehouse thanks to the Lakehouse architecture, which is driven by Azure Synapse Analytics. Organizations get a comprehensive analytics solution that simplifies processes and guarantees strong data protection, governance, and compliance with the next-generation Microsoft Fabric.

Happy Reading!!

Leave a Reply

Your email address will not be published. Required fields are marked *