Data Processing with Apache Spark in Microsoft Fabric

An open-source parallel processing platform for massive data processing and analytics is called Apache Spark. Spark is accessible in several platform implementations, including as Microsoft Fabric, Azure Synapse Analytics, and Azure HDInsight, and has gained popularity in “big data” processing contexts.

In a Lakehouse, data may be ingested, processed, and analyzed using Spark in Microsoft Fabric. Even though all Spark implementations share the fundamental methods and code covered in this module, integrating Spark-based data processing into your overall data analytics solution is made simpler by Microsoft Fabric’s integrated tools and Spark’s capacity to operate in the same environment as other data services.

How to configure Spark Settings

A driver program is launched when you submit a data processing job in the form of code. This program then utilizes the Spark Context, a cluster management object, to control how processing is distributed throughout the Spark cluster. Most of the time, these specifics are abstracted, so all you have to do is create the code needed to execute the necessary data operations.

Every workspace in Microsoft Fabric has a designated Spark cluster. Under the Data Engineering/Science area of the workspace settings, an administrator can control the Spark cluster’s configuration.

Particular setup options consist of:

Node Family: The category of virtual computers that make up the nodes in the Spark cluster. Memory-optimized nodes often offer the best performance.
Runtime version: The version of Spark that will be used on the cluster, together with any required subcomponents.
Spark Properties: These are cluster-specific parameters unique to Spark that you can choose to activate or modify. A set of attributes is available in the Apache Spark documentation.https://spark.apache.org/docs/latest/configuration.html#available-properties

Numerous widely used libraries are included in Microsoft Fabric’s Spark clusters. To build an environment and configure the workspace’s default environment, you require workspace admin capabilities. This allows you to set extra default libraries or persist library specifications for code items.

https://learn.microsoft.com/en-us/fabric/data-engineering/library-management

Running the Spark Code:

To edit and run Spark code in Microsoft Fabric, you can use notebooks, or you can define a Spark job.

Use a Spark dataframe to work with data:

The most popular data structure for working with structured data in Spark is the dataframe, which is included in the Spark SQL package. Although you can create code that works directly with RDDs, Spark natively employs a data structure known as a resilient distributed dataset (RDD). Similar to the widely used Pandas Python package, data frames in Spark are designed to function best in the distributed processing environment of Spark.

Create a new Lakehouse on the Synapse Data Engineering homepage and give it whatever name you choose.

The exercise’s data files may be downloaded and extracted from https://github.com/MicrosoftLearning/dp-data/raw/main/orders.zip.

Extracting the zipped archive, verify that you have a folder named orders that contains CSV files named 2019.csv, 2020.csv, and 2021.csv

Spark Jobs:

Create a workspace DatahourTT with a Lakehouse name OrdersLH

Go back to the tab in your browser that has your lake house on it, then in the Upload the orders folder from your local computer (or lab virtual machine, if relevant) to the Lakehouse using the option for the Files folder in the Explorer pane.
Expand Files, choose the orders folder, and confirm that the CSV files are there once the files have been submitted.

Create a notebook

You may construct a notebook in Apache Spark to deal with data. With the help of notebooks, you may create and execute code (in several languages) and annotate it with notes.

A fresh notebook with a single cell will open in a few seconds. One or more cells in a notebook can have code or markdown (formatted text) in them. To change the first cell—which is now a code cell—to a markdown cell, select it and then use the M↓ button in the dynamic tool bar located at the top-right of the toolbar. The text within is displayed when the cell transforms into a markdown cell.

Load data into a dataframe:

The code to load the data into a dataframe may now be executed. Similar to Pandas dataframes in Python, Spark dataframes offer a shared framework for handling rows and columns of data.

Re-run the cell and review the output, which should look similar to this with false on csv path.

To define a schema, edit the code as follows, then load the data using the schema.

Only the information from the 2019.csv file is contained in the dataframe. Change the code so that the sales order data is read from all of the files in the orders folder using a * wildcard in the file path:

Look through data in a dataframe:

The dataframe object includes a wide range of functions that you can use to filter, group, and otherwise manipulate the data

customers = df[‘CustomerName’, ‘Email’] print(customers.count()) print(customers.distinct().count()) display(customers.distinct())

An operation on a dataframe creates a new dataframe; in this example, the operation creates a new customers dataframe by choosing a certain subset of columns from the df dataframe.
Functions like count and distinct are available in dataframes and may be used to filter and summarize the data they hold.

Aggregate and group data in a dataframe

Run the code cell you added, and note that the results show the sum of order quantities grouped by product. The groupBy method groups the rows by Item, and the subsequent sum aggregate function is applied to all of the remaining numeric columns (in this case, Quantity)

Use Spark to transform data files

Run the code to create a new dataframe from the original order data with the following transformations:

Add Year and Month columns based on the OrderDate column.
Add FirstName and LastName columns based on the CustomerName column.
Filter and reorder the columns, removing the CustomerName column.

You can use the full power of the Spark SQL library to transform the data by filtering rows, deriving, removing, renaming columns, and applying any other required data modifications.

Add a new cell with the following code to save the transformed dataframe in Parquet format (Overwriting the data if it already exists):

Commonly, Parquet format is preferred for data files that you will use for further analysis or ingestion into an analytical store. Parquet is a very efficient format that is supported by most large-scale data analytics systems. In fact, sometimes your data transformation requirement may simply be to convert data from another format (such as CSV) to Parquet!

Run the cell and wait for the message that the data has been saved. Then, in the Lakehouse pane on the left, in the … menu for the Files node, select Refresh; and select the transformed data folder to verify that it contains a new folder named orders, which in turn contains one or more Parquet files.

transformed_df.write.mode(“overwrite”).parquet(‘Files/transformed_data/orders’)

print (“Transformed data saved!”)

Add a new cell with the following code to load a new dataframe from the parquet files in the transformed_data/orders folder:

orders_df = spark.read.format(“parquet”).load(“Files/transformed_data/orders”) display(orders_df)

Save data in partitioned files

Add a new cell with the following code to load a new dataframe from the orders.parquet file

Work with tables and SQL

As you’ve seen, you can query and examine data from a file pretty well using the native methods of the dataframe object. But a lot of data analysts feel better at ease dealing with tables that include SQL syntax for queries. Spark has a metastore where relational tables may be defined. SQL statements may be used to query tables in the metastore using the Spark SQL library that supplies the dataframe object. The phrase “data lakehouse” refers to the combination of a relational data warehouse’s SQL-based queries and structured data architecture with the flexibility of a data lake thanks to Spark’s capabilities.

In the Lakehouses pane, in the … menu for the Tables folder, select Refresh. Then expand the Tables node and verify that the salesorders table has been created.

Run SQL code in a cell

The %%sql line at the beginning of the cell (called a magic) indicates that the Spark SQL language runtime should be used to run the code in this cell instead of PySpark.

Visualize data with Spark

Visualize data with Spark

To visualize the data as a chart, we’ll start by using the matplotlib Python library. This library is the core plotting library on which many others are based, and provides a great deal of flexibility in creating charts.

You must transform the Spark dataframe produced by the Spark SQL query to a Pandas dataframe format in order to use the matplotlib tool.
The pyplot object is the central component of the matplotlib library. The majority of charting functionality is built around this.
Although the chart is functional with the default settings, there is a great deal of customization potential.

Conclusion:

The integration of Spark Notebooks with Fabric will increase, especially when used with Power BI to generate dynamic, real-time dashboards, as demand for analytics and machine learning becomes more complicated. Spark Notebooks, a crucial component of Microsoft Fabric, let data professionals effectively handle challenging jobs. Their skills on this platform will keep improving as the data landscape changes.

Happy Reading!!

Data Processing with Apache Spark in Microsoft Fabric

Leave a ReplyCancel Reply

Write for Us at DataAnalytic Group

Become a Subject Matter Expert

Hire Industry-Ready Talent from us

Corporate Training Programs