Advanced analytics on big data with Azure - Tutorial

Raymond Tang Raymond Tang 0 4550 1.81 index 8/11/2018

Microsoft Azure provides a number of data analytics related products and services. It allows users to tailor the solutions to meet different requirements, for example, architecture for modern data warehouse, advanced analytics with big data or real time analytics.

The following diagram shows a typical advanced analytics solution architecture on Azure.

https://api.kontext.tech/resource/9003ba6a-6c46-51c9-b004-e7573366097c

The architecture has four main layers:

  • Ingest: Azure Data Factory is used for data ingestion. Azure Data Factory is a hybrid and serverless data integration (ETL) service which works with data wherever it lives, in the cloud or on-premises, with enterprise-grade security.
  • Store: Data can be stored in Azure storage products including File, Disk, Blob, Queue, Archive and Data Lake Storage. Azure Data Lake Storage is a massively scalable data lake storage optimized for Apache Spark and Hadoop analytics engines.
  • Prep & Train: In this layer, data can be cleaned and transformed using Azure Databricks. Databricks is an Apache Spark based analytics platform. Python, Scala, R and other language interpreters can be to analyse data. Spark machine learning libraries can also be used to create models.
  • Model & Serve: After data preparation and transformation, the insights or output database can be written into Azure storage products or databases. Cosmos DB is Microsoft's globally distributed multi-model database. Azure SQL Data Warehouse is a massively parallel processing (MPP) data warehouse designed for the cloud. And then data can be consumed via applications or analysis services. Data stored in Azure Data Lake Storage can be also used in SQL Data Warehouse through PolyBase features in SQL Server.

Objectives

In this page, I am going to demonstrate how to ingest data from on-premise data to cloud by using Data Factory and then transform data using HDInsights before loading data into Azure SQL Data Warehouse. Eventually the data is consumed by Power BI.

The following diagram shows the steps required to complete the whole exercise.

https://api.kontext.tech/resource/3ed8233f-97b2-5b80-be28-f5b021b9a889

Prerequisites

  • Azure subscription
  • Azure storage account
  • Azure HDInsights Spark cluster
  • Azure SQL Data Warehouse instance (optional)
  • Power BI (optional)
  • Access to Azure portal
  • On-premise SQL Server instance or any other data sources

Detailed steps

Create Azure Data Factory

https://api.kontext.tech/resource/8c59ccf2-d2fa-551c-aaba-ca0dda37d4cc

Once the resource is created, click Author & Monitor button to go to Azure Data Factory workspace.

https://api.kontext.tech/resource/f6506f10-590d-554e-8508-173a25b05a58

The workspace looks like the following:

https://api.kontext.tech/resource/40913c11-cc96-597f-a98d-dacba45d163f

Create new pipeline

Click ‘Create pipeline’ button to create a new pipeline.

Create an activity to copy data from on-premise to Cloud

Drag ‘Copy Data’ activity from Data Transformation category to create an activity named CopyDataFromSQLServerToAzureBlobStorage.

https://api.kontext.tech/resource/199d0823-8464-5ff2-bc82-a403ae774bb8

And then click Source tab to setup the source data.

https://api.kontext.tech/resource/b02a10c5-4d04-580e-8c3e-b46471df997e

Click ‘+ New button’ to create a new source.

https://api.kontext.tech/resource/456b8b15-dd3d-5a41-bfc1-bc7388437c0b

Select SQL Server as Data Store and then click Finish button to go to the dataset UI to setup the properties for the dataset.

Name the dataset as Premise_Mecury_Db_context_BlogPost:

https://api.kontext.tech/resource/49eb952a-b1e7-57ea-a234-5e1f99dbef38

And then click Connection tab to setup the connection for the database.

https://api.kontext.tech/resource/109aa45d-b7f2-5f8a-9650-79bee04a29dc

Click the ‘+ New’ button to create a linked service.

And now follow step 6 to step 15 to setup the linked service in the following page:

https://docs.microsoft.com/en-us/azure/data-factory/tutorial-hybrid-copy-portal#create-a-pipeline

Once the setup is successful, you can select a table in your local SQL Server database.

For example, I am using context.BlogPosts as the source table:

https://api.kontext.tech/resource/14fe4ba6-1dcd-53c5-9136-81db1333671b

You can further detect the schema too as the following screenshot shows:

https://api.kontext.tech/resource/7b88aa97-405c-5523-8a7f-a52c8383b8db

Go back to the pipeline:

https://api.kontext.tech/resource/e1ac5a33-4cf4-5c9f-8082-cf35a0254bc9

Now we can setup the Sink.

https://api.kontext.tech/resource/b6fde2ef-e0e2-5e4f-a63f-28a42102db27

Click button ‘+ New’ to create a new dataset.

This time we use Azure Blob Storage as Data Store:

https://api.kontext.tech/resource/917c13a1-1bc5-5551-97b4-d3b677d971d5

Click Finish button to continue to edit this dataset:

https://api.kontext.tech/resource/40a08661-c071-5cde-8c78-e5142bc97707

And then setup the Connection by creating a new linked service which connects to one of your existing storage. Use Parquet as file format.

You can also create a new storage account:

https://api.kontext.tech/resource/959a372f-f808-5b9b-a651-de860d24316b

The following screenshot shows my settings for the linked service to Azure Blob Storage.

https://api.kontext.tech/resource/0e043f08-f76a-55e9-b8b8-bf1fb129f02c

File path is also required for writing into blob:

https://api.kontext.tech/resource/ad11caee-f3a7-5759-8e50-77cdb6041166

Till now, we have completed the step to ingest data from on-premise SQL Server database into Cloud. The follow items are created in data factory:

  • Data pipeline: pipeline1
  • Dataset for on-premise data: OnPremise_Mecury_Db_context_BlogPost
  • Dataset for Azure Blob Storage: DA_DEMO_ABS

Renamed the pipeline1 as DA-DEMO-Pipeline and then click Publish ALL button to publish.

Create a HDInsight Spark activity to transform data

Now drag a Spark activity into the pipeline:

https://api.kontext.tech/resource/44c6b905-59e6-5fd3-9d30-7ab5352f5ef7

Now we need to configure the HDInsights cluster for this Spark activity.

https://api.kontext.tech/resource/bbc43ddc-06c8-59b2-ae1d-a29293a2159c

For this tutorial, I am using a predefined HDInsight cluster and also linking the Azure Storage to it too.

And then setup the script using the following code:

from pyspark.sql import SparkSession,SQLContext

spark = SparkSession.builder
      .appName("SparkonADF - Transform")
      .enableHiveSupport()
      .getOrCreate()

import here so that we can sql functions

from pyspark.sql.functions import *

blobPath="wasb://demo@dademoabs.blob.core.windows.net/blogs_raw" sqlContext = SQLContext(spark) #read back parquet to DF blogs = sqlContext.read.parquet(blobPath) blogs.registerTempTable("blogs")

transformation to add a new date column

blogs_new=sqlContext.sql("SELECT *, CAST(DatePublished AS DATE) AS DatePublished_DATE FROM blogs")

register temp table

blogs_new.registerTempTable("blogs_new")

save as new parquet files

blobPathNew="wasb://demo@dademoabs.blob.core.windows.net/blogs_transformed" blogs_new.write.mode("overwrite").parquet(blobPathNew);

https://api.kontext.tech/resource/534ac866-c75d-5268-a9ac-7702536fa692

In my environment, the script file is named transform.py and is stored in sparktransform folder.

In YARN, we should be able to see the running application:

https://api.kontext.tech/resource/21b19bdd-a673-57fe-95fd-da283c13ea4a

Create another Copy Data activity to write transformed data into Azure SQL Data Warehouse

You can now create another Copy Data activity to load transformed data into Azure SQL Data Warehouse (Sink).

For more details, please refer to https://docs.microsoft.com/en-us/azure/data-factory/connector-azure-sql-data-warehouse

As I have limited credit for my current subscription, I will skip this step.

Visualize the data in Power BI

Now we can visualize the data through Power BI. No matter the transformed data is stored in Azure Blob Storage or Azure SQL Data Warehouse, Power BI has connectors available to get data from them.

HDInsights Spark can also be used as source too if we save the transformed data as table in Hive.

https://api.kontext.tech/resource/6a191a08-5d41-528c-ba40-39f914e4b608

For example, the following code snippet shows how to save data frame as table:

Save as table

blogs_new.write.mode("overwrite").saveAsTable("blogs_transfomed")

In my case, I will use HDInsights Spark as source data for visualization.

The detailed steps about how to use HDInsights Spark as source are available here: https://docs.microsoft.com/en-us/power-bi/spark-on-hdinsight-with-direct-connect.

Input server:

https://api.kontext.tech/resource/d8e45035-c85d-52cb-b6bd-ec2ce61d3a59

Click OK button to continue.

And then input user name and password:

https://api.kontext.tech/resource/d5073061-72cd-51f3-8c85-0995791349da

Click Connect button to continue.

The available Hive tables will show:

https://api.kontext.tech/resource/888f72fa-03bb-5089-84a3-a774d560f87b]

In this case, select blogs_transformed as source.

Click Load button to load data into the workspace.

And then, we can visualize the data in Power BI:

https://api.kontext.tech/resource/990aa3b6-53c7-55d1-9c2b-d748bc1d42f9

Remember to enable the following option if creating running total measures.

https://api.kontext.tech/resource/4a4bc6af-2962-59dc-9f86-718edb0000fa

Once it is done, the report can also be published into powerbi.com:

https://api.kontext.tech/resource/4b14e52a-a70e-5f55-b51c-4ed7dd9a83ec

Once published, you can then view the report in the website:

https://api.kontext.tech/resource/c9967789-747b-56e2-a831-908057a7458b

Summary

With Azure Data Factory, it is very easy to move data between on-premises and cloud by using Copy Data activity; you can also use Spark, Databricks and ML activities to perform transformations or to create models.

The transformed data can then be stored in different storage products for different consumptions.

azure power-bi

Join the Discussion

View or add your thoughts below

Comments