.NET for Apache Spark Preview with Examples

event 2019-04-26 thumb_up 0 visibility 3,479 comment 0 insights toc

more_vert

warning Please login first to view stats information.

Installation
Create a dotnet Spark application
Publish the application
Run the application in Spark
Read data from HDFS
Read data from Hive
Read and write parquet files
Summary

I’ve been following Mobius project for a while and have been waiting for this day. .NET for Apache Spark v0.1.0 was just published on 2019-04-25 on GitHub. It provides high performance APIs for programming Apache Spark applications with C# and F#. It is .NET Standard complaint and can run in Windows, Linux and MacOS with .NET Core. It’s a great news for all .NET developers similar as the ML.NET announcement last year.

* Image from https://www.nuget.org/profiles/spark

In this page, I’m going to show you how to install and use it with some hand-on examples.

Refer to the following official repo on GitHub for more details:

https://github.com/dotnet/spark

Mobius project is now deprecated and replaced by this brand new one (.NET for Apache Spark).

Installation

Install the following frameworks/tools if you have not done that:

.NET Core 2.1 SDK
Java 1.8 (Hadoop currently doesn’t work with JVM 1.9+)
Apache Spark 2.4.1 (Apache Spark 2.4.2 is not supported yet!)

For Spark installation, refer to my following post if you don’t know how to install it:

Install Spark 2.2.1 in Windows

*Note: the version in the above guide is 2.2.1. You need to install 2.4.0, 2.4.1 or 2.3.* versions.

And then download and install Microsoft.Spark.Worker release:

Select a Microsoft.Spark.Worker release from .NET for Apache Spark GitHub Releases page and download into your local machine (e.g., F:\DataAnalytics\Microsoft.Spark.Worker-0.1.0\).
Create a new environment variable named DotnetWorkerPath with value set to the directory where you placed Microsoft.Spark.Worker in the preceding step (e.g., F:\DataAnalytics\Microsoft.Spark.Worker-0.1.0).

Create a dotnet Spark application

Run the following command to create a new folder:

mkdir DotNetSpark

Change directory to this folder and create a dotnet core Console application:

cd dotnet-spark
dotnet new console

Add reference to Nuget package Microsoft.Spark:

dotnet add dotnet-spark.csproj package Microsoft.Spark

Open Program.cs file in dotnet-spark folder. The program currently looks like the following:

using System;
namespace dotnet_spark
{
     class Program
     {
         static void Main(string[] args)
         {
             Console.WriteLine("Hello World!");
         }
     }
}

Replace the code with the following:

using System;
using System.Net;
using Microsoft.Spark.Sql;

namespace dotnet_spark
{
    class Program
    {
        static void Main(string[] args)
        {
            var dataUrl = "https://data.cityofnewyork.us/api/views/kku6-nxdu/rows.csv";
            var localFileName = "stats.csv";
            using (var client = new WebClient())
            {
                client.DownloadFile(dataUrl, localFileName);
            }
            var spark = SparkSession.Builder().GetOrCreate();
            var df = spark.Read().Csv(localFileName);
            df.Select("_c0", "_c1", "_c2", "_c3").Show(10);
            df.PrintSchema();
        }
    }
}

Publish the application

Running the command to publish the project so that exe file will be generated:

dotnet publish -c Debug -r win-x64 -o app -f netcoreapp2.2

*Remember to change the runtime version to your own. The above command will publish for runtime Windows x64 and the output will be copied to app folder.

For Linux or MacOS users, change the above command accordingly especially for the

Navigate to the app output folder:

cd app

In the app folder, these files exist (with many others):

microsoft-spark-2.3.x-0.1.0.jar
microsoft-spark-2.4.x-0.1.0.jar
dotnet-spark.exe

Run the application in Spark

Now, we can submit the job to run in Spark using the following command:

%SPARK_HOME%\bin\spark-submit.cmd --class org.apache.spark.deploy.DotnetRunner --master local microsoft-spark-2.4.x-0.1.0.jar dotnet-spark

The last argument is the executable file name. It works with or without extension.

The output looks like the following:

Read data from HDFS

        static void Main(string[] args)
        {
            ReadHDFS();
        }

        static void ReadHDFS()
        {
            var hdfsURL = "hdfs://0.0.0.0:19000/user/hive/warehouse/test_db.db/test_table";
            var spark = SparkSession
            .Builder()
            .AppName(".NET for Spark - Read HDFS example")
            .GetOrCreate();
            var df = spark.Read().Csv(hdfsURL);
            df.Show(10);
            df.PrintSchema();

        }

Sample output:

Read data from Hive

static void Main(string[] args)
        {
            ReadFromHive();
        }

        static void ReadFromHive()
        {
            var master = "local";
            var spark = SparkSession
            .Builder()
            .Config("hive.metastore.uris", "thrift://localhost:9083")
            .AppName(".NET for Spark - Read from Hive example")
            .EnableHiveSupport()
            .Master(master)
            .GetOrCreate();

            var df = spark.Sql("show databases");
            df.Show(10);

            df = spark.Sql("select * from test_db.test_table");
            df.Show();
        }

Sample output:

Read and write parquet files

static void Main(string[] args)
        {
            ReadAndWriteParquetFiles();
        }
        static void ReadAndWriteParquetFiles()
        {
            var master = "local[2]";
            var spark = SparkSession
            .Builder()
            .AppName(".NET for Spark - Read and write Parquet example")
            .Master(master)
            .GetOrCreate();

            var hdfsURL = "hdfs://0.0.0.0:19000/user/hive/warehouse/test_db.db/test_table";
            var df = spark.Read().Csv(hdfsURL);

            // Write parquet files
            df.WithColumn("NewColumn", Functions.Lit("Value")
            .Cast("string"))
            .Write()
            .Parquet("test.parquet");

            var df2 = spark.Read().Parquet("test.parquet");
            df2.Show();
        }

Summary

As you can see in the above examples, you can use very similar fluent APIs to write Spark applications using C#. At the moment, not all the Mobius APIs have been ported to this new project yet. For example, SparkContext.Parallelize function is not implemented yet. But don’t worry as it is just version 0.1.0 at the moment. However, you can still complete majority of the common actions/tasks even with this version. Let’s look forward to the future releases.