Get Started on .NET 5 with Apache Spark

.NET for Apache Spark 1.0 was officially released on 14th Oct 2020. This version was released together with .NET Core 3.0. Since .NET for Apache Spark is written with .NET Standards, it should work with .NET 5 too. This articles how to use .NET 5 with Apache Spark.

* Image from dotnet/spark: .NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.

Installation

Install the following frameworks/tools if you have not done that:

For Spark installation, refer to my following post if you don’t know how to install it:

Install Spark 3.0.0 on Windows 10

And then download and install Microsoft.Spark.Worker release:

Select a Microsoft.Spark.Worker release from .NET for Apache Spark GitHub Releases page and download into your local machine. For example, F:\big-data\Microsoft.Spark.Worker-1.0.0.
Create a new environment variable named DotnetWorkerPath with value set to the directory where you placed Microsoft.Spark.Worker in the preceding step.

infoTo run .NET applications on a distributed cluster environment, Dotnet Worker needs to be deployed to each node.

https://api.kontext.tech/resource/10eab4ec-3067-5c8a-a359-f0eb0e123110

Create a dotnet Spark application

Run the following command to create a new folder:

mkdir dotnet5-spark

Change directory to this folder and create a dotnet core Console application:

cd dotnet5-sparkdotnet new console

Add reference to Nuget package Microsoft.Spark:

dotnet add dotnet5-spark.csproj package Microsoft.Spark

Open Program.cs file in dotnet-spark folder. The program currently looks like the following:

using System;namespace dotnet_spark
{     class Program     {         static void Main(string[] args)         {             Console.WriteLine("Hello World!");         }     }
}

Replace the code with the following:

using System;
using System.Net;
using Microsoft.Spark.Sql;
using System.IO;

namespace dotnet_spark
{
    class Program
    {
        static void Main(string[] args)
        {
            /*Download a file from Internet*/
            var dataUrl = "https://data.cityofnewyork.us/api/views/kku6-nxdu/rows.csv";
            var localFileName = "stats.csv";
            using (var client = new WebClient())
            {
                client.DownloadFile(dataUrl, localFileName);
            }
            var localFilePath = Path.GetFullPath(localFileName);
            Console.WriteLine("Local CSV file path is: {0}", localFilePath);

            /*Create Spark session*/
            var spark = SparkSession.Builder().GetOrCreate();
            var df = spark.Read().Csv($"file:///{localFilePath}");
            df.Select("_c0", "_c1", "_c2", "_c3").Show(10);
            df.PrintSchema();
        }
    }
}

infoNote - the parameter for Csv function is using a local file (file:///). On a single node cluster, this doesn't much but it does mater in a cluster environment.

Publish the application

Running the command to publish the project so that exe file will be generated:

dotnet publish -c Debug -r win-x64 -o app -f net5.0
Microsoft (R) Build Engine version 16.8.0+126527ff1 for .NET
Copyright (C) Microsoft Corporation. All rights reserved.

  Determining projects to restore...
  Restored F:\Projects\dotnet5-spark\dotnet5-spark.csproj (in 214 ms).
  dotnet5-spark -> F:\Projects\dotnet5-spark\bin\x64\Debug\net5.0\win-x64\dotnet5-spark.dll
  dotnet5-spark -> F:\Projects\dotnet5-spark\app\

*Remember to change the runtime version to your own. The above command will publish for runtime Windows x64 and the output will be copied to app folder.

For Linux or MacOS users, change the above command accordingly especially for the argument -r.

Navigate to the app output folder:

cd app

In the app folder, these files exist (with many others):

microsoft-spark-2-3_2.11-1.0.0.jar
microsoft-spark-2-4_2.11-1.0.0.jar
microsoft-spark-3-0_2.12-1.0.0.jar
dotnet5-spark.exe

For the Java files, they are for supporting different versions of Spark.

warning Spark 2.4.2 is not supported.

Run the application in Spark

Now, we can submit the job to run in Spark using the following command:

%SPARK_HOME%\bin\spark-submit.cmd --class org.apache.spark.deploy.dotnet.DotnetRunner --master local microsoft-spark-3-0_2.12-1.0.0.jar dotnet5-spark

The last argument is the executable file name. It works with or without extension.

warning In the official 1.0 release, Java class name is now org.apache.spark.deploy.dotnet.DotnetRunner which is different from early pre-release (org.apache.spark.deploy.DotnetRunner).

The output looks like the following:

https://api.kontext.tech/resource/aa5b40f6-375f-5836-868f-37071840ad3e

Read data from HDFS

        static void Main(string[] args)
        {
            ReadHDFS();
        }

        static void ReadHDFS()
        {
            var hdfsURL = "hdfs://0.0.0.0:19000/user/hive/warehouse/test_db.db/test_table";
            var spark = SparkSession
            .Builder()
            .AppName(".NET for Spark - Read HDFS example")
            .GetOrCreate();
            var df = spark.Read().Csv(hdfsURL);
            df.Show(10);
            df.PrintSchema();

        }

Sample output:

https://api.kontext.tech/resource/d44e086d-fd29-5844-8c15-e1d1a67a138a

Read and write parquet files

static void Main(string[] args)
        {
            ReadAndWriteParquetFiles();
        }
        static void ReadAndWriteParquetFiles()
        {
            var master = "local[2]";
            var spark = SparkSession
            .Builder()
            .AppName(".NET for Spark - Read and write Parquet example")
            .Master(master)
            .GetOrCreate();

            var hdfsURL = "/user/hive/warehouse/test_db.db/test_table";
            var df = spark.Read().Csv(hdfsURL);

            // Write parquet files
            df.WithColumn("NewColumn", Functions.Lit("Value")
            .Cast("string"))
            .Write()
            .Parquet("test.parquet");

            var df2 = spark.Read().Parquet("test.parquet");
            df2.Show();
        }

The output looks like the following screenshot:

Summary

As you can see in the above examples, you can use very similar fluent APIs to write Spark applications using C#. F# is also supported. However, at the moment, .NET for Spark is not part of standard Spark release yet. Follow up on JIRA [SPARK-27006] SPIP: .NET bindings for Apache Spark.

There are also some known issues. Please find more details on GitHub: .NET for Apache Spark release-1.0.0 Known Issues.