I’ve been following Mobius project for a while and have been waiting for this day. .NET for Apache Spark v0.1.0 was just published on 2019-04-25 on GitHub. It provides high performance APIs for programming Apache Spark applications with C# and F#. It is .NET Standard complaint and can run in Windows, Linux and MacOS with .NET Core. It’s a great news for all .NET developers similar as the ML.NET announcement last year. 

* Image from https://www.nuget.org/profiles/spark

In this page, I’m going to show you how to install and use it with some hand-on examples.

Refer to the following official repo on GitHub for more details:

https://github.com/dotnet/spark

Mobius project is now deprecated and replaced by this brand new one (.NET for Apache Spark).

Installation

Install the following frameworks/tools if you have not done that:

For Spark installation, refer to my following post if you don’t know how to install it:

Install Spark 2.2.1 in Windows

*Note: the version in the above guide is 2.2.1. You need to install 2.4.0, 2.4.1 or 2.3.* versions.

And then download and install Microsoft.Spark.Worker release:

  • Select a Microsoft.Spark.Worker release from .NET for Apache Spark GitHub Releases page and download into your local machine (e.g., F:\DataAnalytics\Microsoft.Spark.Worker-0.1.0\).
  • Create a new environment variable named DotnetWorkerPath with value set to the directory where you placed Microsoft.Spark.Worker in the preceding step (e.g., F:\DataAnalytics\Microsoft.Spark.Worker-0.1.0).

image

Create a dotnet Spark application

Run the following command to create a new folder:

mkdir DotNetSpark

Change directory to this folder and create a dotnet core Console application:

cd dotnet-spark
dotnet new console

Add reference to Nuget package Microsoft.Spark:

dotnet add dotnet-spark.csproj package Microsoft.Spark

Open Program.cs file in dotnet-spark folder. The program currently looks like the following:

using System;
namespace dotnet_spark {
     class Program
     {
         static void Main(string[] args)
         {
             Console.WriteLine("Hello World!");
         }
     } }

Replace the code with the following:

using System;
using System.Net;
using Microsoft.Spark.Sql;

namespace dotnet_spark
{
    class Program
    {
        static void Main(string[] args)
        {
            var dataUrl = "https://data.cityofnewyork.us/api/views/kku6-nxdu/rows.csv";
            var localFileName = "stats.csv";
            using (var client = new WebClient())
            {
                client.DownloadFile(dataUrl, localFileName);
            }
            var spark = SparkSession.Builder().GetOrCreate();
            var df = spark.Read().Csv(localFileName);
            df.Select("_c0", "_c1", "_c2", "_c3").Show(10);
            df.PrintSchema();
        }
    }
}

Publish the application

Running the command to publish the project so that exe file will be generated:

dotnet publish -c Debug -r win-x64 -o app -f netcoreapp2.2

*Remember to change the runtime version to your own. The above command will publish for runtime Windows x64 and the output will be copied to app folder.

For Linux or MacOS users, change the above command accordingly especially for the

Navigate to the app output folder:

cd app

In the app folder, these files exist (with many others):

  • microsoft-spark-2.3.x-0.1.0.jar
  • microsoft-spark-2.4.x-0.1.0.jar
  • dotnet-spark.exe

Run the application in Spark

Now, we can submit the job to run in Spark using the following command:

%SPARK_HOME%\bin\spark-submit.cmd  --class org.apache.spark.deploy.DotnetRunner --master local microsoft-spark-2.4.x-0.1.0.jar dotnet-spark

The last argument is the executable file name. It works with or without extension.

The output looks like the following:

image

Read data from HDFS

        static void Main(string[] args)
        {
            ReadHDFS();
        }

        static void ReadHDFS()
        {
            var hdfsURL = "hdfs://0.0.0.0:19000/user/hive/warehouse/test_db.db/test_table";
            var spark = SparkSession
            .Builder()
            .AppName(".NET for Spark - Read HDFS example")
            .GetOrCreate();
            var df = spark.Read().Csv(hdfsURL);
            df.Show(10);
            df.PrintSchema();

        }

Sample output:

image

Read data from Hive

static void Main(string[] args)
        {
            ReadFromHive();
        }

        static void ReadFromHive()
        {
            var master = "local";
            var spark = SparkSession
            .Builder()
            .Config("hive.metastore.uris", "thrift://localhost:9083")
            .AppName(".NET for Spark - Read from Hive example")
            .EnableHiveSupport()
            .Master(master)
            .GetOrCreate();

            var df = spark.Sql("show databases");
            df.Show(10);

            df = spark.Sql("select * from test_db.test_table");
            df.Show();
        }

Sample output:

image

Read and write parquet files

static void Main(string[] args)
        {
            ReadAndWriteParquetFiles();
        }
        static void ReadAndWriteParquetFiles()
        {
            var master = "local[2]";
            var spark = SparkSession
            .Builder()
            .AppName(".NET for Spark - Read and write Parquet example")
            .Master(master)
            .GetOrCreate();

            var hdfsURL = "hdfs://0.0.0.0:19000/user/hive/warehouse/test_db.db/test_table";
            var df = spark.Read().Csv(hdfsURL);

            // Write parquet files
            df.WithColumn("NewColumn", Functions.Lit("Value")
            .Cast("string"))
            .Write()
            .Parquet("test.parquet");

            var df2 = spark.Read().Parquet("test.parquet");
            df2.Show();
        }

Summary

As you can see in the above examples, you can use very similar fluent APIs to write Spark applications using C#. At the moment, not all the Mobius APIs have been ported to this new project yet.  For example, SparkContext.Parallelize function is not implemented yet.  But don’t worry as it is just version 0.1.0 at the moment. However, you can still complete majority of the common actions/tasks even with this version. Let’s look forward to the future releases.

info Last modified by Raymond at 6 months ago * This page is subject to Site terms.

More from Kontext

local_offer hadoop local_offer hive local_offer Java

visibility 172
thumb_up 1
access_time 2 months ago

When I was configuring Hive 3.0.0 in Hadoop 3.2.1 environment, I encountered the following error: Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V Ro...

open_in_new View open_in_new Hadoop

local_offer hive local_offer hdfs

visibility 93
thumb_up 0
access_time 4 months ago

In Hive, there are two types of tables can be created - internal and external table. Internal tables are also called managed tables. Different features are available to different types. This article lists some of the common differences.  Internal table By default, Hive creates ...

open_in_new View open_in_new Hadoop

Schema Merging (Evolution) with Parquet in Spark and Hive

local_offer parquet local_offer pyspark local_offer spark-2-x local_offer hive local_offer hdfs

visibility 1382
thumb_up 0
access_time 5 months ago

Schema evolution is supported by many frameworks or data serialization systems such as Avro, Orc, Protocol Buffer and Parquet. With schema evolution, one set of data can be stored in multiple files with different but compatible schema. In Spark, Parquet data source can detect and merge sch...

open_in_new View open_in_new Spark + PySpark

Machine Learning with .NET in Jupyter Notebooks

local_offer plot local_offer machine-learning local_offer jupyter-notebook local_offer C# local_offer dotnet core

visibility 393
thumb_up 0
access_time 6 months ago

In this article, I'm going to show you how to install Jupyter in Windows and then install .NET kernel for Jupyter notebooks. It also shows a machine learning example using ML.NET. The target audience are .NET developers who want to expand their skills in data engineering and science domain...

open_in_new View open_in_new .NET Machine Learning

info About author

Dark theme mode

Dark theme mode is available on Kontext.

Learn more arrow_forward

Kontext Column

Created for everyone to publish data, programming and cloud related articles. Follow three steps to create your columns.


Learn more arrow_forward