.NET for Apache Spark Preview with Examples
insights Stats
Apache Spark installation guides, performance tuning tips, general tutorials, etc.
*Spark logo is a registered trademark of Apache Spark.
I’ve been following Mobius project for a while and have been waiting for this day. .NET for Apache Spark v0.1.0 was just published on 2019-04-25 on GitHub. It provides high performance APIs for programming Apache Spark applications with C# and F#. It is .NET Standard complaint and can run in Windows, Linux and MacOS with .NET Core. It’s a great news for all .NET developers similar as the ML.NET announcement last year.
* Image from https://www.nuget.org/profiles/spark
In this page, I’m going to show you how to install and use it with some hand-on examples.
Refer to the following official repo on GitHub for more details:
Mobius project is now deprecated and replaced by this brand new one (.NET for Apache Spark).
Installation
Install the following frameworks/tools if you have not done that:
- .NET Core 2.1 SDK
- Java 1.8 (Hadoop currently doesn’t work with JVM 1.9+)
- Apache Spark 2.4.1 (Apache Spark 2.4.2 is not supported yet!)
For Spark installation, refer to my following post if you don’t know how to install it:
*Note: the version in the above guide is 2.2.1. You need to install 2.4.0, 2.4.1 or 2.3.* versions.
And then download and install Microsoft.Spark.Worker release:
- Select a Microsoft.Spark.Worker release from .NET for Apache Spark GitHub Releases page and download into your local machine (e.g., F:\DataAnalytics\Microsoft.Spark.Worker-0.1.0
\
). - Create a new environment variable named DotnetWorkerPath with value set to the directory where you placed Microsoft.Spark.Worker in the preceding step (e.g., F:\DataAnalytics\Microsoft.Spark.Worker-0.1.0).
Create a dotnet Spark application
Run the following command to create a new folder:
mkdir DotNetSpark
Change directory to this folder and create a dotnet core Console application:
cd dotnet-spark
dotnet new console
Add reference to Nuget package Microsoft.Spark:
dotnet add dotnet-spark.csproj package Microsoft.Spark
Open Program.cs file in dotnet-spark folder. The program currently looks like the following:
using System;
namespace dotnet_spark {
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Hello World!");
}
} }
Replace the code with the following:
using System; using System.Net; using Microsoft.Spark.Sql; namespace dotnet_spark { class Program { static void Main(string[] args) { var dataUrl = "https://data.cityofnewyork.us/api/views/kku6-nxdu/rows.csv"; var localFileName = "stats.csv"; using (var client = new WebClient()) { client.DownloadFile(dataUrl, localFileName); } var spark = SparkSession.Builder().GetOrCreate(); var df = spark.Read().Csv(localFileName); df.Select("_c0", "_c1", "_c2", "_c3").Show(10); df.PrintSchema(); } } }
Publish the application
Running the command to publish the project so that exe file will be generated:
dotnet publish -c Debug -r win-x64 -o app -f netcoreapp2.2
*Remember to change the runtime version to your own. The above command will publish for runtime Windows x64 and the output will be copied to app folder.
For Linux or MacOS users, change the above command accordingly especially for the
Navigate to the app output folder:
cd app
In the app folder, these files exist (with many others):
- microsoft-spark-2.3.x-0.1.0.jar
- microsoft-spark-2.4.x-0.1.0.jar
- dotnet-spark.exe
Run the application in Spark
Now, we can submit the job to run in Spark using the following command:
%SPARK_HOME%\bin\spark-submit.cmd --class org.apache.spark.deploy.DotnetRunner --master local microsoft-spark-2.4.x-0.1.0.jar dotnet-spark
The last argument is the executable file name. It works with or without extension.
The output looks like the following:
Read data from HDFS
static void Main(string[] args) { ReadHDFS(); } static void ReadHDFS() { var hdfsURL = "hdfs://0.0.0.0:19000/user/hive/warehouse/test_db.db/test_table"; var spark = SparkSession .Builder() .AppName(".NET for Spark - Read HDFS example") .GetOrCreate(); var df = spark.Read().Csv(hdfsURL); df.Show(10); df.PrintSchema(); }
Sample output:
Read data from Hive
static void Main(string[] args) { ReadFromHive(); } static void ReadFromHive() { var master = "local"; var spark = SparkSession .Builder() .Config("hive.metastore.uris", "thrift://localhost:9083") .AppName(".NET for Spark - Read from Hive example") .EnableHiveSupport() .Master(master) .GetOrCreate(); var df = spark.Sql("show databases"); df.Show(10); df = spark.Sql("select * from test_db.test_table"); df.Show(); }
Sample output:
Read and write parquet files
static void Main(string[] args) { ReadAndWriteParquetFiles(); } static void ReadAndWriteParquetFiles() { var master = "local[2]"; var spark = SparkSession .Builder() .AppName(".NET for Spark - Read and write Parquet example") .Master(master) .GetOrCreate(); var hdfsURL = "hdfs://0.0.0.0:19000/user/hive/warehouse/test_db.db/test_table"; var df = spark.Read().Csv(hdfsURL); // Write parquet files df.WithColumn("NewColumn", Functions.Lit("Value") .Cast("string")) .Write() .Parquet("test.parquet"); var df2 = spark.Read().Parquet("test.parquet"); df2.Show(); }
Summary
As you can see in the above examples, you can use very similar fluent APIs to write Spark applications using C#. At the moment, not all the Mobius APIs have been ported to this new project yet. For example, SparkContext.Parallelize function is not implemented yet. But don’t worry as it is just version 0.1.0 at the moment. However, you can still complete majority of the common actions/tasks even with this version. Let’s look forward to the future releases.