Get Started on .NET 5 with Apache Spark
.NET for Apache Spark 1.0 was officially released on 14th Oct 2020. This version was released together with .NET Core 3.0. Since .NET for Apache Spark is written with .NET Standards, it should work with .NET 5 too. This articles how to use .NET 5 with Apache Spark.
* Image from dotnet/spark: .NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
Installation
1) Install the following frameworks/tools if you have not done that:
- .NET 5.0 SDK
- Java 1.8 (or Java 11)
- Apache Spark 3.0
For Spark installation, refer to my following post if you don’t know how to install it:
And then download and install Microsoft.Spark.Worker release:
- Select a Microsoft.Spark.Worker release from .NET for Apache Spark GitHub Releases page and download into your local machine. For example, F:\big-data\Microsoft.Spark.Worker-1.0.0.
- Create a new environment variable named DotnetWorkerPath with value set to the directory where you placed Microsoft.Spark.Worker in the preceding step.
Create a dotnet Spark application
Run the following command to create a new folder:
mkdir dotnet5-spark
Change directory to this folder and create a dotnet core Console application:
cd dotnet5-spark
dotnet new console
Add reference to Nuget package Microsoft.Spark:
dotnet add dotnet5-spark.csproj package Microsoft.Spark
Open Program.cs file in dotnet-spark folder. The program currently looks like the following:
using System;
namespace dotnet_spark {
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Hello World!");
}
} }
Replace the code with the following:
using System; using System.Net; using Microsoft.Spark.Sql; using System.IO; namespace dotnet_spark { class Program { static void Main(string[] args) { /*Download a file from Internet*/ var dataUrl = "https://data.cityofnewyork.us/api/views/kku6-nxdu/rows.csv"; var localFileName = "stats.csv"; using (var client = new WebClient()) { client.DownloadFile(dataUrl, localFileName); } var localFilePath = Path.GetFullPath(localFileName); Console.WriteLine("Local CSV file path is: {0}", localFilePath); /*Create Spark session*/ var spark = SparkSession.Builder().GetOrCreate(); var df = spark.Read().Csv($"file:///{localFilePath}"); df.Select("_c0", "_c1", "_c2", "_c3").Show(10); df.PrintSchema(); } } }
Publish the application
Running the command to publish the project so that exe file will be generated:
dotnet publish -c Debug -r win-x64 -o app -f net5.0 Microsoft (R) Build Engine version 16.8.0+126527ff1 for .NET Copyright (C) Microsoft Corporation. All rights reserved. Determining projects to restore... Restored F:\Projects\dotnet5-spark\dotnet5-spark.csproj (in 214 ms). dotnet5-spark -> F:\Projects\dotnet5-spark\bin\x64\Debug\net5.0\win-x64\dotnet5-spark.dll dotnet5-spark -> F:\Projects\dotnet5-spark\app\
*Remember to change the runtime version to your own. The above command will publish for runtime Windows x64 and the output will be copied to app folder.
For Linux or MacOS users, change the above command accordingly especially for the argument -r.
Navigate to the app output folder:
cd app
In the app folder, these files exist (with many others):
- microsoft-spark-2-3_2.11-1.0.0.jar
- microsoft-spark-2-4_2.11-1.0.0.jar
- microsoft-spark-3-0_2.12-1.0.0.jar
- dotnet5-spark.exe
For the Java files, they are for supporting different versions of Spark.
Run the application in Spark
Now, we can submit the job to run in Spark using the following command:
%SPARK_HOME%\bin\spark-submit.cmd --class org.apache.spark.deploy.dotnet.DotnetRunner --master local microsoft-spark-3-0_2.12-1.0.0.jar dotnet5-spark
The last argument is the executable file name. It works with or without extension.
The output looks like the following:
Read data from HDFS
static void Main(string[] args) { ReadHDFS(); } static void ReadHDFS() { var hdfsURL = "hdfs://0.0.0.0:19000/user/hive/warehouse/test_db.db/test_table"; var spark = SparkSession .Builder() .AppName(".NET for Spark - Read HDFS example") .GetOrCreate(); var df = spark.Read().Csv(hdfsURL); df.Show(10); df.PrintSchema(); }
Sample output:
Read and write parquet files
static void Main(string[] args) { ReadAndWriteParquetFiles(); } static void ReadAndWriteParquetFiles() { var master = "local[2]"; var spark = SparkSession .Builder() .AppName(".NET for Spark - Read and write Parquet example") .Master(master) .GetOrCreate(); var hdfsURL = "/user/hive/warehouse/test_db.db/test_table"; var df = spark.Read().Csv(hdfsURL); // Write parquet files df.WithColumn("NewColumn", Functions.Lit("Value") .Cast("string")) .Write() .Parquet("test.parquet"); var df2 = spark.Read().Parquet("test.parquet"); df2.Show(); }
The output looks like the following screenshot:
Summary
As you can see in the above examples, you can use very similar fluent APIs to write Spark applications using C#. F# is also supported. However, at the moment, .NET for Spark is not part of standard Spark release yet. Follow up on JIRA [SPARK-27006] SPIP: .NET bindings for Apache Spark.
There are also some known issues. Please find more details on GitHub: .NET for Apache Spark release-1.0.0 Known Issues.