The page summarizes the steps required to run and debug PySpark (Spark for Python) in Visual Studio Code.
Install Python from the official website:
The version I am using is 3.6.4 32-bit. Pip is shipped together in this version.
Download Spark 2.3.3 from the following page:
If you don’t know how to install, please follow the following page:
*Remember to change the package to version 2.3.3.
There is one bug with the latest Spark version 2.4.0 and thus I am using 2.3.3.
Since Spark version is 2.3.3, we need to install the same version for pyspark via the following command:
pip install pyspark==2.3.3
The version needs to be consistent otherwise you may encounter errors for package py4j.
You can run PySpark through context menu item Run Python File in Terminal.
Alternatively, you can also debug your application in VS Code too as shown in the following screenshot:
You can install extension Azure HDInsight Tools to submit spark jobs in VS Code to your HDInsights cluster.
For more details, refer to the extension page:
Overview For SQL developers that are familiar with SCD and merge statements, you may wonder how to implement the same in big data platforms, considering database or storages in Hadoop are not designed/optimised for record level updates and inserts. In this post, I’m going to demons...View detail
This post shows how to derive new column in a Spark data frame from a JSON array string column. I am running the code in Spark 2.2.1 though it is compatible with Spark 1.6.0 (with less JSON SQL functions). Prerequisites Refer to the following post to install Spark in Windows. ...View detail
In this page, I’m going to demonstrate how to write and read parquet files in Spark/Scala by using Spark SQLContext class. Reference What is parquet format? Go the following project site to understand more about parquet. ...View detail
Are you a Windows/.NET developer and willing to learn big data concepts and tools in your Windows? If yes, you can follow the links below to install them in your PC. The installations are usually easier to do in Linux/UNIX but they are not difficult to implement in Windows either since the...View detail
This page shows how to import data from SQL Server into Hadoop via Apache Sqoop. Prerequisites Please follow the link below to install Sqoop in your machine if you don’t have one environment ready. ...View detail
In my previous post, I demonstrated how to write and read parquet files in Spark/Scala. The parquet file destination is a local folder. Write and Read Parquet Files in Spark/Scala In this page...View detail