By using this site, you acknowledge that you have read and understand our Cookie policy, Privacy policy and Terms .

The page summarizes the steps required to run and debug PySpark (Spark for Python) in Visual Studio Code.

Install Python and pip

Install Python from the official website:

https://www.python.org/downloads/.

The version I am using is 3.6.4 32-bit. Pip is shipped together in this version.

Install Spark standalone edition

Download Spark 2.3.3 from the following page:

https://www.apache.org/dyn/closer.lua/spark/spark-2.3.3/spark-2.3.3-bin-hadoop2.7.tgz

If you don’t know how to install, please follow the following page:

Install Spark 2.2.1 in Windows

*Remember to change the package to version 2.3.3.

There is one bug with the latest Spark version 2.4.0 and thus I am using 2.3.3.

Install pyspark package

Since Spark version is 2.3.3, we need to install the same version for pyspark via the following command:

pip install pyspark==2.3.3

The version needs to be consistent otherwise you may encounter errors for package py4j.

Run PySpark code in Visual Studio Code

You can run PySpark through context menu item Run Python File in Terminal.

image

Alternatively, you can also debug your application in VS Code too as shown in the following screenshot:

image

Run Azure HDInsights PySpark code

You can install extension Azure HDInsight Tools to submit spark jobs in VS Code to your HDInsights cluster.

For more details, refer to the extension page:

https://marketplace.visualstudio.com/items?itemName=mshdinsight.azure-hdinsight

info Last modified by Raymond at 3 months ago * This page is subject to Site terms.

More from Kontext

PySpark Read Multiple Lines Records from CSV

local_offer pyspark local_offer spark-2-x local_offer python

visibility 6
thumb_up 0
access_time 1 day ago

CSV is a common format used when extracting and exchanging data between systems and platforms. Once CSV file is ingested into HDFS, you can easily read them as DataFrame in Spark. However there are a few options you need to pay attention to especially if you source file: Has records ac...

open_in_new View

local_offer pyspark local_offer spark-2-x local_offer teradata local_offer SQL Server

visibility 45
thumb_up 0
access_time 12 days ago

In my previous article about  Connect to SQL Server in Spark (PySpark) , I mentioned the ways t...

open_in_new View

local_offer python local_offer SQL Server

visibility 132
thumb_up 0
access_time 2 months ago

Python JayDeBeApi module allows you to connect from Python to databases using Java JDBC drivers.

open_in_new View

Spark Read from SQL Server Source using Windows/Kerberos Authentication

local_offer pyspark local_offer SQL Server local_offer spark-2-x

visibility 134
thumb_up 0
access_time 2 months ago

In this article, I am going to show you how to use JDBC Kerberos authentication to connect to SQL Server sources in Spark (PySpark). I will use  Kerberos connection with principal names and password directly that requires  ...

open_in_new View

info About author

Kontext Column

Kontext Column

Created for everyone to publish data, programming and cloud related articles. Follow three steps to create your columns.

Learn more arrow_forward
info Follow us on Twitter to get the latest article updates. Follow us