Debug PySpark Code in Visual Studio Code

access_time 2 years ago visibility6417 comment 0

The page summarizes the steps required to run and debug PySpark (Spark for Python) in Visual Studio Code.

Install Python and pip

Install Python from the official website:

https://www.python.org/downloads/.

The version I am using is 3.6.4 32-bit. Pip is shipped together in this version.

Install Spark standalone edition

Download Spark 2.3.3 from the following page:

https://www.apache.org/dyn/closer.lua/spark/spark-2.3.3/spark-2.3.3-bin-hadoop2.7.tgz

If you don’t know how to install, please follow the following page:

Install Spark 2.2.1 in Windows

*Remember to change the package to version 2.3.3.

There is one bug with the latest Spark version 2.4.0 and thus I am using 2.3.3.

Install pyspark package

Since Spark version is 2.3.3, we need to install the same version for pyspark via the following command:

pip install pyspark==2.3.3

The version needs to be consistent otherwise you may encounter errors for package py4j.

Run PySpark code in Visual Studio Code

You can run PySpark through context menu item Run Python File in Terminal.

image

Alternatively, you can also debug your application in VS Code too as shown in the following screenshot:

image

Run Azure HDInsights PySpark code

You can install extension Azure HDInsight Tools to submit spark jobs in VS Code to your HDInsights cluster.

For more details, refer to the extension page:

https://marketplace.visualstudio.com/items?itemName=mshdinsight.azure-hdinsight

info Last modified by Raymond at 9 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Want to publish your article on Kontext?

Learn more

Kontext Column

Created for everyone to publish data, programming and cloud related articles.
Follow three steps to create your columns.


Learn more arrow_forward

More from Kontext

local_offer pyspark local_offer spark-2-x local_offer spark local_offer python local_offer spark-dataframe

visibility 5437
thumb_up 0
access_time 10 months ago

This article shows how to convert a Python dictionary list to a DataFrame in Spark using Python. data = [{"Category": 'Category A', "ID": 1, "Value": 12.40}, {"Category": 'Category B', "ID": 2, "Value": 30.10}, {"Category": 'Category C', "ID": 3, "Value": 100.01} ] The ...

local_offer python local_offer pandas local_offer python-file-operations

visibility 231
thumb_up 0
access_time 10 months ago

Pickle files are commonly used Python data related projects. This article shows how to create and load pickle files using Pandas.  import pandas as pd import numpy as np file_name="data/test.pkl" data = np.random.randn(1000, 2) # pd.set_option('display.max_rows', None) df = ...

Pandas DataFrame Plot - Scatter and Hexbin Chart

local_offer plot local_offer pandas local_offer jupyter-notebook local_offer python local_offer pandas-plot

visibility 124
thumb_up 0
access_time 6 months ago

 In this article I'm going to show you some examples about plotting scatter and hexbin chart with Pandas DataFrame. I'm using Jupyter Notebook as IDE/code execution environment.  Hexbin chart is a pcolor of a 2-D histogram with hexagonal cell and can be more informative compared ...

About column

Apache Spark installation guides, performance tuning tips, general tutorials, etc.

*Spark logo is a registered trademark of Apache Spark.

rss_feed Subscribe RSS