PySpark: some questions for a beginner

L luc londea event 2022-11-30 visibility 93 comment 3
more_vert

Hello
i'm trying to understand how Spark wprks and I'm learning PySpark.

I know know Python and the Pandas library.

I understand that if I want to read a big cvs file with Pandas usin dataframe, it may not work (or it will take a long time to read).

As such PySpark is an alternative.

I read some artcicles and I understaoof the first thing to do is to create a SparkContext.

I understant the SparkContext will manage the cluster which will read the csv file and transform datas.

So I hade this code in a juptyter notebook
# Import de SparkContext du module pyspark
from pyspark import SparkContext

sc = SparkContext('local')
sc

if i execute this code twice,t he 2nd time I will get an error because I cant' have 2 spark contexts.
Why can't i have 2 sparks contexts?

I xanted to try this:
# Import de SparkContext du module pyspark
from pyspark import SparkContext

sc1 = SparkContext('local')

sc2 = SparkContext('local')

I have 2 different names: sc1 and sc2. Een id i execute only one time, I have an error. Why cant' I have 2 sparks context sc1and sc2?

thank you

More from Kontext
comment Comments
Raymond Raymond

Raymond access_time 2 years ago link more_vert

Hi Luc,

Welcome to Kontext!

First, the API you are using is the old approach of establish session. For Spark 2 or 3, you can use the following approach to create SparkSession:

    app_name = "PySpark Delta Lake - SCD2 Full Merge Example"
    master = "local"

    # Create Spark session with Delta extension

    builder = SparkSession.builder.appName(app_name) \
        .master(master)

    spark = builder.getOrCreate()

This will ensure one active session. 

Spark session allows you to interactively run your code hence the session will not be gone. You can stop SparkContext via stop function:

pyspark.SparkContext.stop — PySpark 3.3.1 documentation (apache.org)

If you want to create two Spark Sessions, you can submit two Spark jobs separately. The following diagram might be helpful to you:

Also can you explain more why you need two sessions in one script?

L luc londea

luc access_time 2 years ago link more_vert

thank you

I though that Spark Session was used when I wanted to work with dataframe

If I only need RDD, I use SparkContext.

Isn't that true anymore or was I wrong the whole time?


If wand to create 2 sparks sessions can I do this:

    app_name = "PySpark Delta Lake - SCD2 Full Merge Example"
    master = "local"
app_name2 = "PySpark Delta Lake - SCD2 Full Merge Example 2" # Create Spark session with Delta extension builder1 = SparkSession.builder.appName(app_name) \ .master(master)

builder2 = SparkSession.builder.appName(app_name2) \ .master(master) spark1 = builder1.getOrCreate()
spark2 = builder2.getOrCreate()



Raymond Raymond

Raymond access_time 2 years ago link more_vert

From Spark 2.0, SparkSession is recommended as it encapsulates most of the APIs incl. SparkContext ones. You can still use SparkContext though if you prefer:

sc=spark.sparkContext

The example you provided will end up with one session. 

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts