PySpark: some questions for a beginner

event 2022-11-30 visibility 112 comment 3

more_vert

Hello
i'm trying to understand how Spark wprks and I'm learning PySpark.

I know know Python and the Pandas library.

I understand that if I want to read a big cvs file with Pandas usin dataframe, it may not work (or it will take a long time to read).

As such PySpark is an alternative.

I read some artcicles and I understaoof the first thing to do is to create a SparkContext.

I understant the SparkContext will manage the cluster which will read the csv file and transform datas.

So I hade this code in a juptyter notebook
# Import de SparkContext du module pyspark from pyspark import SparkContext sc = SparkContext('local') sc

if i execute this code twice,t he 2nd time I will get an error because I cant' have 2 spark contexts.
Why can't i have 2 sparks contexts?

I xanted to try this:
# Import de SparkContext du module pyspark from pyspark import SparkContext sc1 = SparkContext('local')

sc2 = SparkContext('local')

I have 2 different names: sc1 and sc2. Een id i execute only one time, I have an error. Why cant' I have 2 sparks context sc1and sc2?

thank you

copyright This page is subject to Site terms.

More from Kontext

comment Comments

Raymond access_time 3 years ago link more_vert

Hi Luc,

Welcome to Kontext!

First, the API you are using is the old approach of establish session. For Spark 2 or 3, you can use the following approach to create SparkSession:

    app_name = "PySpark Delta Lake - SCD2 Full Merge Example"
    master = "local"

    # Create Spark session with Delta extension

    builder = SparkSession.builder.appName(app_name) \
        .master(master)

    spark = builder.getOrCreate()

This will ensure one active session.

Spark session allows you to interactively run your code hence the session will not be gone. You can stop SparkContext via stop function:

pyspark.SparkContext.stop — PySpark 3.3.1 documentation (apache.org)

If you want to create two Spark Sessions, you can submit two Spark jobs separately. The following diagram might be helpful to you:

Spark Application Anatomy (kontext.tech)

Also can you explain more why you need two sessions in one script?

luc access_time 3 years ago link more_vert

thank you

I though that Spark Session was used when I wanted to work with dataframe

If I only need RDD, I use SparkContext.

Isn't that true anymore or was I wrong the whole time?

If wand to create 2 sparks sessions can I do this:

    app_name = "PySpark Delta Lake - SCD2 Full Merge Example"
    master = "local"
    app_name2 = "PySpark Delta Lake - SCD2 Full Merge Example 2"


    # Create Spark session with Delta extension

    builder1 = SparkSession.builder.appName(app_name) \
        .master(master)

    builder2 = SparkSession.builder.appName(app_name2) \
        .master(master)

    spark1 = builder1.getOrCreate()
    spark2 = builder2.getOrCreate()

Raymond access_time 3 years ago link more_vert

From Spark 2.0, SparkSession is recommended as it encapsulates most of the APIs incl. SparkContext ones. You can still use SparkContext though if you prefer:

sc=spark.sparkContext

The example you provided will end up with one session.

Please log in or register to comment.

account_circle Log in person_add Register

Big Data Forum

Log in with external accounts

PySpark: some questions for a beginner

Log in with external accounts