PySpark: some questions for a beginner
Hello
i'm trying to understand how Spark wprks and I'm learning PySpark.
I know know Python and the Pandas library.
I understand that if I want to read a big cvs file with Pandas usin dataframe, it may not work (or it will take a long time to read).
As such PySpark is an alternative.
I read some artcicles and I understaoof the first thing to do is to create a SparkContext.
I understant the SparkContext will manage the cluster which will read the csv file and transform datas.
So I hade this code in a juptyter notebook# Import de SparkContext du module pyspark
from pyspark import SparkContext
sc = SparkContext('local')
sc
if i execute this code twice,t he 2nd time I will get an error because I cant' have 2 spark contexts.
Why can't i have 2 sparks contexts?
I xanted to try this:# Import de SparkContext du module pyspark
from pyspark import SparkContext
sc1 = SparkContext('local')
sc2 = SparkContext('local')
thank you
thank you
I though that Spark Session was used when I wanted to work with dataframe
If I only need RDD, I use SparkContext.
Isn't that true anymore or was I wrong the whole time?
If wand to create 2 sparks sessions can I do this:
app_name = "PySpark Delta Lake - SCD2 Full Merge Example" master = "local"
app_name2 = "PySpark Delta Lake - SCD2 Full Merge Example 2" # Create Spark session with Delta extension builder1 = SparkSession.builder.appName(app_name) \ .master(master)
builder2 = SparkSession.builder.appName(app_name2) \ .master(master) spark1 = builder1.getOrCreate()
spark2 = builder2.getOrCreate()
From Spark 2.0, SparkSession is recommended as it encapsulates most of the APIs incl. SparkContext ones. You can still use SparkContext though if you prefer:
sc=spark.sparkContext
The example you provided will end up with one session.
Hi Luc,
Welcome to Kontext!
First, the API you are using is the old approach of establish session. For Spark 2 or 3, you can use the following approach to create SparkSession:
This will ensure one active session.
Spark session allows you to interactively run your code hence the session will not be gone. You can stop SparkContext via
stop
function:pyspark.SparkContext.stop — PySpark 3.3.1 documentation (apache.org)
If you want to create two Spark Sessions, you can submit two Spark jobs separately. The following diagram might be helpful to you:
Also can you explain more why you need two sessions in one script?