access_time 4 years ago languageEnglish
more_vert

PySpark: Convert Python Array/List to Spark Data Frame

visibility 82,252 comment 5
In Spark, SparkContext.parallelize function can be used to convert Python list to RDD and then RDD can be converted to DataFrame object. The following sample code is based on Spark 2.x. In this page, I am going to show you how to convert the following list to a data frame: data = [('Category A' ...
info Last modified by Administrator 3 years ago
thumb_up 4

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

comment Comments
11 months ago link more_vert
Raymond Raymond
articleArticles 546
imageDiagrams 49
codeCode 2
chat_bubble_outlineThreads 8
commentComments 250
loyaltyKontext Points 6005
account_circleProfile
#1566 Re: PySpark: Convert Python Array/List to Spark Data Frame

Hi venu,

There are several things you need to know:

  • collect function will request all data in the data frame to be sent to your driver application.
  • From Spark later versions, you can directly use DataFrame APIs to transform instead of using RDD and loop through.
  • Similarly for saving as CSV, you can also directly use DataFrame APIs.

Thus, to utilize parallelism and to improve performance, I would suggest the following changes:

  1. Repartition your DataFrame df using repartition function if there is appropriate partition keys. 
  2. Directly use df to do all kinds of transformations. You can find more information here: pyspark.sql.DataFrame — PySpark 3.2.0 documentation (apache.org). Remember to read the documentation of your Spark version. 
  3. Use df.write to save the data into HDFS. 


format_quote

person venu access_time 11 months ago
Re: PySpark: Convert Python Array/List to Spark Data Frame

Hi Raymond,


But it takes lot of time because of df.collect()

Is there any way to fasten this process? I tried to use --num-executors 5 in spark-submit but no change in performance. Also if possible please provide a solution for this too on how can we leverage --num-executors in this case. Since it's a 'pyspark dataframe' i also used df1 = df.toPandas() but no change in performance.


11 months ago link more_vert
V venu
articleArticles 0
imageDiagrams 0
codeCode 0
chat_bubble_outlineThreads 0
commentComments 5
loyaltyKontext Points 5
#1565 Re: PySpark: Convert Python Array/List to Spark Data Frame

Hi Raymond,


But it takes lot of time because of df.collect()

Is there any way to fasten this process? I tried to use --num-executors 5 in spark-submit but no change in performance. Also if possible please provide a solution for this too on how can we leverage --num-executors in this case. Since it's a 'pyspark dataframe' i also used df1 = df.toPandas() but no change in performance.


recommendMore from Kontext