visibility 86,750 comment 5 access_time 4 years ago languageEnglish
more_vert

PySpark: Convert Python Array/List to Spark Data Frame

In Spark, SparkContext.parallelize function can be used to convert Python list to RDD and then RDD can be converted to DataFrame object. The following sample code is based on Spark 2.x. In this page, I am going to show you how to convert the following list to a data frame: data = [('Category A' ...
info Last modified by Administrator 3 years ago
thumb_up 5
comment Comments
access_time 10 months ago link more_vert
#1714 Re: PySpark: Convert Python Array/List to Spark Data Frame

Thanks for the feedback, Ravi. Welcome to Kontext!

format_quote

person Ravi access_time 10 months ago
Re: PySpark: Convert Python Array/List to Spark Data Frame

Very nice code and explanation . Excellent feature in pyspark.


access_time 10 months ago link more_vert
#1713 Re: PySpark: Convert Python Array/List to Spark Data Frame

Very nice code and explanation . Excellent feature in pyspark.


access_time 10 months ago link more_vert
#1712 Re: PySpark: Convert Python Array/List to Spark Data Frame

Very nice code and explanation . Excellent feature in pyspark.


access_time 2 years ago link more_vert
#1566 Re: PySpark: Convert Python Array/List to Spark Data Frame

Hi venu,

There are several things you need to know:

  • collect function will request all data in the data frame to be sent to your driver application.
  • From Spark later versions, you can directly use DataFrame APIs to transform instead of using RDD and loop through.
  • Similarly for saving as CSV, you can also directly use DataFrame APIs.

Thus, to utilize parallelism and to improve performance, I would suggest the following changes:

  1. Repartition your DataFrame df using repartition function if there is appropriate partition keys. 
  2. Directly use df to do all kinds of transformations. You can find more information here: pyspark.sql.DataFrame — PySpark 3.2.0 documentation (apache.org). Remember to read the documentation of your Spark version. 
  3. Use df.write to save the data into HDFS. 


format_quote

person venu access_time 2 years ago
Re: PySpark: Convert Python Array/List to Spark Data Frame

Hi Raymond,


But it takes lot of time because of df.collect()

Is there any way to fasten this process? I tried to use --num-executors 5 in spark-submit but no change in performance. Also if possible please provide a solution for this too on how can we leverage --num-executors in this case. Since it's a 'pyspark dataframe' i also used df1 = df.toPandas() but no change in performance.


access_time 2 years ago link more_vert
#1565 Re: PySpark: Convert Python Array/List to Spark Data Frame

Hi Raymond,


But it takes lot of time because of df.collect()

Is there any way to fasten this process? I tried to use --num-executors 5 in spark-submit but no change in performance. Also if possible please provide a solution for this too on how can we leverage --num-executors in this case. Since it's a 'pyspark dataframe' i also used df1 = df.toPandas() but no change in performance.


Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

recommendMore from Kontext