arrow_back PySpark: Convert Python Array/List to Spark Data Frame

access_time 2 years ago link more_vert
#1566 Re: PySpark: Convert Python Array/List to Spark Data Frame

Hi venu,

There are several things you need to know:

  • collect function will request all data in the data frame to be sent to your driver application.
  • From Spark later versions, you can directly use DataFrame APIs to transform instead of using RDD and loop through.
  • Similarly for saving as CSV, you can also directly use DataFrame APIs.

Thus, to utilize parallelism and to improve performance, I would suggest the following changes:

  1. Repartition your DataFrame df using repartition function if there is appropriate partition keys. 
  2. Directly use df to do all kinds of transformations. You can find more information here: pyspark.sql.DataFrame — PySpark 3.2.0 documentation (apache.org). Remember to read the documentation of your Spark version. 
  3. Use df.write to save the data into HDFS. 


format_quote

person venu access_time 2 years ago
Re: PySpark: Convert Python Array/List to Spark Data Frame

Hi Raymond,


But it takes lot of time because of df.collect()

Is there any way to fasten this process? I tried to use --num-executors 5 in spark-submit but no change in performance. Also if possible please provide a solution for this too on how can we leverage --num-executors in this case. Since it's a 'pyspark dataframe' i also used df1 = df.toPandas() but no change in performance.


recommendMore from Kontext