PySpark: Convert Python Array/List to Spark Data Frame
Very nice code and explanation . Excellent feature in pyspark.
Very nice code and explanation . Excellent feature in pyspark.
Hi venu,
There are several things you need to know:
- collect function will request all data in the data frame to be sent to your driver application.
- From Spark later versions, you can directly use DataFrame APIs to transform instead of using RDD and loop through.
- Similarly for saving as CSV, you can also directly use DataFrame APIs.
Thus, to utilize parallelism and to improve performance, I would suggest the following changes:
- Repartition your DataFrame df using repartition function if there is appropriate partition keys.
- Directly use df to do all kinds of transformations. You can find more information here: pyspark.sql.DataFrame — PySpark 3.2.0 documentation (apache.org). Remember to read the documentation of your Spark version.
- Use df.write to save the data into HDFS.
person venu access_time 2 years ago
Re: PySpark: Convert Python Array/List to Spark Data Frame
Hi Raymond,
But it takes lot of time because of df.collect()
Is there any way to fasten this process? I tried to use --num-executors 5 in spark-submit but no change in performance. Also if possible please provide a solution for this too on how can we leverage --num-executors in this case. Since it's a 'pyspark dataframe' i also used df1 = df.toPandas() but no change in performance.
Hi Raymond,
But it takes lot of time because of df.collect()
Is there any way to fasten this process? I tried to use --num-executors 5 in spark-submit but no change in performance. Also if possible please provide a solution for this too on how can we leverage --num-executors in this case. Since it's a 'pyspark dataframe' i also used df1 = df.toPandas() but no change in performance.
Thanks for the feedback, Ravi. Welcome to Kontext!
person Ravi access_time 10 months ago
Re: PySpark: Convert Python Array/List to Spark Data Frame
Very nice code and explanation . Excellent feature in pyspark.