Place & process huge file in PySpark

M Mrudula T event 2023-03-28 visibility 135 comment 1
more_vert

I have a  requirement to place and process a huge file of 1 TB or more in HDFS.

The file is available in S3 in Avro format and is meant to receive weekly.


There are few steps that are being performed 

1) Ingestion - Copy the data from S3 to HDFS

2) Extraction - Transform, filter the ingested data and store the resultant file in HDFS

3) Summary - Run some aggregations and generate the json file and place in HDFS.


Since the file is of huge size, I am unable to process the file and it always throws an error on memory size or runs longer infinitely and has to kill it. 


Can please suggest the ways or approach of tackling this issue? 

More from Kontext
comment Comments
Raymond Raymond

Raymond access_time 2 years ago link more_vert

Hi, I have a few suggestions for you:

  • Discuss with your source team who generated the big file to see if they can partition them properly?

  • Repartition your source data after creating a DataFrame from it. This may involves a lot data shuffling if the source data is not partitioned. 

  • Instead of copying it over to HDFS and then process, can you try using Athena on AWS to ready the file and transform? Or use AWS Redshift Spectrum? Once data is transformed and then save the result to S3 and only transport the final file to HDFS.

  • Alternatively, you could also using EMR / Glue to process the file first. 

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts