Place & process huge file in PySpark
I have a requirement to place and process a huge file of 1 TB or more in HDFS.
The file is available in S3 in Avro format and is meant to receive weekly.
There are few steps that are being performed
1) Ingestion - Copy the data from S3 to HDFS
2) Extraction - Transform, filter the ingested data and store the resultant file in HDFS
3) Summary - Run some aggregations and generate the json file and place in HDFS.
Since the file is of huge size, I am unable to process the file and it always throws an error on memory size or runs longer infinitely and has to kill it.
Can please suggest the ways or approach of tackling this issue?