Place & process huge file in PySpark
I have a requirement to place and process a huge file of 1 TB or more in HDFS.
The file is available in S3 in Avro format and is meant to receive weekly.
There are few steps that are being performed
1) Ingestion - Copy the data from S3 to HDFS
2) Extraction - Transform, filter the ingested data and store the resultant file in HDFS
3) Summary - Run some aggregations and generate the json file and place in HDFS.
Since the file is of huge size, I am unable to process the file and it always throws an error on memory size or runs longer infinitely and has to kill it.
Can please suggest the ways or approach of tackling this issue?
copyright
This page is subject to Site terms.
comment Comments
Raymond
Raymond
access_time
2 years ago
link
more_vert
Hi, I have a few suggestions for you:
Discuss with your source team who generated the big file to see if they can partition them properly?
Repartition your source data after creating a DataFrame from it. This may involves a lot data shuffling if the source data is not partitioned.
Instead of copying it over to HDFS and then process, can you try using Athena on AWS to ready the file and transform? Or use AWS Redshift Spectrum? Once data is transformed and then save the result to S3 and only transport the final file to HDFS.
Alternatively, you could also using EMR / Glue to process the file first.