Place & process huge file in PySpark

event 2023-03-28 visibility 151 comment 1

more_vert

I have a requirement to place and process a huge file of 1 TB or more in HDFS.

The file is available in S3 in Avro format and is meant to receive weekly.

There are few steps that are being performed

1) Ingestion - Copy the data from S3 to HDFS

2) Extraction - Transform, filter the ingested data and store the resultant file in HDFS

3) Summary - Run some aggregations and generate the json file and place in HDFS.

Since the file is of huge size, I am unable to process the file and it always throws an error on memory size or runs longer infinitely and has to kill it.

Can please suggest the ways or approach of tackling this issue?

copyright This page is subject to Site terms.

Big Data Forum

Log in with external accounts

Place & process huge file in PySpark

Log in with external accounts