AWS EMR Debug - Container release on a *lost* node

Recently I've been helping my team to debug EMR Spark applications on AWS. One of the common frequent issue was due to containers released on nodes - Container marked as failed: ... Diagnostics: Container release on a *lost* node. I'm documenting some common approaches I've taken to fix these issues based on my previous experience in Spark on on-premise systems (CDH) and other cloud big data frameworks (for example, HDInsights on Azure and Dataproc on GCP). Hopefully that will provide some insights to debug your problem.

About containers and nodes

As shown in the following diagram, containers (Spark executors) run in nodes. These containers are created by NodeManager with coordination from Application Master (Spark driver container).

2022032995055-snapshot_Spark Memory Management Overview.png

From the error message itself, we can tell the nodes were lost. Thus to debug this issue, we can look into the possible root causes for losing nodes. There can be many reasons:

You are using Spot units for your Core or Task nodes and the instances were taken.
The nodes become unhealthy due to exceeding log and disk space ratio.
Node VMs were deleted by someone accidentally.
Node server system crashed.
Network issues.
...

Typical case - Spot units

When configuring instance fleets, you can choose Spot units. The spot instances will be cheaper than on-demanded ones. However it can be taken away at anytime that causes interruptions to your applications. Usually this will not cause your Spark application to fail as long as there are sufficient resources in the cluster overall.

To ensure stability of your EMR cluster, you could potentially change from spot instances to on-demand instances with bigger cost.

Typical case - Disk space usage is above threshold

In many projects, we use EMR to read and write data from S3 directly.

2022042983337-snapshot_AWS EMR Read and Write with S3.png

In EMR Spark applications, EBS volume (disk attached to the cluster nodes) are important when handling big volume of data as it can be used to store temporary data, cache files, cache user files, store logs, HDFS, etc. It can also be used to persist data depends on the configuration of Spark storage level. For example, if memory storage is not sufficient, Spark can persist the data to disk. If the attached disk are all used, the node will become unhealthy and eventually lost.

The following error message is a common one you can notice in YARN nodes portal:

1/2 local-dirs usable space is below configured utilization percentage/no more usable space [ /mnt/yarn : used space above threshold of 90.0% ] ; 1/1 log-dirs usable space is below configured utilization percentage/no more usable space [ /var/log/hadoop-yarn/containers : used space above threshold of 90.0% ].

About utilization configuration

The threshold is configured by YARN node manager configurable property: yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage. For more details about this property, refer to Hadoop documentation Apache Hadoop 3.3.2 – NodeManager.

This attribute represents the maximum percentage of disk space that may be utilized before a disk is marked as unhealthy by the disk checker service. This check is run for every disk used by the NodeManager. The default value is 90 i.e. 90% of the disk can be used. You can configure it as 0 to 100.

To configure this, you can add it into EMR configurations when setting up the cluster. You can add it into yarn-env classification: Configure applications - Amazon EMR.

Check disk usage

You can log onto your instance fleets servers to check the disk usage of your cluster by using ssh:

ssh -i /path/to/your/private/key hadoop@{ip address of the node}

Once logged into the server, you can then use Linux commands to check the disk usage.

For example, check mounted volume usage:

df /mnt

It will prints out the current usage of disk space.

Similarly you can also generate a detailed list of all the folders in /mnt:

sudo du -h /mnt

Or a summary one:

sudo du -sh /mnt

When your application is running, the results will keep change.

Fix this issue

To fix the issue, you have several options:

Turn off disk usage check by setting yarn.nodemanager.disk-health-checker.enable to false.
Increase yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage setting to 99 or 100.
Increase volume size when setup the cluster.
Increase your EMR cluster side by adding more Core/Task nodes.

Usually the first two options will not help much and the first is not recommended as the node will be lost eventually anyway if it consumes all available disk space. Thus the third and fourth option is recommended.

Hopefully the above two resolutions can provide you some tips about resolving your issues. In References section, there are also links to official documentation about common approaches to address this problem.

If you have other ways to resolve similar problems, feel free to post a comment!

References

Hadoop on Windows - UNHEALTHY Data Nodes Fix

Resolve "Exit status: -100. Diagnostics: Container released on a *lost* node" error in Amazon EMR

AWS EMR Debug - Container release on a lost node