Apache Spark installation guides, performance tuning tips, general tutorials, etc.
*Spark logo is a registered trademark of Apache Spark.
When starting Spark shell in Windows 10 machine, I encountered an error - ERROR SparkUI: Failed to bind SparkUI. The detailed error message looks like the following: 20/12/13 20:47:34 ERROR SparkUI: Failed to bind SparkUI java.net.BindException: Failed to bind to /0.0.0.0:4056: Service 'SparkUI' ...
Convert String to Date using Spark SQL
In Spark, function to_date can be used to convert string to date. This function is available since Spark 1.5.0. SELECT to_date('2020-10-23', 'yyyy-MM-dd'); SELECT to_date('23Oct2020', 'ddMMMyyyy'); Refer to the official documentation about all the datetime patterns. ...
.NET for Apache Spark v1.0.0 Released
.NET for Apache Spark v1.0.0 was released officially on 2020-10-14. This page summarizes some important resources for you to get started on .NET for Spark. *Image credit: https://github.com/dotnet/spark/raw/master/docs/img/dotnetsparklogo-6.png Release Notes on GitHub ...
Recently, one of my colleague asked me one question about Spark: for the same SQL statement on finding max value of partition column, different values are returned in Spark SQL and Hive/Impala SQL. The SQL statement looks like the following: SELECT MAX(PART_COL) FROM HiveDb.TestSQL; ...
This article shows you how to filter NULL/None values from a Spark data frame using Python. Function DataFrame.filter or DataFrame.where can be used to filter out null values.
Turn off INFO logs in Spark
Spark is a robust framework with logging implemented in all modules. Sometimes it might get too verbose to show all the INFO logs. This article shows you how to hide those INFO logs in the console output. Log level can be setup using function pyspark.SparkContext.setLogLevel . The ...
Change Column Type in PySpark DataFrame
This article shows how to change column types of Spark DataFrame using Python. For example, convert StringType to DoubleType, StringType to Integer, StringType to DateType. Follow article Convert Python Dictionary List to PySpark DataFrame to construct a dataframe.
Add Constant Column to PySpark DataFrame
This article shows how to add a constant or literal column to Spark data frame using Python. Follow article Convert Python Dictionary List to PySpark DataFrame to construct a dataframe. +----------+---+------+ | Category| ID| Value| +----------+---+------+ |Category A| 1| ...
This article shows how to 'delete' column from Spark data frame using Python. Follow article Convert Python Dictionary List to PySpark DataFrame to construct a dataframe. +----------+---+------+ | Category| ID| Value| +----------+---+------+ |Category A| 1| 12.40| |Category B| ...
Change DataFrame Column Names in PySpark
Column renaming is a common action when working with data frames. In this article, I will show you how to rename column names in a Spark data frame using Python. The following code snippet creates a DataFrame from a Python native dictionary list. PySpark SQL types are used to create the ...