Apache Spark installation guides, performance tuning tips, general tutorials, etc.

*Spark logo is a registered trademark of Apache Spark.

open_in_new Go to forum rss_feed Subscribe RSS
visibility 9
thumb_up 0
access_time 12 days ago

Spark NTILE function   divides  the rows in each window to 'n' buckets ranging from 1 to at most 'n' (n is the specified parameter).  The following sample SQL uses NTILE function to divide records in each window to two buckets.  SELECT TXN.*, NTILE(2) OVER (PARTITION BY ...

visibility 10
thumb_up 0
access_time 12 days ago

DENSE_RANK is similar as  Spark SQL - RANK Window Function . It  calculates the rank of a value in a group of values. It returns one plus the number of rows proceeding or equals to the current row in the ordering of a partition. The returned values are sequential in each window thus no ...

visibility 10
thumb_up 0
access_time 12 days ago

RANK in Spark calculates the rank of a value in a group of values. It returns one plus the number of rows proceeding or equals to the current row in the ordering of a partition. The returned values are not sequential.   The following sample SQL uses RANK function without PARTITION BY ...

visibility 18
thumb_up 0
access_time 16 days ago

ROW_NUMBER in Spark assigns a unique sequential number (starting from 1) to each record based on the ordering of rows in each window partition. It is commonly used to deduplicate data. The following sample SQL uses ROW_NUMBER function without PARTITION BY clause: SELECT TXN.*, ROW_NUMBER() OVER ...

visibility 19
thumb_up 0
access_time 21 days ago

In Spark-SQL CLI tool, the result print will omit headings (column names) by default. To display columns, we need to update Spark setting spark.hadoop.hive.cli.print.header. To make the changes for all spark-sql sessions, edit file $SPARK_HOME/conf/spark-defaults.conf . Add the following ...

Apache Spark 3.0.1 Installation on Linux or WSL Guide
visibility 19
thumb_up 0
access_time 21 days ago

This article provides step by step guide to install the latest version of Apache Spark 3.0.1 on a UNIX alike system (Linux) or Windows Subsystem for Linux (WSL). These instructions can be applied to Ubuntu, Debian, Red Hat, OpenSUSE, etc.  If you are planning to configure Spark 3.0.1 on WSL ...

visibility 21
thumb_up 0
access_time 22 days ago

When installing a vanilla Spark on Windows or Linux, you may encounter the following error to invoke spark-sql command: Error: Failed to load class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver This error usually occurs when installing a Spark version without built-in Hadoop ...

visibility 26
thumb_up 0
access_time 30 days ago

In my article Connect to Teradata database through Python , I demonstrated about how to use Teradata python package or Teradata ODBC driver to connect to Teradata. In this article, I’m going to show you how to connect to Teradata through JDBC drivers so that you can load data directly into Spark ...

visibility 15
thumb_up 0
access_time 30 days ago

In article Connect to SQL Server in Spark (PySpark) ,  I showed how to connect to SQL Server in PySpark. Data can be loaded via JDBC, ODBC and Python drivers. In this article, I will directly use JDBC driver to load data from SQL Server with Scala. Download Microsoft JDBC Driver for SQL ...

visibility 29
thumb_up 0
access_time 2 months ago

About 12 months ago, I shared an article about reading and writing XML files in Spark using Python . For many companies, Scala is still preferred for better performance and also to utilize full features that Spark offers.  Thus, this article will provide examples about how to load XML file as ...