Kontext Column

Created for everyone to publish data, programming and cloud related articles.
Follow three steps to create your columns.


Learn more arrow_forward

Featured articles

visibility 52062
thumb_up 14
access_time 2 years ago

Data partitioning is critical to data processing performance especially for large volume of data processing in Spark. Partitions in Spark won’t span across nodes though one node can contains more than one partitions. When processing, Spark assigns one task for each partition and each worker threads ...

visibility 33238
thumb_up 0
access_time 2 years ago

In Spark, SparkContext.parallelize function can be used to convert Python list to RDD and then RDD can be converted to DataFrame object. The following sample code is based on Spark 2.x. In this page, I am going to show you how to convert the following list to a data frame: data = [('Category A' ...

visibility 19252
thumb_up 0
access_time 2 years ago

Spark provides rich APIs to save data frames to many different formats of files such as CSV, Parquet, Orc, Avro, etc. CSV is commonly used in data application though nowadays binary formats are getting momentum. In this article, I am going to show you how to save Spark data frame as CSV file in ...

Install Hadoop 3.2.1 on Windows 10 Step by Step Guide
visibility 24193
thumb_up 19
access_time 13 months ago

This detailed step-by-step guide shows you how to install the latest Hadoop (v3.2.1) on Windows 10. It also provides a temporary fix for bug HDFS-14084 (java.lang.UnsupportedOperationException INFO).

visibility 8017
thumb_up 1
access_time 6 months ago

This article shows you how to filter NULL/None values from a Spark data frame using Python. Function DataFrame.filter or DataFrame.where can be used to filter out null values.

visibility 33887
thumb_up 0
access_time 3 years ago

This post shows how to derive new column in a Spark data frame from a JSON array string column. I am running the code in Spark 2.2.1 though it is compatible with Spark 1.6.0 (with less JSON SQL functions). Refer to the following post to install Spark in Windows. Install Spark 2.2.1 in Windows ...

visibility 15478
thumb_up 8
access_time 2 years ago

This page shows how to install Windows Subsystem for Linux (WSL) system on a non-system drive manually. Open PowerShell as Administrator and run the following command to enable WSL feature: Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux Run the following ...

visibility 29225
thumb_up 7
access_time 2 years ago

From Spark 2.0, you can easily read data from Hive data warehouse and also write/append new data to Hive tables. This page shows how to operate with Hive in Spark including: Create DataFrame from existing Hive table Save DataFrame to a new Hive table Append data to the existing Hive table via ...

Pandas DataFrame Plot - Pie Chart
visibility 9144
thumb_up 0
access_time 10 months ago

This article provides examples about plotting pie chart using  pandas.DataFrame.plot  function. The data I'm going to use is the same as the other article  Pandas DataFrame Plot - Bar Chart . I'm also using Jupyter Notebook to plot them. The DataFrame has 9 records: DATE TYPE ...

visibility 26516
thumb_up 4
access_time 2 years ago

Spark is an analytics engine for big data processing. There are various ways to connect to a database in Spark. This page summarizes some of common approaches to connect to SQL Server using Python as programming language. For each method, both Windows Authentication and SQL Server ...

visibility 15630
thumb_up 1
access_time 3 years ago

PowerShell provides a number of cmdlets to retrieve current date time and to create time span object. $current = Get-Date $end= Get-Date $diff= New-TimeSpan -Start $current -End $end Write-Output "Time difference is: $diff" $current = [System.DateTime]::Now $end= ...

visibility 34867
thumb_up 2
access_time 3 years ago

SQLite is a self-contained and embedded SQL database engine. In .NET Core, Entity Framework Core provides APIs to work with SQLite. This page provides sample code to create a SQLite database using package Microsoft.EntityFrameworkCore.Sqlite . Create a .NET Core 2.x console application in ...

visibility 2667
thumb_up 0
access_time 4 months ago

This page summarize information about how to retrieve client and server IP address in ASP.NET core applications.  Client IP address can be retrieved via HttpContext.Connection object. This properties exist in both Razor page model and ASP.NET MVC controller. Property  RemoteIpAddress ...

visibility 3885
thumb_up 0
access_time 6 months ago

This article shows how to change column types of Spark DataFrame using Python. For example, convert StringType to DoubleType, StringType to Integer, StringType to DateType. Follow article  Convert Python Dictionary List to PySpark DataFrame to construct a dataframe.

Install Hadoop 3.3.0 on Windows 10 Step by Step Guide
visibility 6235
thumb_up 7
access_time 6 months ago

This detailed step-by-step guide shows you how to install the latest Hadoop v3.3.0 on Windows 10. It leverages Hadoop 3.3.0 winutils tool and WSL is not required. This version was released on July 14 2020. It is the first release of Apache Hadoop 3.3 line. There are significant changes compared with Hadoop 3.2.0, such as Java 11 runtime support, protobuf upgrade to 3.7.1, scheduling of opportunistic containers, non-volatile SCM support in HDFS cache directives, etc.

Featured sites

Apache Spark installation guides, performance tuning tips, general tutorials, etc.

*Spark logo is a registered trademark of Apache Spark.

Articles about Apache Hadoop installation, performance tuning and general tutorials.

*The yellow elephant logo is a registered trademark of Apache Hadoop.

Code snippets and tips for various programming languages/frameworks.

Tutorials and information about Teradata.

Articles about ASP.NET Core 1.x, 2.x, 3.x and 5.0.

Everything about .NET framework, .NET Core and .NET Standard. 

PowerShell, CMD, Bash, ksh, sh, Perl and etc. 

Latest articles

visibility 4
thumb_up 0
access_time 53 minutes ago

From Visual Studio 16.6, local git tags cannot be pushed to the remote. The issue was reported by:  https://developercommunity.visualstudio.com/idea/1043472/new-git-user-experience-cannot-push-tags.html. This was done by purpose.  Visual Studio team has added tags push features back ...

visibility 10
thumb_up 0
access_time 5 days ago

Like other SQL engines, Spark also supports PIVOT clause. PIVOT is usually used to calculated aggregated values for each value in a column and the calculated values will be included as columns in the result set. PIVOT ( { aggregate_expression [ AS aggregate_expression_alias ] } [ , ... ] FOR ...

visibility 6
thumb_up 0
access_time 5 days ago

Spark SQL provides functions to calculate covariances of a set of number pairs. There are two functions:  covar_pop(expr1, expr2) and covar_samp(expr1, expr2) . The first one calculates population covariance while the second one calculates sample covariance.  Example: SELECT ...

visibility 10
thumb_up 0
access_time 5 days ago

In Spark SQL, function std or   stddev or    stddev_sample  can be used to calculate sample standard deviation from values of a group.  std(expr) stddev(expr) stddev_samp(expr) The first two functions are the alias of stddev_sample function. SELECT ACCT ...

visibility 5
thumb_up 0
access_time 5 days ago

In Spark SQL, function FIRST_VALUE (FIRST) and LAST_VALUE (LAST) can be used to to find the first or the last value of given column or expression for a group of rows. If parameter `isIgnoreNull` is specified as true, they return only non-null values (unless all values are null). first(expr[ ...

visibility 6
thumb_up 0
access_time 6 days ago

Unlike traditional RDBMS systems, Spark SQL supports complex types like array or map. There are a number of built-in functions to operate efficiently on array values. ArrayType columns can be created directly using array or array_repeat  function. The latter repeat one element multiple times ...

visibility 7
thumb_up 0
access_time 6 days ago

In Spark SQL, MapType is designed for key values, which is like dictionary object type in many other programming languages. This article summarize the commonly used map functions in Spark SQL. Function map is used to create a map.  Example: spark-sql> select ...

visibility 4
thumb_up 0
access_time 6 days ago

In article  Scala: Parse JSON String as Spark DataFrame , it shows how to convert JSON string to Spark DataFrame; this article show the other way around - convert complex columns to a JSON string using to_json function. Function ' to_json(expr[, options]) ' returns a JSON string with a ...

visibility 9
thumb_up 0
access_time 6 days ago

JSON string values can be extracted using built-in Spark functions like get_json_object or json_tuple.  Values can be extracted using get_json_object function. The function has two parameters: json_txt and path. The first is the JSON text itself, for example a string column in your Spark ...

visibility 6
thumb_up 0
access_time 7 days ago

Spark SQL function from_json(jsonStr, schema[, options]) returns a struct value with the given JSON string and format. Parameter options is used to control how the json is parsed. It accepts the same options as the  json data source in Spark DataFrame reader APIs. The following code ...

visibility 11
thumb_up 0
access_time 7 days ago

Similar as  Convert String to Date using Spark SQL , you can convert string of timestamp to Spark SQL timestamp data type. Function  to_timestamp(timestamp_str[, fmt]) p arses the `timestamp_str` expression with the `fmt` expression to a timestamp data type in Spark.  Example ...

visibility 8
thumb_up 0
access_time 7 days ago

Function unix_timestamp() returns the UNIX timestamp of current time. You can also specify a input timestamp value.  Example: spark-sql> select unix_timestamp(); unix_timestamp(current_timestamp(), yyyy-MM-dd HH:mm:ss) 1610174099 spark-sql> select unix_timestamp(current_timestamp ...

visibility 6
thumb_up 0
access_time 7 days ago

Function current_date() or current_date can be used to return the current date at the start of query evaluation.  Example: spark-sql> select current_date(); current_date() 2021-01-09 spark-sql> select current_date; current_date() 2021-01-09 *Brackets are optional for this ...

visibility 8
thumb_up 0
access_time 7 days ago

To load data from Hive in Python, there are several approaches: Use PySpark with Hive enabled to directly load data from Hive databases using Spark SQL:  Read Data from Hive in Spark 1.x and 2.x . Use ODBC or JDBC Hive drivers. Cloudera has implemented ODBC drivers for Hive and ...

visibility 7
thumb_up 0
access_time 9 days ago

Spark LEAD function provides access to a row at a given offset that follows the current row in a window. This analytic function can be used in a SELECT statement to compare values in the current row with values in a following row. This function is like  Spark SQL - LAG Window Function .