spark hadoop pyspark oozie hue

Run Multiple Python Scripts PySpark Application with yarn-cluster Mode

262   0   about 2 months ago

When submitting Spark applications to YARN cluster, two deploy modes can be used: client and cluster. For client mode (default), Spark driver runs on the machine that the Spark application was submitted while for cluster mode, the driver runs on a random node in a cluster. On this page, I am goin...

View detail
python pyspark pandas

Convert PySpark Row List to Pandas Data Frame

188   0   about 2 months ago

In Spark, it’s easy to convert Spark Dataframe to Pandas dataframe through one line of code: df_pd = df.toPandas() In this page, I am going to show you how to convert a list of PySpark row objects to a Pandas data frame. Prepare the data frame The fo...

View detail
spark hadoop yarn oozie

Diagnostics: Container is running beyond physical memory limits

263   0   about 4 months ago

Scenario Recently I created an Oozie workflow which contains one Spark action. The Spark action master is yarn and deploy mode is cluster. Each time when the job runs about 30 minutes, the application fails with errors like the following: Application applicatio...

View detail
hadoop linux wsl

Install Hadoop 3.2.0 on Windows 10 using Windows Subsystem for Linux (WSL)

5,467   16   about 4 months ago

In my previous post , I showed how to configure a single node Hadoop instance on Windows 10. The steps are not too difficult to follow if you have Java programming backgr...

View detail
lite-log spark pyspark

Fix PySpark TypeError: field **: **Type can not accept object ** in type <class '*'>

573   0   about 4 months ago

When creating Spark date frame using schemas, you may encounter errors about “field **: **Type can not accept object ** in type &lt;class '*'&gt;”. The actual error can vary, for instances, the following are some examples: field xxx: BooleanType can not accept object 100 in type ...

View detail
python spark pyspark

PySpark: Convert Python Array/List to Spark Data Frame

1,757   0   about 4 months ago

In Spark, SparkContext.parallelize function can be used to convert Python list to RDD and then RDD can be converted to DataFrame object. The following sample code is based on Spark 2.x. In this page, I am going to show you how to convert the following list to a data frame: data = [(...

View detail
teradata spark pyspark

Load Data from Teradata in Spark (PySpark)

810   0   about 4 months ago

In my article Connect to Teradata database through Python , I demonstrated about how to use Teradata python package or Teradata ODBC driver to connect to Teradata. In this article, I’m going to...

View detail
python spark hadoop pyspark

Read Hadoop Credential in PySpark

256   0   about 4 months ago

In one of my previous articles about Password Security Solution for Sqoop , I mentioned creating credential using hadoop credential command. The credentials are stored in JavaKey...

View detail
zeppelin spark hadoop linux sqoop hive wsl

Big Data Tools on Windows via Windows Subsystem for Linux (WSL)

537   0   about 5 months ago

This page summarizes the installation guides about big data tools on Windows through Windows Subsystem for Linux (WSL). ...

View detail
linux sqoop wsl

Sqoop Installation on Windows 10 using Windows Subsystem for Linux

376   0   about 5 months ago

This page summarizes the steps required to install Apache Sqoop (v1.4.7) in Windows 10 environment via Windows Subsystem for Linux (WSL). Prerequisites If you have already installed Hadoop 3.2.0 in WSL, ignore the following steps as you don’t need to install it again. Follow&...

View detail
spark linux wsl

Apache Spark 2.4.3 Installation on Windows 10 using Windows Subsystem for Linux

1,509   4   about 5 months ago

This pages summarizes the steps to install the latest version 2.4.3 of Apache Spark on Windows 10 via Windows Subsystem for Linux (WSL). Prerequisites Follow either of the following pages to install WSL in a system or non-system drive on your Windows 10. ...

View detail
zeppelin spark linux wsl

Install Zeppelin 0.7.3 on Windows 10 using Windows Subsystem for Linux (WSL)

686   0   about 6 months ago

This page summarizes the steps to install Zeppelin version 0.7.3 on Windows 10 via Windows Subsystem for Linux (WSL). Version 0.8.1 When running Zeppelin in Ubuntu, the server may pick up one host address that is not accessible, for example, and the the remote interprete...

View detail
hadoop hive wsl

Apache Hive 3.1.1 Installation on Windows 10 using Windows Subsystem for Linux

1,199   2   about 6 months ago

Previously, I demonstrated how to configured Apache Hive 3.0.0 on Windows 10. Apache Hive 3.0.0 Installation on Windows 10 Step by Step Guide...

View detail
lite-log hive

HiveServer2 Cannot Connect to Hive Metastore Resolutions/Workarounds

426   0   about 6 months ago

Since Hive 3.x, new authentication feature for HiveServer2 client is added. When starting HiveServer2 service (Hive version 3.0.0), you may encounter errors like: ‘HiveServer2 metastore.RetryingMetaStoreClient: RetryingMetaStoreClient trying reconnect as [username]&nbsp; (auth:S...

View detail
sql server hive

Configure a SQL Server Database as Remote Hive Metastore

693   0   about 6 months ago

In one of my previous post, I showed how to configure Apache Hive 3.0.0 in Windows 10. Apache Hive 3.0.0 Installation on Windows 10 Step by Step Guide ...

View detail
lite-log linux wsl ubuntu

Install Windows Subsystem for Linux on a Non-System Drive

753   0   about 6 months ago

This page shows how to install Windows Subsystem for Linux (WSL) system on a non-system drive manually. Enable Windows Subsystem for Linux system feature Open PowerShell as Administrator and run the following command to enable WSL feature: Enable-WindowsOptionalFea...

View detail
kontext lite-log

Notification Email Address Change Notice

44   0   about 6 months ago

In the past months, this website has been using the following Email address to delivery all the notification messages to the website users such as registration confirmation email, comment email and so on. no-reply[at] However, I recently found that...

View detail
sql server java kerberos ntlm

JDBC Integrated Security, NTLM and Kerberos Authentication for SQL Server

679   0   about 6 months ago

With Microsoft SQL Server JDBC driver, you can connect to the database through SQL Server Authentication or Kerberos Authentication. This post summarizes the configurations required for each authentication method with coding examples. *NTLM block in the following diagram represents pure Jav...

View detail

Querying Teradata and SQL Server - Tutorial 1: The SELECT Statement

37,329   7   about 5 years ago

SELECT is one of the most commonly used statements. In this tutorial, I will cover the following items: Two of the principal query clauses—FROM and SELECT Data Types Built-in functions CASE expressions and variations like ISNULL and COALESCE. * The functio...

View detail
hadoop yarn hdfs

Install Hadoop 3.0.0 in Windows (Single Node)

22,642   30   about 2 years ago

This page summarizes the steps to install Hadoop 3.0.0 in your Windows environment. Reference page: ...

View detail

Install Teradata Express by Using VMware Player 6.0 in Windows

14,979   23   about 6 years ago

In this article, I am going to introduce how to install Teradata Express in virtual machines in Windows. Download software 1) Download VMware Player for Windows 32-bit and 64-bit from the following link (version 6.0): ...

View detail core 2

Server.MapPath Equivalent in ASP.NET Core 2

14,049   2   about 3 years ago

In traditional applications, Server.MapPath is commonly used to generate absolute path in the web server. However, this has been removed from ASP.NET Core. So what is the equivalent way of doing it?

View detail

Working with SQL Server Compact 4.0 using Entity Framework 6 and ADO.NET

12,927   0   about 6 years ago

SQL Server Compact 4.0 (CE 4.0) is a free SQL Server embedded database ideal for building standalone and occasionally connected applications for mobile devices, desktops, Web clients and others. In one of my projects, I used it as the database for logging errors, which assumes the errors will onl...

View detail
spark scala parquet

Write and Read Parquet Files in Spark/Scala

12,612   2   about 2 years ago

In this page, I’m going to demonstrate how to write and read parquet files in Spark/Scala by using Spark SQLContext class. Reference What is parquet format? Go the following project site to understand more about parquet. ...

View detail

Create ETL Project with Teradata through SSIS

11,957   4   about 5 years ago

Infosphere DataStage is adopted as ETL (Extract, Transform, Load) tool in many Teradata based data warehousing projects. With the Teradata ODBC and .NET data providers, you can also use the BI tools from Microsoft, i.e. SSIS. In my previous post, I demonstrated how to install Teradata Tool...

View detail core identity core 2

Retrieve Identity username, email and other information in ASP.NET Core

11,452   0   about 3 years ago

The identity system in ASP.NET has evolved over time. If you are using ASP.NET Core, you probably found User property is an instance of ClaimsPrincipal in Controller or Razor views. Thus to retrieve the information, you need to utilize the claims.

View detail

Generate Formatted Excel Destination (Output) in SSIS Data Flow Task

10,991   0   about 6 years ago

SSIS (SQL Server Integration Service) provides a number of convenient tasks to enable data integration. Exporting data from database to Excel file is a common task in ETL (Extract, Transform, Load) projects. Constantly the users/customers may raise format request regarding the Excel extract. To g...

View detail
java kerberos

Java Kerberos Authentication Configuration Sample & SQL Server Connection Practice

9,194   2   about 4 years ago

Overview Recently, I have been working on an ETL framework to load various source data (i.e. files, SQL Server, Oracle and Teradata) into Teradata. Due to some limitations, Java was chosen as the implementation language though IBM Infosphere DataStage is available to use. DataStage has p...

View detail
.net core entity-framework

SQLite in .NET Core with Entity Framework Core

9,094   2   about 2 years ago

SQLite is a self-contained and embedded SQL database engine. In .NET Core, Entity Framework Core provides APIs to work with SQLite. This page provides sample code to create a SQLite database using package Microsoft.EntityFrameworkCore.Sqlite . Create sample project ...

View detail
python spark

PySpark: Convert JSON String Column to Array of Object (StructType) in Data Frame

8,807   0   about 10 months ago

This post shows how to derive new column in a Spark data frame from a JSON array string column. I am running the code in Spark 2.2.1 though it is compatible with Spark 1.6.0 (with less JSON SQL functions). Prerequisites Refer to the following post to install Spark in Windows. ...

View detail
dotnet core angular core 2

Issue - Unable to get property 'apply' of undefined or null reference occurred in Angular 4.*, VS2017 15.3, ASP.NET Core 2.0

8,410   10   about 3 years ago

Issue Context After installed Visual Studio 2017 15.3 preview and .net core 2.0 preview SDK, I upgraded one of my existing core project to 2.0. The project was created using ‘dotnet new angular’ SPA template.&nbsp; I also upgraded all the client app packages to the latest. For exa...

View detail
teradata python

Connect to Teradata database through Python

8,405   3   about 2 years ago

Teradata published an official Python module which can be used in DevOps projects. More details can be found at the following GitHub site: Install Teradata module ...

View detail

Connect to Teradata Virtual Machine Guest from Windows Host

7,705   16   about 5 years ago

In my previous posts about Querying Teradata and SQL Server, I logged into the virtual machine graphic interface to manage the database. However, I constantly found it is resource intensive as there is only 4GB memory in my laptop. Instead, I will use text mode to start the virtual machine and co...

View detail
lite-log spark hdfs scala parquet

Write and Read Parquet Files in HDFS through Spark/Scala

7,384   0   about 2 years ago

In my previous post, I demonstrated how to write and read parquet files in Spark/Scala. The parquet file destination is a local folder. Write and Read Parquet Files in Spark/Scala In this page...

View detail
angular lite-log

ng is not recognized as an internal or external command (Windows 10)

7,257   2   about 12 months ago

Problem When you follow Angular CLI installation guide in Windows, you may encounter the following error: ng is not recognized as an internal or external command The resolutions are available in the following link: ...

View detail
hadoop hive

Apache Hive 3.0.0 Installation on Windows 10 Step by Step Guide

7,193   9   about 7 months ago

If you have been following my website, you would know I’ve published a number of articles about installing big data tools/framewo...

View detail

about 1 day ago

Hi Team,

At step, Set up Hive HDFS folders while creating dir using hadoop fs -mkdir /tmp at cmd, the system is throwing an error.

mkdir: Your endpoint configuration is wrong

Please suggest how to resolve this.



about 2 days ago

Hi there,

I have installed the Hadoop 3 as per instructions mentioned above. Please suggest the steps to load data in Hadoop through cmd in windows 10 and also to  perform operation on it.



about 14 days ago


You got that error because Zeppelin cannot find the SQL Server JDBC driver.

Have you setup the dependencies for the interpreter as shown in the screenshot above? 


Make sure Zeppelin install this artifact from internet successfully. 

about 14 days ago

After running the notebook, I encountered this error:
java.lang.ClassNotFoundException: at Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Unknown Source) at org.apache.zeppelin.jdbc.JDBCInterpreter.createConnectionPool( at org.apache.zeppelin.jdbc.JDBCInterpreter.getConnectionFromPool( at org.apache.zeppelin.jdbc.JDBCInterpreter.getConnection( at org.apache.zeppelin.jdbc.JDBCInterpreter.executeSql( at org.apache.zeppelin.jdbc.JDBCInterpreter.interpret( at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret( at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun( at at org.apache.zeppelin.scheduler.ParallelScheduler$ at jaat java.util.concurrent.Executors$ Source) at Source) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(Unknown Source) at java.util.concurrent.ScheduledThreadPoolExecutor$ Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$ Source) at Source)
Please help me fix it!

about 2 months ago

@Raymond Tang I run in console. i did not do double click on the cmd file.

about 2 months ago

When you run the cmd script, did you directly open the script file or run the command line in Command Prompt?

about 2 months ago

I see your tutorial about the installation of hadoop on windows
However i am gettin this error when try to run the yarn demons with start-yarn.cmd:

This file does not have an app associated with it for performing this action. Please install an app or, if one is already installed, create an association in the defaul apps settings page.

Do you know some solution for that?

Thanks in advance.

about 2 months ago

Did you follow all the exact steps in my post? It seems like Java path (configured in environment variables) doesn't include some of the jar files Hadoop is using. However. it's hard to debug without access to your environment. 

about 2 months ago

I'm using JDK 8.

I tried setting yarn-nodemanager-opts and yarn-resourcemanager-opts  like the link you gave me but no luck the error is still there

about 2 months ago


It seems your problem is similar like the following:

Are you using JDK9 or above?

Can you try with JDK 8? I have not tried with JDK 9 or above as it was not fully supported. It may work now but Java 8 is the one recommended from the official website. 

about 2 months ago

I can't start up the resource manager and node manager

i got the error: 


 WARN webapp.WebAppContext: Failed startup of context o.e.j.w.WebAppContext@53830483{/,file:///C:/Users/ASUS/AppData/Local/Temp/jetty-,UNAVAILABLE}{/cluster} Unable to provision, see the following errors:

1) Error injecting constructor, java.lang.NoClassDefFoundError: javax/activation/DataSource at org.apache.hadoop.yarn.server.resourcemanager.webapp.JAXBContextResolver.<init>(

  at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebApp.setup(

  while locating org.apache.hadoop.yarn.server.resourcemanager.webapp.JAXBContextResolver


about 2 months ago


Can you add more details about the question? I am not sure whether I understand correctly. If you want to execute that command, you can directly run it in bash/Terminal.

about 2 months ago

how to we execute "hdfs fsck" command, it's giving regarding the file system commands

about 3 months ago


I'm not sure whether I understand your questions correctly or not. Once your import the data using Import or Direct Query, you can then customise through Power Query Editor in Power BI:

about 3 months ago

Hi, please try this repository

Unfortunately, this is only updated to Hadoop 3.0.0.

I'm not sure whether it works for 3.2.0. You can give it a try. Please pay attention to the file about the author's declaration. 

about 3 months ago

You mention "native Hadoop executable (winutils.exe) is not included in the official Hadoop distribution and needs to be downloaded separately [or built locally]."  Do you happen to know where this can be downloaded for 3.2.0?

about 3 months ago


Thank you for your post.

Do you know how to customize de query in power query?

Couldn't find it in any documentation.


Joana Barbosa