Raymond Tang

Big Data Engineer, Full Stack .NET and Cross-Platform Software Engineer/Architect


I'm passionate about building data driven, scalable, cloud native applications and products.


Microsoft MVP C#/.NET (2010-2016)/Visual Studio | MCP | MCSE: Data Management and Analytics | Google Cloud Platform Certified Professional Data Engineer | AWS Certified Cloud Practitioner

Posts

local_offer tutorial local_offer spark local_offer how-to

visibility 5
thumb_up 0
access_time 16 hours ago

Spark is a robust framework with logging implemented in all modules. Sometimes it might get too verbose to show all the INFO logs. This article shows you how to hide those INFO logs in the console output. Spark logging level Log level can be setup using function pyspark.Spar...

open_in_new Spark

local_offer tutorial local_offer pyspark local_offer spark local_offer how-to

visibility 3
thumb_up 0
access_time 17 hours ago

This article shows how to change column types of Spark DataFrame using Python. For example, convert StringType to DoubleType, StringType to Integer, StringType to DateType. Construct a dataframe  Follow article  ...

open_in_new Spark

local_offer tutorial local_offer pyspark local_offer spark local_offer how-to

visibility 4
thumb_up 0
access_time 17 hours ago

This article shows how to add a constant or literal column to Spark data frame using Python.  Construct a dataframe  Follow article  Convert Python Dicti...

open_in_new Spark

local_offer tutorial local_offer pyspark local_offer spark local_offer how-to

visibility 4
thumb_up 0
access_time 17 hours ago

This article shows how to 'delete' column from Spark data frame using Python.  Construct a dataframe  Follow article  Convert Python Dictionary List to P...

open_in_new Spark

local_offer tutorial local_offer pyspark local_offer spark local_offer how-to

visibility 9
thumb_up 0
access_time 18 hours ago

Column renaming is a common action when working with data frames. In this article, I will show you how to rename column names in a Spark data frame using Python.  Construct a dataframe  The following code snippet creates a DataFrame from a Python native dictionary list. Py...

open_in_new Spark

Apache Spark 3.0.0 Installation on Linux Guide

local_offer spark local_offer linux local_offer WSL

visibility 11
thumb_up 0
access_time 19 hours ago

This article provides step by step guide to install the latest version of Apache Spark 3.0.0 on a UNIX alike system (Linux) or Windows Subsystem for Linux (WSL). These instructions can be applied to Ubuntu, Debian, Red Hat, OpenSUSE, MacOS, etc.  Prerequisites Windows Subsyste...

open_in_new Spark

Install Apache Spark 3.0.0 on Windows 10

local_offer spark local_offer pyspark local_offer windows10

visibility 11
thumb_up 0
access_time 21 hours ago

Spark 3.0.0 was release on 18th June 2020 with many new features. The highlights of features include adaptive query execution, dynamic partition pruning, ANSI SQL compliance, significant improvements in pandas APIs, new UI for structured streaming, up to 40x speedups for calling R user-defined fu...

open_in_new Spark

local_offer pyspark local_offer spark

visibility 33
thumb_up 0
access_time 5 days ago

CSV is a commonly used data format. Spark provides rich APIs to load files from HDFS as data frame.  This page provides examples about how to load CSV from HDFS using Spark. If you want to read a local CSV file in Python, refer to this page  ...

open_in_new Spark

local_offer linux local_offer hadoop local_offer hdfs local_offer yarn

visibility 54
thumb_up 0
access_time 5 days ago

This article provides step-by-step guidance to install Hadoop 3.3.0 on Linux such as Debian, Ubuntu, Red Hat, openSUSE, etc.  Hadoop 3.3.0 was released on July 14 2020. It is the first release of Apache Hadoop 3.3...

open_in_new Hadoop

Install Hadoop 3.3.0 on Windows 10 Step by Step Guide

local_offer windows10 local_offer hadoop local_offer yarn local_offer hdfs

visibility 191
thumb_up 0
access_time 8 days ago

This detailed step-by-step guide shows you how to install the latest Hadoop v3.3.0 on Windows 10. It leverages Hadoop 3.3.0 winutils tool and WSL is not required. This version was released on July 14 2020. It is the first release of Apache Hadoop 3.3 line. There are significant changes compared with Hadoop 3.2.0, such as Java 11 runtime support, protobuf upgrade to 3.7.1, scheduling of opportunistic containers, non-volatile SCM support in HDFS cache directives, etc.

open_in_new Hadoop

Comments

Further update:

I could not compile all Hadoop 3.3.0 projects on Windows 10 but I can compile the winutils project successfully. 

You can follow this guide to install Hadoop 3.3.0 on Windows 10 without using WSL:

Install Hadoop 3.3.0 on Windows 10 Step by Step Guide


format_quote

person Andika access_time 9 days ago
Re: Install Hadoop 3.2.1 on Windows 10 Step by Step Guide

Hey, I saw an article about how to install Hadoop 3.3.0 on Windows in this website a couple days ago. But now it's gone. Was it deleted or is there a way for me to find it?

reply Reply

Hi Andika,

I have just published the WSL version for Hadoop 3.3.0 installation on Windows 10:

Install Hadoop 3.3.0 on Windows 10 using WSL

The one you mentioned is about building Hadoop 3.3.0 on Windows 10. I have temporarily deleted it as I found some issues of building HDFS C/C++ project using CMake. 

I'm still working on that and will publish the guide once I resolve the issues. There were some unexpected issues that I need to fix before I republish it. 

Please stay tuned.

BTW, the Hadoop 3.2.1 build instructions are fully tested if you want to test something now. 


Regards,

Raymond

format_quote

person Andika access_time 9 days ago
Re: Install Hadoop 3.2.1 on Windows 10 Step by Step Guide

Hey, I saw an article about how to install Hadoop 3.3.0 on Windows in this website a couple days ago. But now it's gone. Was it deleted or is there a way for me to find it?

reply Reply

To build Hadoop 3.3.0, follow build instructions:

https://github.com/apache/hadoop/blob/rel/release-3.3.0/BUILDING.txt

Most of these steps are similar to this post. 


reply Reply

Hi,

Your comment is not relevant to the content on this page. Is that a suggestion you want to make for Kontext website?

For any suggestions please post it at https://kontext.tech/forum/kontext-project 

format_quote

person pavan access_time 18 days ago
Re: Connect to Teradata database through Python

By default, links will appear as follows in all browsers:

  • An unvisited link is underlined and blue
  • A visited link is underlined and purple
  • An active link is underlined and red
  • https://www.google.com/
reply Reply

According to https://github.com/baztian/jaydebeapi/issues/131, this issue has now been fixed.

This problem has been fixed , just upgrade to JayDeBeApi to 1.2.3, JPype1 to 0.7.5
reply Reply

Hi Tim,

Just an update the previous issue (understanding you have fixed it but I'd like to post my findings here too just in case other people may be interested).

I've done the following steps to see if I can run Hadoop daemons without Administrator right.

  • Create a local computer account named hadoop.
  • Setup environment variables for this account.


  • Reconfigured HDFS dfs locations for both data and namespace.
  • Format the namenode again using this local account.
hadoop namenode -format
  • Start HDFS daemons
start-dfs.cmd

Commands can start successfully without any errors.


  • Start YARN daemons
start-yarn.cmd

Very interestingly, this time NodeManager can start successfully while ResourceManager cannot due to the following error:


org.apache.hadoop.service.ServiceStateException: java.io.IOException: Mkdirs failed to create file:/tmp/hadoop-yarn-hadoop/node-attribute

For YARN tmp folder, I am configuring it as the following:

<property>
		<name>yarn.nodemanager.local-dirs</name>
		<value>file:///F:/tmp</value>
	</property>

So I then tried the following steps:

  • Stopped all the running Hadoop daemons.
  • Delete the existing tmp folder and recreate it using hadoop local account:


  • Delete DFS folder and recreate it


  • Reformat namenode
  • Restarted HDFS: the services were started successfully as the following screenshot shows.


  • Start YARN daemons:

This time the services all started successfully without any errors.

I can verify that through resource manager UI too:

So to summarize:

  • You don't necessarily need to create the tmp folder under your user directory.
  • And you can run Hadoop services without Administrator privileges on Windows as long as the HDFS directories and also tmp directories are setup correctly using the Windows account that runs Hadoop daemons.

Hope the above helps.

format_quote

person Tim access_time 3 months ago
Re: Install Hadoop 3.2.1 on Windows 10 Step by Step Guide

Hello,

I have been able to get around my need for admin it seems so far by changing my config so the tmp-nm folder is in my Documents versus in C drive directly in tmp.

However, it seems I still have some issues.   Two of them seem to point to wrong version of winutils.exe.   I am running windows 10 64 bit and am trying to get hadoop 3.2.1 running. One symtom of the wrong version is the repeated warning in Yarn node manager window over and over

WARN util.SysInfoWindows: Expected split length of sysInfo to be 11. Got 7

Another was the failure code of a job I submitted to insert data into a table from the hive prompt.  Job details were found in the Hadoop cluster local UI 

Application application_1589548856723_0001 failed 2 times due to AM Container for appattempt_1589548856723_0001_000002 exited with exitCode: 1639

Failing this attempt.Diagnostics: [2020-05-15 09:53:23.804]Exception from container-launch.

Container id: container_1589548856723_0001_02_000001

Exit code: 1639

Exception message: Incorrect command line arguments.

Shell output: Usage: task create [TASKNAME] [COMMAND_LINE] |

task isAlive [TASKNAME] |

task kill [TASKNAME]

task processList [TASKNAME]

Creates a new task jobobject with taskname

Checks if task jobobject is alive

Kills task jobobject

Prints to stdout a list of processes in the task

along with their resource usage. One process per line

and comma separated info per process

ProcessId,VirtualMemoryCommitted(bytes),

WorkingSetSize(bytes),CpuTime(Millisec,Kernel+User)

[2020-05-15 09:53:23.831]Container exited with a non-zero exit code 1639.


Some sites have said these two issues are symtom of having the wrong winutils.exe.

I have some other issues I'll wait to post after I can get these fixed.

I have used the link in this article to get winutils.exe.    I have also tried other winutils.exe's I find out there.  However, for the other ones I've tried when trying to start yarn, in the yarn node manager window it is full of errors like

2020-05-15 10:12:16,444 ERROR util.SysInfoWindows: java.io.IOException: Cannot run program "C:\Users\XXX\Documents\Big-Data\Hadoop\hadoop-3.2.1\bin\winutils.exe": CreateProcess error=216, This version of %1 is not compatible with the version of Windows you're running. Check your computer's system information and then contact the software publisher

So those ones are worse - I can't even get yarn started with those due to that error.  

So with the version I am using now I can get YARN to start although I get the warning about "WARN util.SysInfoWindows: Expected split length of sysInfo to be 11. Got 7" but the actual hive insert fails anyway... 

Appreciate the help.  How do I find or know if a winutil.exe is meant for windows 10 64 bit and Hadoop 3.2.1?

reply Reply

Hi Tim,

In my computer, all the paths for HADDOP_HOME and JAVA_HOME are configured to a location without any space as I was worried that the spaces issue may cause problems in the applications.

That's also the reasons that most of Windows Hadoop installation guides recommend configuring them in a path that has no space. This is even more important for Hive installation.  

So I think you are right the issue was due to the space in your environment variables.

JAVA_HOME environment variable is setup in the following folder:

%HADOOP_HOME%\etc\hadoop\hadoop-env.cmd

And also in Step 6 of this page, we've added class paths for JARs:

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property> 
        <name>mapreduce.application.classpath</name>
        <value>%HADOOP_HOME%/share/hadoop/mapreduce/*,%HADOOP_HOME%/share/hadoop/mapreduce/lib/*,%HADOOP_HOME%/share/hadoop/common/*,%HADOOP_HOME%/share/hadoop/common/lib/*,%HADOOP_HOME%/share/hadoop/yarn/*,%HADOOP_HOME%/share/hadoop/yarn/lib/*,%HADOOP_HOME%/share/hadoop/hdfs/*,%HADOOP_HOME%/share/hadoop/hdfs/lib/*</value>
    </property>
</configuration>

You can try to change them to absolute values with double quotes to see if they work. 

To save all the troubles, I would highly recommend getting Java available in a path without space or create a symbolic link to the Java folder in a location without space in the path.

format_quote

person Tim access_time 3 months ago
Re: Install Hadoop 3.2.1 on Windows 10 Step by Step Guide

Hi Raymond,

Ok I got my work IT helpdesk to add my ID to the Create Symbolic Links directory. That worked fine. So I am now passed the exitCode=1: CreateSymbolicLink error (1314): A required privilege is not held by the client. error.

Now, it is throwing this error:

Application application_1589579676240_0001 failed 2 times due to AM Container for appattempt_1589579676240_0001_000002 exited with exitCode: 1

Failing this attempt.Diagnostics: [2020-05-15 17:57:08.681]Exception from container-launch.

Container id: container_1589579676240_0001_02_000001

Exit code: 1

Shell output: 1 file(s) moved.

"Setting up env variables"

"Setting up job resources"

"Copying debugging information"

C:\Users\V121119\Documents\Big-Data\tmp-nm\usercache\XXX\appcache\application_1589579676240_0001\container_1589579676240_0001_02_000001>rem Creating copy of launch script

C:\Users\V121119\Documents\Big-Data\tmp-nm\usercache\XXX\appcache\application_1589579676240_0001\container_1589579676240_0001_02_000001>copy "launch_container.cmd" "C:/Users/V121119/Documents/Big-Data/Hadoop/hadoop-3.2.1/logs/userlogs/application_1589579676240_0001/container_1589579676240_0001_02_000001/launch_container.cmd"

1 file(s) copied.

C:\Users\V121119\Documents\Big-Data\tmp-nm\usercache\XXX\appcache\application_1589579676240_0001\container_1589579676240_0001_02_000001>rem Determining directory contents

C:\Users\V121119\Documents\Big-Data\tmp-nm\usercache\XXX\appcache\application_1589579676240_0001\container_1589579676240_0001_02_000001>dir 1>>"C:/Users/XXX/Documents/Big-Data/Hadoop/hadoop-3.2.1/logs/userlogs/application_1589579676240_0001/container_1589579676240_0001_02_000001/directory.info"

"Launching container"

[2020-05-15 17:57:08.696]Container exited with a non-zero exit code 1. Last 4096 bytes of stderr :

'C:\Program' is not recognized as an internal or external command,

operable program or batch file.

[2020-05-15 17:57:08.696]Container exited with a non-zero exit code 1. Last 4096 bytes of stderr :

'C:\Program' is not recognized as an internal or external command,

operable program or batch file.


I think I know what is happening but don't know how to fix. One of my first errors was in hadoop.config.xml where it was trying to check if not exist %JAVA_HOME%\bin\java.exe but the problem is in path : C:\program files\Java\jre8  -Please note this directory "program files" has a space in it. The result was that I had to modify hadoop.config.xml  to put double quotes around the check - making it 

if not exist "%JAVA_HOME%\bin\java.exe" .  This resolved that problem.  Then another place I had found issue was on this line

for /f "delims=" %%A in ('%JAVA% -Xmx32m %HADOOP_JAVA_PLATFORM_OPTS% -classpath "%CLASSPATH%" org.apache.hadoop.util.PlatformName') do set JAVA_PLATFORM=%%A

This was messing up too for similar reason - it was erroring with similar error since some values in the classpath list were C:\program files\...  and once it hit the space it blew up.  For this line of code I just remarked it - since my HADOOP_JAVA_PLATFORM_OPTS is empty - I am not sure what I would have done had HADOOP_JAVA_PLATFORM_OPTS been populated. In any case, these were preliminary issues and all dealt with the fact that C:\program files... path was causing issues due to space.  Therefore when I saw this latest exception, I am assuming it too is hitting this at java path or some member of classpath that has the same... but not sure where to modify - how to work around.

As of 5/15 6:10pm EDT - this is my current issue - you may disregard the prior comments if you wish since they are resolved...   Thanks

reply Reply

Hi Tim,

I did similar changes as you did:

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
	<property>
		<name>yarn.nodemanager.local-dirs</name>
		<value>F:/big-data/data/tmp</value>
	</property>
</configuration>

And I cloud not start YARN nodemanager service because of the following error:

Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Permissions incorrectly set for dir F:/big-data/data/tmp/filecache, should be rwxr-xr-x, actual value = rwxrwxr-x

This issue is recorded here:

I cannot resolve this problem without running the commands as Administrator. 

Based on the JIRA links, these issues should have been fixed. However it may not work because my Windows account is not a local account or domain account.

I will find sometime to try directly using a local account (without Microsoft account) to see if it works. 

It seems you didn't get any issue when changing the local tmp folder, is that correct?

format_quote

Comment is deleted or blocked.

reply Reply

Hi Tim,

Have you checked YARN web portal to see if you can see the Spark application is submitted successfully? You should be able to find more details there too (assuming you are run Spark with master set as yarn).

I’m working today and will try to replicate what you did in my machine when I am off work.



format_quote

Comment is deleted or blocked.

reply Reply

Can you add the environment variables into bash profile?

vi ~/.bashrc

And then insert the following lines (replace the values to your paths as shown in your screenshot):

export HADOOP_HOME='/cygdrive/f/DataAnalytics/hadoop-3.0.0'
export PATH=$PATH:$HADOOP_HOME/bin
export HIVE_HOME='/cygdrive/f/DataAnalytics/apache-hive-3.0.0-bin'
export PATH=$PATH:$HIVE_HOME/bin
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HIVE_HOME/lib/*.jar

Save the file after insert.

It's very hard to debug without access to your environment. 

format_quote

person Praveen access_time 4 months ago
Re: Apache Hive 3.0.0 Installation on Windows 10 Step by Step Guide

Still facing same issue.... 

Tries with derby command. I have shown  the echo path of hive and hadoop in the below screen shot... 

Can you help me in this?


reply Reply

Columns

ML.NET is an open source and cross-platform machine learning framework. With ML.NET, you can create custom ML models using C# or F# without having to leave the .NET ecosystem. This column publish articles about ML.NET.

Code snippets for various programming languages/frameworks.

Data analytics with Google Cloud Platform.

Data analytics, application development with Microsoft Azure cloud platform.

Posts and tutorials about Power BI.

Apache Sqoop, a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

PowerShell, Bash, ksh, sh, Perl and etc. 

General IT information for programming.