Forums

This forum is created for users to publish training related external information from other websites.

All advertisements about training must be published in this forum only. Otherwise the contents will be blocked. 

Discuss about cloud computing technologies, learning resources, etc. 

Any Kontext website related questions, please publish here incl. feature suggestions, bug reports, other feedbacks, etc. Visit Help Centre to learn how to use Kontext platform efficiently. 

New comments

Re: Load Data from Teradata in Spark (PySpark)
Raymond access_time7 days ago

For the latest Teradata JDBC driver, there is only one JAR file required while earlier versions had two JAR file.

If you hit that error in Jupyter, it means you have not added Teradata JDBC driver path to classpath.

spark = SparkSession \
    .builder \
    .appName("Spark App") \
    .config("spark.jars", "/path/to/teradata/jdbc.jar,/path/to/another/jar.jar") \
    .getOrCreate()

Re: Load Data from Teradata in Spark (PySpark)
venu access_time7 days ago

Only found this file 'terajdbc4.jar'. 

Installed pyspark in Jupyter notebook. Set the class path in environment variable.

But still facing this issue: java.lang.ClassNotFoundException: com.teradata.jdbc.TeraDriver

Re: Install Hadoop 3.3.0 on Windows 10 Step by Step Guide
Raymond access_time9 days ago

I need your system information:

OS Name:                   Microsoft Windows 10 Pro
OS Version:                10.0.19043 N/A Build 19043
OS Manufacturer:           Microsoft Corporation
OS Configuration:          Standalone Workstation
OS Build Type:             Multiprocessor Free
...
System Type:               x64-based PC
Processor(s):              1 Processor(s) Installed.
                           [01]: Intel64 Family 6 Model 94 Stepping 3 GenuineIntel ~2601 Mhz

You can get it from systeminfo command. 

Re: Install Hadoop 3.3.0 on Windows 10 Step by Step Guide
Antonio access_time11 days ago

I'm facing the same issue, do you want my powershell version is

Name                           Value
----                           -----
PSVersion                      5.1.19041.1023
PSEdition                      Desktop
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0...}
BuildVersion                   10.0.19041.1023
CLRVersion                     4.0.30319.42000
WSManStackVersion              3.0
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1



Re: Install Hadoop 3.2.1 on Windows 10 Step by Step Guide
Raymond access_time14 days ago

Hi,

Can you please add more details so that I can help you?

For Hadoop build related questions, can you publish here:

Compile and Build Hadoop 3.2.1 on Windows 10 Guide - Hadoop Forum - Kontext

This article is about installing Hadoop with a pre-compiled binary package.


Re: Install Hadoop 3.2.1 on Windows 10 Step by Step Guide
Antonio access_time14 days ago

Hi, I'm using your guide with Hadoop 3.2.2 but when I compile the maven project with the command


mvn package  -Pdist -DskipTests -Dtar -Dmaven.javadoc.skip=true



there is an error



Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.3.1:exec (pre-dist) on project hadoop-project-dist: Command execution failed.:



any idea about how to solve it?

thanks

Re: Data Partitioning in Spark (PySpark) In-depth Walkthrough
Gopinath access_time19 days ago

Thank you Raymond!!

Re: Data Partitioning in Spark (PySpark) In-depth Walkthrough
Raymond access_time20 days ago

Hi Gopinath,

You are mostly correct. 

Here are a few items to consider:

  • There are also replicas in normal production setup, which means the data will be written into the replica nodes too. 
  • HDFS NameNode will decide the node to save each block based on block placement policy. HDFS client location will also impact where the blocks will be stored. For example, if the client is on the HDFS node, the blocks will be placed on the same 
  • By default, each block will be corresponded with one partition in RDD when reading from HDFS in Spark.
Re: Data Partitioning in Spark (PySpark) In-depth Walkthrough
Gopinath access_time20 days ago

Great article!!

And to answer your question, let's assume the size of one partition file created by Spark is 200MB and writing it to the HDFS with block size of 128MB, then one partition file will be distributed across 2 HDFS data node, and if we read back this file from HDFS, spark rdd will have 2 partitions(since file is distributed across 2 HDFS data node.)

Correct this answer if it is wrong.

Thank you!

Re: Install Hadoop 3.2.1 on Windows 10 Step by Step Guide
Raymond access_time25 days ago

You are welcome. Hope it helps.