This site uses cookies to deliver our services. By using this site, you acknowledge that you have read and understand our Cookie and Privacy policy. Your use of Kontext website is subject to this policy. Allow Cookies and Dismiss

Read Text File from Hadoop in Zeppelin through Spark Context

1028 views 0 comments last modified about 7 months ago Raymond Tang

zeppelin spark hadoop rdd

Background

This page provides an example to load text file from HDFS through SparkContext in Zeppelin (sc).

Reference

The details about this method can be found at:

SparkContext.textFile

https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/SparkContext.html#textFile-java.lang.String-int-

SqlContext

https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/sql/SQLContext.html

Prerequisites

Hadoop and Zeppelin

Refer to the following page to install Zeppelin and Hadoop in your environment if you don’t have one to play with.

Install Big Data Tools (Spark, Zeppelin, Hadoop) in Windows for Learning and Practice

Sample text file

In this example, I am going to use the file created in this tutorial:

Create a local CSV file

Step by step guide

Create a new note

Create a new note in Zeppelin with Note Name as ‘Test HDFS’:

image

Create data frame using RDD.toDF function

%spark
import spark.implicits._

// Read file as RDD
val rdd=sc.textFile("hdfs://0.0.0.0:19000/Sales.csv")

// Convert rdd to dataframe using toDF
val df = rdd.toDF
z.show(df)

The output:

image

As shown in the above screenshot, each line is converted to one row.

Let’s convert the string rows to string tuples.

Read CSV using spark.read

%spark
val df = spark.read.format("csv").option("header", "true").load("hdfs://0.0.0.0:19000/Sales.csv")
z.show(df)

image

Alternative method for converting RDD<String> to DataFrame

For previous Spark versions, you may need to convert RDD<String> to DataFrame using map functions.

%spark
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.SQLContext
//import spark.implicits._
import java.text.SimpleDateFormat
import java.util.Date

// Read file as RDD
val rdd=sc.textFile("hdfs://0.0.0.0:19000/Sales.csv")
val header = rdd.first()
val records = rdd.filter(row => row != header)

// create a data row
def row(line: List[String]): Row = { Row(line(0), line(1).toDouble) }

def dfSchema(columnNames: List[String]): StructType = {
  StructType(
      Seq(StructField("MonthOld", StringType, true),
      StructField("Amount", DoubleType, false))
      )
}
     
val headerColumns = header.split(",").to[List]    
val schema = dfSchema(headerColumns)
val data = records.map(_.split(",").to[List]).map(row)

//val df = spark.createDataFrame(data, schema)
//or
val df = new SQLContext(sc).createDataFrame(data, schema)
val df2 = df.withColumn("Month", from_unixtime(unix_timestamp($"MonthOld","dd/MM/yyyy"),"yyyy-MM-dd")).drop("MonthOld")

z.show(df2)

The result is similar to the previous one except the date format is also converted:

image

Related pages

Install Zeppelin 0.7.3 in Windows

1199 views   6 comments last modified about 7 months ago

This post summarizes the steps to install Zeppelin 0.7.3 in Windows environment. Tools and Environment GIT Bash Command Prompt Windows 10 Download Binary Package Download the latest binary package from the following website: ...

View detail

Install Hadoop 3.0.0 in Windows (Single Node)

4564 views   14 comments last modified about 7 months ago

This page summarizes the steps to install Hadoop 3.0.0 in your Windows environment. Reference page: https://wiki.apache.org/hadoop/Hadoop2OnWindows ...

View detail

Write and Read Parquet Files in Spark/Scala

1831 views   2 comments last modified about 7 months ago

In this page, I’m going to demonstrate how to write and read parquet files in Spark/Scala by using Spark SQLContext class. Reference What is parquet format? Go the following project site to understand more about parquet. ...

View detail

Resolve Hadoop RemoteException - Name node is in safe mode

93 views   0 comments last modified about 5 months ago

In Safe Mode, the HDFS cluster is read-only. After completion of block replication maintenance activity, the name node leaves safe mode automatically. If you try to delete files in safe mode, the following exception may raise: org.apache.hadoop.ipc.RemoteException(org.apac...

View detail

Configure Sqoop in a Edge Node of Hadoop Cluster

414 views   0 comments last modified about 5 months ago

This page continues with the following documentation about configuring a Hadoop multi-nodes cluster via adding a new edge node to configure administration or client tools. ...

View detail

Configure YARN and MapReduce Resources in Hadoop Cluster

164 views   0 comments last modified about 5 months ago

When configuring YARN and MapReduce in Hadoop cluster, it is very important to configure the memory and virtual processors correctly. If the configurations are incorrect, the nodes may not be able to start properly and the applications may not be able to run successfully. For example...

View detail

Add comment

Please login first to add comments.  Log in New user?  Register

Comments (0)

No comments yet.