By using this site, you acknowledge that you have read and understand our Cookie policy, Privacy policy and Terms .

Background

This page provides an example to load text file from HDFS through SparkContext in Zeppelin (sc).

Reference

The details about this method can be found at:

SparkContext.textFile

https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/SparkContext.html#textFile-java.lang.String-int-

SqlContext

https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/sql/SQLContext.html

Prerequisites

Hadoop and Zeppelin

Refer to the following page to install Zeppelin and Hadoop in your environment if you don’t have one to play with.

Install Big Data Tools (Spark, Zeppelin, Hadoop) in Windows for Learning and Practice

Sample text file

In this example, I am going to use the file created in this tutorial:

Create a local CSV file

Step by step guide

Create a new note

Create a new note in Zeppelin with Note Name as ‘Test HDFS’:

image

Create data frame using RDD.toDF function

%spark
import spark.implicits._

// Read file as RDD
val rdd=sc.textFile("hdfs://0.0.0.0:19000/Sales.csv")

// Convert rdd to dataframe using toDF
val df = rdd.toDF
z.show(df)

The output:

image

As shown in the above screenshot, each line is converted to one row.

Let’s convert the string rows to string tuples.

Read CSV using spark.read

%spark
val df = spark.read.format("csv").option("header", "true").load("hdfs://0.0.0.0:19000/Sales.csv")
z.show(df)

image

Alternative method for converting RDD<String> to DataFrame

For previous Spark versions, you may need to convert RDD<String> to DataFrame using map functions.

%spark
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.SQLContext
//import spark.implicits._
import java.text.SimpleDateFormat
import java.util.Date

// Read file as RDD
val rdd=sc.textFile("hdfs://0.0.0.0:19000/Sales.csv")
val header = rdd.first()
val records = rdd.filter(row => row != header)

// create a data row
def row(line: List[String]): Row = { Row(line(0), line(1).toDouble) }

def dfSchema(columnNames: List[String]): StructType = {
  StructType(
      Seq(StructField("MonthOld", StringType, true),
      StructField("Amount", DoubleType, false))
      )
}
     
val headerColumns = header.split(",").to[List]    
val schema = dfSchema(headerColumns)
val data = records.map(_.split(",").to[List]).map(row)

//val df = spark.createDataFrame(data, schema)
//or
val df = new SQLContext(sc).createDataFrame(data, schema)
val df2 = df.withColumn("Month", from_unixtime(unix_timestamp($"MonthOld","dd/MM/yyyy"),"yyyy-MM-dd")).drop("MonthOld")

z.show(df2)

The result is similar to the previous one except the date format is also converted:

image

info Last modified by Raymond at 3 years ago * This page is subject to Site terms.

More from Kontext

local_offer hdfs local_offer hadoop local_offer windows

visibility 70
thumb_up 0
access_time 2 months ago

Network Attached Storage are commonly used in many enterprises where files are stored remotely on those servers.  They typically provide access to files using network file sharing protocols such as  ...

open_in_new View open_in_new Hadoop

Fix for Hadoop 3.2.1 namenode format issue on Windows 10

local_offer windows10 local_offer hadoop local_offer hdfs

visibility 254
thumb_up 0
access_time 3 months ago

Issue When installing Hadoop 3.2.1 on Windows 10,  you may encounter the following error when trying to format HDFS  namnode: ERROR namenode.NameNode: Failed to start namenode. The error happens when running the following comm...

open_in_new View open_in_new Hadoop

Compile and Build Hadoop 3.2.1 on Windows 10 Guide

local_offer windows10 local_offer hadoop

visibility 341
thumb_up 1
access_time 3 months ago

This article provides detailed steps about how to compile and build Hadoop (incl. native libs) on Windows 10. The following guide is based on Hadoop release 3.2.1. ...

open_in_new View open_in_new Hadoop

Install Hadoop 3.2.1 on Windows 10 Step by Step Guide

local_offer windows10 local_offer hadoop local_offer yarn

visibility 1085
thumb_up 3
access_time 3 months ago

This detailed step-by-step guide shows you how to install the latest Hadoop (v3.2.1) on Windows 10. It also provides a temporary fix for bug HDFS-14084 (java.lang.UnsupportedOperationException INFO).

open_in_new View open_in_new Hadoop

info About author

Dark theme mode

Dark theme mode is available on Kontext.

Learn more arrow_forward
Kontext Column

Kontext Column

Created for everyone to publish data, programming and cloud related articles. Follow three steps to create your columns.

Learn more arrow_forward
info Follow us on Twitter to get the latest article updates. Follow us