Raymond Raymond

Spark 3.0.1: Connect to HBase 2.4.1

event 2021-02-05 visibility 6,257 comment 14 insights toc
more_vert
insights Stats
Spark 3.0.1: Connect to HBase 2.4.1

Spark doesn't include built-in HBase connectors. We can use HBase Spark connector or other third party connectors to connect to HBase in Spark.

Prerequisites

If you don't have Spark or HBase available to use, you can follow these articles to configure them.

Spark

Apache Spark 3.0.1 Installation on Linux or WSL Guide

HBase

Install HBase in WSL - Pseudo-Distributed Mode

Prepare HBase table with data

Run the following commands in HBase shell to prepare a sample table that will be used in the following sections.

create 'Person', 'Name', 'Address'
put 'Person', '1', 'Name:First', 'Raymond'
put 'Person', '1', 'Name:Last', 'Tang'
put 'Person', '1', 'Address:Country', 'Australia'
put 'Person', '1', 'Address:State', 'VIC'

put 'Person', '2', 'Name:First', 'Dnomyar'
put 'Person', '2', 'Name:Last', 'Gnat'
put 'Person', '2', 'Address:Country', 'USA'
put 'Person', '2', 'Address:State', 'CA'

The table returns the following result when scanning:

scan 'Person'
ROW                             COLUMN+CELL
 1                              column=Address:Country, timestamp=2021-02-05T20:48:42.088, value=Australia
 1                              column=Address:State, timestamp=2021-02-05T20:48:46.750, value=VIC
 1                              column=Name:First, timestamp=2021-02-05T20:48:32.544, value=Raymond
 1                              column=Name:Last, timestamp=2021-02-05T20:48:37.085, value=Tang
 2                              column=Address:Country, timestamp=2021-02-05T20:49:00.692, value=USA
 2                              column=Address:State, timestamp=2021-02-05T20:49:04.972, value=CA
 2                              column=Name:First, timestamp=2021-02-05T20:48:51.653, value=Dnomyar
 2                              column=Name:Last, timestamp=2021-02-05T20:48:56.665, value=Gnat
2 row(s)

Build HBase Spark connector

infoIf you don't want to build the packages by yourself, please go to Summary section to directly download the binary package I built using the following approach. The built package is only provided for testing & learn purposes. 

We need to build HBase Spark Connector for Spark 3.0.1 as it is not published on Maven repository.

Refer to official repo hbase-connectors/spark at master · apache/hbase-connectors for more details. 

1) Clone the repository using the following command:

git clone https://github.com/apache/hbase-connectors.git

2) Install Maven if it is not available on your WSL:

Install Maven on WSL

3) Change directory to the clone repo:

cd hbase-connectors/
4) Build the project using the following command:

mvn -Dspark.version=3.0.1 -Dscala.version=2.12.10 -Dscala.binary.version=2.12 -Dhbase.version=2.2.4 -Dhadoop.profile=3.0 -Dhadoop-three.version=3.2.0 -DskipTests -Dcheckstyle.skip -U clean package

The version arguments need to match with your Hadoop, Spark and HBase versions. 

infoFor HBase version, I have to use 2.2.4 as the latest hbase-connector code was based on that version. Otherwise the built will fail with error like this: [ERROR] ../hbase-connectors/kafka/hbase-kafka-proxy/src/main/java/org/apache/hadoop/hbase/kafka/KafkaTableForBridge.java:[53,8] org.apache.hadoop.hbase.kafka.KafkaTableForBridge is not abstract and does not override abstract method getRegionLocator() in org.apache.hadoop.hbase.client.Table
Regardless of this, the built package will also work with HBase 2.4.1. 

Wait until the build is completed. 

20210206115831-image.png

The Spark connector JAR file locates in ~/hbase-connectors/spark/hbase-spark/target/hbase-spark-1.0.1-SNAPSHOT.jar.

Run Spark shell

For simplicity, I will directly use Spark Shell (Scala) for this demo. You can use PySpark, Scala or other Spark supported languages  to implement the logic in a script. 

Start Spark-Shell with HBase connector

Start Spark Shell using the following command:

spark-shell --jars ~/hbase-connectors/spark/hbase-spark/target/hbase-spark-1.0.1-SNAPSHOT.jar -c spark.ui.port=11111

Remember to change hbase-spark package to your own location. 

Once Spark session is created successfully, the terminal looks like the following screenshot:

20210205101618-image.png

Create DataFrame

1) First import the required classes:

import org.apache.hadoop.hbase.spark.HBaseContext
import org.apache.hadoop.hbase.HBaseConfiguration

2) Create HBase configurations

val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "127.0.0.1:10231")

3) Create HBase context

// Instantiate HBaseContext that will be used by the following code
new HBaseContext(spark.sparkContext, conf)
4) Create DataFrame
val hbaseDF = (spark.read.format("org.apache.hadoop.hbase.spark")
 .option("hbase.columns.mapping",
   "rowKey STRING :key," +
   "firstName STRING Name:First, lastName STRING Name:Last," +
   "country STRING Address:Country, state STRING Address:State"
 )
 .option("hbase.table", "Person")
 ).load()
 
The columns mapping matches with the definition in the steps above.

5) Show DataFrame schema
scala> hbaseDF.schema
res2: org.apache.spark.sql.types.StructType = StructType(StructField(lastName,StringType,true), StructField(country,StringType,true), StructField(state,StringType,true), StructField(firstName,StringType,true), StructField(rowKey,StringType,true))
6) Show data
hbaseDF.show()
The output should look something like the following screenshot:
20210206121405-image.png
Till now, we've successfully loaded data from HBase in Spark 3.0.1.
You can also write into HBase from Spark too. Refer to the API documentation for more details. 

Use catalog

We can also define a catalog for the table Person created above and then use it to read data.

1) Define catalog

def catalog = s"""{
    |"table":{"namespace":"default", "name":"Person"},
    |"rowkey":"key",
    |"columns":{
    |"rowkey":{"cf":"rowkey", "col":"key", "type":"string"},
    |"firstName":{"cf":"Name", "col":"First", "type":"string"},
    |"lastName":{"cf":"Name", "col":"Last", "type":"string"},
    |"country":{"cf":"Address", "col":"Country", "type":"string"},
    |"state":{"cf":"Address", "col":"State", "type":"string"}
    |}
|}""".stripMargin

2) Use catalog

Now the catalog can be directly passed into as tableCatalog option:

import org.apache.hadoop.hbase.spark.datasources._

(spark.read
.options(Map(HBaseTableCatalog.tableCatalog->catalog))
.format("org.apache.hadoop.hbase.spark")
.load()).show()

The code can also be simplified as:

(spark.read.format("org.apache.hadoop.hbase.spark")
.option("catalog",catalog)
.load()).show()
Output:
scala> (spark.read
     | .options(Map(HBaseTableCatalog.tableCatalog->catalog))
     | .format("org.apache.hadoop.hbase.spark")
     | .load()).show()
+--------+------+---------+-----+---------+
|lastName|rowkey|  country|state|firstName|
+--------+------+---------+-----+---------+
|    Tang|     1|Australia|  VIC|  Raymond|
|    Gnat|     2|      USA|   CA|  Dnomyar|
+--------+------+---------+-----+---------+

Summary

Unfortunately the connector packages for Spark 3.x are not published to Maven central repositories yet. 

To save time for building hbase-connector project, you can download it from the ones I built using WSL: Release 1.0.1 HBase Connectors for Spark 3.0.1 · kontext-tech/hbase-connectors.

More from Kontext
comment Comments
Raymond Raymond #1841 access_time 8 months ago more_vert

Hello, welcome to Kontext! Can you please try Spark 3.0.1 with the hbase connector I published to see if it is because misalignment of Spark versions? Your current one is 3.4.1 based on the screenshot you provided. If you do need it work with Spark 3.4.1, we need to find the corresponded connector package to build one. 

format_quote

Comment is deleted or blocked.

Raymond Raymond #1561 access_time 3 years ago more_vert

Just to follow up on this one as I didn't hear back from you. Have you resolved this problem?

format_quote

person cansın access_time 3 years ago

Hi Raymond thanks for the article.

I have managed to create my own jar and connect to shell with following command:
spark-shell --jars hbase-connectors/spark/hbase-spark/target/hbase-spark-1.0.1-SNAPSHOT.jar

but when I write my imports I get following error: 

scala> import org.apache.hadoop.hbase.spark.HBaseContext

import org.apache.hadoop.hbase.spark.HBaseContext

scala> import org.apache.hadoop.hbase.HBaseConfiguration

<console>:24: error: object HBaseConfiguration is not a member of package org.apache.hadoop.hbase

       import org.apache.hadoop.hbase.HBaseConfiguration

Do you have any idea that what what might be wrong?


Raymond Raymond #1559 access_time 3 years ago more_vert

Hi cansın,

What is your version of HBase?

And also can you specify the full path to your spark-hbase connector jar file? For example, in the example I provided in this article, I am using ~/

spark-shell --jars ~/hbase-connectors/spark/hbase-spark/target/hbase-spark-1.0.1-SNAPSHOT.jar


format_quote

person cansın access_time 3 years ago

Hi Raymond thanks for the article.

I have managed to create my own jar and connect to shell with following command:
spark-shell --jars hbase-connectors/spark/hbase-spark/target/hbase-spark-1.0.1-SNAPSHOT.jar

but when I write my imports I get following error: 

scala> import org.apache.hadoop.hbase.spark.HBaseContext

import org.apache.hadoop.hbase.spark.HBaseContext

scala> import org.apache.hadoop.hbase.HBaseConfiguration

<console>:24: error: object HBaseConfiguration is not a member of package org.apache.hadoop.hbase

       import org.apache.hadoop.hbase.HBaseConfiguration

Do you have any idea that what what might be wrong?


C cansın tartıcı #1558 access_time 3 years ago more_vert

Hi Raymond thanks for the article.

I have managed to create my own jar and connect to shell with following command:
spark-shell --jars hbase-connectors/spark/hbase-spark/target/hbase-spark-1.0.1-SNAPSHOT.jar

but when I write my imports I get following error: 

scala> import org.apache.hadoop.hbase.spark.HBaseContext

import org.apache.hadoop.hbase.spark.HBaseContext

scala> import org.apache.hadoop.hbase.HBaseConfiguration

<console>:24: error: object HBaseConfiguration is not a member of package org.apache.hadoop.hbase

       import org.apache.hadoop.hbase.HBaseConfiguration

Do you have any idea that what what might be wrong?


Administrator Administrator #1548 access_time 3 years ago more_vert

Please contact us via: Contact us and we will try to arrange a Teams session for you.

format_quote

person Pavan Kumar access_time 3 years ago

Yes, they all are in current directory. Can we connect if possible?

PK Pavan Kumar Yerravelly #1541 access_time 3 years ago more_vert

Yes, they all are in current directory. Can we connect if possible?

format_quote

person Raymond access_time 3 years ago

Are all those jars included in the current directory where you initiated the spark-shell?

You can manually put them into \jars directory in your Spark installation. 

Raymond Raymond #1540 access_time 3 years ago more_vert

Are all those jars included in the current directory where you initiated the spark-shell?

You can manually put them into \jars directory in your Spark installation. 

format_quote

person Pavan Kumar access_time 3 years ago

I think there is no issue with the build. But I'm unable to connect to Hbase from Spark. I'm using a docker environment where Zookeeper, HDFS, Spark, and HBase run in different containers in the same network.

Here are the jars I'm using.

spark-shell --jars hbase-spark-protocol-shaded-1.0.0.7.2.12.0-291.jar,htrace-core4-4.2.0-incubating.jar,hbase-shaded-protobuf-3.5.1.jar,protobuf-java-2.5.0.jar,hbase-protocol-2.4.8.jar,hbase-shaded-miscellaneous-3.5.1.jar,hbase-mapreduce-2.4.8.jar,hbase-server-2.4.8.jar,hbase-client-2.4.8.jar,hbase-common-2.4.8.jar,hbase-spark-1.0.1-SNAPSHOT.jar,hadoop-common-2.8.5.jar --files hbase-site.xml

I have almost all the required jars but still seeing below error. I tried my best to debug the isue but didn't find a way to get rid of this. Please advise me how to resolve this or redirect me if there is any detailed documentation about prerequisites.

java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/shaded/protobuf/generated/MasterProtos$MasterService$BlockingInterface

  at java.lang.ClassLoader.defineClass1(Native Method)

  at java.lang.ClassLoader.defineClass(ClassLoader.java:757)

  at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)

PK Pavan Kumar Yerravelly #1537 access_time 3 years ago more_vert

I think there is no issue with the build. But I'm unable to connect to Hbase from Spark. I'm using a docker environment where Zookeeper, HDFS, Spark, and HBase run in different containers in the same network.

Here are the jars I'm using.

spark-shell --jars hbase-spark-protocol-shaded-1.0.0.7.2.12.0-291.jar,htrace-core4-4.2.0-incubating.jar,hbase-shaded-protobuf-3.5.1.jar,protobuf-java-2.5.0.jar,hbase-protocol-2.4.8.jar,hbase-shaded-miscellaneous-3.5.1.jar,hbase-mapreduce-2.4.8.jar,hbase-server-2.4.8.jar,hbase-client-2.4.8.jar,hbase-common-2.4.8.jar,hbase-spark-1.0.1-SNAPSHOT.jar,hadoop-common-2.8.5.jar --files hbase-site.xml

I have almost all the required jars but still seeing below error. I tried my best to debug the isue but didn't find a way to get rid of this. Please advise me how to resolve this or redirect me if there is any detailed documentation about prerequisites.

java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/shaded/protobuf/generated/MasterProtos$MasterService$BlockingInterface

  at java.lang.ClassLoader.defineClass1(Native Method)

  at java.lang.ClassLoader.defineClass(ClassLoader.java:757)

  at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)

format_quote

person Raymond access_time 3 years ago

I'm glad to hear that. Since the HBase minor version is slightly different, it might be possible that the package will cause unexpected problem though the possibility is low since the major version is the same. Please let me know if you encounter issue like that. 
Raymond Raymond #1536 access_time 3 years ago more_vert
I'm glad to hear that. Since the HBase minor version is slightly different, it might be possible that the package will cause unexpected problem though the possibility is low since the major version is the same. Please let me know if you encounter issue like that. 
format_quote

person Pavan Kumar access_time 3 years ago

Awesome!
That worked. Thanks again for your help, Raymond.

PK Pavan Kumar Yerravelly #1535 access_time 3 years ago more_vert

Awesome!
That worked. Thanks again for your help, Raymond.

format_quote

person Raymond access_time 3 years ago

Hi Pavan,

The issue you encountered is the same one I mentioned in the article due to incompatible version of the HBase and connector code. 

For HBase version, I have to use 2.2.4 as the latest hbase-connector code was based on that version. 

So please try the following command:

mvn -Dspark.version=3.1.1 -Dscala.version=2.12.10 -Dscala.binary.version=2.12 -Dhbase.version=2.2.4 -Dhadoop.profile=3.0 -Dhadoop-three.version=3.2.1 -DskipTests -Dcheckstyle.skip -U clean package

The built package should still work with HBase 2.4.7.


Regards,

Raymond

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts