Spark 3.0.1: Connect to HBase 2.4.1
Spark doesn't include built-in HBase connectors. We can use HBase Spark connector or other third party connectors to connect to HBase in Spark.
Prerequisites
If you don't have Spark or HBase available to use, you can follow these articles to configure them.
Spark
Apache Spark 3.0.1 Installation on Linux or WSL Guide
HBase
Install HBase in WSL - Pseudo-Distributed Mode
Prepare HBase table with data
Run the following commands in HBase shell to prepare a sample table that will be used in the following sections.
create 'Person', 'Name', 'Address' put 'Person', '1', 'Name:First', 'Raymond' put 'Person', '1', 'Name:Last', 'Tang' put 'Person', '1', 'Address:Country', 'Australia' put 'Person', '1', 'Address:State', 'VIC' put 'Person', '2', 'Name:First', 'Dnomyar' put 'Person', '2', 'Name:Last', 'Gnat' put 'Person', '2', 'Address:Country', 'USA' put 'Person', '2', 'Address:State', 'CA'
The table returns the following result when scanning:
scan 'Person' ROW COLUMN+CELL 1 column=Address:Country, timestamp=2021-02-05T20:48:42.088, value=Australia 1 column=Address:State, timestamp=2021-02-05T20:48:46.750, value=VIC 1 column=Name:First, timestamp=2021-02-05T20:48:32.544, value=Raymond 1 column=Name:Last, timestamp=2021-02-05T20:48:37.085, value=Tang 2 column=Address:Country, timestamp=2021-02-05T20:49:00.692, value=USA 2 column=Address:State, timestamp=2021-02-05T20:49:04.972, value=CA 2 column=Name:First, timestamp=2021-02-05T20:48:51.653, value=Dnomyar 2 column=Name:Last, timestamp=2021-02-05T20:48:56.665, value=Gnat 2 row(s)
Build HBase Spark connector
We need to build HBase Spark Connector for Spark 3.0.1 as it is not published on Maven repository.
Refer to official repo hbase-connectors/spark at master · apache/hbase-connectors for more details.
1) Clone the repository using the following command:
git clone https://github.com/apache/hbase-connectors.git
2) Install Maven if it is not available on your WSL:
3) Change directory to the clone repo:
cd hbase-connectors/4) Build the project using the following command:
mvn -Dspark.version=3.0.1 -Dscala.version=2.12.10 -Dscala.binary.version=2.12 -Dhbase.version=2.2.4 -Dhadoop.profile=3.0 -Dhadoop-three.version=3.2.0 -DskipTests -Dcheckstyle.skip -U clean package
The version arguments need to match with your Hadoop, Spark and HBase versions.
Regardless of this, the built package will also work with HBase 2.4.1.
Wait until the build is completed.
The Spark connector JAR file locates in ~/hbase-connectors/spark/hbase-spark/target/hbase-spark-1.0.1-SNAPSHOT.jar.
Run Spark shell
For simplicity, I will directly use Spark Shell (Scala) for this demo. You can use PySpark, Scala or other Spark supported languages to implement the logic in a script.
Start Spark-Shell with HBase connector
Start Spark Shell using the following command:
spark-shell --jars ~/hbase-connectors/spark/hbase-spark/target/hbase-spark-1.0.1-SNAPSHOT.jar -c spark.ui.port=11111
Remember to change hbase-spark package to your own location.
Once Spark session is created successfully, the terminal looks like the following screenshot:
Create DataFrame
1) First import the required classes:
import org.apache.hadoop.hbase.spark.HBaseContext import org.apache.hadoop.hbase.HBaseConfiguration
2) Create HBase configurations
val conf = HBaseConfiguration.create() conf.set("hbase.zookeeper.quorum", "127.0.0.1:10231")
3) Create HBase context
// Instantiate HBaseContext that will be used by the following code new HBaseContext(spark.sparkContext, conf)
val hbaseDF = (spark.read.format("org.apache.hadoop.hbase.spark") .option("hbase.columns.mapping", "rowKey STRING :key," + "firstName STRING Name:First, lastName STRING Name:Last," + "country STRING Address:Country, state STRING Address:State" ) .option("hbase.table", "Person") ).load()
scala> hbaseDF.schema res2: org.apache.spark.sql.types.StructType = StructType(StructField(lastName,StringType,true), StructField(country,StringType,true), StructField(state,StringType,true), StructField(firstName,StringType,true), StructField(rowKey,StringType,true))
hbaseDF.show()
Use catalog
We can also define a catalog for the table Person created above and then use it to read data.
1) Define catalog
def catalog = s"""{ |"table":{"namespace":"default", "name":"Person"}, |"rowkey":"key", |"columns":{ |"rowkey":{"cf":"rowkey", "col":"key", "type":"string"}, |"firstName":{"cf":"Name", "col":"First", "type":"string"}, |"lastName":{"cf":"Name", "col":"Last", "type":"string"}, |"country":{"cf":"Address", "col":"Country", "type":"string"}, |"state":{"cf":"Address", "col":"State", "type":"string"} |} |}""".stripMargin
2) Use catalog
Now the catalog can be directly passed into as tableCatalog option:
import org.apache.hadoop.hbase.spark.datasources._ (spark.read .options(Map(HBaseTableCatalog.tableCatalog->catalog)) .format("org.apache.hadoop.hbase.spark") .load()).show()
The code can also be simplified as:
(spark.read.format("org.apache.hadoop.hbase.spark") .option("catalog",catalog) .load()).show()
scala> (spark.read | .options(Map(HBaseTableCatalog.tableCatalog->catalog)) | .format("org.apache.hadoop.hbase.spark") | .load()).show() +--------+------+---------+-----+---------+ |lastName|rowkey| country|state|firstName| +--------+------+---------+-----+---------+ | Tang| 1|Australia| VIC| Raymond| | Gnat| 2| USA| CA| Dnomyar| +--------+------+---------+-----+---------+
Summary
Unfortunately the connector packages for Spark 3.x are not published to Maven central repositories yet.
To save time for building hbase-connector project, you can download it from the ones I built using WSL: Release 1.0.1 HBase Connectors for Spark 3.0.1 · kontext-tech/hbase-connectors.
What is the command line you used to build? The example I provided was for the following versions:
mvn -Dspark.version=3.0.1 -Dscala.version=2.12.10 -Dscala.binary.version=2.12 -Dhbase.version=2.2.4 -Dhadoop.profile=3.0 -Dhadoop-three.version=3.2.0 -DskipTests -Dcheckstyle.skip -U clean package
If you are using the latest release code of hbase-connector, you can find the Spark and Hadoop versions here:
The HBase version in that release is 2.5.4.
Thank you for your feedback. Currently, I want to try using spark 3 to read and write data with hbase, is there a way to do this because the hbase version is quite old 2.0.2 (HDP). Your above method does not work with the command below: "mvn -Dspark.version=3.2.2 -Dscala.version=2.12.15 -Dscala.binary.version=2.12 -Dhbase.version=2.0.2 -Dhadoop.profile=3.0 -Dhadoop-three.version=3.1.1 -DskipTests -Dcheckstyle.skip -U clean package "
Based on my limited knowledge that won’t work as the libs referenced are different. You can fork from the repo to customize the referenced library versions to see if it works.
Hi Raymond thanks for the article.
I have managed to create my own jar and connect to shell with following command:
spark-shell --jars hbase-connectors/spark/hbase-spark/target/hbase-spark-1.0.1-SNAPSHOT.jar
but when I write my imports I get following error:
scala> import org.apache.hadoop.hbase.spark.HBaseContext
import org.apache.hadoop.hbase.spark.HBaseContext
scala> import org.apache.hadoop.hbase.HBaseConfiguration
<console>:24: error: object HBaseConfiguration is not a member of package org.apache.hadoop.hbase
import org.apache.hadoop.hbase.HBaseConfiguration
Do you have any idea that what what might be wrong?
Just to follow up on this one as I didn't hear back from you. Have you resolved this problem?
Hi cansın,
What is your version of HBase?
And also can you specify the full path to your spark-hbase connector jar file? For example, in the example I provided in this article, I am using ~/
spark-shell --jars ~/hbase-connectors/spark/hbase-spark/target/hbase-spark-1.0.1-SNAPSHOT.jar
Hi @Raymond,
Thanks for this informative article. I followed the steps mentioned in this. But seeing the below error while building the project. It would be great if could you help me resolve the issue.
Thanks in advance.
Hi Pravan,
Is your Maven version is Apache Maven 3.6.0?
If you are using Spark 3.0.1 with HBase 2.4.1, you can directly try the one I built:
Thanks for pointing that @Raymond. My Hadoop, Spark, Scala, and Hbase versions are 3.2.1, 3.1.1,2.12, and 2.4.7 respectively.
Maven build:
mvn -Dspark.version=3.1.1 -Dscala.version=2.12.10 -Dscala.binary.version=2.12 -Dhbase.version=2.4.7 -Dhadoop.profile=3.0 -Dhadoop-three.version=3.2.1 -DskipTests -Dcheckstyle.skip -U clean package
I have upgraded Maven and the issue is resolved. But seeing a compilation error as below.
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project hbase-kafka-proxy: Compilation failure
[ERROR] /home/ec2-user/git/spark-hbase/hbase-connectors/kafka/hbase-kafka-proxy/src/main/java/org/apache/hadoop/hbase/kafka/KafkaTableForBridge.java:[53,8] org.apache.hadoop.hbase.kafka.KafkaTableForBridge is not abstract and does not override abstract method getRegionLocator() in org.apache.hadoop.hbase.client.Table
I would be so grateful if you could help me with what I need to learn to resolve such issues.
Thank you so much for your help.
Hi Pavan,
The issue you encountered is the same one I mentioned in the article due to incompatible version of the HBase and connector code.
For HBase version, I have to use 2.2.4 as the latest hbase-connector code was based on that version.
So please try the following command:
mvn -Dspark.version=3.1.1 -Dscala.version=2.12.10 -Dscala.binary.version=2.12 -Dhbase.version=2.2.4 -Dhadoop.profile=3.0 -Dhadoop-three.version=3.2.1 -DskipTests -Dcheckstyle.skip -U clean package
The built package should still work with HBase 2.4.7.
Regards,
Raymond
Awesome!
That worked. Thanks again for your help, Raymond.
I think there is no issue with the build. But I'm unable to connect to Hbase from Spark. I'm using a docker environment where Zookeeper, HDFS, Spark, and HBase run in different containers in the same network.
Here are the jars I'm using.
spark-shell --jars hbase-spark-protocol-shaded-1.0.0.7.2.12.0-291.jar,htrace-core4-4.2.0-incubating.jar,hbase-shaded-protobuf-3.5.1.jar,protobuf-java-2.5.0.jar,hbase-protocol-2.4.8.jar,hbase-shaded-miscellaneous-3.5.1.jar,hbase-mapreduce-2.4.8.jar,hbase-server-2.4.8.jar,hbase-client-2.4.8.jar,hbase-common-2.4.8.jar,hbase-spark-1.0.1-SNAPSHOT.jar,hadoop-common-2.8.5.jar --files hbase-site.xml
I have almost all the required jars but still seeing below error. I tried my best to debug the isue but didn't find a way to get rid of this. Please advise me how to resolve this or redirect me if there is any detailed documentation about prerequisites.
java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/shaded/protobuf/generated/MasterProtos$MasterService$BlockingInterface
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
Are all those jars included in the current directory where you initiated the spark-shell?
You can manually put them into \jars directory in your Spark installation.
Yes, they all are in current directory. Can we connect if possible?
Please contact us via: Contact us and we will try to arrange a Teams session for you.
Hello Raymond, I'm having some errors when building the connector, I am using spark 3.2.2 on a separate cluster reading data from hbase 2.0.2 (on HDP), hadoop version 3.1.1, scala 2.12.15. Hope you will respond !