Read Hadoop Credential in PySpark
insights Stats
Apache Spark installation guides, performance tuning tips, general tutorials, etc.
*Spark logo is a registered trademark of Apache Spark.
In one of my previous articles about Password Security Solution for Sqoop, I mentioned creating credential using hadoop credential
command. The credentials are stored in JavaKeyStoreProvider. Credential providers are used to separate the use of sensitive tokens, secrets and passwords from the details of their storage and management.
The following command lines create a credential named mydatabase.password in both local JCEKS file and also in HDFS.
#Store the password in HDFS
hadoop credential create mydatabase.password -provider jceks://hdfs/user/hue/mypwd.jceks
# Store the password locally
hadoop credential create mydatabase.password -provider jceks://file/home/user/mypwd.jceks
For running jobs in clusters like YARN, it is important to create the credential in HDFS so that it can be accessed by all worker nodes in the cluster.
Once the credential is created, you can easily use it in Sqoop by passing in the credential name as parameter. However, if you want to access the credential in Spark, what should you do? If you are using Scala, you can easily reference the Hadoop java libraries for credential. However, if you use Python as programming language, it won’t be that straightforward.
Sample code to retrieve Hadoop credential in PySpark
from pyspark.sql import SparkSession appName = "PySpark Hadoop Credential Example" master = "local" # Create Spark session spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .getOrCreate() # Replace the credential provider path accordingly credential_provider_path = 'jceks://hdfs/user/hue/.jceks' credential_name = 'mydatabase.password' # Retrive credential/password from Hadoop credential conf = spark.sparkContext._jsc.hadoopConfiguration() conf.set('hadoop.security.credential.provider.path',credential_provider_path) credential_raw = conf.getPassword(credential_name) credential_str = '' for i in range(credential_raw.__len__()): credential_str = credential_str + str(credential_raw.__getitem__(i)) # Now you can use credential_str, for example, use it as database password in JDBC to load data from databases into Spark data frame.
Access to the credential provider file
Anyone who has access to your credential provider file can also use the same approach to retrieve the credential value from the provider. So it is important to manage the access to the credential file so that only allowed users can access it.
More details about Hadoop credential API
Refer to the official page to learn more about Hadoop credential APIs: CredentialProvider API Guide.