PySpark - Read Parquet Files in S3

PySpark - Read Parquet Files in S3

This code snippet provides an example of reading parquet files located in S3 buckets on AWS (Amazon Web Services). The bucket used is from [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). S3 bucket location is: `s3a://ursa-labs-taxi-data/2009/01/data.parquet`. To run the script, we need to setup the package dependency on [Hadoop AWS package](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws), for example, `org.apache.hadoop:hadoop-aws:3.3.0`. This can be easily done by passing configuration argument using `spark-submit`: ``` spark-submit --conf spark.jars.packages=org.apache.hadoop:hadoop-aws:3.3.0 ``` This can also be done via `SparkConf`: ``` conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.0') ``` ### Use temporary AWS credentials In this code snippet, AWS `AnonymousAWSCredentialsProvider `is used. If the bucket is not public, we can also use `TemporaryAWSCredentialsProvider`. ``` conf.set('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider') conf.set('spark.hadoop.fs.s3a.access.key', <access_key>) conf.set('spark.hadoop.fs.s3a.secret.key', <secret_key>) conf.set('spark.hadoop.fs.s3a.session.token', <token>) ``` If you have used AWS CLI or SAML tools to cache local credentials ( ~/.aws/credentials), you then don't need to specify the access keys assuming the credential has access to the S3 bucket you are reading data from.

Raymond Tang Raymond Tang 0 4818 4.62 index 8/21/2022

Code description

This code snippet provides an example of reading parquet files located in S3 buckets on AWS (Amazon Web Services). 

The bucket used is from New York City taxi trip record data.

S3 bucket location is: s3a://ursa-labs-taxi-data/2009/01/data.parquet.

To run the script, we need to setup the package dependency on Hadoop AWS package, for example, org.apache.hadoop:hadoop-aws:3.3.0. This can be easily done by passing configuration argument using spark-submit:

     spark-submit --conf spark.jars.packages=org.apache.hadoop:hadoop-aws:3.3.0  
    

This can also be done via SparkConf:

    conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.0')

Use temporary AWS credentials

In this code snippet, AWS AnonymousAWSCredentialsProvider is used. If the bucket is not public, we can also use TemporaryAWSCredentialsProvider.

    conf.set('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider')
    conf.set('spark.hadoop.fs.s3a.access.key', <access_key>)
    conf.set('spark.hadoop.fs.s3a.secret.key', <secret_key>)
    conf.set('spark.hadoop.fs.s3a.session.token', <token>)  
    

If you have used AWS CLI or SAML tools to cache local credentials ( ~/.aws/credentials), you then don't need to specify the access keys assuming the credential has access to the S3 bucket you are reading data from.

Code snippet

    from pyspark.sql import SparkSession
    from pyspark import SparkConf
    
    app_name = "PySpark - Read from S3 Example"
    master = "local[1]"
    
    conf = SparkConf().setAppName(app_name)        .setMaster(master)
    
    # setup anonymoue AWS credential
    conf.set('spark.hadoop.fs.s3a.aws.credentials.provider',
             'org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider')
    
    spark = SparkSession.builder.config(conf=conf)         .getOrCreate()
    
    spark.sparkContext.setLogLevel("WARN")
    
    file_path = 's3a://ursa-labs-taxi-data/2009/01/data.parquet'
    
    df = spark.read.format('parquet').load(file_path)
    
    df.explain(mode='formatted')
    
aws cloud-storage pyspark

Join the Discussion

View or add your thoughts below

Comments