aws

17 items tagged with "aws"

5 Articles
12 Diagrams

Articles

AWS CDK Python - Add Environment Variables for CodeBuild Pipeline

2023-01-11
Cloud Computing

PySpark - Read Parquet Files in S3

This code snippet provides an example of reading parquet files located in S3 buckets on AWS (Amazon Web Services). The bucket used is from New York City taxi trip record data. S3 bucket location is: s3a://ursa-labs-taxi-data/2009/01/data.parquet. To run the script, we need to setup the package dependency on Hadoop AWS package, for example, org.apache.hadoop:hadoop-aws:3.3.0. This can be easily done by passing configuration argument using spark-submit: `` spark-submit --conf spark.jars.packages=org.apache.hadoop:hadoop-aws:3.3.0 ` This can also be done via SparkConf: ` conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.0') ` Use temporary AWS credentials In this code snippet, AWS AnonymousAWSCredentialsProvider is used. If the bucket is not public, we can also use TemporaryAWSCredentialsProvider. ` conf.set('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider') conf.set('spark.hadoop.fs.s3a.access.key', ) conf.set('spark.hadoop.fs.s3a.secret.key', ) conf.set('spark.hadoop.fs.s3a.session.token', ) `` If you have used AWS CLI or SAML tools to cache local credentials ( ~/.aws/credentials), you then don't need to specify the access keys assuming the credential has access to the S3 bucket you are reading data from.

2022-08-21
Code Snippets & Tips

AWS EMR Debug - Container release on a *lost* node

2022-04-29
Cloud Computing

EMR - Expected schema-specific part at index : s3:

2022-04-28
Cloud Computing

AWS Certified Cloud Practitioner Notes

2020-04-11
Cloud Computing Forum

Diagrams

AWS Elastic Path Based Listener

This diagram shows how to create a path-based routing on ELB with ECS. For different paths, the requests are routed to different services in ECS. Reference: Achieve path-based routing on an Application Load Balancer | AWS re:Post (repost.aws)

2023-03-28
Cloud Computing

PySpark Reading from S3

This diagram is used as article feature images, which depicts reading data from S3 bucket via PySpark.

2022-08-21
Solution Diagrams

AWS EMR Read and Write with S3

This diagram shows a typical EMR application that reads and writes data with S3. References EMR File System (EMRFS) - Amazon EMR

2022-04-29
Solution Diagrams

AWS ETL Solution with Glue Diagram

This diagram shows one example of using AWS Glue to crawl, catalog and perform data stored in S3. Data landed in raw bucket is scanned by Glue Crawler and the metadata is stored in Glue Catalog. Glue ETL job loads the raw data and does transformations and eventually store the processed data in curated bucket. The processed files are scanned by Glue Crawler. Processed data is then queried by Amazon Athena. The data can be further utilized in reporting and dashboard.

2022-01-29
Solution Diagrams

AWS Streaming Processing Diagrams

This diagram is used as feature image for AWS streaming processing diagram series.

2022-01-14
Solution Diagrams

AWS Batch Processing Diagrams

This diagram is used as feature image for AWS batch processing diagram series.

2022-01-14
Solution Diagrams

AWS Big Data Lambda Architecture for Streaming Analytics

This diagram shows a typical lambda streaming processing solution on AWS with Amazon Kinesis, Amazon Glue, Amazon S3, Amazon Athena and Amazon Quicksight: Amazon Kinesis - capture streaming data via Data Firehose and then transform and analyze streaming data using Data Analytics; the result of analytics is stored into another Data Firehose process; for batch processing, the captured streaming data can be directly loaded into S3 bucket too. Amazon S3 - store streaming raw data and batch processed data. Amazon Glue - transform batch data in S3 and store the processed data into another bucket for consumption. Amazon Athena - used to read data in S3 via SQL. Amazon Quicksight - data visualization tool. References AWS IoT Streaming Processing Solution Diagram AWS IoT Streaming Processing Solution Diagram w Glue

2022-01-11
Solution Diagrams

AWS IoT Streaming Processing Solution Diagram w Glue

This diagram shows a typical streaming processing solution on AWS with Amazon Kinesis, Amazon Glue, Amazon S3, Amazon Athena and Amazon Quicksight: Amazon Kinesis - capture streaming data via Data Firehose and then load the data to S3. Amazon S3 - store streaming raw data and batch processed data. Amazon Glue - transform batch data in S3 and store the processed data into another bucket for consumption. Amazon Athena - used to read data in S3 via SQL. Amazon Quicksight - data visualization tool. Similar solution diagram using streaming transformation: AWS IoT Streaming Processing Solution Diagram.

2022-01-11
Solution Diagrams

AWS IoT Streaming Processing Solution Diagram

This diagram shows a typical streaming processing solution on AWS with Amazon Kinesis, Amazon S3, Amazon Athena and Amazon Quicksight: Amazon Kinesis - capture streaming data via Data Firehose and then transform and analyze streaming data using Data Analytics; the result of analytics is stored into another Data Firehose process. Amazon S3 - streaming processed data is stored in Amazon S3. Amazon Athena - used to read data in S3 via SQL. Amazon Quicksight - data visualization tool.

2022-01-11
Solution Diagrams

AWS Batch Processing Solution Diagram (using AWS Glue)

This diagram shows a typical batch processing solution on AWS with Amazon S3, AWS Lambda, Amazon Glue and Amazon Redshift: Amazon S3 is used to store staging data extracted from source systems on-premises or on-cloud. AWS Lambda is used to register data arrival in S3 buckets into ETL frameworks and trigger batch process process. Amazon Glueis then used to integrate data like merging, sorting, filtering, aggregations, transformations and load the data. Amazon Redshift is then used to store the transformed data. This diagram is forked from AWS Batch Processing Solution Diagram

2022-01-11
Solution Diagrams

AWS Batch Processing Solution Diagram

This diagram shows a typical batch processing solution on AWS with Amazon S3, AWS Lambda, Amazon EMR and Amazon Redshift: Amazon S3 is used to store staging data extracted from source systems on-premises or on-cloud. AWS Lambda is used to register data arrival in S3 buckets into ETL frameworks and trigger batch process process. Amazon EMR is then used to transform data like aggregations and load the data. Amazon Redshift is then used to store the transformed data. This pattern follow the traditional ETL pattern and you can change it to ELT pattern too to do transformations in Redshift directly. Amazon EMR can be replaced with many other products.

2022-01-11
Solution Diagrams

Kontext Cloud Diagram Example

This diagram is created for testing purpose to validate whether the software can draw diagrams with Azure, GCP and AWS product SVG icons correctly.

2021-12-12
Solution Diagrams