Overview of AWS Glue

Reading Time: 4 minutes
AWS Glue event based workflow

Migrating from on Premise solution to AWS Glue

When you run AWS Glue, there are no servers or other infrastructure to manage. Pay only for the resources used while running the jobs and the metadata that is stored. If your organization is already invested in Informatica or Datastage, Talend, etc., it may be easy for the developers to pick up Amazon Glue easily by using the AWS Glue studio. AWS Glue Studio makes it easy to visually create, run, and monitor AWS Glue ETL jobs. You can compose ETL jobs that move and transform data using a drag-and-drop editor, and AWS Glue automatically generates the code. You can then use the AWS Glue Studio job run dashboard to monitor ETL execution and ensure that your jobs are operating as intended. 

Though it is important to remember that the 3rd party connectors that are commonly available in other ETL tools may not be available (yet!). No Salesforce connector 🙂

If your company has already significantly invested in on-prem for ETL pipelines, migration may be expensive.

Steps to Build your ETL jobs

  • Set up connections to source and target
  • Create crawlers to gather schemas of source and target data
  • Build ETL jobs using Amazon Glue Studio
  • Schedule the ETL jobs and monitor them

Set up connections to source and target

All connections are setup using IAM roles. Connections to RDBMS in Amazon ecosystem can be configured using IAM roles and connected using RDBMS connector.

For non RDBMS connections, example connection to S3 can be established based on IAM roles that have access to read/update respective S3 buckets.

Create crawlers to gather schemas of source and target data

AWS GLUE crawlers infer schemas from connected datastores and stores metadata in the data catalog
AWS GLUE crawlers infer schemas from connected datastores and stores metadata in the data catalog

AWS Glue crawlers can connect to data stores using the IAM roles that you can configure. After connection, you can set up the crawlers to choose data store to include and crawl all JSON, text files, system logs, relational database tables, etc. You can include or exclude patterns that the crawler infers schemas from. For example, if you don’t want the *.csv files on the S3 bucket to be crawled, you can exclude them. The crawler can be one time or be setup to run on a given schedule. It can store the output in the data catalog. The output includes the format (eg. JSON) and the schema.

Build ETL jobs using AWS Glue Studio

AWS Glue generates PySpark or Scala script

While building the ETL job in AWS Glue studio, the job references source and target table schemas based on the data catalog. Job argument can be setup in the job and it can be scheduled based on events or time. After the job is complied it generates a PySpark or Scala script that is executed during run time. Serverless means we pay only for the processing and loading data and for discovering data (crawlers) and these are billed by the second. For AWS Glue catalog, a monthly fee is paid for storing and accessing the metadata. The first million objects stored and the first million accesses are free.

Scheduling and monitoring jobs

AWS provides logging within the Cloudwatch logs

Knowledge of Python PySpark or Scala may be useful in case of troubleshooting or large project with multiple changes. Consider your teams strength on these before you dive into AWS Glue.