Traditional ETL vs AWS Glue

Reading Time: 3 minutes

ETL consists of Extraction, Transformation and Load

Extraction

The process of retrieving data from one or more sources – CRM, ERP, operational systems, legacy data, social media data, data from third party, etc.

Transformation

The process of mapping, reformatting, conforming, adding meaning and more to prepare the data in a way that it is more easily consumed. Eg. Currency conversion from USD to JPY before storage.

Loading

Loading involves inserting of data into a target database or a data warehouse.

etl process

Cloud and ETL

After the internet, explosion in the devices that are connected to the internet and increase in the devices that collect data such as home automation devices, mobile phones, smart watches, etc, the amount of data and variety of data has increase exponentially. To understand the variety of data consider the formats such as social media posts on tiktok or photos on instagram or videos on youtube or temperature data from your home thermostat or history of people contacting your company of twitter. These are in addition to the traditional transactions, sales, customer data, operational data that your company already collects and stores on your data warehouse. The advent of cloud SaaS and big data has produced an ability to store and process all this data.

Data Integration tools(ETL is a part of them, mainly referring to the batch portion of Data integration) that can complete these tasks on scale and are able to complete these tasks on time. More and more, we see that the so many sources and target are on the cloud. Check out the growth of Snowflake or AWS Redshift in the data warehousing space.

Serverless ETLTraditional ETL
ETL pipeline job runs as code on servers that are maintained off-premise or in the cloudETL pipeline jobs typically run in on-premise servers that are maintained, sometimes by another team
ETL tools such as AWS Glue is called ETL as a service as it allows users to create and store and run ETL jobs onlineETL tools are typically canvas based that live on-premise and require maintenance such as software updates
The code for serverless ETL operations can be customized to do what the developer wants in the ETL data pipeline.There is a canvas based fucntion represented by an icon with configurable UI for customizing ETL operations in a data pipeline
AWS Glue ETL process