AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue provides all of the capabilities needed for data integration so that you can start analyzing your data and putting it to use in minutes instead of months.
AWS Glue works very well with structured and semi-structured data, and it has an intuitive console to discover, transform and query the data. You can also use the console to edit/modify the generated ETL scripts and execute them in real-time.
Components of AWS Glue
- Data catalog: It is the centralized catalog that stores the metadata and structure of the data. You can point Hive and Athena to this centralized catalog while setting up to access the data. Hence you can leverage the pros of both the tools on the same data without changing any configuration and methods.
- Database: This option is used to create the database for movement and storing the data from source to target.
- Table: This option allows you to create tables in the database that can be used by the source and target.
- Crawler and Classifier: A crawler is an outstanding feature provided by AWS Glue. It crawls the location to S3 or other sources by JDBC connection and moves the data to the table or other target RDS by identifying and mapping the schema. It creates/uses metadata tables that are pre-defined in the data catalog.
- Job: A job is an application that carries out the ETL task. Internally it uses Spark or Python as the programming language and EMR/EC2 to execute these applications on the cluster.
- Trigger: A trigger starts the ETL job execution on-demand or at a specific time.
- Development endpoint: The development environment consists of a cluster which processes the ETL operation. It is an EMR cluster which can be then connected to a notebook or to execute the jobs.
- Notebook: Jupyter notebook is an on the web IDE to develop and run the Scala or Python program for development and testing.
Key Features of AWS Glue
- AWS Glue automatically generates the code structure to perform ETL after configuring the job.
- You can modify the code and add extra features/transformations that you want to carry out on the data.
- With AWS Crawler, you can connect to data sources, and it automatically maps the schema and stores them in a table and catalog.
- Data Catalog of AWS Glue automatically manages the compute statistics and generates the plan to make the queries efficient and cost-effective.
- With AWS Glue, you can also dedup your data. Glue provides a feature called FindMatches that locates similar data and dedup them.