What is AWS Glue and what are its key features
AWS Glue is a service that accelerates the process of turning raw data into information by enabling users to extract, transform, and load data quickly. It is powered by Web Services and uses AWS services such as S3, EC2, and RDS to support data ingestion. AWS Glue is built with industry standard standards and tries to provide platform independent APIs. It also supports batching for data extraction. The goal of the project is to create a new way of operating on data with low operational cost by creating one service that does everything for you.
AWS Glue supports data ingestion from many sources including AWS, other public cloud services, and the most popular file formats. AWS Glue can be considered as a simple web service that allows pulling in, transforming, and loading data from anywhere.
AWS Glue uses standardized APIs to manage batching so they all use the same methodologies; both Amazon S3 and Amazon EC2 efforts have been integrated into AWS Glue. In this way, you can use AWS Glue to manage batching for these services for up to 1 million records per task.
Feature comparison between Amazon EMR and AWS Glue are listed in the following table:
Glue and EMR both have common features such as:
. It is also integrates with Amazon S3 and Amazon EC2. Data transformation can support on all sources.
How does AWS Glue compare to EMR
AWS Glue is a data integration tool that is cheaper than EMR which cost 8 cents per hour. The pricing model for Glue is based on the amount of data you process in a month. There are some limitations of AWS Glude for now such as the fact that AWS Glue doesn’t support data formats beyond JSON and Avro. Additionally, AWS Glue is brand new and is still in beta so the fact that there is no support for debugging in it must be kept in mind.
On the other hand, EMR is an open source project and this makes it highly customizable which means that you can already build EMR for your needs. If you use AWS Glue, you’ll have to pay for every 1 GB of processed data. The pricing model of EMR is highly customizable – you can select the number of nodes, memory, and storage that your instances will utilize. This flexibility is of great advantage since it allows you to choose the combination that fulfills your business needs in the most cost effective way.
When should you use AWS Glue over EMR
AWS Glue is a good alternative to EMR if you are just getting started with Hadoop and you don’t have enough resources to run EMR yet. AWS Glue provides you with one service that handles all the loading, data ingestion, transformations, and clustering. The cost of running this service is cheaper than running EMR too. AWS Glue also turns out to be a good option for companies that have their own Hadoop clusters but don’t want to use EMR.
To solve the following problems, AWS Glue is highly recommended:
Loading data (getting the data in Amazon S3) : AWS Glue handles all the loading and transformations. Since you don’t need to specify Hadoop specifications like mappings and reducers for import, this saves a lot of time for your development.
: AWS Glue handles all the loading and transformations. Since you don’t need to specify Hadoop specifications like mappings and reducers for import, this saves a lot of time for your development. Data transformation : First and foremost, AWS Glue does the work pretty well. There are options for most of the transformations such as filtering, sorting, aggregating, grouping, pivoting and so on.
: First and foremost, AWS Glue does the work pretty well. There are options for most of the transformations such as filtering, sorting, aggregating, grouping, pivoting and so on. Clustering: With options like hierarchical and k-means clustering functions, you can easily do that with AWS Glue.
To solve the following problems, using EMR is preferred: