Overview of AWS Glue

Reading Time: 4 minutes

Migrating from on Premise solution to AWS Glue When you run AWS Glue, there are no servers or other infrastructure to manage. Pay only for the resources used while running the jobs and the metadata that is stored. If your organization is already invested in Informatica or Datastage, Talend, etc., it may be easy for… Continue reading Overview of AWS Glue

AWS Glue (Serverless ETL)

Reading Time: < 1 minute

Introduction Traditional ETL vs AWS Glue Overview of AWS Glue Demo: Creating a ETL solution using AWS Some use cases for using AWS Glue Summary What to read next? Overview of AWS Glue Traditional ETL vs AWS Glue Some use cases for using AWS Glue

ETL vs ELT: Must Know Differences

Reading Time: 5 minutes

What is ETL?ETL is an abbreviation of Extract, Transform and Load. In this process, an ETL tool extracts the data from different RDBMS source systems then transforms the data like applying calculations, concatenations, etc. and then load the data into the Data Warehouse system. In ETL data is flows from the source to the target. In ETL process transformation engine takes care of any data changes. What is ELT?ELT is a different method of looking at the tool approach to data movement. Instead of transforming the data before it’s written, ELT lets the target system to do the transformation. The data first copied to the target and then transformed in place. ELT usually used with no-Sql databases like Hadoop cluster, data appliance or cloud installation. Difference between ETL vs. ELTETL and ELT process are different in following parameters: table{width:100%;border-collapse:collapse}table td{line-height:20px;text-align:left;vertical-align:top;border:0 solid;border-top:1px solid #ddd;background-color:transparent}@media only screen and (max-width:760px),(min-device-width:768px) and (max-device-width:1024px){table,c,tbody,th,td,tr{display:block}thead tr{position:absolute;top:-9999px;left:-9999px}tr{border:1px solid #ccc}td{border:none;border-bottom:1px solid #eee;position:relative;padding-left:50%}td:before{position:absolute;top:6px;left:6px;width:45%;padding-right:10px}.table1 td:nth-of-type(1):before{content:”Parameters”}.table1 td:nth-of-type(2):before{content:”ETL”}.table1 td:nth-of-type(3):before{content:”ELT”}Parameters ETL ELT Process Data is transformed at staging server and then transferred to Datawarehouse DB. Data remains in the DB of the Datawarehouse. Code Usage Used for Compute-intensive TransformationsSmall amount of dataUsed for High amounts of data Transformation Transformations are done in ETL server/staging area. Transformations are performed in the target system Time-Load Data first loaded into staging and later loaded into target system. Time intensive. Data loaded into target system only once. Faster. Time-Transformation ETL process needs to wait for transformation to complete. As data size grows, transformation time increases. In ELT process, speed is never dependant on the size of the data. Time- Maintenance It needs highs maintenance as you need to select data to load and transform. Low maintenance as data is always available. Implementation Complexity At an early stage, easier to implement. To implement ELT process organization should have deep knowledge of tools and expert skills. Support for Data warehouse ETL model used for on-premises, relational and structured data. Used in scalable cloud infrastructure which supports structured, unstructured data sources. Data Lake Support Does not support. Allows use of Data lake with unstructured data. Complexity The ETL process loads only the important data, as identified at design time. This process involves development from the output-backward and loading only relevant data. Cost High costs for small and medium businesses. Low entry costs using online Software as a Service Platforms. Lookups In the ETL process, both facts and dimensions need to be available in staging area. All data will be available because Extract and load occur in one single action. Aggregations Complexity increase with the additional amount of data in the dataset. Power of the target platform can process significant amount of data quickly. Calculations Overwrites existing column or Need to append the dataset and push to the target platform. Easily add the calculated column to the existing table. Maturity The process is used for over two decades. It is well documented and best practices easily available. Relatively new concept and complex to implement. Hardware Most tools have unique hardware requirements that are expensive. Being Saas hardware cost is not an issue. Support for Unstructured Data Mostly supports relational data Support for unstructured data readily available. Summary:ETL stands for Extract, Transform and Load while ELT stands for Extract, Load, TransformIn ETL process data flows from the source to staging to the target.ELT lets the target system to do the transformation. No staging system involved.ELT address many a challenge of ELT but is expensive and requires niche skills to implement and maintain.  

OLAP Terms and Definitions

Reading Time: < 1 minute

Dimensions are lists of related terms used to organize your data. Thus, a natural Dimension name for the Members January, February and March might be Months. Dimensions, in turn, are used to construct Cubes, the multidimensional structures in which you store and model data. What to read next? Nothing to see here. Consider joining one of our full courses..

OLTP vs OLAP: What’s the Difference?

Reading Time: 7 minutes

What is OLAP?Online Analytical Processing, a category of software tools which provide analysis of data for business decisions. OLAP systems allow users to analyze database information from multiple database systems at one time. The primary objective is data analysis and not data processing. What is OLTP?Online transaction processing shortly known as OLTP supports transaction-oriented applications in a 3-tier architecture. OLTP administers day to day transaction of an organization. The primary objective is data processing and not data analysis Example of OLAPAny Datawarehouse system is an OLAP system. Uses of OLAP are as follows A company might compare their mobile phone sales in September with sales in October, then compare those results with another location which may be stored in a sperate database.Amazon analyzes purchases by its customers to come up with a personalized homepage with products which likely interest to their customer.Example of OLTP systemAn example of OLTP system is ATM center. Assume that a couple has a joint account with a bank. One day both simultaneously reach different ATM centers at precisely the same time and want to withdraw total amount present in their bank account. However, the person that completes authentication process first will be able to get money. In this case, OLTP system makes sure that withdrawn amount will be never more than the amount present in the bank. The key to note here is that OLTP systems are optimized for transactional superiority instead data analysis. Other examples of OLTP system are: Online bankingOnline airline ticket booking Sending a text messageOrder entryAdd a book to shopping cartBenefits of using OLAP servicesOLAP creates a single platform for all type of business analytical needs which includes planning, budgeting, forecasting, and analysis.The main benefit of OLAP is the consistency of information and calculations. Easily apply security restrictions on users and objects to comply with regulations and protect sensitive data. Benefits of OLTP methodIt administers daily transactions of an organization. OLTP widens the customer base of an organization by simplifying individual processes.Drawbacks of OLAP serviceImplementation and maintenance are dependent on IT professional because the traditional OLAP tools require a complicated modeling procedure. OLAP tools need cooperation between people of various departments to be effective which might always be not possible. Drawbacks of OLTP methodIf OLTP system faces hardware failures, then online transactions get severely affected. OLTP systems allow multiple users to access and change the same data at the same time which many times created unprecedented situation. Difference between OLTP and OLAPParameters OLTP OLAP Process It is an online transactional system. It manages database modification. OLAP is an online analysis and data retrieving process. Characteristic It is characterized by large numbers of short online transactions. It is characterized by a large volume of data. Functionality OLTP is an online database modifying system. OLAP is an online database query management system. Method OLTP uses traditional DBMS. OLAP uses the data warehouse. Query Insert, Update, and Delete information from the database. Mostly select operations Table Tables in OLTP database are normalized. Tables in OLAP database are not normalized. Source OLTP and its transactions are the sources of data. Different OLTP databases become the source of data for OLAP. Data Integrity OLTP database must maintain data integrity constraint. OLAP database does not get frequently modified. Hence, data integrity is not an issue. Response time It’s response time is in millisecond. Response time in seconds to minutes. Data quality The data in the OLTP database is always detailed and organized. The data in OLAP process might not be organized. Usefulness It helps to control and run fundamental business tasks. It helps with planning, problem-solving, and decision support. Operation Allow read/write operations. Only read and rarely write. Audience It is a market orientated process. It is a customer orientated process. Query Type Queries in this process are standardized and simple. Complex queries involving aggregations. Back-up Complete backup of the data combined with incremental backups. OLAP only need a backup from time to time. Backup is not important compared to OLTP Design DB design is application oriented. Example: Database design changes with industry like Retail, Airline, Banking, etc. DB design is subject oriented. Example: Database design changes with subjects like sales, marketing, purchasing, etc. User type It is used by Data critical users like clerk, DBA & Data Base professionals. Used by Data knowledge users like workers, managers, and CEO. Purpose Designed for real time business operations. Designed for analysis of business measures by category and attributes. Performance metric Transaction throughput is the performance metric Query throughput is the performance metric. Number of users This kind of Database users allows thousands of users. This kind of Database allows only hundreds of users. Productivity It helps to Increase user’s self-service and productivity Help to Increase productivity of the business analysts. Challenge Data Warehouses historically have been a development project which may prove costly to build. An OLAP cube is not an open SQL server data warehouse. Therefore, technical knowledge and experience is essential to manage the OLAP server. Process It provides fast result for daily used data. It ensures that response to the query is quicker consistently. Characteristic It is easy to create and maintain. It lets the user create a view with the help of a spreadsheet. Style OLTP is designed to have fast response time, low data redundancy and is normalized. A data warehouse is created uniquely so that it can integrate different data sources for building a consolidated database Summary:Online Analytical Processing is a category of software tools that analyze of data stored in a database.Online transaction processing shortly known as OLTP supports transaction-oriented applications in a 3-tier architecture OLAP creates a single platform for all type of business analysis needs which includes planning, budgeting, forecasting, and analysis.OLTP is useful to administer day to day transactions of an organization.OLAP is characterized by a large volume of data. OLTP is characterized by large numbers of short online transactions. A data warehouse is created uniquely so that it can integrate different data sources for building a consolidated database. An OLAP Cube takes a spreadsheet and three-dimensionless the experiences of analysis.  

Difference Between Fact Table and Dimension Table

Reading Time: 9 minutes

Fact Table: A fact table is a primary table in a dimensional model. A Fact Table contains Measurements/factsForeign key to dimension table Dimension table: A dimension table contains dimensions of a fact. They are joined to fact table via a foreign key. Dimension tables are de-normalized tables. The Dimension Attributes are the various columns in a dimension tableDimensions offers descriptive characteristics of the facts with the help of their attributesNo set limit set for given for number of dimensions The dimension can also contain one or more hierarchical relationships Difference between Dimension table vs. Fact tableParameters Fact Table Dimension Table Definition Measurements, metrics or facts about a business process. Companion table to the fact table contains descriptive attributes to be used as query constraining. Characteristic Located at the center of a star or snowflake schema and surrounded by dimensions. Connected to the fact table and located at the edges of the star or snowflake schema Design Defined by their grain or its most atomic level. Should be wordy, descriptive, complete, and quality assured. Task Fact table is a measurable event for which dimension table data is collected and is used for analysis and reporting. Collection of reference information about a business. Type of Data Facts tables could contain information like sales against a set of dimensions like Product and Date. Evert dimension table contains attributes which describe the details of the dimension. E.g., Product dimensions can contain Product ID, Product Category, etc. Key Primary Key in fact is mapped as foreign keys to Dimensions. Foreign key to the facts table Storage Helps to store report labels and filter domain values in dimension tables. Load detailed atomic data into dimensional structures. Hierarchy Does not contain Hierarchy Contains Hierarchies. For example Location could contain, country, pin code, state, city, etc. Type of factsType of facts Explanation Additive Measures should be added to all dimensions. Semi-Additive In this type of facts, measures may be added to some dimensions and not with others. Non-Additive It stores some basic unit of measurement of a business process. Some real-world examples include sales, phone calls, and orders. Types of Dimensions:Types of Dimension Definition Conformed Dimensions Conformed dimensions is the very fact to which it relates. This dimension is used in more than one-star schema or Datamart. Outrigger Dimensions A dimension may have a reference to another dimension table. These secondary dimensions called outrigger dimensions. This kind of Dimensions should be used carefully. Shrunken Rollup Dimensions Shrunken Rollup dimensions are a subdivision of rows and columns of a base dimension. These kinds of dimensions are useful for developing aggregated fact tables. Dimension-to-Dimension Table Joins Dimensions may have references to other dimensions. However, these relationships can be modeled with outrigger dimensions. Role-Playing Dimensions A single physical dimension helps to reference multiple times in a fact table as each reference linking to a logically distinct role for the dimension. Junk Dimensions It a collection of random transactional codes, flags or text attributes. It may not logically belong to any specific dimension. Degenerate Dimensions Degenerate dimension is without corresponding dimension. It is used in the transaction and collecting snapshot fact tables. This kind of dimension does not have its dimension as it is derived from the fact table. Swappable Dimensions They are used when the same fact table is paired with different versions of the same dimension. Step Dimensions Sequential processes, like web page events, mostly have a separate row in a fact table for every step in a process. It tells where the specific step should be used in the overall session.  

Data Modelling with Erwin

Reading Time: 3 minutes

Definition: Data modeling is a process used to define and analyze data requirements needed to support the business processes. The process of data modeling involves professional data modelers working with business stakeholders, as well as potential users of the information system. Types: Conceptual data model: It is a set of technology independent specifications about the… Continue reading Data Modelling with Erwin

SMP vs MPP Architecture

Reading Time: 2 minutes

Databases such as Oracle, DB2, Sybase Symmetrical Multi-Processing Architecture (SMP) was once the champion of the Data Warehouse. It was rivalled by MPP as SMP had the below disadvantages: Failures of components did not result in graceful decline in performance. Rather, the whole system failed and data was unrecoverable until the failure was resolved Upon… Continue reading SMP vs MPP Architecture

Netezza

Reading Time: 2 minutes

IBM Netezza Overview Netezza Architecture Connecting to Netezza Databases, Tables and Database Objects Data Distribution Loading and Unloading Tables Statistics Zone Maps and Clustered Base Tables Materialized Views Groom Sequences Transactions Query and System Optimization nz commands Backup and Restore Creating User and User Management Query History Managing Workloads Managing Events Createxid, Deletexid NPS AMPP… Continue reading Netezza

What is Dimensional Model in Data Warehouse?

Reading Time: 5 minutes

What is Dimensional Model?A dimensional model is a data structure technique optimized for Data warehousing tools. The concept of Dimensional Modelling was developed by Ralph Kimball and is comprised of “fact” and “dimension” tables. A Dimensional model is designed to read, summarize, analyze numeric information like values, balances, counts, weights, etc. in a data warehouse. In contrast, relation models are optimized for addition, updating and deletion of data in a real-time Online Transaction System. These dimensional and relational models have their unique way of data storage that has specific advantages. For instance, in the relational mode, normalization and ER models reduce redundancy in data. On the contrary, dimensional model arranges data in such a way that it is easier to retrieve information and generate reports. Hence, Dimensional models are used in data warehouse systems and not a good fit for relational systems. In this tutorial, you will learn- What is Dimensional Model? Elements of Dimensional Data Model Fact Dimension Attributes Fact Table Dimension table Steps of Dimensional Modelling Step 1) Identify the business process Step 2) Identify the grain Step 3) Identify the dimensions Step 4) Identify the Fact Step 5) Build Schema Rules for Dimensional Modelling Benefits of dimensional modeling Elements of Dimensional Data ModelFactFacts are the measurements/metrics or facts from your business process. For a Sales business process, a measurement would be quarterly sales number DimensionDimension provides the context surrounding a business process event. In simple terms, they give who, what, where of a fact. In the Sales business process, for the fact quarterly sales number, dimensions would be Who – Customer NamesWhere – LocationWhat – Product NameIn other words, a dimension is a window to view information in the facts. AttributesThe Attributes are the various characteristics of the dimension. In the Location dimension, the attributes can be StateCountryZipcode etc.Attributes are used to search, filter, or classify facts. Dimension Tables contain Attributes Fact TableA fact table is a primary table in a dimensional model. A Fact Table contains Measurements/factsForeign key to dimension table Dimension table A dimension table contains dimensions of a fact. They are joined to fact table via a foreign key. Dimension tables are de-normalized tables. The Dimension Attributes are the various columns in a dimension tableDimensions offers descriptive characteristics of the facts with the help of their attributesNo set limit set for given for number of dimensions The dimension can also contain one or more hierarchical relationships Steps of Dimensional ModellingThe accuracy in creating your Dimensional modeling determines the success of your data warehouse implementation. Here are the steps to create Dimension Model Identify Business ProcessIdentify Grain (level of detail)Identify DimensionsIdentify FactsBuild StarThe model should describe the Why, How much, When/Where/Who and What of your business process Step 1) Identify the business processIdentifying the actual business process a datarehouse should cover. This could be Marketing, Sales, HR, etc. as per the data analysis needs of the organization. The selection of the Business process also depends on the quality of data available for that process. It is the most important step of the Data Modelling process, and a failure here would have cascading and irreparable defects. To describe the business process, you can use plain text or use basic Business Process Modelling Notation (BPMN) or Unified Modelling Language (UML). Step 2) Identify the grainThe Grain describes the level of detail for the business problem/solution. It is the process of identifying the lowest level of information for any table in your data warehouse. If a table contains sales data for every day, then it should be daily granularity. If a table contains total sales data for each month, then it has monthly granularity. During this stage, you answer questions like Do we need to store all the available products or just a few types of products? This decision is based on the business processes selected for DatawarehouseDo we store the product sale information on a monthly, weekly, daily or hourly basis? This decision depends on the nature of reports requested by executivesHow do the above two choices affect the database size?Example of Grain: The CEO at an MNC wants to find the sales for specific products in different locations on a daily basis. So, the grain is “product sale information by location by the day.” Step 3) Identify the dimensionsDimensions are nouns like date, store, inventory, etc. These dimensions are where all the data should be stored. For example, the date dimension may contain data like a year, month and weekday. Example of Dimensions: The CEO at an MNC wants to find the sales for specific products in different locations on a daily basis. Dimensions: Product, Location and Time Attributes: For Product: Product key (Foreign Key), Name, Type, Specifications Hierarchies: For Location: Country, State, City, Street Address, Name Step 4) Identify the FactThis step is co-associated with the business users of the system because this is where they get access to data stored in the data warehouse. Most of the fact table rows are numerical values like price or cost per unit, etc. Example of Facts: The CEO at an MNC wants to find the sales for specific products in different locations on a daily basis. The fact here is Sum of Sales by product by location by time. Step 5) Build SchemaIn this step, you implement the Dimension Model. A schema is nothing but the database structure (arrangement of tables). There are two popular schemas Star SchemaThe star schema architecture is easy to design. It is called a star schema because diagram resembles a star, with points radiating from a center. The center of the star consists of the fact table, and the points of the star is dimension tables.The fact tables in a star schema which is third normal form whereas dimensional tables are de-normalized.