Data Warehousing and Database concepts- The basics
The basic concept of a Data Warehouse is to facilitate a single version of truth for a company for decision making and forecasting. A Data warehouse is an information system that contains historical and commutative data from single or multiple sources. Data Warehouse Concepts simplify the reporting and analysis process of organizations.
- Cloud Data Warehouse vs Traditional Data Warehouse Concepts
- Cloud Data Warehouse Concepts
- Conclusion: Traditional vs. Data Warehouse Concepts in Brief
- Data Warehouse Architecture: Traditional vs. Cloud
- New Data Warehouse Architectures
- Data Mart vs. Data Warehouse
- The Difference Between a Data Warehouse and a Database
- Data Lake vs. Data Warehouse
- 12 Datawarehouse cloud tools
- Data Warehouse Testing
Modern Data Warehouse
A modern data warehouse lets you bring together all your data at any scale easily, and to get insights through analytical dashboards, operational reports, or advanced analytics for all your users.
- BI and Data Warehousing
- Data Warehousing and Data Mining
- The New EDW: Meet the Big Data Stack
- Build An ETL Process with Examples
- ETL – load data into AWS Redshift
- Full vs Incremental Loading in ETL
- Best Cloud-Based and Open Source Tools
- Get Started with ETL
- ETL Testing
- Redshift Architecture, Pricing, and Performance
- Redshift Columnar Storage
- Redshift Cluster 101
- Snowflake and the Future of Data Warehousing
- Google BigQuery Architecture
Cloud Basics and Why choose the Cloud Warehouse
Due to their architecture, cloud-based data warehouse offers some major advantages over the traditional systems, such as: Scalability; Reliability; Security; Adaptability; Many organizations cite a lack of resources and expertise as barriers to implementing an on-site data warehouse solution. This is where cloud data warehouses become a preference.
- What is cloud computing?
- Cloud Computing Service Models – IaaS, PaaS, SaaS
- Cloud Computing Deployment Models – Public, Private & Hybrid
- Cloud Computing Basics – Compute
- Cloud Computing Basics – Storage
- Cloud Computing Basics – Network
- Cloud Computing Basics – Serverless
- What is a cloud datawarehouse?
Introduction To Snowflake
Introduction to the powerful and unique features that Snowflake provides for ensuring your data is protected, secure, and available.
Connecting To Snowflake
This guide will walk you through how to connect to Snowflake. You may need to contact the administrator of your Snowflake account if you lack some of the credentials discussed below.
Loading / Unloading Data Into/From Snowflake
These topics describe the concepts and tasks for Data Loading /unloading (i.e. exporting) data from Snowflake tables. Key concepts related to data unloading, as well as best practices.
- Overview of Data Loading/Unloading
- Data Loading/Unloading Considerations
- Preparing to Load/Unload Data
- Bulk Loading/Unloading from Amazon S3
- Bulk Loading/Unloading from a Local File System
- Bulk Loading /Unloading from Microsoft Azure
- Loading Continuously Using Snowpipe
- Loading Using the Web Interface (Limited)
- Querying Data in Staged Files
- Querying Metadata for Staged Files
- Transforming Data During a Load
Database Objects and Querying
A database object is any defined object in a database that is used to store or reference data. Anything which we make from create command is known as Database Object. Some of the examples of database objects are : view, table, sequence, indexes, etc.
Sharing Data In Snowflake
Snowflake data providers can share data that resides in different databases by using secure views. A secure view can reference objects such as schemas, tables, and other views from one or more databases, as long as these databases belong to the same account.
Managing Your Snowflake Account
These topics describe the administrative concepts and tasks associated with managing your account in Snowflake. These topics are intended primarily for administrators (i.e. users with the ACCOUNTADMIN, SYSADMIN, or SECURITYADMIN roles). Account Identifier.
Welcome to Advanced Topics. The goal of this course is to provide you with a deeper understanding of optimizing the usage of snowflake.
- Informatica Cloud Services and Snowflake Integration
- Informatica Snowflake Connector
- Informatica Snowflake JDBC Connection
- Informatica Snowflake Key Range Partitioning
- Informatica Snowflake Objects in Mappings
- Informatica Snowflake Sources in Mappings
- Informatica Snowflake Targets in Mappings
- Informatica Snowflake Lookups in Mappings
- Data Clustering
- Clustering Keys & Clustered Tables
- Using the Spark Snowflake Connector
Cloud Data Warehouse vs Traditional Data Warehouse Concepts
However, this does not mean that traditional data warehouse ideas are dead. Classical data warehouse theory underpins most of what cloud-based data warehouses do.
In this lesson, we’ll explain the traditional data warehouse concepts you need to know and the most important cloud ones from a selection of the top providers: Amazon, Google, and Microsoft. Finally, we’ll wrap up with a cost-benefit analysis of traditional vs. cloud data warehouses, so you know which one is right for you.
Let’s get started.
Traditional Data Warehouse Concepts(to understand Cloud Data Warehouse)
A data warehouse is any system that collates data from a wide range of sources within an organization. Data warehouses are used as centralized data repositories for analytical and reporting purposes.
A traditional data warehouse is located on-site in your company’s servers. You purchase the hardware, the server rooms and hire the staff to run it. They are also called on-premises, on-prem or on-premise data warehouses.
Facts, Dimensions, and Measures
The core building blocks of information in a data warehouse are facts, dimensions, and measures.
A fact is the part of your data that indicates a specific occurrence or transaction. For example, if your business sells flowers, some facts you would see in your data warehouse are:
- Sold 30 roses in-store for $19.99
- Ordered 500 new flower pots from China for $1500
- Paid salary of cashier for this month $1000
Several numbers can describe each fact, and we call these numbers measures. Some measures to describe the fact ‘ordered 500 new flower pots from China for $1500’ are:
- Quantity ordered – 500
- Cost – $1500
When analysts are working with data, they perform calculations on measures (e.g., sum, maximum, average) to glean insights. For example, you may want to know the average number of flower pots you order each month.
A dimension categorizes facts and measures and provides structured labeling information for them – otherwise, they would just be a collection of unordered numbers! Some dimensions to describe the fact ‘ordered 500 new flower pots from China for $1500’ are:
- Country purchased from – China
- Time purchased – 1 pm
- Expected date of arrival – June 6th
You cannot perform calculations on dimensions explicitly, and doing so probably would not be very helpful – how can you find the ‘average arrival date for orders’? However, it is possible to create new measures from dimensions, and these are useful. For example, if you know the average number of days between the order date and arrival date, you can better plan stock purchases.
Normalization and Denormalization(also relevant for Cloud Data Warehouse)
Normalization is the process of efficiently organizing data in a data warehouse (or any other place that stores data). The main goals are to reduce data redundancy – i.e., remove any duplicate data – and improve data integrity – i.e., improve the accuracy of data. There are different levels of normalization and no consensus for the ‘best’ method. However, all methods involve storing separate but related pieces of information in different tables.
There are many benefits to normalization, such as:
- Faster searching and sorting on each table
- Simpler tables make data modification commands faster to write and execute
- Less redundant data means you save on disk space, and so you can collect and store more data
Denormalization is the process of deliberately adding redundant copies or groups of data to already normalized data. It is not the same as un-normalized data. Denormalization improves the read performance and makes it much easier to manipulate tables into forms you want. When analysts work with data warehouses, they typically only perform reads on the data. Thus, denormalized data can save them vast amounts of time and headaches.
Benefits of denormalization:
- Fewer tables minimize the need for table joins which speeds up data analysts’ workflow and leads them discovering more useful insights in the data
- Fewer tables simplify queries leading to fewer bugs
It would be wildly inefficient to store all your data in one massive table. So, your data warehouse contains many tables that you can join together to get specific information. The main table is called a fact table, and dimension tables surround it.
The first step in designing a data warehouse is to build a conceptual data model that defines the data you want and the high-level relationships between them.
Here, we have defined the conceptual model. We are storing Sales data and have three additional tables – Time, Product, and Store – that provide extra, more granular information about each sale. The fact table is Sales, and the others are dimension tables.
The next step is to define a logical data model. This model describes the data in detail in plain English without worrying about how to implement it in code.
Now we have filled out which information each table contains in plain English. Each of the Time, Product, and Store dimension tables shows the Primary Key (PK) in the grey box and the corresponding data in the blue boxes. The Sales table contains three Foreign Keys (FK) so that it can quickly join with the other tables.
The final stage is to create a physical data model. This model tells you how to implement the data warehouse in code. It defines tables, their structure, and the relationship between them. It also specifies data types for columns, and everything is named as it will be in the final data warehouse, i.e., all caps and connected with underscores. Lastly, each dimension table starts with DIM_, and each fact table starts with FACT_.
Now you know how to design a data warehouse, but there are a few nuances to fact and dimension tables that we’ll explain next.
Each business function – e.g., sales, marketing, finance – has a corresponding fact table.
Fact tables have two types of columns: dimension columns and fact columns. Dimension columns – colored grey in our examples – contain Foreign Keys (FK) that you use to join a fact table with a dimension table. These foreign keys are the Primary Keys (PK) for each of the dimension tables. Fact columns – colored yellow in our examples – contain the actual data and measures to be analyzed, e.g., the number of items sold and the total dollar value of sales.
A factless fact table is a particular type of fact table that only has dimension columns. Such tables are useful for tracking events, such as student attendance or employee leave, as the dimensions tell you everything you need to know about the events.
The above factless fact table tracks employee leave. There are no facts since you just need to know:
- What day they were off (DAY_ID).
- How long they were off (TIME_ID).
- Who was on leave (EMPLOYEE_ID).
- Their reason for being on leave, e.g., illness, holiday, doctor’s appointment, etc. (LEAVE_TYPE_ID).
Star Schema vs. Snowflake Schema
The above data warehouses have all had a similar layout. However, this is not the only way to arrange them.
The two most common schemas used to organize data warehouses are star and snowflake. Both methods use dimension tables that describe the information contained within a fact table.
The star schema takes the information from the fact table and splits it into denormalized dimension tables. The emphasis for the star schema is on query speed. Only one join is needed to link fact tables to each dimension, so querying each table is easy. However, since the tables are denormalized, they often contain repeated and redundant data.
The snowflake schema splits the fact table into a series of normalized dimension tables. Normalizing creates more dimension tables, and so reduces data integrity issues. However, querying is more challenging using the snowflake schema because you need more table joins to access the relevant data. So, you have less redundant data, but it is harder to access.
Now we’ll explain some more fundamental data warehouse concepts.
OLAP vs. OLTP
Online transaction processing (OLTP) is characterized by short write transactions that involve the front-end applications of an enterprise’s data architecture. OLTP databases emphasize fast query processing and only deal with current data. Businesses use these to capture information for business processes and provide source data for the data warehouse.
Online analytical processing (OLAP) allows you to run complex read queries and thus perform a detailed analysis of historical transactional data. OLAP systems help to analyze the data in the data warehouse.
Three Tier Architecture
Traditional data warehouses are typically structured in three tiers:
- Bottom Tier: A database server, typically an RDBMS, that extracts data from different sources using a gateway. Data sources fed into this tier include operational databases and other types of front-end data such as CSV and JSON files.
- Middle Tier: An OLAP server that either
- Directly implements the operations, or
- Maps the operations on multidimensional data to standard relational operations, e.g., flattening XML or JSON data into rows within tables.
- Top Tier: The querying and reporting tools for data analysis and business intelligence.
Virtual Data Warehouse / Data Mart
Virtual data warehousing uses distributed queries on several databases, without integrating the data into one physical data warehouse.
Data marts are subsets of data warehouses oriented for specific business functions, such as sales or finance. A data warehouse typically combines information from several data marts in multiple business functions. Yet, a data mart contains data from a set of source systems for one business function.
Kimball vs. Inmon
There are two approaches to data warehouse design, proposed by Bill Inmon and Ralph Kimball. Bill Inmon is an American computer scientist who is recognized as the father of the data warehouse. Ralph Kimball is one of the original architects of data warehousing and has written several books on the topic.
The two experts had conflicting opinions on how data warehouses should be structured. This conflict has given rise to two schools of thought.
The Inmon approach is a top-down design. With the Inmon methodology, the data warehouse is created first and is seen as the central component of the analytic environment. Data is then summarized and distributed from the centralized warehouse to one or more dependent data marts.
The Kimball approach takes a bottom-up view of data warehouse design. In this architecture, an organization creates separate data marts, which provide views into single departments within an organization. The data warehouse is the combination of these data marts.
ETL vs. ELT
Extract, Transform, Load (ETL) describes the process of extracting the data from source systems (typically transactional systems), converting the data to a format or structure suitable for querying and analysis, and finally loading it into the data warehouse. ETL leverages a separate staging database and applies a series of rules or functions to the extracted data before loading.
Extract, Load, Transform (ELT) is a different approach to loading data. ELT takes the data from disparate sources and loads it directly into the target system, such as the data warehouse. The system then transforms the loaded data on-demand to enable analysis.
ELT offers quicker loading than ETL, but it requires a powerful system to perform the data transformations on-demand.
Enterprise Data Warehouse vs Cloud Data Warehouse
An enterprise data warehouse is intended as a unified, centralized warehouse containing all transactional information in the organization, both current and historical. An enterprise data warehouse should incorporate data from all subject areas related to the business, such as marketing, sales, finance, and human resources.
These are the core ideas that make up traditional data warehouses. Now, let’s look at what cloud data warehouses have added on top of them.
Other links you may be interested in
To Continue to dive dep into Teradata
Get full Teradata Course free
Learn About Netezza database