Table of Contents
Data Warehouse
- In computing, a data warehouse is a system used for reporting and data analysis and is considered a core component of business intelligence.
- A data warehouse centralizes and consolidates large amounts of data from multiple source
- The data within a data warehouse is usually derived from a wide range of sources such as application log files and transaction applications.
- It is a relational database that is designed for query and analysis rather than for transaction processing.
- It usually contains historical data derived from transaction data, but it can include data from other sources.
- To perform analytics you need a data warehouse not a regular database
- OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations.
- RDS (MySQL..etc ) is an OLTP database, where there is detailed and current data, and a schema used to store transactional data
AWS Redshift
- Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service .
- It is optimized for datasets ranging from a few hundred gigabytes to a petabyte or more
- It costs less than $1,000 per terabyte per year, a tenth the cost of most traditional data warehousing solutions.
- Amazon Redshift gives you fast querying capabilities over structured data using familiar SQL based clients and business intelligence (BI) tools using standard ODBC and JDBC connections.
- Queries are distributed and parallelized across multiple physical resources.
- Amazon Redshift uses replication and continuous backups to enhance availability and improve data durability and can automatically recover from component and node failures
- Redshift is a SQL based data warehouse used for analytics applications (Analytics DB)
- Example use cases: Sales Reporting, Health Care analytics
- It is suited for OLAP-based use cases (On Line Analytics Processing)
- Can store huge amount of data (a database), but can’t ingest huge amounts of data in real time (not like what Kinesis can do)
- You can launch & Configure it from AWS Console or through AWS Redshift APIs
- Redshift can:
- Fully recover from a node or component failure
- It automatically patches and performs data backup
- Backups can be stored for a user defined retention period
- Is 10 times faster than traditional SQL RDBMS
Availability and Durability
- Redshift automatically replicates all your data within your data warehouse cluster
- Redshift always keeps three copies of your data:
- The original one
- A replica on compute nodes (within the cluster)
- A backup copy on S3
- Cross Region Replication
- Redshift can asynchronously replicate your snapshots to S3 in another region for DR
Backup Retention
- Automatically patches and backs up (Snapshots) your data warehouse, storing the
- backups for a user-defined retention period in AWS S3.
- By default for one day (24 hours) but you can configure it from 0 – 35 days
- Automatic backups are stopped if you choose retention period of 0
- You have access to these automated snapshots during the retention period
- Manual backups are not deleted automatically, if you do not manually delete them, you will be charged standard S3 storage rates
- AWS Redshift currently supports only one AZ (no Multi-AZ option)
- You can restore from your backup to a new Redshift cluster in the same or a different AZ
Data Security
- Supports encryption of data at rest
- By default, AWS Redshift takes care of encryption key management
- You can choose to manage your own keys through
- HSM (Hardware Security Modules)
- AWS KMS (Key Management Service)
- Supports SSL Encryption, in-transit, between client applications and Redshift data warehouse cluster
- You can’t have direct access to your AWS Redshift cluster nodes
Amazon Redshift