Amazon Redshift

Data Warehouse

  • In computing, a data warehouse is a system used for reporting and data analysis and is considered a core component of business intelligence.
  • A data warehouse centralizes and consolidates large amounts of data from multiple source
  • The data within a data warehouse is usually derived from a wide range of sources such as application log files and transaction applications.
  • It is a relational database that is designed for query and analysis rather than for transaction processing.
    • It usually contains historical data derived from transaction data, but it can include data from other sources.
    • To perform analytics you need a data warehouse not a regular database
  • OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations.
  • RDS (MySQL..etc ) is an OLTP database, where there is detailed and current data, and a schema used to store transactional data

AWS Redshift

  • Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service .
  • It is optimized for datasets ranging from a few hundred gigabytes to a petabyte or more
  • It costs less than $1,000 per terabyte per year, a tenth the cost of most traditional data warehousing solutions.
  • Amazon Redshift gives you fast querying capabilities over structured data using familiar SQL based clients and business intelligence (BI) tools using standard ODBC and JDBC connections.
  • Queries are distributed and parallelized across multiple physical resources.
  • Amazon Redshift uses replication and continuous backups to enhance availability and improve data durability and can automatically recover from component and node failures
  • Redshift is a SQL based data warehouse used for analytics applications (Analytics DB)
    • Example use cases: Sales Reporting, Health Care analytics
    • It is suited for OLAP-based use cases (On Line Analytics Processing)
    • Can store huge amount of data (a database), but can’t ingest huge amounts of data in real time (not like what Kinesis can do)
  • You can launch & Configure it from AWS Console or through AWS Redshift APIs
  • Redshift can:
    • Fully recover from a node or component failure
    • It automatically patches and performs data backup
    • Backups can be stored for a user defined retention period
    • Is 10 times faster than traditional SQL RDBMS

Availability and Durability

  • Redshift automatically replicates all your data within your data warehouse cluster
  • Redshift always keeps three copies of your data:
    • The original one
    • A replica on compute nodes (within the cluster)
    • A backup copy on S3
  • Cross Region Replication
    • Redshift can asynchronously replicate your snapshots to S3 in another region for DR

Backup Retention

  • Automatically patches and backs up (Snapshots) your data warehouse, storing the
    • backups for a user-defined retention period in AWS S3.
    • By default for one day (24 hours) but you can configure it from 0 – 35 days
  • Automatic backups are stopped if you choose retention period of 0
    • You have access to these automated snapshots during the retention period
  • Manual backups are not deleted automatically, if you do not manually delete them, you will be charged standard S3 storage rates
  • AWS Redshift currently supports only one AZ (no Multi-AZ option)
  • You can restore from your backup to a new Redshift cluster in the same or a different AZ

Data Security

  • Supports encryption of data at rest
  • By default, AWS Redshift takes care of encryption key management
  • You can choose to manage your own keys through
    • HSM (Hardware Security Modules)
    • AWS KMS (Key Management Service)
  • Supports SSL Encryption, in-transit, between client applications and Redshift data warehouse cluster
  • You can’t have direct access to your AWS Redshift cluster nodes
Amazon Redshift

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top