Amazon Redshift

Amazon Redshift is a fully managed, peta-byte scale data warehouse available as a pay as you go service.

Redshift provides a powerful and cost effective alternative to the traditional on premise data warehouses.

According to an interview by Matt Wood, General Manager, Data Sciences at Amazon Web Services, Redshift has more than 1000 global customers since its launch in February and is the fastest growing AWS service ever.

In this article, I list down some of the things you should know about Redshift both in terms of it’s features and limitations. The points listed here are based on AWS documentation and our experience while implementing Redshift for our customers and our product Cloudlytics.

Please share your feedback about the blog and add any other features or limitations you’ve experienced with Redshift.

1.Massively Parallel Processing (MPP) Architecture

Amazon Redshift has a Massively Parallel Processing Architecture. MPP enables Redshift to distribute and parallelize queries across multiple nodes. Apart from queries, the MPP architecture also enables parallel operations for data loads, backups and restores.
Redshift architecture is inherently parallel; there is no additional tuning or overheads for distribution of loads for the end users.

2.Redshift supports Single Node Clusters to 100 Nodes Clusters with up to 1.6 PB of storage

You can provision a Redshift cluster with from a single Node to 100 Nodes configuration depending on the processing and storage capacity required. Redshift nodes come in two sizes XL & 8XL. XL node comes with 2 TB attached storage and 8XL node comes with 16 TB attached storage.
Clusters can have a maximum of 32 XL nodes (64 TB) or 100 8XL nodes (1.6 PB).

3.Redshift does not support multi AZ deployments

Redshift clusters currently support only Single AZ deployments. You will not be able to access Redshift n case of an Availability Zone failure. An AZ failure will not affect the durability of your data, you can start using the cluster once the AZ is available. To ensure continuous access to your data, you can launch an additional cluster in different AZ. You can restore a new Redshift cluster in a different AZ by recreating it using the snap shots of the original cluster. Alternately, you can have a cluster running always in a different AZ, accessing the same set of data from S3.

4.Columnar Storage & Data Compression

Redshift provides columnar data storage. With Columnar data storage, all values for a particular column are stored contiguously on the disk in sequential blocks.

Columnar data storage helps reduce the I/O requests made to the disk compared to a traditional row based data storage. It also reduces the amount of data loaded from the disk improving the processing speed, as more memory is available for query executions.

As similar data is stored sequentially, Redshift compresses the data rather efficiently. Compression of data further reduces the amount of I/O required for queries.

5.Parallel uploads to Redshift are supported only for data stored in Amazon S3 &
DynamoDB

Redshift currently supports data imports/copy only from S3 and DynamoDB.  Using COPY command from S3 is the fastest way to load data into Redshift. COPY loads data in parallel and is much more efficient than Insert statement.

Redshift does not have support to load data in parallel from other sources. You will either have to use Insert statements or write scripts to first load data into S3 and then into Redshift. This could sometime be a complex process depending on the size and format of data available with you.

6.Redshift is Secure

Amazon provides various security features for Redshift just like all other AWS services.
Access Control can be maintained at the account level using IAM roles. For data base level access control, you can define Redshift database groups and users and restrict access to specific database and tables.

Redshift can be launched in Amazon VPC. You can define VPC security groups to restrict inbound access to your clusters.
Redshift allows data encryption for all data which is stored in the cluster as well as SSL encryption for data in transit.

7.Distribution Keys

Redshift achieves high query performance by distributing data evenly on all the nodes of a cluster and slices within a node.

A Redshift cluster is made of multiple nodes and each node has multiple slices. The number of slices is equal to the number of processor cores in a node. Each slice is allocated a portion of node’s memory and disk space. During query execution the data is distributed across slices, the slices operate in parallel to execute the queries.

To distribute data evenly among slices, you need to define a distribution key for a table while creating it. If a distribution key is defined during table creation, any data, which is loaded in the table, is distributed across nodes based on the distribution key value. Matching values from a distribution key column are stored together.

A good distribution key will ensure even load distribution across slices, uneven distributions will cause some slices to handle more load than others, and slows down the query execution.

If a distribution key is not defined for a column, the data is by default distributed in a round robin fashion by Redshift.

8.You cannot change the distribution key once a table is created

A distribution key for a table cannot be amended once it is created. This is very important to keep in mind while identifying the right distribution key for a table.

To change a distribution key, the only work around is to create a new table with the updated distribution key, load data into this table and rename the table as the original table after deleting the original table.

9.Redshift does not enforce Database Constraints or support Indexes

You can define database constraints like unique, primary and foreign keys but these constraints are informational only and are not enforced by Redshift. These constraints, though are used by Redshift to create query execution plans, ensuring optimal execution. If the primary key and foreign key constraints are correct, they should be declared while creating tables to have optimal executions.

Redshift also does not support creation of secondary indexes on columns.

10.Redshift does not automatically reclaim space that is freed on deletes or updates

Redshift is based on PostgreSQL version 8.0.2 and inherits some of its limitations. One such limitation is that Redshift does not reclaim and reuse the space freed up by delete or update commands. The free space left by deleted or updated records in large numbers can cost some extra processing.

Every update command in Redshift first deletes the existing row and then inserts a new record with the updated values.

To reclaim this unused space, you can run the Vacuum command. Vacuum command reclaims the freed space and also sorts data in the disk.

Ideally there would be very little updates or deletes once data is loaded in a data warehouse, but in case it does, you can run the Vacuum command.

11.Query Concurrency in a cluster

Redshift enforces a query concurrency limit of 15 on a cluster.

Queries in are executed in a queue, by default there is one queue per query cluster which can run up to five concurrent queries. Users can modify the configuration to allow up to 15 queries per queue and a maximum of 8 queues.

The concurrent queries for a cluster across queues is limited to a maximum of 15. Users cannot modify this configuration.

Tags: , , , ,