Manoeuvering Through the Big Data Highway (Shards) with Amazon Kinesis

Source: Screenshot from Official AWS Kinesis video

In my last post, we captured an overview of what Kinesis does and what it is capable of doing. In this post we will dive a bit deeper into the technical building Blocks of Kinesis. Yes you will still understand this Blog even if you’re not a Web Developer, Just to clear your Doubts! Anyways, so lets come back where we took off in the Kinesis Story. We know that Kinesis enables capturing of continuous stream of data; capable of processing data in real time! So what is this Stream made of? How does my data from different sources logically get processed without getting intermingled? That’s what we’re here to find out.

Shards – The Data Highway

So what is a Kinesis Stream made of? Well, a Kinesis Stream is made of a single or multiple number of “shards”.So now you might wondering what is a shard ?? Well AWS Cloud Defines Shards as;

“A Shard is a scaling unit for a stream. A shard is a uniquely identified group of data records in an Amazon Kinesis stream”

Woah ! That may have been a Tangent for many of you, don’t worry it’s not as complicated as it sounds. Let me explain it this way: For now,lets consider shards as carriers of data in a stream. In the previous post we spoke about the lumber jack example remember? We considered the logs of wood as data and the water stream as the Kinesis Stream.

[Curious about Big Data with AWS? See how the Lifecycle Works]

The Highway Example to the Rescue !!

So now lets take Another example, this time of a highway. The highway is our Kinesis Stream and the vehicles running on it are our data. For our scenario, let us consider it to be a single lane highway(where all the vehicles are travelling in one direction). As time passes, the number of vehicles on the highway keeps increasing, eventually leading to traffic congestion and then a traffic snarl-up!!!! So what’s the Solution??? Simple, increase the capacity of the road. As the road is widened,the lanes in the road increase i.e our first single laned highway is widened and made into a two lane highway. What if there’s further traffic congestion?? Simple, widen the road and make it a 4 lane highway! So the lanes here represent the “shards”. Want to increase the capacity of the roads, just increase the number of lanes; or want to increase the capacity of the Kinesis Stream, just increase the number of shards!

How do Shards make your Job a Walk in the Park?

So now we clearly understand the meaning of the sentence “Shards are the scaling unit for a stream”.
Fortunately increasing the capacity of a Kinesis stream is not such a backbreaking task as building a new lane for a highway!! Adding shards to a Kinesis stream is just a matter of a few clicks!! ( We will be including a tutorial for a Step by Step Guide on how to Build your Own Kinesis App in our future blogs, so stay tuned!)

Courtesy: AWS Website | Amazon Kinesis High Level Architecture

Here are a few facts about Shards that will make your brain ponder:

Each shard is capable of ingesting 1MB / sec of data and upto 1000 TPS (transaction per second)
HTTP “Puts” can range from 1 KB to a max of 50 KB
Data in a shard will be stored for a maximum of 24 hrs i.e data is available for read, re-read, backfilled, and analyzed, or moved to long-term storage within this time-frame

This adding and removing of shards is not only easy, but can be done without disturbing the running application or the stream i.e one can reduce or increase the number of shards in “real-time” thus fulfilling its qualities of “Scalability” and “Manageability”.

[Reinvent Big Data with Hadoop in the Clouds!]

Partition Keys & Sequence Numbers- Just Like Lane Markings & Number Plates!

We have almost covered everything we need to know about shards.But wait, just a tiny bit remaining about- “Partition Keys” So now what are these partition keys!!?

AWS Cloud defines them as,

Partition Keys are like Lane Markings

“The partition key is used to group data by shard within the stream”

So, lets get back to the highway example:
We can map partition keys to be the lane markings( those white markings) on the highway. Its like traffic going to destination “abcd” will be on this lane and to “efgh” will be on that lane. It helps categorize the traffic, i.e partition keys help decide,which data should go into which shard. Consider,there is data coming into the Kinesis Stream from three sources like Twitter, Facebook and some other random site. Each of these sources have their unique “Source Id”. Ideally we would want that data from these sources be carried in separate shards. In such situations partition keys come to the rescue. If we set the partition keys (A partition key is specified by the applications putting the data into a stream) when inserting data as the source-id, the data will be put into their respective shards automatically.

Partition keys are Unicode strings with a maximum length limit of 256 bytes. An MD5 hash function is used to map partition keys to 128-bit integer values and to map associated data records to shards.

Along,with partition keys, there is something called as the sequence number. Like every vehicle on the road is uniquely identified by its registration number,every data( blob of data through a single “put” call ) is identified through its sequence number.

Well there, now you know what Shards are! stay tuned for our Next Blog where we discuss How to Set up your own Kinesis Application in Detail !
Don’t miss it, or any of our other post! Subscribe to our Blogs.

Cloud Consulting, Strategy, and Migration

DevSecOps

Cloud Security Engineering

Application Assessment

Cloud Native Application Development & Testing

SaaS Product & Platform Development

Data Strategy

Data Governance and Engineering

Advanced Analytics

Cloud Governance & Reporting

Cloud Discovery & Optimization

DevOps Transformation (DoT)

cAssure

cSecure

SaaS Factory Model

BlazePulse

cSaver

Cloud and Platform Modernization

Cloud Security Operations

Conversational AI

Application Maintenance & Enhancement

Application Modernization

Managed Analytics

BI Modernization

Cloud Managed Services

How Analytics Helps Businesses Better Serve their Customers

Building Capabilities of Incident Response/Disaster Recovery on the Cloud

Impactful Cloud Computing Trends to Look For in 2022

Cloud-based End-to-End System Helped Feed Ontario Achieve ~60% Operational Optimization

Azure Cloud Hosting Enabled Etek to See ~40% Performance Improvement in their Website

Cloud Migration Helped Customer Achieve Uniform Platform and Reduce Rollbacks by Over 60%

Ebooks & Whitepapers

Augment Your Deployment Velocity With Terraform

Back to Business as Usual - Rightsizing Your AWS Cloud Cost

Call Us

Email Us

Financial Services

Banking & Insurance

Media & Entertainment

Telecom

Technology

About Us

Our Leadership

Customer Speaks

Strategic Partners

OneClan Life

Clouditects

Work With Us

Thought Leadership

Awards and Recognition

Media Coverage