Analytics Services

AWS Kinesis

Difference between SQS and Kinesis: SQS is a message queue service to decouple the application architecture. One or more elements put data in the queue and the other elements consume the data from that queue. Queues are designed to process data one to one, single entities.
Kinesis
- Producer: A Kinesis producer is any device which puts data records into a kinesis stream such as IoT devices, mobile applications, applications, EC2 instances or on-premises servers. Producer specify the stream name, partition key and the data to be added. The partition key allows Kinesis to select the particular shard used to ingest the data.
- Consumer: You are able to have multiple consumers such as EC2 instances running the Kinesis Consumer Library, Lambda functions set to invoke based on stream data records or Kinesis Firehose.
Kinesis Stream: 2 types of streams;
- Video Streams
- Data Streams:
Shard: This is a unit of throughput capacity. Each shard ingests up to 1MB/sec and 1000 records/sec, and emits up to 2MB/sec. To accommodate higher or lower throughput, the number of shards can be modified after the Kinesis stream is created using the API. A stream has at least one shard which provides 1MB of ingestion capacity and 2MB of consumption capacity. Shards can be added to streams to scale the performance on that stream.
- Calculation of number of shards: shards_required = max (ingestion_in_KB/1024, read_requirements_in_KB/2048)
- Per producer bandwidth = stream BW / number of producer
- Per consumer bandwidth = Stream BW / number of consumers (*enhanced fanout)
Kinesis Data Record: A data record is the basic entity written to and read from Kinesis streams. A data record consists of a sequence number, a partition key and a data blob. The data blob can be up to 1MB in size and not altered by kinesis anyway.
Enhanced Fan-out: you get to select specific consumers to guarantee a 2MB/s read throughput per BW.
Kinesis Agent can stream logs from your services.
Kinesis Producer Library aggregates and batches messasges via background processes if you have a lot of workloads with small messages. This is in order to optimise creation of shards.

Kinesis Data Stream

You need to have a data consumer such as Kinesis Data Analytics, EC2, Lambda, EMR etc.

Kinesis Data Firehose

It’s a managed service which can deliver real time data to supported services and mainly “Data Lake” services. It consumes data either from a kinesis stream or traditional producers. It currently supports a number of destinations such as HTTP endpoints, S3, Redshift, Elasticsearch and Splunk.
It also allows for data transformation during the movement between the producer and destination using lambda transformation functions such as apache logs to JSON or CSV.
Use Firehose when you want to take the data out from Kinesis into S3 or one of the other services for data processing. Firehose stores the data if you choose to get data from other producers than Kinesis.
Firehose can perform Ingestion-Transformation-Load activities by itself.
Firehose offers near realtime delivery where the buffer size in 1 MB fills or Buffer interval in 60 seconds passes and then deliver.
For Redshift, the data is copied to S3 first and then loaded into Redshift.

When to use Data Streams and Firehose

You need to use Data Streams for custom processing or sub-second latency
You need to use Firehose for use cases with no administration and latency of up to 60 seconds. It’s not ideal for real-time workloads.

Kinesis Data Analytics

Kinesis Data Analytics for SQL Applications is a Kinesis feature allowing realtime processing and analysis of stream data using SQL. Examples are time-series analytics, real time dashboards, metrics. During the retention window of kinesis or firehose streams, you would be able to query those data in the form of input streams tables with SQL queries. You can either operate SQL queries continuously or during a specified window. Outputs will be going into in-application output stream which is a virtual entity that would feed the data to the other AWS services.
SQL on streaming data: you look at aggregations (count, sum, min and so on) to take granular real-time data and turn them into insights.

Ingestion Best Practices

Optimisation for end to end latency
- Be mindful of per-shard max record and throughput limits. You can auto-scale your shard count. Scale conservatively to leave room for burst of traffic.
Optimisation for overall cost
- Leverage aggregation for high volume of small records
- Implement batching at producer (both buffer size and interval)
- Consider compression and encoding for smaller record sizes

Lambda for Kinesis Streams Processor

There are 3 execution models
- Poll-based: polling every second and go ahead and spin up an execution environment. Number of consumers are limited, be wary. Enhanced fan-out allows the scale of number of functions reading the streams
- Synchronous
- Asynchronous (event based)