AWS Kinesis vs. Kafka: Comparing Architectures, Features, and Cost
Event driven architectures are very powerful, and they are a fundamental way that modern software solutions are being developed.
In this article we compare 2 technologies that power it: Apache Kafka® and AWS Kinesis. We will see how they can be used as the engine which drives event driven architecture, and which one is the best fit for you.
Introduction to Message Brokers
In very simple terms, a message broker is the central store where a software application can send and receive messages and data.
A useful analogy is how restaurant kitchens organize orders that come in. The order dockets are printed and placed on a rail, with magnets or clamps. They are typically sorted by newest to oldest, and there may be different streams depending on what was ordered — drinks or meals. Tasks are dispatched and the orders are prepared and served. Once the order is fulfilled, the docket is normally discarded, having served its purpose.
A message broker works in a very similar way. A software application uses the message broker protocol to send and receive messages to a central store, the message broker. The message broker is designed to store these messages in an efficient manner, and provide the data to any application that requests it.
Introduction to Data Streaming
Data streaming is the process of continually generating and consuming messages, often via a message broker.
If we continue the restaurant analogy, during a lunchtime rush the order dockets will be coming in thick and fast, and it’s important to have a system in place where the orders are processed in the same order they were placed, and that the steps for creating each meal are delegated as efficiently as possible.
In software applications, data streaming is a lunchtime rush that never ends. A common use case is processing application metrics. Metrics are generated continuously and must be processed efficiently and in the order they were generated to extract the most value out of the data stream.
What Is Apache Kafka?
Apache Kafka is a distributed event store, or message broker, designed originally at LinkedIn and then open sourced in 2011.
The core paradigm in Apache Kafka is the producer and consumer. As you can infer, producers send messages to Kafka and consumers read them. Built on top of this concept is the streams api, which is a powerful abstraction layer that can be used to build your application and take advantage of stream processing.
Kafka is designed to be highly scalable — the application supports clusters of nodes, or brokers, and is able to efficiently distribute the read and write load as required.
Kafka clients are also designed to scale. Consumers can join together to form a consumer group, and the data is delivered to the group as a single entity; within the group the data is divided equally among the tenants. This allows consumer groups to consume massive amounts of data, and scale the load horizontally across a number of different hosts.
Apache Kafka is written in Java, and there are client libraries written in almost every modern language. We have an example producer and consumer example here.
What Is Amazon Kinesis?
Kinesis is a stream processing service developed and offered as a managed service by Amazon Web Services (AWS). Amazon offers it in 4 different configurations: Video Streams, Data Streams, Data Firehose, and Data Analytics.
With AWS Kinesis, the servers are hidden from you but they give you the ability to scale the capability of the service to match the requirements of your application.
The Amazon Kinesis Client Library is offered in Java, Python, Ruby, Node.js, and .NET. AWS Kinesis is a proprietary solution, with tight integration to other AWS services. This makes it a prime candidate for those who are already invested in the AWS ecosystem.
Since the compute power is abstracted from the user, Kinesis users are charged for individual streams, known as shards, which have their performance capped at specific known values. Each shard can support up to 5 transactions per second for reads, up to a maximum total data read rate of 2 MB per second and up to 1,000 records per second for writes, up to a maximum total data write rate of 1 MB per second.
Customers need to match their desired performance with the number of shards necessary to facilitate it.
Architecture Comparison
Apache Kafka is a fairly traditional distributed application. The architecture of an Apache Kafka cluster consists of a collection of servers known as brokers. Together they offer a single service interface for producers and consumers to broker messages.
Brokers can be scaled vertically and horizontally, allowing a cluster to grow throughout the lifecycle of the applications using it, and Kafka handles the routing of records to the appropriate nodes and replicas.
Apache Kafka writes records durably by persisting them on disk. The writes are distributed among brokers via partitions, and messages are replicated to other broker nodes for redundancy, depending on the replication factor configured.
Traditionally, Apache ZooKeeperTM is used to help Kafka manage the maintenance of the cluster, but recently the Apache Kafka community has started to move away from the ZooKeeper dependency.
AWS Kinesis operates as a serverless interface. Once provisioned, there are multiple ways it can be configured to connect to your client applications, all of which require additional AWS infrastructure for connectivity.
The unit of scale in Kinesis is the shard, and shards can be modified using the AWS console, which allows you to control the size and retention policy of each Kinesis instance.
Brief Features Comparison
Apache Kafka is highly configurable. Producers write messages to topics, which is the basic unit of isolation. Each topic can be configured with a different number of partitions, replication factors, payload maximums, retention policies, and other settings.
This gives you granular control of how each topic in your cluster operates, and allows you to optimize the usage of your cluster as efficiently as possible. It also allows you to tune these settings to respond to changing requirements.
Apache Kafka has a rich ecosystem of supporting applications and add-ons. The Karapace project allows you to connect via REST API using the REST Proxy. The Schema Registry allows you to enforce a schema onto the data for each Kafka topic.
Kafka® Connect offers a pluggable solution for input and output sources, known as sources and sinks, and connects them with an Apache Kafka. This allows you to create complex stream processing solutions with off the shelf products.
AWS Kinesis is not as configurable as Apache Kafka, but does allow you to set the retention settings per shard, to a maximum of 7 days, after which it is deleted.
The biggest drawcard Kinesis has is the simplicity of provision and connecting it to your existing AWS infrastructure. It is designed for tight AWS integration, and has a large library of example deployments to help you setup what you need.
Costs Analysis
Apache Kafka is free, open source software. This makes the barrier to entry very low, but you will quickly encounter the complexity of provisioning, running, and maintaining a cluster in a production environment.
Deployments can start small, i.e. 3 small server instances, and can scale to hundreds of broker nodes and anywhere in between. Users need to consider the cost of the server infrastructure, storage, and network transfer to understand the total cost of their cluster, in addition to maintenance and other operational costs.
This is where a managed service can help. Instaclustr has offered Apache Kafka as a managed service since 2018, and has experts in-house who can help you run and maintain your cluster on AWS, GCP, or Azure. We can tailor a solution to your needs.
AWS Kinesis has 2 tiers of use: on-demand or provisioned.
On-demand will automatically scale itself to suit the workloads that are utilizing it. This is a simple solution to provision but could also lead to bill shock if your usage spikes dramatically.
Provisioned mode is more predictable, you select how many shards are required, and are charged for using it. Shards can be added or removed as necessary throughout the lifetime of your application.
Who Is Using What?
Apache Kafka has wide adoption across all industry sectors. The Kafka website asserts that “Kafka is used by thousands of companies including over 80% of the Fortune 100”. Instaclustr manages hundreds of Apache Kafka clusters for our customers located all over the world and supports a wide range of use cases, including using it internally ourselves!
AWS Kinesis has a number of case studies featuring known companies, such as Netflix and Sonos. Due to the proprietary nature of Kinesis, it is an AWS only application and support is limited to that community.
Summary: Which Is Right for You?
Instaclustr is a huge champion of open source software, including Apache Kafka, and it has proven itself time and again for ourselves, our customers, and the industry as a reliable, widely supported streaming platform.
Apache Kafka is a mature project, and its industry-wide popularity ensures the development community is active and Kafka is constantly improving from the open source contributions that are being made on a daily basis. In addition to the core Apache Kafka software, the rich ecosystem of add-ons and connectors ensures Kafka can be used for almost any scenario you may require.
The Apache Kafka project is openly transparent about the direction it’s heading in, you can even view the active issues on the Kafka jira board.
Apache Kafka can be complex to configure and operate, and to mitigate that there are a number of managed service offerings that take care of that complexity and let you focus on the application. Instaclustr offers our Kafka Managed Service on all of the major cloud providers — AWS, GCP, and Azure.
For those requiring the best performance available, a number of independent tests has shown that Kafka is able to achieve greater throughput and lower latencies than Kinesis.
AWS Kinesis is no slouch in performance, and its tight integration and simple deployment makes it appealing for anyone already in the AWS ecosystem.
The lack of configuration and limited data retention policies makes it more difficult to recommend for users who want more control of their data streaming platform. If you are not in the AWS ecosystem it is even more difficult to recommend.
Apache Kafka offers rich configuration options, making it easy to customize for your use cases.
In combination with the better performance, unparalleled community, and support available, this makes Apache Kafka the most versatile solution for most applications.
Originally published at https://www.instaclustr.com on November 28th, 2022.