An honest AWS MSK review - July 2019
Almost nine months ago, AWS announced a new service: Managed Streaming for Apache Kafka, aka AWS MSK. AWS MSK is a fully managed service that enables you to build and run applications that use Apache Kafka to process streaming data.
The service was announced as a public preview and it was clear that using it in production is a bit risky. A handful of brave Apache Kafka enthusiasts jumped to get their hands dirty with this new service. The reviews that went deep into details were mainly negative. AWS MSK demonstrated a solid performance but lacked a lot of important features. A very good review for the first version of AWS MSK you can find here.
So, almost nine months have passed and on May 30, 2019, AWS announced a generally available version of Amazon Managed Streaming for Apache Kafka. And it was just perfect timing for me. I had to do a POC for a streaming solution for one of the projects that I am involved in.
Technically, the problem is very simple. We have a cluster of machines running on AWS. Each machine generates log files and comes with preinstalled logstash. All produced log files should be shipped to another AWS account and analyzed there by Apache Spark. The traffic is spiky. Cluster machines generate up to 500,000 messaged per second. Each message is about 1KB.
Given that, I was thinking about reusing pretty common pattern in the industry that includes Apache Kafka as a streaming platform, logstash as Kafka Producer and Apache Spark Streaming as Kafka consumer.
Setting up and managing Apache Kafka cluster is a resource-demanding task, and that's why using Apache Kafka as a service sounded like a nice idea. So I started the POC and tested the following aspects of AWS MSK: maintainability, performance, scalability, reliability, security, and cost.
Maintainability is really great. First of all, setting up a new cluster is easy. Just go to AWS console and find an MSK service panel. After a couple of clicks plus wait time of 20 minutes your cluster will be ready. I’ve set up a new multi-AZ cluster of six m5.large machines with a replication factor of 3. The cluster comes with managed Zookeeper. In terms of versions, you can choose between 1.1.1 and 2.1. A 1.1.1 is really old, I do not really know why any new starter would choose it. 2.1 is OK but not the newest one. Today, 2.3 is available.
Another big plus is monitoring. Like many other AWS services, AWS MSK monitoring is integrated into CloudWatch. CloudWatch is not perfect but provides an all-in-one solution. You take AWS service and you get monitoring with it. What I didn’t like in AWS MSK monitoring is that some important metrics come with extra charge. For example, a very basic “topic” metrics cost extra money.
The support and documentation are good. If you follow this tutorial, you will end up with working Apache Kafka cluster and an ec2 test machine which contains Kafka native consumer, producer, and other tools. All opened tickets on AWS MSK service were replied by support team professionally and within a short period of time.
I will start this section with another strange issue. During AWS MSK creation, you can choose only the underlying machine size. No machine types options. And this is weird because the storage type and the network bandwidth are the main indicators for the amount of traffic that Apache Kafka will be able to handle. I have this general feeling that using i3 instances is a much better choice than m5 for Apache Kafka. But in the case of AWS MSK, we are stuck with general-purpose m5 instances type. So, no ephemeral storage option is available and in order to get a 10 Gbps network bandwidth, we need to choose at least m5.12xlarge instance size. This issue affects the cost of AWS MSK in a very negative way and I will cover it in the cost section.
So, I started with a very basic setup. Six m5.large machines, one topic with a replication factor of 1, and 6 partitions. The test machine was m5.4xlarge. From the test machine, 5000000 records were sent. Here are the results:
- Max record rate: ±90K rec/sec (90 MB / sec)
- Avg latency: 302 ms
- Max latency: 600 ms
This is a solid performance for such a basic configuration. It is far from what I am looking for, though. I tried to play with broker and cluster sizes, but I never was able to get much more than 100K rec/sec. I am sure that the problem was on my side, some kind of load tester config problem. So I asked the AWS support team to provide the recommended cluster configuration for ingesting 500K records per second. The support team has kindly performed a load test for me and provided the following results:
- Cluster size: 15 brokers
- Machine type: kafka.m5.12xlarge
- Partitions per topic: 15
- Replication factor: 1
- Max record rate: ±310K rec/sec (310 MB /sec)
- Avg latency: 103 ms
This configuration can work for me. I have high spikes of 500K rec/sec, but spikes last for short period of time, and probably I can live with 310K rec/sec. So overall I almost satisfied with AWS MSK performance.
AWS MSK is not scalable. Currently, modifying a running cluster to add or remove broker nodes or change instance type is not supported. These operations require creating a new cluster. Only updating cluster configuration ‘update-cluster-configuration’ and increasing EBS storage associated with MSK brokers ‘update-broker-storage’ can be done without re-creating the cluster.
This is a kind of problem for us. Except for regular spikes, our system has ‘big event’ days. In these days the amount of produced logs can grow by hundreds of percents. We have information about big events in advance, so having a scalable cluster allows us to add new brokers in advance and to give some time for cluster stabilization. And after the ‘big event’ has finished, we could gradually remove brokers.
The question is how can we work around this scalability issue. The simple approach is to have overprovisioned AWS MSK cluster all the time. It is OK when the price difference is not so high. Another option is to work hard during the ‘big event’ days. We can create a new cluster, and then by using tools like Mirror Maker to migrate from the old cluster to the new cluster. And when the event is finished we can do vice versa. If I have to choose between these two options, I will choose to run with the overprovisioned cluster. So now, this is the question of price and I will cover it later.
Even if rare, failures can occur. Broker machines can fail, a broker can be unreachable because of networks issues or even it can be a problem with the whole availability zone or region. AWS region failures handling is out of scope here. But I do want to discuss the case of AZ or broker failures.
When you create the AWS MSK, you can only have brokers that are a multiple of the number of AZ, which good. The probability of failure of two machines at the same time in different AZ is really low.
But what is going on when a broker or AZ fail. How does it affect consumers and producers? What is the SLA provided by AWS for broker recovery?
In order to ensure that consumers and producer will continue to work after AZ failure, you have to set your topic replication factor to be greater than the number of brokers in the same availability zone. For example, if you have 3 brokers over 3 availability zones, and one of the zones failed, having a replication factor of 2 is enough to continue without downtime. Failure does affect Kafka Cluster performance. In the case of AZ failure you just have 66% of your cluster’s power. So it is important that brokers recovery will be performed as fast as possible.
Here you can find an explanation of AWS MSK recovery process. I will just summarize it. So, MSK detects and automatically recovers from the most common failure scenarios for Multi-AZ clusters. When MSK detects one of a failure, it automatically replaces the unhealthy or unreachable broker with a new broker. In addition, where possible, it reuses the storage from the older broker to reduce the data that Apache Kafka needs to replicate. The availability impact is limited to the time required for MSK to complete the detection and recovery. After recovery, producers and consumers apps can continue to communicate with the same broker IP addresses that they used before the failure.
What about SLA? Here is the answer of AWS support team.
Detection and recovery time vary by the type of issue the cluster encounters. Some recovery actions might involve a broker node restart while others require a replacement. For example, if there is an underlying hardware failure in one of the broker EC2 nodes, once MSK detects the failure, recovery time includes launching a new broker node of the same instance type to replace the failed broker. When a broker instance is replaced, the ENI from the original broker is attached to the new broker so the existing BootstrapServer string can be used and you do not have to update your application with a new BootstrapServer string. Similarly, recovery time may vary if an issue with the underlying storage volume is detected and a new volume has to be launched and attached.
Bottom line, the MSK has good reliability if you set up your MSK on different availability zones, you use the right replication factor and you calculate cluster’s capacity with taking spares for brokers failure cases. Just remember, that SLA is not really defined and it can take long hours to recover failed brokers.
Well, there were many complaints about the security of the first MSK version. Most of the issues were fixed, and now MSK security is much better. I will cover only those that were important for my project.
First of all, MSK encrypts data at rest using AWS KMS customer master key (CMK). When you create an MSK cluster, just specify the AWS KMS customer master key (CMK). AWS MSK will use this key to encrypt your data at rest. If you don’t specify a CMK, Amazon MSK creates an AWS managed CMK for you and uses it on your behalf.
MSK encrypts data in transit between the brokers of your MSK cluster. You can override this default at the time you create the cluster. Also, data encrypted in-transit between Apache Kafka clients and the Amazon MSK service. For a client to broker encryption, MSK brokers use public AWS Certificate Manager certificates. Therefore, any truststore that trusts Amazon Trust Services also trusts the certificates of Amazon MSK brokers. An example truststore is JVM truststore. You can use your own truststore as well for encryption between client-broker if it trusts Amazon Trust Services.
However, MSK only supports AWS ACM PCA certificates to authenticate to MSK. This is by design and other certificates cannot be used for client authentication to MSK brokers.
Another problem is that I didn’t find a way to restrict the Kafka client’s permissions. I would like to grant to a certain ec2 machines a permission to produce data to a specific topic. It looks like it is impossible right now. Any client that has produce permissions automatically gets permission to create/delete topics.
Finally, let’s discuss the cost of AWS MSK. Here you can find formulas for AWS MSK price calculations. Let’s go with a configuration that was provided by the AWS support team that handles 310 MB/sec: 15 brokers of m5.12xlarge. The monthly price will be:
Brokers charge: 31 days * 24 hours * 15 brokers * 5.04$ = 56,246$/month
Storage charge (7d retention): 310MB/sec * 3600s * 24h * 7d / 1000 * 0.1$= ±18,200 $ / month
Total price: ±75,000 $
This price is before extra charges for metrics. Well, this is a very expensive Kafka cluster. I know that on-prem cluster for the given amount of data will cost about 20% of the given price. But of course, a comparison between an on-prem self-deployed cluster and managed service is not fair. Let’s take a look at the probably biggest competitor of AWS MSK: Confluent Managed Kafka service. Here is an estimated monthly price for 1MB/sec traffic (example 2)
So it is about 751$/month for 1MB/sec. As I already mentioned, we have spikes of 500MB/sec and I can estimate our average traffic by 250 MB/sec. So 250*751 = 187,750$. Suddenly AWS MSK looks like reasonably priced service.
I have a long history of running things on AWS and frankly speaking, I like AWS very much. Usually, there are a lot of benefits of using AWS managed services like scalability, solid performance, easy cost management, and great technical support. But in the case of AWS MSK, it is a big ‘no’. I can not take it to production mainly because of cost and scalability issues.
But if you are working in a small company/team and you are constrained in your ability to provide staffing, invest in infrastructure and maintain systems, plus your traffic is low and not spiky, AWS MSK can be a great deal for you.