Kafka Streams vs Alpakka

Limor Stotland
Geek Culture
Published in
4 min readJun 29, 2021

--

Proofpoint Cloud Access Security Broker (CASB) backend data processing leverages Apache Kafka for most of the asynchronous communication between microservices. Whenever we add a new piece of functionality or a new service, we ask ourselves: “What is the right stream processing technology for this use case?”. Making a good choice early is crucial to avoid performance problems and costly refactoring in the future. For most of our data pipelines, the choice comes down to Kafka Streams vs Alpakka Kafka. After testing and using in production extensively, we’ve outlined three major considerations that we think are critical to making a decision.

Processing Time

Operations that require a long processing time were always a pain point when writing Kafka based applications. Although KIP-62: “Moving the heartbeat to a dedicated thread” did make a lot of progress in this area, slow consumers can still fall to unnecessary rebalancing or even enter an endless state of rebalance. Why is that? To decide if a consumer is dead, the coordinator looks at the last successful heartbeat time and the last poll time. If the consumer doesn’t poll during a long message processing operation, it will be considered dead and consumer group rebalance will take place. When the slow consumer in question finishes its long processing operation it will try to poll for new messages, but at that point it is no longer part of the consumer group. It will be considered a new consumer, and another rebalance will take place. Now consider what happens when a group has more than one consumer and a different one takes too long to poll every time.

There are 2 ways to solve this problem:

  1. Pausing the consumer. This will continue the poll loop but will not fetch data.
  2. Increasing the max.poll.interval.ms or reducing the max.poll.records. This comes with two risks: reading slowly and missing when a consumer gets stuck during processing.

While Kafka streams doesn’t support pausing the consumer, Alpakka depends on it.

The poll cycle in Akka pauses the consumer on all the partitions that Akka can’t process and then calls the poll method on the Kafka consumer. So, the poll timeout doesn’t pass even if the stream is busy and can’t process any actual data.

Memory consumption

Akka streams buffer is a really great feature. It helps to bridge upstream and downstream at different rates, it can improve performance, and it’s a mechanism to implement stream backpressure. However, there are no free rides, Akka buffers cost a lot of memory. And buffers are not the only thing that requires memory. The actors have a mailbox, the async action comes with a small buffer to solve performance issues, etc.

On the other hand, Kafka Streams is a library that has a much more straightforward processing model by design. It does not support additional buffers or backpressure, and therefore does not require additional memory compared to Alpakka.

Parallelism

The processing parallelism in Kafka Streams is defined by the number of stream processor threads that run on that stream. Since every stream thread uses a consumer, it will translate to the number of consumers. Therefore, Kafka Stream parallelism will always be bounded by the number of partitions in the input topic.

In Alpakka the story is a bit more complicated. Akka streams run one consumer per graph and this is not configurable. The parallelism is defined by the number of mapAsync and mapAsyncUnordered actions and their parallelism degree. By adding a buffer to the consumer and adding the right amount of async actions with the right parallelism degree you can get much higher concurrency than the number of partitions in the input topic. In theory, you can have thousands of threads processing messages from a single partition in parallel.

Let us not forget that async actions in Akka streams are accompanied by a small internal buffer (that leads us back to the memory consumption section). Topics with a lot of partitions can demand a lot of memory.

Real Use-Cases

The core job of Proofpoint CASB is to find malicious activity and data leakage in cloud applications, raise alerts, and perform various remediations. To do so the system fetches user activities from the cloud providers and enriches them with additional information. This enrichment process makes several API calls and can potentially take some time, at least on the scale of near real-time systems: hundreds of milliseconds to tens of seconds in the worst case. On the other hand, most of this time is spent waiting for IO operations to complete, not processing, so we can actually use very high parallelism per CPU core. The need for high concurrency, combined with the potential for long processing time, led us to base the enricher component on Alpakka. This way we avoid the endless rebalance issues and being capped by the number of partitions in the input Kafka topic.

CASB reads data from many different cloud applications, which all have wildly different data schema. One of the first things we do in our data pipeline is normalize all the input data into a common schema. This requires parsing JSON data, transforming, and then serializing it into an output topic, which is primarily a CPU bound operation. It performs no IO and isn’t constrained by memory, because most of the allocations are very short lived. Therefore, having more parallelism than the number of CPU cores is, in fact, counterproductive. These considerations combine to make it a very good use case for Kafka Streams.

In conclusion, investing some time to understand which of the three core trade-offs we’ve outlined here apply in your case will help you build more robust, scalable, future-proof data processing applications.

--

--