The overlooked issue with DLQs in Kafka

Are we getting out of quarantine? 😉

Integrating data through Kafka isn’t really that different from any other systems integration pattern you’ve encountered – we have a producer and a consumer of data. However, in the synchronous, web-service way, the consumer is ‘pulling’ data from the producer whereas in the asynchronous case the data is being ‘pushed’ to the consumer. Normally, the consumer also immediately has to do something with this data. It may be persisting it to a database through Kafka Connect or performing some enriching through Kafka Streams before passing it forward and ‘acking’ the consumption.

In the “happy case” (where we all live, right?) this is pretty straight forward and elegant, but what happens if something goes wrong? If, for instance, deserialization of the message fails or the database connection suddenly dies? Or if you have implemented event sourcing and a customer-change event is consumed before the corresponding customer-created event?

Strategies

This situation can’t be fully avoided, and needs some thoughtful design decisions when moving to an asynchronous, event driven architecture. And as always, there are a number of different strategies to choose from which can roughly be grouped into the following 3 categories:

  1. Just drop the problematic record, send it to /dev/null! Maybe the database connection has come back up when the next message arrives (but it might just be a millisecond away…)! If you don’t really bother about losing data this is probably the easiest alternative, yet the most naive. Just bear in mind that if your producer is currently very active, you might be ditching a lot of information. Of course you can fetch these from some logs or rewind your consumer group and replay the messages from Kafka – but these operations can be cumbersome and have other challenging aspects to take into account.
  2. Stop the world! Let the consumer instance die, and let us wait for a human to investigate the situation and make a decision for when it’s time to manually start again. Although this might seem dramatic, it might be the only viable option if consuming records in order is extremely important to you. An alternative, “semi-automatic” way could be to temporarily halt and re-try a couple of times, with a clever backoff, before giving up.
  3. Send the record (or a reference to it) to a quarantine, commonly known as a “Dead Letter Queue – DLQ”, and move on consuming the next record. Perhaps it just was something fishy with this particular record and the next record will be successfully consumed. Or perhaps some downstream system was temporarily unavailable or some lookup based on data in that record failed for some reason.

If your scenario is such that you neither want important data to be dropped nor you want your data integration to shut down – you often decide to go for strategy number 3. It might not come as a surprise that the “DLQ-pattern” has been around as an enterprise integration pattern since the dawn of messaging, nothing that Kafka has introduced in other words. In fact, DLQ’s are used almost everywhere Kafka is used and it’s also a built-in feature of e.g. Kafka Connect. The experienced reader of course notes some fundamental differences in Kafka compared to the “traditional” DLQ pattern: 1) there are no “queues” in Kafka, hence the name is a bit misleading 2) the record is not moved to a DLQ but rather copied to it.

DLQs in Kafka irori

Are we finished?

Ok, so we configure a DLQ to solve our situation and we’re all set, right?! We put the solution into production and let it run, end of story.

Wait a bit, is anyone monitoring the DLQs? Without some strategy on how we should handle records entering our DLQs, what action we should take, we’re effectively at strategy number 1 above! Oh no – let’s put a monitoring tool in place and send an alarm to someone whenever a record is produced to a DLQ topic. Now we’re done!

Hold on – what should the receiver of the alarm do? Why did the record land on the DLQ? Was there an intermittent network glitch or was something wrong with the serialization of the message? Even looking at the payload of a particular Kafka record can take time without the correct tools. If circumstances are such that the record couldn’t be consumed at the time, but it can be now – how do we do that? Do we need to SSH to some machine and use a CLI tool to manually consume from DLQ and produce that record again on the originating topic so that the original consumer gets a new chance? What consequences will putting the message again on the topic (a duplicate) have for other consumers? We can categorize the last situation as a “re-try” and it’s a remarkably common scenario in an asynchronous world, yet no good out-of-the-box support comes with Kafka.

Conclusion

Using DLQ’s to handle unhappy consumer-scenarios are often the only reasonable strategy we have. Yet, handling the DLQ’s themselves are often overlooked since we need to put some effort into figuring out what to do then. Often however, this becomes an afterthought, when the damage already has been done.

In the next blog post in this series – we’ll focus more on the solution to this challenge. How do we handle failure-scenarios in a reliable and scalable way, without rebuilding our existing consumers. Stay tuned!

Author:
Pär Eriksson
Solution Architect and Kafka Expert