DLQman to the rescue

Simplify error handling in your kafka streaming service

This article is a follow up on the article “The overlooked issue with DLQs in Kafka” which focused on potential strategies for handling failure scenarios when streaming data. We have not yet seen a product that addresses the post processing of DLQ messages. So we at Irori decided to build it ourselves. We call it DLQman.

Why do we need DLQman?

As we mentioned in the previous article, there are several ways to handle errors in a streaming architecture. By no means do we say that the DLQ pattern is always the best choice. This article is meant to help if that choice has already been made.

To recapitulate what we are doing here. Assuming you are using a DLQ pattern, what if something happens in your environment that causes thousands of messages to pile up on the DLQ.

The most obvious way to solve that is perhaps to let the application itself (the one sending to DLQ) contain logic to re-process messages from the DLQ. It would likely need some logic to decide on what to do depending on the error or contents of the message. This is perhaps fine if you have one application where this is needed, but what if you have 30? This will quickly become lots of code duplication and a lot of maintenance.

Imagine if there was an application that could ingest messages from any DLQ and that it would index those messages to make them easy to categorise and search for. And what if this application made it easy to choose which messages to re-try and which to skip…

DLQman to the rescue!

The type of exception causing the message to be sent to the DLQ could many times be used as the primary piece of information to decide on how to handle the message. From experience, a majority of DLQ messages in a streaming architecture are caused by connectivity issues or time related problems such as “data is not yet available”. By making it possible to filter on message traits such as exception headers DLQman makes it easy to quickly address the most urgent problems.

Design and architecture of DLQman

The idea of DLQman was inspired by several different cases that we have been involved in over the years. Common for all of them is that there is usually a number of different data streams where this is applicable, and each of them have their own source and retry topic but can many times have similar error scenarios.
For simplicity, DLQman does not require any client-side changes. Instead it assumes that standard dead letter error handling is used by either Spring Cloud Streams, Smallrye Messaging or Kafka Connect Sink.

Each message is stored together with a lifecycle state. This state is central to the solution to track if messages are new, resent, dismissed or in other ways already handled. The state is then used in the filters in order to select the most relevant data.

Workflow Definition

Each DLQ is assumed to be associated with one set of handling rules in a Workflow Definition. This contains information on how the message should be indexed and how to output the message, like a resend topic.

Indexing Module

The Workflow Definition defines what topics that are subscribed to. When messages arrive they are processed by the indexing module and any indexing logic is applied before passing the data to the persistence module for … persisting.
The original message including metadata is always persisted as is in addition to the metadata added by the processing logic.

API Module

The API module serves as the interface for managing the received messages. It enables us to list, filter and search for messages based on their indexed traits.
Messages can then be grouped together for handling. Filters used for this should be possible to save to enable reuse and ultimately allow automated message handling where applicable.

Handling Module

The Workflow definition of the source DLQ tells the handling module where to send a message and what other logic to apply before sending it. Since the complete original message and all its metadata is stored, it is possible to resend the exact same contents again if needed.

The current and the future

This project is currently being developed as an internal tool in the Irori toolbox. When we are satisfied with the initial version we aim to release it under a not yet decided OpenSource licence later this year. The initial release will only have an API to integrate with, but further down the line we aim to introduce a UI component to strengthen the user experience.

If you are eager to contribute before we publish this publicly, please get in touch and we will sort something out.

Daniel Oldgren
Solution Architect and Kafka Expert