Project Metamorphosis: Unveiling the next-gen event streaming platformLearn More

Handling GDPR with Apache Kafka: How does a log forget?

If you follow the press around Apache Kafka you’ll probably know it’s pretty good at tracking and retaining messages, but sometimes removing messages is important too. GDPR is a good example of this as, amongst other things, it includes the right to be forgotten. This raises a very obvious question: how do you delete arbitrary data from Kafka? After all, its underlying storage mechanism is an immutable log.

As it happens, Kafka is a pretty good fit for GDPR. The regulatory regime specifies not only that users have the right to be forgotten, but also have the right to request a copy of their personal data. Companies are also required to keep detailed records of what data is used for—a requirement for which recording and tracking the messages that move from application to application is a boon.

How do you delete (or redact) data from Kafka?

The simplest way to remove messages from Kafka is to simply let them expire. By default, Kafka will keep data for two weeks, and you can tune this to an arbitrarily large (or small) period of time. There is also an Admin API that lets you delete messages explicitly if they are older than some specified time or offset. But businesses increasingly want to leverage Kafka’s ability to keep data for longer periods of time, say for Event Sourcing architectures or as a source of truth. In such cases it’s important to understand how to make long lived data in Kafka GDPR compliant. For this, compacted topics are the tool of choice, as they allow messages to be explicitly deleted or replaced via their key.

Data isn’t removed from compacted topics in the same way as in a relational database. Instead, Kafka uses a mechanism closer to those used by Cassandra and HBase where records are marked for removal then later deleted when the compaction process runs. Deleting a message from a compacted topic is as simple as writing a new message to the topic with the key you want to delete and a null value.  When compaction runs the message will be deleted forever.

//Create a record in a compacted topic in kafka
producer.send(new ProducerRecord(CUSTOMERS_TOPIC, “Customer123”, “Donald Duck”));
//Mark that record for deletion when compaction runs
producer.send(new ProducerRecord(CUSTOMERS_TOPIC, “Customer123”, null));

If the key of the topic is something other than the CustomerId, then you need some process to map the two. For example, if you have a topic of Orders, then you need a mapping of Customer to OrderId held somewhere. This could be an external system, or it could be another Kafka topic. To ‘forget’ a customer, simply lookup their Orders and either explicitly delete them from Kafka, or alternatively redact any customer information they contain. You can roll this into a process of your own using a database to hold the user->key mappings, these can be held in Kafka topic, or you might do the whole process using Kafka Streams.

There is a less common case, which is worth mentioning, where the key (which Kafka uses for ordering) is completely different to the key you want to be able to delete by. Let’s say that you need to key your Orders by ProductId. This choice of key won’t let you delete Orders for individual customers, so the simple method above wouldn’t work. You can still achieve this by using a key that is a composite of the two: make the key [ProductId][CustomerId], then use a custom partitioner in the Producer (see the Producer Config: “partitioner.class”) that extracts the ProductId and partitions only on that value. Then you can delete messages using the mechanism discussed earlier using the [ProductId][CustomerId] pair as the key.

A quite different approach, suggested by Daniel Lebrero, is well worth mentioning. Messages are encrypted as they arrive, with an encryption key per user. The encryption keys are stored in a compacted topic. When a user needs to be ‘forgotten’ only the encryption key has to be deleted, leaving all the user’s data unintelligible, but intact, in the log. There are a couple of advantages to this approach (a) the metadata associated with each user is much smaller: only one k-v pair per user (user–>encryption key) (b) the immutability of the log that stores data long term is maintained. The disadvantage is that the process for handling redaction (i.e. encryption/decryption) sits on the critical path: either embedded in the producer/consumer or using short lived, ingress/egress topics for the unencrypted data. Daniel provides a  proof of concept as well as noting some pitfalls he sees in the approach.  

What about the databases that I read data from or push data to?

Quite often you’ll be in a pipeline where Kafka is moving data from one database to another using Kafka Connectors. In this case, you need to delete the record in the originating database and have that propagate through Kafka to any Connect Sinks you have downstream. If you’re using CDC this will just work: the delete will be picked up by the source Connector, propagated through Kafka and deleted in the sinks. If you’re not using a CDC enabled connector you’ll need some custom mechanism for managing deletes.

How long does Compaction take to delete a message?

By default, compaction will run periodically and won’t give you a clear indication of when a message will be deleted. Fortunately, you can tweak the settings for stricter guarantees. The best way to do this is to configure the compaction process to run continuously, then add a rate limit so that it doesn’t affect the rest of the system unduly:

# Ensure compaction runs continuously
log.cleaner.min.cleanable.ratio = 0.00001
# Set a limit on compaction so there is bandwidth for regular activities
log.cleaner.io.max.bytes.per.second=1000000

Setting the cleanable ratio to 0 would make compaction run continuously. A small, positive value is used here, so the cleaner doesn’t execute if there is nothing to clean, but will kick in quickly as soon as there is. A sensible value for the log cleaner max I/O is [max I/O of disk subsystem] x 0.1 / [number of compacted partitions]. So say this computes to 1MB/s then a topic of 100GB will clean removed entries within 28 hours. Obviously you can tune this value to get the desired guarantees.

One final consideration is that partitions in Kafka are made from a set of files, called segments, and the latest segment (the one being written to) isn’t considered for compaction. This means that a low throughput topic might accumulate messages in the latest segment for quite some time before rolling, and compaction kicking in. To address this we can force the segment to roll after a defined period of time. For example log.roll.hours=24 would force segments to roll every day if it hasn’t already met its size limit.

Tuning and Monitoring

There are a number of configurations for tuning the compactor (see the log.cleaner.* properties in the docs), and the compaction process publishes JMX metrics regarding its progress. You can actually set a topic to be both compacted and have an expiry so data is never held longer than the expiry time.

In Summary

Kafka provides immutable topics where entries are expired after some configured time, compacted topics where messages with specific keys can be flagged for deletion and the ability to propagate deletes from database to database with CDC enabled Connectors. 

If you’d like to know more, here are some resources for you:

Did you like this blog post? Share it now

Subscribe to the Confluent blog

More Articles Like This

Streaming Heterogeneous Databases with Kafka Connect – The Easy Way

Building a Cloud ETL Pipeline on Confluent Cloud shows you how to build and deploy a data pipeline entirely in the cloud. However, not all databases can be in the […]

Data Privacy, Security, and Compliance for Apache Kafka

Why data privacy for Apache Kafka®? As companies seek to leverage all forms of data for competitive advantage, there is a growing regulatory and reputational risk that calls for the […]

Announcing the Elasticsearch Service Sink Connector for Apache Kafka in Confluent Cloud

We are excited to announce the preview release of the fully managed Elasticsearch Service Sink Connector in Confluent Cloud, our fully managed event streaming service based on Apache Kafka®. Our […]

Sign Up Now

Start your 3-month trial. Get up to $200 off on each of your first 3 Confluent Cloud monthly bills

Nouvelles inscriptions uniquement.

En cliquant sur le bouton « inscription » ci-dessus, vous acceptez que nous traitions vos informations personnelles conformément à notre Politique de confidentialité.

En cliquant sur « Inscription » ci-dessus, vous acceptez les termes du/de la Conditions d'utilisation et de recevoir occasionnellement des e-mails publicitaires de la part de Confluent. Vous comprenez également que nous traiterons vos informations personnelles conformément à notre Politique de confidentialité.

Get Confluent Cloud

Get up to $200 off on each of your first 3 Confluent Cloud monthly bills


Choose one sign-up option below

Marketplaces

  • AWS
  • Azure
  • Google Cloud

  • Billed through your Cloud provider*
  • Stream only on 1 cloud
*Billing admin role needed

Marketplaces

  • Billed through your Cloud provider*
  • Stream only on 1 cloud
  • Billing admin role needed

*Billing admin role needed

Confluent


  • Pay with a credit card
  • Stream across multiple clouds

Confluent

  • Pay with a credit card
  • Stream across multiple clouds

En cliquant sur le bouton « inscription » ci-dessus, vous acceptez que nous traitions vos informations personnelles conformément à notre Politique de confidentialité.

En cliquant sur « Inscription » ci-dessus, vous acceptez les termes du/de la Conditions d'utilisation et de recevoir occasionnellement des e-mails publicitaires de la part de Confluent. Vous comprenez également que nous traiterons vos informations personnelles conformément à notre Politique de confidentialité.

Gratuit à vie sur un seul broker Kafka
i

Le logiciel permettra une utilisation illimitée dans le temps de fonctionnalités commerciales sur un seul broker Kafka. Après l'ajout d'un second broker, un compteur de 30 jours démarrera automatiquement sur les fonctionnalités commerciales. Celui-ci ne pourra pas être réinitialisé en revenant à un seul broker.

Sélectionnez un type de déploiement
Déploiement manuel
  • tar
  • zip
  • deb
  • rpm
  • docker
ou
Déploiement automatique
  • kubernetes
  • ansible

En cliquant sur le bouton « télécharger gratuitement » ci-dessus, vous acceptez que nous traitions vos informations personnelles conformément à notre Politique de confidentialité.

En cliquant sur « Téléchargement gratuit » ci-dessus, vous acceptez la Contrat de licence Confluent et de recevoir occasionnellement des e-mails publicitaires de la part de Confluent. Vous acceptez également que vos renseignements personnels soient traitées conformément à notre Politique de confidentialité.

Ce site Web utilise des cookies afin d'améliorer l'expérience utilisateur et analyser les performances et le trafic sur notre site Web. Nous partageons également des informations concernant votre utilisation de notre site avec nos partenaires publicitaires, analytiques et de réseaux sociaux.