[Webinar] Q1 Confluent Cloud Launch Brings You the Latest Features | Register Now

Generic

Microservices in the Apache Kafka Ecosystem - Q&A

The follow were the questions from the live webinar "Microservices in the Apache Kafka Ecosystem" with Ben Stopford. Ben has taken the time to answer some of these questions. If you missed the webinar or want to rewatch it, you can view the recording here.

How do you deal with invalid/erroneous data on a stream? In other words a "poison message"?
Kafka doesn’t have the dead letter queue concept. You’d have to craft one yourself.

Is Kafka usable for logging BLOBs like PDF files or E-mails?
The default max message size is 1MB. This can be increased, but I wouldn’t send very large files through kafka. Emails and PDFs should be ok.

How do you get by the issue where a compacted topic may contain duplicate keys in between compaction?
Good Qu: To treat a compacted topic as a table you need to materialize it in something that will ensure uniqueness. Kafka Streams does it by deduplicating the keys as it puts them into the state store. So the key point is, you’re using the compacted topic as a mechanism for evicting old versions, so your topic doesn’t grow indefinitely.

Can Kafka Streams be used to calculate metrics for a time frame, say 1min, 1 hr, 1 day, 1 week, etc?
Yes - the window size bounds the aggregate.

I guess some kind of snapshots has to be used to combat replaying a log from the beginning of history (and avoiding history views).
Use a compacted topic. That is essentially a snapshot as it keeps the latest keys. It’s not a snapshot in a transactional sense though.

For a high throughput Rest API based data ingestion, how would Schema Registry impact the performance?
The schema registry includes de/serializers plugged into the producer and consumer. These cache the schemas locally so you actually don’t hit the schema registry itself that often.

In typical micro-services environment ACK is kind of imp which is more or less provided by AMQP like brokers (for e.g. RabbitMQ) ... how to achieve the same using kafka ?
ACKs are provided by committing the offset you have read to back to Kafka.

What is the recommended approach for taking a 'snapshot' of the data?
Snapshots of a partition are easy, you just read from the start to a certain offset. Creating a global, transactional snapshot of all data, as you might in a database, is not possible though.

Will the fact that tables can be stored in Kafka as log compacted, encourage people basically share the state?
A really good question. Thank you for asking this. I should first say, that they’re not really tables. They are compacted streams. These are just normal streams, with overwritten keys removed. But they look like tables when they are inside Kafka Streams.
So, the reason it’s a great question is it we do end up using Kafka, in part, for sharing historic state, albeit it in a rather limited way. This isn’t like sharing a database though. It’s more like sharing immutable files. Also, in practice, the tricky question: how does a service stay independent and autonomous if it depends on shared state? The answer is that, in reality, you use relatively small tables. These tend to map to “Dimensions” in relational parlance. Your “facts” will be infinite streams. These are the important entities, orders, trades etc. So your services are largely only rely on streaming facts (which don’t provide a stateful coupling), and a few tables used for enrichment and the like (which do). The result is a pretty good balance between sharing state, and staying independent.

Any plans for Kafka as a service offering?
Yep. On its way.

Kafka Security is in Beta as per the docs. Any plans or ETA on when it will be taken out of Beta?
Confluent had a round of pen testing done by an external agency and it passed so it’s unofficially safe to use now. We’ll be pushing it out of beta shortly.

How do you avoid time synchronization issues when joining 2 streams & 1 stream is delayed?
Good Q. That’s what windows are for. They allow you to buffer so you can do a join between streams. If you join using time, you will be dependent on the clocks of the systems that produced the messages. But if you join by a physical key, all will be good, so long as the delayed messages are not greater than your window. One nice feature of Kafka Streams it it can support quite large windows by  buffering on disk.
**
Can the tables be cached in Kafka topics..? if so do we have any metrics on size and other performance metrics**
Yes, absolutely. In Kafka Streams at KTable is backed by a local state store (rocksdb), which is backed by a Kafka topic. We’re working on perf stats. Watch this space.

Why do you recommend a version in the key? should it change for every state change of the entity?  Isn't order in the log enough?
Good question: The version has no function inside the log or in Kafka Streams. I meant this as more general advice for this type of system, rather than something Kafka/Kafka Streams specific. So in this case, you would need the version if you wanted to amend the data at source (i.e. single writer principle) as you’d request an update for a version of the entity which would use versioned compare and set.

How much control do you have over purging topics. For example, if you have multiple subscribers, can you purge 1 hour after last client has read it?
You can control when segments are deleted (i.e. how long they stick around for). This can be updated dynamically at a topic level.

How would use Kafka to do things like query joins over your stream data? is that a misuse of Kafka?
Use Kafka Streams

There are a number of unofficial providers for .Net, which lack a lot of features. Are there any plans for an official .Net provider for Kafka?
Not currently, but we are developing a suite of clients inside Confluent, as wrappers around the C API.

Where can we get simple/clean Kstream+Connect code example.  In particular, something that shows off the simplicity. Something that consumes from Kafka and connect to MySQL via JDBC.
I’d start here: https://www.confluent.io/blog/how-to-build-a-scalable-etl-pipeline-with-kafka-connect

How does Kafka perform compared to RPC in terms of real-time event driven communications between microservices? Specifically with regard to response times.
Really Kafka does broadcast, via topics, rather than RPC. Latencies in kafka are typically around 10ms.

You said don’t use Kafka for Shopping cart, but previous slide says that use Kafka between Micro service; how does this work?
You can of course use Kafka for shopping carts. The main point is that it’s probably not the right tool for that job. A user checking the items in their shopping cart has no real business significance. That means the request should likely be ephemeral and hence using a non persistent transport could make better sense. But there’s nothing to stop you doing it with Kafka, You’d typically create request/response topics, with a single partition each, between each service you wish to connect. But the key point is, build your architecture around broadcast streams of business significant events. Use point to point communication only when you really need it. Looking up the items in a cart could well be one of those times.

How can we effectively leverage an event driven architecture and a message broker with the need for atomic transactions and awareness of whether something was processed successfully?
You need Exactly Once processing, which is coming next year. His provides transactions around messaging and stream processing. Until then, I’d suggest sticking with idempotence. As a more general theme: do everything you can to avoid needing distributed transactions when you design your app.

Related Links

How Confluent Completes Apache Kafka eBook

Leverage a cloud-native service 10x better than Apache Kafka

Confluent Developer Center

Spend less on Kafka with Confluent, come see how