Project Metamorphosis: Unveiling the next-gen event streaming platformLearn More

Seamless SIEM – Part 2: Anomaly Detection with Machine Learning and ksqlDB

We talked about how easy it is to send osquery logs to the Confluent Platform in part 1. Now, we’ll consume streams of osquery logs, detect anomalous behavior using machine learning (ML), and provide solutions for handling these anomalous events in ksqlDB and the Confluent Platform.

Behavioral anomaly detection

Learning how users and operating systems normally behave and detecting changes in their behavior are fundamental for anomaly detection. In this use case, osquery logs from one host are used to train a machine learning model so that it can isolate anomalous behavior in another host. We’ll use this model in ksqlDB to score real-time osquery logs and route them to different topics.

Machine learning model – Latent Dirichlet allocation (LDA)

LDA is a text mining technique that groups or clusters text documents into similar themes, characteristics, or LDA topics (topics in LDA are clusters of documents; they are not Apache Kafka® topics). An LDA model will be trained to learn the normal operating system behavior and score real-time osquery logs for anomalous behavior. It can be swapped out for a deep learning model or for one that consumes continuous data like a regression model. LDA specifically consumes discrete text data.

Anomaly detection using machine learning

Anomaly Detection Using Machine Learning

This diagram shows the big picture of how osquery events get routed using ksqlDB and machine learning. The user-defined function (UDF) enriches the events by calling a model service that provides a score. That score is used to route events to different topics: GOOD, BAD, and UGLY. GOOD means a score greater than .7, UGLY means a score between .3 and .7, and BAD means a score less than .3. The bad and ugly events can then be consumed by a real-time dashboard for viewing. The good events are used to feed back the machine learning training to produce updated models.

The good, the bad, and the ugly

The score provided by the machine learning model is used to categorize the osquery events. GOOD means that the event appears to be a normal osquery log. These logs are used to retrain the model so that it can recognize normal osquery logs better.

BAD means that the event meets the threshold of being a suspicious osquery log. These logs can be sent to a real-time dashboard or a security information and event management (SIEM) system like Splunk or IBM QRadar for a threat hunter to further investigate.

UGLY means that we are not confident enough to label the osquery log as GOOD or BAD, requiring human or even more advanced machine learning or artificial intelligence (AI) processing for further analysis. After analysis, these logs should be moved either to the GOOD topic, where they can retrain/retune the model to recognize these events, or to the BAD topic so that it can be investigated by a threat hunter.

Machine learning workflow

Machine Learning Workflow

Training a machine learning model is usually a batch process that can take hours to complete. We’ll follow this more typical use case where the model is trained as a batch process, although there are ways to train a model with stream processing (sometimes called online machine learning and using stochastic methods).

There are two pipelines in this workflow diagram which are setup in a Lambda Architecture: batch and streaming. The batch pipeline trains the LDA model and saves it in a place where the model server can load it. The streaming pipeline scores incoming logs in real time using the most recently generated LDA model it obtained from the batch pipeline.

The ksqlDB UDF scores incoming messages. It enriches the event with a new column called score. The score is obtained by calling the model server, which has loaded the latest LDA model. Subsequent ksqlDB queries are used to route good, bad, and ugly osquery logs to their corresponding topics.

Running the demo

# get the code
git clone

# change directory the location which builds the Confluent Platform
cd demo-scene/osquery/cp

# build and create the Kafka cluster
make build
make up

# show the running containers
make ps

# wait 30 seconds for everything to start then start a ksqldb shell
docker exec -it ksql-cli ksql http://ksql-server:8088

Creating a stream from osquery events with ksqlDB

ksqlDB Clusters > ksqlDB

Navigate to the ksqlDB page above, and create a stream from the topic processes by clicking on “Add a stream” if you have not yet done so from part 1. Set “Encoding” to “JSON” and keep the default values for the rest of the fields.Encoding: JSON

Creating sink connectors with ksqlDB

In the ksqlDB shell, paste the statements below. Their corresponding comments describe their purpose and how they fit into the first and second diagrams depicted above.

/* selects osquery process logs that belong only to host1 for training */
CREATE STREAM training as
    select * from processes
    where hostidentifier = 'host1'
    emit changes;

/* creates a connector to feed the model with training data */
    'connector.class' = 'FileStreamSinkConnector',
    'key.converter'  = '',
    'value.converter' = '',
    'topics' = 'TRAINING',
    'file' = '/project/cp/logs/train.log'

/* scores the osquery log using a UDF */
        hostidentifier as host, 
        columns['path'] as path, 
        columns['name'] as name, 
        columns['cmdline'] as cmdline,
        columns['cwd'] as cwd,
        columns['root'] as root,
        columns['state'] as state,
        ) as score
    from processes;

/* routes good logs to the good topic */
    select *
    from scored
    where score >= .7
    emit changes;

/* routes bad logs to the bad topic */
    select *
    from scored
    where score < .3 
    emit changes; 

/* routes ugly logs to the ugly topic */
    select *
    from scored
    where score >= .3 and score < .7
    emit changes;

/* creates a connector to feedback good logs to the model trainer */
    'connector.class' = 'FileStreamSinkConnector',
    'key.converter'  = '',
    'value.converter' = '',
    'topics' = 'GOOD',
    'file' = '/project/cp/logs/feedback.log'

Proceed to the topics page in Confluent Control Center and monitor the topics GOOD, BAD, and UGLY. The LDA model will need about 100,000 training events to perform properly. You will not get good scores until your training data reaches that amount. Once you have an adequate amount of training data, you can trigger an anomalous event in host2, such as head -c 1G </dev/urandom >myfile. Make sure you don’t run this command in host1 since it is being used for training. You should expect this event to have a low probability score and end up in the BAD topic.


Building upon part 1, we’ve now published osquery logs into the Confluent Platform and trained an LDA model to learn from it. Then we deployed the model and called it from a ksqlDB UDF to enrich it with a score. Subsequent ksqlDB statements consumed the scored events to route them to the topics GOOD, BAD, and UGLY.

There are many ways to expand this use case, including publishing the bad and ugly logs to a SIEM system or external datastores using Kafka Connect for additional investigation and real-time alerting. The ease by which we built a seamless SIEM pipeline using Confluent Platform, Kafka Connect, and ksqlDB helps with building a minimum viable product for any budding security project.

If you’re interested in learning more, download the Confluent Platform to get started with a complete event streaming platform built by the original creators of Apache Kafka.

Hubert Dulay is a systems engineer at Confluent. He has over 20 years of engineering experience focusing on big and fast data. He has deployed several machine learning and event streaming applications into production for numerous Fortune 500 companies and top NYC startups.

Did you like this blog post? Share it now

Subscribe to the Confluent blog

More Articles Like This

Announcing the Snowflake Sink Connector for Apache Kafka in Confluent Cloud

We are excited to announce the preview release of the fully managed Snowflake sink connector in Confluent Cloud, our fully managed event streaming service based on Apache Kafka®. Our managed […]

Announcing ksqlDB 0.10.0

We’re excited to announce the release of ksqlDB 0.10.0, available now in the standalone distribution and on Confluent Cloud! This version includes a first-class Java client, improved Apache Kafka® key […]

Unifying Streams and State: The Seamless Path to Real-Time

More than ever before, people demand immediacy in every aspect of their lives. Expectations for how we shop, bank, and commute have completely evolved over the last decade. When you […]

Sign Up Now

Start your 3-month trial. Get up to $200 off on each of your first 3 Confluent Cloud monthly bills

Nouvelles inscriptions uniquement.

En cliquant sur le bouton « inscription » ci-dessus, vous acceptez que nous traitions vos informations personnelles conformément à notre Politique de confidentialité.

En cliquant sur « Inscription » ci-dessus, vous acceptez les termes du/de la Conditions d'utilisation et de recevoir occasionnellement des e-mails publicitaires de la part de Confluent. Vous comprenez également que nous traiterons vos informations personnelles conformément à notre Politique de confidentialité.

Gratuit à vie sur un seul broker Kafka

Le logiciel permettra une utilisation illimitée dans le temps de fonctionnalités commerciales sur un seul broker Kafka. Après l'ajout d'un second broker, un compteur de 30 jours démarrera automatiquement sur les fonctionnalités commerciales. Celui-ci ne pourra pas être réinitialisé en revenant à un seul broker.

Sélectionnez un type de déploiement
Déploiement manuel
  • tar
  • zip
  • deb
  • rpm
  • docker
Déploiement automatique
  • kubernetes
  • ansible

En cliquant sur le bouton « télécharger gratuitement » ci-dessus, vous acceptez que nous traitions vos informations personnelles conformément à notre Politique de confidentialité.

En cliquant sur « Téléchargement gratuit » ci-dessus, vous acceptez la Contrat de licence Confluent et de recevoir occasionnellement des e-mails publicitaires de la part de Confluent. Vous acceptez également que vos renseignements personnels soient traitées conformément à notre Politique de confidentialité.

Ce site Web utilise des cookies afin d'améliorer l'expérience utilisateur et analyser les performances et le trafic sur notre site Web. Nous partageons également des informations concernant votre utilisation de notre site avec nos partenaires publicitaires, analytiques et de réseaux sociaux.