Anomalia Machina 1 — Massively Scalable Anomaly Detection with Apache Kafka and Cassandra

Let’s build the Anomalia Machina!

A Steampunk Anomalia Machina

1 Introduction

For the next Instaclustr Technology Evangelism Blog series we plan to build and use an Instaclustr platform wide application to showcase Apache Kafka and Cassandra working together, to demonstrate some best practices, and to do some benchmarking and demonstrate everything running at very large scale. We are calling the application “Anomalia Machina”, or Anomaly Machine, a machine for large scale anomaly detection from streaming data (see Functionality below for further explanation). We may also add other Instaclustr managed technologies to the mix including Apache Spark and Elasticsearch. In this blog we’ll introduce the main motivations for the project, and cover functionality and initial test results (with Cassandra only initially).

1.1 Kafka as a Buffer

1.2 Scalability

  • Performance and Business Metrics instrumentation, collection and reporting is scalable
  • That the application code used to write the data from Kafka to Cassandra, and the streams processing code including reading data from Cassandra, is scalable and has sufficient resources
  • The Cassandra database is initially populated with sufficient data to ensure that reads are predominantly from disk rather than cache for added realism. This may take considerable time as a warm-up phase of the benchmark, or we may be able to do it once and save the data, and restore the data each time the benchmark is run or to populate new clusters.

1.3 Deployment Automation

2 Functionality

Let’s look at the application domain in more detail. In the previous blog series on Kongo, a Kafka focussed IoT logistics application, we persisted business “violations” to Cassandra for future use using Kafka Connect. For example, we could have used the data in Cassandra to check and certify that a delivery was free of violations across its complete storage and transportation chain.

  1. Data is sent to Cassandra for long term persistence
  2. Streams processing is triggered by the incoming events in real-time
  3. Historic data is requested from Cassandra
  4. Historic data is retrieved from Cassandra
  5. Historic data is processed, and
  6. A result is produced.

2.1 Anomaly Detection

What sort of application combines streaming data with historical data with this sort of pipeline? Anomaly detection is an interesting use case. Anomaly detection is applicable to a wide range of application domains such as fraud detection, security, threat detection, website user analytics, sensors and IoT, system health monitoring, etc.

  • The data doesn’t need to be equally temporally spaced, so it may extend over a long period of time and, making the query more demanding (i.e. the earlier data is more likely to be on disk rather than in cache).
  • The amount of events needed may be dynamic and depend on the actual values and patterns, and how accurate the prediction needs to be.
  • The prediction quality is measured by accuracy, a confusion matrix, and how long it takes to detect a change point after it first occurred. in reality the algorithms typically work by dividing the data into 2 windows, 1 window is “normal” data before the change point, the second is “abnormal” data after a change point. The amount of data in the second window is variable resulting in the detection lag.

2.2 Data models

Given the simple <key, value> data records, what will the data models look like for Kafka and C*?

3 Cassandra Test Implementation

We started building the application incrementally, and wrote an initial test application using Cassandra only. We wrote an event generator based on a random walk which produces “interesting” data for the anomaly detector to look at, it can be tuned to produce more or less anomalies as required. This event generator was included in a simple Cassandra client and used to trigger the anomaly detection pipeline and write and read paths to Cassandra:

3.1 Initial Cassandra Results

In order to check that the anomaly pipeline is working correctly and to get an initial idea of the resource requirements for the pipeline application and baseline performance of the Cassandra cluster, we provisioned a small Cassandra cluster on Instaclustr on AWS using the smallest recommended Cassandra production AWS instances and a large single client instance as follows:

  • 8GB RAM, 2 cores, 250GB SSD, 450 Mbps dedicated EBS b/w.
  • Client Instance: c5.2xlarge:
  • 8 cores, 16GB RAM, EBS b/w up to 3,500 Mbps.
  • 0:1 read:write operations ratio: If the probability is 0 then no checking is ever performed so there are no reads and the load is write only, giving a read:write ratio of 0:1.

4 Further Resources

  1. Anomaly Detection,
  2. Change Detection,
  3. An introduction to Anomaly Detection,
  4. A Survey of Methods for Time Series Change Point Detection, 10.1007/s10115–016–0987-z
  5. Introduction to optimal changepoint detection algorithms,
  6. A review of change point detection methods,
  7. Unsupervised real-time anomaly detection for streaming data
  8. For some non-algorithmic cybernetics see: Cybernetic Zoo, A history of cybernetic animals and early robots
  9. Anomaly Detection With Kafka Streams,
  10. Building a Streaming Data Hub with Elasticsearch, Kafka and Cassandra (An Example With Anomaly Detection),
  11. Real-time Clickstream Anomaly Detection with Amazon Kinesis Analytics,



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store