Anomalia Machina 1 — Massively Scalable Anomaly Detection with Apache Kafka and Cassandra

Let’s build the Anomalia Machina!

A Steampunk Anomalia Machina

1 Introduction

1.1 Kafka as a Buffer

1.2 Scalability

  • The Kafka load generator is scalable and has sufficient resources to produce the high loads and patterns (e.g. load spikes of increasing duration)
  • Performance and Business Metrics instrumentation, collection and reporting is scalable
  • That the application code used to write the data from Kafka to Cassandra, and the streams processing code including reading data from Cassandra, is scalable and has sufficient resources
  • The Cassandra database is initially populated with sufficient data to ensure that reads are predominantly from disk rather than cache for added realism. This may take considerable time as a warm-up phase of the benchmark, or we may be able to do it once and save the data, and restore the data each time the benchmark is run or to populate new clusters.

1.3 Deployment Automation

2 Functionality

  1. Large volumes of streaming data is ingested into Kafka (at variable rates)
  2. Data is sent to Cassandra for long term persistence
  3. Streams processing is triggered by the incoming events in real-time
  4. Historic data is requested from Cassandra
  5. Historic data is retrieved from Cassandra
  6. Historic data is processed, and
  7. A result is produced.

2.1 Anomaly Detection

  • The simplest version is a single variable method, which just needs data from the variable of interest (e.g. an account number, an IP address, etc), not other variables.
  • The data doesn’t need to be equally temporally spaced, so it may extend over a long period of time and, making the query more demanding (i.e. the earlier data is more likely to be on disk rather than in cache).
  • The amount of events needed may be dynamic and depend on the actual values and patterns, and how accurate the prediction needs to be.
  • The prediction quality is measured by accuracy, a confusion matrix, and how long it takes to detect a change point after it first occurred. in reality the algorithms typically work by dividing the data into 2 windows, 1 window is “normal” data before the change point, the second is “abnormal” data after a change point. The amount of data in the second window is variable resulting in the detection lag.

2.2 Data models

3 Cassandra Test Implementation

3.1 Initial Cassandra Results

  • Cassandra Cluster: 3 nodes x 2 cores per node, EBS tiny m4l-250:
  • 8GB RAM, 2 cores, 250GB SSD, 450 Mbps dedicated EBS b/w.
  • Client Instance: c5.2xlarge:
  • 8 cores, 16GB RAM, EBS b/w up to 3,500 Mbps.
  • 1:1 read:write operations ratio: If the probability is 1.0 then a read to Cassandra is issued for the most recent 50 rows for the event key. The read:write operations ratio is 1:1, but 50 times more rows are read than written.
  • 0:1 read:write operations ratio: If the probability is 0 then no checking is ever performed so there are no reads and the load is write only, giving a read:write ratio of 0:1.

4 Further Resources

  1. Anomaly Detection,
  2. Change Detection,
  3. An introduction to Anomaly Detection,
  4. A Survey of Methods for Time Series Change Point Detection, 10.1007/s10115–016–0987-z
  5. Introduction to optimal changepoint detection algorithms,
  6. A review of change point detection methods,
  7. Unsupervised real-time anomaly detection for streaming data
  8. For some non-algorithmic cybernetics see: Cybernetic Zoo, A history of cybernetic animals and early robots
  9. Anomaly Detection With Kafka Streams,
  10. Building a Streaming Data Hub with Elasticsearch, Kafka and Cassandra (An Example With Anomaly Detection),
  11. Real-time Clickstream Anomaly Detection with Amazon Kinesis Analytics,



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


Managed platform for open source technologies including Apache Cassandra, Apache Kafka, Apache ZooKeepere, Redis, Elasticsearch and PostgreSQL