Geospatial Anomaly Detection: Part 1 — Massively Scalable Geospatial Anomaly Detection with Apache Kafka and Cassandra

Part 1: The Problem and Initial Ideas

This blog will introduce the problem of Geospatial Anomaly Detection and investigate some initial Cassandra data models based on using latitude and longitude locations.

1. Space: The Final Frontier

Space: the final frontier. These are the voyages of the starship Enterprise. Its continuing mission: to explore strange new worlds. To seek out new life and new civilizations. To boldly go where no one has gone before! [Captain Picard]

Geospatial Anomaly Detection with Kafka and Cassandra - black hole in a galaxy far away
Geospatial Anomaly Detection with Kafka and Cassandra - Project Blue Book archives

2. The Geospatial Anomaly Detection Problem

Space is big. Really big. You just won’t believe how vastly, hugely, mind-bogglingly big it is. I mean, you may think it’s a long way down the road to the chemist, but that’s just peanuts to space. Douglas Adams, The Hitchhiker’s Guide to the Galaxy

Geospatial Anomaly Detection with Kafka and Cassandra - Space
Geospatial Anomaly Detection with kafka and cassandra - Location and Scale challenges
Geospatial Anomaly Detection with kafka and cassandra - Treasure Island
Geospatial Anomaly Detection with kafka and cassandra - event spaced far away
Geospatial Anomaly Detection with kafka and cassandra - Near events
Geospatial Anomaly Detection with kafka and cassandra- Flat earth theory

3. Latitude and Longitude

To modify the Anomalia Machina application to work with geospatial data we need to (1) modify the Kafka load generator so that it produces data with a geospatial location as well as a value (i.e. we replace the original ID integer key with a geospatial key), (2) write the new data type to Cassandra, and (3) for a given geospatial key, query Cassandra for the nearest 50 events in reverse time order.

CREATE TABLE latlong (
country text,
time timestamp,
lat double,
long double,
PRIMARY KEY (country, time)
) WITH CLUSTERING ORDER BY (time DESC);
select * from latlong where country='nz' and lat=- 39.1296 and long=- 175.6358 limit 50;
“Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
select * from latlong where country='nz' and lat=- 39.1296 and long=175.6358 limit 50 allow filtering;
Geospatial Anomaly Detection with kafka and cassandra - Volcano
Geospatial Anomaly Detection with kafka and cassandra - Square and stationary earth
Geospatial Anomaly Detection with kafka and cassandra - Earth Longitude and Latitude
Geospatial Anomaly Detection with kafka and cassandra- Mercator map
Geospatial Anomaly Detection with kafka and cassandra - Earth Cylindrical Projection
Geospatial Anomaly Detection with kafka and cassandra - Haversine formula

4. Bounding Box

Using a simpler approximation for distance such as a bounding box calculation means we can then use inequalities (>=, <=) to compute if a point (x2, y2) is approximately within some distance (d) of another point (x, y). This example is for simple (x,y) co-ordinates, as the calculation for latitude and longitude is more complex and requires: converting latitude and longitude to distance (each degree of latitude is approximately 111km and constant, as latitudes are always parallel, but a degree of longitude is 111km at the equator and shrinks to zero at the poles), and careful handling of boundary conditions near the poles (90, and -90 degrees latitude) and near -180 and 180 degrees longitude (the “antimeridian”, which is the basis for the International date line, directly opposite the Prime Meridian):

Geospatial Anomaly Detection with kafka and cassandra - Bounding Box Query
select * from latlong where country='nz' and lat>= -39.58 and lat <= -38.67 and long >= 175.18 and long <= 176.08 limit 50 allow filtering;

5. Indexing

I wondered if indexing the latitude and longitude columns would remove the need for “allow filtering”. Indexing allows rows in Cassandra to be queried by columns other than just those in the partition key. Some Cassandra indexing options include clustering columns, secondary indexes or SASI indexes.

5.1 Clustering Columns

Even though we are using time as a clustering column already, we can also add latitude and longitude as clustering columns as follows:

CREATE TABLE latlong (
country text,
time timestamp,
lat double,
long double,
PRIMARY KEY (country, lat, long, time)
) WITH CLUSTERING ORDER BY (lat DESC, long DESC, time DESC);
select * from latlong where country='nz' and lat= -39.58 and long >= 175.18 and long <= 176.08 limit 50;

5.2 Secondary Indexes

But what if we add secondary indexes instead? A Cassandra secondary index is an optional index on a column, but should be used with caution. Let’s create some secondary indexes :

create index i1 on latlong (lat);
create index i2 on latlong (long);

5.3 SASI

There is another type of indexing available called SASI (SSTable Attached Secondary Index) which are included in Cassandra by default. SASI supports complex queries more efficiently that the default secondary indexes, including:

  • Wildcard search in string values.
  • Range queries.
create custom index i3 on latlong (long) using 'org.apache.cassandra.index.sasi.SASIIndex';

create custom index i4 on latlong (lat) using 'org.apache.cassandra.index.sasi.SASIIndex';

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store