PostgreSQL JSON — Unlocking the Full Potential — Instaclustr

Instaclustr
10 min readJun 8, 2021

--

Unlocking the full potential of PostgreSQL JSON

1. What Is JSON?

JSON was designed to be an open, lightweight text-based data-interchange format for web browsers to and from backend servers. It works natively with JavaScript (hence the name “JavaScript Object Notation”, rather than “J’Son, the only son of the previous Emperor of Spartax” i.e. J’Son, a Marvel character), and as of 2017 is an IETF standard ( 8259).

JSON has a very simple basic structure, but it can be nested, so you may end up with a more complex structure!

Let’s put “J’Son” in some JSON to demonstrate.

A JSON object consists of key/value pairs, surrounded by “{}” and separated by “,” as follows:

Values can just be more JSON objects, so nesting with arbitrary hierarchies is possible, for example, “fatherOf”:

Objects can also be arrays (an ordered lists of objects), for example, “likesAndDislikes”, as in this example:

But, even though JSON was designed for interoperability, it is increasingly used for storage, and thus begins our story…

JSON and Databases

One of the most interesting Percona talks I listened to recently was by Guy Harrison on “Next Generation Databases”. I later discovered that he has written a more in-depth book, and the one page JSON slide from the talk has a whole chapter ( chapter 4) devoted to this topic in his book.

A (very) brief summary is that there was an explosion of “document” oriented databases dedicated to specific unstructured data types (e.g. Object, XML, JSON, etc.), but more recently there’s a convergence as many databases, including “traditional” SQL databases, increasingly support multiple document types. PostgreSQL is no exception, and supports Objects, XML, and JSON. Let’s have a look at PostgreSQL JSON support, illustrated by some simple examples using JSON NOAA Tidal data, from my blog series and Percona talk “Building and Scaling a Robust Zero-Code Data Pipeline With Open Source Technologies”.

2. PostgreSQL JSON Data Types

PostgreSQL has multiple JSON related data types:

  • Two data types for storing JSON data: json and jsonb (A? and B).
  • And a third type (C!), the jsonpath type which supports the SQL/JSON path language for efficient querying of JSON data.

Why does PostgreSQL need JSON specific data types? For creating table columns with JSON types, efficiency in storing data, enforcing JSON specific rules, and support for queries.

The first two types, json and jsonb, are just like any other PostgreSQL data types, so can be used to specify column types as follows:

Note that we use a SERIAL pseudo-type which automatically creates a unique primary key.

What’s the difference between json and jsonb?

The json data type stores an exact copy of the JSON (in a text data type), and must be re-parsed to use. It’s therefore faster to store, but slower to use. It also preserves the order of object keys and duplicate object keys, whereas jsonb doesn’t.

The jsonb data type stores the JSON in a “decomposed” binary format which is initially slower to store (due to the conversion overhead), but faster to subsequently process (as re-parsing isn’t required), and may use slightly more disk space. The main advantage is that jsonb supports indexing, which enables more efficient querying.

Here’s a table to summarize:

The general consensus seems to be that, even though jsonb originally meant “JSON binary”, it really means “JSON better”, and should be the default JSON data type you use.

You add an index to a jsonb column like this:

Note: On jsonb you can use normal indexes or GIN indexes (which are more powerful). GIN enables some of the more powerful JSON operators (below), so please consult the detailed “jsonb indexing” documentation to find out more as I won’t mention indexing further here (e.g. you can index the entire JSON column, as in this example, or a subset).

As I discovered in my last blog on PostgreSQL Data Types and Java, PostgreSQL has a wealth of data types available, making it challenging to determine the mappings to SQL/JDBC and Java types. However, the job is easier for JSON to jsonb, as there are only four primitive JSON types. PostreSQL uses the following JSON mappings:

Why do the mappings matter? Well, unlike some other technologies (e.g. Elasticsearch), PostgreSQL doesn’t attempt to do any “intelligent” type conversions for you. For example, a JSON string with a timestamp in it will only be stored as a PostgreSQL text type. This implies that any conversion has to be done during subsequent processing (e.g. to order results over a “timestamp” field).

3. PostgreSQL JSON Creation

So far we’ve seen how to define a JSON column data type and associated index. But how do you actually get JSON data into it? Here’s an example of using the INSERT statement to create JSON data in the jsonb column called reading. Note that the data must be in correct JSON syntax. For example:

Note that there are also a dozen dedicated JSON Creation Functions which convert SQL values, arrays or rows to json or jsonb (e.g. to_json, to_jsonb, etc).

4. PostgreSQL JSON Operators and Functions

To understand what sorts of things you may want to do with JSON data, I started by finding out what JSON operators are supported in Javascript itself. I discovered that JSON can be converted to and from Javascript Arrays, so this just means that all the Javascript array operators work for JSON as well, including find, concat, filter, join, and slice, splice and split!

What JSON operations are supported in PostgreSQL? I assumed that there are equivalent operators, and being a database, that it supports create, read, update, and delete (CRUD) as well? I found the list of current JSON Functions and Operators, although the tables in version 12 are a bit easier to read. Following is a summary and example of some of the operators.

Get/Extract

There are six get/extract operators (accessors, and serializers) which work for both json and jsonb data types (“->”, “->>”, “#>”, “#>>”). Half of these (“->”, “#>”) return json, and other half return text (“->>”, “#>>”). Note that they return NULL if the JSON doesn’t have the right structure for the request.

These operators (also range and path operators) are used in a SELECT statement like this:

Which returns all rows, showing only the json reading columns. This example drills down in the JSON column, and returns all the metadata fields as json:

Operator Chaining

Operators can be chained. This example returns all the metadata name fields as text, and is an example of operator chaining, as “->” returns JSON, so you can just apply another operator to the results, and so on:

This refinement is probably more useful as it returns the set of unique locations (duplicates are removed).

Comparison

All of the standard comparison operators work for jsonb (but not json). These can be used in the WHERE clause to filter the results. For example, this query returns the data json for the “Key West” location only:

Containment, Existence, and More

There are also 12 additional operators that work for jsonb only (“@>”, “<@”, “?”, “?|”, “?&”, “||”, “-”, “#-”, “@?”, “@@”).

These include:

  • Containment operators (“@>” and “<@”)
  • Existence operators (“?”, “?|”, “?&”)
  • Concatenation operator (“||”)
  • Deletion operators (“-”, “#-”)
  • Path checking operators (“@?”, “@@”).

The containment and path checking operators all return boolean results, while the other operators all return jsonb results. See here for more information on containment and existence operator semantics. Note that containment is nested, but existence isn’t, and that concatenation only works at the “top level” (it’s not recursive).

The containment example returns all readings that contain the object {“name”:”Key West”} within the metadata field.

The existence operator (“?”) checks if a string argument exists as an object key or an array element. This existence example returns all readings that have a ‘name’ field in the metadata:

Note that as existence does not match object values, this example returns nothing:

Processing Functions

There’s a very long list (42! Apparently the “The Answer to the Ultimate Question of Life, the Universe, and Everything” was actually relatively mundane after all) of JSON Processing Functions, including functions (for example) to:

  • expand a json array into different data types
  • count elements
  • extract objects (similar to “#>” and “#>>” operators)
  • return the json keys
  • set and insert (which replaces or adds path elements)
  • path checking (exists, match, query)
  • pretty and
  • Typeof!

Changing a value deep in a json structure can result in some pretty complex sql; and here’s another object updating example.

The processing functions also explain the mystery of why there is no Postgresql data type corresponding to the JSON array object type in the table above (hint: array isn’t a JSON primitive data type). You just use the processing functions to extract the array elements from jsonb, for example “ Working with a JSONB Array of Objects in PostgreSQL “.

Aggregation

There are also json specific aggregation functions ( json_agg etc), which allow you to aggregate (compute a single result from multiple inputs) other data types or json/jsonb into json/jsonb. You can also use standard aggregation functions after casting json to the correct type. For example, this finds the northernmost locations:

5. SQL/JSON Path

We mentioned that PostgreSQL has three JSON data types but we’ve only explored two so far. What happened to the other one? The third data type is the jsonpath type. The jsonpath type implements support for the JSONPath standard in PostgreSQL to efficiently query JSON data. It provides similar functionality to XPath for XML. It’s a very powerful language, and allows you to traverse json structures level by level and retrieve arbitrary sub-items using multiple assessor operators and nested filter expressions. You can have strict or lax error handling. If lax, the default, then the path engine adapts the queried data to the path, otherwise with strict, an exception occurs. Here’s the complete list of jsonpath Operators and Methods (lots). And you can also use regular expressions.

This is a nice “gentle” introduction to JSON Path. And this ( more in depth coverage of the PostgreSQL JSON and a roadmap) concludes that “PostgreSQL has the best implementation of JSON Path”.

6. JSON Validation

One of the Percona talks I watched recently was by David Stokes on JSON document validation. This talk was a good overview of why JSON validation is important for databases (you don’t want to get “bad” data into the database, then get errors and have to fix them afterwards), and how the JSON schema standard helps prevent this. There have been some experimental implementations of JSON schema validation for earlier drafts of the standard, but I couldn’t find any current support. Asking some of our PostgreSQL gurus, they said that data validation is traditionally done outside the database, so there’s nothing to stop the client code from validating it before inserting into the database. For example, this is a recent Java JSON validator that is used by the Apache Camel JSON Validator component.

You can also apply “constraints” to JSON.

7. Conclusions

PostgreSQL is a mature database technology with excellent native support for JSON column data types and operations ( coming up to 10 years now). It offers more than enough JSON functionality to launch new JSON database use cases. But users of document databases are increasingly taking advantage of this JSON functionality and migrating to PostgreSQL. One notable success story that I came across recently was The Guardian newspaper’s migration of their mission-critical content publication application from using MongoDB as the backend, to PostgreSQL (“ Bye bye Mongo, Hello Postgres “). The migration was motivated by the advanced JSON functionality (requiring minimal changes to their existing JSON data model), improved performance and availability, potential opportunities for leveraging SQL, and easier operation of cloud hosted PostgreSQL.

However, one hurdle with migrating from one document database to another type of database is that the interfaces of document databases are invariably custom, making it impossible to simply unplug one database and plug in another. The Guardian’s success was largely due to the sophisticated approach they took to the migration. First, they built a new API for their application to interface with PostgreSQL so they could keep the old API operating. Then they built a new proxy sitting between the application and both APIs. This enabled them to run the application with both databases for a period of time while they migrated the content over and carried out tests, while gradually increasing traffic to PostgreSQL. Finally, after 10 months and 2.4 million articles successfully migrated to PostgreSQL JSON, they switched MongoDB off permanently.

The successful migration ticked off their goal of supporting their existing JSON data model and future SQL enhancements. JSON allows for a halfway house between developers who want an unstructured or dynamic serialization, and database operations who want a fairly rigid data definition schema. You can have both in PostgreSQL, literally at the same time in the same place. As we discovered in this blog, JSON is a full fledged data type in PostgreSQL, so you can use it for SQL function return types, columns, views, anything really.

The migration to PostgreSQL also achieved another important goal, which was to have someone else manage their database, as (and I quote) “Database management is important and hard-and we’d rather not be doing it ourselves.”

Originally published at https://www.instaclustr.com on June 8, 2021.

--

--

Instaclustr

Managed platform for open source technologies including Apache Cassandra, Apache Kafka, Apache ZooKeepere, Redis, Elasticsearch and PostgreSQL