Contains and contained by lookups

9 min read

It's been quite a long time since my last post. Truth is, I've been busy implementing the remaining lookups on all database backends and polishing the JSONField. Now, the field is pretty much complete and I will try to explain the implementation here. I'll probably split the explanation into multiple posts if it becomes too long. In this post, I'll explain the containment lookups, i.e. contains and contained_by.

The contains lookup for JSONField is different from the built-in contains lookup. Instead of checking for a matching substring in the field's value, the lookup is overridden to check whether field's value contain the specified JSON path/value entries at the top level. To demonstrate:

>>> Dog.objects.create(name='Rufus', data={'breed': 'labrador', 'owner': 'Bob'})
>>> Dog.objects.create(name='Meg', data={'breed': 'collie', 'owner': 'Bob'})
>>> Dog.objects.create(name='Fred', data={})

>>> Dog.objects.filter(data__contains={'owner': 'Bob'})
<QuerySet [<Dog: Rufus>, <Dog: Meg>]>

>>> Dog.objects.filter(data__contains={'breed': 'collie'})
<QuerySet [<Dog: Meg>]>

On PostgreSQL, it's implemented using the @> operator.

So, this lookup:

Dog.objects.filter(data__contains={'breed': 'collie'})

becomes something like:

SELECT * FROM myapp_dog WHERE myapp_dog.data @> '{"breed": "collie"}'::jsonb

On MySQL and MariaDB, we can make use of the JSON_CONTAINS function. The function takes three arguments: target JSON document, candidate(s), and path (optional). Since we're only checking at the top level, we can omit the path or specify it as '$' (the root of the JSON document). The documentation isn't really clear what would happen if the path is omitted. It only says "candidate JSON document is contained within a target JSON document". It doesn't mention whether it searches deep into the target document or just the top level. From what I briefly tested, it seems that it only searches at the top level. However, let's just be explicit and specify the path as '$'.

The query becomes something like:

SELECT * FROM myapp_dog WHERE JSON_CONTAINS(myapp_dog.data, '{"breed": "collie"}', '$')

On Oracle Database and SQLite, there's no proper equivalent of JSON_CONTAINS. So, we have to extract each value of the candidate and check if it also exists in the target, chained with AND. Fun fact: I initially just skipped implementing this lookup on both backends. After implementing the transforms later, I realized this lookup can be implemented in this manner.

As always, things are unnecessarily more complex on Oracle Database. There is no function to easily extract a value from a JSON document. We are given two functions for this: JSON_QUERY and JSON_VALUE. The functions both require two arguments: the JSON document, and the path we want to extract.

Here's what's ridiculous: JSON_QUERY returns a string that represents the JSON object or array located at the path. If the path doesn't exist, it returns SQL NULL. If the path exists and the value located at the path is a JSON scalar, it also returns NULL. Meanwhile, JSON_VALUE does the opposite: it returns SQL equivalent of the scalars and returns NULL for JSON objects and arrays.

I don't really mind having these two functions if we have a combination of the two. On MariaDB, it has JSON_QUERY, JSON_VALUE, and JSON_EXTRACT. You guessed it: JSON_EXTRACT is the combination of the first two.

There's a way to get around this, though. We can use COALESCE. However, it's not really needed now because we can still identify which function to use by looking at the right-hand-side of the lookup (see update at the bottom of this post). We will need to use COALESCE when we implement the transforms later.

So, the query for the lookup should be something like:

SELECT * FROM myapp_dog WHERE JSON_VALUE(myapp_dog.data, '$.breed') = 'collie'

If the candidate contains more than one key-value pair, we can chain the clauses using AND.

However, the problem doesn't end there. In this case, the candidate is just a shallow JSON object. Let's say I have this JSON object:

{
  "a1": {
    "a2": {
      "a3": "first"
    }
  },
  "b1": "second"
}

If I want to query using the following lookup:

Dog.objects.filter(data__contains={'a1': {'a2': {'a3': 'first'}}})

For some reason, the following WHERE clause doesn't match any data:

WHERE JSON_QUERY(myapp_dog.data, '$.a1') = '{"a2": {"a3": "first"}}'

After some fiddling around, I altered it to look like this:

WHERE JSON_QUERY(myapp_dog.data, '$.a1') = JSON_QUERY('{"a2": {"a3": "first"}}', '$')

and it works. I don't know why that's the case, though. It probably has something to do with LOB stuff, but I'm not very familiar with Oracle Database so I'm not really sure if that's the case.

Also, if you noticed, the previous $.breed = 'collie' query didn't use double-quote for the right-hand-side (RHS) string 'collie' (instead of '"collie"') as Oracle automatically unquotes the string returned by JSON_VALUE. This means that we'll have to do a json.loads() in the Python code first.

Meanwhile, if the RHS is an integer or float, we shouldn't do a json.loads() because the Oracle backend will throw an error.

What about boolean values?

That's a lot of unnecessary cases to handle. I'm not surprised if there's more that I haven't caught.

Anyway, one trick I decided to use is to wrap the RHS with a dummy JSON object and use JSON_QUERY or JSON_VALUE, depending on the RHS.

So, this lookup:

Dog.objects.filter(data__contains={'a1': {'a2': {'a3': 'first'}}})

becomes:

SELECT * FROM myapp_dog WHERE JSON_QUERY(myapp_dog.data, '$.a1') = JSON_QUERY('{"val": {"a2": {"a3": "first"}}}', '$.val')

and this lookup:

Dog.objects.filter(data__contains={'breed': 'collie'})

becomes:

SELECT * FROM myapp_dog WHERE JSON_VALUE(myapp_dog.data, '$.breed') = JSON_VALUE('{"val": "collie"', '$.val')

This way, I only have to differentiate whether the RHS is a scalar or not to decide whether to use JSON_QUERY or JSON_VALUE. We'll still end up using the JSON-encoded RHS for both types, so they don't differ much and we can just chain multiple key-value pairs with AND.

Another case is performing the lookup with an empty dictionary ({}) as the candidate. Basically, it means that the query should return any rows that are JSON objects, as any JSON object can be thought of containing an empty JSON object. Kind of like an empty set is a subset of any set.

On Oracle, we can only store valid JSON objects or arrays and not scalars. However, there's no function to check whether the JSON type is an object or an array. A trick I came up with is by using pattern-matching with the '{%}' pattern. Basically, it means that the data "starts with { and ends with }", which is what we need.

That's it, now (I think) the lookup is complete on Oracle.

On SQLite, things are also a bit tricky, but not so much. There's a JSON_EXTRACT function, so we don't really have to worry about the left-hand-side. However, as with Oracle, a JSON object or array RHS can't be queried directly. I think it's because SQLite minifies the stored data. To get around this, I just need to wrap the object or array RHS with JSON().

Meanwhile, for JSON null values, we have to use JSON_TYPE function to differentiate it from SQL NULL, as I explained in the previous post. We can also use the function to check for the empty dictionary case.

That's pretty much it for the SQLite implementation.

The next lookup is the contained_by lookup. I won't go into details here, as it's basically the inverse of the contains lookup. On PostgreSQL, we just need to use <@ operator instead of @>. On MySQL and MariaDB, we just need to flip the arguments of JSON_CONTAINS.

On Oracle and SQLite, though... it's impossible (I think)! There's no way to check if the stored data in the database is completely contained in a Python object and return the matching data in just a single query. We can do the inverse by checking if all of the key-value pairs in the Python object also exist in the database because if they do, then the stored data is a superset of (or equal to) the Python object. Meanwhile, we cannot know what keys exist in the stored data without querying for it first. So, I have to leave the lookup unimplemented for these two backends.

Well, that's it for this post! Sorry if it's confusing, I don't really know how to explain this stuff in a simpler way because it really is kind of complex. The next post will be about key transforms, where things get even more fun complex (or not? Heh, I don't know). Until then, bye!

Update 1: I realized I missed something for the Oracle implementation. If the Python dictionary includes a key with None as its value, the resulting part of the WHERE clause would be:

JSON_VALUE(myapp_dog.data, '$.somekey') = JSON_VALUE('{"val": null}', '$.val')

and that wouldn't work because it would be like NULL = NULL.

To check for an existing key with a null value, we need to use JSON_EXISTS combined with JSON_QUERY(...) IS NULL and JSON_VALUE(...) IS NULL. So, the correct WHERE clause is:

JSON_EXISTS(myapp_dog.data, '$.somekey') AND COALESCE(
    JSON_QUERY(myapp_dog.data, '$.somekey'), JSON_VALUE(myapp_dog.data, '$.somekey')
) IS NULL

Like I said, things are unnecessarily more complex on Oracle Database. On SQLite, we can make use of the JSON_TYPE function which returns the string null for JSON value null that exists at a given path. Well, turns out we do need the COALESCE function, after all!

Update 2: Simon (yes, that Simon) replied to my tweet, suggesting the use of a custom SQLite function to implement the contained_by lookup. I decided to implement the function for contains lookup and flip the arguments for contained_by lookup, kind of like how it's done on MySQL.

The implementation of the lookups is very simple. The function takes two arguments, the target and the candidate. We just need to json.loads() them both. If they're both dictionaries, we can use the comparison operator >= with the .items() of the dictionaries. Otherwise, just check if both objects are equal. That's it!