Writing JSONField from scratch
Last week, I improved an existing Django package and implemented some SQLite
lookups for its JSONField
to get some sense of how Django model fields work.
This week, I tried to create a unified JSONField from scratch. It's
unified in the sense that it works on all database backends supported by
Django. I also set up docker containers to run Django's test suite with all
database backends with the help of django-docker-box (though I ended up
making my own for Oracle since it's a bit more complex). However, I'm just
going to talk about writing the field as that's more interesting.
As before, the docs on how to write custom model fields is very handy. I
started building my own JSONField
and got it working in a relatively short
time. Not without bugs and quirks I had to deal with on each database backend,
though.
First, I wrote the code with SQLite as the top priority. I overrode the
db_type
method so it returns json
. It's also the name of the data type used
in MySQL and MariaDB for JSON values, so I guess it's a good start.
def db_type(self, connection):
return 'json'
This means that the SQL query for the table creation will be something like:
CREATE TABLE ... (..., "column_name" json)
Now, is there a json
datatype in SQLite, especially with the JSON1
extension available? Nope. I thought there was, as the JSON1 docs have some
examples that use JSON
as the data type of some columns. Anyway, the docs
state that the extension stores JSON data as ordinary text
.
You can actually use custom data type names in SQLite. It means we can still
define our column's data type as json
, so the database schema will look
pretty nice.
However, SQLite uses dynamic typing. This means that data types aren't determined by the container (i.e. column) of the value. In SQLite, the data type is associated with the value itself. There's something called type affinity in SQLite. It's a system that tries to convert the values inserted into a table according to the preferred storage class for a column, called affinity.
There's a break down of how SQLite determines column affinity in the data type docs. I recommend giving it a quick read.
Anyway, according to the docs, our json
data type will have a NUMERIC
affinity. This means, SQLite will try to store any values in the column as an
INTEGER
or REAL
(in that order). This can lead into some problems.
When we retrieve a JSON value from the database, we would like it to be
converted into its Python equivalent (bool
, int
, float
, str
, dict
,
list
), despite it being stored as a text in the database. To illustrate:
>>> value = {'name': 'John', 'age': 20, 'height': 180.3}
>>> obj = JSONModel.objects.create(value=value)
>>> obj = JSONModel.objects.get(id=obj.id)
>>> type(obj.value)
<class 'dict'>
That means, the serialization (fancy word for translation) of the value
should be handled automatically by our JSONField
. In Django, we implement
this by overriding two essential methods in our Field
subclass:
get_prep_value
and from_db_value
(for serialization and deserialization,
respectively).
Thanks to Python's json
library, we can use the json.dumps
and json.loads
functions for that purpose.
def get_prep_value(self, value):
"""Convert the value into a JSON string."""
if value is None:
return value
return json.dumps(value)
def from_db_value(self, value, expression, connection):
"""Convert the JSON string value into a Python object."""
if value is None:
return value
return json.loads(value)
Keep in mind that our field has a NUMERIC
affinity in the database. Let's say
we want to insert a Python int
as the value, let's say 3
. It's a valid JSON
value. If we call json.dumps(3)
, we will get the Python string '3'
. Suppose
we create a new object of our model. The value will be inserted into the
database like so:
INSERT INTO myapp_mymodel VALUES (..., '3', ...)
That SQL query is perfectly fine, since SQLite stores JSON as ordinary text
.
However, with NUMERIC
affinity, SQLite will try to convert the value into an
INTEGER
in the database. It succeeds in doing so, therefore it's stored as an
INTEGER
.
If we go ahead and retrieve our object using Model.objects.get
, our
from_db_value
will receive a Python int
object 3
as the value, instead of
str
object '3'
, from the database backend. Of course, calling
json.loads(3)
would raise an exception (a TypeError
, to be precise).
We could go ahead and add if isinstance(value, (int, float))
to our
from_db_value
method, but I find the best solution is to just ditch json
as
our db_type
and use text
instead, so our column will have a TEXT
affinity
and no conversion will be done by SQLite.
Actually, SQLite will assign TEXT
affinity if the declared type of a column
contains the string "CHAR"
, "CLOB"
, or "TEXT"
. So, we can actually
specify something like json text
as our data type, but I'm not sure if that's
something people would like. (I kind of like it, so I might use that later.)
Anyway, that's just some SQLite magic.
For MySQL and MariaDB, we can use json
as our data type and our JSONField
would work without having to change anything else. That's pretty cool!
If we want to make it cooler, we can add an SQL CHECK
constraint using the
JSON_VALID
function available in SQLite, MySQL, and MariaDB. To do so, we can
just override db_check
method in our field. This will prevent invalid JSON
values from getting inserted into the database table.
def db_check(self, connection):
data = self.db_type_parameters(connection)
if connection.vendor in ('mysql', 'sqlite'):
return 'JSON_VALID(%(qn_column)s)' % data
return super().db_check(connection)
Note that connection.vendor
is also 'mysql'
for MariaDB. Also, for MariaDB
10.4.3 and up, the JSON_VALID
function is automatically used as a CHECK
constraint for JSON
data type.
Now, let's move on to PostgreSQL.
PostgreSQL provides two data types for JSON values: json
and jsonb
. In
short, jsonb
is better. It supports indexing
and other cool stuffs. You can see the JSON Types docs for more details.
django.contrib.postgres.fields.JSONField
also uses jsonb
. If you look at
its source code, you can see that it doesn't override from_db_value
. Why is
that?
Well, Django uses psycopg2
as the database adapter for its PostgreSQL
backend. It turns out, psycopg2
already does JSON adaptation when it
retrieves json
and jsonb
data from the database. The Python json
module
and json.loads
are used, just like what our from_db_value
method does. You
can verify this in the psycopg2 source code.
Calling json.loads
with the adapted value could throw a TypeError
. It's
kind of like what happened with SQLite earlier. We can easily handle this by
modifying our method like this:
def from_db_value(self, value, expression, connection):
"""Convert the JSON string value into a Python object."""
if value is None or connection.vendor == 'postgresql':
return value
return json.loads(value)
However, I want to spice things up a bit. Let's allow custom JSON encoder and decoder to be used in our serialization and deserialization process!
Both json.dumps
and json.loads
accept a keyword argument cls
. It can be
used to specify the class
of a JSON encoder and decoder, respectively. The
argument is None
by default, which will make the functions use the built-in
json.JSONEncoder
and json.JSONDecoder
.
If we would like to allow custom JSON encoder and decoder, we need to pass them
as an argument in our JSONField
constructor and make it an instance
attribute. This way, we can pass the encoder and decoder to our json.dumps
and json.loads
calls like so:
def __init__(self, encoder=None, decoder=None, default=dict, *args, **kwargs):
self.encoder, self.decoder = encoder, decoder
super().__init__(default=default, *args, **kwargs)
def get_prep_value(self, value):
if value is None:
return value
return json.dumps(value, cls=self.encoder)
def from_db_value(self, value, expression, connection):
if value is None:
return value
return json.loads(value, cls=self.decoder)
We also need to override the deconstruct
method accordingly:
def deconstruct(self):
name, path, args, kwargs = super().deconstruct()
if self.default is dict:
del kwargs['default']
if self.encoder is not None:
kwargs['encoder'] = self.encoder
if self.decoder is not None:
kwargs['decoder'] = self.decoder
return name, path, args, kwargs
You probably notice that I left off the conditional for PostgreSQL in
from_db_value
. If we stick to our solution, we won't be able to use a custom
decoder, since we will just return the value adapted by psycopg2
.
A naive solution might be to call json.dumps
on that value and call
json.loads
with our decoder, basically serializing it and deserializing it
again. It could work, but that would be slow. We need a way to prevent
psycopg2
from adapting the value to Python objects.
According to the docs, we can either cast the column to text
in the query, or
register a no-op loads
with register_default_json
(the registration is
shared for the same database connection). If we choose the latter, we might
break compatibility with contrib.postgres
's JSONField
, since it doesn't
allow a custom decoder and it relies on psycopg2
's loads
instead.
Thankfully, we can implement the former by overriding select_format
. It's not
documented as of this writing, but the docstring gives a clue on how it can be
used. I found examples of overridden select_format
in contrib
's GIS fields.
Apparently, we can do it like this:
def select_format(self, compiler, sql, params):
if compiler.connection.vendor == 'postgresql':
return '%s::text' % sql, params
return super().select_format(compiler, sql, params)
Therefore, from_db_value
will always retrieve a string value for non-NULL
values in the database, and we can call json.loads
just like for other
backends.
Phew! Now, we've got our JSONField
working on SQLite, MySQL, MariaDB, and
PostgreSQL. We've also allowed the use of custom JSON encoder and decoder.
Let's move on to the last database backend: Oracle.
Oracle can use VARCHAR2
, BLOB
, or CLOB
data types to store JSON values.
It recommends us to use BLOB
, but there are some downsides:
-
When selecting data from a
BLOB
column, if you want to view it as printable text then you must use SQL functionto_clob
. -
When performing insert or update operations on a
BLOB
column, you must explicitly convert character strings to BLOB format using SQL functionrawtohex
.
I haven't verified if that's the case with cx_Oracle
(Oracle Database adapter
for Python). Since an implementation of JSONField
on Oracle is available
and it uses CLOB
instead, I'm going to use CLOB
too. If BLOB
turns out to
be feasible to implement (without some complex things), I might switch to
BLOB
.
Oracle also provides the IS JSON
constraint for columns that hold JSON data.
It also recommends us to use this constraint in our table definition, because
we won't be able to use the simple dot-notation syntax to query the JSON data
otherwise.
However, from my experience, the IS JSON
constraint only accepts a JSON
object or array as its value. Meaning, we can only use dict
or list
Python
objects and not any of bool
, int
, float
, and str
as our field's value.
To add that constraint in our field, we can override db_check
like before:
def db_check(self, connection):
data = self.db_type_parameters(connection)
if connection.vendor in ('mysql', 'sqlite'):
return 'JSON_VALID(%(qn_column)s)' % data
if connection.vendor == 'oracle':
return '%(qn_column)s IS JSON' % data
return super().db_check(connection)
Another thing to note is that cx_Oracle
returns a Python object of type LOB
for values with BLOB
, CLOB
, and NCLOB
data types. We cannot simply use
json.loads
as the decoder doesn't know how to decode LOB
objects. In order
to do that, we must obtain the str
equivalent of the LOB
object by calling
the .read
method of the LOB
.
The oracle-json-field
package overcomes this by subclassing TextField
.
TextField
on Oracle is implemented using NCLOB
in Django, and the database
backend already has a converter that calls the .read
method.
We can either modify the backend so it also does the same for our JSONField
,
or we can also override the get_db_converters
method in our field. I choose
the latter to be consistent with my previous decisions.
The get_db_converters
method isn't documented as of this writing, but we
basically just need to create a list that contains the functions that we want
to be run before returning the final value. Then, we extend that list with the
one returned by the original get_db_converters
.
I implemented mine like this:
def convert_jsonfield_value(value, expression, connection):
if connection.vendor == 'oracle':
return value.read()
return value
class JSONField(Field):
...
def get_db_converters(self, connection):
return [convert_jsonfield_value] + super().get_db_converters(connection)
Now, our from_db_value
method will get the equivalent str
of the LOB
as
the value.
And, that's it! We've successfully implemented our own unified JSONField
. We
have yet to implement custom lookups and transforms, but that's on our to-do
list. I made a draft PR to Django for this field today (which also happens to
be my birthday!). If some things in that PR turn out different from what I
wrote here, maybe I've found some better approaches in implementing them.
Whoa, this turned out long. Anyway, if you want to leave a feedback, feel free to comment below or on the PR, I'd love to hear some thoughts about this. I'll be back here with another post next week!