classification
Title: sqlite3 driver fails on four byte unicode strings coming from JSON_EXTRACT
Type: crash Stage:
Components: Library (Lib) Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ghaering, serhiy.storchaka, zzzeek
Priority: normal Keywords:

Created on 2019-11-08 19:42 by zzzeek, last changed 2019-11-08 22:50 by zzzeek.

Messages (7)
msg356257 - (view) Author: mike bayer (zzzeek) * Date: 2019-11-08 19:42
When using unicode characters inside of JSON strings, values retrieved via the JSON_EXTRACT SQLite function fail to be decoded by the sqlite3 driver if they include four-byte unicode characters.

Version information for my build, which is Fedora 30:

Python 3.7.4 (default, Jul  9 2019, 16:32:37) 
[GCC 9.1.1 20190503 (Red Hat 9.1.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sqlite3
>>> sqlite3.version
'2.6.0'
>>> sqlite3.sqlite_version
'3.26.0'


Demo as follows:

import json
import sqlite3

# two unicode strings, the second one has four byte character in it
good_data = "réve illé"
bad_data = "réve🐍 illé"

# create simple json structures
good_data_json = json.dumps({"foo": good_data})
bad_data_json = json.dumps({"foo": bad_data})

# all strings are valid utf-8
# data round trips correctly through json
assert json.loads(good_data_json.encode("utf-8").decode("utf-8")) == {
    "foo": good_data
}
assert json.loads(bad_data_json.encode("utf-8").decode("utf-8")) == {
    "foo": bad_data
}


conn = sqlite3.connect(":memory:")
cursor = conn.cursor()

cursor.execute("CREATE TABLE some_data (id INT, data JSON)")
cursor.executemany(
    "INSERT INTO some_data(id, data) VALUES(?, ?)",
    [(1, good_data_json), (2, bad_data_json)],
)

# we can retrieve the JSON objects as a whole from the DB, no issue
cursor.execute("SELECT some_data.data FROM some_data ORDER BY id")
assert cursor.fetchall() == [(good_data_json, ), (bad_data_json, )]

# when we use JSON_EXTRACT, then full utf-8 support is lost

# extract good value from JSON object
cursor.execute("""
    SELECT JSON_EXTRACT(some_data.data, '$."foo"')
    FROM some_data WHERE id=1
""")
assert cursor.fetchone()[0] == good_data

# extract bad value from JSON object; utf-8 failure
# sqlite3.OperationalError: Could not decode to UTF-8 column
# 'JSON_EXTRACT(some_data.data, '$."foo"')' with text 'r��ve������ ill��'
cursor.execute("""
    SELECT JSON_EXTRACT(some_data.data, '$."foo"')
    FROM some_data WHERE id=2
""")
assert cursor.fetchone()[0] == bad_data


output:


Traceback (most recent call last):
  File "test4.py", line 50, in <module>
    """)
sqlite3.OperationalError: Could not decode to UTF-8 column 'JSON_EXTRACT(some_data.data, '$."foo"')' with text 'r��ve������ ill��'


surprising to say the least as the SQLite driver has always been completely solid with all unicode, but there you go.
msg356262 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-11-08 20:56
sqlite> select JSON_EXTRACT('["t\u00e8\u015b\ud835\udd99"]', '$[0]');
tèś������
msg356263 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-11-08 20:57
Seems this is a bug in JSON_EXTRACT.
msg356266 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-11-08 21:36
I have reported a bug to SQLite. Will see an answer.

It could be possible to implement a workaround in Python if this behavior will not be fixed.
msg356268 - (view) Author: mike bayer (zzzeek) * Date: 2019-11-08 22:17
Hi where did you report it?    I don't see it on the mailing list or in their fossil tracker.
msg356269 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-11-08 22:22
It awaits moderator approval.
msg356270 - (view) Author: mike bayer (zzzeek) * Date: 2019-11-08 22:50
silly me thinking python devs had better access to SQLite devs :)
History
Date User Action Args
2019-11-08 22:50:54zzzeeksetmessages: + msg356270
2019-11-08 22:22:24serhiy.storchakasetmessages: + msg356269
2019-11-08 22:17:36zzzeeksetmessages: + msg356268
2019-11-08 21:36:52serhiy.storchakasetnosy: + ghaering
messages: + msg356266
2019-11-08 20:57:33serhiy.storchakasetmessages: + msg356263
2019-11-08 20:56:56serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg356262
2019-11-08 19:42:25zzzeekcreate