This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Custom functions in sqlite receive None on invalid UTF-8
Type: behavior Stage: needs patch
Components: Library (Lib) Versions: Python 3.7, Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Ingo Ruhnke, berker.peksag, ghaering, palaviv
Priority: normal Keywords: patch

Created on 2016-12-20 09:24 by Ingo Ruhnke, last changed 2022-04-11 14:58 by admin.

Files
File name Uploaded Description Edit
29021.patch palaviv, 2017-01-05 22:22 review
29021-fixed.patch palaviv, 2017-01-05 22:51 review
Messages (7)
msg283674 - (view) Author: Ingo Ruhnke (Ingo Ruhnke) Date: 2016-12-20 09:24
When a sqlite database contains invalid UTF-8 code in a TEXT column, Python can query that data normally when .text_factory is set appropriately. However when a custom function is created with .create_function() and applied to that column the custom function will receive 'None' as argument instead of the value of the column.

The following example demonstrate the issue:

Example:
--------

import sqlite3
import sys
import os

con = sqlite3.connect(":memory:")
con.text_factory = os.fsdecode

con.create_function("py_identity", 1, lambda x: x)

cur = con.cursor()
cur.execute("CREATE TABLE foo(bar TEXT)")

# insert some invalid UTF-8 into the database
cur.execute("INSERT INTO foo(bar) VALUES(cast(? AS TEXT))", [b"\xff"])

# try to call a custom function on the invalid UTF-8
cur.execute("SELECT "
            "  typeof(bar), "
            "  bar, " # this works
            "  py_identity(bar), " # this returns None instead of the content of 'bar'
            "  cast(py_identity(cast(bar as BLOB)) AS TEXT) " # this works around the issue
            "FROM foo")

for row in cur:
    print(row)

Output:
-------

('text', '\udcff', None, '\udcff')


Expected:
---------

('text', '\udcff', '\udcff', '\udcff')
msg284673 - (view) Author: Aviv Palivoda (palaviv) * Date: 2017-01-04 22:07
The problem is in _pysqlite_build_py_params function at connection.c. In case we have text we do the following code:

case SQLITE_TEXT:
      val_str = (const char*)sqlite3_value_text(cur_value);
      cur_py_value = PyUnicode_FromString(val_str);
      /* TODO: have a way to show errors here */
      if (!cur_py_value) {
             PyErr_Clear();
             Py_INCREF(Py_None);
             cur_py_value = Py_None;
      }
      break;

As you can see we call PyUnicode_FromString instead of text_factory.

I started making a patch to fix this by passing the text_factory to _pysqlite_build_py_params function but I currently have a problem with setting the result to the sqlite. User text_factory may return any type of object and I can't see how to handle that...
msg284700 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2017-01-05 03:53
We can just fallback to the current behavior (i.e. return None) if the return type of text_factory(val_str) is not str, bytes or bytearray. I think this is also somewhat similar to what we do for the return values of sqlite3_column_text().
msg284776 - (view) Author: Aviv Palivoda (palaviv) * Date: 2017-01-05 22:22
I actually was wrong and there is no problem with the type that is returned from the text_factory as it passes directly as argument for the create_function function.

Attached is a patch that change create_function and create_aggregate to use the text_factory when a TEXT data type is received. I am now using sqlite3_create_function_v2 but this is fine because we use sqlite3_stmt_readonly and it is from a newer version.

As for Ingo example code it still don't work with this current fix but this is due to a different problem. Now the _pysqlite_set_result function fail when we do PyUnicode_AsUTF8 on the result from py_identity. As this is a different problem I will fix this in a different patch.
msg284783 - (view) Author: Aviv Palivoda (palaviv) * Date: 2017-01-05 22:51
Actually had a small mistake in the patch I uploaded. Uploading a fixed one.
msg284924 - (view) Author: Aviv Palivoda (palaviv) * Date: 2017-01-07 16:12
After looking more into the _pysqlite_set_result function fail in Ingo example I think this is the expected behavior. The PyUnicode_AsUTF8 fails but this is expected as the value is an invalid UTF-8.
msg291726 - (view) Author: Aviv Palivoda (palaviv) * Date: 2017-04-15 18:18
In my patch I use sqlite3_create_function_v2 which was added in sqlite 3.7.3 (2010-10-08). There were a few problems after adding sqlite3_stmt_readonly in 284676cf2ac8 as can be seen in issue #29355. sqlite3_stmt_readonly was added in 3.7.4 (2010-12-07) so I guess using sqlite3_create_function_v2 will cause the same problems.

Do you think I should find another way to fix this issue? I can follow all the user functions and free on my own but that will make the code more complex. On the other hand I can add this fix in ifdef and the behavior will be different when using sqlite in versions smaller then 3.7.3.
History
Date User Action Args
2022-04-11 14:58:40adminsetgithub: 73207
2017-04-15 18:18:47palavivsetmessages: + msg291726
2017-01-07 16:12:13palavivsetmessages: + msg284924
2017-01-05 22:51:39palavivsetfiles: + 29021-fixed.patch

messages: + msg284783
2017-01-05 22:22:16palavivsetfiles: + 29021.patch
keywords: + patch
messages: + msg284776
2017-01-05 03:53:03berker.peksagsetversions: + Python 3.7
nosy: + berker.peksag

messages: + msg284700

stage: needs patch
2017-01-04 22:07:23palavivsetnosy: + palaviv
messages: + msg284673
2016-12-20 22:06:46ned.deilysetnosy: + ghaering
2016-12-20 09:24:51Ingo Ruhnkecreate