classification
Title: sqlite3 accepts strings it cannot (by default) return
Type: behavior Stage: resolved
Components: Unicode Versions: Python 2.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder: unable to retrieve latin-1 encoded data from sqlite3
View: 6010
Assigned To: Nosy List: William.D..Colburn, ezio.melotti, jcea, terry.reedy
Priority: normal Keywords:

Created on 2012-12-26 06:59 by William.D..Colburn, last changed 2012-12-31 15:06 by jcea. This issue is now closed.

Messages (5)
msg178189 - (view) Author: William D. Colburn (William.D..Colburn) Date: 2012-12-26 06:59
This seems to be just like issue6010, which is closed as "not as bug" which is simple irresponsible on the part of Python.

The problem is that I can store data into sqlite3 which cannot be retrieved.  The data I encountered was a string with a \xa0 in it that was read from urllib2.  I deposited that data into an sqlite3 database with no trouble at all.  Later, however, sqlite3 would throw an exception when I tried to retrieve the data.

Martin v. Löwis says this is user error because sqlite3 assumes UTF-8. It makes no sense that storeable data cannot be retrieved and that the failure of sqlite3 is a user-error.  If sqlite3 assumes UTF-8, then the error should have been caught when I did the store because I was not storing UTF-8 data.  Accepting bad input and blaming the user later down the line is bad bug handling, especially because the two events might be separated by a tremendous gap of time and code which makes debugging quite hard.  I was only able to find the errant character by popping the streamed webpage into emacs as both python and firefox showed me a normal looking string with nothing wrong with it.
msg178451 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-12-29 01:04
Opening a duplicate issue to rant against the developers is not responsible behavior. Since you do not seem to understand Martin's 2.x solution, ask for help on python-list or elsewhere (and read below). The proper fix for multiple Unicode and text coding problems was and is to use Unicode for text, as we did and do in 3.x.

Note that while we link to sqlite3 with a Python interface, and choose that as the database to link to in the stdlib, we do not control sqlite3 itself. As documented and as Martin wrote, sqlite *assumes*, by default, that byte-encoded text handed to it is error-free utf-8 encoded. However, docs and Martin both say that you can override that assumption by replacing its text_factory. Sqlite should not reject *any* bytes because anything *could* be just what the use intended.

The problem of multiple byte encodings for text and of encoding info getting separated from encoded bytes is a general one. We constantly get questions on python-list like "how do I determine the real encoding of a web page if the encoding information is missing or wrong". We are doing our part to solve it by using unicode for text and pushing utf-8 as the one, true encoding that everyone should use whenever possible.

If you need more explanation, try python-list, as I said before.
msg178454 - (view) Author: William D. Colburn (William.D..Colburn) Date: 2012-12-29 01:15
I'm ranting against an improper bug closure and bad behavior from the
python sqlite3.  Treating invalid data as sometimes valid and sometime
as invalid is a problem.  Problems with a Python module are not
anuser-error, they are an implementation error.  Users should never
have to work around bugs in Python.

I'm saddened by your poor response to this issue.

On Fri, Dec 28, 2012 at 6:04 PM, Terry J. Reedy <report@bugs.python.org> wrote:
>
> Terry J. Reedy added the comment:
>
> Opening a duplicate issue to rant against the developers is not responsible behavior. Since you do not seem to understand Martin's 2.x solution, ask for help on python-list or elsewhere (and read below). The proper fix for multiple Unicode and text coding problems was and is to use Unicode for text, as we did and do in 3.x.
>
> Note that while we link to sqlite3 with a Python interface, and choose that as the database to link to in the stdlib, we do not control sqlite3 itself. As documented and as Martin wrote, sqlite *assumes*, by default, that byte-encoded text handed to it is error-free utf-8 encoded. However, docs and Martin both say that you can override that assumption by replacing its text_factory. Sqlite should not reject *any* bytes because anything *could* be just what the use intended.
>
> The problem of multiple byte encodings for text and of encoding info getting separated from encoded bytes is a general one. We constantly get questions on python-list like "how do I determine the real encoding of a web page if the encoding information is missing or wrong". We are doing our part to solve it by using unicode for text and pushing utf-8 as the one, true encoding that everyone should use whenever possible.
>
> If you need more explanation, try python-list, as I said before.
>
> ----------
> nosy: +terry.reedy
> resolution:  -> duplicate
> status: open -> closed
> title: sqlite3 accepts strings it cannot return -> sqlite3 accepts strings it cannot (by default) return
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue16783>
> _______________________________________
msg178586 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-12-30 16:32
> Treating invalid data as sometimes valid and sometime as invalid is a problem. 

What is valid is defined by your application.  AFAIU sqlite3 defaults to utf-8, but it's able to work with latin1 data as well.  The fact that you are mixing utf-8 and latin1 is an error in your application, and while it might be nice if sqlite3 warned you about it, it's not necessarily its responsibility.  Even thought it's a really bad idea, you might want to store data with different encodings and still being able to retrieve them using the right text_factory.

However while trying to reproduce the issue I noticed a possible inconsistency and reported it on #6010.
msg178629 - (view) Author: William D. Colburn (William.D..Colburn) Date: 2012-12-31 00:00
What it should do is be consistent (predictable) in it's handling of
input and output.  If it accepts unicode and outputs unicode, then it
should accept unicode and output unicode, not accept garbage and then
barf.  Valid data should be consistantly valid, and invalid data
should be consistently invalid.   It is bad behavior for a system to
allow invalid data in that can't be recovered or handled.

On Sun, Dec 30, 2012 at 9:32 AM, Ezio Melotti <report@bugs.python.org> wrote:
>
> Ezio Melotti added the comment:
>
>> Treating invalid data as sometimes valid and sometime as invalid is a problem.
>
> What is valid is defined by your application.  AFAIU sqlite3 defaults to utf-8, but it's able to work with latin1 data as well.  The fact that you are mixing utf-8 and latin1 is an error in your application, and while it might be nice if sqlite3 warned you about it, it's not necessarily its responsibility.  Even thought it's a really bad idea, you might want to store data with different encodings and still being able to retrieve them using the right text_factory.
>
> However while trying to reproduce the issue I noticed a possible inconsistency and reported it on #6010.
>
> ----------
> stage:  -> committed/rejected
> superseder:  -> unable to retrieve latin-1 encoded data from sqlite3
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue16783>
> _______________________________________
History
Date User Action Args
2012-12-31 15:06:16jceasetnosy: + jcea
2012-12-31 00:00:11William.D..Colburnsetmessages: + msg178629
2012-12-30 16:32:42ezio.melottisetsuperseder: unable to retrieve latin-1 encoded data from sqlite3
messages: + msg178586
stage: resolved
2012-12-29 01:15:41William.D..Colburnsetmessages: + msg178454
2012-12-29 01:04:23terry.reedysetstatus: open -> closed
title: sqlite3 accepts strings it cannot return -> sqlite3 accepts strings it cannot (by default) return
nosy: + terry.reedy

messages: + msg178451

resolution: duplicate
2012-12-26 06:59:16William.D..Colburncreate