classification
Title: Byte/string inconsistencies between different dbm modules
Type: behavior Stage: commit review
Components: Library (Lib) Versions: Python 3.0
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: brett.cannon Nosy List: barry, brett.cannon, christian.heimes, gvanrossum, jcea, skip.montanaro
Priority: release blocker Keywords: needs review, patch

Created on 2008-09-06 21:59 by skip.montanaro, last changed 2008-11-25 19:19 by brett.cannon. This issue is now closed.

Files
File name Uploaded Description Edit
dumb.diff skip.montanaro, 2008-09-08 00:26
dumb_to_utf8.diff brett.cannon, 2008-11-20 00:33 Move dbm.dumb to UTF-8 internally with writing to disk in Latin-1
doc_dbm_strings.diff brett.cannon, 2008-11-21 00:53 Document that dbm implementations store in bytes and convert as needed
specify_open_encoding.diff brett.cannon, 2008-11-21 01:28
mydb2write.py skip.montanaro, 2008-11-21 15:42 dumbdbm write (Python 2.5)
mydb3read.py skip.montanaro, 2008-11-21 15:42 dbm.dumb read script (Python 3.0)
pickle2write.py brett.cannon, 2008-11-21 18:24 Write out a simple user-defined class in 2.x to a file
pickle3read.py brett.cannon, 2008-11-21 18:25 Read a pickle file for a user-defined class in Py3K
mydb2write.py brett.cannon, 2008-11-21 18:25 My version of Skip's acid code working with encoded Unicode
mydb3read.py brett.cannon, 2008-11-21 18:26 My version of Skip's acid code working that reads the 2.x-written db
Messages (29)
msg72711 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2008-09-06 21:59
Consider these two timeit commands:

py3k% python3.0 -m timeit -s 'import dbm.ndbm as db' -s 'f = 
db.open("/tmp/trash.db", "c")' 'for i in range(1000): f[str(i)] = str(i)'
100 loops, best of 3: 5.51 msec per loop
py3k% python3.0 -m timeit -s 'import dbm.dumb as db' -s 'f = 
db.open("/tmp/trash.db", "c")' 'for i in range(1000): f[str(i)] = str(i)'
Traceback (most recent call last):
  File "/Users/skip/local/lib/python3.0/timeit.py", line 297, in main
    x = t.timeit(number)
  File "/Users/skip/local/lib/python3.0/timeit.py", line 193, in timeit
    timing = self.inner(it, self.timer)
  File "<timeit-src>", line 7, in inner
    for i in range(1000): f[str(i)] = str(i)
  File "/Users/skip/local/lib/python3.0/dbm/dumb.py", line 165, in 
__setitem__
    raise TypeError("keys must be bytes")
TypeError: keys must be bytes

Seems to me they should either both succeed or both fail.  What are keys
and values supposed to be for these modules?

Marking it as high priority.  When 3.0 is released all these modules
should probably agree.
msg72740 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-09-07 15:29
Making this into a release blocker just so someone will look at it.
Needs to be decided & fixed by rc2 at the very latest.
msg72743 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2008-09-07 17:03
Extra data point.  I tried

  f["1"] = "a"

and

  f[b"1"] = "a"

with dbm.{gnu,ndbm,dumb,sqlite}.  All worked with bytes.  A except
dbm.dumb worked with strings.  (dbm.sqlite is my little proof-of-concept
module currently in the sandbox.)
msg72745 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-09-07 17:33
How hard would it be to fix dbm.dumb to accept strings as well?
msg72757 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2008-09-08 00:26
I'm not sure.  I've never done anything with the io module.  Simply
eliminating the bytes checks and letting it try to write the string
yields:

  File "/Users/skip/local/lib/python3.0/dbm/dumb.py", line 170, in 
__setitem__
    self._addkey(key, self._addval(val))
  File "/Users/skip/local/lib/python3.0/dbm/dumb.py", line 138, in 
_addval
    f.write(val)
  File "/Users/skip/local/lib/python3.0/io.py", line 1224, in write
    return BufferedWriter.write(self, b)
  File "/Users/skip/local/lib/python3.0/io.py", line 1034, in write
    raise TypeError("can't write str to binary stream")

I suppose you'd have to check if val is an instance of str and if so,
encode it as utf-8.  I notice in the existing code that it's doing
some key decoding assuming latin-1.  That would be an incompatibility,
but I think assuming latin-1 is wrong.

That said, I've attached a patch which passes all current unit tests.

Skip
msg72963 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-09-10 14:12
I think this isn't quite right.

Ideally a fix should maintain several important properties:

(1) Be able to read databases written by Python 2.x.

(1a) Write databases readable by Python 2.x.

(2) Use the same mapping between str and bytes as the other *dbm
libraries have.

(2a) Return the same value for keys() as the other *dbm libraries
(except for order).

I think (2) means that we should use UTF-8 to convert str keys to bytes,
but (1) means we should use Latin-1 to convert keys to str *upon writing
the index*.  This will also satisfy (1a).  The keys maintained
internally should be kept in bytes to satsfy (2a).

PS. I noticed the dbm module still returns bytearrays for keys and
values. Is there a bug open to change those to bytes?
msg74655 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-10-11 02:05
For information, Python3 trunk fails on:
test.support.TestFailed: Traceback (most recent call last):
  File "Lib/test/test_dbm.py", line 157, in test_keys
    self.assert_('xxx' not in self.d)
TypeError: gdbm key must be bytes, not str
msg75537 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-11-05 23:21
The patch causes three errors:

======================================================================
ERROR: test_anydbm_access (test.test_dbm.TestCase-dbm.dumb)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/heimes/dev/python/py3k/Lib/test/test_dbm.py", line 92, in
test_anydbm_access
    f = dbm.open(_fname, 'r')
  File "/home/heimes/dev/python/py3k/Lib/dbm/__init__.py", line 79, in open
    raise error[0]("need 'c' or 'n' flag to open new db")
dbm.error: need 'c' or 'n' flag to open new db

======================================================================
ERROR: test_anydbm_keys (test.test_dbm.TestCase-dbm.dumb)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/heimes/dev/python/py3k/Lib/test/test_dbm.py", line 86, in
test_anydbm_keys
    f = dbm.open(_fname, 'r')
  File "/home/heimes/dev/python/py3k/Lib/dbm/__init__.py", line 79, in open
    raise error[0]("need 'c' or 'n' flag to open new db")
dbm.error: need 'c' or 'n' flag to open new db

======================================================================
ERROR: test_anydbm_read (test.test_dbm.TestCase-dbm.dumb)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/heimes/dev/python/py3k/Lib/test/test_dbm.py", line 80, in
test_anydbm_read
    f = dbm.open(_fname, 'r')
  File "/home/heimes/dev/python/py3k/Lib/dbm/__init__.py", line 79, in open
    raise error[0]("need 'c' or 'n' flag to open new db")
dbm.error: need 'c' or 'n' flag to open new db

----------------------------------------------------------------------
Ran 16 tests in 0.429s

FAILED (errors=3)
msg75541 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2008-11-06 03:07
I don't see enough progress on this issue, and I'm not going to hold up
3.0rc2 for it, so I'm deferring.
msg76080 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-11-19 23:33
If you look at the 2.7 code all it requires of keys and values in
__setitem__ is that they are strings; there is nothing about Latin-1 in
terms of specific encoding (must be a 3.0 addition to make the
str/unicode transition the easiest). That would suggest to me that
assuming that previous DBs were written in Latin-1 is somewhat bogus as
people could have passed in any str encoded in any format as a DB key or
value.

Thus I think going down the UTF-8 route is the right thing to do for
string arguments. A quick look at _gdbmmodule.c supports this as it just
converts its arguments through PyArg_Parse("s#") to get its keys and
thus uses UTF-8 as the default encoding.
msg76082 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-11-19 23:52
OK, now I see why it is called 'dumb'; the thing literally just dumps
out the repr of two strings on each line for key/value pairs. To read it
just evals each line in the string. And whichdb detects this format by
looking for ' or " as the first character since that is what the repr
for str is. And that is why the Latin-1 decoding was used; it has the
proper repr for write-out.
msg76085 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-11-20 00:33
I have attached a file that does everything internally as UTF-8 but
reads and writes to disk as Latin-1. I added some unit tests to verify
that taking a character encoded in UTF-8 or as a string still works
properly regardless of whether one uses the string or bytes version of
the key.
msg76155 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-11-21 00:18
r67310 has the fix.
msg76156 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-11-21 00:53
I am re-opening this as a deferred blocker with a patch to document that
dbm implementations write out and return bytes, but that strings are
accepted and implicitly converted.
msg76157 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-11-21 01:28
Have another patch that fixes all open() calls to specify the file
encoding in dbm.dumb. Also caught one spot in _addkey() where
decode("Latin-1") was not being called.
msg76186 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2008-11-21 15:42
damn... my cc to report@bugs.python.org didn't work.  Here's the recap
(message to python-checkins):

    me> ... I thought Guido was of the opinion that the 3.0 version 
should
    me> be able to read dumb dbms written by earlier Python versions....

And write them.  From msg72963:

    (1) Be able to read databases written by Python 2.x.

    (1a) Write databases readable by Python 2.x.

Ah, but wait a minute.  I see your comment in msg76080:

    If you look at the 2.7 code all it requires of keys and values in
    __setitem__ is that they are strings; there is nothing about Latin-1 
in
    terms of specific encoding (must be a 3.0 addition to make the
    str/unicode transition the easiest).

The acid test.  I executed the attached mydb2write.py using Python 2.5 
then
executed the attached mydb3read.py using Python 3.0.  The output:

    % python2.5 mydb2write.py 
    1 abc
    2 [4, {4.2999999999999998: 12}]
    3 <__main__.C instance at 0x34bb70>
    % python3.0 mydb3read.py
    1 b'abc'
    2 [4, {4.2999999999999998: 12}]
    Traceback (most recent call last):
      File "mydb3read.py", line 13, in <module>
        print(3, pickle.loads(db['3']))
      File "/Users/skip/local/lib/python3.0/pickle.py", line 1329, in 
loads
        return Unpickler(file, encoding=encoding, errors=errors).load()
    _pickle.UnpicklingError: bad pickle data

so if the ability to read Python 2.x dumbdbm files is still a 
requirement I
think there's a little more work to do.
msg76187 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-11-21 15:44
I think the ability to read old files is essential. The ability to
write them is a mer nice-to-have.

On Fri, Nov 21, 2008 at 7:36 AM,  <skip@pobox.com> wrote:
>
>    me> ... I thought Guido was of the opinion that the 3.0 version should
>    me> be able to read dumb dbms written by earlier Python versions....
>
> And write them.  From msg72963:
>
>    (1) Be able to read databases written by Python 2.x.
>
>    (1a) Write databases readable by Python 2.x.
>
> Ah, but wait a minute.  I see your comment in msg76080:
>
>    If you look at the 2.7 code all it requires of keys and values in
>    __setitem__ is that they are strings; there is nothing about Latin-1 in
>    terms of specific encoding (must be a 3.0 addition to make the
>    str/unicode transition the easiest).
>
> The acid test.  I executed the attached mydb2write.py using Python 2.5 then
> executed the attached mydb3read.py using Python 3.0.  The output:
>
>    % python2.5 mydb2write.py
>    1 abc
>    2 [4, {4.2999999999999998: 12}]
>    3 <__main__.C instance at 0x34bb70>
>    % python3.0 mydb3read.py
>    1 b'abc'
>    2 [4, {4.2999999999999998: 12}]
>    Traceback (most recent call last):
>      File "mydb3read.py", line 13, in <module>
>        print(3, pickle.loads(db['3']))
>      File "/Users/skip/local/lib/python3.0/pickle.py", line 1329, in loads
>        return Unpickler(file, encoding=encoding, errors=errors).load()
>    _pickle.UnpicklingError: bad pickle data
>
> so if the ability to read Python 2.x dumbdbm files is still a requirement I
> think there's a little more work to do.
>
> cc'ing report@bugs.python.org to preserve the scripts with the ticket.
>
> Skip
>
>
> _______________________________________________
> Python-3000-checkins mailing list
> Python-3000-checkins@python.org
> http://mail.python.org/mailman/listinfo/python-3000-checkins
>
>
msg76188 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-11-21 15:45
Reading old dumbdbm files is essential. Writing them is not.
msg76189 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-11-21 15:48
[fix title]
msg76195 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-11-21 18:24
So the use of pickle is not fair as that doesn't round-trip if you
simply write out a file because py3k pickle doesn't like classic classes
and the new-style classes want copy_reg which is not in existence in
py3k since it was renamed. See pickle2write.py and pickle3read.py (which
are version-agnostic; just following Skip's naming) for the pickle failing.

Now if you skip that one use-case in the example of pickling a
user-defined class then everything works out. I have attached a
hacked-up version of Skip's scripts that take Unicode strings, encode
them in Latin-1 and UTF-8, and then decode them on the other side in
Py3K, all without issue.

In other words I think my solution works and pickle is the trouble-maker
in all of this.
msg76196 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2008-11-21 18:32
Brett> In other words I think my solution works and pickle is the
    Brett> trouble-maker in all of this.

I can buy that.  Should pickle map "copy_reg" to "copyreg"?  Is that the
only incompatibility?

Actually, it seems this ticket should be closed and another opened about
this pickle issue.

Skip
msg76197 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-11-21 18:40
On Fri, Nov 21, 2008 at 10:32, Skip Montanaro <report@bugs.python.org> wrote:
>
> Skip Montanaro <skip@pobox.com> added the comment:
>
> Brett> In other words I think my solution works and pickle is the
>    Brett> trouble-maker in all of this.
>
> I can buy that.  Should pickle map "copy_reg" to "copyreg"?  Is that the
> only incompatibility?
>
> Actually, it seems this ticket should be closed and another opened about
> this pickle issue.
>

Well, I still need a code review for my latest patch that changes the
docs, cleans up gdbm and ndbm exception messages, catches a place
where I didn't decode before write-out, and adds an 'encoding'
argument to all open() calls.
msg76198 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2008-11-21 19:01
One doc nit: There is still reference to ``gdbm`` and Dbm (or dbm) objects
when they should probably use ``dbm.gnu`` and ``dbm.ndbm``, respectively.

I'm confused by the encoding="Latin-1" args to _io.open for dbm.dumb.  I
thought the default enoding was going to be utf-8, and I see no way to
influence that.  Is that just to make sure all code points can be written
without error?

Skip
msg76199 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-11-21 19:54
On Fri, Nov 21, 2008 at 11:01, Skip Montanaro <report@bugs.python.org> wrote:
>
> Skip Montanaro <skip@pobox.com> added the comment:
>
> One doc nit: There is still reference to ``gdbm`` and Dbm (or dbm) objects
> when they should probably use ``dbm.gnu`` and ``dbm.ndbm``, respectively.
>

OK, I will fix that and upload a new patch at some point.

> I'm confused by the encoding="Latin-1" args to _io.open for dbm.dumb.  I
> thought the default enoding was going to be utf-8, and I see no way to
> influence that.  Is that just to make sure all code points can be written
> without error?
>

It's so that when writing out there won't be any errors. Since the
repr of strings are used instead of bytes the stuff passed in and
written to disk must be represented in an encoding that will never
complain about what bytes it gets. Latin-1 does this while UTF-8. And
since everything is being written and read in Latin-1 I figured I
might as well be thorough and specify the encoding.
msg76245 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2008-11-22 13:22
py3k patched with specify_open_encoding.diff passes test_dbm_dumb on my
Mac (Leopard, Intel).  Might as well assign this to Brett.  He seems to
be doing all the heavy lifting anyway. ;-)

Skip
msg76363 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-11-24 21:11
specify_open_encoding.diff has been committed in r67369.

I still need a review for doc_dbm_strings.diff, though, which clarifies
the docs, fixes one oversight in dbm.dumb, and extends testing to make
sure strings can be accepted.
msg76371 - (view) Author: Skip Montanaro (skip.montanaro) * (Python committer) Date: 2008-11-25 00:25
Brett> I still need a review for doc_dbm_strings.diff, though, which
    Brett> clarifies the docs, fixes one oversight in dbm.dumb, and extends
    Brett> testing to make sure strings can be accepted.

Was my comment

    http://bugs.python.org/msg76198

and your resonse

    http://bugs.python.org/msg76199

not sufficient?

If not, allow me to anoint said diff with virtual holy water now...

Skip
msg76378 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-11-25 01:11
On Mon, Nov 24, 2008 at 16:25, Skip Montanaro <report@bugs.python.org> wrote:
>
> Skip Montanaro <skip@pobox.com> added the comment:
>
> Brett> I still need a review for doc_dbm_strings.diff, though, which
>    Brett> clarifies the docs, fixes one oversight in dbm.dumb, and extends
>    Brett> testing to make sure strings can be accepted.
>
> Was my comment
>
>    http://bugs.python.org/msg76198
>
> and your resonse
>
>    http://bugs.python.org/msg76199
>
> not sufficient?
>

Wasn't sure if that meant everything not mentioned was fine.

> If not, allow me to anoint said diff with virtual holy water now...
>

=) Thanks!
msg76420 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2008-11-25 19:19
r67380 has the fix. Thanks for the review, Skip!
History
Date User Action Args
2008-11-25 19:19:36brett.cannonsetstatus: open -> closed
resolution: fixed
messages: + msg76420
2008-11-25 01:11:50brett.cannonsetmessages: + msg76378
2008-11-25 00:25:02skip.montanarosetmessages: + msg76371
2008-11-24 21:12:18brett.cannonsetcomponents: + Library (Lib)
stage: needs patch -> commit review
2008-11-24 21:11:59brett.cannonsetmessages: + msg76363
2008-11-22 13:22:25skip.montanarosetassignee: brett.cannon
messages: + msg76245
2008-11-21 22:45:32brett.cannonlinkissue4382 dependencies
2008-11-21 19:54:15brett.cannonsetmessages: + msg76199
2008-11-21 19:08:29hayposetnosy: - haypo
2008-11-21 19:01:43skip.montanarosetmessages: + msg76198
2008-11-21 18:40:03brett.cannonsetmessages: + msg76197
2008-11-21 18:32:02skip.montanarosetmessages: + msg76196
2008-11-21 18:26:36brett.cannonsetresolution: accepted -> (no value)
2008-11-21 18:26:22brett.cannonsetfiles: + mydb3read.py
2008-11-21 18:25:56brett.cannonsetfiles: + mydb2write.py
2008-11-21 18:25:16brett.cannonsetfiles: + pickle3read.py
2008-11-21 18:24:52brett.cannonsetfiles: + pickle2write.py
messages: + msg76195
stage: commit review -> needs patch
2008-11-21 15:48:13gvanrossumsetmessages: + msg76189
title: Re: r67310 - in python/branches/py3k: Lib/dbm/dumb.py Lib/test/test_dbm_dumb.py Misc/NEWS -> Byte/string inconsistencies between different dbm modules
2008-11-21 15:45:43gvanrossumsetmessages: + msg76188
2008-11-21 15:44:57gvanrossumsetmessages: + msg76187
title: Byte/string inconsistencies between different dbm modules -> Re: r67310 - in python/branches/py3k: Lib/dbm/dumb.py Lib/test/test_dbm_dumb.py Misc/NEWS
2008-11-21 15:42:53skip.montanarosetfiles: + mydb3read.py
2008-11-21 15:42:35skip.montanarosetfiles: + mydb2write.py
nosy: + skip.montanaro
messages: + msg76186
2008-11-21 15:17:21barrysetpriority: deferred blocker -> release blocker
2008-11-21 01:28:16brett.cannonsetfiles: + specify_open_encoding.diff
messages: + msg76157
2008-11-21 00:53:59brett.cannonsetstatus: closed -> open
files: + doc_dbm_strings.diff
messages: + msg76156
priority: release blocker -> deferred blocker
2008-11-21 00:18:16brett.cannonsetstatus: open -> closed
resolution: accepted
messages: + msg76155
2008-11-20 18:44:17brett.cannonsetkeywords: + needs review
stage: commit review
2008-11-20 00:33:03brett.cannonsetfiles: + dumb_to_utf8.diff
messages: + msg76085
2008-11-19 23:52:11brett.cannonsetmessages: + msg76082
2008-11-19 23:33:56brett.cannonsetnosy: + brett.cannon
messages: + msg76080
2008-11-07 13:30:55barrysetpriority: deferred blocker -> release blocker
2008-11-06 03:28:08skip.montanarosetnosy: - skip.montanaro
2008-11-06 03:07:50barrysetpriority: release blocker -> deferred blocker
nosy: + barry
messages: + msg75541
2008-11-05 23:21:07christian.heimessetnosy: + christian.heimes
messages: + msg75537
2008-10-11 02:05:15hayposetnosy: + haypo
messages: + msg74655
2008-10-02 21:04:09bboissinsetnosy: - bboissin
2008-10-02 20:54:11bboissinsetnosy: + bboissin
2008-10-02 20:14:59jceasetnosy: + jcea
2008-10-02 12:54:38barrysetpriority: deferred blocker -> release blocker
2008-09-26 22:19:04barrysetpriority: release blocker -> deferred blocker
2008-09-18 05:43:36barrysetpriority: deferred blocker -> release blocker
2008-09-10 14:12:40gvanrossumsetmessages: + msg72963
2008-09-09 13:12:15barrysetpriority: release blocker -> deferred blocker
2008-09-08 00:26:15skip.montanarosetfiles: + dumb.diff
keywords: + patch
messages: + msg72757
2008-09-07 17:33:10gvanrossumsetmessages: + msg72745
2008-09-07 17:03:21skip.montanarosetmessages: + msg72743
2008-09-07 15:29:28gvanrossumsetpriority: high -> release blocker
nosy: + gvanrossum
messages: + msg72740
2008-09-06 21:59:32skip.montanarocreate