Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Byte/string inconsistencies between different dbm modules #48049

Closed
smontanaro opened this issue Sep 6, 2008 · 29 comments
Closed

Byte/string inconsistencies between different dbm modules #48049

smontanaro opened this issue Sep 6, 2008 · 29 comments
Assignees
Labels
release-blocker stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@smontanaro
Copy link
Contributor

BPO 3799
Nosy @gvanrossum, @smontanaro, @warsaw, @brettcannon, @jcea, @tiran
Files
  • dumb.diff
  • dumb_to_utf8.diff: Move dbm.dumb to UTF-8 internally with writing to disk in Latin-1
  • doc_dbm_strings.diff: Document that dbm implementations store in bytes and convert as needed
  • specify_open_encoding.diff
  • mydb2write.py: dumbdbm write (Python 2.5)
  • mydb3read.py: dbm.dumb read script (Python 3.0)
  • pickle2write.py: Write out a simple user-defined class in 2.x to a file
  • pickle3read.py: Read a pickle file for a user-defined class in Py3K
  • mydb2write.py: My version of Skip's acid code working with encoded Unicode
  • mydb3read.py: My version of Skip's acid code working that reads the 2.x-written db
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/brettcannon'
    closed_at = <Date 2008-11-25.19:19:36.244>
    created_at = <Date 2008-09-06.21:59:31.995>
    labels = ['type-bug', 'library', 'release-blocker']
    title = 'Byte/string inconsistencies between different dbm modules'
    updated_at = <Date 2008-11-25.19:19:36.202>
    user = 'https://github.com/smontanaro'

    bugs.python.org fields:

    activity = <Date 2008-11-25.19:19:36.202>
    actor = 'brett.cannon'
    assignee = 'brett.cannon'
    closed = True
    closed_date = <Date 2008-11-25.19:19:36.244>
    closer = 'brett.cannon'
    components = ['Library (Lib)']
    creation = <Date 2008-09-06.21:59:31.995>
    creator = 'skip.montanaro'
    dependencies = []
    files = ['11418', '12067', '12084', '12085', '12095', '12096', '12097', '12098', '12099', '12100']
    hgrepos = []
    issue_num = 3799
    keywords = ['patch', 'needs review']
    message_count = 29.0
    messages = ['72711', '72740', '72743', '72745', '72757', '72963', '74655', '75537', '75541', '76080', '76082', '76085', '76155', '76156', '76157', '76186', '76187', '76188', '76189', '76195', '76196', '76197', '76198', '76199', '76245', '76363', '76371', '76378', '76420']
    nosy_count = 6.0
    nosy_names = ['gvanrossum', 'skip.montanaro', 'barry', 'brett.cannon', 'jcea', 'christian.heimes']
    pr_nums = []
    priority = 'release blocker'
    resolution = 'fixed'
    stage = 'commit review'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue3799'
    versions = ['Python 3.0']

    @smontanaro
    Copy link
    Contributor Author

    Consider these two timeit commands:

    py3k% python3.0 -m timeit -s 'import dbm.ndbm as db' -s 'f = 
    db.open("/tmp/trash.db", "c")' 'for i in range(1000): f[str(i)] = str(i)'
    100 loops, best of 3: 5.51 msec per loop
    py3k% python3.0 -m timeit -s 'import dbm.dumb as db' -s 'f = 
    db.open("/tmp/trash.db", "c")' 'for i in range(1000): f[str(i)] = str(i)'
    Traceback (most recent call last):
      File "/Users/skip/local/lib/python3.0/timeit.py", line 297, in main
        x = t.timeit(number)
      File "/Users/skip/local/lib/python3.0/timeit.py", line 193, in timeit
        timing = self.inner(it, self.timer)
      File "<timeit-src>", line 7, in inner
        for i in range(1000): f[str(i)] = str(i)
      File "/Users/skip/local/lib/python3.0/dbm/dumb.py", line 165, in 
    __setitem__
        raise TypeError("keys must be bytes")
    TypeError: keys must be bytes

    Seems to me they should either both succeed or both fail. What are keys
    and values supposed to be for these modules?

    Marking it as high priority. When 3.0 is released all these modules
    should probably agree.

    @smontanaro smontanaro added the type-bug An unexpected behavior, bug, or error label Sep 6, 2008
    @gvanrossum
    Copy link
    Member

    Making this into a release blocker just so someone will look at it.
    Needs to be decided & fixed by rc2 at the very latest.

    @smontanaro
    Copy link
    Contributor Author

    Extra data point. I tried

    f["1"] = "a"

    and

    f[b"1"] = "a"

    with dbm.{gnu,ndbm,dumb,sqlite}. All worked with bytes. A except
    dbm.dumb worked with strings. (dbm.sqlite is my little proof-of-concept
    module currently in the sandbox.)

    @gvanrossum
    Copy link
    Member

    How hard would it be to fix dbm.dumb to accept strings as well?

    @smontanaro
    Copy link
    Contributor Author

    I'm not sure. I've never done anything with the io module. Simply
    eliminating the bytes checks and letting it try to write the string
    yields:

    File "/Users/skip/local/lib/python3.0/dbm/dumb.py", line 170, in
    __setitem__
    self._addkey(key, self._addval(val))
    File "/Users/skip/local/lib/python3.0/dbm/dumb.py", line 138, in
    _addval
    f.write(val)
    File "/Users/skip/local/lib/python3.0/io.py", line 1224, in write
    return BufferedWriter.write(self, b)
    File "/Users/skip/local/lib/python3.0/io.py", line 1034, in write
    raise TypeError("can't write str to binary stream")

    I suppose you'd have to check if val is an instance of str and if so,
    encode it as utf-8. I notice in the existing code that it's doing
    some key decoding assuming latin-1. That would be an incompatibility,
    but I think assuming latin-1 is wrong.

    That said, I've attached a patch which passes all current unit tests.

    Skip

    @gvanrossum
    Copy link
    Member

    I think this isn't quite right.

    Ideally a fix should maintain several important properties:

    (1) Be able to read databases written by Python 2.x.

    (1a) Write databases readable by Python 2.x.

    (2) Use the same mapping between str and bytes as the other *dbm
    libraries have.

    (2a) Return the same value for keys() as the other *dbm libraries
    (except for order).

    I think (2) means that we should use UTF-8 to convert str keys to bytes,
    but (1) means we should use Latin-1 to convert keys to str *upon writing
    the index*. This will also satisfy (1a). The keys maintained
    internally should be kept in bytes to satsfy (2a).

    PS. I noticed the dbm module still returns bytearrays for keys and
    values. Is there a bug open to change those to bytes?

    @vstinner
    Copy link
    Member

    For information, Python3 trunk fails on:
    test.support.TestFailed: Traceback (most recent call last):
      File "Lib/test/test_dbm.py", line 157, in test_keys
        self.assert_('xxx' not in self.d)
    TypeError: gdbm key must be bytes, not str

    @tiran
    Copy link
    Member

    tiran commented Nov 5, 2008

    The patch causes three errors:

    ======================================================================
    ERROR: test_anydbm_access (test.test_dbm.TestCase-dbm.dumb)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home/heimes/dev/python/py3k/Lib/test/test_dbm.py", line 92, in
    test_anydbm_access
        f = dbm.open(_fname, 'r')
      File "/home/heimes/dev/python/py3k/Lib/dbm/__init__.py", line 79, in open
        raise error[0]("need 'c' or 'n' flag to open new db")
    dbm.error: need 'c' or 'n' flag to open new db

    ======================================================================
    ERROR: test_anydbm_keys (test.test_dbm.TestCase-dbm.dumb)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home/heimes/dev/python/py3k/Lib/test/test_dbm.py", line 86, in
    test_anydbm_keys
        f = dbm.open(_fname, 'r')
      File "/home/heimes/dev/python/py3k/Lib/dbm/__init__.py", line 79, in open
        raise error[0]("need 'c' or 'n' flag to open new db")
    dbm.error: need 'c' or 'n' flag to open new db

    ======================================================================
    ERROR: test_anydbm_read (test.test_dbm.TestCase-dbm.dumb)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home/heimes/dev/python/py3k/Lib/test/test_dbm.py", line 80, in
    test_anydbm_read
        f = dbm.open(_fname, 'r')
      File "/home/heimes/dev/python/py3k/Lib/dbm/__init__.py", line 79, in open
        raise error[0]("need 'c' or 'n' flag to open new db")
    dbm.error: need 'c' or 'n' flag to open new db

    Ran 16 tests in 0.429s

    FAILED (errors=3)

    @warsaw
    Copy link
    Member

    warsaw commented Nov 6, 2008

    I don't see enough progress on this issue, and I'm not going to hold up
    3.0rc2 for it, so I'm deferring.

    @brettcannon
    Copy link
    Member

    If you look at the 2.7 code all it requires of keys and values in
    __setitem__ is that they are strings; there is nothing about Latin-1 in
    terms of specific encoding (must be a 3.0 addition to make the
    str/unicode transition the easiest). That would suggest to me that
    assuming that previous DBs were written in Latin-1 is somewhat bogus as
    people could have passed in any str encoded in any format as a DB key or
    value.

    Thus I think going down the UTF-8 route is the right thing to do for
    string arguments. A quick look at _gdbmmodule.c supports this as it just
    converts its arguments through PyArg_Parse("s#") to get its keys and
    thus uses UTF-8 as the default encoding.

    @brettcannon
    Copy link
    Member

    OK, now I see why it is called 'dumb'; the thing literally just dumps
    out the repr of two strings on each line for key/value pairs. To read it
    just evals each line in the string. And whichdb detects this format by
    looking for ' or " as the first character since that is what the repr
    for str is. And that is why the Latin-1 decoding was used; it has the
    proper repr for write-out.

    @brettcannon
    Copy link
    Member

    I have attached a file that does everything internally as UTF-8 but
    reads and writes to disk as Latin-1. I added some unit tests to verify
    that taking a character encoded in UTF-8 or as a string still works
    properly regardless of whether one uses the string or bytes version of
    the key.

    @brettcannon
    Copy link
    Member

    r67310 has the fix.

    @brettcannon
    Copy link
    Member

    I am re-opening this as a deferred blocker with a patch to document that
    dbm implementations write out and return bytes, but that strings are
    accepted and implicitly converted.

    @brettcannon
    Copy link
    Member

    Have another patch that fixes all open() calls to specify the file
    encoding in dbm.dumb. Also caught one spot in _addkey() where
    decode("Latin-1") was not being called.

    @smontanaro
    Copy link
    Contributor Author

    damn... my cc to report@bugs.python.org didn't work. Here's the recap
    (message to python-checkins):

    me\> ... I thought Guido was of the opinion that the 3.0 version 
    

    should
    me> be able to read dumb dbms written by earlier Python versions....

    And write them. From msg72963:

    (1) Be able to read databases written by Python 2.x.
    
    (1a) Write databases readable by Python 2.x.
    

    Ah, but wait a minute. I see your comment in msg76080:

    If you look at the 2.7 code all it requires of keys and values in
    \_\_setitem__ is that they are strings; there is nothing about Latin-1 
    

    in
    terms of specific encoding (must be a 3.0 addition to make the
    str/unicode transition the easiest).

    The acid test. I executed the attached mydb2write.py using Python 2.5
    then
    executed the attached mydb3read.py using Python 3.0. The output:

        % python2.5 mydb2write.py 
        1 abc
        2 [4, {4.2999999999999998: 12}]
        3 <__main__.C instance at 0x34bb70>
        % python3.0 mydb3read.py
        1 b'abc'
        2 [4, {4.2999999999999998: 12}]
        Traceback (most recent call last):
          File "mydb3read.py", line 13, in <module>
            print(3, pickle.loads(db['3']))
          File "/Users/skip/local/lib/python3.0/pickle.py", line 1329, in 
    loads
            return Unpickler(file, encoding=encoding, errors=errors).load()
        _pickle.UnpicklingError: bad pickle data

    so if the ability to read Python 2.x dumbdbm files is still a
    requirement I
    think there's a little more work to do.

    @gvanrossum
    Copy link
    Member

    I think the ability to read old files is essential. The ability to
    write them is a mer nice-to-have.

    On Fri, Nov 21, 2008 at 7:36 AM,  <skip@pobox.com> wrote:
    >
    >    me> ... I thought Guido was of the opinion that the 3.0 version should
    >    me> be able to read dumb dbms written by earlier Python versions....
    >
    > And write them.  From msg72963:
    >
    >    (1) Be able to read databases written by Python 2.x.
    >
    >    (1a) Write databases readable by Python 2.x.
    >
    > Ah, but wait a minute.  I see your comment in msg76080:
    >
    >    If you look at the 2.7 code all it requires of keys and values in
    >    __setitem__ is that they are strings; there is nothing about Latin-1 in
    >    terms of specific encoding (must be a 3.0 addition to make the
    >    str/unicode transition the easiest).
    >
    > The acid test.  I executed the attached mydb2write.py using Python 2.5 then
    > executed the attached mydb3read.py using Python 3.0.  The output:
    >
    >    % python2.5 mydb2write.py
    >    1 abc
    >    2 [4, {4.2999999999999998: 12}]
    >    3 <__main__.C instance at 0x34bb70>
    >    % python3.0 mydb3read.py
    >    1 b'abc'
    >    2 [4, {4.2999999999999998: 12}]
    >    Traceback (most recent call last):
    >      File "mydb3read.py", line 13, in <module>
    >        print(3, pickle.loads(db['3']))
    >      File "/Users/skip/local/lib/python3.0/pickle.py", line 1329, in loads
    >        return Unpickler(file, encoding=encoding, errors=errors).load()
    >    _pickle.UnpicklingError: bad pickle data
    >
    > so if the ability to read Python 2.x dumbdbm files is still a requirement I
    > think there's a little more work to do.
    >
    > cc'ing report@bugs.python.org to preserve the scripts with the ticket.
    >
    > Skip
    >
    >
    > _______________________________________________
    > Python-3000-checkins mailing list
    > Python-3000-checkins@python.org
    > http://mail.python.org/mailman/listinfo/python-3000-checkins
    >
    >

    @gvanrossum gvanrossum changed the title Byte/string inconsistencies between different dbm modules Re: r67310 - in python/branches/py3k: Lib/dbm/dumb.py Lib/test/test_dbm_dumb.py Misc/NEWS Nov 21, 2008
    @gvanrossum
    Copy link
    Member

    Reading old dumbdbm files is essential. Writing them is not.

    @gvanrossum
    Copy link
    Member

    [fix title]

    @gvanrossum gvanrossum changed the title Re: r67310 - in python/branches/py3k: Lib/dbm/dumb.py Lib/test/test_dbm_dumb.py Misc/NEWS Byte/string inconsistencies between different dbm modules Nov 21, 2008
    @brettcannon
    Copy link
    Member

    So the use of pickle is not fair as that doesn't round-trip if you
    simply write out a file because py3k pickle doesn't like classic classes
    and the new-style classes want copy_reg which is not in existence in
    py3k since it was renamed. See pickle2write.py and pickle3read.py (which
    are version-agnostic; just following Skip's naming) for the pickle failing.

    Now if you skip that one use-case in the example of pickling a
    user-defined class then everything works out. I have attached a
    hacked-up version of Skip's scripts that take Unicode strings, encode
    them in Latin-1 and UTF-8, and then decode them on the other side in
    Py3K, all without issue.

    In other words I think my solution works and pickle is the trouble-maker
    in all of this.

    @smontanaro
    Copy link
    Contributor Author

    Brett> In other words I think my solution works and pickle is the
    Brett> trouble-maker in all of this.

    I can buy that. Should pickle map "copy_reg" to "copyreg"? Is that the
    only incompatibility?

    Actually, it seems this ticket should be closed and another opened about
    this pickle issue.

    Skip

    @brettcannon
    Copy link
    Member

    On Fri, Nov 21, 2008 at 10:32, Skip Montanaro <report@bugs.python.org> wrote:

    Skip Montanaro <skip@pobox.com> added the comment:

    Brett> In other words I think my solution works and pickle is the
    Brett> trouble-maker in all of this.

    I can buy that. Should pickle map "copy_reg" to "copyreg"? Is that the
    only incompatibility?

    Actually, it seems this ticket should be closed and another opened about
    this pickle issue.

    Well, I still need a code review for my latest patch that changes the
    docs, cleans up gdbm and ndbm exception messages, catches a place
    where I didn't decode before write-out, and adds an 'encoding'
    argument to all open() calls.

    @smontanaro
    Copy link
    Contributor Author

    One doc nit: There is still reference to gdbm and Dbm (or dbm) objects
    when they should probably use dbm.gnu and dbm.ndbm, respectively.

    I'm confused by the encoding="Latin-1" args to _io.open for dbm.dumb. I
    thought the default enoding was going to be utf-8, and I see no way to
    influence that. Is that just to make sure all code points can be written
    without error?

    Skip

    @brettcannon
    Copy link
    Member

    On Fri, Nov 21, 2008 at 11:01, Skip Montanaro report@bugs.python.org wrote:

    Skip Montanaro skip@pobox.com added the comment:

    One doc nit: There is still reference to gdbm and Dbm (or dbm) objects
    when they should probably use dbm.gnu and dbm.ndbm, respectively.

    OK, I will fix that and upload a new patch at some point.

    I'm confused by the encoding="Latin-1" args to _io.open for dbm.dumb. I
    thought the default enoding was going to be utf-8, and I see no way to
    influence that. Is that just to make sure all code points can be written
    without error?

    It's so that when writing out there won't be any errors. Since the
    repr of strings are used instead of bytes the stuff passed in and
    written to disk must be represented in an encoding that will never
    complain about what bytes it gets. Latin-1 does this while UTF-8. And
    since everything is being written and read in Latin-1 I figured I
    might as well be thorough and specify the encoding.

    @smontanaro
    Copy link
    Contributor Author

    py3k patched with specify_open_encoding.diff passes test_dbm_dumb on my
    Mac (Leopard, Intel). Might as well assign this to Brett. He seems to
    be doing all the heavy lifting anyway. ;-)

    Skip

    @brettcannon
    Copy link
    Member

    specify_open_encoding.diff has been committed in r67369.

    I still need a review for doc_dbm_strings.diff, though, which clarifies
    the docs, fixes one oversight in dbm.dumb, and extends testing to make
    sure strings can be accepted.

    @brettcannon brettcannon added the stdlib Python modules in the Lib dir label Nov 24, 2008
    @smontanaro
    Copy link
    Contributor Author

    Brett> I still need a review for doc_dbm_strings.diff, though, which
    Brett> clarifies the docs, fixes one oversight in dbm.dumb, and extends
    Brett> testing to make sure strings can be accepted.

    Was my comment

    http://bugs.python.org/msg76198
    

    and your resonse

    http://bugs.python.org/msg76199
    

    not sufficient?

    If not, allow me to anoint said diff with virtual holy water now...

    Skip

    @brettcannon
    Copy link
    Member

    On Mon, Nov 24, 2008 at 16:25, Skip Montanaro <report@bugs.python.org> wrote:

    Skip Montanaro <skip@pobox.com> added the comment:

    Brett> I still need a review for doc_dbm_strings.diff, though, which
    Brett> clarifies the docs, fixes one oversight in dbm.dumb, and extends
    Brett> testing to make sure strings can be accepted.

    Was my comment

    http://bugs.python.org/msg76198

    and your resonse

    http://bugs.python.org/msg76199

    not sufficient?

    Wasn't sure if that meant everything not mentioned was fine.

    If not, allow me to anoint said diff with virtual holy water now...

    =) Thanks!

    @brettcannon
    Copy link
    Member

    r67380 has the fix. Thanks for the review, Skip!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    release-blocker stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants