Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF7 decoding is far too strict #48676

Closed
NickBarnes mannequin opened this issue Nov 25, 2008 · 18 comments
Closed

UTF7 decoding is far too strict #48676

NickBarnes mannequin opened this issue Nov 25, 2008 · 18 comments
Assignees
Labels
topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@NickBarnes
Copy link
Mannequin

NickBarnes mannequin commented Nov 25, 2008

BPO 4426
Nosy @malemburg, @pitrou, @vstinner
Files
  • utf7patch: Patch to Python 2.6 to fix problems with UTF7 Unicode codec.
  • issue4426.patch
  • python-unicode-trunk-patch: New trunk patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/pitrou'
    closed_at = <Date 2009-05-04.18:59:11.845>
    created_at = <Date 2008-11-25.11:11:55.649>
    labels = ['type-bug', 'expert-unicode']
    title = 'UTF7 decoding is far too strict'
    updated_at = <Date 2009-05-04.18:59:11.843>
    user = 'https://bugs.python.org/NickBarnes'

    bugs.python.org fields:

    activity = <Date 2009-05-04.18:59:11.843>
    actor = 'pitrou'
    assignee = 'pitrou'
    closed = True
    closed_date = <Date 2009-05-04.18:59:11.845>
    closer = 'pitrou'
    components = ['Unicode']
    creation = <Date 2008-11-25.11:11:55.649>
    creator = 'Nick Barnes'
    dependencies = []
    files = ['12179', '13868', '13872']
    hgrepos = []
    issue_num = 4426
    keywords = ['patch', 'needs review']
    message_count = 18.0
    messages = ['76405', '76409', '76414', '76415', '76419', '76423', '76453', '76496', '76696', '76704', '77933', '87113', '87117', '87119', '87124', '87137', '87147', '87150']
    nosy_count = 5.0
    nosy_names = ['lemburg', 'ggenellina', 'pitrou', 'vstinner', 'Nick Barnes']
    pr_nums = []
    priority = 'critical'
    resolution = 'fixed'
    stage = 'patch review'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue4426'
    versions = ['Python 3.1', 'Python 2.7']

    @NickBarnes
    Copy link
    Mannequin Author

    NickBarnes mannequin commented Nov 25, 2008

    UTF-7 decoding raises an exception for any character not in the RFC2152
    "Set D" (directly encoded characters). In particular, it raises an
    exception for characters in "Set O" (optional direct characters), such
    as < = > [ ] @ etc. These characters can legitimately appear in
    UTF-7-encoded text, and should be decoded (as themselves). As it is,
    the UTF-7 decoder can't reliably be used to decode any UTF-7 text other
    than that encoded by Python's own UTF-7 encoder.

    Looking at the source of unicodeobject.c, the call to the SPECIAL macro
    on line 1009 has hardcoded second and third arguments of zero. Maybe
    changing the second argument to 1 would fix this. Maybe.

    @NickBarnes NickBarnes mannequin added topic-unicode type-bug An unexpected behavior, bug, or error labels Nov 25, 2008
    @vstinner
    Copy link
    Member

    Can you write some tests to help fixing this issue? Stupid example (I 
    don't know UTF-8 encoding):
    >>> all((byte.encode("utf-7") == byte) for byte in '<=>[]@')
    >>> all((byte.decode("utf-7") == byte) for byte in '<=>[]@')

    @NickBarnes
    Copy link
    Mannequin Author

    NickBarnes mannequin commented Nov 25, 2008

    # Note, this test covers issues 4425 and 4426

    # Direct encoded characters:
    set_d =
    "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'(),-./:?"

    # Optional direct characters:
    set_o = '!"#$%&*;<=>@[]^_`{|}'

    all((c.encode('utf7') == c) for c in set_d)
    all((c.decode('utf7') == c) for c in set_d)
    all((c.decode('utf7') == c) for c in set_o)

    @malemburg
    Copy link
    Member

    On 2008-11-25 12:11, Nick Barnes wrote:

    New submission from Nick Barnes <Nick.Barnes@pobox.com>:

    UTF-7 decoding raises an exception for any character not in the RFC2152
    "Set D" (directly encoded characters). In particular, it raises an
    exception for characters in "Set O" (optional direct characters), such
    as < = > [ ] @ etc. These characters can legitimately appear in
    UTF-7-encoded text, and should be decoded (as themselves). As it is,
    the UTF-7 decoder can't reliably be used to decode any UTF-7 text other
    than that encoded by Python's own UTF-7 encoder.

    Thanks for noticing this. Apparently, the UTF-7 codec is not used
    a lot by Python users, since it's been like this for years.

    The tests we have do check round-trip safety, but not the special
    characteristics of the UTF-7 codec.

    Also note that the code for the codec was contributed and is, AFAIK,
    not maintained by any of the Python developers.

    @NickBarnes
    Copy link
    Mannequin Author

    NickBarnes mannequin commented Nov 25, 2008

    Well, I could submit a diff for unicodeobject.c, but I have never
    contributed to Python (or used this particular tracking system) before.
    Is there a standard form for contributing changes? Unified diff?

    @vstinner
    Copy link
    Member

    Is there a standard form for contributing changes? Unified diff?

    Attach a patch file (unified diff, yes).

    @malemburg
    Copy link
    Member

    On 2008-11-25 19:56, Nick Barnes wrote:

    Nick Barnes <Nick.Barnes@pobox.com> added the comment:

    Well, I could submit a diff for unicodeobject.c, but I have never
    contributed to Python (or used this particular tracking system) before.
    Is there a standard form for contributing changes? Unified diff?

    Please send unified diffs and attach them to the ticket.

    While you're at it: perhaps you could try to refactor the code a bit
    in order to get rid off the many macros use throughout the codec.
    They make the code a lot less readable.

    Thanks,

    Marc-Andre Lemburg
    eGenix.com


    2008-11-12: Released mxODBC.Connect 0.9.3 http://python.egenix.com/

    :::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::

    eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
    Registered at Amtsgericht Duesseldorf: HRB 46611

    @NickBarnes
    Copy link
    Mannequin Author

    NickBarnes mannequin commented Nov 27, 2008

    I'll try to get to this next week. Right now I'm snowed under. I don't
    promise to do any refactoring.

    @NickBarnes
    Copy link
    Mannequin Author

    NickBarnes mannequin commented Dec 1, 2008

    My original defect report here was incorrect, or possibly only relates
    to a particular older Python installation. It is still the case that
    UTF-7 decoding is fussier than it need be (decoding should be
    permissive), and is broken specifically for the '/' character (ASCII
    47). I'm probably going to rewrite the whole codec for greater clarity
    and decoding permissiveness.

    Any UTF-7 encoder has two boolean parameters: whether to base-64 encode
    whitespace (sp, ht, nl, cr), and whether to base-64 encode "set O"
    characters. The existing Python UTF-7 encoder says "no" to both of
    these. It would be useful to have them as options. How should encoding
    parameters such as these be passed? As setstate() methods on the
    IncrementalEncoder and StreamWriter objects? Or should I provide four
    separate codecs (retaining the existing behaviour in the 'utf7' codec,
    of course).

    @NickBarnes
    Copy link
    Mannequin Author

    NickBarnes mannequin commented Dec 1, 2008

    Here is my patch. This is a rewrite of the UTF7 encoder and decoder.
    It now handles surrogate pairs correctly, so non-BMP characters work
    with this codec. And my motivating example ('/'.decode('utf7')) works
    OK. I'm not totally confident of the error-handling code here, but in
    'strict' mode this is definitely a better codec than the one it
    replaces. This patch is based on the Python 2.6 distribution.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 16, 2008

    I'm not in a position to comment on the encoding algorithm itself but I
    have a couple of comments:

    • I get the following compilation warning:
      Objects/unicodeobject.c: In function ‘PyUnicode_DecodeUTF7Stateful’:
      Objects/unicodeobject.c:1531: attention : ‘shiftOutStart’ may be used
      uninitialized in this function

    • shouldn't there be an additional test for the '/' behaviour (unless it
      is already there and I have overlooked it)?

    * your patch fails on another test:
    Traceback (most recent call last):
      File "Lib/test/test_unicode.py", line 538, in test_codecs_utf7
        self.assertRaises(UnicodeDecodeError, '+\xc1'.decode, 'utf-7')
    AssertionError: UnicodeDecodeError not raised

    @vstinner
    Copy link
    Member

    vstinner commented May 4, 2009

    Copy of msg76404 from duplicate issue (bpo-4425):
    ----
    '/'.encode('utf7') returns '+AC8-'. It should return '/'. See RFC
    2152.

    '/'.decode('utf7') raises an exception (this is a special case of a
    general problem with UTF-7 decoding, which I will report as a separate
    bug).
    ----

    @vstinner
    Copy link
    Member

    vstinner commented May 4, 2009

    I updated the patch to python trunk. I was hard because "patch -p1"
    failed at many places, so please double check the updated patch.

    Changes from utf7patch:

    • I removed the test "c >=0" in DECODE_DIRECT(c) because c type
      is "Py_UNICODE" and the type is always unsigned (right?)
    • I added "shiftOutStart = p;" at the beginning of
      PyUnicode_DecodeUTF7Stateful() to fix the gcc warning reported by
      pitrou.
    • I fixed the failing test in test_unicode.py (also reported by
      pitrou)
    • I added the tests listed in msg76414 (by nick)

    I didn't read the UTF-7 encoder or decoder code. I just updated the
    patch.

    @vstinner
    Copy link
    Member

    vstinner commented May 4, 2009

    (oops, i stripped spaces in my last patch)

    @pitrou
    Copy link
    Member

    pitrou commented May 4, 2009

    A quick comment on the patch: it seems to invert (quite futily I'd say)
    the meaning of the arguments given to PyUnicode_EncodeUTF7, which is an
    incompatible API change.
    I'm in favour of reworking this patch in order to keep the original API.
    If I'm not mistaken, it's just a matter of using the boolean complement
    of these two arguments (encodeSetO, encodeWhiteSpace) inside the body of
    PyUnicode_EncodeUTF7.

    @NickBarnes
    Copy link
    Mannequin Author

    NickBarnes mannequin commented May 4, 2009

    This was my first contribution to Python. I don't know what the rules
    are on changing the arguments of an internal function such as
    PyUnicode_EncodeUTF7(). Since I was rewriting the whole function
    anyway, I tried to give it arguments which made more sense with respect
    to the defining RFC. If you want us to revert to the original meanings
    of these arguments (so the third argument means "use base-64 encoding
    for characters in set O", and not "use direct encoding for characters in
    set O"), please can we have better names for them?

    The name "encodeSetO" was meaningless: the function encodes *all* the
    characters in the string. What the argument specifies is whether the
    "set O" characters are self-encoded or base-64 encoded. So maybe it
    should be called "base64SetO".

    Ditto for the "encodeWhiteSpace" name.

    Here's a trunk patch with the meaning of those parameters reverted, and
    better names.

    @pitrou
    Copy link
    Member

    pitrou commented May 4, 2009

    Thanks for the update. Functions like PyUnicode_EncodeUTF7() are part of
    the public C API, therefore their semantics can't be changed lightly.
    The patch looks ok to me, apart from minor style issues.

    @pitrou pitrou self-assigned this May 4, 2009
    @pitrou
    Copy link
    Member

    pitrou commented May 4, 2009

    Committed in r72283 and r72285. Thanks!

    @pitrou pitrou closed this as completed May 4, 2009
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants