Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_warnings: patch to avoid conversions from/to UTF-8 #63623

Closed
vstinner opened this issue Oct 28, 2013 · 16 comments
Closed

_warnings: patch to avoid conversions from/to UTF-8 #63623

vstinner opened this issue Oct 28, 2013 · 16 comments

Comments

@vstinner
Copy link
Member

BPO 19424
Nosy @vstinner, @ezio-melotti, @zware, @serhiy-storchaka, @vajrasky
Files
  • warnings.patch
  • remove_compiler_warning_pyunicode_compare_with_ascii_string.patch
  • issue19424-windows-warnings.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2013-10-29.23:37:08.640>
    created_at = <Date 2013-10-28.17:43:22.298>
    labels = ['expert-unicode']
    title = '_warnings: patch to avoid conversions from/to UTF-8'
    updated_at = <Date 2013-11-04.10:29:09.994>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2013-11-04.10:29:09.994>
    actor = 'python-dev'
    assignee = 'none'
    closed = True
    closed_date = <Date 2013-10-29.23:37:08.640>
    closer = 'vstinner'
    components = ['Unicode']
    creation = <Date 2013-10-28.17:43:22.298>
    creator = 'vstinner'
    dependencies = []
    files = ['32395', '32419', '32427']
    hgrepos = []
    issue_num = 19424
    keywords = ['patch']
    message_count = 16.0
    messages = ['201558', '201655', '201657', '201680', '201688', '201689', '201691', '201692', '201694', '201697', '201703', '201749', '201752', '202013', '202022', '202113']
    nosy_count = 7.0
    nosy_names = ['vstinner', 'ezio.melotti', 'Arfrever', 'python-dev', 'zach.ware', 'serhiy.storchaka', 'vajrasky']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue19424'
    versions = ['Python 3.4']

    @vstinner
    Copy link
    Member Author

    Attached patch removes usage of _PyUnicode_AsString() to not convert strings from/to UTF-8.

    @serhiy-storchaka
    Copy link
    Member

    I don't see a benefit from this patch. _PyUnicode_AsString() is very fast in most cases (because source lines are mostly ASCII only). On other hand, the patch replaces one-time _PyUnicode_AsString() by multiple less effective (even for ASCII strings) operations.

    @vstinner
    Copy link
    Member Author

    I don't see a benefit from this patch.

    Oh, sorry, I forgot to explain the motivation. Performances of the warnings module are not critical module. The motivation here is to avoid to encoding string to UTF-8 for correctness. For example, _PyUnicode_AsString(filename) fails if the filename contains a surrogate character.

    >>> warnings.warn_explicit("text", RuntimeError, "filename", 5)
    filename:5: RuntimeError: text
    >>> warnings.warn_explicit("text", RuntimeError, "filename\udc80", 5)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 8: surrogates not allowed

    Another example where a string to encoded to UTF-8 and decoded from UTF-8 a few instructions later:

    PyObject *to_str = PyObject_Str(item);
    err_str = _PyUnicode_AsString(to_str);
    ...
    PyErr_Format(PyExc_RuntimeError,  "...%s", err_str);

    Using "%R" avoids any encoding conversion.

    @serhiy-storchaka
    Copy link
    Member

    Well, it looks reasonable. But an action should be ASCII only string. So perhaps we should first check PyUnicode_IS_ASCII() and then use PyUnicode_1BYTE_DATA() and strcpy().

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 29, 2013

    New changeset 34e166d60f37 by Victor Stinner in branch 'default':
    Issue bpo-19424: Optimize PyUnicode_CompareWithASCIIString()
    http://hg.python.org/cpython/rev/34e166d60f37

    @vstinner
    Copy link
    Member Author

    "Well, it looks reasonable. But an action should be ASCII only string. So perhaps we should first check PyUnicode_IS_ASCII() and then use PyUnicode_1BYTE_DATA() and strcpy()."

    Such optimizations in warnings seem overkill, performances are not critical in this module.

    I optimized PyUnicode_CompareWithASCIIString() instead.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 29, 2013

    New changeset c7326aa0b69c by Victor Stinner in branch 'default':
    Issue bpo-19424: Fix the warnings module to accept filename containing surrogate
    http://hg.python.org/cpython/rev/c7326aa0b69c

    @vstinner
    Copy link
    Member Author

    @serhiy: Thanks for your review. I modified warnings.warn_explicit() to reject types different than str for the filename. I also removed the useless optimization for PyUnicode_Substring() when i=0.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 29, 2013

    New changeset 05e8dde3229c by Victor Stinner in branch 'default':
    Issue bpo-19424: Fix test_warnings for locale encoding unable to encode
    http://hg.python.org/cpython/rev/05e8dde3229c

    @vstinner
    Copy link
    Member Author

    Oops, while fixing this issue, I found a new issue related to warnings: I opened issue bpo-19442 "Python crashes when a warning is emitted during shutdown".

    This issue can now be fixed.

    @vajrasky
    Copy link
    Mannequin

    vajrasky mannequin commented Oct 30, 2013

    Victor, I found out that this commit http://hg.python.org/cpython/rev/34e166d60f37 gives me compiler warning.

    Objects/unicodeobject.c: In function ‘PyUnicode_CompareWithASCIIString’:
    Objects/unicodeobject.c:10583:22: warning: pointer targets in initialization differ in signedness [-Wpointer-sign]
    gcc -pthread -c -Wno-unused-result -Werror=declaration-after-statement -g -O0 -Wall -Wstrict-prototypes -I. -IInclude -I./Include -DPy_BUILD_CORE \

    Attached the patch to remove the compiler warning.

    @zware
    Copy link
    Member

    zware commented Oct 30, 2013

    Adding to Vajrasky's report, the same commit also adds 3 warnings when building on Windows:

    ..\Objects\unicodeobject.c(10588): warning C4018: '>' : signed/unsigned mismatch [P:\Projects\OSS\Python\cpython\PCbuild\pythoncore.vcxproj]
    ..\Objects\unicodeobject.c(10592): warning C4018: '>' : signed/unsigned mismatch [P:\Projects\OSS\Python\cpython\PCbuild\pythoncore.vcxproj]
    ..\Objects\unicodeobject.c(10594): warning C4018: '>' : signed/unsigned mismatch [P:\Projects\OSS\Python\cpython\PCbuild\pythoncore.vcxproj]

    Vajrasky's patch doesn't fix the Windows warnings (but doesn't create any new ones either); the attached patch does (but likely doesn't do anything for the other).

    I don't know nearly enough C to say whether my patch is any good or not, just enough to say it compiles, doesn't break anything immediately obvious, and removes the warnings on Windows.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 30, 2013

    New changeset 52ec6a3eeda5 by Victor Stinner in branch 'default':
    Issue bpo-19424: Fix a compiler warning
    http://hg.python.org/cpython/rev/52ec6a3eeda5

    @vajrasky
    Copy link
    Mannequin

    vajrasky mannequin commented Nov 3, 2013

    Py_ssize_t is signed long. size_it is unsigned long. In this case, I suppose we should avoid unsigned as much as possible in comparison with signed.

    So I think Zachary's patch is reasonable.

    What do you think, Victor?

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 3, 2013

    New changeset 2ed8d500e113 by Victor Stinner in branch 'default':
    Issue bpo-19424: Fix a compiler warning on comparing signed/unsigned size_t
    http://hg.python.org/cpython/rev/2ed8d500e113

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 4, 2013

    New changeset 494f736f5945 by Victor Stinner in branch 'default':
    Issue bpo-19424: PyUnicode_CompareWithASCIIString() normalizes memcmp() result
    http://hg.python.org/cpython/rev/494f736f5945

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants