Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Py_UCS4 instead of Py_UNICODE in unicodectype.c #49377

Closed
bupjae mannequin opened this issue Feb 2, 2009 · 54 comments
Closed

Use Py_UCS4 instead of Py_UNICODE in unicodectype.c #49377

bupjae mannequin opened this issue Feb 2, 2009 · 54 comments
Assignees
Labels
topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@bupjae
Copy link
Mannequin

bupjae mannequin commented Feb 2, 2009

BPO 5127
Nosy @malemburg, @amauryfa, @vstinner, @ezio-melotti
Superseder
  • bpo-7663: narrow build incorrectly translates cases for non-BMP code points
  • Files
  • unicodectype_ucs4.patch
  • unicodectype_ucs4-2.patch
  • unicodectype_ucs4_3.patch: refreshed; don't forget to run makeunicodedata.py
  • unicodectype_ucs4_4.patch
  • unicodectype_ucs4_5.patch
  • unicodectype_ucs4_6.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/amauryfa'
    closed_at = <Date 2010-08-19.12:56:22.263>
    created_at = <Date 2009-02-02.01:31:39.408>
    labels = ['type-bug', 'expert-unicode']
    title = 'Use Py_UCS4 instead of Py_UNICODE in unicodectype.c'
    updated_at = <Date 2010-08-19.12:56:22.261>
    user = 'https://bugs.python.org/bupjae'

    bugs.python.org fields:

    activity = <Date 2010-08-19.12:56:22.261>
    actor = 'amaury.forgeotdarc'
    assignee = 'amaury.forgeotdarc'
    closed = True
    closed_date = <Date 2010-08-19.12:56:22.263>
    closer = 'amaury.forgeotdarc'
    components = ['Unicode']
    creation = <Date 2009-02-02.01:31:39.408>
    creator = 'bupjae'
    dependencies = []
    files = ['12934', '15047', '15058', '17909', '17911', '17924']
    hgrepos = []
    issue_num = 5127
    keywords = ['patch']
    message_count = 54.0
    messages = ['80924', '80928', '80971', '81044', '81048', '81049', '81051', '81052', '81053', '81057', '81061', '81081', '81090', '81115', '93562', '93585', '93587', '93588', '93589', '93590', '93594', '93595', '93596', '93604', '93611', '93615', '93617', '93636', '93659', '93662', '97441', '97447', '97535', '109509', '109513', '109518', '109519', '109522', '109523', '109526', '109527', '109529', '109530', '109626', '109631', '109644', '109646', '109687', '109689', '109692', '109693', '109695', '109795', '114360']
    nosy_count = 6.0
    nosy_names = ['lemburg', 'amaury.forgeotdarc', 'Rhamphoryncus', 'vstinner', 'ezio.melotti', 'bupjae']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = '7663'
    type = 'behavior'
    url = 'https://bugs.python.org/issue5127'
    versions = ['Python 3.1', 'Python 3.2']

    @bupjae
    Copy link
    Mannequin Author

    bupjae mannequin commented Feb 2, 2009

    >>> license
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "C:\Python30\lib\site.py", line 372, in __repr__
        self.__setup()
      File "C:\Python30\lib\site.py", line 359, in __setup
        data = fp.read()
      File "C:\Python30\lib\io.py", line 1724, in read
        decoder.decode(self.buffer.read(), final=True))
      File "C:\Python30\lib\io.py", line 1295, in decode
        output = self.decoder.decode(input, final=final)
    UnicodeDecodeError: 'cp949' codec can't decode bytes in position 15164-
    15165: il
    legal multibyte sequence
    >>> chr(0x10000)
    '\U00010000'
    >>> chr(0x11000)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "C:\Python30\lib\io.py", line 1491, in write
        b = encoder.encode(s)
    UnicodeEncodeError: 'cp949' codec can't encode character '\ud804' in 
    position 1:
     illegal multibyte sequence
    >>>

    I also can't understand why chr(0x10000) and chr(0x11000) has different
    behavior

    @bupjae bupjae mannequin added the topic-unicode label Feb 2, 2009
    @ezio-melotti
    Copy link
    Member

    Here (winxpsp2, Py3, cp850-terminal) the license works fine:
    >>> license
    Type license() to see the full license text

    and license() works as well.

    I get this output for the chr()s:
    >>> chr(0x10000)
    '\U00010000'
    >>> chr(0x11000)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "C:\Programs\Python30\lib\io.py", line 1491, in write
        b = encoder.encode(s)
      File "C:\Programs\Python30\lib\encodings\cp850.py", line 19, in encode
        return codecs.charmap_encode(input,self.errors,encoding_map)[0]
    UnicodeEncodeError: 'charmap' codec can't encode characters in position
    1-2: character maps to <undefined>

    I believe that chr(0x10000) and chr(0x11000) should have the opposite
    behavior.
    U+10000 (LINEAR B SYLLABLE B008 A) belongs to the 'Lo' category and
    should be printed (and possibly raise a UnicodeError, see bpo-5110
    1), U+11000 belongs to the 'Cn' category and should be escaped2.

    On Linux with Py3 and a UTF-8 terminal, chr(0x10000) prints '\U00010000'
    and chr(0x11000) prints the char (actually I see two boxes, but it
    shouldn't be a problem of Python). The license() works fine too.

    Also note that with cp850 the error message is 'character maps to
    <undefined>' and with cp949 is 'illegal multibyte sequence'.

    @amauryfa
    Copy link
    Member

    amauryfa commented Feb 2, 2009

    There were non-ascii characters in the Windows license file. This was
    corrected with r67860.

    I believe that chr(0x10000) and chr(0x11000) should have the
    opposite behavior.

    This other problem is because on a narrow unicode build,
    Py_UNICODE_ISPRINTABLE takes a 16bit integer.
    And indeed,

    >>> unicodedata.category(chr(0x10000 % 65536))
    'Cc'
    >>> unicodedata.category(chr(0x11000 % 65536))
    'Lo'

    @vstinner
    Copy link
    Member

    vstinner commented Feb 3, 2009

    I don't understand the behaviour of unichr():

    Python 2.7a0 (trunk:68963M, Jan 30 2009, 00:49:28)
    >>> import unicodedata
    >>> unicodedata.category(u"\U00010000")
    'Lo'
    >>> unicodedata.category(u"\U00011000")
    'Cn'
    >>> unicodedata.category(unichr(0x10000))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: unichr() arg not in range(0x10000) (narrow Python build)

    Why unichr() fails whereas \Uxxxxxxxx works?

    >>> len(u"\U00010000")
    2
    >>> ord(u"\U00010000")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: ord() expected a character, but string of length 2 found

    @ezio-melotti
    Copy link
    Member

    FWIW, on Python3 it seems to work:
    >>> import unicodedata
    >>> unicodedata.category("\U00010000")
    'Lo'
    >>> unicodedata.category("\U00011000")
    'Cn'
    >>> unicodedata.category(chr(0x10000))
    'Lo'
    >>> unicodedata.category(chr(0x11000))
    'Cn'
    >>> ord(chr(0x10000)), 0x10000
    (65536, 65536)
    >>> ord(chr(0x11000)), 0x11000
    (69632, 69632)
    
    I'm using a narrow build too:
    >>> import sys
    >>> sys.maxunicode
    65535
    >>> len('\U00010000')
    2
    >>> ord('\U00010000')
    65536

    On Python2 unichr() is supposed to raise a ValueError on a narrow build
    if the value is greater than 0xFFFF 1, but if the characters above
    0xFFFF can be represented with u"\Uxxxxxxxx" there should be a way to
    fix unichr so it can return them. Python3 already does it with chr().

    Maybe we should open a new issue for this if it's not present already.

    @amauryfa
    Copy link
    Member

    amauryfa commented Feb 3, 2009

    Since r56395, ord() and chr() accept and return surrogate pairs even in
    narrow builds.

    The goal is to remove most differences between narrow and wide unicode
    builds (except for string lengths, indices or slices)

    To address this problem, I suggest to change all functions in
    unicodectype.c so that they accept Py_UCS4 characters (instead of
    Py_UNICODE).
    This would be a binary-incompatible change; and --with-wctype-functions
    would have an effect only if sizeof(wchar_t)==4 (instead of the current
    condition sizeof(wchar_t)==sizeof(PY_UNICODE_TYPE))

    @malemburg
    Copy link
    Member

    On 2009-02-03 13:39, Amaury Forgeot d'Arc wrote:

    Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

    Since r56395, ord() and chr() accept and return surrogate pairs even in
    narrow builds.

    The goal is to remove most differences between narrow and wide unicode
    builds (except for string lengths, indices or slices)

    To address this problem, I suggest to change all functions in
    unicodectype.c so that they accept Py_UCS4 characters (instead of
    Py_UNICODE).

    -1.

    That would cause major breakage in the C API and is not inline with the
    intention of having a Py_UNICODE type in the first place.

    Users who are interested in UCS4 builds should simply use UCS4 builds.

    This would be a binary-incompatible change; and --with-wctype-functions
    would have an effect only if sizeof(wchar_t)==4 (instead of the current
    condition sizeof(wchar_t)==sizeof(PY_UNICODE_TYPE))

    --with-wctype-functions was scheduled for removal many releases ago,
    but I never got around to it. The only reason it's still there is
    that some Linux distribution use this config option (AFAIR, RedHat).
    I'd be +1 on removing the option in 3.0.1 or deprecating it in
    3.0.1 and removing it in 3.1.

    It's not useful in any way, and causes compatibility problems
    with regular builds.

    @vstinner
    Copy link
    Member

    vstinner commented Feb 3, 2009

    amaury> Since r56395, ord() and chr() accept and return surrogate pairs
    amaury> even in narrow builds.

    Note: My examples are made with Python 2.x.

    The goal is to remove most differences between narrow and wide unicode
    builds (except for string lengths, indices or slices)

    It would be nice to get the same behaviour in Python 2.x and 3.x to help
    migration from Python2 to Python3 ;-)

    unichr() (in Python 2.x) documentation is correct. But I would approciate to
    support surrogates using unichr() which means also changing ord() behaviour.

    To address this problem, I suggest to change all functions in
    unicodectype.c so that they accept Py_UCS4 characters (instead of
    Py_UNICODE).

    Why? Using surrogates, you can use 16-bits Py_UNICODE to store non-BMP
    characters (code > 0xffff).

    --

    I can open a new issue if you agree that we can change unichr() / ord()
    behaviour on narrow build. We may ask on the mailing list?

    @malemburg
    Copy link
    Member

    On 2009-02-03 14:14, STINNER Victor wrote:

    STINNER Victor <victor.stinner@haypocalc.com> added the comment:

    amaury> Since r56395, ord() and chr() accept and return surrogate pairs
    amaury> even in narrow builds.

    Note: My examples are made with Python 2.x.

    > The goal is to remove most differences between narrow and wide unicode
    > builds (except for string lengths, indices or slices)

    It would be nice to get the same behaviour in Python 2.x and 3.x to help
    migration from Python2 to Python3 ;-)

    unichr() (in Python 2.x) documentation is correct. But I would approciate to
    support surrogates using unichr() which means also changing ord() behaviour.

    This is not possible for unichr() in Python 2.x, since applications
    always expect len(unichr(x)) == 1.

    Changing ord() would be possible in Python 2.x is easier, since
    this would only extend the range of returned values for UCS2
    builds.

    > To address this problem, I suggest to change all functions in
    > unicodectype.c so that they accept Py_UCS4 characters (instead of
    > Py_UNICODE).

    Why? Using surrogates, you can use 16-bits Py_UNICODE to store non-BMP
    characters (code > 0xffff).

    --

    I can open a new issue if you agree that we can change unichr() / ord()
    behaviour on narrow build. We may ask on the mailing list?

    @amauryfa
    Copy link
    Member

    amauryfa commented Feb 3, 2009

    That would cause major breakage in the C API

    Not if you recompile. I don't see how this breaks the API at the C level.

    and is not inline with the intention of having a Py_UNICODE
    type in the first place.

    Py_UNICODE is still used as the allocation unit for unicode strings.

    To get correct results, we need a way to access the whole unicode
    database even on ucs2 builds; it's possible with the unicodedata module,
    why not from C?

    My motivation for the change is this post:
    http://mail.python.org/pipermail/python-dev/2008-July/080900.html

    @vstinner
    Copy link
    Member

    vstinner commented Feb 3, 2009

    lemburg> This is not possible for unichr() in Python 2.x, since applications
    lemburg> always expect len(unichr(x)) == 1

    Oh, ok.

    lemburg> Changing ord() would be possible in Python 2.x is easier, since
    lemburg> this would only extend the range of returned values for UCS2
    lemburg> builds.

    ord() of Python3 (narrow build) rejects surrogate characters:

    '\U00010000'
    >>> len(chr(0x10000))
    2
    >>> ord(0x10000)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: ord() expected string of length 1, but int found

    It looks that narrow builds with surrogates have some more problems...

    Test with U+10000: "LINEAR B SYLLABLE B008 A", category: Letter, Other.

    Correct result (Python 2.5, wide build):

       $ python
       Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40)
       >>> unichr(0x10000)
       u'\U00010000'
       >>> unichr(0x10000).isalpha()
       True

    Error in Python3 (narrow build):

       marge$ ./python
       Python 3.1a0 (py3k:69105M, Feb  3 2009, 15:04:35)
       >>> chr(0x10000).isalpha()
       False
       >>> list(chr(0x10000))
       ['\ud800', '\udc00']
       >>> chr(0xd800).isalpha()
       False
       >>> chr(0xdc00).isalpha()
       False

    Unicode ranges, all in the category "Other, Surrogate":

    • U+D800..U+DB7F: Non Private Use High Surrogate
    • U+DB80..U+DBFF: Private Use High Surrogate
    • U+DC00..U+DFFF: Low Surrogate" range

    @malemburg
    Copy link
    Member

    On 2009-02-03 14:50, Amaury Forgeot d'Arc wrote:

    Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

    > That would cause major breakage in the C API

    Not if you recompile. I don't see how this breaks the API at the C level.

    Well, then try to look at such a change from a C extension
    writer's perspective.

    They'd have to change all their function calls and routines to work
    with Py_UCS4.

    Supporting both the old API and the new one would
    be nearly impossible and require either an adapter API or a lot
    of #ifdef'ery.

    Please remember that the public Python C API is not only meant for
    Python developers. It's main purpose is for it to be used by
    other developers extending or embedding Python and those developers
    use different release cycles and want to support more than just the
    bleeding edge Python version.

    Python has a long history of providing very stable APIs, both in
    C and in Python.

    FWIW: The last major change in the C API (the change to Py_ssize_t
    from Python 2.4 to 2.5) has not even propogated to all major C
    extensions yet. It's only now that people start to realize problems
    with this, since their extensions start failing with segfaults
    on 64-bit machines.

    That said, we can of course provide additional UCS4 APIs for
    certain things and also provide conversion helpers between
    Py_UNICODE and Py_UCS4 where needed.

    > and is not inline with the intention of having a Py_UNICODE
    > type in the first place.

    Py_UNICODE is still used as the allocation unit for unicode strings.

    To get correct results, we need a way to access the whole unicode
    database even on ucs2 builds; it's possible with the unicodedata module,
    why not from C?

    I must be missing some detail, but what does the Unicode database
    have to do with the unicodeobject.c C API ?

    My motivation for the change is this post:
    http://mail.python.org/pipermail/python-dev/2008-July/080900.html

    There are certainly other ways to make Python deal with surrogates
    in more cases than the ones we already support.

    @ezio-melotti
    Copy link
    Member

    haypo> ord() of Python3 (narrow build) rejects surrogate characters:
    haypo> '\U00010000'
    haypo> >>> len(chr(0x10000))
    haypo> 2
    haypo> >>> ord(0x10000)
    haypo> TypeError: ord() expected string of length 1, but int found
    
    ord() works fine on Py3, you probably meant to do 
    >>> ord('\U00010000')
    65536
    or
    >>> ord(chr(0x10000))
    65536
    
    In Py3 is also stated that it accepts surrogate pairs (help(ord)).
    Py2 instead doesn't support them:
    >>> ord(u'\U00010000')
    TypeError: ord() expected a character, but string of length 2 found

    @amauryfa
    Copy link
    Member

    amauryfa commented Feb 3, 2009

    I must be missing some detail, but what does the Unicode database
    have to do with the unicodeobject.c C API ?

    Ah, now I understand your concerns. My suggestion is to change only the 20 functions in
    unicodectype.c: _PyUnicode_IsAlpha, _PyUnicode_ToLowercase... and no change in
    unicodeobject.c at all.
    They all take a single code point as argument, some also return a single code point.
    Changing these functions is backwards compatible.

    I join a patch so we can argue on concrete code (tests are missing).

    Another effect of the patch: unicodedata.numeric('\N{AEGEAN NUMBER TWO}') can return 2.0.

    The str.isalpha() (and others) methods did not change: they still split the surrogate pairs.

    @Rhamphoryncus
    Copy link
    Mannequin

    Rhamphoryncus mannequin commented Oct 4, 2009

    Surrogates aren't optional features of UTF-16, we really need to get
    this fixed. That includes .isalpha().

    We might keep the old public API for compatibility, but it should be
    clearly marked as broken for non-BMP scalar values.

    I don't see a problem with changing 2.x. The existing behaviour is
    broken for non-BMP scalar values, so surely nobody can claim dependence
    on it.

    @Rhamphoryncus Rhamphoryncus mannequin added the type-bug An unexpected behavior, bug, or error label Oct 4, 2009
    @malemburg
    Copy link
    Member

    Adam Olsen wrote:

    Adam Olsen <rhamph@gmail.com> added the comment:

    Surrogates aren't optional features of UTF-16, we really need to get
    this fixed. That includes .isalpha().

    We use UCS2 on narrow Python builds, not UTF-16.

    We might keep the old public API for compatibility, but it should be
    clearly marked as broken for non-BMP scalar values.

    That has always been the case. UCS2 doesn't support surrogates.

    However, we have been slowly moving into the direction of making
    the UCS2 storage appear like UTF-16 to the Python programmer.

    This process is not yet complete and will likely never complete
    since it must still be possible to create things line lone
    surrogates for processing purposes, so care has to be taken
    when using non-BMP code points on narrow builds.

    I don't see a problem with changing 2.x. The existing behaviour is
    broken for non-BMP scalar values, so surely nobody can claim dependence
    on it.

    No, but changing the APIs from 16-bit integers to 32-bit integers
    does require a recompile of all code using it. Otherwise you
    end up with segfaults.

    Also, the Unicode type database itself uses Py_UNICODE, so
    case mapping would fail for non-BMP code points.

    So if we want to support accessing non-BMP type information
    on narrow builds, we'd need to change the complete
    Unicode type database API to work with UCS4 code points and then
    provide a backwards compatible C API using Py_UNICODE. Due
    to the UCS2/UCS4 API renaming done in unicodeobject.h, this
    would amount to exposing both the UCS2 and the UCS4 variants
    of the APIs on narrow builds.

    With such an approach we'd not break the binary API and
    still get the full UCS4 range of code points in the type
    database. The change would be possible in Python 2.x and
    3.x (which now both use the same strategy w/r to change
    management).

    Would someone be willing to work on this ?

    @amauryfa
    Copy link
    Member

    amauryfa commented Oct 5, 2009

    No, but changing the APIs from 16-bit integers to 32-bit integers
    does require a recompile of all code using it.

    Is it acceptable between 3.1 and 3.2 for example? ISTM that other
    changes already require recompilation of extension modules.

    Also, the Unicode type database itself uses Py_UNICODE, so
    case mapping would fail for non-BMP code points.

    Where, please? in unicodedata.c, getuchar and _getrecord_ex use Py_UCS4.

    @malemburg
    Copy link
    Member

    Amaury Forgeot d'Arc wrote:

    Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

    > No, but changing the APIs from 16-bit integers to 32-bit integers
    > does require a recompile of all code using it.

    Is it acceptable between 3.1 and 3.2 for example? ISTM that other
    changes already require recompilation of extension modules.

    With the proposed approach, we'll keep binary compatibility, so
    this is not much of an issue.

    Note: Changes to the binary interface can be done in minor releases,
    but we should make sure that it's not possible to load an extension
    compiled with 3.1 in 3.2 to prevent segfaults and buffer overruns.

    > Also, the Unicode type database itself uses Py_UNICODE, so
    > case mapping would fail for non-BMP code points.

    Where, please? in unicodedata.c, getuchar and _getrecord_ex use Py_UCS4.

    The change affects the Unicode type database which is implemented
    in unicodectype.c, not the Unicode database, which already uses UCS4.

    @amauryfa
    Copy link
    Member

    amauryfa commented Oct 5, 2009

    we should make sure that it's not possible to load an extension
    compiled with 3.1 in 3.2 to prevent segfaults and buffer overruns.

    This is the case with this patch: today all these functions
    (PyUnicode_IsAlpha, _PyUnicode_ToLowercase) are actually #defines to
    _PyUnicodeUCS2
    * or _PyUnicodeUCS4_*.
    The patch removes the #defines: 3.1 modules that call
    _PyUnicodeUCS4_IsAlpha wouldn't load into a 3.2 interpreter.

    The change affects the Unicode type database which is implemented
    in unicodectype.c, not the Unicode database, which already uses UCS4.

    Are you referring to the _PyUnicode_TypeRecord structure?
    The first three fields only contains values up to 65535, so they could
    use "unsigned short" even for UCS4 builds.
    All the other uses are precisely changed by the patch...

    @ezio-melotti
    Copy link
    Member

    > We might keep the old public API for compatibility, but it should be
    > clearly marked as broken for non-BMP scalar values.

    That has always been the case. UCS2 doesn't support surrogates.

    However, we have been slowly moving into the direction of making
    the UCS2 storage appear like UTF-16 to the Python programmer.

    UCS2 died long ago, is there any reason why we keep using an UCS2 that
    "appears" like UTF-16 instead of real UTF-16?

    This process is not yet complete and will likely never complete
    since it must still be possible to create things line lone
    surrogates for processing purposes, so care has to be taken
    when using non-BMP code points on narrow builds.

    I don't exactly know all the details of the current implementation, but
    -- from what I understand reading this (correct me if I'm wrong) -- it
    seems that the implementation is half-UCS2 to allow things like the
    processing of lone surrogates and half-UTF16 (or UTF-16-compatible) to
    work with surrogate pairs and hence with chars outside the BMP.

    What are the use cases for processing the lone surrogates? Wouldn't be
    better to use UTF-16 and disallow them (since they are illegal) and
    possibly provide some other way to deal with them (if it's really needed)?

    @malemburg
    Copy link
    Member

    Amaury Forgeot d'Arc wrote:

    Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

    > we should make sure that it's not possible to load an extension
    > compiled with 3.1 in 3.2 to prevent segfaults and buffer overruns.

    This is the case with this patch: today all these functions
    (PyUnicode_IsAlpha, _PyUnicode_ToLowercase) are actually #defines to
    _PyUnicodeUCS2
    * or _PyUnicodeUCS4_*.
    The patch removes the #defines: 3.1 modules that call
    _PyUnicodeUCS4_IsAlpha wouldn't load into a 3.2 interpreter.

    True, but we can do better. For narrow builds, the API currently
    exposes the UCS2 APIs. We'd need to expose the UCS4 APIs *in addition*
    to those APIs and have the UCS2 APIs redirect to the UCS4 ones.

    For wide builds, we don't need to change anything.

    > The change affects the Unicode type database which is implemented
    > in unicodectype.c, not the Unicode database, which already uses UCS4.

    Are you referring to the _PyUnicode_TypeRecord structure?
    The first three fields only contains values up to 65535, so they could
    use "unsigned short" even for UCS4 builds.

    I haven't checked, but it's certainly possible to have a code point
    use a non-BMP lower/upper/title case mapping, so this should be
    made possible as well, if we're going to make changes to the type
    database.

    @malemburg
    Copy link
    Member

    This is off-topic for the tracker item, but I'll reply anyway:

    Ezio Melotti wrote:
    > 
    > Ezio Melotti <ezio.melotti@gmail.com> added the comment:
    > 
    >>> We might keep the old public API for compatibility, but it should be
    >>> clearly marked as broken for non-BMP scalar values.
    > 
    >> That has always been the case. UCS2 doesn't support surrogates.
    > 
    >> However, we have been slowly moving into the direction of making
    >> the UCS2 storage appear like UTF-16 to the Python programmer.
    > 
    > UCS2 died long ago, is there any reason why we keep using an UCS2 that
    > "appears" like UTF-16 instead of real UTF-16?

    UCS2 is how we store Unicode in Python for narrow builds internally.
    It's a storage format, not an encoding.

    However, on narrow builds such as the Windows builds, you will sometimes
    want to create Unicode strings that use non-BMP code points. Since
    both UCS2 and UCS4 can represent the UTF-16 encoding, it's handy to
    expose a bit of automatic conversion at the Python level to make
    things easier for the programmer.

    > This process is not yet complete and will likely never complete
    > since it must still be possible to create things line lone
    > surrogates for processing purposes, so care has to be taken
    > when using non-BMP code points on narrow builds.

    I don't exactly know all the details of the current implementation, but
    -- from what I understand reading this (correct me if I'm wrong) -- it
    seems that the implementation is half-UCS2 to allow things like the
    processing of lone surrogates and half-UTF16 (or UTF-16-compatible) to
    work with surrogate pairs and hence with chars outside the BMP.

    What are the use cases for processing the lone surrogates? Wouldn't be
    better to use UTF-16 and disallow them (since they are illegal) and
    possibly provide some other way to deal with them (if it's really needed)?

    No, because Python is meant to be used for working on all Unicode
    code points. Lone surrogates are not allowed in transfer encodings
    such as UTF-16 or UTF-8, but they are valid Unicode code points and
    you need to be able to work with them, since you may want to construct
    surrogate pairs by hand or get lone surrogates as a result of slicing a
    Unicode string.

    @amauryfa
    Copy link
    Member

    amauryfa commented Oct 5, 2009

    We'd need to expose the UCS4 APIs *in addition*
    to those APIs and have the UCS2 APIs redirect to the UCS4 ones.

    Why have two names for the same function? it's Python 3, after all.
    Or is this "no recompile" feature so important (as long as changes are
    clearly shown to the user)? It does not work on Windows, FWIW.

    I haven't checked, but it's certainly possible to have a code point
    use a non-BMP lower/upper/title case mapping, so this should be
    made possible as well, if we're going to make changes to the type
    database.

    OK, here is a new patch. Even if this does not happen with unicodedata
    up to 5.1, the table has only 175 entries so memory usage is not
    dramatically increased.
    Py_UNICODE is no more used at all in unicodectype.c.

    @malemburg
    Copy link
    Member

    Amaury Forgeot d'Arc wrote:

    Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

    > We'd need to expose the UCS4 APIs *in addition*
    > to those APIs and have the UCS2 APIs redirect to the UCS4 ones.

    Why have two names for the same function? it's Python 3, after all.

    It's not the same function... the UCS2 version would take a
    Py_UNICODE parameter, the UCS4 version a Py_UCS4 parameter.

    I don't understand the comment about Python 3.x. FWIW, we're no
    longer in the backwards incompatible changes are allowed mode
    for 3.x.

    Or is this "no recompile" feature so important (as long as changes are
    clearly shown to the user)? It does not work on Windows, FWIW.

    There are generally two options for API changes within a
    major release branch:

    1. the changes are API backwards compatible and only the Python API
      version is changed

    2. the changes are not API backwards compatible; in such a case,
      Python has to reject imports of old module (as it always
      does on Windows), so the Python API version has to be changed
      *and* the import mechanism must reject the import

    The second option was used when transitioning from 2.4 to 2.5 due
    to the Py_ssize_t changes.

    We could do the same for 2.7/3.2, but if it's just needed for this
    one change, then I'd rather stick to implementing the first option.

    > I haven't checked, but it's certainly possible to have a code point
    > use a non-BMP lower/upper/title case mapping, so this should be
    > made possible as well, if we're going to make changes to the type
    > database.

    OK, here is a new patch. Even if this does not happen with unicodedata
    up to 5.1, the table has only 175 entries so memory usage is not
    dramatically increased.
    Py_UNICODE is no more used at all in unicodectype.c.

    Sorry, but this doesn't work: the functions have to return Py_UNICODE
    and raise an exception if the return value doesn't fit.

    Otherwise, you'd get completely wrong values in code downcasting
    the return value to Py_UNICODE on narrow builds.

    Another good reason to use two sets of APIs. The new set could
    indeed return Py_UCS4 values.

    @Rhamphoryncus
    Copy link
    Mannequin

    Rhamphoryncus mannequin commented Oct 5, 2009

    On Mon, Oct 5, 2009 at 03:03, Marc-Andre Lemburg <report@bugs.python.org> wrote:

    We use UCS2 on narrow Python builds, not UTF-16.

    > We might keep the old public API for compatibility, but it should be
    > clearly marked as broken for non-BMP scalar values.

    That has always been the case. UCS2 doesn't support surrogates.

    However, we have been slowly moving into the direction of making
    the UCS2 storage appear like UTF-16 to the Python programmer.

    This process is not yet complete and will likely never complete
    since it must still be possible to create things line lone
    surrogates for processing purposes, so care has to be taken
    when using non-BMP code points on narrow builds.

    Balderdash. We expose UTF-16 code units, not UCS-2. Guido has made
    this quite clear.

    UTF-16 was designed as an easy transition from UCS-2. Indeed, if your
    code only does searches or joins existing strings then it will Just
    Work; declare it UTF-16 and you are done. We have a lot more work to
    do than that (as in this bug report), and we can't reasonably prevent
    the user from splitting surrogate pairs via poor code, but a 95%
    solution doesn't mean we suddenly revert all the way back to UCS-2.

    If the intent really was to use UCS-2 then a correctly functioning
    UTF-16 codec would join a surrogate pair into a single scalar value,
    then raise an error because it's outside the range representable in
    UCS-2. That's not very helpful though; obviously, it's much better to
    use UTF-16 internally.

    "The alternative (no matter what the configure flag is called) is
    UTF-16, not UCS-2 though: there is support for surrogate pairs in
    various places, including the \U escape and the UTF-8 codec."
    http://mail.python.org/pipermail/python-dev/2008-July/080892.html

    "If you find places where the Python core or standard library is doing
    Unicode processing that would break when surrogates are present you
    should file a bug. However this does not mean that every bit of code
    that slices a string at an arbitrary point (and hence risks slicing in
    the middle of a surrogate) is incorrect -- it all depends on what is
    done next with the slice."
    http://mail.python.org/pipermail/python-dev/2008-July/080900.html

    @malemburg
    Copy link
    Member

    Adam Olsen wrote:
    > 
    > Adam Olsen <rhamph@gmail.com> added the comment:
    > 
    > On Mon, Oct 5, 2009 at 03:03, Marc-Andre Lemburg <report@bugs.python.org> wrote:
    >> We use UCS2 on narrow Python builds, not UTF-16.
    >>
    >>> We might keep the old public API for compatibility, but it should be
    >>> clearly marked as broken for non-BMP scalar values.
    >>
    >> That has always been the case. UCS2 doesn't support surrogates.
    >>
    >> However, we have been slowly moving into the direction of making
    >> the UCS2 storage appear like UTF-16 to the Python programmer.
    >>
    >> This process is not yet complete and will likely never complete
    >> since it must still be possible to create things line lone
    >> surrogates for processing purposes, so care has to be taken
    >> when using non-BMP code points on narrow builds.
    > 
    > Balderdash.  We expose UTF-16 code units, not UCS-2.  Guido has made
    > this quite clear.
    > 
    > UTF-16 was designed as an easy transition from UCS-2.  Indeed, if your
    > code only does searches or joins existing strings then it will Just
    > Work; declare it UTF-16 and you are done.  We have a lot more work to
    > do than that (as in this bug report), and we can't reasonably prevent
    > the user from splitting surrogate pairs via poor code, but a 95%
    > solution doesn't mean we suddenly revert all the way back to UCS-2.
    > 
    > If the intent really was to use UCS-2 then a correctly functioning
    > UTF-16 codec would join a surrogate pair into a single scalar value,
    > then raise an error because it's outside the range representable in
    > UCS-2.  That's not very helpful though; obviously, it's much better to
    > use UTF-16 internally.
    > 
    > "The alternative (no matter what the configure flag is called) is
    > UTF-16, not UCS-2 though: there is support for surrogate pairs in
    > various places, including the \U escape and the UTF-8 codec."
    > http://mail.python.org/pipermail/python-dev/2008-July/080892.html
    > 
    > "If you find places where the Python core or standard library is doing
    > Unicode processing that would break when surrogates are present you
    > should file a bug. However this does not mean that every bit of code
    > that slices a string at an arbitrary point (and hence risks slicing in
    > the middle of a surrogate) is incorrect -- it all depends on what is
    > done next with the slice."
    > http://mail.python.org/pipermail/python-dev/2008-July/080900.html

    All this is just nitpicking, really. UCS2 is a character set,
    UTF-16 an encoding.

    It so happens that when the Unicode consortium realized
    that 16 bit would not be enough to represent all scripts of the
    world, they added the concept of surrogates and reserved a few
    ranges of code points in UCS2 to represent these extra code
    points which are not part of UCS2, but the extensions UCS4.

    The conversion of these surrogate pairs to UCS4 code point
    values is what you find defined in UTF-16.

    If we were to implement Unicode using UTF-16 as storage format,
    we would not be able to store single lone surrogates, since these
    are not allowed in UTF-16. Ditto for unassigned ordinals, invalid
    code points, etc.

    PEP-100 really says it all:

    http://www.python.org/dev/peps/pep-0100/
    

    """
    This [internal] format will hold UTF-16 encodings of the corresponding
    Unicode ordinals. The Python Unicode implementation will address
    these values as if they were UCS-2 values. UCS-2 and UTF-16 are
    the same for all currently defined Unicode character points.
    ...
    Future implementations can extend the 16 bit restriction to the
    full set of all UTF-16 addressable characters (around 1M
    characters).
    """

    Note that I wrote the PEP and worked on the implementation at a time
    when Unicode 2.x was still in use wide-spread use (mostly on Windows)
    and 3.0 was just being release:

    http://www.unicode.org/history/publicationdates.html
    

    But all that is off-topic for this ticket, so please let's just
    stop such discussions.

    @Rhamphoryncus
    Copy link
    Mannequin

    Rhamphoryncus mannequin commented Oct 5, 2009

    On Mon, Oct 5, 2009 at 12:10, Marc-Andre Lemburg <report@bugs.python.org> wrote:

    All this is just nitpicking, really. UCS2 is a character set,
    UTF-16 an encoding.

    UCS is a character set, for most purposes synonymous with the Unicode
    character set. UCS-2 and UTF-16 are both encodings of that character
    set. However, UCS-2 can only represent the BMP, while UTF-16 can
    represent the full range.

    If we were to implement Unicode using UTF-16 as storage format,
    we would not be able to store single lone surrogates, since these
    are not allowed in UTF-16. Ditto for unassigned ordinals, invalid
    code points, etc.

    No. Internal usage may become temporarily ill-formed, but this is a
    compromise, and acceptable so long as we never export them to other
    systems.

    Not that I wouldn't *prefer* a system that wouldn't store lone
    surrogates, but.. pragmatics prevail.

    Note that I wrote the PEP and worked on the implementation at a time
    when Unicode 2.x was still in use wide-spread use (mostly on Windows)
    and 3.0 was just being release:

           http://www.unicode.org/history/publicationdates.html

    I think you hit the nail on the head there. 10 years ago, unicode
    meant something different than it does today. That's reflected in PEP-100 and in the code. Now it's time to move on, switch to the modern
    terminology, modern usage, and modern specs.

    But all that is off-topic for this ticket, so please let's just
    stop such discussions.

    It needs to be discussed somewhere. It's a distraction from fixing
    the bug, but at least it's more private here. Would you prefer email?

    @amauryfa
    Copy link
    Member

    amauryfa commented Oct 6, 2009

    So the discussion is now on 2 points:

    1. Is the change backwards compatible? (at the code level, after
      recompilation). My answer is yes, because all known case
      transformations stay in the same plane: if you pass a char in the BMP,
      they return a char in the BMP; if you pass a code >0x1000, you get
      another code >0x1000. In other words: in narrow builds, when you pass
      Py_UNICODE, the answer will be correct even when downcasted to
      Py_UNICODE. If you want, I can add checks to makeunicodedata.py to
      verify that future Unicode standards don't break this statement.

    "Naive" code that simply walks the Py_UNICODE* buffer will have
    identical behavior. (The current unicode methods are in this case.
    They should be fixed, later)

    1. Is this change acceptable for 3.2? I'd say yes, because existing
      extension modules that use these functions will need to be recompiled;
      the functions names change, the modules won't load otherwise. There is
      no need to change the API number for this.

    @malemburg
    Copy link
    Member

    It's not as easy as that.

    The functions for case conversion are used in a way that assumes they
    never fail (and indeed, the existing functions cannot fail).

    What we can do is change the input parameter to Py_UCS4, but not the
    Py_UNICODE output parameter, since that would cause lots of compiler
    warnings and implicit truncation on UCS2 builds, which would be a pretty
    disruptive change.

    However, this change would not really help anyone if there are no
    mappings from BMP to non-BMP or vice-versa, so I'm not sure whether this
    whole exercise is worth the effort.

    It appears to be better to just leave the case mapping APIs unchanged -
    or am I missing something ?

    The situation is different for the various Py_UNICODE_IS*() APIs: for
    these we can change the input parameter to Py_UCS4, remove the name
    mangling and add UCS2 helper functions to maintain API compatibility on
    UCS2 builds.

    @amauryfa
    Copy link
    Member

    amauryfa commented Oct 6, 2009

    that would cause lots of compiler
    warnings and implicit truncation on UCS2 builds

    Unfortunately, there is no such warning, or the initial problem we are trying
    to solve would have been spotted by such a warning (unicode_repr() calls
    Py_UNICODE_ISPRINTABLE() with a Py_UCS4 argument).

    gcc has a -Wconversion flag, (which I tried today on python) but it is far too
    verbose before version 4.3, and this newer version still has some false
    positives. http://gcc.gnu.org/wiki/NewWconversion

    But the most important thing is that implicit truncation on UCS2 builds is what
    happens already! The patch does not solve it, but at least it yields sensible
    results to wary code.
    Or can you imagine some (somewhat working) code which behavior will be worse
    after the change?

    @ezio-melotti ezio-melotti changed the title UnicodeEncodeError - I can't even see license Use Py_UCS4 instead of Py_UNICODE in unicodectype.c Jan 8, 2010
    @malemburg
    Copy link
    Member

    I don't see the point in changing the various conversion APIs in the unicode database to return Py_UCS4 when there are no conversions that map code points between BMP and non-BMP.

    In order to solve the problem in question (unicode_repr() failing), we should change the various property checking APIs to accept Py_UCS4 input data. This needlessly increases the type database size without real benefit.

    For that to work properly we'll have to either make sure that extensions get recompiled if they use these changed APIs, or we provide an additional set of UCS2 APIs that extend the Py_UNICODE input value to a Py_UCS4 value before calling the underlying Py_UCS4 API.

    @amauryfa
    Copy link
    Member

    amauryfa commented Jan 9, 2010

    I don't see the point in changing the various conversion APIs in the
    unicode database to return Py_UCS4 when there are no conversions that
    map code points between BMP and non-BMP.

    For consistency: if Py_UNICODE_ISPRINTABLE is changed to take Py_UCS4, Py_UNICODE_TOLOWER should also take Py_UCS4, and must return the same type.

    In order to solve the problem in question (unicode_repr() failing),
    we should change the various property checking APIs to accept Py_UCS4
    input data. This needlessly increases the type database size without
    real benefit.
    [I'm not sure to understand. For me the 'real benefit' is that it solves the problem in question.]

    Yes this increases the type database: there are 300 more "case" statements in _PyUnicode_ToNumeric(), and the PyUnicode_TypeRecords array needs 1068 more bytes.
    On Windows, VS9.0 release build, unicodectype.obj grows from 86Kb to 94Kb; python32.dll is exactly 1.5Kb larger (from 2219Kb to 2221.5Kb);
    the memory usage of the just-started interpreter is about 32K larger (around 5M). These look reasonable figures to me.

    For that to work properly we'll have to either make sure that
    extensions get recompiled if they use these changed APIs, or we
    provide an additional set of UCS2 APIs that extend the Py_UNICODE
    input value to a Py_UCS4 value before calling the underlying Py_UCS4
    API.

    Extensions that use these changed APIs need to be recompiled, or they won't load: existing modules link with symbols like _PyUnicodeUCS2_IsPrintable, when the future interpreter will define _PyUnicode_IsPrintable.

    @malemburg
    Copy link
    Member

    Amaury Forgeot d'Arc wrote:

    Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

    > I don't see the point in changing the various conversion APIs in the
    > unicode database to return Py_UCS4 when there are no conversions that
    > map code points between BMP and non-BMP.

    For consistency: if Py_UNICODE_ISPRINTABLE is changed to take Py_UCS4, Py_UNICODE_TOLOWER should also take Py_UCS4, and must return the same type.

    > In order to solve the problem in question (unicode_repr() failing),
    > we should change the various property checking APIs to accept Py_UCS4
    > input data. This needlessly increases the type database size without
    > real benefit.
    [I'm not sure to understand. For me the 'real benefit' is that it solves the problem in question.]

    The problem in question is already solved by just changing the property
    checking APIs. Changing the conversion APIs fixes a non-problem, since there
    are no mappings that would require Py_UCS4 on a UCS2 build.

    Yes this increases the type database: there are 300 more "case" statements in _PyUnicode_ToNumeric(), and the PyUnicode_TypeRecords array needs 1068 more bytes.
    On Windows, VS9.0 release build, unicodectype.obj grows from 86Kb to 94Kb; python32.dll is exactly 1.5Kb larger (from 2219Kb to 2221.5Kb);
    the memory usage of the just-started interpreter is about 32K larger (around 5M). These look reasonable figures to me.

    > For that to work properly we'll have to either make sure that
    > extensions get recompiled if they use these changed APIs, or we
    > provide an additional set of UCS2 APIs that extend the Py_UNICODE
    > input value to a Py_UCS4 value before calling the underlying Py_UCS4
    > API.

    Extensions that use these changed APIs need to be recompiled, or they won't load: existing modules link with symbols like _PyUnicodeUCS2_IsPrintable, when the future interpreter will define _PyUnicode_IsPrintable.

    Hmm, that's a good point.

    OK, you got me convinced: let's go for it then.

    @amauryfa amauryfa self-assigned this Jul 7, 2010
    @amauryfa
    Copy link
    Member

    amauryfa commented Jul 7, 2010

    Now I wonder whether it's reasonable to consider this character
    U+10000 (LINEAR B SYLLABLE B008 A)
    as printable with repr(). Yes, its category is "Lo", but is there a font which can display it?

    @ezio-melotti
    Copy link
    Member

    Given that '\U00010000'.isprintable() returns True, I would say yes. If someone needs to print this char and has an appropriate font to do it, I don't see why it shouldn't work.

    @malemburg
    Copy link
    Member

    Ezio Melotti wrote:

    Ezio Melotti <ezio.melotti@gmail.com> added the comment:

    Given that '\U00010000'.isprintable() returns True, I would say yes. If someone needs to print this char and has an appropriate font to do it, I don't see why it shouldn't work.

    Note that Python3 will send printable code points as-is to the console,
    so whether or not a code point is considered printable should take the
    common availability of fonts being able to display the code point
    into account. Otherwise, a user would just see a square box instead of
    the much more useful escape sequence.

    The "printable" property is a Python invention, not a Unicode property,
    so we do have some freedom is deciding what is printable and what
    is not.

    In recent years the situation has just started clearing up
    for fonts covering the assigned BMP range, mostly due to Microsoft actively
    working to get their fonts cover the Unicode 2.0 assigned code points
    (BMP only):

    http://support.microsoft.com/kb/287247
    

    The only font set I know of that tries to go beyond BMP is this
    shareware one:

    http://code2000.net/
    

    Most other fonts just cover small parts of the Unicode assigned
    code point ranges:

    http://unicode.org/resources/fonts.html
    http://www.wazu.jp/
    http://www.unifont.org/fontguide/
    

    I suppose that in a few years we'll see OS and GUIs mix and match the
    available fonts to display Unicode code points.

    Given the font situation, I don't think we should have repr()
    pass through printable non-BMP code points as is. Perhaps we shouldn't
    give those code points the printable property to begin with.

    @ezio-melotti
    Copy link
    Member

    [This should probably be discussed on python-dev or in another issue, so feel free to move the conversation there.]

    The current implementation considers printable """all the characters except those characters defined in the Unicode character database as following categories are considered printable.

    • Cc (Other, Control)
    • Cf (Other, Format)
    • Cs (Other, Surrogate)
    • Co (Other, Private Use)
    • Cn (Other, Not Assigned)
    • Zl Separator, Line ('\u2028', LINE SEPARATOR)
    • Zp Separator, Paragraph ('\u2029', PARAGRAPH SEPARATOR)
    • Zs (Separator, Space) other than ASCII space('\x20')."""

    We could also arbitrary exclude all the non-BMP chars, but that shouldn't be based on the availability of the fonts IMHO.

    Note that Python3 will send printable code points as-is to the
    console, so whether or not a code point is considered printable
    should take the common availability of fonts being able to display
    the code point into account. Otherwise, a user would just see a
    square box instead of the much more useful escape sequence

    If the concern is about the usefulness of repr() in the console, note that on the Windows terminal trying to display most of the characters results in an error (see bpo-5110), and that makes repr() barely usable.
    ascii() might be an alternative if the user wants to see the escape sequence instead of a square box.

    @amauryfa
    Copy link
    Member

    amauryfa commented Jul 8, 2010

    I suggest to go ahead and apply this patch, at least it correctly selects "printable" characters, whatever this means.
    I filed bpo-9198 to decide whether chr(0x10000) should be printable.

    @malemburg
    Copy link
    Member

    Ezio Melotti wrote:

    Ezio Melotti <ezio.melotti@gmail.com> added the comment:

    [This should probably be discussed on python-dev or in another issue, so feel free to move the conversation there.]

    The current implementation considers printable """all the characters except those characters defined in the Unicode character database as following categories are considered printable.

    • Cc (Other, Control)
    • Cf (Other, Format)
    • Cs (Other, Surrogate)
    • Co (Other, Private Use)
    • Cn (Other, Not Assigned)
    • Zl Separator, Line ('\u2028', LINE SEPARATOR)
    • Zp Separator, Paragraph ('\u2029', PARAGRAPH SEPARATOR)
    • Zs (Separator, Space) other than ASCII space('\x20')."""

    We could also arbitrary exclude all the non-BMP chars, but that shouldn't be based on the availability of the fonts IMHO.

    Without fonts, you can't print the code points, even if the Unicode
    database defines the code point as not having one of the above
    classes. And that's probably also the reason why the Unicode
    database doesn't define a printable property :-)

    I also find the use of Zl, Zp and Zs in the definition somewhat
    arbitrary: whitespace is certainly printable. This also doesn't
    match the isprint() C lib API:

    http://www.cplusplus.com/reference/clibrary/cctype/isprint/

    "A printable character is any character that is not a control character."

    > Note that Python3 will send printable code points as-is to the
    > console, so whether or not a code point is considered printable
    > should take the common availability of fonts being able to display
    > the code point into account. Otherwise, a user would just see a
    > square box instead of the much more useful escape sequence

    If the concern is about the usefulness of repr() in the console, note that on the Windows terminal trying to display most of the characters results in an error (see bpo-5110), and that makes repr() barely usable.
    ascii() might be an alternative if the user wants to see the escape sequence instead of a square box.

    That's a different problem, but indeed also related to the
    printable property which was introduced as part of the Unicode repr()
    change: if the console encoding cannot represent
    the printable code points, you get an error.

    I was never a fan of the Unicode repr() change to begin with. The
    repr() of an object should work in almost all cases. Being able to
    read the repr() of an object in clear text is only secondary.
    IMHO, allowing all printable code points to pass through unescaped
    was not beneficial. We have str() for getting readable representations
    of objects. Anyway, we're stuck with it now, so have to work
    around the issues...

    @malemburg
    Copy link
    Member

    Amaury Forgeot d'Arc wrote:

    Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

    I suggest to go ahead and apply this patch, at least it correctly selects "printable" characters, whatever this means.
    I filed bpo-9198 to decide whether chr(0x10000) should be printable.

    +1

    @ezio-melotti
    Copy link
    Member

    Amaury, before applying the patch consider replacing the tab characters before the comments with spaces. The use of tabs is discouraged.

    Marc-Andre Lemburg wrote:

    I was never a fan of the Unicode repr() change to begin with. The
    repr() of an object should work in almost all cases.

    I still think that bpo-5110 should be fixed (there's also a patch to fix the issue on Windows). If you agree please comment there and/or reopen that issue.

    @malemburg
    Copy link
    Member

    Ezio Melotti wrote:

    Marc-Andre Lemburg wrote:
    > I was never a fan of the Unicode repr() change to begin with. The
    > repr() of an object should work in almost all cases.

    I still think that bpo-5110 should be fixed (there's also a patch to fix the issue on Windows). If you agree please comment there and/or reopen that issue.

    Let's discuss this on bpo-9198.

    @amauryfa
    Copy link
    Member

    amauryfa commented Jul 8, 2010

    consider replacing the tab characters before the comments with spaces
    It's actually already the case in my working copy.

    @amauryfa
    Copy link
    Member

    amauryfa commented Jul 8, 2010

    A new patch, generated on top of r82662

    @malemburg
    Copy link
    Member

    Amaury Forgeot d'Arc wrote:

    Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

    A new patch, generated on top of r82662

    Could you explain what this bit is about ?

    @@ -349,7 +313,7 @@
    configure Python using --with-wctype-functions. This reduces the
    interpreter's code size. */

    -#if defined(HAVE_USABLE_WCHAR_T) && defined(WANT_WCTYPE_FUNCTIONS)
    +#if defined(Py_UNICODE_WIDE) && defined(WANT_WCTYPE_FUNCTIONS)

     #include <wctype.h>

    @amauryfa
    Copy link
    Member

    amauryfa commented Jul 8, 2010

    A new patch that doesn't remove an important check, avoids a crash when the C macro is called with a huge number. thanks Ezio.

    @amauryfa
    Copy link
    Member

    amauryfa commented Jul 8, 2010

    Could you explain what this bit is about ?
    -#if defined(HAVE_USABLE_WCHAR_T) && defined(WANT_WCTYPE_FUNCTIONS)
    +#if defined(Py_UNICODE_WIDE) && defined(WANT_WCTYPE_FUNCTIONS)

    On Windows at least, HAVE_USABLE_WCHAR_T is True, this means that Py_Unicode can be converted to wchar_t. But now that Py_UNICODE_ISSPACE() takes Py_UCS4, it cannot be converted to wchar_t anymore.

    Now that the unicode database functions claim to use Py_UCS4, the functions of wctypes.h are usable only if they also support Py_UCS4.

    OTOH the symbol WANT_WCTYPE_FUNCTIONS is defined only if ./configure is called with --with-wctype-functions, I don't expect it to be common.
    BTW, the comment says that "This reduces the interpreter's code size". I don't really agree, these functions are two-liners.

    @malemburg
    Copy link
    Member

    Amaury Forgeot d'Arc wrote:

    Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

    > Could you explain what this bit is about ?
    > -#if defined(HAVE_USABLE_WCHAR_T) && defined(WANT_WCTYPE_FUNCTIONS)
    > +#if defined(Py_UNICODE_WIDE) && defined(WANT_WCTYPE_FUNCTIONS)

    On Windows at least, HAVE_USABLE_WCHAR_T is True, this means that Py_Unicode can be converted to wchar_t. But now that Py_UNICODE_ISSPACE() takes Py_UCS4, it cannot be converted to wchar_t anymore.

    Right, but you still have to check whether wchar_t is usable, don't
    you ? If wchar_t is 2 bytes on a non-Windows platform, using the
    wctypes functions won't work either.

    The line should probably read:

    #if defined(WANT_WCTYPE_FUNCTIONS) && defined(HAVE_USABLE_WCHAR_T) && defined(Py_UNICODE_WIDE)
    

    Now that the unicode database functions claim to use Py_UCS4, the functions of wctypes.h are usable only if they also support Py_UCS4.

    OTOH the symbol WANT_WCTYPE_FUNCTIONS is defined only if ./configure is called with --with-wctype-functions, I don't expect it to be common.
    BTW, the comment says that "This reduces the interpreter's code size". I don't really agree, these functions are two-liners.

    True.

    The support for the wctype functions should have been remove long ago,
    since they cause subtle incompatibilities between Python builds. I should
    have probably never added it in the first place... people were worried
    about the size of the type record tables at the time, which is why
    I thought it would be a good idea to try to optionally use the C lib
    functions.

    The comment was true before the Python type tables were changed
    into a type record database: the switch used to remove the
    Python tables required for those functions. With the type records
    database, this is no longer the case, since the records are also
    being used for properties that are not exposed via wctype functions.

    @malemburg
    Copy link
    Member

    Amaury Forgeot d'Arc wrote:

    Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

    A new patch that doesn't remove an important check, avoids a crash when the C macro is called with a huge number. thanks Ezio.

    Could you please be more specific on what you changed ?

    At least visually, there don't appear to be any differences between
    v4 and v5 of the patch.

    Thanks,

    Marc-Andre Lemburg
    eGenix.com


    2010-07-19: EuroPython 2010, Birmingham, UK 9 days to go

    ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

    eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
    Registered at Amtsgericht Duesseldorf: HRB 46611
    http://www.egenix.com/company/contact/

    @ezio-melotti
    Copy link
    Member

    The 'if' in 'gettyperecord'. (I would also rewrite that as "if (code > 0x10FFFF)", it looks more readable to me.)

    The patch seems OK to me. In the NEWS message 'python' should be capitalized and I would also mention .isprintable() and possibly other functions that are affected directly -- mentioning repr() is good too, but it's only affected indirectly.

    @malemburg
    Copy link
    Member

    Ezio Melotti wrote:

    Ezio Melotti <ezio.melotti@gmail.com> added the comment:

    The 'if' in 'gettyperecord'. (I would also rewrite that as "if (code > 0x10FFFF)", it looks more readable to me.)

    Ah, good catch !

    The patch seems OK to me. In the NEWS message 'python' should be capitalized and I would also mention .isprintable() and possibly other functions that are affected directly -- mentioning repr() is good too, but it's only affected indirectly.

    @amauryfa
    Copy link
    Member

    amauryfa commented Jul 9, 2010

    str.isprintable() &co are not changed by this patch, because they enumerate Py_UNICODE units and do not join surrogates. See bpo-9200

    @amauryfa
    Copy link
    Member

    amauryfa commented Jul 9, 2010

    In this 6th patch, the wctype part was changed as suggested.
    there is one more condition, Py_UNICODE_WIDE:

    -#if defined(HAVE_USABLE_WCHAR_T) && defined(WANT_WCTYPE_FUNCTIONS)
    +#if defined(WANT_WCTYPE_FUNCTIONS) && defined(HAVE_USABLE_WCHAR_T) && defined(Py_UNICODE_WIDE)

    @amauryfa
    Copy link
    Member

    Committed with r84177.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants