Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should repr() print unicode characters outside the BMP? #53444

Closed
amauryfa opened this issue Jul 8, 2010 · 15 comments
Closed

Should repr() print unicode characters outside the BMP? #53444

amauryfa opened this issue Jul 8, 2010 · 15 comments
Assignees
Labels
topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@amauryfa
Copy link
Member

amauryfa commented Jul 8, 2010

BPO 9198
Nosy @malemburg, @loewis, @amauryfa, @abalkin, @vstinner, @ezio-melotti, @merwok
Files
  • issue9198.diff: Incomplete (but working) patch to "fix" displayhook.
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ezio-melotti'
    closed_at = <Date 2010-12-02.01:51:06.081>
    created_at = <Date 2010-07-08.08:53:04.028>
    labels = ['type-bug', 'invalid', 'expert-unicode']
    title = 'Should repr() print unicode characters outside the BMP?'
    updated_at = <Date 2010-12-02.01:51:06.080>
    user = 'https://github.com/amauryfa'

    bugs.python.org fields:

    activity = <Date 2010-12-02.01:51:06.080>
    actor = 'vstinner'
    assignee = 'ezio.melotti'
    closed = True
    closed_date = <Date 2010-12-02.01:51:06.081>
    closer = 'vstinner'
    components = ['Unicode']
    creation = <Date 2010-07-08.08:53:04.028>
    creator = 'amaury.forgeotdarc'
    dependencies = []
    files = ['17915']
    hgrepos = []
    issue_num = 9198
    keywords = ['patch']
    message_count = 15.0
    messages = ['109520', '109528', '109531', '109533', '109534', '109535', '109536', '109550', '109555', '109617', '109620', '109702', '113149', '113734', '123033']
    nosy_count = 8.0
    nosy_names = ['lemburg', 'loewis', 'amaury.forgeotdarc', 'belopolsky', 'Rhamphoryncus', 'vstinner', 'ezio.melotti', 'eric.araujo']
    pr_nums = []
    priority = 'normal'
    resolution = 'not a bug'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue9198'
    versions = ['Python 3.2']

    @amauryfa
    Copy link
    Member Author

    amauryfa commented Jul 8, 2010

    On wide unicode builds, '\U00010000'.isprintable() returns True, and repr() returns the character unmodified.
    Is it a good behavior, given that very few fonts have can display this character?

    Marc-Andre Lemburg wrote:

    The "printable" property is a Python invention, not a Unicode property,
    so we do have some freedom is deciding what is printable and what
    is not.

    The current implementation considers printable """all the characters except those characters defined in the Unicode character database as following categories are considered printable.

    • Cc (Other, Control)
    • Cf (Other, Format)
    • Cs (Other, Surrogate)
    • Co (Other, Private Use)
    • Cn (Other, Not Assigned)
    • Zl Separator, Line ('\u2028', LINE SEPARATOR)
    • Zp Separator, Paragraph ('\u2029', PARAGRAPH SEPARATOR)
    • Zs (Separator, Space) other than ASCII space('\x20').
      """

    We could also arbitrarily exclude all the non-BMP chars.

    @amauryfa amauryfa added topic-unicode type-bug An unexpected behavior, bug, or error labels Jul 8, 2010
    @malemburg
    Copy link
    Member

    [Adding some bits from the discussion on bpo-5127 for better context]

    """
    Ezio Melotti wrote:

    >
    > Ezio Melotti <ezio.melotti@gmail.com> added the comment:
    >
    > [This should probably be discussed on python-dev or in another issue, so feel free to move the
    conversation there.]
    >
    > The current implementation considers printable """all the characters except those characters
    defined in the Unicode character database as following categories are considered printable.
    > * Cc (Other, Control)
    > * Cf (Other, Format)
    > * Cs (Other, Surrogate)
    > * Co (Other, Private Use)
    > * Cn (Other, Not Assigned)
    > * Zl Separator, Line ('\u2028', LINE SEPARATOR)
    > * Zp Separator, Paragraph ('\u2029', PARAGRAPH SEPARATOR)
    > * Zs (Separator, Space) other than ASCII space('\x20')."""
    >
    > We could also arbitrary exclude all the non-BMP chars, but that shouldn't be based on the
    availability of the fonts IMHO.

    Without fonts, you can't print the code points, even if the Unicode
    database defines the code point as not having one of the above
    classes. And that's probably also the reason why the Unicode
    database doesn't define a printable property :-)

    I also find the use of Zl, Zp and Zs in the definition somewhat
    arbitrary: whitespace is certainly printable. This also doesn't
    match the isprint() C lib API:

    http://www.cplusplus.com/reference/clibrary/cctype/isprint/

    "A printable character is any character that is not a control character."
    """

    There are two aspects:

    • What to call a printable code point ?

      I'd suggest to follow the C lib approach: all non-control
      characters.

    • Which criteria to use for Unicode repr() ?

      Given the original intent of the extension to allow printable
      code points to pass through unescaped, it may be better to
      define "printable" based on the sys.stdout/sys.stderr encoding:

      A code points may pass through unescaped, if it is
      printable per the above definition, and does not cause problems
      with the sys.stdout/sys.stderr encoding.

      Since we can't apply this check based on a per character basis,
      I think we should only allow non-ASCII code points to pass through
      if sys.stdout/sys.stderr is set to utf-8, utf-16 or utf-32.

    @ezio-melotti
    Copy link
    Member

    Regarding the fonts, I think that who actually uses or needs to use characters outside the BMP might have (now or in a few months/years) a font able to display them.
    I also tried to print the printable chars from U+FFFF to U+1FFFF on my linux terminal and about half of them were rendered correctly (the default font is DejaVu Sans Mono).

    The question is then if we do more harm hiding these chars behind escape sequence to the people who use them or hiding the escape sequence behind boxes for the people who don't and don't have the right font.

    Regarding the categories that should be considered printable, I agree that the Zx categories could be considered printable, so the non printable chars could be limited to the Cx categories.

    Since we can't apply this check based on a per character basis,
    I think we should only allow non-ASCII code points to pass through
    if sys.stdout/sys.stderr is set to utf-8, utf-16 or utf-32.

    If I understood correctly, you are suggesting to look at the sys.stdout/sys.stderr encoding and:

    • if it's a UTF-* encoding: allow all the non-ASCII (printable) codepoints (because they are the only encodings that can represent all the Unicode characters);
    • if it's not a UTF-* encoding: allow only ASCII (printable) codepoints.

    This would however introduce a regression. For example on Windows (where the encoding is usually not a UTF-* one) I would expect accented characters (at least the ones in the codepage I'm using -- and usually it matches the native language of the user) to be displayed correctly.
    A more accurate approach would be to actually try to encode the string and escape only the chars that can't be encoded (and also the one that are not printable of course), but this can't be done in repr() because repr() returns a Unicode string (in bpo-5110 I did it in sys.displayhook), and encode the string there would mean doing it twice.

    Also note that I might want to use repr() to get a representation of the object without necessarily send it through sys.stdout. For example I could write it on a file or send it via mail (Roundup reports errors via mail showing a repr of the variables) and in both the cases I might use/want UTF-8 even if sys.stdout is ASCII.

    @amauryfa
    Copy link
    Member Author

    amauryfa commented Jul 8, 2010

    A more accurate approach would be to actually try to encode the string
    and escape only the chars that can't be encoded

    This is already the case with sys.stderr, it uses the "backslashreplace" error handler. Do you suggest the same for sys.stdout?

    @amauryfa
    Copy link
    Member Author

    amauryfa commented Jul 8, 2010

    Yes, repr() should not depend on the user's terminal.

    @ezio-melotti
    Copy link
    Member

    This is already the case with sys.stderr, it uses the "backslashreplace"
    error handler. Do you suggest the same for sys.stdout?

    See http://bugs.python.org/issue5110#msg84965

    @amauryfa
    Copy link
    Member Author

    amauryfa commented Jul 8, 2010

    The chapter "Rationale" in PEP-3138 explains why sys.stdout uses "strict" encoding, when sys.stderr uses "backslashreplace".

    It would be possible to use "backslashreplace" for stdout as well for interactive sessions, but the PEP also rejected this because it '''may add confusion of the kind "it works in interactive mode but not when redirecting to a file".'''

    @ezio-melotti
    Copy link
    Member

    Yes, but as I said in the message I linked, that's *not* what I want to do.
    I want to change only the behavior of the interactive interpreter and only when the string sent to stdout is not encodable (so only when the encoding is not UTF-*).

    This can be done changing sys.displayhook, but I haven't figured out yet how to do it. The default displayhook (Python/sysmodule.c:71) calls PyFile_WriteObject (Objects/fileobject.c:139) passing the object as is and the stdout. PyFile_WriteObject then does the equivalent of sys.stdout.write(repr(obj)).
    This is all done passing around unicode strings. Ideally we should try to encode the repr of the objects in displayhook using sys.stdout.encoding and 'backslashreplace', but then:

    1. we would have to decode the resulting byte string again before passing it to PyFile_WriteObject;
    2. we would have to find a way to write to sys.stdout a bytestring but I don't think that's possible (keep in mind that sys.stdout could also be some other object).

    OTOH even if the intermediate step of encoding/decoding looks redundant it shouldn't affect performances too much, because it's not that common to print lot of text in the interactive interpreter and even in those cases probably performances are not so important. It would anyway be better to find another way to do it.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jul 8, 2010

    I think if you change it stop considering non-BMP characters as printable, somebody will complain. If you change it in any other way, somebody will complain. Somebody will always complain - so you might as well leave things the way they are. Or you change it in a way that you think most users would consider useful - but don't expect consensus.

    @vstinner
    Copy link
    Member

    vstinner commented Jul 8, 2010

    amaury> Should repr() print unicode characters outside the BMP?

    Yes. I don't understand why characters outside the BMP will be considered differently than other characters. Is it a workaround for bogus operating systems? My Linux terminal (Konsole on KDE) is able to display Ugaritic characters (range starting at U+10383).

    amaury> it may be better to define "printable" based on the
    amaury> sys.stdout/sys.stderr encoding

    You can not do that: stdout and stderr encoding might be different, stdout and stderr errors are different, and repr() output is not always written to stdout or stderr. If you write repr() output to a file, you have to know the encoding of the file. How can I get the encoding? If sys.stdout is replaced by a io.StringIO() object (eg. doctests), you don't have any "encoding" (StringIO only manipulate unicode objects, no bytes objects).

    ezio> I want to change only the behavior of the interactive interpreter

    This idea was rejected by the PEP.

    I agree with "may add confusion of the kind "it works in interactive mode but not when redirecting to a file".

    I already noticed such problem: the interactive interpreter adds '' to sys.path, and so import behaves differently in the interpreter than a script. It's annoying because it took me hours to understand why it was different.

    ezio> and only when the string sent to stdout is not encodable

    Which means setting sys.stdout.errors to something else than strict. I prefer to detect unicode problems earlier.

    Eg. if you set errors to 'replace', write will never fail. If the output is used as input for another program (UNIX pipe), you will send "?" to the reader process. I'm not sure that it is the expected behaviour. And if stdout is not a TTY, stdout uses ASCII encoding (which will raise an unicode errors at the first non ASCII character!).

    @ezio-melotti
    Copy link
    Member

    I suggest to:

    1. keep the current behavior for non-BMP chars (i.e. print them normally);
    2. change isprintable to consider the Zx categories printable (this will affect repr() too);
    3. change displayhook (NOT sys.stdout.encoding) to use backslashreplace when the string contains chars that are not encodable with sys.stdout.encoding *.
    * note that this will affect only the objects that are converted with repr() in the interpreter e.g. ">>> x = 'foo'; x" and *NOT* ">>> x = 'foo'; print(x)". Since the first behavior exists *only* in the interactive interpreter it won't be inconsistent with normal programs ("x = 'foo'; x" in a program doesn't display anything).

    @ezio-melotti
    Copy link
    Member

    Here is a patch to "fix" sys_displayhook (note: the patch is just a proof of concept -- it seems to work fine but I still have to clean it up, add comments, rename and reorganize some vars and add tests).
    This is an example output while using iso-8859-1 as IO encoding:

    wolf@linuxvm:~/dev/py3k$ PYTHONIOENCODING=iso-8859-1 ./python
    Python 3.2a0 (py3k:82643:82644M, Jul  9 2010, 11:39:25)
    [GCC 4.4.1] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import sys; sys.stdout.encoding, sys.stdin.encoding
    ('iso-8859-1', 'iso-8859-1')
    >>> 'ascii string'
    'ascii string'  # works fine
    >>> 'some accented chars: öäå'
    'some accented chars: öäå'  # works fine - these chars are encodable
    >>> 'a snowman: \u2603'
    'a snowman: \u2603'  # non-encodable - the char is escaped instead of raising an error
    >>> 'snowman: \u2603, and accented öäå'
    'snowman: \u2603, and accented öäå' # only non-encodable chars are escaped
    >>> # the behavior of print is still the same:
    >>> print('some accented chars: öäå') 
    some accented chars: öäå
    >>> print('a snowman: \u2603')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'latin-1' codec can't encode character '\u2603' in position 11: ordinal not in range(256)
    
    -------------------------------------
    
    While testing the patch with PYTHONIOENCODING=iso-8859-1 I also found this weird issue that however is *not* related to the patch, since I managed to reproduce on a clean py3k using PYTHONIOENCODING=iso-8859-1:
    >>> 'òàùèì  óáúéí  öäüëï'
    'ò�\xa0ùèì  óáúé�\xad  öäüëï'
    >>> 'òàùèì  óáúéí  öäüëï'.encode('iso-8859-1')
    b'\xc3\xb2\xc3\xa0\xc3\xb9\xc3\xa8\xc3\xac  \xc3\xb3\xc3\xa1\xc3\xba\xc3\xa9\xc3\xad  \xc3\xb6\xc3\xa4\xc3\xbc\xc3\xab\xc3\xaf'
    >>> 'òàùèì'.encode('utf-8')
    b'\xc3\x83\xc2\xb2\xc3\x83\xc2\xa0\xc3\x83\xc2\xb9\xc3\x83\xc2\xa8\xc3\x83\xc2\xac'

    I think there might be some conflict between the IO encoding that I specified and the one that my terminal actually uses, but I couldn't figure out what's going on exactly (it also weird that only 'à' and 'í' are not displayed correctly). Unless this behavior is expected I'll open another issue about it.

    @ezio-melotti
    Copy link
    Member

    Assigning to myself so that I'll remember to finish and commit the patch.

    @ezio-melotti ezio-melotti self-assigned this Aug 7, 2010
    @vstinner
    Copy link
    Member

    About bpo-9198.diff:

    • exit directly if !PyErr_ExceptionMatches(PyExc_UnicodeEncodeError) to avoid an useless level of indentation
    • why do you clear the exception before calling PyObject_Repr()? if you cannot execute code while an exception is active, you should maybe save/restore the original exception?
    • the code is long: it can maybe be moved to a subfunction
    • PyObject_CallMethod(buffer, "write", "(O)", encoded) to call write method
    • you can maybe just use strict error handler when you decode the encoded variable: it doesn't work anyway
    >>> b"a\xff".decode("ascii", "backslashreplace")
    ...
    TypeError: don't know how to handle UnicodeDecodeError in error callback
    >>> b"a\xff".decode("utf-8", "backslashreplace")
    ...
    TypeError: don't know how to handle UnicodeDecodeError in error callback

    Note: some encodings don't support backslashreplace, especially mbcs encoding. But on Windows, sys.stdout.encoding is not mbcs but cpXXXX (eg. cp850).

    @vstinner
    Copy link
    Member

    vstinner commented Dec 2, 2010

    I created a new issue for Ezio's proposition to patch sys.displayhook: issue bpo-10601.

    To answer the initial question, "Should repr() print unicode characters outside the BMP?", Marc-Andre, Ezio and me agree that we should keep the current behavior for non-BMP chars (i.e. print them normally) and so I close this issue.

    @vstinner vstinner closed this as completed Dec 2, 2010
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants