Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repr() should not escape non-ASCII characters #46882

Closed
atsuoishimoto mannequin opened this issue Apr 14, 2008 · 43 comments
Closed

repr() should not escape non-ASCII characters #46882

atsuoishimoto mannequin opened this issue Apr 14, 2008 · 43 comments
Labels
type-feature A feature request or enhancement

Comments

@atsuoishimoto
Copy link
Mannequin

atsuoishimoto mannequin commented Apr 14, 2008

BPO 2630
Nosy @malemburg, @gvanrossum, @birkenfeld, @atsuoishimoto, @amauryfa, @pitrou, @ericvsmith
Files
  • diff.txt
  • diff2.txt
  • diff3.txt
  • diff4.txt
  • docdiff1.txt
  • diff5.txt
  • diff6.txt
  • diff7_1.txt
  • diff8.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2008-06-11.18:38:55.202>
    created_at = <Date 2008-04-14.09:54:22.958>
    labels = ['type-feature']
    title = 'repr() should not escape non-ASCII characters'
    updated_at = <Date 2008-06-12.02:44:45.865>
    user = 'https://github.com/atsuoishimoto'

    bugs.python.org fields:

    activity = <Date 2008-06-12.02:44:45.865>
    actor = 'ishimoto'
    assignee = 'none'
    closed = True
    closed_date = <Date 2008-06-11.18:38:55.202>
    closer = 'georg.brandl'
    components = ['None']
    creation = <Date 2008-04-14.09:54:22.958>
    creator = 'ishimoto'
    dependencies = []
    files = ['10028', '10034', '10193', '10447', '10456', '10491', '10507', '10512', '10518']
    hgrepos = []
    issue_num = 2630
    keywords = ['patch']
    message_count = 43.0
    messages = ['65461', '65470', '65483', '65490', '65491', '65493', '65494', '65514', '65535', '65536', '65542', '65564', '65573', '65601', '65606', '66216', '66298', '66299', '66302', '66303', '66307', '66310', '66320', '66424', '66425', '67409', '67439', '67591', '67651', '67653', '67654', '67655', '67656', '67657', '67665', '67667', '67670', '67692', '67702', '67704', '67705', '68008', '68047']
    nosy_count = 7.0
    nosy_names = ['lemburg', 'gvanrossum', 'georg.brandl', 'ishimoto', 'amaury.forgeotdarc', 'pitrou', 'eric.smith']
    pr_nums = []
    priority = 'normal'
    resolution = 'accepted'
    stage = None
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue2630'
    versions = ['Python 3.0']

    @atsuoishimoto
    Copy link
    Mannequin Author

    atsuoishimoto mannequin commented Apr 14, 2008

    In py3k, repr() escapes non-ASCII characters in Unicode to \uXXXX as
    Python 2. This is unpleasant feature if you are working with non-latin
    characters. This issue was once discussed by Hye-Shik Chang[1], but was
    rejected. Here's a new challenge for Python 3 to fix issue.

    In this patch, repr() converts special ascii characters such as "\t",
    "\r", "\n", but doesn't convert non-ASCII characters to \uXXXX form.
    Non-ASCII characters are converted by TextIOWrapper on printing. I set
    'errors' attribute of sys.stdout and sys.stderr to 'backslashreplace', so
    un-printable characters are converted to '\uXXXX' if your console
    cannot print such characters.

    This patch breaks five regr tests on my environment.
    I'll fix these tests if this patch is acceptable.

    [1] http://mail.python.org/pipermail/python-dev/2002-October/029443.html
    http://bugs.python.org/issue479898

    @atsuoishimoto atsuoishimoto mannequin added the type-feature A feature request or enhancement label Apr 14, 2008
    @gvanrossum
    Copy link
    Member

    I think this has potential, but it is too liberal. There are many more
    characters that cannot be assumed printable, e.g. many of the Latin-1
    characters in the range 0x80 through 0x9F. Isn't there some Unicode
    data table that shows code points that are safely printable?

    OTOH there are other potential use cases where it would be nice to see
    the \u escapes, e.g. when one is concerned about sequences that print
    the same but don't have the same content (e.g. pre-normalization).

    The backslashreplace trick is nice, I didn't even know about that. :-)

    @amauryfa
    Copy link
    Member

    What if we turn on the backslashreplace trick for some operations only?
    For example: sys_displayhook and sys_excepthook.

    @atsuoishimoto
    Copy link
    Mannequin Author

    atsuoishimoto mannequin commented Apr 15, 2008

    I think this has potential, but it is too liberal. There are many more
    characters that cannot be assumed printable, e.g. many of the Latin-1
    characters in the range 0x80 through 0x9F. Isn't there some Unicode
    data table that shows code points that are safely printable?

    As Michael Urman pointed out, we can use Unicode properties.
    Or we can define a set of non-printable characters (e.g.
    sys.nonprintablechars).

    OTOH there are other potential use cases where it would be nice to see
    the \u escapes, e.g. when one is concerned about sequences that print
    the same but don't have the same content (e.g. pre-normalization).

    For such cases, print(s.encode("ascii", "backslashreplace")) might work.

    @atsuoishimoto
    Copy link
    Mannequin Author

    atsuoishimoto mannequin commented Apr 15, 2008

    What if we turn on the backslashreplace trick for some operations only?
    For example: sys_displayhook and sys_excepthook.

    It would be difficult, since *_repr() API don't know who is the caller.

    @gvanrossum
    Copy link
    Member

    Atsuo: I missed Michael Urman's comment. Can you copy it here, or
    (better :-) write a patch that uses it?

    Amaury: I think it would be okay to use backslashreplace as the default
    error handler for sys.stderr. Probably not for sys.stdout or other
    files, since I'm sure many users prefer the errors when their data
    cannot be printed rather than silently writing \u escapes that might
    cause other code reading their output to choke. For sys.stderr though I
    think not having exceptions raised when attempting to print errors is
    very valuable.

    @atsuoishimoto
    Copy link
    Mannequin Author

    atsuoishimoto mannequin commented Apr 15, 2008

    Okay, I'll revise a patch later today.

    @atsuoishimoto
    Copy link
    Mannequin Author

    atsuoishimoto mannequin commented Apr 15, 2008

    I revised a patch against Python 3.0a4.

    • As-per suggestion from Michael Urman, unicode_repr()
      refers unicode database to determine characters to be
      hex-encoded.

    • sys.stdout doesn't use 'backslashreplace'.

    @atsuoishimoto
    Copy link
    Mannequin Author

    atsuoishimoto mannequin commented Apr 16, 2008

    I think sys.stdout need to have backslashreplace error handler.
    Without backslashreplace, print(listOfJapaneseString) prints nothing,
    but raises an exception. This is worse than Python2.

    @gvanrossum
    Copy link
    Member

    I don't think this is a good idea; I've explained why earlier on this issue.

    @atsuoishimoto
    Copy link
    Mannequin Author

    atsuoishimoto mannequin commented Apr 16, 2008

    Sorry, I missed to write "for interactive session".
    I agree for sys.stdout and other files should not have default
    backslashescape, but for iteractive session, I think sys.stdout can
    have backslasespape handler to avoid exceptions.

    @malemburg
    Copy link
    Member

    While it may be desirable to to have repr(unicode) return a non-ASCII
    string, the suggested approach is not suitable to solve the problem.

    repr() is usually used in logging and applications/users/tools don't
    expect to suddenly find non-ASCII or even mixed encodings in a log file.

    If you do want to have this more flexible, then make the encoding used
    by unicode_repr() adjustable, turn the existing code into a codec (e.g.
    "unicode-repr") and leave it setup as default.

    Users who wish to see non-ASCII repr(unicode) data can then adjust the
    used encoding to their liking.

    This is both more flexible and backwards compatible with 2.x.

    Also note that the separation of the Unicode database from the
    interpreter core was done to keep the interpreter footprint manageable.
    It's not a good idea to just dump the complete table set into
    unicodeobject.c via an #include. If you need to reference APIs from
    modules in C, the usual approach is to create a PyCObject which is then
    exported by the module (see e.g. the datetime module) and imported by
    code needing it.

    BTW: "printable" is not a defined term in Unicode. What is or is not
    printable really depends on the use case, e.g. there are quite a few
    code points in Unicode that don't result in any glyph being "printed" to
    the screen. A Unicode string could then look as if it had fewer code
    points than it actually does - which is not really what you want when
    debugging code or sifting through log files.

    @atsuoishimoto
    Copy link
    Mannequin Author

    atsuoishimoto mannequin commented Apr 17, 2008

    If you do want to have this more flexible, then make the encoding used
    by unicode_repr() adjustable, turn the existing code into a codec (e.g.
    "unicode-repr") and leave it setup as default.

    Turning code in unicode_repr() into a codec is good idea. I'll write two
    codecs(existing repr and new Unicode friendly codec) and post a revised
    patch later.

    @atsuoishimoto
    Copy link
    Mannequin Author

    atsuoishimoto mannequin commented Apr 18, 2008

    Is a codec which encode() returns an Unicode allowed in Python3? I
    started to think codec is not nessesary, but python function is enough.

    @malemburg
    Copy link
    Member

    On 2008-04-18 05:35, atsuo ishimoto wrote:

    atsuo ishimoto <ishimoto@users.sourceforge.net> added the comment:

    Is a codec which encode() returns an Unicode allowed in Python3?

    Sure, why not ?

    I think you have to ask another question: Is repr() allowed to
    return a string (instead of Unicode) in Py3k ?

    If not, then unicode_repr() will have to check the return value of
    the codec and convert it back to Unicode as necessary.

    I started to think codec is not nessesary, but python function is enough.

    That's what we currently have with unicode_repr(), but it doesn't
    solve the problem.

    @atsuoishimoto
    Copy link
    Mannequin Author

    atsuoishimoto mannequin commented May 4, 2008

    New patch agaist current py3k branch.

    All the regr tests faild by my patch is now fixed as far as I
    can run.
    I also modified a doctest module a bit, so should be reviewed
    by module owners.

    @gvanrossum
    Copy link
    Member

    On Fri, Apr 18, 2008 at 1:46 AM, Marc-Andre Lemburg
    <report@bugs.python.org> wrote:

    On 2008-04-18 05:35, atsuo ishimoto wrote:
    > atsuo ishimoto <ishimoto@users.sourceforge.net> added the comment:
    >
    > Is a codec which encode() returns an Unicode allowed in Python3?

    Sure, why not ?

    Actually, it is not. In Py3k, x.encode() always requires x to be a str
    (i.e. unicode) instance and return a bytes instance. y.decode()
    requires y to be a bytes instance and returns a str (i.e. unicode)
    instance.

    I think you have to ask another question: Is repr() allowed to
    return a string (instead of Unicode) in Py3k ?

    In Py3k, "strings" *are* unicode. The str data type is Unicode.

    If you're asking about repr() possibly returning a bytes instance,
    definitely not.

    If not, then unicode_repr() will have to check the return value of
    the codec and convert it back to Unicode as necessary.

    What codec?

    > I started to think codec is not nessesary, but python function is enough.

    That's what we currently have with unicode_repr(), but it doesn't
    solve the problem.

    I'm lost here.

    PS. Atsuo's PEP has now been checked in as PEP-3138. Discussion should
    start soon on the python-3000 list.

    @gvanrossum
    Copy link
    Member

    FWIW, I've uploaded diff3.txt to Rietveld:
    http://codereview.appspot.com/767

    Code review comments should be reflected here.

    I had to skip the change to Modules/unicodename_db.h which were too
    large for Rietveld to handle.

    @atsuoishimoto
    Copy link
    Mannequin Author

    atsuoishimoto mannequin commented May 6, 2008

    I forgot to mention to Modules/unicodename_db.h.

    The current unicodename_db.h looks it was generated
    by old Tools/unicode/makeunicodedata.py. This patch
    includes newly generated unicodename_db.h, but we
    can exclude the change if not necessary.

    @gvanrossum
    Copy link
    Member

    No need to change anything, the diff is just too big for the code
    review tool (Rietveld), but since it consists only of numbers we don't
    need to review it anyway. :)

    @malemburg
    Copy link
    Member

    On 2008-05-06 00:07, Guido van Rossum wrote:

    Guido van Rossum <guido@python.org> added the comment:

    On Fri, Apr 18, 2008 at 1:46 AM, Marc-Andre Lemburg
    <report@bugs.python.org> wrote:
    > On 2008-04-18 05:35, atsuo ishimoto wrote:
    > > atsuo ishimoto <ishimoto@users.sourceforge.net> added the comment:
    > >
    > > Is a codec which encode() returns an Unicode allowed in Python3?
    >
    > Sure, why not ?

    Actually, it is not. In Py3k, x.encode() always requires x to be a str
    (i.e. unicode) instance and return a bytes instance. y.decode()
    requires y to be a bytes instance and returns a str (i.e. unicode)
    instance.

    So you've limited the codec design to just doing Unicode<->bytes
    conversions ?

    The original codec design was to have the codec decide which
    types to take on input and to generate on output, e.g. to
    escape characters in Unicode (converting Unicode to Unicode),
    work on compressed 8-bit strings (converting 8-bit strings to
    8-bit strings), etc.

    > I think you have to ask another question: Is repr() allowed to
    > return a string (instead of Unicode) in Py3k ?

    In Py3k, "strings" *are* unicode. The str data type is Unicode.

    With "strings" I always refer to 8-bit strings, ie. 8-bit data that
    is encoded in some encoding.

    If you're asking about repr() possibly returning a bytes instance,
    definitely not.

    > If not, then unicode_repr() will have to check the return value of
    > the codec and convert it back to Unicode as necessary.

    What codec?

    The idea is to have a codec which takes the Unicode object and
    converts it to its repr()-value.

    Now, since you apparently cannot
    go the direct way anymore (ie. have the codec encode Unicode to
    Unicode), you'd have to first use a codec which converts the Unicode
    object to its repr()-value represented as bytes object and then
    convert the bytes object back to Unicode in unicode_repr().

    With the original design, this extra step wouldn't have been
    necessary.

    > > I started to think codec is not nessesary, but python function is enough.
    >
    > That's what we currently have with unicode_repr(), but it doesn't
    > solve the problem.

    I'm lost here.

    See my previous replies on this ticket.

    PS. Atsuo's PEP has now been checked in as PEP-3138. Discussion should
    start soon on the python-3000 list.

    @atsuoishimoto
    Copy link
    Mannequin Author

    atsuoishimoto mannequin commented May 6, 2008

    No need to change anything, the diff is just too big for the code
    review tool (Rietveld), but since it consists only of numbers we don't
    need to review it anyway. :)

    I wonder why unicodename_db.h have not updated after
    makeunicodedata.py was modified. If new makeunicodedata.py
    breaks something, I should remove the chage to unicodename_db.h
    from this patch (My patch works whether unicodename_db.h is
    updated or not.). I'll post a question to python-3000 list.

    @gvanrossum
    Copy link
    Member

    On Tue, May 6, 2008 at 1:26 AM, Marc-Andre Lemburg wrote:

    So you've limited the codec design to just doing Unicode<->bytes
    conversions ?

    Yes. This was quite a conscious decision that was not taken lightly,
    with lots of community input, quite a while ago.

    The original codec design was to have the codec decide which
    types to take on input and to generate on output, e.g. to
    escape characters in Unicode (converting Unicode to Unicode),
    work on compressed 8-bit strings (converting 8-bit strings to
    8-bit strings), etc.

    Unfortunately this design made it hard to reason about the correctness
    of code, since (especially in Py3k, where bytes and str are more
    different than str and unicode were in 2.x) it's hard to write code
    that uses .encode() or .decode() unless it knows which codec is being
    used.

    IOW, when translated to 3.0, the design violates the general design
    principle that the *type* of a function's or method's return value
    should not depend on the *value* of one of the arguments.

    >> I think you have to ask another question: Is repr() allowed to
    >> return a string (instead of Unicode) in Py3k ?
    >
    > In Py3k, "strings" *are* unicode. The str data type is Unicode.

    With "strings" I always refer to 8-bit strings, ie. 8-bit data that
    is encoded in some encoding.

    You will have to change this habit or you will thoroughly confuse both
    users and developers of 3.0. "String" refers to the built-in "str"
    type which in Py3k is PyUnicode. For the PyString type we use the
    built-in type "bytes".

    > If you're asking about repr() possibly returning a bytes instance,
    > definitely not.
    >
    >> If not, then unicode_repr() will have to check the return value of
    >> the codec and convert it back to Unicode as necessary.
    >
    > What codec?

    The idea is to have a codec which takes the Unicode object and
    converts it to its repr()-value.

    Now, since you apparently cannot
    go the direct way anymore (ie. have the codec encode Unicode to
    Unicode), you'd have to first use a codec which converts the Unicode
    object to its repr()-value represented as bytes object and then
    convert the bytes object back to Unicode in unicode_repr().

    With the original design, this extra step wouldn't have been
    necessary.

    Why does everything have to be a codec?

    @malemburg
    Copy link
    Member

    On 2008-05-06 19:10, Guido van Rossum wrote:

    Guido van Rossum <guido@python.org> added the comment:

    On Tue, May 6, 2008 at 1:26 AM, Marc-Andre Lemburg wrote:
    > So you've limited the codec design to just doing Unicode<->bytes
    > conversions ?

    Yes. This was quite a conscious decision that was not taken lightly,
    with lots of community input, quite a while ago.

    > The original codec design was to have the codec decide which
    > types to take on input and to generate on output, e.g. to
    > escape characters in Unicode (converting Unicode to Unicode),
    > work on compressed 8-bit strings (converting 8-bit strings to
    > 8-bit strings), etc.

    Unfortunately this design made it hard to reason about the correctness
    of code, since (especially in Py3k, where bytes and str are more
    different than str and unicode were in 2.x) it's hard to write code
    that uses .encode() or .decode() unless it knows which codec is being
    used.

    IOW, when translated to 3.0, the design violates the general design
    principle that the *type* of a function's or method's return value
    should not depend on the *value* of one of the arguments.

    I understand where this concept originates and usual apply this
    rule to software design as well, however, in the particular case
    of codecs, the codec registry and its helper functions are merely
    interfaces to code that is defined elsewhere.

    In comparison, the approach is very much like getattr() - you know
    what the attribute is called, but know nothing about its type
    until you receive it from the function.

    The reason codecs where designed like this was to be able to
    easily stack them. For this to work, only the interfaces need
    to be defined, without restricting the codecs too much in terms
    of which types may be used.

    I'd suggest to lift the type restrictions from the general
    codecs.c access APIs (PyCodec_*), since they don't really belong
    there and instead only impose the limitation on PyUnicode and
    PyString methods .encode() and .decode().

    If you then also allow those methods to return *both*
    PyUnicode and PyString, you'd still have strong typing
    (only 1 of two possible types is allowed) and stacking
    streams or having codecs that work on PyUnicode->PyUnicode
    or PyString->PyString would still be accessible via
    .encode()/.decode().

    > >> I think you have to ask another question: Is repr() allowed to
    > >> return a string (instead of Unicode) in Py3k ?
    > >
    > > In Py3k, "strings" *are* unicode. The str data type is Unicode.
    >
    > With "strings" I always refer to 8-bit strings, ie. 8-bit data that
    > is encoded in some encoding.

    You will have to change this habit or you will thoroughly confuse both
    users and developers of 3.0. "String" refers to the built-in "str"
    type which in Py3k is PyUnicode. For the PyString type we use the
    built-in type "bytes".

    Well, I'm confused by the P3k use of terms (esp. because the
    C type names don't match the Python ones), which is why I'm
    talking about 8-bit strings and Unicode.

    Perhaps it's better to use PyString and PyUnicode.

    > > If you're asking about repr() possibly returning a bytes instance,
    > > definitely not.
    > >
    > >> If not, then unicode_repr() will have to check the return value of
    > >> the codec and convert it back to Unicode as necessary.
    > >
    > > What codec?
    >
    > The idea is to have a codec which takes the Unicode object and
    > converts it to its repr()-value.
    >
    > Now, since you apparently cannot
    > go the direct way anymore (ie. have the codec encode Unicode to
    > Unicode), you'd have to first use a codec which converts the Unicode
    > object to its repr()-value represented as bytes object and then
    > convert the bytes object back to Unicode in unicode_repr().
    >
    > With the original design, this extra step wouldn't have been
    > necessary.

    Why does everything have to be a codec?

    It doesn't. It's just that codecs are so easy to add, change
    and adjust that reusing the existing code is more attractive
    than reinventing the wheel every time you need to make
    a conversion from one text form to another adjustable in
    some way.

    In the case addresses by this ticket, I see the usefulness
    of having native language being written to the console using
    native glyphs, but there are so many drawbacks to this (see the
    discussion on the ticket and the mailing list), that
    I think there needs to be a way to adjust the mechanism
    or at least be able to revert to the existing repr() output.

    Furthermore, a codec implementation of what Atsuo has in mind
    would also be useful in other contexts, e.g. where you want
    to write PyUnicode to a stream without introducing line breaks.

    @gvanrossum
    Copy link
    Member

    I'd be happy to have a separate more relaxed API for stackable codecs,
    however, the API should not be overloaded on the .encode() and .decode()
    methods on str and bytes objects.

    @atsuoishimoto
    Copy link
    Mannequin Author

    atsuoishimoto mannequin commented May 27, 2008

    I updated a patch as per latest PEP.

    • io.TextIOWrapper doesn't provide API to change error handler
      at this time. I should update this patch after the API is
      provided.

    • This patch contains a fix for Tools/unicode/makeunicodedata.py
      in rev 63378.

    @atsuoishimoto
    Copy link
    Mannequin Author

    atsuoishimoto mannequin commented May 28, 2008

    docdiff1.txt contains a documentation for functions I added.

    @atsuoishimoto
    Copy link
    Mannequin Author

    atsuoishimoto mannequin commented Jun 1, 2008

    diff5.txt contains both code and documentation patch for PEP-3138.

    • In this patch, default error-handler of sys.stdout is always 'strict'.

    @birkenfeld
    Copy link
    Member

    Review:

    • Why is an empty string not printable? In any case, the empty string
      should be among the test cases for isprintable().

    • Why not use PyUnicode_DecodeASCII instead of
      PyUnicode_FromEncodedObject? It should be a bit faster.

    • If old-style string formatting gets "%a", .format() must get a "!a"
      specifier.

    • The ascii() and repr() tests should be expanded so that both test the
      same set of objects, and the expected differences. Are there tests for
      failing cases?

    • This is just "return ascii" (in builtin_ascii):
      + if (ascii == NULL)
      + return NULL;
      +
      + return ascii;

    • For PyBool_FromLong(1) and PyBool_FromLong(0) there is Py_RETURN_TRUE
      and Py_RETURN_FALSE. (You're not to blame, the rest of unicodeobject.c
      seems to use them too, probably a legacy.)

    • There appear to be some space indentations in tab-indented files like
      bltinmodule.c and vice versa (unicodeobject.c).

    • C docs/isprintable() docs: The spec
      + Characters defined in the Unicode character database as "Other"
      + or "Separator" other than ASCII space(0x20) are not considered
      + printable.
      is unclear, better say "All character except those ... are considered
      printable".

    • ascii() docs:
      + the non-ASCII
      + characters in the string returned by :func:`ascii`() are hex-escaped
      + to generate a same string as :func:`repr` in Python 2.

    should be

    "the non-ASCII characters in the string returned by :func:`repr` are
    backslash-escaped (with ``\x``, ``\u`` or ``\U``) to generate ...".

    • makeunicodedata: len(list(n for n in names if n is not None)) could
      better be expressed as sum(1 for n in names if n is not None).

    Otherwise, the patch is fine IMO. (I'm surprised that only so few tests
    needed adaptation, that's a sign that we're not testing Unicode enough.)

    @birkenfeld
    Copy link
    Member

    One more thing: with r63891 the encoding and errors arguments for the
    creation of sys.stderr were made configurable; you'll have to adapt the
    patch so that it defaults to backslashescape but can be overridden by
    PYTHONIOENCODING.

    @atsuoishimoto
    Copy link
    Mannequin Author

    atsuoishimoto mannequin commented Jun 3, 2008

    This patch contains following changes.

    • Added the new C API PyObject_ASCII() for consistency.
    • Added the new string formatting operater for str.format() and
      PyUnicode_FromFormat.

    @atsuoishimoto
    Copy link
    Mannequin Author

    atsuoishimoto mannequin commented Jun 3, 2008

    Thank you for your review!
    I filed a new patch just before I see your comments.

    On Tue, Jun 3, 2008 at 7:13 PM, Georg Brandl <report@bugs.python.org> wrote:

    Georg Brandl <georg@python.org> added the comment:

    Review:

    • Why is an empty string not printable? In any case, the empty string
      should be among the test cases for isprintable().

    Well, my intuition came from str.islower() was wrong. An empty string is
    printable, of cource.

    • Why not use PyUnicode_DecodeASCII instead of
      PyUnicode_FromEncodedObject? It should be a bit faster.

    Okay, thank you.

    • If old-style string formatting gets "%a", .format() must get a "!a"
      specifier.

    I added the format string in my latest patch.

    • The ascii() and repr() tests should be expanded so that both test the
      same set of objects, and the expected differences. Are there tests for
      failing cases?

    Okay, thank you.

    • This is just "return ascii" (in builtin_ascii):
    •   if (ascii == NULL)
      
    •       return NULL;
      
    •   return ascii;
      

    Fixed in my latest patch.

    • For PyBool_FromLong(1) and PyBool_FromLong(0) there is Py_RETURN_TRUE
      and Py_RETURN_FALSE. (You're not to blame, the rest of unicodeobject.c
      seems to use them too, probably a legacy.)

    Okay, thank you.

    • There appear to be some space indentations in tab-indented files like
      bltinmodule.c and vice versa (unicodeobject.c).

    I think bltinmodule.c is fixed with latest patch, but I don't know what
    is correct indentation for unicodeobject.c. I guess latest patch is
    acceptable.

    • C docs/isprintable() docs: The spec
    • Characters defined in the Unicode character database as "Other"
    • or "Separator" other than ASCII space(0x20) are not considered
    • printable.
      is unclear, better say "All character except those ... are considered
      printable".
    • ascii() docs:
    • the non-ASCII
    • characters in the string returned by :func:`ascii`() are hex-escaped
    • to generate a same string as :func:`repr` in Python 2.

    should be

    "the non-ASCII characters in the string returned by :func:`repr` are
    backslash-escaped (with ``\x``, ``\u`` or ``\U``) to generate ...".

    Okay, thank you.

    • makeunicodedata: len(list(n for n in names if n is not None)) could
      better be expressed as sum(1 for n in names if n is not None).

    I don't want to change here, because this is reversion of rev 63378.

    One more thing: with r63891 the encoding and errors arguments for the
    creation of sys.stderr were made configurable; you'll have to adapt the
    patch so that it defaults to backslashescape but can be overridden by
    PYTHONIOENCODING.

    I think sys.stderr should be default to 'backslashreplace' always. I'll
    post a messege to Py3k-list later.

    Otherwise, the patch is fine IMO. (I'm surprised that only so few tests
    needed adaptation, that's a sign that we're not testing Unicode enough.)

    Thank you very much! I'll file new patch soon.

    @atsuoishimoto
    Copy link
    Mannequin Author

    atsuoishimoto mannequin commented Jun 3, 2008

    BTW, are new C APIs and functions should be ported to Python 2.6 for
    compatibility, without modifing repr() itself? If so, I'll prepare a
    patch for Python 2.6.

    @birkenfeld
    Copy link
    Member

    ascii() should probably be in future_builtins.

    Whether the C API stuff and .isprintable() should be backported to 2.6
    is something for Guido to decide.

    @atsuoishimoto
    Copy link
    Mannequin Author

    atsuoishimoto mannequin commented Jun 3, 2008

    I updated the patch as per Georg's advice.

    @atsuoishimoto
    Copy link
    Mannequin Author

    atsuoishimoto mannequin commented Jun 3, 2008

    I'm sorry, I missed a file to be uploaded. diff7_1.txt is correct file.

    @gvanrossum
    Copy link
    Member

    Whether the C API stuff and .isprintable() should be backported to 2.6
    is something for Guido to decide.

    No way -- while all of this makes sense in Py3k, where all strings are
    Unicode, it would cause no end of problems in 2.6, and it would break
    backward compatibility badly.

    @atsuoishimoto
    Copy link
    Mannequin Author

    atsuoishimoto mannequin commented Jun 4, 2008

    stringlib can be compiled for Python 2.6 now, but the '!a' converter is
    disabled by #ifdef for now.

    @pitrou
    Copy link
    Member

    pitrou commented Jun 4, 2008

    Shall the method be called isprintable() or simply printable()? For the
    record, in the io classes, the writable()/readable() convention was chosen.

    @birkenfeld
    Copy link
    Member

    I would expect "abc".isprintable() give me a bool and "abc".printable()
    to return a printable string, as with "abc".lower() and "abc".islower().

    @pitrou
    Copy link
    Member

    pitrou commented Jun 4, 2008

    You are right, I had forgotton about lower()/islower().

    @birkenfeld
    Copy link
    Member

    Patch committed to Py3k branch in r64138. Thanks all!

    @atsuoishimoto
    Copy link
    Mannequin Author

    atsuoishimoto mannequin commented Jun 12, 2008

    Great, thank you!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants