Message 109528 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	amaury.forgeotdarc, ezio.melotti, lemburg
Date	2010-07-08.09:34:49
SpamBayes Score	0.000138904
Marked as misclassified	No
Message-id	<4C359BB8.6050508@egenix.com>
In-reply-to	<1278579187.34.0.662391507791.issue9198@psf.upfronthosting.co.za>

Content
[Adding some bits from the discussion on #5127 for better context] """ Ezio Melotti wrote: > > > > Ezio Melotti <ezio.melotti@gmail.com> added the comment: > > > > [This should probably be discussed on python-dev or in another issue, so feel free to move the conversation there.] > > > > The current implementation considers printable """all the characters except those characters defined in the Unicode character database as following categories are considered printable. > > * Cc (Other, Control) > > * Cf (Other, Format) > > * Cs (Other, Surrogate) > > * Co (Other, Private Use) > > * Cn (Other, Not Assigned) > > * Zl Separator, Line ('\u2028', LINE SEPARATOR) > > * Zp Separator, Paragraph ('\u2029', PARAGRAPH SEPARATOR) > > * Zs (Separator, Space) other than ASCII space('\x20').""" > > > > We could also arbitrary exclude all the non-BMP chars, but that shouldn't be based on the availability of the fonts IMHO. Without fonts, you can't print the code points, even if the Unicode database defines the code point as not having one of the above classes. And that's probably also the reason why the Unicode database doesn't define a printable property :-) I also find the use of Zl, Zp and Zs in the definition somewhat arbitrary: whitespace is certainly printable. This also doesn't match the isprint() C lib API: http://www.cplusplus.com/reference/clibrary/cctype/isprint/ "A printable character is any character that is not a control character." """ There are two aspects: * What to call a printable code point ? I'd suggest to follow the C lib approach: all non-control characters. * Which criteria to use for Unicode repr() ? Given the original intent of the extension to allow printable code points to pass through unescaped, it may be better to define "printable" based on the sys.stdout/sys.stderr encoding: A code points may pass through unescaped, if it is printable per the above definition, and does not cause problems with the sys.stdout/sys.stderr encoding. Since we can't apply this check based on a per character basis, I think we should only allow non-ASCII code points to pass through if sys.stdout/sys.stderr is set to utf-8, utf-16 or utf-32.

[Adding some bits from the discussion on #5127 for better context]

"""
Ezio Melotti wrote:
> >
> > Ezio Melotti <ezio.melotti@gmail.com> added the comment:
> >
> > [This should probably be discussed on python-dev or in another issue, so feel free to move the
conversation there.]
> >
> > The current implementation considers printable """all the characters except those characters
defined in the Unicode character database as following categories are considered printable.
> >   * Cc (Other, Control)
> >   * Cf (Other, Format)
> >   * Cs (Other, Surrogate)
> >   * Co (Other, Private Use)
> >   * Cn (Other, Not Assigned)
> >   * Zl Separator, Line ('\u2028', LINE SEPARATOR)
> >   * Zp Separator, Paragraph ('\u2029', PARAGRAPH SEPARATOR)
> >   * Zs (Separator, Space) other than ASCII space('\x20')."""
> >
> > We could also arbitrary exclude all the non-BMP chars, but that shouldn't be based on the
availability of the fonts IMHO.

Without fonts, you can't print the code points, even if the Unicode
database defines the code point as not having one of the above
classes. And that's probably also the reason why the Unicode
database doesn't define a printable property :-)

I also find the use of Zl, Zp and Zs in the definition somewhat
arbitrary: whitespace is certainly printable. This also doesn't
match the isprint() C lib API:

http://www.cplusplus.com/reference/clibrary/cctype/isprint/

"A printable character is any character that is not a control character."
"""

There are two aspects:

 * What to call a printable code point ?

   I'd suggest to follow the C lib approach: all non-control
   characters.

 * Which criteria to use for Unicode repr() ?

   Given the original intent of the extension to allow printable
   code points to pass through unescaped, it may be better to
   define "printable" based on the sys.stdout/sys.stderr encoding:

   A code points may pass through unescaped, if it is
   printable per the above definition, and does not cause problems
   with the sys.stdout/sys.stderr encoding.

   Since we can't apply this check based on a per character basis,
   I think we should only allow non-ASCII code points to pass through
   if sys.stdout/sys.stderr is set to utf-8, utf-16 or utf-32.

History
Date	User	Action	Args
2010-07-08 09:34:52	lemburg	set	recipients: + lemburg, amaury.forgeotdarc, ezio.melotti
2010-07-08 09:34:50	lemburg	link	issue9198 messages
2010-07-08 09:34:50	lemburg	create