Author lemburg
Recipients amaury.forgeotdarc, ezio.melotti, lemburg
Date 2010-07-08.09:34:49
SpamBayes Score 0.000138904
Marked as misclassified No
Message-id <>
In-reply-to <>
[Adding some bits from the discussion on #5127 for better context]

Ezio Melotti wrote:
> >
> > Ezio Melotti <> added the comment:
> >
> > [This should probably be discussed on python-dev or in another issue, so feel free to move the
conversation there.]
> >
> > The current implementation considers printable """all the characters except those characters
defined in the Unicode character database as following categories are considered printable.
> >   * Cc (Other, Control)
> >   * Cf (Other, Format)
> >   * Cs (Other, Surrogate)
> >   * Co (Other, Private Use)
> >   * Cn (Other, Not Assigned)
> >   * Zl Separator, Line ('\u2028', LINE SEPARATOR)
> >   * Zp Separator, Paragraph ('\u2029', PARAGRAPH SEPARATOR)
> >   * Zs (Separator, Space) other than ASCII space('\x20')."""
> >
> > We could also arbitrary exclude all the non-BMP chars, but that shouldn't be based on the
availability of the fonts IMHO.

Without fonts, you can't print the code points, even if the Unicode
database defines the code point as not having one of the above
classes. And that's probably also the reason why the Unicode
database doesn't define a printable property :-)

I also find the use of Zl, Zp and Zs in the definition somewhat
arbitrary: whitespace is certainly printable. This also doesn't
match the isprint() C lib API:

"A printable character is any character that is not a control character."

There are two aspects:

 * What to call a printable code point ?

   I'd suggest to follow the C lib approach: all non-control

 * Which criteria to use for Unicode repr() ?

   Given the original intent of the extension to allow printable
   code points to pass through unescaped, it may be better to
   define "printable" based on the sys.stdout/sys.stderr encoding:

   A code points may pass through unescaped, if it is
   printable per the above definition, and does not cause problems
   with the sys.stdout/sys.stderr encoding.

   Since we can't apply this check based on a per character basis,
   I think we should only allow non-ASCII code points to pass through
   if sys.stdout/sys.stderr is set to utf-8, utf-16 or utf-32.
Date User Action Args
2010-07-08 09:34:52lemburgsetrecipients: + lemburg, amaury.forgeotdarc, ezio.melotti
2010-07-08 09:34:50lemburglinkissue9198 messages
2010-07-08 09:34:50lemburgcreate