Message 93265 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	christoph
Recipients	christoph, ezio.melotti, ggenellina, lemburg
Date	2009-09-29.09:20:11
SpamBayes Score	1.416359e-09
Marked as misclassified	No
Message-id	<1254216013.11.0.953096626201.issue6412@psf.upfronthosting.co.za>
In-reply-to

Content
> * U+0027 APOSTROPHE hardcoded (see below) > * U+00AD SOFT HYPHEN (SHY) has the "Format (Cf)" property and thus is included automatically > * U+2019 RIGHT SINGLE QUOTATION MARK hardcoded (see below) I hardcoded some characters into Tools/unicode/makeunicodedata.py: >>> print ' '.join([u':', u'\xb7', u'\u0387', u'\u05f4', u'\u2027', u'\ufe13', u'\ufe55', u'\uff1a'] + [u"'", u'.', u'\u2018', u'\u2019', u'\u2024', u'\ufe52', u'\uff07', u'\uff0e']) : · · ״ ‧ ︓ ﹕ ： ' . ‘ ’ ․ ﹒ ＇． Those cannot currently be extracted automatically, as neither DerivedCoreProperties.txt nor the source file for property "Word_Break(C) = MidLetter or MidNumLet" are provided in the script. As I said, the patch is only a second best solution, as the correct path would be implementing the word breaking algorithm as described in the newest standard. This patch is just an improvement over the current situation.

> * U+0027 APOSTROPHE
hardcoded (see below)
> * U+00AD SOFT HYPHEN (SHY)
has the "Format (Cf)" property and thus is included automatically
> * U+2019 RIGHT SINGLE QUOTATION MARK
hardcoded (see below)

I hardcoded some characters into Tools/unicode/makeunicodedata.py:
>>> print ' '.join([u':', u'\xb7', u'\u0387', u'\u05f4', u'\u2027',
u'\ufe13', u'\ufe55', u'\uff1a'] + [u"'", u'.', u'\u2018', u'\u2019',
u'\u2024', u'\ufe52', u'\uff07', u'\uff0e'])
: · · ״ ‧ ︓ ﹕ ： ' . ‘ ’ ․ ﹒ ＇ ．

Those cannot currently be extracted automatically, as neither
DerivedCoreProperties.txt nor the source file for property
"Word_Break(C) = MidLetter or MidNumLet" are provided in the script.

As I said, the patch is only a second best solution, as the correct
path would be implementing the word breaking algorithm as described in
the newest standard. This patch is just an improvement over the current
situation.

History
Date	User	Action	Args
2009-09-29 09:20:13	christoph	set	recipients: + christoph, lemburg, ggenellina, ezio.melotti
2009-09-29 09:20:13	christoph	set	messageid: <1254216013.11.0.953096626201.issue6412@psf.upfronthosting.co.za>
2009-09-29 09:20:11	christoph	link	issue6412 messages
2009-09-29 09:20:11	christoph	create