This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author christoph
Recipients christoph, ezio.melotti, ggenellina, lemburg
Date 2009-09-29.09:20:11
SpamBayes Score 1.416359e-09
Marked as misclassified No
Message-id <1254216013.11.0.953096626201.issue6412@psf.upfronthosting.co.za>
In-reply-to
Content
> * U+0027 APOSTROPHE
hardcoded (see below)
> * U+00AD SOFT HYPHEN (SHY)
has the "Format (Cf)" property and thus is included automatically
> * U+2019 RIGHT SINGLE QUOTATION MARK
hardcoded (see below)

I hardcoded some characters into Tools/unicode/makeunicodedata.py:
>>> print ' '.join([u':', u'\xb7', u'\u0387', u'\u05f4', u'\u2027',
u'\ufe13', u'\ufe55', u'\uff1a'] + [u"'", u'.', u'\u2018', u'\u2019',
u'\u2024', u'\ufe52', u'\uff07', u'\uff0e'])
: · · ״ ‧ ︓ ﹕ : ' . ‘ ’ ․ ﹒ ' .

Those cannot currently be extracted automatically, as neither
DerivedCoreProperties.txt nor the source file for property
"Word_Break(C) = MidLetter or MidNumLet" are provided in the script.

As I said, the patch is only a second best solution, as the correct
path would be implementing the word breaking algorithm as described in
the newest standard. This patch is just an improvement over the current
situation.
History
Date User Action Args
2009-09-29 09:20:13christophsetrecipients: + christoph, lemburg, ggenellina, ezio.melotti
2009-09-29 09:20:13christophsetmessageid: <1254216013.11.0.953096626201.issue6412@psf.upfronthosting.co.za>
2009-09-29 09:20:11christophlinkissue6412 messages
2009-09-29 09:20:11christophcreate