Message 93267 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	christoph, ezio.melotti, ggenellina, lemburg
Date	2009-09-29.09:44:37
SpamBayes Score	2.8865799e-15
Marked as misclassified	No
Message-id	<4AC1D703.9080002@egenix.com>
In-reply-to	<1254216013.11.0.953096626201.issue6412@psf.upfronthosting.co.za>

Content
Christoph Burgmer wrote: > > Christoph Burgmer <cburgmer@ira.uka.de> added the comment: > >> * U+0027 APOSTROPHE > hardcoded (see below) >> * U+00AD SOFT HYPHEN (SHY) > has the "Format (Cf)" property and thus is included automatically >> * U+2019 RIGHT SINGLE QUOTATION MARK > hardcoded (see below) > > I hardcoded some characters into Tools/unicode/makeunicodedata.py: >>>> print ' '.join([u':', u'\xb7', u'\u0387', u'\u05f4', u'\u2027', > u'\ufe13', u'\ufe55', u'\uff1a'] + [u"'", u'.', u'\u2018', u'\u2019', > u'\u2024', u'\ufe52', u'\uff07', u'\uff0e']) > : · · ״ ‧ ︓ ﹕ ： ' . ‘ ’ ․ ﹒ ＇． > > Those cannot currently be extracted automatically, as neither > DerivedCoreProperties.txt nor the source file for property > "Word_Break(C) = MidLetter or MidNumLet" are provided in the script. As long as those code points are defined somewhere in the Unicode standard files, that's ok. It would be good to add a comment explaining the above in the code. BTW: It's better to use "if (....)" instead of \-line joining. The parens will automatically have Python do the line joining for you and it looks better. > As I said, the patch is only a second best solution, as the correct > path would be implementing the word breaking algorithm as described in > the newest standard. This patch is just an improvement over the current > situation. We could handle the work-breaking in a separate new method. For .title(), I think your patch is an improvement and it will fix most of the cases that issue7008 mentions.

Christoph Burgmer wrote:
> 
> Christoph Burgmer <cburgmer@ira.uka.de> added the comment:
> 
>> * U+0027 APOSTROPHE
> hardcoded (see below)
>> * U+00AD SOFT HYPHEN (SHY)
> has the "Format (Cf)" property and thus is included automatically
>> * U+2019 RIGHT SINGLE QUOTATION MARK
> hardcoded (see below)
> 
> I hardcoded some characters into Tools/unicode/makeunicodedata.py:
>>>> print ' '.join([u':', u'\xb7', u'\u0387', u'\u05f4', u'\u2027',
> u'\ufe13', u'\ufe55', u'\uff1a'] + [u"'", u'.', u'\u2018', u'\u2019',
> u'\u2024', u'\ufe52', u'\uff07', u'\uff0e'])
> : · · ״ ‧ ︓ ﹕ ： ' . ‘ ’ ․ ﹒ ＇ ．
> 
> Those cannot currently be extracted automatically, as neither
> DerivedCoreProperties.txt nor the source file for property
> "Word_Break(C) = MidLetter or MidNumLet" are provided in the script.

As long as those code points are defined somewhere in the Unicode
standard files, that's ok.

It would be good to add a comment explaining the above in the code.

BTW: It's better to use "if (....)" instead of \-line joining. The
parens will automatically have Python do the line joining for you
and it looks better.

> As I said, the patch is only a second best solution, as the correct
> path would be implementing the word breaking algorithm as described in
> the newest standard. This patch is just an improvement over the current
> situation.

We could handle the work-breaking in a separate new method.

For .title(), I think your patch is an improvement and it will
fix most of the cases that issue7008 mentions.

History
Date	User	Action	Args
2009-09-29 09:44:39	lemburg	set	recipients: + lemburg, ggenellina, christoph, ezio.melotti
2009-09-29 09:44:38	lemburg	link	issue6412 messages
2009-09-29 09:44:38	lemburg	create