Message 124191 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	loewis
Recipients	Arfrever, barry, belopolsky, ezio.melotti, jhalcrow, lemburg, loewis, pitrou, valhallasw, vstinner
Date	2010-12-17.08:46:58
SpamBayes Score	1.6256955e-07
Marked as misclassified	No
Message-id	<4D0B2381.7070408@v.loewis.de>
In-reply-to	<1292549692.05.0.442901398789.issue10254@psf.upfronthosting.co.za>

Content
> The logic suggested by Martin in msg120018 looks right to me, but the > whole code seems to be unnecessarily complex. (And comb1==comb may > need to be changed to comb1>=comb.) I don't understand why linear > search through "skipped" array is needed. At the very least instead > of adding their positions to the "skipped" list, used combining > characters can be replaced by a non-character to be later skipped. The skipped array keeps track of what characters have been integrated into a base character, as they must not appear in the output. Assume you have a sequence B,C,N,C,N,B (B: base character, C: combined, N: not combined). You need to remember not to output C, whereas you still need to output N. I don't think replacing them with a non-character can work: which one would you chose (that cannot also appear in the input)? The worst case (wrt. cskipped) is the maximum number of characters that can get combined into a single base character. It used to be (and I hope still is) 20 (decomposition of U+FDFA).

> The logic suggested by Martin in msg120018 looks right to me, but the
> whole code seems to be unnecessarily complex.  (And comb1==comb may
> need to be changed to comb1>=comb.) I don't understand why linear
> search through "skipped" array is needed.  At the very least instead
> of adding their positions to the "skipped" list, used combining
> characters can be replaced by a non-character to be later skipped.

The skipped array keeps track of what characters have been integrated
into a base character, as they must not appear in the output.
Assume you have a sequence B,C,N,C,N,B (B: base character, C: combined,
N: not combined). You need to remember not to output C, whereas you
still need to output N. I don't think replacing them with a
non-character can work: which one would you chose (that cannot also
appear in the input)?

The worst case (wrt. cskipped) is the maximum number of characters that
can get combined into a single base character. It used to be (and I
hope still is) 20 (decomposition of U+FDFA).

History
Date	User	Action	Args
2010-12-17 08:47:00	loewis	set	recipients: + loewis, lemburg, barry, belopolsky, pitrou, vstinner, ezio.melotti, Arfrever, jhalcrow, valhallasw
2010-12-17 08:46:58	loewis	link	issue10254 messages
2010-12-17 08:46:58	loewis	create