classification
Title: Python lib re cannot handle Unicode properly due to narrow/wide bug
Type: enhancement Stage: resolved
Components: Regular Expressions Versions: Python 3.3
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, abacabadabacaba, belopolsky, ezio.melotti, gvanrossum, haypo, jkloth, lemburg, mrabarnett, pitrou, r.david.murray, tchrist, terry.reedy, v+python, zbysz
Priority: normal Keywords:

Created on 2011-08-11 19:03 by tchrist, last changed 2011-11-16 22:44 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
utf16.py terry.reedy, 2011-08-22 21:18 Revised prototype w/ codepoint indexing/slicing
utf16.py terry.reedy, 2011-08-23 23:58 3rd version with improved iteration
Messages (59)
msg141917 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-11 19:03
Python is in flagrant violation of the very most basic premises of Unicode Technical Report #18 on Regular Expressions, which requires that a regex engine support Unicode characters as "basic logical units independent of serialization like UTF‑*".  Because sometimes you must specify ".." to match a single Unicode character -- whenever those code points are above the BMP and you are on a narrow build -- Python regexes cannot be reliably used for Unicode text.

 % python3.2
 Python 3.2 (r32:88445, Jul 21 2011, 14:44:19)
 [GCC 4.2.1 (Apple Inc. build 5664)] on darwin
 Type "help", "copyright", "credits" or "license" for more information.
 >>> import re
 >>> g = "\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI}"
 >>> print(g)
ᾲ
 >>> print(re.search(r'\w', g))
 <_sre.SRE_Match object at 0x10051f988>
 >>> p = "\N{MATHEMATICAL SCRIPT CAPITAL P}"
 >>> print(p)
𝒫
 >>> print(re.search(r'\w', p))
None
 >>> print(re.search(r'..', p))   # ← 𝙏𝙃𝙄𝙎 𝙄𝙎 𝙏𝙃𝙀 𝙑𝙄𝙊𝙇𝘼𝙏𝙄𝙊𝙉 𝙍𝙄𝙂𝙃𝙏 𝙃𝙀𝙍𝙀 
<_sre.SRE_Match object at 0x10051f988>
 >>> print(len(chr(0x1D4AB)))
2

That is illegal in Unicode regular expressions.
msg141930 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-08-12 00:09
This is an acknowledged problem with Python narrow builds, and applies to much more than just regex processing.
msg141992 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-08-12 22:21
Does the regex module handle these particular issues better?
msg141995 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-12 23:04
"Terry J. Reedy" <report@bugs.python.org> wrote
   on Fri, 12 Aug 2011 22:21:59 -0000: 

> Does the regex module handle these particular issues better?

No, it currently does not.  One would have to ask Matthew directly, but I
believe it was because he was trying to stay compatible with re, sometimes
apparently even if that means being bug compatible.  I have brought it 
to his attention though, and at last report he was pondering the matter.

In contrast to how Python behaves on narrow builds, even though Java uses
UTF-16 for its internal representation of strings, its Java Pattern is
quite adamant about treating with logical code points alone.  Besides
running afoul of tr18, it is senseless to do otherwise.  A dot is one
Unicode code point, no matter whether you have 8-bit code units, 16-bit
code units, or 32-bit code units.  Similarly, character classes and their
negations only match entire code points, never pieces of the same.

ICU's regexes work the same way the normal Java Pattern library does.  
So too do Perl, Ruby, and Go.  Python is really the odd man out here.  

Almost.

One interesting counterexample is the vim editor.  It has dot match a
complete grapheme no matter how many code points that requires, because
we're dealing with user-visible characters now, not programmer-visible one.

It is an unreasonable burden to make the programmer deal with the
fine-grained details of low-level serialization schemes instead of at
least(*) the code point level of operations, which is the minimum for
getting real work done.  (*Note that tr18 admits that accessing text at the
code point level meets only programmer expectations, not those of the user,
and therefore to meet user expectations much more elaborate patterns must
necessarily be constructed than if logical groups of coarser granularity
than code points alone are supported.)

Python should not be subject to changing its behavior from one build to the
next.  This astonishing narrow-vs-wide build behavior makes it virtually
impossible to write portable code to work on arbitrary Unicode text. You
cannot even know whether you need to match one dot or two to get a single
code point, and similarly for character indexing, etc. Even identifiers
come into play.  Surrogates should be utterly nonexistent/invisible at
this, the normal level of operation.  

An API that minimally but uniformly deals with logical code points and
nothing finer in granularity is the only way to go here.  Please trust me
on this one.  Graphemes (tr18 Level 2) and collation elements (Level 3)
will someday build on that, but one must first support code points
properly. That's why it's a Level 1 requirement.

--tom
msg142005 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2011-08-13 02:59
In a narrow build, a codepoint in the astral plane is encoded as surrogate pair.

I could implement a workaround for it in the regex module, but I think that the proper place to fix it is in the language as a whole, perhaps by implementing PEP 393 ("Flexible String Representation").
msg142024 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-08-13 15:57
Tom, note that nobody is arguing that what you are requesting is a bad thing :)

As far as I know, Matthew is the only one currently working on the regex support in Python.  (Other developers will commit small fixes if someone proposes a patch, but no one that I've seen other than Matthew is working on the deeper issues.)  If you want to help out that would be great.

And as far as this particular issue goes, yes the difference between the narrow and wide build has been a known issue for a long time, but has become less and less ignorable as unicode adoption has grown. Martin's PEP that Matthew references is the only proposed fix that I know of.  There is a GSoc project working on it, but I have no idea what the status is.
msg142036 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-13 20:04
David Murray <report@bugs.python.org> wrote:

> Tom, note that nobody is arguing that what you are requesting is a bad
> thing :)

There looked to be minor some resistance, based on absolute backwards
compatibility even if wrong, regarding changing anything *at all* in re,
even things that to my jaded seem like actual bugs.

There are bugs, and then there are bugs.

In my survey of Unicode support across 7 programming languages for OSCON

    http://training.perl.com/OSCON/index.html

I came across a lot of weirdnesses, especially as first when the learning
curve was high.  Sure, I found it odd that unlike Java, Perl, and Ruby,
Python didn't offer regular casemapping on strings, only the simple
character-based mapping.  But that doesn't make it a bug, which is why
I filed it as an feature/enhancement request/wish, not as a bug.

I always count as bugs not handling Unicode text the way Unicode says
it must be handled.  Such things would be:

    Emitting CESU-8 when told to emit UTF-8.

    Violating the rule that UTF-8 must be in the shortest possible encoding.

    Not treating a code point as a letter when the supported version of the
    UCD says it is.  (This can happen if internal rules get out of sync
    with the current UCD.)

    Claiming one does "the expected thing on Unicode" for case-insensitive
    matches when not doing what Unicode says you must minimally do: use at
    least the simple casefolds, if not in fact the full ones.

    Saying \w matches Unicode word characters when one's definition of
    word characters differs from that of the supported version of the UCD.

Supporting Unicode vX.Y.Z is more than adding more characters.  All the
behaviors specified in the UCD have to be updated too, or else you are just
ISO 10646.  I believe some of Python's Unicode bugs happened because folks
weren't aware which things in Python were defined by the UCD or by various
UTS reports yet were not being directly tracked that way.  That's why its
important to always fully state which version of these things you follow.

Other bugs, many actually, are a result of the narrow/wide-build untransparency.

There is wiggle room in some of these.  For example, which is the one that
applies to re, in that you could -- in a sense -- remove the bug by no longer
claiming to do case-insensitive matches on Unicode.  I do not find that very
useful. Javascript works this way: it doesn't do Unicode casefolding.  Java you
have to ask nicely with the extra UNICODE_CASE flag, aka "(?u)", used with the
CASE_INSENSITIVE, aka "(?i)".

Sometimes languages provide different but equivalent interfaces to the same
functionality.  For example, you may not support the Unicode property
\p{NAME=foobar} in patterns but instead support \N{foobar} in patterns and
hopefully also in strings.  That's just fine.  On slightly shakier ground but
still I think defensible is how one approaches support for the standard UCD
properties:

          Case_Folding    Simple_Case_Folding
     Titlecase_Mapping    Simple_Titlecase_Mapping
     Uppercase_Mapping    Simple_Uppercase_Mapping
     Lowercase_Mapping    Simple_Lowercase_Mapping

One can support folding, for example, via (?i) and not have to
directly supporting a Case_Folding property like \p{Case_Folding=s},
since "(?i)s" should be the same thing as "\p{Case_Folding=s}".

> As far as I know, Matthew is the only one currently working on the
> regex support in Python.  (Other developers will commit small fixes if
> someone proposes a patch, but no one that I've seen other than Matthew
> is working on the deeper issues.)  If you want to help out that would
> be great.

Yes, I actually would.  At least as I find time for it.  I'm a competent C
programmer and Matthew's C code is very well documented, but that's very
time consuming.  For bang-for-buck, I do best on test and doc work, making
sure things are actually working the way they say do.

I was pretty surprised and disappointed by how much trouble I had with
Unicode work in Python.  A bit of that is learning curve, a bit of it is
suboptimal defaults, but quite a bit of it is that things either don't work
the way Unicode says, or because something is altogether missing.  I'd like
to help at least make the Python documentation clearer about what it is
or is not doing in this regard.

But be warned: one reason that Java 1.7 handles Unicode more according to
the published Unicode Standard in its Character, String, and Pattern
classes is because when they said they'd be supporting Unicode 6.0.0,
I went through those classes and every time I found something in violation
of that Standard, I filed a bug report that included a documentation patch
explaining what they weren't doing right.  Rather than apply my rather
embarrassing doc patches, they instead fixed the code. :)

> And as far as this particular issue goes, yes the difference between
> the narrow and wide build has been a known issue for a long time, but
> has become less and less ignorable as unicode adoption has grown.

Little would make me happier than for Python to move to a logical character
model the way Go and Perl treat them.  I find getting bogged down by code
units to be abhorrently low level, and it doesn't matter whether these are
8-bit code units like in PHP or 16-bit code units like in Java.  The Nth
character of a string should always be its Nth logical code point not
its Nth physical code units.

In regular expressions, this is a clearly stated requirement in the Unicode
Standard (see tr18 RL1.1 and RL1.7).  However, it is more than that.  The
entire character processing model really really should be logical not
physical. That's because you need to have whole code points not broken-up
code units before you can build on the still higher-level components needed
to meet user expectations. These include user-visible graphemes (like an E
with both a combining acute and a combining up tack below) and linguistic
collating elements (like the letter <ch> in Spanish, <dzs> in Hungarian, or
<dd> in Welsh).

> Martin's PEP that Matthew references is the only proposed fix that I
> know of.  There is a GSoc project working on it, but I have no idea
> what the status is.

Other possible fixes are using UTF-8 or UTF-32.

One reason I don't like that PEP is because if you are really that
concerned wiht storage space, it is too prone to spoilers.  Neither UTF-8
nor UTF-32 have any spoilers, but that PEP does.  What I mean is that just
one lone code point in a huge string is enough to change the representation
of everything in that string.  I think of these as spoilers; strings that
are mostly non-spoilers with just a bit of spoiler in them are super super
common.  Often it's all ASCII plus just a few ISO-8859-1 or other non-ASCII
Latin characters.  Or it's all Latin with a couple of non-BMP mathematical
alphanumerics thrown in.  That kind of thing.

Consider this mail message. It contains exactly six non-ASCII code points.

    % uniwc `mhpath cur +drafts`
    Paras    Lines    Words   Graphs    Chars    Bytes File
       79      345     2796    16899    16899    16920 /home/tchrist/Mail/drafts/1183

Because it is in UTF-8, its memory profile in bytes grows only very
slightly over its character count.  However, if you adopt the PEP, then you
pay and pay and pay, very nearly quadrupling the memory profile for six
particular characters.  Now it takes 67596 bytes intead of 16920, just for
the sake of six code points.  Ouch!!

Why would you want to do that?  You say you are worried about memory, but
then you would do this sort of thing.  I just don't understand.

I may be wrong here, not least because I can think of possible extenuating
circumstances, but it is my impression that there there is an underlying
assumption in the Python community and many others that being able to
access the Nth character in a string in constant time for arbitrary N is
the most important of all possible considerations.  I

I don't believe that makes as much sense as people think, because I don't
believe character strings really are accessed in that fashion very often
at all.  Sure, if you have a 2D matrix of strings where a given row-column
pair yields one character and you're taking the diagonal you might want
that, but how often does that actually happen?  Virtually never: these
are strings and not matrices we're running FFTs on after all.  We don't
need to be able to load them into vector registers or anything the way
the number-crunching people do.

That's because strings are a sequence of characters: they're text.  Whether
reading text left to right, right to left, or even boustrophedonically, you're
always going one past the character you're currently at.  You aren't going to
the Nth character forward or back for arbitrary N.  That isn't how people deal
with text.  Sometimes they do look at the end, or a few in from the far end,
but even that can be handled in other fashions.

I need to see firm use-case data justifying this overwhelming need for O(1)
access to the the Nth character before I will believe it.  I think it is
too rare to be as concerned with as so many people bizarrely appear to be.
This attachment has serious consequences.  It is because of this attachment
that the whole narrow/wide build thing occurs, where people are willing to
discard a clean, uniform processing model in search of what I do not
believe a reasonable or realistic goal.  Even if they *were* correct, *far*
more bugs are caused by unreliability than by performance.

 * If you were truly concerned with memory use, you would simply use UTF-8.
 * If you were truly concerned with O(1) access time, you would always use UTF-32.
 * Anything that isn't one of these two is some sort of messy compromise.

I promise that nobody ever had a BMP vs non-BMP bug who was working with
either UTF-8 or UTF-32.  This only happens with UTF-16 and UCS-2, which
have all the disadvantages of both UTF-8 and UTF-32 combined yet none of
the advantages of either.  It's the worst of both worlds.

Because you're using UTF-16, you're already paying quite a bit of memory for
text processing in Python compared to doing so in Go or in Perl, which are
both UTF-8 languages.  Since you're already used to paying extra, what's
so wrong with going to purely UTF-32?  That too would solve things.

UTF-8 is not the memory pig people allege it is on Asian text.  Consider:

I saved the Tokyo Wikipedia page for each of these languages as
NFC text and generated the following table comparing them. I've grouped
the languages into Western Latin, Western non-Latin, and Eastern.

   Paras Lines Words Graphs Chars  UTF16 UTF8   8:16 16:8  Language

     519  1525  6300  43120 43147  86296 44023   51% 196%  English
     343   728  1202   8623  8650  17302  9173   53% 189%  Welsh
     541  1722  9013  57377 57404 114810 59345   52% 193%  Spanish
     529  1712  9690  63871 63898 127798 67016   52% 191%  French
     321   837  2442  18999 19026  38054 21148   56% 180%  Hungarian

     202   464   976   7140  7167  14336 11848   83% 121%  Greek
     348   937  2938  21439 21467  42936 36585   85% 117%  Russian

     355   788   613   6439  6466  12934 13754  106%  94%  Chinese, simplified
     209   419   243   2163  2190   4382  3331   76% 132%  Chinese, traditional
     461  1127  1030  25341 25368  50738 65636  129%  77%  Japanese
     410   925  2955  13942 13969  27940 29561  106%  95%  Korean

Where:

  * Paras is the number of blankline-separated  text spans.
  * Lines is the number of linebreak-separated  text spans.
  * Words is the number of whitespace-separated text spans.

  * Graphs is the number of Unicode extended grapheme clusters.
  * Chars  is the number of Unicode code points.

  * UTF16 is how many bytes it takes up stored as UTF-16.
  * UTF8  is how many bytes it takes up stored as  UTF-8.

  * 8:16 is the ratio of  UTF-8 size to UTF-16 size as a percentage.
  * 16:8 is the ratio of UTF-16 size to UTF-8  size as a percentage.

  * Language is which version of the Tokyo page we're talking
    about here.

Here are my observations:

 * Western languages that use the Latin script suffer terribly upon
   conversion from UTF-8 to UTF-16, with English suffering the most
   by expanding by 96% and Hungarian the least by expanding by 80%.
   All are huge.

 * Western languages that do not use the Latin script still suffer, but
   only 15-20%.

 * Eastern languages DO NOT SUFFER in UTF-8 the way everyone claims
   that they do!

To expand on the last point:

 * In Korean and in (simplified) Chinese, you get only 6% bigger in
   UTF-8 than in UTF-16.

 * In Japanese, you get only 29% bigger in UTF-8 than in UTF-16.

 * The traditional Chinese actually got smaller in UTF-8 than in
   UTF-16! In fact, it costs 32% to use UTF-16 over UTF-8 for this
   sample. If you look at the Lines and Words columns, it looks
   that this might be due to white space usage.

So UTF-8 isn't even too bad on Asian languages.

But you howl that it's variable width.  So?  You're already using
a variable-width encoding in Python on narrow builds.  I know you think
otherwise, but I'll prove this below.

Variable width isn't as bad as people claim, partly because fixed width is
not as good as they claim.  Think of the kind of operations that you
normally do on strings.  You want to to go to the next user-visible
grapheme, or to the end of the current word, or go to the start of the next
line.  UTF-32 helps you with none of those, and UTF-8 does not hurt them.
You cannot instantly go to a particular address in memory for any of those
unless you build up a table of offsets as some text editors sometimes do,
especially for lines.  You simply have to parse it out as you go.

Here's why I say that Python uses UTF-16 not UCS-2 on its narrow builds.
Perhaps someone could tell me why the Python documentation says it uses
UCS-2 on a narrow build.  I believe this to be an error, because otherwise
I cannot explain how you can have non-BMP code points in your UTF-8
literals in your source code.  And you clearly can.

    #!/usr/bin/env python3.2
    # -*- coding: UTF-8 -*-
    super = "𝔘𝔫𝔦𝔠𝔬𝔡𝔢"
    print(super)

This is with a narrow build on Darwin:

    % python3.2 -c 'import sys; print(sys.maxunicode)'
    65535

    % export PYTHONIOENCODING=UTF-8 PERLUNICODE=S

    % python3.2 supertest | uniwc
       Paras    Lines    Words   Graphs    Chars    Bytes File
           0        1        1        8        8       29 standard input

    % python3.2 supertest | uniquote -x
    \x{1D518}\x{1D52B}\x{1D526}\x{1D520}\x{1D52C}\x{1D521}\x{1D522}

Observations and conclusion:

 *  You are emitting 8 code points,  7  in the SMP not in the BMP.

 *  You clearly understand code points above your alleged maxunicode value.

 *  If you were actually using UCS-2, those would not be possible.

 *  I submit that this proves you are actually using UTF-16.  Q.E.D.

Yet you are telling people you are using UCS-2.  Why is that?  Since you
are already using a variable-width encoding, why the supercilious attitude
toward UTF-8?  UTF-16 has the same properties but costs you a lot more.  As
I said before, UTF-16 puts you in the worst of all worlds.

 * If you were truly concerned with memory use, you would simply use UTF-8.
 * If you were truly concerned with O(1) access time, you would always use UTF-32.
 * Anything that isn't one of these two is some sort of messy compromise.

But even with UTF-16 you *can* present to the user a view of logical
characters that doesn't care about the underlying clunkish representation.
The Java regex engine proves that, since "." always matches a single code
point no matter whether it is in the BMP or not.  Similarly, ICU's classes
operate on logical characters -- code points not units -- even though they
use UTF-16 languages.  The Nth code point does not care and should not care
how many units it takes to get there.  It is fine to have both a byte
interface *and* a character interface, but I don't believe having something
that falls in between those two is of any use whatsoever.  And if you don't
have a code point interface, you don't have a character interface.

This is my biggest underlying complaint about Python's string model, but
I believe it fixable, even if doing so exceeds my own personal background.

--tom
msg142037 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-08-13 20:26
> Here's why I say that Python uses UTF-16 not UCS-2 on its narrow builds.
> Perhaps someone could tell me why the Python documentation says it uses
> UCS-2 on a narrow build.

There's a disagreement on that point between several developers. See an example sub-thread at:
http://mail.python.org/pipermail/python-dev/2010-November/105751.html

> Since you are already using a variable-width encoding, why the
> supercilious attitude toward UTF-8?

I think you are reading too much into these decisions. It's simply that no-one took the time to write an alternative implementation and demonstrate its superiority. I also believe the original implementation was UCS-2 and surrogate support was added progressively during the years. Hence the terminological mess and the ad-hoc semantics.

I agree that going with UTF-8 and a clever indexing scheme would be a better solution.
msg142038 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2011-08-13 20:57
There are occasions when you want to do string slicing, often of the form:

pos = my_str.index(x)
endpos = my_str.index(y)
substring = my_str[pos : endpos]

To me that suggests that if UTF-8 is used then it may be worth profiling to see whether caching the last 2 positions would be beneficial.
msg142039 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-08-13 21:09
> There are occasions when you want to do string slicing, often of the form:
> 
> pos = my_str.index(x)
> endpos = my_str.index(y)
> substring = my_str[pos : endpos]
> 
> To me that suggests that if UTF-8 is used then it may be worth
> profiling to see whether caching the last 2 positions would be
> beneficial.

And/or a lookup table giving the byte offset of, say, every 16th
character. It gives you a O(1) lookup with a relatively reasonable
constant cost (you have to scan for less than 16 characters after the
lookup).

On small strings (< 256 UTF-8 bytes) the space overhead for the lookup
table would be 1/16. It could also be constructed lazily whenever more
than 2 positions are cached.
msg142041 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-14 01:07
Matthew Barnett <report@bugs.python.org> wrote
   on Sat, 13 Aug 2011 20:57:40 -0000: 

> There are occasions when you want to do string slicing, often of the form:

>   pos = my_str.index(x)
>   endpos = my_str.index(y)
>   substring = my_str[pos : endpos]

Me, I would probably give the second call to index the first  
index position to guarantee the end comes after the start:

    str  = "for finding the biggest of all the strings"
    x_at = str.index("big")
    y_at = str.index("the", x_at)
    some = str[x_at:y_at]
    print("GOT", some)

But here's a serious question: is that *actually* a common usage pattern
for accessing strings in Python?  I ask because it wouldn't even *occur* to
me to go at such a problem in that way.  I would have always just written
it this way instead:

    import re
    str  = "for finding the biggest of all the strings"
    some = re.search("(big.*?)the", str).group(1)
    print("GOT", some)

I know I would use the pattern approach, just because that's 
how I always do such things in Perl:

    $str  = "for finding the biggest of all the strings";
    ($some) = $str =~ /(big.*?)the/;
    print "GOT $some\n";

Which is obviously a *whole* lot simpler than the index approach:

    $str  = "for finding the biggest of all the strings";
    $x_at = index($str, "big");
    $y_at = index($str, "the", $x_at);
    $len  = $y_at - $x_at;
    $some = substr($str, $x_at, $len);
    print "GOT $some\n";

With no arithmetic and no need for temporary variables (you can't really
escape needing x_at to pass to the second call to index), it's all a
lot more WYSIWIG.  See how much easier that is?  

Sure, it's a bit cleaner and less noisy in Perl than it is in Python by
virtue of Perl's integrated pattern matching, but I would still use
patterns in Python for this, not index.  

I honestly find the equivalent pattern operations a lot easier to read and write
and maintain than I find the index/substring version.  It's a visual thing.  
I find patterns a win in maintainability over all that busy index monkeywork.  
The index/rindex and substring approach is one I almost never ever turn to.
I bet I use pattern matching 100 or 500 times for each time I use index, and
maybe even more.

I happen to think in patterns.  I don't expect other people to do so.  But
because of this, I usually end up picking patterns even if they might be a
little bit slower, because I think the gain in flexibility and especially
maintability more than makes up for any minor performance concerns.

This might also show you why patterns are so important to me: they're one
of the most important tools we have for processing text.  Index isn't,
which is why I really don't care about whether it has O(1) access.  

> To me that suggests that if UTF-8 is used then it may be worth
> profiling to see whether caching the last 2 positions would be
> beneficial.

Notice how with the pattern approach, which is inherently sequential, you don't
have all that concern about running over the string more than once.  Once you
have the first piece (here, "big"), you proceed directly from there looking for
the second piece in a straightforward, WYSIWIG way.  There is no need to keep an
extra index or even two around on the string structure itself, going at it this way.

I would be pretty surprised if Perl could gain any speed by caching a pair of
MRU index values against its UTF-8 [but see footnote], because again, I think
the normal access pattern wouldn't make use of them.  Maybe Python programmers
don't think of strings the same way, though.  That, I really couldn't tell you.

But here's something to think about:

If it *is* true that you guys do all this index stuff that Perl programmers
just never see or do because of our differing comfort levels with regexes,
and so you think Python that might still benefit from that sort of caching 
because its culture has promoted a different access pattern, then that caching 
benefit would still apply even if you were retain the current UTF-16 representation
instead of going to UTF-8 (which might want it) or to UTF-32 (which wouldn't).

After all, you have the same variable-width caching issue with UTF-16 as with
UTF-8, so if it makes sense to have an MRU cache mapping character indices to
byte indices, then it doesn't matter whether you use UTF-8 or UTF-16!

However, I'd want some passive comparative benchmarks using real programs with
real data, because I would be suspicious of incurring the memory cost of two
more pointers in every string in the whole program.  That's serious.

--tom

FOOTNOTE: The Perl 6 people are thinking about clever ways to set up byte
          offset indices.  You have to do this if you want O(1) access to the
          Nth element for elements that are not simple code points even if you
          use UTF-32.  That's because they want the default string element to be
          a user visible grapheme, not a code point.  I know they have clever
          ideas, but I don't know how critical O(1) access truly is, nor whether 
	  it's worth the overhead this would require.  But perhaps it would be
	  extensible for other sorts of string elements, like locale-based
	  alphabetic units (like <ch>, <dz>, <ll>) or even words, and so would
	  prove interesting to try nonetheless.
msg142042 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-14 01:11
Antoine Pitrou <report@bugs.python.org> wrote
   on Sat, 13 Aug 2011 21:09:52 -0000: 

> And/or a lookup table giving the byte offset of, say, every 16th
> character. It gives you a O(1) lookup with a relatively reasonable
> constant cost (you have to scan for less than 16 characters after the
> lookup).

> On small strings (< 256 UTF-8 bytes) the space overhead for the lookup
> table would be 1/16. It could also be constructed lazily whenever more
> than 2 positions are cached.

You really should talk to the Perl 6 people to see whether their current
strategy for caching offset maps for grapheme positions might be of use to
you.  Larry explained it to me once but I no longer recall any details.

I notice though that they don't seem to think it worth doing for UTF-8 
or UTF-16, just for their synthetic "NFG" (Grapheme Normalization Form)
strings, where it would be needed even if they used UTF-32 underneath.

--tom
msg142043 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2011-08-14 02:06
You're right about starting the second search from where the first finished. Caching the position would be an advantage there.

The memory cost of extra pointers wouldn't be so bad if UTF-8 took less space than the current format.

Regex isn't used as much as in Perl. BTW, the current re module was introduced in Python 1.5, the previous regex and regsub modules being removed in Python 2.5.
msg142044 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-14 02:35
>> Here's why I say that Python uses UTF-16 not UCS-2 on its narrow builds.
>> Perhaps someone could tell me why the Python documentation says it uses
>> UCS-2 on a narrow build.

> There's a disagreement on that point between several developers. 
> See an example sub-thread at:

> 	http://mail.python.org/pipermail/python-dev/2010-November/105751.html

Some of those folks know what they're talking about, and some do not.

Most of the postings miss the mark.

Python uses UTF-16 for its narrow builds.  It does not use UCS-2.

The argument that it must be UCS-2 because it can store lone surrogates
in memory is spurious.

You have to read The Unicode Standard very *very* closely, but it is not
necessary that all internal buffers always be in well-formed UTF-whatever.
Otherwise it would be impossible to append a code unit at a time to buffer.
I could pull out the reference if I worked at it, because I've had to find
it before.  It's in there.  Trust me.  I know.

It is also spurious to pretend that because you can produce illegal output
when telling it to generate something in UTF-16 that it is somehow not using
UTF-16.  You have simply made a mistake.  You have generated something  that
you have promised you would not generate.   I have more to say about this below.

Finally, it is spurious to argue against UTF-16 because of the code unit
interface.  Java does exactly  the same thing as Python does *in all regards*
here, and no one pretends that Java is UCS-2.  Both are UTF-16.

It is simply a design error to pretend that the number of characters
is the number of code units instead of code points.  A terrible and
ugly one, but it does not mean you are UCS-2.

You are not.  Python uses UTF-16 on narrow builds.  

The ugly terrible design error is digusting and wrong, just as much in 
Python as in Java, and perhaps moreso because of the idiocy of narrow
builds even existing.  But that doesn't make them UCS-2.

If I could wave a magic wand, I would have Python undo its code unit
blunder and go back to code points, no matter what.  That means to stop
talking about serialization schemes and start talking about logical code
points.  It means that slicing and index and length and everything only
report true code points.  This horrible code unit botch from narrow builds
is most easily cured by moving to wide builds only.

However, there is more.

I haven't checked its UTF-16 codecs, but Python's UTF-8 codec is broken
in a bunch of ways.  You should be raising as exception in all kinds of
places and you aren't.  I can see I need to bug report this stuff to.  
I don't to be mean about this.  HONEST!  It's just the way it is.

Unicode currently reserves 66 code points as noncharacters, which it 
guarantees will never be in a legal UTF-anything stream.  I am not talking 
about surrogates, either.

To start with, no code point which when bitwise added with 0xFFFE returns
0xFFFE can never appear in a valid UTF-* stream, but Python allow this
without any error.

That means that both 0xNN_FFFE and 0xNN_FFFF are illegal in all planes,
where NN is 00 through 10 in hex.  So that's 2 noncharacters times 17 
planes = 34 code points illegal for interchange that Python is passing 
through illegally.  

The remaining 32 nonsurrogate code points illegal for open interchange
are 0xFDD0 through 0xFDEF.  Those are not allowed either, but Python
doesn't seem to care.

You simply cannot say you are generating UTF-8 and then generate a byte
sequence that UTF-8 guarantees can never occur.  This is a violation.

***SIGH***

--tom
msg142047 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-14 04:54
> It is simply a design error to pretend that the number of characters
> is the number of code units instead of code points.  A terrible and
> ugly one, but it does not mean you are UCS-2.

If you are referring to the value returned by len(unicode_string), it is the number of code units.  This is a matter of "practicality beats purity".  Returning the number of code units is O(1) (num_of_bytes/2).  To calculate the number of characters it's instead necessary to scan all the string looking for surrogates and then count any surrogate pair as 1 character.  It was therefore decided that it was not worth to slow down the common case just to be 100% accurate in the "uncommon" case.

That said it would be nice to have an API (maybe in unicodedata or as new str methods?) able to return the number of code units, code points, graphemes, etc, but I'm not sure that it should be the default behavior of len().

> The ugly terrible design error is digusting and wrong, just as much
> in Python as in Java, and perhaps moreso because of the idiocy of
> narrow builds even existing.

Again, wide builds use twice as much the space than narrow ones, but one the other hand you can have fast and correct behavior with e.g. len().  If people don't care about/don't need to use non-BMP chars and would rather use less space, they can do so.  Until we agree that the difference in space used/speed is no longer relevant and/or that non-BMP characters become common enough to prefer the "correct behavior" over the "fast-but-inaccurate" one, we will probably keep both.

> I haven't checked its UTF-16 codecs, but Python's UTF-8 codec is
> broken in a bunch of ways.  You should be raising as exception in
> all kinds of places and you aren't.

I am aware of some problems of the UTF-8 codec on Python 2.  It used to follow RFC 2279 until last year and now it's been updated to follow RFC 3629.
However, for backward compatibility, it still encodes/decodes surrogate pairs.  This broken behavior has been kept because on Python 2, you can encode every code point with UTF-8, and decode it back without errors:
>>> x = [unichr(c).encode('utf-8') for c in range(0x110000)]
>>>
and breaking this invariant would probably make more harm than good.  I proposed to add a "real" utf-8 codec on Python 2, but no one seems to care enough about it.

Also note that this is fixed in Python3:
>>> x = [chr(c).encode('utf-8') for c in range(0x110000)]
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

>  I can see I need to bug report this stuff to.  

If you find other places where it's broken (both on Python 2 and/or Python 3), please do and feel free to add me to the nosy.  If you can also provide a failing test case and/or point to the relevant parts of the Unicode standard, it would be great.
msg142053 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-14 06:09
Ezio Melotti <ezio.melotti@gmail.com> added the comment:

>> It is simply a design error to pretend that the number of characters
>> is the number of code units instead of code points.  A terrible and
>> ugly one, but it does not mean you are UCS-2.

> If you are referring to the value returned by len(unicode_string), it
> is the number of code units.  This is a matter of "practicality beats
> purity".  Returning the number of code units is O(1) (num_of_bytes/2).
> To calculate the number of characters it's instead necessary to scan
> all the string looking for surrogates and then count any surrogate
> pair as 1 character.  It was therefore decided that it was not worth
> to slow down the common case just to be 100% accurate in the
> "uncommon" case.

If speed is more important than correctness, I can make any algorithm
infinitely fast.  Given the choice between correct and quick, I will 
take correct every single time.

Plus your strings our immutable! You know how long they are and they 
never change.  Correctness comes at a negligible cost.  

It was a bad choice to return the wrong answer.

> That said it would be nice to have an API (maybe in unicodedata or as
> new str methods?) able to return the number of code units, code
> points, graphemes, etc, but I'm not sure that it should be the default
> behavior of len().

Always code points, never code units.  I even use a class whose length
method returns the grapheme count, because even code points aren't good
enough.  Yes of course graphemes have to be counted.  Big deal.   How 
would you like it if you said to move three to the left in vim and 
it *didn't* count each graphemes as one position?  Madness.

>> The ugly terrible design error is digusting and wrong, just as much
>> in Python as in Java, and perhaps moreso because of the idiocy of
>> narrow builds even existing.

> Again, wide builds use twice as much the space than narrow ones, but
> one the other hand you can have fast and correct behavior with e.g.
> len().  If people don't care about/don't need to use non-BMP chars and
> would rather use less space, they can do so.  Until we agree that the
> difference in space used/speed is no longer relevant and/or that non-
> BMP characters become common enough to prefer the "correct behavior"
> over the "fast-but-inaccurate" one, we will probably keep both.

Which is why I always put loud warnings in my Unicode-related Python
programs that they do not work right on Unicode if running under
a narrow build.  I almost feel I should just exit.

>> I haven't checked its UTF-16 codecs, but Python's UTF-8 codec is
>> broken in a bunch of ways.  You should be raising as exception in
>> all kinds of places and you aren't.

> I am aware of some problems of the UTF-8 codec on Python 2.  It used
> to follow RFC 2279 until last year and now it's been updated to follow
> RFC 3629.

Unicode says you can't put surrogates or noncharacters in a UTF-anything 
stream.  It's a bug to do so and pretend it's a UTF-whatever.

Perl has an encoding form, which it does not call "UTF-8", that you 
can use the UTF-8 algorithm on for any code point, include non-characters
and surrogates and even non-Unicode code points far above 0x10_FFFF, up
to in fact 0xFFFF_FFFF_FFFF_FFFF on 64-bit machines.  It's the internal
format we use in memory.  But we don't call it real UTF-8, either.

It sounds like this is the kind of thing that would be useful to you.

> However, for backward compatibility, it still encodes/decodes
> surrogate pairs.  This broken behavior has been kept because on Python
> 2, you can encode every code point with UTF-8, and decode it back
> without errors:

No, that's not UTF-8 then.  By definition.  See the Unicode Standard.

>>>> x = [unichr(c).encode('utf-8') for c in range(0x110000)]
>>>>

> and breaking this invariant would probably make more harm than good.

Why?  Create something called utf8-extended or utf8-lax or utf8-nonstrict
or something.  But you really can't call it UTF-8 and do that.  

We actually equate "UTF-8" and "utf8-strict".  Our internal extended
UTF-8 is something else.  It seems like you're still doing the old
relaxed version we used to have until 2003 or so.  It seems useful
to be able to have both flavors, the strict and the relaxed one,
and to call them different things.  

Perl defaults to the relaxed one, which gives warnings not exceptions,
if you do things like setting PERLUNICODE to S or SD and such for the
default I/I encoding.  If you actually use "UTF-8" as the encoding on the stream, though, you
get the version that gives exceptions instead.  

    "UTF-8" = "utf8-strict" 	strictly by the standard, raises exceptions otherwise
    "utf8"			loosely only, emits warnings on encoding illegal things

We currently only emit warnings or raise exceptions on I/O, not on chr
operations and such.  We used to raise exceptions on things like
chr(0xD800), but that was a mistake caused by misunderstanding the in-
memory requirements being different from stream requirements.  It's
really really subtle and you have to read the standard very closely to
realize this.

So you are perfectly free to use chr(0x20FFFF) in your own code.  This is
really useful for out-of-band sentinels and such.  However, if you try to
send it out a loose utf8 stream, you get a mandatory warning, and if you
try to send it out a strict UTF-8 stream, you get an exception.

In fact, if you remember the old "migrate ASCII trick" from the tr program,
doing something like this to turn on the high bit to mark characters in some way:

    tr[\0-\x7F][\x80-\xFF]

        (that's what killed WordStar BTW, as they used that trick
         on their ASCII internally so couldn't port to 8-bit
         encodings.  Ooops.)

Given the full 32- or 64-bit (or higher) character range for internal use, 
you can actually do this as a sort of corresponding transform:

    tr[\0-\x{10_FFFF}][\x{20_0000}-\x{3F_FFFF}]

Just don't try to output it. :)  For internal use only.  Blah blah.

(Hm, I just realized you couldn't ever do that sort of thing at all on a
narrow build because you're stuck with UTF-16.  On a wide build, though,
you could, because you'd have UTF-32.  Not that I'm suggesting it!!!)

Yes, that's not necessarily the best way to do most of what one might
naively try using it for, but there are all kinds of intersting things you
can do when your characters' internal code points don't have the same upper
bound as Unicode.

It's taken us years to unravel all this Unicode stuff so it's usable.  We used
to a lot of really um unfortunate things, whether too many errors or too few.
I'm certainly not suggesting you go down those roads.  In some ways, Python's
Unicode support reminds me of ours from rather a long time ago.

We've worked pretty hard at Unicode in Perl for the last few years, although
even ten years ago we already supported \X, all regex properties, and full
casemapping and full casefolding.  So there's always been a strong Unicode
sensitivity in Perl.  It's just taken us a long long long long time to 
get all the kinks out.

I don't imagine most of the Python devel team knows Perl very well, and maybe
not even Java or ICU.  So I get the idea that there isn't as much awareness of
Unicode in your team as there tends to be in those others. From my point of
view, learning from other people's mistakes is a way to get ahead without
incurring all the learning-bumps oneself, so if there's a way to do that for
you, that could be to your benefit, and I'm very happy to share some of our
blunders so you can avoid them yourselves.

> I proposed to add a "real" utf-8 codec on Python 2, but no one seems
> to care enough about it.

Hm.  See previous paragraph. :)

> Also note that this is fixed in Python3:
>>>> x = [chr(c).encode('utf-8') for c in range(0x110000)]
> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

Yes, I've noticed that Python3 is better about some of this,
but it doesn't detect the 66 noncharacter code points.  

I haven't checked on decoding yet, but I bet it's the same.

I think having something that does the lax Python2 way and something
else that does the stricter Standard way makes the most sense.  

>>  I can see I need to bug report this stuff to.  

> If you find other places where it's broken (both on Python 2 and/or
> Python 3), please do and feel free to add me to the nosy.  If you can
> also provide a failing test case and/or point to the relevant parts of
> the Unicode standard, it would be great.

I'll wait to report it till I have all my references at ready.

I can probably pretty easily find the part of the Unicode Standard where it
says no UTF can contain code points that are illegal for interchange.
Finding the part that explains that/why you can and indeed must be able to 
have them internally is going to be harder, but I know it's there.

Also, there is a tr18 update that adds a bit of clarification about how it
is sometimes ok to allow a regex engine in a UTF-16 language to find
unpaired surrogates, like checking whether "foo\x{D800}bar" matches
the pattern /\p{Cs}/.  You could never have that string read in from a valid
UTF-{8,16,32} stream, but because it can happen in your program, you have
to be able to match it.  So they finally admit this in the next tr18 update.
But it's still a bit odd, eh?  (And no, that doesn't make it UCS-2! :)

--tom
msg142054 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-14 07:15
> If speed is more important than correctness, I can make any algorithm
> infinitely fast.  Given the choice between correct and quick, I will 
> take correct every single time.

It's a trade-off.  Using non-BMP chars is fairly unusual (many real-world applications hardly use non-ASCII chars).  Slowing everything down just to allow non-BMP chars on narrow builds is not a good idea IMHO.  Wide builds can be used if one really wants len() and other methods to work properly with non-BMP chars.

> Plus your strings our immutable! You know how long they are and they 
> never change.  Correctness comes at a negligible cost. 

Sure, we can cache the len, but we still have to compute it at least once.  Also it's not just len(), but many other operations like slicing that are affected.

> Unicode says you can't put surrogates or noncharacters in a 
> UTF-anything stream.  It's a bug to do so and pretend it's a 
> UTF-whatever.

The UTF-8 codec described by RFC 2279 didn't say so, so, since our codec was following RFC 2279, it was producing valid UTF-8.  With RFC 3629 a number of things changed in a non-backward compatible way.  Therefore we couldn't just change the behavior of the UTF-8 codec nor rename it to something else in Python 2.  We had to wait till Python 3 in order to fix it.

> Perl has an encoding form, which it does not call "UTF-8", that you
> can use the UTF-8 algorithm on for any code point, include 
> non-characters and surrogates and even non-Unicode code points far
> above 0x10_FFFF, up to in fact 0xFFFF_FFFF_FFFF_FFFF on 64-bit 
> machines.  It's the internal format we use in memory.  But we don't
> call it real UTF-8, either.

This sounds like RFC 2279 UTF-8.  It allowed up to 6 bytes (following the same encoding scheme) and had no restrictions about surrogates (at the time I think only BMP chars existed, so there were no surrogates and the Unicode consortium didn't decide that the limit was 0x10FFFF).

> It sounds like this is the kind of thing that would be useful to you.

I believe this is what the surrogateescape error handler does (up to 0x10FFFF).

> Why?  Create something called utf8-extended or utf8-lax or 
> utf8-nonstrict or something.  But you really can't call it UTF-8 and 
> do that. 

That's what we did in Python 3, but on Python 2 is too late to fix it, especially in a point release.  (Just to clarify, I don't think any of these things will be fixed in 2.7.  There won't be any 2.8, and major changes (especially backwards-incompatible ones) are unlikely to happen in a point release (e.g. 2.7.3), so it's better to focus on Python 3.  Minor bug fixes can still be done even in 2.7 though.)

> Perl defaults to the relaxed one, which gives warnings not exceptions,
> if you do things like setting PERLUNICODE to S or SD and such for the
> default I/I encoding.  If you actually use "UTF-8" as the encoding on 
> the stream, though, you get the version that gives exceptions 
> instead.

In Python we don't usually use warnings for this kind of things (also we don't have things like "use strict").

> I don't imagine most of the Python devel team knows Perl very well,
> and maybe not even Java or ICU.  So I get the idea that there isn't 
> as much awareness of Unicode in your team as there tends to be in
> those others.

I would say there are at least 5-10 Unicode "experts" in our team.  It might be true though that we don't always follow closely what other languages and the Unicode consortium do, but if people reports problem we are willing to fix them (so thanks for reporting them!).

> From my point of view, learning from other people's mistakes is a way
> to get ahead without incurring all the learning-bumps oneself, so if
> there's a way to do that for you, that could be to your benefit, and 
> I'm very happy to share some of our blunders so you can avoid them
> yourselves.

While I really appreciate the fact that you are sharing with us your experience, the solution found and applied in Perl might not always be the best one for Python (but it's still good to learn from others' mistakes).
For example I don't think removing the 0x10FFFF upper limit is going to happen -- even if it might be useful for other things.
Also regular expressions are not part of the core and are not used that often, so I consider problems with narrow/wide builds, codecs and the unicode type much more important than problems with the re/regex module (they should be fixed too, but have lower priority IMHO).
msg142064 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-14 16:54
Ezio Melotti <report@bugs.python.org> wrote
   on Sun, 14 Aug 2011 07:15:09 -0000:

>> Unicode says you can't put surrogates or noncharacters in a
>> UTF-anything stream.  It's a bug to do so and pretend it's a
>> UTF-whatever.

> The UTF-8 codec described by RFC 2279 didn't say so, so, since our
> codec was following RFC 2279, it was producing valid UTF-8.  With RFC
> 3629 a number of things changed in a non-backward compatible way.
> Therefore we couldn't just change the behavior of the UTF-8 codec nor
> rename it to something else in Python 2.  We had to wait till Python 3
> in order to fix it.

I'm a bit confused on this.  You no longer fix bugs in Python 2?

I've dug out the references that state that you are not allowed to do things the
way you are doing them.  This is from the published Unicode Standard version 6.0.0,
chapter 3, Conformance.  It is a very important chapter.

    http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf

Python is in violation of that published Standard by interpreting noncharacter code
points as abstract characters and tolerating them in character encoding forms like
UTF-8 or UTF-16.  This explains that conformant processes are forbidden from doing this.

    Code Points Unassigned to Abstract Characters

     C1 A process shall not interpret a high-surrogate code point or a low-surrogate code point
         as an abstract character.
       · The high-surrogate and low-surrogate code points are designated for surrogate
         code units in the UTF-16 character encoding form. They are unassigned to any
         abstract character.

==>  C2 A process shall not interpret a noncharacter code point as an abstract character.
       · The noncharacter code points may be used internally, such as for sentinel val-
         ues or delimiters, but should not be exchanged publicly.

     C3 A process shall not interpret an unassigned code point as an abstract character.
       · This clause does not preclude the assignment of certain generic semantics to
         unassigned code points (for example, rendering with a glyph to indicate the
         position within a character block) that allow for graceful behavior in the pres-
         ence of code points that are outside a supported subset.
       · Unassigned code points may have default property values. (See D26.)
       · Code points whose use has not yet been designated may be assigned to abstract
         characters in future versions of the standard. Because of this fact, due care in
         the handling of generic semantics for such code points is likely to provide better
         robustness for implementations that may encounter data based on future ver-
         sions of the standard.

Next we have exactly how something you call UTF-{8,16-32} must be formed.
*This* is the Standard against which these things are measured; it is not the RFC.

You are of course perfectly free to say you conform to this and that RFC, but you
must not say you conform to the Unicode Standard when you don't.  These are different
things.  I feel it does users a grave disservice to ignore the Unicode Standard in
this, and sheer casuistry to rely on an RFC definition while ignoring the Unicode
Standard whence it originated, because this borders on being intentionally misleading.

    Character Encoding Forms

     C8 When a process interprets a code unit sequence which purports to be in a Unicode char-
         acter encoding form, it shall interpret that code unit sequence according to the corre-
         sponding code point sequence.
==>    · The specification of the code unit sequences for UTF-8 is given in D92.
       · The specification of the code unit sequences for UTF-16 is given in D91.
       · The specification of the code unit sequences for UTF-32 is given in D90.

     C9 When a process generates a code unit sequence which purports to be in a Unicode char-
         acter encoding form, it shall not emit ill-formed code unit sequences.
       · The definition of each Unicode character encoding form specifies the ill-
         formed code unit sequences in the character encoding form. For example, the
         definition of UTF-8 (D92) specifies that code unit sequences such as <C0 AF>
         are ill-formed.

==> C10 When a process interprets a code unit sequence which purports to be in a Unicode char-
         acter encoding form, it shall treat ill-formed code unit sequences as an error condition
         and shall not interpret such sequences as characters.
       · For example, in UTF-8 every code unit of the form 110xxxx2 must be followed
         by a code unit of the form 10xxxxxx2. A sequence such as 110xxxxx2 0xxxxxxx2
         is ill-formed and must never be generated. When faced with this ill-formed
         code unit sequence while transforming or interpreting text, a conformant pro-
         cess must treat the first code unit 110xxxxx2 as an illegally terminated code unit
         sequence--for example, by signaling an error, filtering the code unit out, or
         representing the code unit with a marker such as U+FFFD replacement
         character.
       · Conformant processes cannot interpret ill-formed code unit sequences. How-
         ever, the conformance clauses do not prevent processes from operating on code
         unit sequences that do not purport to be in a Unicode character encoding form.
         For example, for performance reasons a low-level string operation may simply
         operate directly on code units, without interpreting them as characters. See,
         especially, the discussion under D89.
       · Utility programs are not prevented from operating on "mangled" text. For
         example, a UTF-8 file could have had CRLF sequences introduced at every 80
         bytes by a bad mailer program. This could result in some UTF-8 byte sequences
         being interrupted by CRLFs, producing illegal byte sequences. This mangled
         text is no longer UTF-8. It is permissible for a conformant program to repair
         such text, recognizing that the mangled text was originally well-formed UTF-8
         byte sequences. However, such repair of mangled data is a special case, and it
         must not be used in circumstances where it would cause security problems.
         There are important security issues associated with encoding conversion, espe-
         cially with the conversion of malformed text. For more information, see Uni-
         code Technical Report #36, "Unicode Security Considerations."

Here is the part that explains why Python narrow builds are actually UTF-16 not UCS-2,
and why its documentation needs to be updated:

    D89 In a Unicode encoding form: A Unicode string is said to be in a particular Unicode
           encoding form if and only if it consists of a well-formed Unicode code unit sequence
           of that Unicode encoding form.
        · A Unicode string consisting of a well-formed UTF-8 code unit sequence is said
           to be in UTF-8. Such a Unicode string is referred to as a valid UTF-8 string, or a
           UTF-8 string for short.
        · A Unicode string consisting of a well-formed UTF-16 code unit sequence is said
           to be in UTF-16. Such a Unicode string is referred to as a valid UTF-16 string,
           or a UTF-16 string for short.
        · A Unicode string consisting of a well-formed UTF-32 code unit sequence is said
           to be in UTF-32. Such a Unicode string is referred to as a valid UTF-32 string,
           or a UTF-32 string for short.

==> Unicode strings need not contain well-formed code unit sequences under all conditions.
    This is equivalent to saying that a particular Unicode string need not be in a Unicode
    encoding form.

        · For example, it is perfectly reasonable to talk about an operation that takes the
           two Unicode 16-bit strings, <004D D800> and <DF02 004D>, each of which
           contains an ill-formed UTF-16 code unit sequence, and concatenates them to
           form another Unicode string <004D D800 DF02 004D>, which contains a well-
           formed UTF-16 code unit sequence. The first two Unicode strings are not in
           UTF-16, but the resultant Unicode string is.

    [...]

     D14 Noncharacter: A code point that is permanently reserved for internal use and that
           should never be interchanged. Noncharacters consist of the values U+nFFFE and
           U+nFFFF (where n is from 0 to 1016) and the values U+FDD0..U+FDEF.
         · For more information, see Section 16.7, Noncharacters.
         · These code points are permanently reserved as noncharacters.

     D15 Reserved code point: Any code point of the Unicode Standard that is reserved for
           future assignment. Also known as an unassigned code point.
         · Surrogate code points and noncharacters are considered assigned code points,
           but not assigned characters.
         · For a summary classification of reserved and other types of code points, see
           Table 2-3.

    In general, a conforming process may indicate the presence of a code point whose use has
    not been designated (for example, by showing a missing glyph in rendering or by signaling
    an appropriate error in a streaming protocol), even though it is forbidden by the standard
    from interpreting that code point as an abstract character.

Here's how I read all that.

The noncharacters and the unpaired surrogates are illegal for interchange, and their
presence in a UTF means that that UTF is not conformant to the requirements for what
a UTF shall contain.  Nonetheless, internally it is necessary that all code points,
even noncharacter code points and surrogates, be representable, and doing so does not
mean that you are no longer are in that encoding form.  However, you must not allow
such things into a UTF stream, because doing so means that that stream is no longer
a UTF stream.

That's why I say that you are of conformance by having encoders and decoders of UTF
streams tolerate noncharacters.  You are not allowed to call something a UTF and do
non-UTF things with it, because this in violation of conformance requirement C2.
Therefore you must either (1) change what you are calling the thing you doing the
nonconforming thing to, or you must (2) change it to no longer do the nonconforming
thing.  If you do neither, then Python no longer conforms to the formal requirements
for handling such things as these are defined by the Unicode Standard, and therefore
that version of Python is no longer conformant to the version of the Unicode Standard
that it purports conformance to.  And yes, that's a long way of saying it's lying.

It's also why having noncharacters including surrogates in memory does *not* suddenly
mean that there are not stored in a UTF, because you have to be able to do that to
build up buffers per the concatenation example in conformance requirement D89.
Therefore, Python uses UTF-16 internally and should not say it uses UCS-2, because
that is inaccurate and incorrect; in short, it's wrong.  That doesn't help anybody.

At least, that's how I read the Unicode Standard.  Perhaps a more careful reading
than mine would admit alternate interpretations.  If you have not reread its Chapter
3 of late in its entirety, you probably want to do so.  There is quite a bit of
material there that is fundamental to any process that claims to be conformant with
the Unicode Standard.

I hope that makes sense.  These things can be extremely difficult to read, for they
are subtle and quick to anger. :)

--tom
msg142066 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-14 17:09
Ezio Melotti <report@bugs.python.org> wrote
   on Sun, 14 Aug 2011 07:15:09 -0000: 

> For example I don't think removing the 0x10FFFF upper limit is going to
> happen -- even if it might be useful for other things. 

I agree entirely.  That's why I appended a triple exclamation point to where I
said I certainly do not expect this.  It can only work fully on UTF-8ish systems
and up to 32 bits on UTF-32, and it is most emphatically *not* Unicode.  Yes,
there are things you can do with it, but it risks serious misunderstanding and
even noncomformance if not done very carefully.  The Standard does not forbid
such things internally, but you are not allowed to pass them around in
noninternal streams claiming they are real UTF streams.

> Also regular expressions are not part of the core and are not used
> that often, so I consider problems with narrow/wide builds, codecs and
> the unicode type much more important than problems with the re/regex
> module (they should be fixed too, but have lower priority IMHO).

One advantage of having an external library is the ability to update
it asynchronously.  Another is the possibility to swap in out altogether.
Perl only gained that ability, which Python has always had, some four
years ago with its 5.10 release.  To my knowledge, the only thing people
tend to use this for is to get Russ Cox's re2 library, which has very
different performance characteristics and guarantees that allow it to 
be used in potential starvation denial-of-service situations that the
normal Perl, Python, Java, etc regex engine cannot be safely used for.

-tom
msg142069 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-08-14 17:36
> > The UTF-8 codec described by RFC 2279 didn't say so, so, since our
> > codec was following RFC 2279, it was producing valid UTF-8.  With RFC
> > 3629 a number of things changed in a non-backward compatible way.
> > Therefore we couldn't just change the behavior of the UTF-8 codec nor
> > rename it to something else in Python 2.  We had to wait till Python 3
> > in order to fix it.
> 
> I'm a bit confused on this.  You no longer fix bugs in Python 2?

In general, we try not to introduce changes that have a high probability
of breaking existing code, especially when what is being "fixed" is a
minor issue which almost nobody complains about.

This is even truer for stable branches, and Python 2 is very much a
stable branch now (no more feature releases after 2.7).

> That's why I say that you are of conformance by having encoders and decoders of UTF
> streams tolerate noncharacters.  You are not allowed to call something a UTF and do
> non-UTF things with it, because this in violation of conformance requirement C2.

Perhaps, but it is not Python's fault if the IETF and the Unicode
consortium have disagreed on what UTF-8 should be. I'm not sure what
people called "UTF-8" when support for it was first introduced in
Python, but you can't blame us for maintaining a consistent behaviour
across releases.
msg142070 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-14 17:46
> I'm a bit confused on this.  You no longer fix bugs in Python 2?

We do, but it's unlikely that we will introduce major changes in behavior.
Even if we had to get rid of narrow builds and/or fix len(), we would probably only do it in the next 3.x version (i.e. 3.3), and not in the next bug fix release of 3.2 (i.e. 3.2.2).

> That's why I say that you are of conformance by having encoders and
> decoders of UTF streams tolerate noncharacters.  You are not allowed
> to call something a UTF and do non-UTF things with it, because this
> in violation of conformance requirement C2.

This IMHO should be fixed, but it's another issue.

> If you have not reread its Chapter 3 of late in its entirety, you
> probably want to do so.  There is quite a bit of material there that
> is fundamental to any process that claims to be conformant with
> the Unicode Standard.

I am familiar with the Chapter 3, but admittedly I only read the parts that were relevant to the bugs I was fixing.  I never went through it checking that everything in Python matches the described behavior.
Thanks for pointing out the parts were Python doesn't follow the specs.
msg142076 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-14 19:06
Ezio Melotti <report@bugs.python.org> wrote
   on Sun, 14 Aug 2011 17:46:55 -0000: 

>> I'm a bit confused on this.  You no longer fix bugs in Python 2?

> We do, but it's unlikely that we will introduce major changes in behavior.

> Even if we had to get rid of narrow builds and/or fix len(), we would
> probably only do it in the next 3.x version (i.e. 3.3), and not in the
> next bug fix release of 3.2 (i.e. 3.2.2).

Antoine Pitrou <report@bugs.python.org> wrote
   on Sun, 14 Aug 2011 17:36:42 -0000:

> This is even truer for stable branches, and Python 2 is very much a
> stable branch now (no more feature releases after 2.7).

Does that mean you now go to 2.7.1, 2.7.2, etc?

I had thought that 2.6 was going to be the last, but then 2.7
ame out.  I think I remember Guido said something about there 
never being a 2.10, so I wasn't too surprised to see 2.7.  

--tom
msg142079 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-14 20:01
2.7 is the last 2.x.  There won't be any 2.8 (also I never heard that 2.6 was supposed to be the last).
We already have 2.7.2, and we will continue with 2.7.3, 2.7.4, etc for a few more years.  Eventually 2.7 will only get security fixes and the development will be focused on 3.x only.
msg142084 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-08-14 20:55
Tom, I appreciate your taking the time to help us improve our Unicode story. I agree that the compromises made a decade ago need to be revisited and revised.

I think it will help if you better understand our development process. Our current *intent* is that 'Python x.y' be a fixed language and that 'CPython x.y.0', '.1', '.2', etc be increasingly (and strictly -- no regressions) better implementations of Python x.y. (Of course, the distribution and installation names and up-to-now dropping of '.0' may confuse the distinction, but it is real.) As a consequence, correct Python x.y code that runs correctly on the CPython x.y.z implementation should run correctly on x.y.(z+1).

For the purpose of this tracker, a behavior issue ('bug') is a discrepancy between the documented intent of a supported Python x.y and the behavior of the most recent CPython x.y.z implementation thereof. A feature request is a design issue, a request for a change in the language definition (and in the corresponding .0 implementation). Most people (including you, obviously) that file feature requests regard them as addressing design bugs. But still, language definition bugs are different from implementation bugs.

Of course, this all assumes that the documents are correct and unambiguous. But accomplishing that can be as difficult as correct code. Obvious mistakes are quickly corrected. Ambiguities in relation to uncontroversial behavior are resolved by more exactly specifying the actual behavior. But ambiguities about behavior that some consider wrong, are messy. We can consult the original author if available, consult relevant tests if present, take votes, but some fairly arbitrary decision may be needed. A typical response may be to clarify behavior in the docs for the current x.y release and consider behavior changes for the next x.(y+1) release.

So the answer to your question, "Do we fix bugs?", is that we fix doc and implementation behavior bugs in the next micro x.y.z behavior bug-fix release and language design bugs in the next minor x.y language release. But note that language changes merely have to be improvements for Python in the future without necessarily worrying about whether a design decision made years ago was  or is a 'bug'.

The purpose of me discussing or questioning the 'type' of some of your issues is to *facilitate* change by getting the issue on the right track, in relation to our development process, as soon as possible.
msg142085 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-08-14 22:29
This is off-topic, but there was discussion on whether or not to have a 2.7. The decision was to focus on back-porting things that would make the eventual transition to 3.x easier.
msg142089 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-08-15 00:26
Python's narrow builds are, in a sense, 'between' UCS-2 and UTF-16. They support non-BMP chars but only partially, because, BY DESIGN*, indexing and len are by code units, not codepoints. They are documented as being UCS-2 because that is what M-A Lemburg, the original designer and writer of Python's unicode type and the unicode-capable re module, wants them to be called. The link to msg142037, which is one of 50+ in the thread (and many or most other disagree), pretty well explains his viewpoint. The positive side is that we deliver more than we promise. The negative side is that by not promising what perhaps we should allows is not to deliver what perhaps we should.

*While I think this design decision may have been OK a decade ago for a first implementation of an *optional* text type, I do not think it so for the future for revised implementations of what is now *the* text type. I think narrow builds can and should be revised and upgraded to index, slice, and measure by codepoints. Here is my current idea:

If the code unit stream contains any non-BMP characters (ie, surrogate pair of 16-bit code units), construct a sequence of *indexes* of such characters (pairs). The fixed length of the string in codepoints is n-k, where n is the number of code units (the current length) and k is the length of the auxiliary sequence and the number of pairs. For indexing, look up the character index in the list of indexes by binary search and increment the codepoint index by the index of the index found to get the corresponding code unit index. (I have omitted the details needed avoid off-by-1 errors.)

This would make indexing O(log(k)) when there are surrogates. If that is really a problem because k is a substantial fraction of a 'large' n, then one should use a wide build. By using a separate internal class, there would be no time or space penalty for all-BMP text. I will work on a prototype in Python.

PS: The OSCON link in msg142036 currently gives me 404 not found
msg142091 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2011-08-15 01:28
Have a look here: http://98.245.80.27/tcpc/OSCON2011/gbu/index.html
msg142093 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-15 02:17
"Terry J. Reedy" <report@bugs.python.org> wrote
   on Mon, 15 Aug 2011 00:26:53 -0000: 

> PS: The OSCON link in msg142036 currently gives me 404 not found

Sorry, I wrote 

     http://training.perl.com/OSCON/index.html

but meant 

     http://training.perl.com/OSCON2011/index.html

I'll fix it on the server in a short spell.

I am trying to keep the document up to date as I learn more, so it
isn't precisely the talk I gave in Portland.

> Python's narrow builds are, in a sense, 'between' UCS-2 and UTF-16.

So I'm finding.  Perhaps that's why I keep getting confused. I do have a pretty firm
notion of what UCS-2 and UTF-16 are, and so I get sometimes self-contradictory results.
Can you think of anywhere that Python acts like UCS-2 and not UTF-16?  I'm not sure I
have found one, although the regex thing might count.

Thank you guys for being so helpful and understanding.

> They support non-BMP chars but only partially, because, BY DESIGN*,
> indexing and len are by code units, not codepoints. 

That's what Java did, too, and for the same reason.  Because they had
a UCS-2 implementation for Unicode 1.1 so when Unicode 2.0 came out
and they learned that they would need more than 16 bits, they piggybacked
UTF-16 onto the top of it instead of going for UTF-8 or UTF-32, and they're
still paying that price, and to my mind, heavily and continually.

Do you use Java?  It is very like Python in many of its 16-bit character issues.
Most of the length and indexing type functions address things by code unit
only, not copepoint.  But they would never claim to be UCS-2.

Oh, I realize why they did it.  For one thing, they had bytecode out there
that they had to support.  For another, they had some pretty low-level APIs
that didn't have enough flexibility of abstraction, so old source had to keep
working as before, even though this penalized the future.  Forever, kinda.

While I wish they had done better, and kinda think they could have, it
isn't my place to say.  I wasn't there (well, not paying attention) when
this was all happening, because I was so underwhelmed by the how annoyingly
overhyped it was.  A billion dollars of marketing can't be wrong, you know?
I know that smart people looked at it, seriously.  I just find the cure
they devised to be more in the problem set than the solution set.

I like how Python works on wide builds, especially with Python3. I was
pretty surprised that the symbolic names weren't working right on the
earlier version of the 2.6 wide build I tried them on.

I know have both wide and narrow builds installed of both 2.7 and 3.2,
so that shouldn't happen again.

> They are documented as being UCS-2 because that is what M-A Lemburg,
> the original designer and writer of Python's unicode type and the unicode-
> capable re module, wants them to be called. The link to msg142037,
> which is one of 50+ in the thread (and many or most other disagree),
> pretty well explains his viewpoint.

Count me as one of those many/most others who disagree. :)

> The positive side is that we deliver more than we promise. The
> negative side is that by not promising what perhaps we should allows
> is not to deliver what perhaps we should.

It is always better to deliver more than you say than to deliver less.

> * While I think this design decision may have been OK a decade ago for
>   a first implementation of an *optional* text type, I do not think it
>   so for the future for revised implementations of what is now *the*
>   text type. I think narrow builds can and should be revised and
>   upgraded to index, slice, and measure by codepoints. 

Yes, I think so, too.  If you look at the growth curve of UTF-8 alone,
it has followed a mathematically exponential growth curve in the 
first decade of this century.  I suspect that will turn into an S
surve with with aymtoptotic shoulders any time now.  I haven't looked
at it lately, so maybe it already has.  I know that huge corpora I work
with at work are all absolutely 100% Unicode now.  Thank XML for that.

> Here is my current idea:

> If the code unit stream contains any non-BMP characters (ie, surrogate
> pair of 16-bit code units), construct a sequence of *indexes* of such
> characters (pairs). The fixed length of the string in codepoints is
> n-k, where n is the number of code units (the current length) and k is
> the length of the auxiliary sequence and the number of pairs. For
> indexing, look up the character index in the list of indexes by binary
> search and increment the codepoint index by the index of the index
> found to get the corresponding code unit index. (I have omitted the
> details needed avoid off-by-1 errors.)

> This would make indexing O(log(k)) when there are surrogates. If that
> is really a problem because k is a substantial fraction of a 'large'
> n, then one should use a wide build. By using a separate internal
> class, there would be no time or space penalty for all-BMP text. I
> will work on a prototype in Python.

You are a brave man, and good.  Bravo!

It may be that that was the sort of thing that Larry was talking to me
about 6-8 months ago regarding how to construct a better way to access
strings by grapheme index.  

Everyone always talks about important they're sure O(1) access must be, and how they
therefore abosolutely have to have it no matter the tradeoffs.  But without two
implementations to compare real-world access patterns against, one really can't know.  
I know that index/rindex and substring operations are very rare in how I myself process
strings, but I have seen how Python people turn to those all the time when I would
reflexively use a pattern match.  So usage patterns may very; hence the desire for real
comparisons.  I'm perfectly willing to be convinced, but I want to see real data.

If I get time, I'll check into whether the Perl 6 people have any real data about that.
I had thought that that Parrot was currently using ICU4C for its string handling, which
may mean they're afflicted by UTF-16, something I wouldn't think they would tolerate,
especially since they need code points above 0x10_FFFF for their Normalization Form G
(Grapheme).  Piggy packing that on UTF-16 would require stealing some Private Use code
point to act as multilevel surrogates so that UTF-16 is infinitely extensible the way
UTF-8 is.  Not sure what I think about that, but it's been mentioned as a loophole
escape for when Unicode has to renege on its 21-bit promise.  I sure hope everyone has
stopped using UTF-16 by then myself.  It's trouble enough right now.

Hm, now that I think ICU about it, ICU just might use int sequences interally, so UTF-
32, for its own strings, so that might be it.  Yes, I see they too are going for O(1)
access.  Nonetheless, a careful enough UTF-16 implementation with a rich enough API is
able to access all of Unicode with no trouble.  It's just that the Java API from Sun is
not one such.  The Perl 6 spec is all about graphemes, and graphemes are all about code
points, which means an implementation could work around 16-bit code units so the user
never has to think about them. That would be just like the Perl 5 implmentation works
around 8-bit code units and never let the user notice it.

Hey, if you can build TCP out of IP, anything is possible. :)

--tom
msg142096 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-15 03:31
I wrote:

>> Python's narrow builds are, in a sense, 'between' UCS-2 and UTF-16.

> So I'm finding.  Perhaps that's why I keep getting confused. I do have a pretty firm
> notion of what UCS-2 and UTF-16 are, and so I get sometimes self-contradictory results.
> Can you think of anywhere that Python acts like UCS-2 and not UTF-16?  I'm not sure I
> have found one, although the regex thing might count.

I just thought of one.  The casemapping functions don't work right on
Deseret, which is a non-BMP case-changing scripts.  That's one I submitted
as a bug, because I figure if the the UTF-8 decoder can decode the non-BMP
code points into paired UTF-16 surrogates, then the casing functions had
jolly well be able to deal with it.  If the UTF-8 decoder knows it is only
going to UCS-2, then it should have raised on exception on my non-BMP source.
Since it went to UTF-16, the rest of the language should have behaved accordingly.
Java does to this right, BTW, despite its UTF-16ness.

--tom
msg142097 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-08-15 04:18
>It is always better to deliver more than you say than to deliver less.
Except when promising too little is a copout.

>Everyone always talks about important they're sure O(1) access must be,

I thought that too until your challenge. But now that you mention it, indexing is probably not the bottleneck in most document processing. We are optimizing without measuring! We all know that is bad.

If done transparently, non-O(1) indexing should only be done when it is *needed*. And if it is a bottleneck, switch to a wide build -- or get a newer, faster machine.

I first used Python 1.3 on a 10 megahertz DOS machine. I just got a  multicore 3.+ gigahertz machine. Tradeoffs have changed and just as we use cycles (and space) for nice graphical interfaces, we should use some for global text support. In the same pair of machines, core memory jumped from 2 megabytes to 24 gigabytes. (And the new machine cost perhaps as much in adjusted dollars.) Of course, better unicode support should come standard with the OS and not have to be re-invented by every language and app.

Having promised to actually 'work on a prototype in Python', I decided to do so before playing. I wrote the following test:

tucs2 = 'A\U0001043cBC\U0001042f\U00010445DE\U00010428H'
tutf16= UTF16(tucs2)
tlist = ['A', '\U0001043c','B','C','\U0001042f','\U00010445',
         'D','E','\U00010428','H']
tlis2 = [tutf16[i] for i in range(len(tlist))]
assert tlist == tlis2

and in a couple hours wrote and debugged the class to make it pass (and added a couple of length tests). See the uploaded file.

Adding an __iter__ method to iterate by characters (with hi chars returned as wrapped length-1 surrogate pairs) instead of code units would be trivial. Adding the code to __getitem__ to handle slices should not be too hard. Slices containing hi characters should be wrapped. The cpdex array would make that possible without looking at the whole slice.

The same idea could be used to index by graphemes. For European text that used codepoints for pre-combined (accented) characters as much as possible, the overhead should not be too much. 

This may not be the best issue to attach this to, but I believe that improving the narrow build would allow fixing of the re/regex problems reported here.
msg142098 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-15 04:56
Keep in mind that we should be able to access and use lone surrogates too, therefore:
s = '\ud800'  # should be valid
len(s)  # should this raise an error? (or return 0.5 ;)?
s[0]  # error here too?
list(s)  # here too?

p = s + '\udc00'
len(p)  # 1?
s[0]  # '\U00010000' ?
s[1]  # IndexError?
list(p + 'a')  # ['\ud800\udc00', 'a']?

We can still decide that strings with lone surrogates work only with a limited number of methods/functions but:
1) it's not backward compatible;
2) it's not very consistent

Another thing I noticed is that (at least on wide builds) surrogate pairs are not joined "on the fly":
>>> p
'\ud800\udc00'
>>> len(p)
2
>>> p.encode('utf-16').decode('utf-16')
'𐀀'
>>> len(_)
1
msg142102 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-15 06:53
Ezio Melotti <report@bugs.python.org> wrote on Mon, 15 Aug 2011 04:56:55 -0000: 

> Another thing I noticed is that (at least on wide builds) surrogate pairs are not joined "on the fly":
> >>> p
> '\ud800\udc00'
> >>> len(p)
> 2
> >>> p.encode('utf-16').decode('utf-16')
> '𐀀'
> >>> len(_)
> 1

    (For those who may not immediately realize from reading the surrogates,
     '𐀀' is code point 0x10000, the first non-BMP code point.  I piped it 
     through `uniquote -x` just to make sure.)

Yes, that makes perfect sense.  It's something of a buggy feature or featureful bug
that UTF-16 does this.  

When you are thinking of arbitrary sequences of code points, which is
something you have be able to do in memory but not in a UTF stream, then
one can say that one has four code points of anything in the 0 .. 0x10FFFF
range.  Those can be any arbitrary code points only (1) *while* in memory,
*and* assuming a (2) non-UTF16, ie UTF-32 or UTF-8 representation.  You
cannot do that with UTF-16, which is why it works only on a Python wide
build.  Otherwise they join up.

The reason they "join up" in UTF-16 is also the reason why unlike in regular
memory where you might be able to use an alternate representation like UTF-8 or
UTF-32, UTF streams cannot contain unpaired surrogates: because if that stream
were in UTF-16, you would never be able to tell the difference between a
sequence of a lead surrogate followed by a tail surrogate and the same thing
meaning just one non-BMP code point.  Since you would not be able to tell the
difference, it always only means the latter, and the former sense is illegal.
This is why lone surrogates are illegal in UTF streams.

In case it isn't obvious, *this* is the source of the [𝒜--𝒵] bug in all
the UTF-16 or UCS-2 regex languages. It is why Java 7 added \x{...}, so
that they can rewrite that as [\x{1D49C}--\x{1D4B5}] to pass the regex
compiler, so that it seems something indirect, not just surrogates.

That's why I always check it in my cross-language regex tests.  A 16-bit
language has to have a workaround, somehow, or it will be in trouble.

The Java regex compiler doesn't generate UTF-16 for itself, either. It
generates UTF-32 for its pattern.  You can see this right at the start of
the source code.  This is from the Java Pattern class:

    /**
     * Copies regular expression to an int array and invokes the parsing
     * of the expression which will create the object tree.
     */
    private void compile() {
        // Handle canonical equivalences
        if (has(CANON_EQ) && !has(LITERAL)) {
            normalize();
        } else {
            normalizedPattern = pattern;
        }
        patternLength = normalizedPattern.length();

        // Copy pattern to int array for convenience
        // Use double zero to terminate pattern
        temp = new int[patternLength + 2];

        hasSupplementary = false;
        int c, count = 0;
        // Convert all chars into code points
        for (int x = 0; x < patternLength; x += Character.charCount(c)) {
            c = normalizedPattern.codePointAt(x);
            if (isSupplementary(c)) {
                hasSupplementary = true;
            }
            temp[count++] = c;
        }

        patternLength = count;   // patternLength now in code points

See how that works?  They use an int(-32) array, not a char(-16) array!  It's reasonably
clever, and necessary.  Because it does that, it can now compile \x{1D49C} or erstwhile
embedded UTF-8 non-BMP literals into UTF-32, and not get upset by the stormy sea of
troubles that surrogates are. You can't have surrogates in ranges if you don't do
something like this in a 16-bit language.

Java couldn't fix the [𝒜--𝒵] bug except by doing the \x{...} indirection trick,
because they are stuck with UTF-16.  However, they actually can match the string
"𝒜" against the pattern "^.$", and have it fail on "^..$".   Yes, I know: the
code-unit length of that string is 2, but its regex count is just one dot worth.

I *believe* they did it that way because tr18 says it has to work that way, but
they may also have done it just because it makes sense.  My current contact at
Oracle doing regex support is not the guy who originally wrote the class, so I
am not sure.  (He's very good, BTW.  For Java 7, he also added named captures,
script properties, *and* brought the class up to conformance with tr18's 
"level 1" requirements.)

I'm thinking Python might be able to do in the regex engine on narrow builds the 
sort of thing that Java does.  However, I am also thinking that that might
be a lot of work for a situation more readily addressed by phasing out narrow
builds or at least telling people they should use wide builds to get that thing
to work.  

--tom

======================================================================

        ============================================
        ===>  QUASI OFF TOPIC ADDENDUM FOLLOWS  <===
        ============================================

The rest of this message is just a comparison demo of how Perl treats the
same sort of surrogate stuff that you were showing with Python. I'm not
sure that it as relevant to your problem as the Java part just was, since
we don't have UTF-16 the way Java does.  

One place it *may* be relevant to Python, which is the only reason I'm
including it, is that it shows what sort of elastic boundaries you have
with the old loose form utf8 that you no longer have with the current
strict definition of UTF-8 from the Unicode Standard.  

It turns out that you can't play those surrogate halvsies games in Perl,
because it thinks you're nuts. :)

   (Whether this is a bug or a feature may depend on where you're standing, but
    we're trying to conform with our reading of the Standard.  Things like the
    Unicode Collation Algorithm have third-party conformance suites, but I
    don't know of any for encoders/decoders, so one is back to deciding whether
    one has correctly understood the Standard.  As we all know, this isn't
    always all that easy.)

This is the clincher at the far end, so you can quit reading now if you want:

    % perl -MEncode -le 'print encode("UTF-16LE", "\x{D800}")' | uniquote -b 
    UTF-16 surrogate U+D800 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
    \x00\x00

    % perl -MEncode -le 'print encode("UTF-16LE", "\x{D800}\x{DC00}")' | uniquote -b
    UTF-16 surrogate U+D800 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
    UTF-16 surrogate U+DC00 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
    \x00\x00\x00\x00

    % perl -MEncode -le 'print encode("UTF-16LE", "\x{D800}"), encode("UTF-16LE", "\x{DC00}")' | uniquote -b 
    UTF-16 surrogate U+D800 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
    UTF-16 surrogate U+DC00 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
    \x00\x00\x00\x00

(Meanwhile, back to the quasi-off-topic monologue.)

Oh, you can have isolated surrogates in memory just fine.  We use abstract
code points not code units, and since we aren't UTF-16, there's no problem there:

    % perl -le 'print length("\x{D800}")'
    1
    % perl -le 'print length("\x{D800}\x{DC00}")'
    2
    % perl -le 'print length("\x{DC00}\x{D800}")'
    2

But you aren't allowed to *encode* those abstract characters into a UTF encoding form.  

        (The -CS switch is like having PERLUNICODE=S or PYTHONIOENCODING=utf8)

    % perl -CS -le 'print "\x{D800}\x{DC00}"'
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1.
    Unicode surrogate U+DC00 is illegal in UTF-8 at -e line 1.
    ??????

Now, you might think my encoder did that, but it didn't.  Those "??????" are is from my
dumb Mac Terminal, *not* from the Perl utf8 encoder.  This is yet another reason, BTW, 
why using "?" for the replacement for a non-encodable code point is strongly discouraged 
by Unicode.  Otherwise it is too hard whether it is a "?" because it was in the original
data or a "?" because it was a replacment char.  Some of our encoders do use

     �  FFFD        REPLACEMENT CHARACTER
        * used to replace an incoming character whose value is unknown or unrepresentable in Unicode
        * compare the use of 001A as a control character to indicate the substitute function

But that isn't triggering here.  Run through uniquote shows that you got
illegal surrogates in your (loose) utf8 stream:

    % perl -CS -le 'print "\x{D800}\x{DC00}"' | uniquote -x
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1.
    Unicode surrogate U+DC00 is illegal in UTF-8 at -e line 1.
    \x{D800}\x{DC00}

Use -b on uniquote to get bytes not hex chars:

    % perl -CS -le 'print "\x{D800}\x{DC00}"' | uniquote -b
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1.
    Unicode surrogate U+DC00 is illegal in UTF-8 at -e line 1.
    \xED\xA0\x80\xED\xB0\x80

Yes, I'm not fond of the dribbled-out warning, either.  It should of course be
an exception.  Part of what is going on here is that Perl's regular "utf8"
encoding is the (I believe) the same loose one that Python also uses, the old
one from before the restrictions. If you use the strict "UTF-8", then you get...

    % perl -le 'binmode(STDOUT, "encoding(UTF-8)"); print "\x{D800}\x{DC00}"' 
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1.
    Unicode surrogate U+DC00 is illegal in UTF-8 at -e line 1.
    "\x{d800}" does not map to utf8.
    "\x{dc00}" does not map to utf8.
    \x{D800}\x{DC00}

Which is pretty whacky.  I'm *not* uniquoting that this time around, the
encoder is.   It did the \x{} substitution because it is absolutely
forbidden from ever emitting surrogates in a valid strict UTF-8 stream,
the way it will begrudgingly do in a loose utf8 stream (although the
result is not valid as strict UTF-8).

If you run with all warnings promoted into exceptions (as I almost always do),
then you get more reasonable behavior, maybe.   I'm going to use a command-line
switch that's just like saying

    use warnings "FATAL" => "all".

in the top-level scope.  First with loose utf8:

    % perl -Mwarnings=FATAL,all -CS -le 'print "\x{D800}\x{DC00}"' 
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1.
    Exit 255

And next with strict UTF-8:

    % perl -Mwarnings=FATAL,all -le 'binmode(STDOUT, "encoding(UTF-8)"); print "\x{D800}\x{DC00}"' 
    Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1.
    Exit 25

So when you make them fatal, you get an exception -- which by not being caught,
causes the program to exit.  

    (The "Exit" is printed by my shell, and no, I'm sorry but I don't
     know why the status value differs.  Well, kinda I know.  It's
     because errno was 25 (ENOTTY) in the second case, but had been 0 in
     the first case so that turned into an unsigned char -1.)

If you *promise* Perl you know what you're doing and full speed ahead, you can
disable utf8 warnings, which will work to get you your two surrogates in the
outbound *loose* utf8 stream, as six bytes that look like UTF-8 but actually
"can't" be (called that) due to the Standard's rules:

    % perl -M-warnings=utf8  -CS -le 'print "\x{D800}\x{DC00}"'
    ??????
    % perl -M-warnings=utf8 -CS -le 'print "\x{D800}\x{DC00}"' | uniquote -x
    \x{D800}\x{DC00}
    % perl -M-warnings=utf8 -CS -le 'print "\x{D800}\x{DC00}"' | uniquote -b
    \xED\xA0\x80\xED\xB0\x80

    (The weird module import is like saying 
        no warnings "utf8";
    as the outer scope.)

You can't get it with strict UTF-8 no matter how hard you try though.  Only
the loose one will put out invalid UTF-8.

So what about UTF-16?  Can we build things up piecemeal?  Turns out that
no, we can't.  I *think* this is a feature, but I'm not 100.000% sure.

    % perl -MEncode -le 'print encode("UTF-16LE", "\x{D800}")' | uniquote -b 
    UTF-16 surrogate U+D800 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
    \x00\x00

    % perl -MEncode -le 'print encode("UTF-16LE", "\x{D800}\x{DC00}")' | uniquote -b
    UTF-16 surrogate U+D800 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
    UTF-16 surrogate U+DC00 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
    \x00\x00\x00\x00

    % perl -MEncode -le 'print encode("UTF-16LE", "\x{D800}"), encode("UTF-16LE", "\x{DC00}")' | uniquote -b 
    UTF-16 surrogate U+D800 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
    UTF-16 surrogate U+DC00 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
    \x00\x00\x00\x00

You get all nulls because it refused to give you anything for surrogates.
Although you can force it for UTF-8 if you use the loose utf8 version that
is out of spec, there is just nothing you can to do encode into UTF-16 
code points that are surrogates.  Just can't.  There is no loose UTF-16.

Or that's what it looks like to me.  You can have surrogates in memory, but
those cannot make into a UTF-16 stream.  Or a UTF-32 stream either.

    % perl -MEncode -le 'print encode("UTF-32LE", "\x{D800}")' | uniquote -b
    UTF-16 surrogate U+D800 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
    \x00\x00\x00\x00

You have to encode integral code points, not halvsies:

    % perl -C0 -MEncode -le 'print encode("UTF-16BE", "\x{10000}")' | uniquote -b
    \xD8\x00\xDC\x00
    % perl -C0 -MEncode -le 'print encode("UTF-16LE", "\x{10000}")' | uniquote -b
    \x00\xD8\x00\xDC
    % perl -C0 -MEncode -le 'print encode("UTF-16",   "\x{10000}")' | uniquote -b
    \xFE\xFF\xD8\x00\xDC\x00

    % perl -C0 -MEncode -le 'print encode("UTF-32BE", "\x{10000}")' | uniquote -b
    \x00\x01\x00\x00
    % perl -C0 -MEncode -le 'print encode("UTF-32LE", "\x{10000}")' | uniquote -b
    \x00\x00\x01\x00
    % perl -C0 -MEncode -le 'print encode("UTF-32",   "\x{10000}")' | uniquote -b
    \x00\x00\xFE\xFF\x00\x01\x00\x00

    % perl -C0 -MEncode -le 'print encode("UTF-8",    "\x{10000}")' | uniquote -b
    \xF0\x90\x80\x80
    % perl -C0 -MEncode -le 'print encode("UTF-8",    "\x{10000}")' | uniquote -x
    \x{10000}

    % perl -C0 -MEncode -le 'print encode("UTF-8",    "\x{10000}")' 
    𐀀
    % perl -CS          -le 'print                    "\x{10000}" ' 
    𐀀

tom++
msg142107 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-08-15 09:04
> Keep in mind that we should be able to access and use lone surrogates too, therefore:
> s = '\ud800'  # should be valid
> len(s)  # should this raise an error? (or return 0.5 ;)?
> s[0]  # error here too?
> list(s)  # here too?
> 
> p = s + '\udc00'
> len(p)  # 1?
> s[0]  # '\U00010000' ?
> s[1]  # IndexError?
> list(p + 'a')  # ['\ud800\udc00', 'a']?
> 
> We can still decide that strings with lone surrogates work only with a limited number of methods/functions but:
> 1) it's not backward compatible;
> 2) it's not very consistent
> 
> Another thing I noticed is that (at least on wide builds) surrogate pairs are not joined "on the fly":
>>>> p
> '\ud800\udc00'
>>>> len(p)
> 2
>>>> p.encode('utf-16').decode('utf-16')
> '𐀀'
>>>> len(_)
> 1

Hi Tom,

welcome to Python land :-) Here's some more background information
on how Python's Unicode implementation works:

You need to differentiate between Unicode code points stored in
Unicode objects and ones encoded in transfer formats by codecs.

We generally do allow lone surrogates, unassigned code
points, lone combining code points, etc. in Unicode objects
since Python needs to be able to work on all Unicode code points
and build strings with them.

The transfer format codecs do try to combine surrogates
on decoding data on UCS4 builds. On UCS2 builds they create
surrogate pairs as necessary. On output, those pairs will again
be joined to get round-trip safety.

It helps if you think of Python's Unicode objects using UCS2
and UCS4 instead of UTF-16/32. Python does try to make working
with UCS2 easy and in many cases behaves as if it were using
UTF-16 internally, but there are, of course, limits to this. In
practice, you only rarely get to see any of these special cases,
since non-BMP code points are usually not found in everyday
use. If they do become a problem for you, you have the option
of switching to a UCS4 build of Python.

You also have to be aware of the fact that Python started
Unicode in 1999/2000 with Unicode 2.0/3.0, so it uses the
terminology of those versions, some of which has changed in
more recent versions of Unicode.

For more background information, you might want take a look
at this talk from 2002:

http://www.egenix.com/library/presentations/#PythonAndUnicode

Related to the other tickets you opened You'll also find that
collation and compression was already on the plate back then,
but since no one step forward, it wasn't implemented.

Cheers,
-- 
Marc-Andre Lemburg
eGenix.com

________________________________________________________________________
2011-10-04: PyCon DE 2011, Leipzig, Germany                50 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
msg142121 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2011-08-15 11:30
For what it's worth, I've had idea about string storage, roughly based on how *nix stores data on disk.

If a string is small, point to a block of codepoints.

If a string is medium-sized, point to a block of pointers to codepoint blocks.

If a string is large, point to a block of pointers to pointer blocks.

This means that a large string doesn't need a single large allocation.

The level of indirection can be increased as necessary.

For simplicity, all codepoint blocks contain the same number of codepoints, except the final codepoint block, which may contain fewer.

A codepoint block may use the minimum width necessary (1, 2 or 4 bytes) to store all of its codepoints.

This means that there are no surrogates and that different sections of the string can be stored in different widths to reduce memory usage.
msg142749 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-08-22 21:18
I improved UTF16.__getitem__ to handle negative indexes and slices. The later uses the same adjustment as for indexes. An __iter__ method is not needed as str.__iter__ used __getitem__. I will take further discussion of this prototype to python-ideas list.
msg142877 - (view) Author: Glenn Linderman (v+python) Date: 2011-08-24 07:04
In msg142098  Ezio said:
> Keep in mind that we should be able to access and use lone surrogates too, therefore:
> s = '\ud800'  # should be valid
> len(s)  # should this raise an error? (or return 0.5 ;)?

I say:
For streams and data types in which lone surrogates are permitted, a lone surrogate should be treated as and counted as a character (codepoint).

For streams and data types in which lone surrogates are not permitted, the assigned should be invalid, and raise an error; len would then never see it, and has no quandary.
msg143033 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2011-08-26 20:58
Wow.  A very educational discussion.  We will be referencing this issue for many years to come.

As long as the buck stops with me, I feel strongly that *today* changing indexing from O(1) to O(log N) is a bad idea, partly for technical reasons, partly because the Python culture isn't ready.  In 5 or 10 years we need to revisit this, and it wouldn't hurt if in the mean time we started seriously thinking about how to change our APIs so that O(1) indexing is not relied upon so much.  This may include rewriting tutorials to nudge users in the direction of using different idioms for text processing.

In the meantime, I think our best option is to switch CPython to the PEP 393 string implementation.  Despite its disadvantages (I understand the "spoiler" issue) is is generally no worse than a wide build, and there is working code today that we can optimize before 3.3 is released.

For Python implementations where this is not an option (I'm thinking Jython and IronPython, both of which are closely tied to a system string type that behaves like UTF-16) I hope that at least the regular expression behavior can be fixed so that "." matches a surrogate pair.  (Possibly they already behave that way, if they use a native regex library.)

In all cases, for future Python versions, we should tighten the codecs to reject data that the Unicode standard considers invalid (and we should offer separate non-strict codecs for situations where such invalid data needs to be processed).

I wish we could fix the codecs and the regex "." issue on narrow builds for Python versions before 3.3 (esp. 3.2 and 2.7), but I fear that this is considered too backwards incompatible (though for each specific fix we should consider this carefully).
msg143054 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-08-27 03:12
My proposal is better than log(N) in 2 respects.

1) There need only be a time penalty when there are non-BMP chars and indexing currently gives the 'wrong' answer and therefore when a time-penalty should be acceptable. Lookup for normal all-BMP strings could remain the same.

2) The penalty is log(K), where K in the number of non-BMP chars. In theory, O(logK) is as 'bad' as O(logN), for any fixed ratio K/N. In practice, the difference should be noticeable when there are just a few (say .01%) extended-range chars.

I am aware that this is an idea for the future, not now.
---

Fixing string iteration on narrow builds to produce code points the same
as with wide builds is easy and costs O(1) per code point (character), which is the same as the current cost. Then

>>> from unicodedata import name
>>> name('\U0001043c')
'DESERET SMALL LETTER DEE'
>>> for c in 'a\U0001043c': name(c)
'LATIN SMALL LETTER A'
Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    for c in 'a\U0001043c': name(c)
ValueError: no such name

would work like it does on wide builds instead of failing.

I admit that it would be strange to have default iteration produce different items than default indexing (and indeed, str currently iterates by sequential indexing). But keeping them in sync means that buggy iteration is another cost of O(1) indexing.
msg143055 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2011-08-27 03:26
To me, making (default) iteration deviate from indexing is anathema.

However, there is nothing wrong with providing a library function that
takes a string and returns an iterator that iterates over code points,
joining surrogate pairs as needed. You could even have one that
iterates over characters (I think Tom calls them graphemes), if that
is well-defined and useful.
msg143056 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-08-27 04:09
PEP-393 will take care of iterating by code points.
Where would you have other iterators go? The string module?
Something else I have not thought of? Or something new?
msg143061 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-27 11:51
Guido van Rossum <report@bugs.python.org> wrote
   on Sat, 27 Aug 2011 03:26:21 -0000: 

> To me, making (default) iteration deviate from indexing is anathema.

So long is there's a way to interate through a string some other way
that by code unit, that's fine.  However, the Java way of 16-bit code
units is so annoying because there often aren't code point APIs, and 
so you get a lot of niggling errors creeping in.  This is part of why
I strongly prefer wide builds, so that code point and code unit are the
same thing again.

> However, there is nothing wrong with providing a library function that
> takes a string and returns an iterator that iterates over code points,
> joining surrogate pairs as needed. You could even have one that
> iterates over characters (I think Tom calls them graphemes), if that
> is well-defined and useful.

"Character" can sometimes be a confusing term when it means something
different to us programmers as it does to users.  Code point to mean the
integer is a lot clearer to us but to no one else.  At work I often just
give in and go along with the crowd and say character for the number that
sits in a char or wchar_t or Character variable, even though of course
that's a code point.  I only rebel when they start calling code units 
characters, which (inexperienced) Java people tend to do, because that
leads to surrogate splitting and related errors.

By grapheme I mean something the user perceives as a single character.  In
full Unicodese, this is an extended grapheme cluster.  These are code point
sequences that start with a grapheme base and have zero or more grapheme
extenders following it.  For our purposes, that's *mostly* like saying you
have a non-Mark followed by any number of Mark code points, the main
excepting being that a CR followed by a LF also counts as a single grapheme
in Unicode.

If you are in an editor and wanted to swap two "characters", the one 
under the user's cursor and the one next to it, you have to deal with
graphemes not individual code points, or else you'd get the wrong answer.
Imagine swapping the last two characters of the first string below,
or the first two characters of second one:

    contrôlée    contro\x{302}le\x{301}e
    élève        e\x{301}le\x{300}ve        

While you can sometimes fake a correct answer by considering things
in NFC not NFD, that's doesn't work in the general case, as there
are only a few compatibility glyphs for round-tripping for legacy
encodings (like ISO 8859-1) compared with infinitely many combinations
of combining marks.  Particularly in mathematics and in phonetics, 
you often end up using marks on characters for which no pre-combined
variant glyph exists.  Here's the IPA for a couple of Spanish words
with their tight (phonetic, not phonemic) transcriptions:

        anécdota    [a̠ˈne̞ɣ̞ð̞o̞t̪a̠]
        rincón      [rĩŋˈkõ̞n]

    NFD:
        ane\x{301}cdota    [a\x{320}\x{2C8}ne\x{31E}\x{263}\x{31E}\x{F0}\x{31E}o\x{31E}t\x{32A}a\x{320}]
        rinco\x{301}n      [ri\x{303}\x{14B}\x{2C8}ko\x{31E}\x{303}n]

    NFD:
        an\x{E9}cdota    [a\x{320}\x{2C8}ne\x{31E}\x{263}\x{31E}\x{F0}\x{31E}o\x{31E}t\x{32A}a\x{320}]
        rinc\x{F3}n      [r\x{129}\x{14B}\x{2C8}k\x{F5}\x{31E}n]

So combining marks don't "just go away" in NFC, and you really do have to
deal with them.  Notice that to get the tabs right (your favorite subject :),
you have to deal with print widths, which is another place that you get
into trouble if you only count code points.

BTW, did you know that the stress mark used in the phonetics above
is actually a (modifier) letter in Unicode, not punctuation?

    # uniprops -a 2c8
    U+02C8 ‹ˈ› \N{MODIFIER LETTER VERTICAL LINE}
        \w \pL \p{L_} \p{Lm}
    All Any Alnum Alpha Alphabetic Assigned InSpacingModifierLetters Case_Ignorable CI Common Zyyy Dia Diacritic L Lm Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Modifier_Letter Print Spacing_Modifier_Letters Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word
    Age=1.1 Bidi_Class=ON Bidi_Class=Other_Neutral BC=ON Block=Spacing_Modifier_Letters Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=BB Line_Break=Break_Before LB=BB Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=LE Sentence_Break=OLetter SB=LE Word_Break=ALetter WB=LE Word_Break=LE _Case_Ignorable _X_Begin

That means those would all be matched by \w+, as unlike \p{alpha},
\p{word} includes not just \pL etc but also all the combining marks.
That's how you want it to work, although I think you have to use
regex not re in Python to get that.

Iterating by grapheme is easy in a regex engine that supports \X.
Instead of using "." to match a code point, you use a \X to match
a grapheme.  So the swapping problem goes away, and many others.
To capture a pair of graphemes for swapping you'd use (\X)(\X), and
to grab the first 6 graphemes without breaking them up you'd use \X{6}.
That means to interate by grpaheme you just split up your string one
\X at a time.

Here's a real-world example:

In the vim editor, when you're editing UTF-8 as I am this mail message,
because it is all about user-perceived characters, they actually use "." to
match an entire grapheme.  This is different form th eayw perl and
everybody else uses "." for a code point, not a grapheme.  If I did s/^.//
or s/.$// in vim, I would need s/^\X// or s/\X$// for in perl.  Similarly,
to swap "characters" with the "xp" command, it will grab the entire \X.
Put some of those phonetic transcriptions above into a vim buffer and play
with them to see what I mean.

Imagine using a format like "%-6.6s" on "contrôlée": that should produce
"contrô" not "contro".  That's because code points with the property
Bidi_Class=Non_Spacing_Mark (BC=NSM) do not advance the cursor, they just
stack up.

It gets even worse in that some code points advance the cursor by two
not by zero or one.  These include those with the East_Asian_Width
property value Full or Wide.  And they aren't always Asian characters,
either.  For example, these code points all have the EA=W property, so
take up to print columns:

     〈  U+2329 LEFT-POINTING ANGLE BRACKET
     〉  U+232A RIGHT-POINTING ANGLE BRACKET
     〃  U+3003 DITTO MARK
     〜  U+301C WAVE DASH
     〝  U+301D REVERSED DOUBLE PRIME QUOTATION MARK
     〞  U+301E DOUBLE PRIME QUOTATION MARK
     〟  U+301F LOW DOUBLE PRIME QUOTATION MARK

Perl's built-in string indexing, and hence its substrings, is strictly 
by code point and not by grapheme.  This is really frustrating at times,
because something like this screws up:

    printf "%-6.6", "contrôlée";
    printf "%-6.6", "a̠ˈne̞ɣ̞ð̞o̞t̪a̠";

Logically, those should produce "contrô" and "a̠ˈne̞ɣ̞ð̞", but of course
when considering only code points, they won't.  Well, not unless the 
1st is in NFC, but there's no hope for the second.

Perl does have a grapheme cluster string class which provides a way 
to figure out the columns and also allows for substring operation by
grapheme. But it is not at all integrated into anything, which makes 
it tedious to use.

    use Unicode::GCString;  # on CPAN only, not yet in core

    my $string   = "a̠ˈne̞ɣ̞ð̞o̞t̪a̠";
    my $gcstring = Unicode::GCString->new($string);
    my $colwidth = $gcstring->columns;
    if ($colwidth > 6) {
        print $gcstring->substr(0,6);
    } else {
        print " " x (6 - $colwidth);
        print $gcstring;
    }

Isn't that simply horrible?  You *will* get the right answer that way, but
what a pain!  Really, there needs to be a way for the built-in formatters
to understand graphemes.  But first, I think, you have to have the regex
engine understand them.  Matthew's regex does, because it supports \X.

There's a lot more to dealing with Unicode text than just extending the
character repertoire.  How much should fundamental to the language and how
much should be relegated to modules isn't always clear.  I do know I've had
to rewrite a *lot* of standard Unix tools to deal with Unicode properly.
For the wc(1) rewrite I only needed to consider graphemes with \X and 
Unicode line break sequences with \R, but other tools need better smarts.
For example, just getting the fmt(1) rewrite to wrap lines in paragraphs 
correctly requires understanding not just graphemes but the Unicode 
Linebreak Algorithm, which in turn relies upon understanding the print
widths for grapheme cluster strings and East Asian wide or full characters.

It's something you only want to do once and never think about again. :(

--tom
msg143088 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-08-27 20:23
Python makes it easy to transform a sequence with a generator as long as no look-ahead is needed. utf16.UTF16.__iter__ is a typical example. Whenever a surrogate is found, grab the matching one.

However, grapheme clustering does require look-ahead, which is a bit trickier. Assume s is a sanitized sequence of code points with unicode database entries. Ignoring line endings the following should work (I tested it with a toy definition of mark()):

def graphemes(s):
  sit = iter(s)
  try: graph = [next(sit)]
  except StopIteration: graph = []

  for cp in sit:
    if mark(cp):  
      graph.append(cp)
    else:
      yield combine(graph)
      graph = [cp]

  yield combine(graph)

I tested this with several input with
def mark(cp): return cp == '.'
def combine(l) return ''.join(l)

Python's object orientation makes formatting easy for the user. Assume someone does the hard work of writing (once ;-) a GCString class with a .__format__ method that interprets the format mini-language for graphemes, using a generalized version of your 'simply horrible' code. The might be done by adapting str.__format__ to use the grapheme iterator above. Then users should be able to write

>>> '{:6.6}'.format(GCString("a̠ˈne̞ɣ̞ð̞o̞t̪a̠"))
"a̠ˈne̞ɣ̞ð̞"
(Note: Thunderbird properly displays characters with the marks beneath even though FireFox does not do so above or in its display of your message.)
msg143111 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2011-08-28 17:29
> PEP-393 will take care of iterating by code points.

Only for CPython. IronPython/Jython will still need a separate solution.

> Where would you have other iterators go? The string module?
> Something else I have not thought of? Or something new?

Undecided. But I think we may want to create a new module which
provides various APIs specifically for apps that need care when
dealing with Unicode.
msg143123 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-08-28 20:55
> But I think we may want to create a new module which
provides various APIs specifically for apps that need care when
dealing with Unicode.

I have started thinking that way too -- perhaps "unitools"?
It could contain the code point iterator for the benefit of other implementations. Actually, since 393 still allows insertion of surrogate values, it might not be completely useless for CPython either.
msg143392 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-09-02 10:05
> To start with, no code point which when bitwise added with 0xFFFE 
> returns 0xFFFE can never appear in a valid UTF-* stream, but Python
> allow this without any error.

> That means that both 0xNN_FFFE and 0xNN_FFFF are illegal in all 
> planes, where NN is 00 through 10 in hex.  So that's 2 noncharacters
> times 17 planes = 34 code points illegal for interchange that Python 
> is passing through illegally.  

> The remaining 32 nonsurrogate code points illegal for open interchange
> are 0xFDD0 through 0xFDEF.  Those are not allowed either, but Python
> doesn't seem to care.

It's not entirely clear to me what the UTF-8 codec is supposed to do with this.

For example U+FFFE is <EF BF BE> in UTF-8, and this is valid according to table 3-7, Chapter 03[0]:
"""
Code points     1st byte  2nd byte  3rd byte
U+E000..U+FFFF  EE..EF    80..BF    80..BF
"""

Chapter 16, section 16.7 "Noncharacters" says[1]:
"""
Noncharacters are code points that are permanently reserved in the Unicode Standard for internal use. They are forbidden for use in open interchange of Unicode text data.
"""

and
"""
Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them.
"""
seem to suggest that encoding them is forbidden.


"""
If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as replacing it with U+FFFD replacement character, to indicate the problem in the text. It is not recommended to simply delete noncharacter code points from such text, because of the potential security issues caused by deleting uninterpreted characters.
"""
here decoding seems allowed, possibly with a replacement (that would depend on the error handler used though, so the default 'strict' would turn this in an error).


Chapter 03, after D14, says:
"""
In general, a conforming process may indicate the presence of a code point whose use has not been designated (for example, by showing a missing glyph in rendering or by signaling an appropriate error in a streaming protocol), even though it is forbidden by the standard from interpreting that code point as an abstract character.
"""

and in C7:
"""
If a noncharacter that does not have a specific internal use is unexpectedly encountered in processing, an implementation may signal an error or replace the noncharacter with U+FFFD replacement character. If the implementation chooses to replace, delete or ignore a noncharacter, such an action constitutes a modification in the interpretation of the text. In general, a noncharacter should be treated as an unassigned code point.
"""

This doesn't mention clearly what the codec is supposed to do.
On one hand, it suggests that an error can be raised, i.e. consider the noncharacter invalid like out-of-range codepoints (>U+10FFFF) or lone surrogates. 
On the other hand it says that they should be treated as an unassigned code point, i.e. encoded/decoded normally, and doesn't list them as invalid in table 3-7.


[0]: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf
[1]: http://www.unicode.org/versions/Unicode6.0.0/ch16.pdf
msg143432 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-09-02 19:24
Ezio, that is a lot of nice work to track down those pieces of the standard. I think the operative phrase in many of those quotes is 'open interchange'. Codecs are also used for private storage. If I use the unassigned or private-use code points in a private project, I would use utf-8 to save the work between active sessions. That is quite fine under the standard. But I should not put files with such codings on the net for 'open interchange'. And if I receive them, the one thing I should not do is interpret them as meaningful abstract characters.

So the codec should allow for both public and private use. I have the impression that is does so now. A Python programmer should know whether the codec is being used for private (or local group) files or random stuff from the net, and therefore, what the appropriate error handling is. If they do not now, the docs could suggest that public text should normally be decoded with 'strict' or 'replace' and that 'ignore' should normally be reserved for local text that is known to intentionally have 'errors'.

I am pretty sure that the intent of prohibiting non-standard interpretation of code points as abstract characters is to prevent 'organic' evolution of the code point -- abstract character mapping in which anyone (or any company) who wants to do so creates a new pairing and promotes its wide recognition around the world. Conforming implementations are strict in both what they produce (publicly) *and* in what they accept (from public sources). Many now think that liberal acceptance of sloppy html promoted sloppy production of html.
msg143433 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2011-09-02 19:28
> So the codec should allow for both public and private use.

IIUC we have (or are planning) codecs that support the private use.
They are not called "utf-8" though.
msg143446 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-09-03 00:28
Or they are still called UTF-8 but used in combination with different error handlers, like surrogateescape and surrogatepass.  The "plain" UTF-* codecs should produce data that can be used for "open interchange", rejecting all the invalid data, both during encoding and decoding.

Chapter 03, D79 also says:
"""
To ensure that the mapping for a Unicode encoding form is one-to-one, all Unicode scalar values, including those corresponding to noncharacter code points and unassigned code points, must be mapped to unique code unit sequences. Note that this requirement does not extend to high-surrogate and low-surrogate code points, which are excluded by definition from the set of Unicode scalar values.
"""

and this seems to imply that the only unencodable codepoint are the non-scalar values, i.e. surrogates and codepoints >U+10FFFF.  Noncharacters shouldn't thus receive any special treatment (at least during encoding).

Tom, do you agree with this?  What does Perl do with them?
msg143702 - (view) Author: Tom Christiansen (tchrist) Date: 2011-09-07 19:26
Ezio Melotti <report@bugs.python.org> wrote
   on Sat, 03 Sep 2011 00:28:03 -0000: 

> Ezio Melotti <ezio.melotti@gmail.com> added the comment:

> Or they are still called UTF-8 but used in combination with different error
> handlers, like surrogateescape and surrogatepass.  The "plain" UTF-* codecs
> should produce data that can be used for "open interchange", rejecting all the
> invalid data, both during encoding and decoding.

> Chapter 03, D79 also says:

>        To ensure that the mapping for a Unicode encoding form is one-to-one,
>        all Unicode scalar values, including those corresponding to
>        noncharacter code points and unassigned code points, must be mapped to
>        unique code unit sequences. Note that this requirement does not extend
>        to high-surrogate and low-surrogate code points, which are excluded by
>        definition from the set of Unicode scalar values.

> and this seems to imply that the only unencodable codepoint are the non-scalar
> values, i.e. surrogates and codepoints >U+10FFFF.  Noncharacters shouldn't
> thus receive any special treatment (at least during encoding).

> Tom, do you agree with this?  What does Perl do with them?

I agree that one needs to be able to encode any scalar value and
store it in memory in a designated character encoding form.

This is different from streams, though.

The 3 different Unicode "character encoding *forms*" -- UTF-8,
UTF-16, and UTF-32 -- certainly need to support all possible
scalar values.  These are the forms used to store code points in
memory.  They do not have BOMs, because one knows one's memory
layout.   These are specifically allowed to contain the
noncharacters:

    http://www.unicode.org/reports/tr17/#CharacterEncodingForm

    The third type is peculiar to the Unicode Standard: the noncharacter.
    This is a kind of internal-use user-defined character, not intended for
    public interchange.

The problem is that one must make a clean distinction between character
encoding *forms* and character encoding *schemes*.

    http://www.unicode.org/reports/tr17/#CharacterEncodingScheme

    It is important not to confuse a Character Encoding Form (CEF) and a CES.

    1. The CEF maps code points to code units, while the CES transforms
       sequences of code units to byte sequences.
    2. The CES must take into account the byte-order serialization of
       all code units wider than a byte that are used in the CEF.
    3. Otherwise identical CESs may differ in other aspects, such as the
       number of user-defined characters allowed.

    Some of the Unicode encoding schemes have the same labels as the three
    Unicode encoding forms. [...]

    As encoding schemes, UTF-16 and UTF-32 refer to serialized bytes, for
    example the serialized bytes for streaming data or in files; they may have
    either byte orientation, and a single BOM may be present at the start of the
    data. When the usage of the abbreviated designators UTF-16 or UTF-32 might
    be misinterpreted, and where a distinction between their use as referring to
    Unicode encoding forms or to Unicode encoding schemes is important, the full
    terms should be used. For example, use UTF-16 encoding form or UTF-16
    encoding scheme. They may also be abbreviated to UTF-16 CEF or UTF-16 CES,
    respectively.

    The Unicode Standard has seven character encoding schemes: UTF-8, UTF-16,
    UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE.

	* UTF-8, UTF-16BE, UTF-16LE, UTF-32BE and UTF32-LE are simple CESs.

        * UTF-16 and UTF-32 are compound CESs, consisting of an single, optional
          byte order mark at the start of the data followed by a simple CES.

I believe that what this comes down to is that you can have noncharacters in memory
as a CEF, but that you cannot have them in a CES meant for open interchange.
And what you do privately is a different, third matter.

What Perl does differs somewhat depending on whether you are just playing
around with encodings in memory verus using streams that have particular
encodings associated with them.  I belive that you can think of this as the
first being for CEF stuff and the second is for CES stuff.

Streams are strict.  Memory isn't.

Perl will never ever produce nor accept one of the 66 noncharacers on any
stream marked as one of the 7 character encoding schemes.  However, we
aren't always good about whether we generate an exception or whether we
return replacement characters.  

Here the first process created a (for the nonce, nonfatal) warning, 
whereas the second process raised an exception:

     %   perl -wle 'binmode(STDOUT, "encoding(UTF-16)")|| die; print chr(0xFDD0)' | 
	 perl -wle 'binmode(STDIN, "encoding(UTF-16)")||die; print ord <STDIN>'
    Unicode non-character U+FDD0 is illegal for open interchange at -e line 1.
    UTF-16:Unicode character fdd0 is illegal at -e line 1.
    Exit 255

Here the first again makes a warning, and the second returns a replacement
string because:

    % perl -wle 'binmode(STDOUT, "encoding(UTF-8)")|| die; print chr(0xFDD0)' | 
	perl -wle 'binmode(STDIN, "encoding(UTF-8)")||die; print ord <STDIN>'
    Unicode non-character U+FDD0 is illegal for open interchange at -e line 1.
    "\x{fdd0}" does not map to utf8.
    92

If you call encode() manually, you have a lot clearer control over this, 
beause you can specify what to do with invalid characters (exceptions,
replacements, etc).

We have a flavor of non-strict utf8, spelled "utf8" instead of "UTF-8", that
can produce and accept illegal characters, although by default it is still
going to generate a warning:

    %   perl -wle 'binmode(STDOUT, "encoding(utf8)")|| die; print chr(0xFDD0)' | 
	perl -wle 'binmode(STDIN, "encoding(utf8)")||die; print ord <STDIN>'
    Unicode non-character U+FDD0 is illegal for open interchange at -e line 1.
    64976

I could talk about ways to control whether it's a warning or an exception
or a replacement string or nothing at all, but suffice to say such
mechanisms do exist.  I just don't know that I agree with the defaults.

I think a big problem here is that the Python culture doesn't use stream
encodings enough.  People are always making their own repeated and tedious
calls to encode and then sending stuff out a byte stream, by which time it
is too late to check.  This is a real problem, because now you cannot be
permissive for the CES but conservative for the CEF.  

In Perl this doesn't in practice happen because in Perl people seldom send
the result of encode() out a byte stream; they send things out character
streams that have proper encodings affiliated with them.  Yes, you can do
it, but then you lose the checks.  That's not a good idea.

Anything that deals with streams should have an encoding argument.  But
often/many? things in Python don't.  For example, subprocess.Popen
doesn't even seem to take an encoding argument.  This makes people do
things by hand too often.  In fact, subprocess.Popen won't even accept
normal (Python 3 Unicode) strings, which is a real pain.  I do think the
culture of calling .encode("utf8") all over the place needs to be
replaced with a more stream-based approach in Python.  I had another
place where this happens too much in Python besides subprocess.Popen but
I can't remember where it is right now.

Perl's internal name for the strict utf stuff is for example "utf-8-strict".
I think you probably want to distingish these, and make the default strict
the way we do with "UTF-8".  We do not ever allow nonstrict UTF-16 or UTF-32,
only sometimes nonstrict UTF-8 if you call it "utf8".

I quote a bit of the perlunicode manpage below which talks about this a bit.

Sorry it's taken me so long to get back to you on this.  I'd be happy to answer
any further questions you might have.

--tom

	PERLUNICODE(1)   Perl Programmers Reference Guide  PERLUNICODE(1)

	   Non-character code points
	       66 code points are set aside in Unicode as "non-character code
	       points".  These all have the Unassigned (Cn) General Category, and
	       they never will be assigned.  These are never supposed to be in
	       legal Unicode input streams, so that code can use them as sentinels
	       that can be mixed in with character data, and they always will be
	       distinguishable from that data.  To keep them out of Perl input
	       streams, strict UTF-8 should be specified, such as by using the
	       layer ":encoding('UTF-8')".  The non-character code points are the
	       32 between U+FDD0 and U+FDEF, and the 34 code points U+FFFE,
	       U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF.  Some people are
	       under the mistaken impression that these are "illegal", but that is
	       not true.  An application or cooperating set of applications can
	       legally use them at will internally; but these code points are
	       "illegal for open interchange". Therefore, Perl will not accept
	       these from input streams unless lax rules are being used, and will
	       warn (using the warning category "nonchar", which is a sub-category
	       of "utf8") if an attempt is made to output them.

	   Beyond Unicode code points
	       The maximum Unicode code point is U+10FFFF.  But Perl accepts code
	       points up to the maximum permissible unsigned number available on
	       the platform.  However, Perl will not accept these from input
	       streams unless lax rules are being used, and will warn (using the
	       warning category "non_unicode", which is a sub-category of "utf8")
	       if an attempt is made to operate on or output them.  For example,
	       "uc(0x11_0000)" will generate this warning, returning the input
	       parameter as its result, as the upper case of every non-Unicode
	       code point is the code point itself.

	perl v5.14.0                2011-05-07                         26
msg143720 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-09-08 08:32
So to summarize a bit, there are different possible level of strictness:
  1) all the possible encodable values, including the ones >10FFFF;
  2) values in range 0..10FFFF;
  3) values in range 0..10FFFF except surrogates (aka scalar values);
  4) values in range 0..10FFFF except surrogates and noncharacters;

and this is what is currently available in Python:
  1) not available, probably it will never be;
  2) available through the 'surrogatepass' error handler;
  3) default behavior (i.e. with the 'strict' error handler);
  4) currently not available.

(note: this refers to the utf-8 codec in Python 3, but it should be true for the utf-16/32 codecs too once #12892 is fixed.  This whole message refers to codecs only and what they should (dis)allow.  What we use internally seems to work fine and doesn't need to be changed.)

Now, assume that we don't care about option 1 and want to implement the missing option 4 (which I'm still not 100% sure about).  The possible options are:
  * add a new codec (actually one for each UTF encoding);
  * add a new error handler that explicitly disallows noncharacters;
  * change the meaning of 'strict' to match option 4;

This depends on what should be the default behavior while dealing with noncharacters.  If they are rejected by default, then 'strict' should reject them.  However this would leave us without option 3 (something to encode all and only scalar values), and surrogatepass will be misnamed if it also allows noncharacters (and if it doesn't we will end up without option 2 too).  This is apparently what Perl does:
> Perl will never ever produce nor accept one of the 66 noncharacers
> on any stream marked as one of the 7 character encoding schemes. 


Implementation-wise, I think the 'strict' error handler should be the strictest one, because the codec must detects all the "problematic" chars and send them to the error handler that might then decide what to do with them.  I.e. if the codec detects noncharacters, sends them to the error handler, and the error handler is strict, an error will be raised; if it doesn't detect them, the error handler won't be able to do anything with them.  
Another option is to provide another codec that specifically detects them, but this means re-implementing a slightly different version of each codec (or possibly add an extra argument to the PyUnicode_{Encode,Decode}UTF* functions).

We could also decide to leave the handling of noncharacters as it is -- after all the Unicode standard doesn't seem to explicitly forbid them as it does with e.g. surrogates.


> We have a flavor of non-strict utf8, spelled "utf8" instead of "UTF-8",
> that can produce and accept illegal characters, although by default it
> is still going to generate a warning

How did Perl implement this?  With two (or more) slightly different version of the same codec?
And how does Perl handle errors?  With some global options that turns (possibly specific) warnings into error (like python -We)?

Python has different codecs that encode/decode str/bytes and whenever they find a "problematic" char they send it to the error handler that might decide to raise an error, remove the char, replace it with something else, sending it back unchanged, generate a warning and so on.  In this way you can have different combinations of codecs and error handlers to get the desired behaviors.  (and FWIW in Python 'utf8' is an alias for 'UTF-8'.)

> I think a big problem here is that the Python culture doesn't use stream
> encodings enough.  People are always making their own repeated and tedious
> calls to encode and then sending stuff out a byte stream, by which time it
> is too late to check.
> [...]
> Anything that deals with streams should have an encoding argument.  But
> often/many? things in Python don't.
Several objects have an .encoding and .error attributes (e.g. sys.stdin/out), and they are used to encode/decode the text/bytes sent/read to/from them.  In other places we prefer the "explicit is better than implicit" approach and require the user (or some other higher-level layer) to encode/decode manually.

I'm not sure why you are saying that it's too late to check, and since the encoding/decoding happens only in a few places I don't think it's tedious at all (and often it's automatic too).
msg143735 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-09-08 18:56
On 9/8/2011 4:32 AM, Ezio Melotti wrote:
> So to summarize a bit, there are different possible level of strictness:
>    1) all the possible encodable values, including the ones>10FFFF;
>    2) values in range 0..10FFFF;
>    3) values in range 0..10FFFF except surrogates (aka scalar values);
>    4) values in range 0..10FFFF except surrogates and noncharacters;
>
> and this is what is currently available in Python:
>    1) not available, probably it will never be;
>    2) available through the 'surrogatepass' error handler;
>    3) default behavior (i.e. with the 'strict' error handler);
>    4) currently not available.
>
> Now, assume that we don't care about option 1 and want to implement the missing option 4 (which I'm still not 100% sure about).  The possible options are:
>    * add a new codec (actually one for each UTF encoding);
>    * add a new error handler that explicitly disallows noncharacters;
>    * change the meaning of 'strict' to match option 4;

If 'strict' meant option 4, then 'scalarpass' could mean option 3. 
'surrogatepass' would then mean 'pass surragates also, in addition to 
non-char scalers'.
msg144256 - (view) Author: Tom Christiansen (tchrist) Date: 2011-09-18 22:45
"Terry J. Reedy" <report@bugs.python.org> wrote
   on Thu, 08 Sep 2011 18:56:11 -0000: 

>On 9/8/2011 4:32 AM, Ezio Melotti wrote:

>> So to summarize a bit, there are different possible level of strictness:
>>    1) all the possible encodable values, including the ones>10FFFF;
>>    2) values in range 0..10FFFF;
>>    3) values in range 0..10FFFF except surrogates (aka scalar values);
>>    4) values in range 0..10FFFF except surrogates and noncharacters;

>> and this is what is currently available in Python:
>>    1) not available, probably it will never be;
>>    2) available through the 'surrogatepass' error handler;
>>    3) default behavior (i.e. with the 'strict' error handler);
>>    4) currently not available.

>> Now, assume that we don't care about option 1 and want to implement the missing option 4 (which I'm still not 100% sure about).  The possible options are:
>>    * add a new codec (actually one for each UTF encoding);
>>    * add a new error handler that explicitly disallows noncharacters;
>>    * change the meaning of 'strict' to match option 4;

> If 'strict' meant option 4, then 'scalarpass' could mean option 3. 
> 'surrogatepass' would then mean 'pass surragates also, in addition to 
> non-char scalers'.

I'm pretty sure that anything that claims to be UTF-{8,16,32} needs  
to reject both surrogates *and* noncharacters. Here's something from the
published Unicode Standard's p.24 about noncharacter code points:

    • Noncharacter code points are reserved for internal use, such as for 
      sentinel values. They should never be interchanged. They do, however,
      have well-formed representations in Unicode encoding forms and survive
      conversions between encoding forms. This allows sentinel values to be
      preserved internally across Unicode encoding forms, even though they are
      not designed to be used in open interchange.

And here from the Unicode Standard's chapter on Conformance, section 3.2, p. 59:

    C2 A process shall not interpret a noncharacter code point as an 
       abstract character.

        • The noncharacter code points may be used internally, such as for 
          sentinel values or delimiters, but should not be exchanged publicly.

I'd have to check the fine print, but I am pretty sure that "shall not" 
is an imperative form.  We have understand that to read that a comforming
process *must*not* do that.  It's because of that wording that in Perl,
using either of {en,de}code() with any of the "UTF-{8,16,32}" encodings,
including the LE/BE versions as appropriate, it will not produce nor accept
a noncharacter code point like FDD0 or FFFE.

Do you think we may perhaps have misread that conformance clause?

Using Perl's special, loose-fitting "utf8" encoding, you can get it do
noncharacter code points and even surrogates, but you have to suppress
certain things to make that happen quietly.  You can only do this with
"utf8", not any of the UTF-16 or UTF-32 flavors.  There we give them no 
choice, so you must be strict.  I agree this is not fully orthogonal.

Note that this is the normal thing that people do:

    binmode(STDOUT, ":utf8");

which is the *loose* version.  The strict one is "utf8-strict" or "UTF-8":

    open(my $fh, "< :encoding(UTF-8)", $pathname)

So it is a bit too easy to get the loose one.  We felt we had to do this
because we were already using the loose definition (and allowing up to
chr(2**32) etc) when the Unicode Consortium made clear what sorts of
things must not be accepted, or perhaps, before we made ourselves clear
on this.  This will have been back in 2003, when I wasn't paying very
close attention.

I think that just like Perl, Python has a legacy of the original loose
definition.  So some way to accommodate that legacy while still allowing
for a comformant application should be devised.  My concern with Python
is that people tend to make they own manual calls to encode/decode a lot
more often than they do in Perl.  That people that if you only catch it
on a stream encoding, you'll miss it, because they will use binary I/O
and miss the check.

--tom

    Below I show a bit of how this works in Perl.  Currently the builtin
    utf8 encoding is controlled somewhat differently from how the Encode
    module's encode/decode functions are.  Yes, this is not my idea of good.

    This shows that noncharacters and surrogates do not survive the
    encoding/decoding process for UTF-16:

        % perl -CS -MEncode -wle 'print decode("UTF-16", encode("UTF-16", chr(0xFDD0)))' | uniquote -v
        \N{REPLACEMENT CHARACTER}
        % perl -CS -MEncode -wle 'print decode("UTF-16", encode("UTF-16", chr(0xFFFE)))' | uniquote -v
        \N{REPLACEMENT CHARACTER}
        % perl -CS -MEncode -wle 'print decode("UTF-16", encode("UTF-16", chr(0xD800)))' | uniquote -v
        UTF-16 surrogate U+D800 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.

    If you pass a third argument to encode/decode, you can tell it what to
    do on error; an argument of 1 raises an exception.  Not supplying a
    third argument gets the "default" behavior, which varies by encoding.
    (The careful programmer is apt to want to pass in an appropropriate
     bit mask of things like DIE_ON_ERR, WARN_ON_ERR, RETURN_ON_ERR,
     LEAVE_SRC, PERLQQ, HTMLCREF, or XMLCREF.)

    With "utf8" vs "UTF-8" using encode(), the default behavior is to swap in 
    the Unicode replacement character for things that don't map to the given 
    encoding, as you saw above with UTF-16:

        % perl -C0 -MEncode -wle 'print encode("utf8", chr(0xFDD0))' | uniquote -v
        \N{U+FDD0}
        % perl -C0 -MEncode -wle 'print encode("UTF-8", chr(0xFDD0))' | uniquote -v
        \N{REPLACEMENT CHARACTER}

        % perl -C0 -MEncode= -wle 'print encode("utf8", chr(0xD800))' | uniquote -v
        \N{U+D800}
        % perl -C0 -MEncode= -wle 'print encode("UTF-8", chr(0xFDD0))' | uniquote -v
        \N{REPLACEMENT CHARACTER}

        % perl -C0 -MEncode=:all -wle 'print encode("utf8", chr(0x100_0000))' | uniquote -v
        \N{U+1000000}
        % perl -C0 -MEncode=:all -wle 'print encode("UTF-8", chr(0x100_0000))' | uniquote -v
        \N{REPLACEMENT CHARACTER}

    With the builtin "utf8" encoding, which does *not* go through the
    Encode module, you instead control all this through lexical
    warnings/exceptions categories.   By default, you get a warning if
    you try to use noncharacter, surrogate, or nonunicode code points
    even on a loose utf8 stream (which is what -CS gets you):

        % perl -CS -le 'print chr for 0xFDD0, 0xD800, 0x100_0000' | uniquote
        Unicode non-character U+FDD0 is illegal for open interchange at -e line 1.
        Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1.
        Code point 0x1000000 is not Unicode, may not be portable at -e line 1.
        \N{U+FDD0}
        \N{U+D800}
        \N{U+1000000}

    Notice I didn't ask for warnings there, but I still got them.  This
    promotes all utf8 warnings into exceptions, thus dying on the first one
    it finds:

        % perl -CS -Mwarnings=FATAL,utf8 -le 'print chr for 0xFDD0, 0xD800, 0x100_0000' | uniquote
        Unicode non-character U+FDD0 is illegal for open interchange at -e line 1.

    You can control these separately.  For example, these all die of an
    exception:

        % perl -CS -Mwarnings=FATAL,utf8 -wle 'print chr(0xFDD0)'   
        Unicode non-character U+FDD0 is illegal for open interchange at -e line 1.
        % perl -CS -Mwarnings=FATAL,utf8 -wle 'print chr(0xD800)'   
        Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1.
        % perl -CS -Mwarnings=FATAL,utf8 -wle 'print chr(0x100_0000)' 
        Code point 0x1000000 is not Unicode, may not be portable at -e line 1.

    While these do not:

        % perl -CS -Mwarnings=FATAL,utf8 -wle 'no warnings "nonchar";     print chr(0xFDD0)'     | uniquote
        \N{U+FDD0}
        % perl -CS -Mwarnings=FATAL,utf8 -wle 'no warnings "surrogate";   print chr(0xD800)'     | uniquote
        \N{U+D800}
        % perl -CS -Mwarnings=FATAL,utf8 -wle 'no warnings "non_unicode"; print chr(0x100_0000)' | uniquote
        \N{U+1000000}

        % perl -CS -Mwarnings=FATAL,utf8 -wle 'no warnings qw(nonchar surrogate non_unicode);
                        print chr for 0xFDD0, 0xD800, 0x100_0000' | uniquote
        \N{U+FDD0}
        \N{U+D800}
        \N{U+1000000}
msg144260 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-09-19 01:23
My long-ago memory is that 'should not' is slightly looser  in w3c parlance than 'must not'. However, it is a moot point if we decide to follow the 'should' in 3.3 for the default 'strict' mode, which both Ezio and I think we 'should' ;-). Our 'errors' parameter makes it easy to request something else, but it has to be explicit.
msg144266 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-09-19 11:11
We could also look at what other languages do and/or ask to the Unicode consortium [0].

[0]: http://www.unicode.org/consortium/distlist.html
msg144287 - (view) Author: Tom Christiansen (tchrist) Date: 2011-09-19 15:12
Ezio Melotti <report@bugs.python.org> wrote
   on Mon, 19 Sep 2011 11:11:48 -0000: 

> We could also look at what other languages do and/or ask to the
> Unicode consortium.

I will look at what Java does a bit later on this morning, which is the
only other commonly used language besides C that I feel even reasonably
competent at.  I seem to recall that Java changed its default behavior on
certain Unicode decoding issues from warnings to exceptions between one
release and the next, but can't remember any details.

As the Perl Foundation is a member of the Unicode Consortium and I am on
the mailing list, I suppose I could just ask them.  I feel a bit timid
though because the last thing I brought up there was based on a subtle
misunderstanding of mine regarding the IDC and Pattern_Syntax properties.
I hate looking dumb twice in row. :)

--tom
msg144289 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-09-19 15:28
Tom Christiansen wrote:
> 
> I'm pretty sure that anything that claims to be UTF-{8,16,32} needs  
> to reject both surrogates *and* noncharacters. Here's something from the
> published Unicode Standard's p.24 about noncharacter code points:
> 
>     • Noncharacter code points are reserved for internal use, such as for 
>       sentinel values. They should never be interchanged. They do, however,
>       have well-formed representations in Unicode encoding forms and survive
>       conversions between encoding forms. This allows sentinel values to be
>       preserved internally across Unicode encoding forms, even though they are
>       not designed to be used in open interchange.
> 
> And here from the Unicode Standard's chapter on Conformance, section 3.2, p. 59:
> 
>     C2 A process shall not interpret a noncharacter code point as an 
>        abstract character.
> 
>         • The noncharacter code points may be used internally, such as for 
>           sentinel values or delimiters, but should not be exchanged publicly.

You have to remember that Python is used to build applications. It's
up to the applications to conform to Unicode or not and the
application also defines what "exchange" means in the above context.

Python itself needs to be able to deal with assigned non-character
code points as well as unassigned code points or code points that
are part of special ranges such as the surrogate ranges.

I'm +1 on not allowing e.g. lone surrogates in UTF-8 data, because
we have a way to optionally allow these via an error handler,
but -1 on making changes that cause full range round-trip safety
of the UTF encodings to be lost without a way to turn the functionality
back on.
msg144307 - (view) Author: Tom Christiansen (tchrist) Date: 2011-09-19 18:22
No good news on the Java front.  They do all kinds of things wrong.  
For example, they allow intermixed CESU-8 and UTF-8 in a real UTF-8 
input stream, which is illegal.  There's more they do wrong, including 
in their documentation, but I won't bore you with their errors.

I'm going to seek clarification on some matters here.

--tom
msg144312 - (view) Author: Tom Christiansen (tchrist) Date: 2011-09-19 22:41
It appears that I'm right about surrogates, but wrong about
noncharacters.  I'm seeking a clarification there.

--tom
msg147776 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-11-16 22:44
Closing this bug as PEP 393 is now implemented and makes so-called "narrow builds" obsolete. Python now has an adaptative internal representation that is able to fit all unicode characters.
History
Date User Action Args
2011-11-16 22:44:03pitrousetstatus: open -> closed
resolution: out of date
messages: + msg147776

stage: resolved
2011-09-22 21:42:18belopolskysetnosy: + belopolsky
2011-09-21 12:18:47zbyszsetnosy: + zbysz
2011-09-19 22:41:38tchristsetmessages: + msg144312
2011-09-19 18:22:06tchristsetmessages: + msg144307
title: Python lib re cannot handle Unicode properly due to narrow/wide bug -> Python lib re cannot handle Unicode properly due to narrow/wide bug
2011-09-19 15:28:35lemburgsetmessages: + msg144289
title: Python lib re cannot handle Unicode properly due to narrow/wide bug -> Python lib re cannot handle Unicode properly due to narrow/wide bug
2011-09-19 15:12:27tchristsetmessages: + msg144287
2011-09-19 11:11:48ezio.melottisetmessages: + msg144266
2011-09-19 01:23:35terry.reedysetmessages: + msg144260
2011-09-18 22:45:29tchristsetmessages: + msg144256
2011-09-15 14:39:29abacabadabacabasetnosy: + abacabadabacaba
2011-09-08 18:56:10terry.reedysetmessages: + msg143735
2011-09-08 08:32:44ezio.melottisetmessages: + msg143720
2011-09-07 19:26:45tchristsetmessages: + msg143702
2011-09-03 00:28:03ezio.melottisetmessages: + msg143446
2011-09-02 19:28:47gvanrossumsetmessages: + msg143433
2011-09-02 19:24:22terry.reedysetmessages: + msg143432
2011-09-02 10:05:14ezio.melottisetmessages: + msg143392
2011-08-28 20:55:21terry.reedysetmessages: + msg143123
2011-08-28 17:29:24gvanrossumsetmessages: + msg143111
2011-08-27 20:23:39terry.reedysetmessages: + msg143088
2011-08-27 11:51:48tchristsetmessages: + msg143061
2011-08-27 04:09:03terry.reedysetmessages: + msg143056
2011-08-27 03:26:20gvanrossumsetmessages: + msg143055
title: Python lib re cannot handle Unicode properly due to narrow/wide bug -> Python lib re cannot handle Unicode properly due to narrow/wide bug
2011-08-27 03:12:33terry.reedysetmessages: + msg143054
2011-08-26 20:58:44gvanrossumsetnosy: + gvanrossum
messages: + msg143033
2011-08-24 07:04:53v+pythonsetnosy: + v+python
messages: + msg142877
2011-08-23 23:58:05terry.reedysetfiles: + utf16.py
2011-08-22 21:28:48hayposetnosy: + haypo
2011-08-22 21:19:53terry.reedysetfiles: - utf16.py
2011-08-22 21:18:45terry.reedysetfiles: + utf16.py

messages: + msg142749
2011-08-15 11:30:37mrabarnettsetmessages: + msg142121
2011-08-15 09:04:58lemburgsetnosy: + lemburg

messages: + msg142107
title: Python lib re cannot handle Unicode properly due to narrow/wide bug -> Python lib re cannot handle Unicode properly due to narrow/wide bug
2011-08-15 06:53:34tchristsetmessages: + msg142102
2011-08-15 04:56:54ezio.melottisetmessages: + msg142098
2011-08-15 04:18:45terry.reedysetfiles: + utf16.py

messages: + msg142097
2011-08-15 03:31:13tchristsetmessages: + msg142096
2011-08-15 02:17:24tchristsetmessages: + msg142093
2011-08-15 01:28:58mrabarnettsetmessages: + msg142091
2011-08-15 00:26:52terry.reedysetmessages: + msg142089
2011-08-14 22:29:24terry.reedysetmessages: + msg142085
2011-08-14 20:55:17terry.reedysetmessages: + msg142084
2011-08-14 20:01:05ezio.melottisetmessages: + msg142079
2011-08-14 19:06:01tchristsetmessages: + msg142076
2011-08-14 17:46:55ezio.melottisetmessages: + msg142070
2011-08-14 17:36:42pitrousetmessages: + msg142069
2011-08-14 17:09:27tchristsetmessages: + msg142066
2011-08-14 16:54:49tchristsetmessages: + msg142064
2011-08-14 14:25:14jklothsetnosy: + jkloth
2011-08-14 07:15:08ezio.melottisetmessages: + msg142054
2011-08-14 06:09:44tchristsetmessages: + msg142053
2011-08-14 04:54:39ezio.melottisetnosy: + ezio.melotti
messages: + msg142047
2011-08-14 02:35:26tchristsetmessages: + msg142044
2011-08-14 02:06:31mrabarnettsetmessages: + msg142043
2011-08-14 01:11:52tchristsetmessages: + msg142042
2011-08-14 01:07:32tchristsetmessages: + msg142041
2011-08-13 21:09:51pitrousetmessages: + msg142039
2011-08-13 20:57:39mrabarnettsetmessages: + msg142038
2011-08-13 20:26:21pitrousetnosy: + pitrou
messages: + msg142037
2011-08-13 20:04:27tchristsetmessages: + msg142036
2011-08-13 15:57:58r.david.murraysetmessages: + msg142024
2011-08-13 02:59:32mrabarnettsetmessages: + msg142005
2011-08-13 00:56:39mrabarnettsetnosy: + mrabarnett
2011-08-12 23:04:53tchristsetmessages: + msg141995
2011-08-12 22:21:59terry.reedysetversions: + Python 3.3, - Python 2.7
nosy: + terry.reedy

messages: + msg141992

type: behavior -> enhancement
2011-08-12 18:02:36Arfreversetnosy: + Arfrever
2011-08-12 00:09:19r.david.murraysetnosy: + r.david.murray
messages: + msg141930
2011-08-11 19:03:55tchristcreate