The patch looks ok overall. I have a couple of questions. http://bugs.python.org/review/14738/diff/4837/17114 File Objects/stringlib/codecs.h (right): ...
http://bugs.python.org/review/14738/diff/4837/17114
File Objects/stringlib/codecs.h (right):
http://bugs.python.org/review/14738/diff/4837/17114#newcode165
Objects/stringlib/codecs.h:165: /* \xF0\x90\x80\80-\xF4\x8F\xBF\xBF --
10000-10FFFF */
On 2012/05/09 17:45:57, AntoinePitrou wrote:
> Typo in comment? It looks like it should be "\xF0\x80\x80\80", not
> "\xF0\x90\x80\80".
Yes, there's a typo, but not the one you think. Missing "x" in last "\x80".
See comment below -- range \xF0\x80\x80\x80-\xF0\x80\xBF\xBF is invalid.
http://bugs.python.org/review/14738/diff/4837/17114#newcode184
Objects/stringlib/codecs.h:184: \xF0\x80\x80\80-\xF0\x80\xBF\xBF -- fake
0000-FFFF */
Typo in comment -- missing "x" in last "\x80".
http://bugs.python.org/review/14738/diff/4837/17119
File Objects/unicodeobject.c (right):
http://bugs.python.org/review/14738/diff/4837/17119#newcode4772
Objects/unicodeobject.c:4772: while (endinpos < size && (starts[endinpos] &
0xC0) == 0x80)
On 2012/05/09 17:45:57, AntoinePitrou wrote:
> Why do you need this, instead of "endinpos = end"?
What if last three bytes are '\xF0\x90\x41'? \xF0 requires 4 bytes, '\xF0\x90'
is incompleted sequence (and can be ignored or replaced), but \x41 is valid
encoding of "A".
http://bugs.python.org/review/14738/diff/4837/17119#newcode4829
Objects/unicodeobject.c:4829: wchar_t *unicode;
On 2012/05/09 17:45:57, AntoinePitrou wrote:
> I don't have a Mac, did you actually check this code compiles and works?
At least it compiles (I checked, removing __APPLE__ guard). I also don't have a
Mac.
Issue 14738: Amazingly faster UTF-8 decoding
(Closed)
Created 1 year ago by storchaka
Modified 11 months, 1 week ago
Reviewers: AntoinePitrou
Base URL: None
Comments: 13