Rietveld Code Review Tool
Help | Bug tracker | Discussion group | Source code | Sign in
(227)

#14738: Amazingly faster UTF-8 decoding (Closed)

Can't Edit
Can't Publish+Mail
Start Review
Created:
1 year ago by storchaka
Modified:
11 months, 1 week ago
Reviewers:
pitrou
CC:
loewis, jcea, ronaldoussoren, mark.dickinson, bill.janssen_gmail.com, AntoinePitrou, haypo, ned.deily, ezio.melotti, Arfrever.FTA_GMail.Com, devnull_psf.upfronthosting.co.za, storchaka
Visibility:
Public.

Patch Set 1 #

Total comments: 6

Patch Set 2 #

Total comments: 7
Unified diffs Side-by-side diffs Delta from patch set Stats Patch
Objects/stringlib/asciilib.h View 1 chunk +1 line, -0 lines 0 comments Download
Objects/stringlib/codecs.h View 1 3 chunks +152 lines, -87 lines 3 comments Download
Objects/stringlib/ucs1lib.h View 1 1 chunk +1 line, -0 lines 0 comments Download
Objects/stringlib/ucs2lib.h View 1 1 chunk +1 line, -0 lines 0 comments Download
Objects/stringlib/ucs4lib.h View 1 1 chunk +1 line, -0 lines 0 comments Download
Objects/stringlib/undef.h View 1 1 chunk +1 line, -0 lines 0 comments Download
Objects/unicodeobject.c View 1 10 chunks +189 lines, -498 lines 4 comments Download

Messages

Total messages: 4
AntoinePitrou
A couple of cosmetic comments, I hope someone else can look at the patch in ...
1 year ago #1
storchaka
http://bugs.python.org/review/14738/diff/4836/17107 File Objects/stringlib/codecs.h (right): http://bugs.python.org/review/14738/diff/4836/17107#newcode113 Objects/stringlib/codecs.h:113: ch = (ch << 6) + ch2 - 030200; ...
1 year ago #2
AntoinePitrou
The patch looks ok overall. I have a couple of questions. http://bugs.python.org/review/14738/diff/4837/17114 File Objects/stringlib/codecs.h (right): ...
1 year ago #3
storchaka
1 year ago #4
http://bugs.python.org/review/14738/diff/4837/17114
File Objects/stringlib/codecs.h (right):

http://bugs.python.org/review/14738/diff/4837/17114#newcode165
Objects/stringlib/codecs.h:165: /* \xF0\x90\x80\80-\xF4\x8F\xBF\xBF --
10000-10FFFF */
On 2012/05/09 17:45:57, AntoinePitrou wrote:
> Typo in comment? It looks like it should be "\xF0\x80\x80\80", not
> "\xF0\x90\x80\80".

Yes, there's a typo, but not the one you think. Missing "x" in last "\x80".

See comment below -- range \xF0\x80\x80\x80-\xF0\x80\xBF\xBF is invalid.

http://bugs.python.org/review/14738/diff/4837/17114#newcode184
Objects/stringlib/codecs.h:184: \xF0\x80\x80\80-\xF0\x80\xBF\xBF -- fake
0000-FFFF */
Typo in comment -- missing "x" in last "\x80".

http://bugs.python.org/review/14738/diff/4837/17119
File Objects/unicodeobject.c (right):

http://bugs.python.org/review/14738/diff/4837/17119#newcode4772
Objects/unicodeobject.c:4772: while (endinpos < size && (starts[endinpos] &
0xC0) == 0x80)
On 2012/05/09 17:45:57, AntoinePitrou wrote:
> Why do you need this, instead of "endinpos = end"?

What if last three bytes are '\xF0\x90\x41'? \xF0 requires 4 bytes, '\xF0\x90'
is incompleted sequence (and can be ignored or replaced), but \x41 is valid
encoding of "A".

http://bugs.python.org/review/14738/diff/4837/17119#newcode4829
Objects/unicodeobject.c:4829: wchar_t *unicode;
On 2012/05/09 17:45:57, AntoinePitrou wrote:
> I don't have a Mac, did you actually check this code compiles and works?

At least it compiles (I checked, removing __APPLE__ guard). I also don't have a
Mac.
Sign in to reply to this message.

RSS Feeds Recent Issues | This issue
This is Rietveld cbc36f91f3f7