Rietveld Code Review Tool
Help | Bug tracker | Discussion group | Source code | Sign in
(1990)

#14874: Faster charmap decoding (Closed)

Can't Edit
Can't Publish+Mail
Start Review
Created:
1 year ago by storchaka
Modified:
11 months, 1 week ago
Reviewers:
pitrou
CC:
mal_egenix.com, loewis, AntoinePitrou, haypo, ezio.melotti, devnull_psf.upfronthosting.co.za, storchaka
Visibility:
Public.

Patch Set 1 #

Total comments: 4
Unified diffs Side-by-side diffs Delta from patch set Stats Patch
Lib/codecs.py View 1 chunk +1 line, -4 lines 0 comments Download
Lib/encodings/cp037.py View 1 chunk +1 line, -0 lines 0 comments Download
Lib/encodings/cp500.py View 1 chunk +1 line, -0 lines 0 comments Download
Lib/encodings/hp_roman8.py View 2 chunks +266 lines, -105 lines 0 comments Download
Lib/encodings/iso8859_1.py View 1 chunk +1 line, -0 lines 0 comments Download
Lib/encodings/mac_latin2.py View 3 chunks +266 lines, -137 lines 0 comments Download
Lib/encodings/palmos.py View 2 chunks +265 lines, -39 lines 0 comments Download
Lib/encodings/ptcp154.py View 2 chunks +265 lines, -128 lines 0 comments Download
Objects/unicodeobject.c View 5 chunks +48 lines, -15 lines 2 comments Download
Tools/unicode/gencodec.py View 5 chunks +6 lines, -3 lines 2 comments Download

Messages

Total messages: 2
AntoinePitrou
A couple of comments. http://bugs.python.org/review/14874/diff/4980/Objects/unicodeobject.c File Objects/unicodeobject.c (right): http://bugs.python.org/review/14874/diff/4980/Objects/unicodeobject.c#newcode7722 Objects/unicodeobject.c:7722: if (!PyUnicode_Check(string) || !PyUnicode_GET_LENGTH(string)) { ...
11 months, 1 week ago #1
storchaka
11 months, 1 week ago #2
http://bugs.python.org/review/14874/diff/4980/Objects/unicodeobject.c
File Objects/unicodeobject.c (right):

http://bugs.python.org/review/14874/diff/4980/Objects/unicodeobject.c#newcode...
Objects/unicodeobject.c:7722: if (!PyUnicode_Check(string) ||
!PyUnicode_GET_LENGTH(string)) {
On 2012/06/16 18:44:00, AntoinePitrou wrote:
> Perhaps this is relaxing the length requirement a bit too much?

The main reason for this weakening -- 257-character strings (with 257th
character U+FFFE added to widen a string to UCS2). Below in the code the
hardcoded 256-char limit replaced by a variable length. With these changes, the
code works for strings of any length. Unmapped characters are simply ignored
(lines 7801-7803). A string shorter than 256 characters means the same encoding
as a string filled to 256 length by U+FFFE character.

http://bugs.python.org/review/14874/diff/4980/Tools/unicode/gencodec.py
File Tools/unicode/gencodec.py (right):

http://bugs.python.org/review/14874/diff/4980/Tools/unicode/gencodec.py#newco...
Tools/unicode/gencodec.py:105: if not isinstance(enc, tuple) and enc < 256:
On 2012/06/16 18:44:00, AntoinePitrou wrote:
> Why this change?

gencodec.py is too long has not been used and updated. Now it just does not work
for many character mappings in Python3. In Python3 you cannot compare the tuples
and integers. parsecodes() can return a number or a tuple (for multibyte
encoding). Without this change the script crashes on
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/GURMUKHI.TXT for example.

The script contains a number of other errors, which do not allow its use for all
character mappings, I've corrected only the most necessary. Full fixing of
gencodec.py -- this is a separate issue.
Sign in to reply to this message.

RSS Feeds Recent Issues | This issue
This is Rietveld cbc36f91f3f7