classification
Title: HTMLParser cannot deal with mixture of arbitrary data and character reference
Type: behavior Stage:
Components: Library (Lib) Versions: Python 2.6
process
Status: closed Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: bones7456, liudongmiao@gmail.com
Priority: normal Keywords:

Created on 2009-07-31 07:45 by liudongmiao@gmail.com, last changed 2009-08-01 16:20 by liudongmiao@gmail.com. This issue is now closed.

Files
File name Uploaded Description Edit
chinese.py liudongmiao@gmail.com, 2009-07-31 07:45
Messages (3)
msg91128 - (view) Author: Liu DongMiao (liudongmiao@gmail.com) Date: 2009-07-31 07:45
HTMLParser (Python 2.6.2) Cannot deal with mixture of arbitrary data and
character reference. 

In line 365-373, replaceEntities(s) returns unichr(charref) in unicode,
which cannot be a mixture with arbitrary data in str.

A fix way: replace unichr(c) with unichr(c).encode('utf-8').
msg91158 - (view) Author: bones7456 (bones7456) Date: 2009-08-01 06:11
another fix way:
and these three lines to the head of file:

import sys
reload(sys)
sys.setdefaultencoding('utf8')
msg91164 - (view) Author: Liu DongMiao (liudongmiao@gmail.com) Date: 2009-08-01 16:20
i think this should not be a bug.

as we dont know the encoding of str, so we cannt deal with str and
unicode together. 

in my example, str is in utf-8, so i need to convert unicode to str in
utf-8.

i will takes bones' suggestion.
History
Date User Action Args
2009-08-01 16:20:47liudongmiao@gmail.comsetstatus: open -> closed

type: compile error -> behavior
messages: + msg91164
nosy: bones7456, liudongmiao@gmail.com
2009-08-01 06:11:46bones7456setnosy: + bones7456
messages: + msg91158
2009-07-31 07:45:52liudongmiao@gmail.comcreate