Issue 6611: HTMLParser cannot deal with mixture of arbitrary data and character reference

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/50860

classification

Title:	HTMLParser cannot deal with mixture of arbitrary data and character reference
Type:	behavior	Stage:
Components:	Library (Lib)	Versions:	Python 2.6

process

Created on 2009-07-31 07:45 by liudongmiao@gmail.com, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
chinese.py	liudongmiao@gmail.com, 2009-07-31 07:45

Messages (3)
msg91128 - (view)	Author: Liu DongMiao (liudongmiao@gmail.com)	Date: 2009-07-31 07:45
HTMLParser (Python 2.6.2) Cannot deal with mixture of arbitrary data and character reference. In line 365-373, replaceEntities(s) returns unichr(charref) in unicode, which cannot be a mixture with arbitrary data in str. A fix way: replace unichr(c) with unichr(c).encode('utf-8').
msg91158 - (view)	Author: bones7456 (bones7456)	Date: 2009-08-01 06:11
another fix way: and these three lines to the head of file: import sys reload(sys) sys.setdefaultencoding('utf8')
msg91164 - (view)	Author: Liu DongMiao (liudongmiao@gmail.com)	Date: 2009-08-01 16:20
i think this should not be a bug. as we dont know the encoding of str, so we cannt deal with str and unicode together. in my example, str is in utf-8, so i need to convert unicode to str in utf-8. i will takes bones' suggestion.

History
Date	User	Action	Args
2022-04-11 14:56:51	admin	set	github: 50860
2009-08-01 16:20:47	liudongmiao@gmail.com	set	status: open -> closed type: compile error -> behavior messages: + msg91164 nosy: bones7456, liudongmiao@gmail.com
2009-08-01 06:11:46	bones7456	set	nosy: + bones7456 messages: + msg91158
2009-07-31 07:45:52	liudongmiao@gmail.com	create