Message 161520 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	loewis
Recipients	Phil.Daintree, amaury.forgeotdarc, ezio.melotti, loewis, santoso.wijaya, xrg
Date	2012-05-24.15:58:06
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1337875087.29.0.691856859543.issue11804@psf.upfronthosting.co.za>
In-reply-to

Content
This has nothing to do with XML 1.1 (so closing this report as "won't fix"). The UTF-8 text that you present works very well: >>> p=xml.parsers.expat.ParserCreate(encoding="utf-8") >>> p.Parse("<x>\xc3\x87</x", 1) 1 The character LATIN CAPITAL LETTER C WITH CEDILLA is definitely supported in XML 1.0, so there is no need for XML 1.1 here. If this still fails to parse for you, it may be because the input is actually different, e.g. >>> p=xml.parsers.expat.ParserCreate(encoding="utf-8") >>> p.Parse("<x>Ã\x87</x>", 1) Traceback (most recent call last): File "<stdin>", line 1, in <module> xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 9 I.e. the input might contain the character &, #, 1, 9, 5, ;, and \x87. That is ill-formed UTF-8, and the parser is right to choke on it. Even if it was declared as XML 1.1, it will still be ill-formed, because it still would be invalid UTF-8.

This has nothing to do with XML 1.1 (so closing this report as "won't fix").

The UTF-8 text that you present works very well:

>>> p=xml.parsers.expat.ParserCreate(encoding="utf-8")
>>> p.Parse("<x>\xc3\x87</x", 1)
1

The character LATIN CAPITAL LETTER C WITH CEDILLA is definitely supported in XML 1.0, so there is no need for XML 1.1 here.

If this still fails to parse for you, it may be because the input is actually different, e.g.

>>> p=xml.parsers.expat.ParserCreate(encoding="utf-8")
>>> p.Parse("<x>&#195;\x87</x>", 1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 9

I.e. the input might contain the character &, #, 1, 9, 5, ;, and \x87. That is ill-formed UTF-8, and the parser is right to choke on it. Even if it was declared as XML 1.1, it will still be ill-formed, because it still would be invalid UTF-8.

History
Date	User	Action	Args
2012-05-24 15:58:07	loewis	set	recipients: + loewis, amaury.forgeotdarc, ezio.melotti, santoso.wijaya, xrg, Phil.Daintree
2012-05-24 15:58:07	loewis	set	messageid: <1337875087.29.0.691856859543.issue11804@psf.upfronthosting.co.za>
2012-05-24 15:58:06	loewis	link	issue11804 messages
2012-05-24 15:58:06	loewis	create