Message 71373 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	edreamleo
Recipients	benjamin.peterson, edreamleo, pitrou
Date	2008-08-18.20:16:09
SpamBayes Score	3.8302694e-15
Marked as misclassified	No
Message-id	<ffb592890808181316g7e3e9c2qfdb103a071dd73ba@mail.gmail.com>
In-reply-to	<1219085468.74.0.841290241444.issue3590@psf.upfronthosting.co.za>

Content
On Mon, Aug 18, 2008 at 1:51 PM, Antoine Pitrou <report@bugs.python.org>wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > > From the discussion on the python-3000, it looks like it would be nice > if sax.parser handled both bytes and unicode streams. > > Edward, does your simple fix make sax.parser work entirely well with > byte streams? No. The sax.parser seems to have other problems. Here is what I think I know ;-) 1. A smallish .leo file (an xml file) containing a single non-ascii (utf-8) encoded character appears to have been read correctly with Python 3.0. 2. A larger .leo file fails as follows (it's possible that the duplicate error messages are a Leo problem): Traceback (most recent call last): Traceback (most recent call last): File "C:\leo.repo\leo-30\leo\core\leoFileCommands.py", line 1283, in parse_leo_file parser.parse(theFile) # expat does not support parseString File "C:\leo.repo\leo-30\leo\core\leoFileCommands.py", line 1283, in parse_leo_file parser.parse(theFile) # expat does not support parseString File "c:\python30\lib\xml\sax\expatreader.py", line 107, in parse xmlreader.IncrementalParser.parse(self, source) File "c:\python30\lib\xml\sax\expatreader.py", line 107, in parse xmlreader.IncrementalParser.parse(self, source) File "c:\python30\lib\xml\sax\xmlreader.py", line 121, in parse buffer = file.read(self._bufsize) File "c:\python30\lib\xml\sax\xmlreader.py", line 121, in parse buffer = file.read(self._bufsize) File "C:\Python30\lib\io.py", line 1670, in read eof = not self._read_chunk() File "C:\Python30\lib\io.py", line 1670, in read eof = not self._read_chunk() File "C:\Python30\lib\io.py", line 1499, in _read_chunk self._set_decoded_chars(self._decoder.decode(input_chunk, eof)) File "C:\Python30\lib\io.py", line 1499, in _read_chunk self._set_decoded_chars(self._decoder.decode(input_chunk, eof)) File "C:\Python30\lib\io.py", line 1236, in decode output = self.decoder.decode(input, final=final) File "C:\Python30\lib\io.py", line 1236, in decode output = self.decoder.decode(input, final=final) File "C:\Python30\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] File "C:\Python30\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 74: character maps to <undefined> UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 74: character maps to <undefined> The same calls to sax read the file correctly on Python 2.5. It would be nice to have a message pinpoint the line and character offset of the problem. My vote would be for the code to work on both kinds of input streams. This would save the users considerable confusion if sax does the (tricky) conversions automatically. Imo, now would be the most convenient time to attempt this--there is a certain freedom in having everything be partially broken :-) Edward -------------------------------------------------------------------- Edward K. Ream email: edreamleo@gmail.com Leo: http://webpages.charter.net/edreamleo/front.html --------------------------------------------------------------------

On Mon, Aug 18, 2008 at 1:51 PM, Antoine Pitrou <report@bugs.python.org>wrote:

>
> Antoine Pitrou <pitrou@free.fr> added the comment:
>
> From the discussion on the python-3000, it looks like it would be nice
> if sax.parser handled both bytes and unicode streams.
>

> Edward, does your simple fix make sax.parser work entirely well with
> byte streams?

No. The sax.parser seems to have other problems.  Here is what I *think* I
know ;-)

1. A smallish .leo file (an xml file) containing a single non-ascii (utf-8)
encoded character appears to have been read correctly with Python 3.0.

2. A larger .leo file fails as follows (it's possible that the duplicate
error messages are a Leo problem):

Traceback (most recent call last):
Traceback (most recent call last):

  File "C:\leo.repo\leo-30\leo\core\leoFileCommands.py", line 1283, in
parse_leo_file
    parser.parse(theFile) # expat does not support parseString
  File "C:\leo.repo\leo-30\leo\core\leoFileCommands.py", line 1283, in
parse_leo_file
    parser.parse(theFile) # expat does not support parseString

  File "c:\python30\lib\xml\sax\expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "c:\python30\lib\xml\sax\expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)

  File "c:\python30\lib\xml\sax\xmlreader.py", line 121, in parse
    buffer = file.read(self._bufsize)
  File "c:\python30\lib\xml\sax\xmlreader.py", line 121, in parse
    buffer = file.read(self._bufsize)

  File "C:\Python30\lib\io.py", line 1670, in read
    eof = not self._read_chunk()
  File "C:\Python30\lib\io.py", line 1670, in read
    eof = not self._read_chunk()

  File "C:\Python30\lib\io.py", line 1499, in _read_chunk
    self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
  File "C:\Python30\lib\io.py", line 1499, in _read_chunk
    self._set_decoded_chars(self._decoder.decode(input_chunk, eof))

  File "C:\Python30\lib\io.py", line 1236, in decode
    output = self.decoder.decode(input, final=final)
  File "C:\Python30\lib\io.py", line 1236, in decode
    output = self.decoder.decode(input, final=final)

  File "C:\Python30\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
  File "C:\Python30\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 74:
character maps to <undefined>
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 74:
character maps to <undefined>

The same calls to sax read the file correctly on Python 2.5.

It would be nice to have a message pinpoint the line and character offset of
the problem.

My vote would be for the code to work on both kinds of input streams. This
would save the users considerable confusion if sax does the (tricky)
conversions automatically.

Imo, now would be the most convenient time to attempt this--there is a
certain freedom in having everything be partially broken :-)

Edward
--------------------------------------------------------------------
Edward K. Ream email: edreamleo@gmail.com
Leo: http://webpages.charter.net/edreamleo/front.html
--------------------------------------------------------------------

Files
File name	Uploaded
unnamed	edreamleo, 2008-08-18.20:16:06

History
Date	User	Action	Args
2008-08-18 20:16:10	edreamleo	set	recipients: + edreamleo, pitrou, benjamin.peterson
2008-08-18 20:16:09	edreamleo	link	issue3590 messages
2008-08-18 20:16:09	edreamleo	create