Message 370043 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	steven.daprano
Recipients	remi.lapeyre, serhiy.storchaka, sidhant, skip.montanaro, steven.daprano
Date	2020-05-27.03:01:34
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<20200527025544.GS11884@ando.pearwood.info>
In-reply-to	<1590545387.05.0.387979829628.issue40762@roundup.psfhosted.org>

Content
On further thought, no, I don't think it would be a reasonable feature. User opens the CSV file, probably using the default encoding (UTF-8?) but potentially in anything. They collect some data as bytes. Those bytes could be from any unknown encoding. When they try writing those bytes to the CSV file, at best they get an explicit but confusing exception that the decoding failed, at worst they get data loss (mojibake). # Latin-1 to UTF-8 fails py> b = 'ßæ'.encode('latin-1') py> b.decode('utf-8') # raises UnicodeDecodeError: 'utf-8' codec can't decode # byte 0xdf in position 0: invalid continuation byte # UTF-8 to Latin-1 loses data py> b = 'ßæ'.encode('UTF-8') py> b.decode('latin-1') # returns mojibake 'Ã\x9fÃ¦' Short of outright banning the use of bytes (raise a TypeError), I think the current behaviour is least-worst.

On further thought, no, I don't think it would be a reasonable feature.

User opens the CSV file, probably using the default encoding (UTF-8?) 
but potentially in anything.

They collect some data as bytes. Those bytes could be from any unknown 
encoding. When they try writing those bytes to the CSV file, at best 
they get an explicit but confusing exception that the decoding failed, 
at worst they get data loss (mojibake).

    # Latin-1 to UTF-8 fails
    py> b = 'ßæ'.encode('latin-1')
    py> b.decode('utf-8')
    # raises UnicodeDecodeError: 'utf-8' codec can't decode 
    # byte 0xdf in position 0: invalid continuation byte

    # UTF-8 to Latin-1 loses data
    py> b = 'ßæ'.encode('UTF-8')
    py> b.decode('latin-1')
    # returns mojibake 'Ã\x9fÃ¦'

Short of outright banning the use of bytes (raise a TypeError), I think 
the current behaviour is least-worst.

History
Date	User	Action	Args
2020-05-27 03:01:35	steven.daprano	set	recipients: + steven.daprano, skip.montanaro, serhiy.storchaka, remi.lapeyre, sidhant
2020-05-27 03:01:35	steven.daprano	link	issue40762 messages
2020-05-27 03:01:34	steven.daprano	create