This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: decoding functions in _codecs module accept str arguments
Type: behavior Stage:
Components: Extension Modules Versions: Python 3.0, Python 3.1
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: pitrou Nosy List: amaury.forgeotdarc, benjamin.peterson, lemburg, pitrou, vstinner
Priority: release blocker Keywords: patch

Created on 2009-01-07 23:48 by pitrou, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
_codecs_bytes-2.patch vstinner, 2009-01-17 01:31
mbdecode-unicode.patch pitrou, 2009-01-22 11:04
Messages (12)
msg79384 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-07 23:48
The following function calls should raise a TypeError instead. Encoding
functions are fine (they only accept str).

>>> import codecs
>>> codecs.utf_8_decode('aa')
('aa', 2)
>>> codecs.utf_8_decode('éé')
('éé', 4)
>>> codecs.latin_1_decode('éé')
('éé', 4)
msg79387 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-01-08 00:59
Patch replacing "s*" parsing format by "y*" for:
 - utf_7_decode()
 - utf_8_decode()
 - utf_16_decode()
 - utf_16_le_decode()
 - utf_16_be_decode()
 - utf_16_ex_decode()
 - utf_32_decode()
 - utf_32_le_decode()
 - utf_32_be_decode()
 - utf_32_ex_decode()
 - unicode_escape_decode()
 - raw_unicode_escape_decode()
 - latin_1_decode()
 - ascii_decode()
 - charmap_decode()
 - mbcs_decode()

Using run_tests.sh, all tests are ok (with 19 skipped tests). I guess 
that there is not tests for all these functions :-/

Note: codecs documentation was already correct:

.. method:: Codec.decode(input[, errors])
   (...)
   *input* must be a bytes object or one which provides the read-only 
character
   buffer interface -- for example, buffer objects and memory mapped 
files.
msg79402 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-01-08 09:29
On 2009-01-08 01:59, STINNER Victor wrote:
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
> Patch replacing "s*" parsing format by "y*" for:
>  - utf_7_decode()
>  - utf_8_decode()
>  - utf_16_decode()
>  - utf_16_le_decode()
>  - utf_16_be_decode()
>  - utf_16_ex_decode()
>  - utf_32_decode()
>  - utf_32_le_decode()
>  - utf_32_be_decode()
>  - utf_32_ex_decode()
>  - latin_1_decode()
>  - ascii_decode()
>  - charmap_decode()
>  - mbcs_decode()

These are fine.

>  - unicode_escape_decode()
>  - raw_unicode_escape_decode()

These changes are in line with their C API codec interfaces as well,
but those particular codecs could well also be made to work on Unicode
input, since unescaping can well be applied to Unicode as well.

I'll probably open a new item for this.

> Using run_tests.sh, all tests are ok (with 19 skipped tests). I guess 
> that there is not tests for all these functions :-/

The mbcs codec is only available on Windows.

All others are tested by test_codecs.py.

Which ones are skipped in your case ?

> Note: codecs documentation was already correct:
> 
> .. method:: Codec.decode(input[, errors])
>    (...)
>    *input* must be a bytes object or one which provides the read-only 
> character
>    buffer interface -- for example, buffer objects and memory mapped 
> files.

That's not entirely correct: codecs are allowed to accept any
object type and can also return any object type. It up to them
to decide, e.g. a codec may accept both bytes and Unicode input
and always generate Unicode output when decoding.

I guess I have to review those doc changes...
msg79741 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-13 13:14
The patch is probably fine, but it would be nice to add some unit tests
for the new behaviour.
msg79843 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-01-14 08:40
On 2009-01-13 14:14, Antoine Pitrou wrote:
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
> The patch is probably fine, but it would be nice to add some unit tests
> for the new behaviour.

+1 from my side.

Thanks for the patch, Viktor.
msg79991 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-01-17 01:31
New patch:
 - Leave unicode_escape_decode() and raw_unicode_escape_decode() 
unchanged (still accept unicode string)
 - Test changes (reject unicode for most codecs decode functions)
 - Write tests for unicode_escape_decode() and 
raw_unicode_escape_decode() (there was no test for these functions)
msg80362 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-22 10:32
Committed in r68855, r68856. Thanks!
msg80363 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2009-01-22 10:46
IMO, Modules/cjkcodecs/multibytecodec.c should be changed as well.
msg80364 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-22 10:55
Right, I hadn't thought of that.
msg80365 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-22 11:04
"Fixing" multibytecodec.c produces a TypeError in the following test:

    def test_errorcallback_longindex(self):
        dec = codecs.getdecoder('euc-kr')
        myreplace  = lambda exc: ('', sys.maxsize+1)
        codecs.register_error('test.cjktest', myreplace)
        self.assertRaises(IndexError, dec,
                          'apple\x92ham\x93spam', 'test.cjktest')

TypeError: decode() argument 1 must be bytes or buffer, not str

Since the test is meant to test recovery from a misbehaving callback, I
guess the type of the input string is not really important and can be
changed to a bytes string instead. What do you think?

(in any case, here is a patch)
msg80366 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2009-01-22 11:56
The patch looks good. 
I think the missing b in test_errorcallback_longindex is an overlook
when the tests were updated for py3k. You are right to change the test.
msg80367 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-22 12:25
Committed in r68857, r68858.
History
Date User Action Args
2022-04-11 14:56:43adminsetnosy: + benjamin.peterson
github: 49124
2009-01-22 12:25:56pitrousetstatus: open -> closed
resolution: fixed
messages: + msg80367
2009-01-22 11:56:18amaury.forgeotdarcsetmessages: + msg80366
2009-01-22 11:04:50pitrousetfiles: + mbdecode-unicode.patch
messages: + msg80365
2009-01-22 10:55:57pitrousetstatus: closed -> open
resolution: fixed -> (no value)
messages: + msg80364
2009-01-22 10:46:11amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg80363
2009-01-22 10:32:07pitrousetstatus: open -> closed
resolution: accepted -> fixed
messages: + msg80362
2009-01-22 10:10:20pitrousetassignee: pitrou
resolution: accepted
2009-01-17 01:31:48vstinnersetfiles: - _codecs_bytes.patch
2009-01-17 01:31:42vstinnersetfiles: + _codecs_bytes-2.patch
messages: + msg79991
2009-01-14 08:40:04lemburgsetmessages: + msg79843
2009-01-13 13:14:19pitrousetmessages: + msg79741
2009-01-08 09:29:20lemburgsetnosy: + lemburg
messages: + msg79402
2009-01-08 00:59:08vstinnersetfiles: + _codecs_bytes.patch
keywords: + patch
messages: + msg79387
nosy: + vstinner
2009-01-07 23:48:15pitroucreate