classification
Title: PyArg_ParseTuple(): remove "t# format
Type: Stage:
Components: Interpreter Core Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: lemburg, loewis, vstinner
Priority: normal Keywords: patch

Created on 2010-05-27 23:29 by vstinner, last changed 2010-06-10 09:43 by lemburg. This issue is now closed.

Files
File name Uploaded Description Edit
getarg_remove_tdash.patch vstinner, 2010-05-28 10:33
getarg_remove_tdash-2.patch vstinner, 2010-06-06 21:09
Messages (11)
msg106627 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-27 23:29
"t#" format was introduced by r11803 (11 years ago): "Implement new format character 't#'. This is like s#, accepting an object that implements the buffer interface, but requires a buffer that contains 8-bit character data."

Python3 now has a strict separation between byte string (bytes and bytearray types) and unicode string (str), and has PyBuffer and PyCapsule APIs. "t#" format can be replaced by "y#" or "y*".

Extract of getarg.c:

      /*TEO: This can be eliminated --- here only for backward
        compatibility */
    case 't': { /* 8-bit character buffer, read-only access */

In Python, the last function using "t#" is _codecs.charbuffer_encode() and I proposed to remove this function in #8838. We can also patch this function.

I don't know if third party modules use this format or not. I don't know if it can be just removed or if it should raise a deprecation warning (but who will notice such warning since there are disabled by default?).
msg106641 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-05-28 08:14
STINNER Victor wrote:
> 
> New submission from STINNER Victor <victor.stinner@haypocalc.com>:
> 
> "t#" format was introduced by r11803 (11 years ago): "Implement new format character 't#'. This is like s#, accepting an object that implements the buffer interface, but requires a buffer that contains 8-bit character data."
> 
> Python3 now has a strict separation between byte string (bytes and bytearray types) and unicode string (str), and has PyBuffer and PyCapsule APIs. "t#" format can be replaced by "y#" or "y*".
>
> Extract of getarg.c:
> 
>       /*TEO: This can be eliminated --- here only for backward
>         compatibility */
>     case 't': { /* 8-bit character buffer, read-only access */
> 
> In Python, the last function using "t#" is _codecs.charbuffer_encode() and I proposed to remove this function in #8838. We can also patch this function.
> 
> I don't know if third party modules use this format or not. I don't know if it can be just removed or if it should raise a deprecation warning (but who will notice such warning since there are disabled by default?).

Since Python3 completely removed the getcharbuffer interface
to which the "t#" interfaces in Python2, "t#" does indeed no
longer serve any special purpose.

It's probably wise to just map "t#" to "y#" in order to ease
porting extensions from 2.x to 3.x.
msg106642 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-28 10:33
Patch to remove "t#":
 - Update c-api/arg.rst documentation
 - Replace "t#" format by "y#" in codecs.charbuffer_encode()
 - Add a note in Doc/whatsnew/3.2.rst (in Porting to Python 3.2)
msg106643 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-05-28 10:39
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
> Patch to remove "t#":
>  - Update c-api/arg.rst documentation
>  - Replace "t#" format by "y#" in codecs.charbuffer_encode()
>  - Add a note in Doc/whatsnew/3.2.rst (in Porting to Python 3.2)

Given that "y#" is not (yet) in wide-spread use, it may actually make
more sense, to replace "y#" with "t#" and introduce "t*" to replace
"y*".

"y#" and "y*" could then be setup as synonyms for "t#" and "t*".
msg106644 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-28 11:04
> Given that "y#" is not (yet) in wide-spread use, ...

t# is only used once (in codecs.charbuffer_encode()), whereas y# is used by ossaudiodev, socket and mmap modules (there are 8 functions using y#). There are 46 functions using y* format. y format is not used in Python3.

To me, it looks easier to just drop t# and continue to use y, y* and y# formats in Python3.

> "y#" and "y*" could then be setup as synonyms for "t#" and "t*"

If we have to keep backward compatibility, yes, t# can be kept as a synonym for y#. But I don't think that backward compatibility of the C API is important in Python3 because only few 3rd party modules are compatible with Python3.

--

I prefer to use y, y* and y# formats because they target the *bytes* type (which is the Python3 type to store byte strings), whereas s# is used in Python2 to get text, *str* type.. which are byte strings, but most Python2 programmers consider that the str type is the type of chararacter string. I see the change of s# to y#, as the change from str to bytes (the strict separation between bytes and str).
msg106648 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-05-28 11:30
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
>> Given that "y#" is not (yet) in wide-spread use, ...
> 
> t# is only used once (in codecs.charbuffer_encode()), whereas y# is used by ossaudiodev, socket and mmap modules (there are 8 functions using y#). There are 46 functions using y* format. y format is not used in Python3.
> 
> To me, it looks easier to just drop t# and continue to use y, y* and y# formats in Python3.

You are forgetting our main target: to get extension writers to
port their extensions to Python3. Changes to the Python core are
a lot easier to implement than getting thousands of extensions
ported.

"t#" is in wide-spread use, since it's the only way a Python2
extension can request access to an object's text data version.

"y#" was introduced with Python3, and there are only very few
extensions written for it.

Given these facts, it's better to drop "y#" and replace it with
"t#". This is easily done for the core modules and by adding
synonyms for "y#" we can also automatically take care of the
few Python3 extensions possibly using it.

>> "y#" and "y*" could then be setup as synonyms for "t#" and "t*"
> 
> If we have to keep backward compatibility, yes, t# can be kept as a synonym for y#. But I don't think that backward compatibility of the C API is important in Python3 because only few 3rd party modules are compatible with Python3.

True and that's why we have to make it easier for extension writer
to port their extensions rather than making it harder.

It is not too difficult to adjust a Python2 extension to work
in Python3 as well, so that's most likely the route that
many extension writer will take, hence the need to reduce the
number of differences between the Python2 and Python3 C API.

> --
> 
> I prefer to use y, y* and y# formats because they target the *bytes* type (which is the Python3 type to store byte strings), whereas s# is used in Python2 to get text, *str* type.. which are byte strings, but most Python2 programmers consider that the str type is the type of chararacter string. I see the change of s# to y#, as the change from str to bytes (the strict separation between bytes and str).

That's not correct: "s#" is used in Python2 to get at the bytes
representation of an object, not the text version. "t#" was
specifically added to access a text version of the content.

In Python3, this distinction is no longer available (for whatever
reason), so only the bytes representation of the object remains.

Looking at the implementation again, I found that "y#" rejects
Unicode, while "s#" returns the default encoded version like
"t#" does in Python2.

So I have to correct what I said earlier:

"y#" is not the right replacement for "t#" in order to stay compatible
with its Python2 pendant. The "t#" implementation in Python3 is not
compatible with the Python2 approach - it's in fact, a totally
different parser, since Unicode no longer provides a buffer interface
and thus cannot be used as input for "t#".

The only compatible pendant to the Python2 "t#" parser marker
in Python3 appears to be "s#".

I'll have to think about this some more, but seen in that light,
removing "t#" in Python3 may actually be a better strategy after
all - mostly to remove a misguided forward-porting attempt
and to reduce the number of surprising extension writer will
see when porting their apps to Python3.
msg106652 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-28 12:08
Le vendredi 28 mai 2010 13:30:22, vous avez écrit :
> Looking at the implementation again, I found that "y#" rejects
> Unicode, while "s#" returns the default encoded version like
> "t#" does in Python2.

Oh, I didn't noticed that.

> So I have to correct what I said earlier:
> 
> "y#" is not the right replacement for "t#" in order to stay compatible
> with its Python2 pendant. The "t#" implementation in Python3 is not
> compatible with the Python2 approach - it's in fact, a totally
> different parser, since Unicode no longer provides a buffer interface
> and thus cannot be used as input for "t#".
> 
> The only compatible pendant to the Python2 "t#" parser marker
> in Python3 appears to be "s#".
> 
> I'll have to think about this some more, but seen in that light,
> removing "t#" in Python3 may actually be a better strategy after
> all - mostly to remove a misguided forward-porting attempt
> and to reduce the number of surprising extension writer will
> see when porting their apps to Python3.

So t#, s# and y# are all different. I'm waiting for your final decision.

"reduce the number of surprising extension writer ..." is a good argument in 
favor of removing t# :-)
msg107229 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-06-06 21:09
New version of the patch:
 - charbuffer_encode() uses y* instead of y# format to accept modifiable buffer objects (eg. bytearray)
 - Improve the documentation about the change

@lemburg: So, do you agree with my patch?
msg107284 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-06-07 21:50
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
> New version of the patch:
>  - charbuffer_encode() uses y* instead of y# format to accept modifiable buffer objects (eg. bytearray)
>  - Improve the documentation about the change
> 
> @lemburg: So, do you agree with my patch?

No, because y*/y# are not correct replacements for t#. They don't
accept Unicode objects.

t# was meant to provide access to text data, so replacing it with a
parser code that is meant for binary data is not correct. The
closes Python3 gets to t# from Python2 is s# or s*, so please use
those in the NEWS entry and s* in charbuffer_encode().
msg107362 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-06-08 23:02
> t# was meant to provide access to text data, so replacing it with a
> parser code that is meant for binary data is not correct. The
> closes Python3 gets to t# from Python2 is s# or s*, so please use
> those in the NEWS entry and s* in charbuffer_encode().

Done. Patch commited as r81854 in 3.2: it removes also codecs.charbuffer_encode(). Commit blocked in 3.1 (r81855).
msg107450 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-06-10 09:43
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
>> t# was meant to provide access to text data, so replacing it with a
>> parser code that is meant for binary data is not correct. The
>> closes Python3 gets to t# from Python2 is s# or s*, so please use
>> those in the NEWS entry and s* in charbuffer_encode().
> 
> Done. Patch commited as r81854 in 3.2: it removes also codecs.charbuffer_encode(). Commit blocked in 3.1 (r81855).

Thanks.
History
Date User Action Args
2010-06-10 09:43:45lemburgsetmessages: + msg107450
2010-06-08 23:13:56vstinnersetstatus: open -> closed
resolution: fixed
2010-06-08 23:02:02vstinnersetmessages: + msg107362
2010-06-07 21:50:21lemburgsetmessages: + msg107284
2010-06-06 21:09:15vstinnersetfiles: + getarg_remove_tdash-2.patch

messages: + msg107229
2010-05-28 13:18:02pitrousetnosy: + loewis
2010-05-28 12:08:46vstinnersetmessages: + msg106652
2010-05-28 11:30:20lemburgsetmessages: + msg106648
2010-05-28 11:04:00vstinnersetmessages: + msg106644
2010-05-28 10:39:50lemburgsetmessages: + msg106643
2010-05-28 10:33:30vstinnersetfiles: + getarg_remove_tdash.patch
keywords: + patch
messages: + msg106642
2010-05-28 08:14:41lemburgsetnosy: + lemburg
messages: + msg106641
2010-05-27 23:29:07vstinnercreate