classification
Title: nntplib module broken by str to unicode conversion
Type: crash Stage:
Components: Library (Lib) Versions: Python 3.0
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: benjamin.peterson Nosy List: amaury.forgeotdarc, barry, benjamin.peterson, christian.heimes, haypo, hdima, orsenthil
Priority: release blocker Keywords: patch

Created on 2008-08-28 14:09 by hdima, last changed 2008-11-05 19:44 by christian.heimes. This issue is now closed.

Files
File name Uploaded Description Edit
nntplib.patch hdima, 2008-08-28 14:40 Patch to nntplib
nntp-bytes-2.patch haypo, 2008-11-04 18:27
Messages (20)
msg72090 - (view) Author: Dmitry Vasiliev (hdima) Date: 2008-08-28 14:09
The following commands fail badly:

>>> from nntplib import NNTP
>>> s = NNTP("free-text.usenetserver.com")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/py3k/Lib/nntplib.py", line 116, in __init__
    self.welcome = self.getresp()
  File "/py3k/Lib/nntplib.py", line 215, in getresp
    resp = self.getline()
  File "/py3k/Lib/nntplib.py", line 209, in getline
    elif line[-1:] in CRLF: line = line[:-1]
TypeError: 'in <string>' requires string as left operand, not bytes

Actually there are many places in nntplib module which need to be
converted to bytes, or socket input/output values need to be converted
from/to str. I think API need to be updated to pass user defined
encoding at some stages. I can make a patch later if needed.
msg72091 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-08-28 14:20
Yes, the module is unusable in the current state.
msg72092 - (view) Author: Dmitry Vasiliev (hdima) Date: 2008-08-28 14:40
I've attached the patch which adds encoding parameter to the NNTP class.
msg72774 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2008-09-08 12:57
I verified the patch against the trunk. It works fine and solves the issue.
But, I have a minor concern over 'ascii' as the default encoding, which you
have chosen.

For e.g, when I ran python3.0 nntplib.py (It would run tests, as the module
does not have an explicit test), I got the following error due to encoding.
(tests read comp.lang.python)

UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 29:
ordinal not in range(128)

Setting the encoding to 'latin1', it passed.

Would 'latin1' be a better default encoding? Or should we leave it as 'ascii'.
msg72776 - (view) Author: Dmitry Vasiliev (hdima) Date: 2008-09-08 13:26
Actually RFC-977 said all characters must be in ASCII, but RFC-3977
changed default character set to UTF-8. So I think UTF-8 must be default
encoding, not Latin-1. Moreover Latin-1 can silently hide a real
encoding, for example:

>>> u'\u0422\u0435\u0441\u0442'.encode("koi8-r").decode("latin1")
u'\xf4\xc5\xd3\xd4'

Additionally in the future it would be a good idea to look in the
article headers for article body encoding.
msg72777 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-09-08 13:27
Setting the default encoding as "ascii" is very conservative until we
know the encoding actually used by the server.

Are you sure that comp.lang.python uses latin-1? RFC3977, which
re-defines the NNTP protocol, prefers utf-8 for the character set.

Is there a way to know the character set used by a server?
msg72778 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2008-09-08 13:36
When the default encoding 'ascii' failed, I tried, 'utf-8', even that failed to
retrieve the message headers from comp.lang.python. 
'latin1' succeeded. That was reason for my suggestion, considering 'latin1' to
be superset of 'ascii'. 

But, yes checking the encoding at the server and using that would be a good
idea. The for the default, we could follow whatever RFC3977 recommends.
msg72779 - (view) Author: Dmitry Vasiliev (hdima) Date: 2008-09-08 13:41
If I understand it correctly there is no "character set used by server"
because every article can be in different encoding. RFC-3977 say:

"""
The character set of article bodies SHOULD be indicated in the
article headers, and this SHOULD be done in accordance with MIME.
"""

But it's not always true, for example fido7.* groups known to use
"KOI-8R" encoding but I didn't find any relevant headers.
msg72780 - (view) Author: Dmitry Vasiliev (hdima) Date: 2008-09-08 13:54
RFC-3977 say the following about headers:

- The names of headers (e.g., "From" or "Subject") MUST be in
  US-ASCII.

- Header values SHOULD use US-ASCII or an encoding based on it, such
  as RFC 2047 [RFC2047], until such time as another approach has
  been standardised.  At present, 8-bit encodings (including UTF-8)
  SHOULD NOT be used because they are likely to cause
  interoperability problems.

But in practice for now there is no way to reliable find a header's
encoding.
msg72884 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2008-09-09 17:02
So, in effect, if we settle for ASCII as the default encoding, then the
attached patch is good enough. Someone should check this in. 
Should it go in py3k, release? Because, without this patch(which is again
simple), nntplib is almost unusable.
msg74733 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-10-14 11:50
Instead of ASCII, I think that it would be better to use ISO-8859-1 
since it's the most common charset.

New patch:
 - use ISO-8859-1 as the default charset
 - remove set_encoding() method: was it really needed
 - use makefile('r', encoding=self.encoding') to get a new 
TextIOWrapper with universal newline and automatic unicode decode => 
getline() is simplified and I removed CRLF
msg74737 - (view) Author: Dmitry Vasiliev (hdima) Date: 2008-10-14 12:11
Oh, you need to read the comments first:

- Use of ISO-8859-1 it's a bad idea here. See msg72776 for details.
Moreover RFC-3977 explicitly say about UTF-8, so I think we need to use
UTF-8.
- Maybe set_encoding() isn't needed but you need to have a possibility
to change encoding after object creation. Because different groups can
use different encodings. But with makefile() addition you just remove
this possibility.
msg74739 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-10-14 12:20
Ok for UTF-8 which is a superset of ASCII and raise an error when 
trying to decode Latin1 or KOI-8.

> You need to have a possibility to change encoding after object 
creation

If you share a connection for the different groups, you will have to 
take care of the side effets of set_encoding(). But if you consider 
that set_encoding() is a must-have, ok, forget my patch (because using 
makefile(), it's not possible to change the charset) ;-)
msg74889 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2008-10-17 01:39
This issue does not seem close to resolution and the nntplib is not
critical enough to block the release, so I'm deferring this to rc3.
msg75157 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-10-24 10:06
As I did for POP3 (poplib) and IMAP4 (imaplib), nntplib should use 
bytes instead of characters because the charset may be different for 
each message and nntplib is unable to retreive the charset from the 
email header.
msg75395 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-10-30 22:54
After testing many different emails servers (POP3, IMAP4 and NNTP), I 
think that the best type is bytes and not str (unicode). Attached 
patch changes nntplib to use bytes:
 - write tests using Gmane NNTP public server
 - use bytes everywhere except for the commands (str), converted to 
bytes using ASCII charset
 - factorizy post() + ihave(): create _post()
 - replace line[-1] == '\n' by line.endswith(b'\n')
 - replace e.response[:3] == '480' by e.response.startswith(b'480')
 - use format() method because str+int is forbidden. Eg. see body() 
method: id can be an integer (or an unicode string).

I like startswith() / endswith() because it's more readable and it 
works with string of 1 byte: line[-1:] == b'\n' vs 
line.endswith(b'\n').

TODO:
 - test authentication
 - test methods marked as "TODO" in the testcase: 
    * Check newgroups(), newnews(), list()
    * Check xhdr(), xover(), xgtitle(), xpath()
    * Check post(), ihave()
 - does .format() work with bytes? body(b'1') may creates the 
command "BODY b'1'" which is wrong <= should I also use bytes for the 
commands?

I don't know NNTP enough to test all methods, and I don't have any 
account to test the authentication :-/ But most common commands should 
work (see the testcase).
msg75396 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-10-30 23:00
I haven't been able to examine the patch in depth, but one thing I see
is that the test case needs to call test.support.requires("network").
msg75481 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2008-11-04 00:08
Please fix the test as Benjamin suggests ("network").  While I'd prefer
a more fleshed out test suite, I think in time you can add that.  For
now, this is still an improvement and likely the best we're going to get
before 3.0.  Thanks for working on this.

Go ahead and make the suggested change and land the code.
msg75500 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-11-04 18:27
Le Tuesday 04 November 2008 01:08:25 Barry A. Warsaw, vous avez écrit :
> Please fix the test as Benjamin suggests ("network")

Done: see new attached patch.
msg75528 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-11-05 19:44
Applied in revision r67108
History
Date User Action Args
2008-11-05 19:44:46christian.heimessetstatus: open -> closed
nosy: + christian.heimes
resolution: accepted -> fixed
messages: + msg75528
2008-11-04 18:27:29hayposetfiles: + nntp-bytes-2.patch
messages: + msg75500
2008-11-04 18:19:06hayposetfiles: - nntp-bytes.patch
2008-11-04 02:48:54benjamin.petersonsetassignee: benjamin.peterson
2008-11-04 00:08:24barrysetkeywords: - needs review
resolution: accepted
messages: + msg75481
2008-10-31 00:18:08hayposetkeywords: + needs review
2008-10-30 23:00:26benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg75396
2008-10-30 22:54:52hayposetfiles: + nntp-bytes.patch
messages: + msg75395
2008-10-28 15:05:37hayposetfiles: - nntplib_unicode.patch
2008-10-24 10:06:54hayposetmessages: + msg75157
2008-10-24 09:32:13barrysetpriority: deferred blocker -> release blocker
2008-10-17 01:39:28barrysetpriority: release blocker -> deferred blocker
nosy: + barry
messages: + msg74889
2008-10-14 12:20:41hayposetmessages: + msg74739
2008-10-14 12:11:58hdimasetmessages: + msg74737
2008-10-14 11:50:14hayposetfiles: + nntplib_unicode.patch
nosy: + haypo
messages: + msg74733
2008-10-08 20:42:19benjamin.petersonsetpriority: critical -> release blocker
2008-09-09 17:02:30orsenthilsetmessages: + msg72884
2008-09-08 13:54:50hdimasetmessages: + msg72780
2008-09-08 13:41:56hdimasetmessages: + msg72779
2008-09-08 13:36:09orsenthilsetmessages: + msg72778
2008-09-08 13:27:51amaury.forgeotdarcsetmessages: + msg72777
2008-09-08 13:26:41hdimasetmessages: + msg72776
2008-09-08 12:57:37orsenthilsetnosy: + orsenthil
messages: + msg72774
2008-08-28 14:40:09hdimasetfiles: + nntplib.patch
keywords: + patch
messages: + msg72092
2008-08-28 14:20:54amaury.forgeotdarcsetpriority: critical
nosy: + amaury.forgeotdarc
messages: + msg72091
2008-08-28 14:09:21hdimacreate