Issue 10284: NNTP should accept bytestrings for username and password

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/54493

classification

Title:	NNTP should accept bytestrings for username and password
Type:	behavior	Stage:	needs patch
Components:	Library (Lib), Unicode	Versions:	Python 3.3, Python 3.4

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	eric.araujo, hynek, jelie, pitrou, r.david.murray, vstinner
Priority:	normal	Keywords:

Created on 2010-11-01 20:22 by jelie, last changed 2022-04-11 14:57 by admin.

Messages (21)
msg120161 - (view)	Author: Julien ÉLIE (jelie)	Date: 2010-11-01 20:22
> +# - all commands are encoded as UTF-8 data (using the "surrogateescape" > +# error handler), except for raw message data (POST, IHAVE) > +# - all responses are decoded as UTF-8 data (using the "surrogateescape" > +# error handler), except for raw message data (ARTICLE, HEAD, BODY) It does not seem to work on my news server (news.trigofacile.com): print(s.descriptions('*')) Exception raised with an UnicodeEncodeError. I do not know what is happening. The same command works fine with Python 2.7 and 3.0. Also, for AUTHINFO, be careful that nntplib should not send an UTF-8 string but a byte string. (I have not tested.) The username and the password are byte strings.
msg120163 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-11-01 20:26
What's the exception? If there were any escaped bytes in the string returned by descriptions, you would get an error when you try to print them. This could be a design problem.
msg120168 - (view)	Author: Julien ÉLIE (jelie)	Date: 2010-11-01 20:50
Traceback (most recent call last): File "nntplib-test.py", line 10, in <module> print(s.descriptions('')) File "C:\Program Files\Python32\lib\encodings\cp850.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 3285-3287: character maps to <undefined> The code was previously working fine with Python 3.1. Looking more in details: (resp, descs) = s.descriptions('') clefs = list(descs.keys()) clefs.sort() for clef in clefs: print(clef), print(b[clef]) Traceback (most recent call last): File "nntplib-test.py", line 14, in <module> print(clef), print(b[clef]) File "C:\Program Files\Python32\lib\encodings\cp850.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u0152' in position 0: character maps to <undefined> That is another error. It corresponds to a description containing « Œ », yet in UTF-8. So you mean the issue comes from the MS-DOS console I am using. Well, that seems right. No problem in the Python IDLE! Maybe this issue should be closed then? (Yet, something changed because print() worked fine with the previous version...)
msg120169 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2010-11-01 20:52
FTR, a UTF-8 string is a byte string.
msg120170 - (view)	Author: Julien ÉLIE (jelie)	Date: 2010-11-01 20:58
Yes, you're right. I meant to say that AUTHINFO is not expecting a UTF-8-encoded string. For instance: AUTHINFO USER Éric is valid and should not always be transformed by nntplib to: AUTHINFO USER Ã‰ric News servers do a byte-string comparison (as specified in RFC 4643). So if « Éric » is the expected user name, then it is this very name that is expected!
msg120172 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2010-11-01 21:00
But É cannot be transferred as is. It needs to be encoded to bytes using some encoding. What encoding is correct?
msg120173 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-11-01 21:05
Éric: UTF-8 (IIUC the RFC says "SHOULD be UTF-8"). Julien: yes, there are differences in the way printing to the console works between 2.x and 3.x, and this has caused some surprises for Windows users, where the default console codec is a bit limited. So yes I'm going to close this issue. Reopen it if you find it is not a windows console problem.
msg120177 - (view)	Author: Julien ÉLIE (jelie)	Date: 2010-11-01 21:18
David: no, the RFC does not mention UTF-8 about AUTHINFO. Please note the subtlety: command =/ authinfo-sasl-command / authinfo-user-command / authinfo-pass-command authinfo-sasl-command = "AUTHINFO" WS "SASL" WS mechanism [WS initial-response] authinfo-user-command = "AUTHINFO" WS "USER" WS username authinfo-pass-command = "AUTHINFO" WS "PASS" WS password initial-response = base64-opt username = 1user-pass-char password = 1user-pass-char user-pass-char = B-CHAR ; U- means based on UTF-8, excluding NUL CR and LF ; B- means based on bytes, excluding NUL CR and LF U-CHAR = CTRL / TAB / SP / A-CHAR / UTF8-non-ascii B-CHAR = CTRL / TAB / SP / %x21-FF That is not for nothing that B-CHAR are explicitly mentioned. And not U-CHAR. That is why I insist on that fact, and I fear the new nntplib implementation using UTF-8 is breaking the NNTP protocol at some places...
msg120179 - (view)	Author: Julien ÉLIE (jelie)	Date: 2010-11-01 21:28
Éric: there is no notion of encoding in a few NNTP commands. Regarding AUTHINFO, the real string that I should have written is: AUTHINFO USER \xC9ric 7-bit bytes are considered to be encoded in ASCII. 8-bit bytes are just 8-bit bytes. No encoding. The news client and the news server have to agree on the setting. Authentification occurs between them. I can imagine the news client in ISO-8859-1 and the news server in ISO-8859-15, and a password with a « € » sign in. Then the password will not be the "same" (when entered on the keyboard), but will match in bytes! I hope my explanation was clear enough now. No encoding, just byte strings here!
msg120181 - (view)	Author: Julien ÉLIE (jelie)	Date: 2010-11-01 21:37
Maybe the bug should be reopened -- or the subject changed -- because the real issue is when I read: # Incompatible changes from the 2.x nntplib: # - all commands are encoded as UTF-8 data (using the "surrogateescape" # error handler), except for raw message data (POST, IHAVE) # - all responses are decoded as UTF-8 data (using the "surrogateescape" # error handler), except for raw message data (ARTICLE, HEAD, BODY) # UTF-8 is the character set for all NNTP commands and responses: they # are automatically encoded (when sending) and decoded (and receiving) # by this class. That is not true. What for XOVER/OVER answers? They contain raw message data. Why aren't they excluded? And XHDR/HDR answers? (As I see that HEAD is excluded, then so should OVER and HDR...) And AUTHINFO, as I have just explained in the comments here.
msg120190 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-11-02 00:01
That's not what you opened the bug about, though, according to the title. I discussed the headers-in-things-other-than HEAD/ARTICLE, and Antoine was of the opinion that they were "supposed" to be utf-8 and that in any case using surrogate escape was good enough in context. (Headers could also, of course, be MIME transfer encoded, but in that case the header decode will turn them into the correct unicode, assuming they were encoded correctly). Perhaps this design decision needs to be revisited, but if so you'll need a different example of a problem, and so I think this ticket should remain closed and you should open a new one. Two new ones, actually, since AUTHINFO is yet a different problem. And given that the standard you quote specifies bytes without an encoding, there may be no solution that works (that is, the standard appears to be broken, since most people expect to be able to use text strings for passwords).
msg120263 - (view)	Author: Julien ÉLIE (jelie)	Date: 2010-11-02 22:06
David, the headers are not at all supposed to be "utf-8" encoded. For instance, have a look at the cn.bbs.comp.lang.python newsgroup: http://groups.google.fr/group/cn.bbs.comp.lang.python If you look at the source of the articles, you will for instance see that the Subject: header field is not MIME-encoded. It is directly written in gb2312. That's how news works in the wild. Please do not break nntplib in Python 3.2! Regarding AUTHINFO, the specification is not broken at all. Bytes are expected, not strings in a particular encoding. Well, most people are in fact confused when they speak about encodings -- me included :-) The specification is pretty clear: NNTP expects bytes. And my text string is "\xC9ric", that's all. Please also do not break nntplib when providing such strings on class instantiation.
msg120266 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-11-02 22:14
> And my text string is "\xC9ric", that's all. You mean b"\xC9ric", right? > If you look at the source of the articles, you will for instance see > that the Subject: header field is not MIME-encoded. It is directly > written in gb2312. How is an NNTP client supposed to guess the encoding? Either a header is MIME-encoded, or it follows the RFC 3977 recommendation of UTF-8 (“The content of a header SHOULD be in UTF-8”), or it's unreadable.
msg120273 - (view)	Author: Julien ÉLIE (jelie)	Date: 2010-11-02 22:43
Antoine, a news client could guess it because of the Content-Type: header field (in this example, it mentions charset="gb2312"). Yet, articles without a Content-Type: header field exist in the wild... There is no way to always make the right guess, unfortunately. News clients try to do their best :-) Yes, I mean b"\xC9ric". 4 bytes.
msg120276 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-11-02 22:48
> Antoine, a news client could guess it because of the Content-Type: > header field (in this example, it mentions charset="gb2312"). > Yet, articles without a Content-Type: header field exist in the > wild... Unless I'm mistaken, Content-Type should only apply to the body, not the headers. Either the headers use UTF-8 (RFC 3977), or they should be MIME-encoded. Everything else is undecodable. > There is no way to always make the right guess, unfortunately. > News clients try to do their best :-) Well, a news client built on nntplib could also try to do its best :) > Yes, I mean b"\xC9ric". 4 bytes. Ok, perhaps we should allow bytes username and password.
msg120279 - (view)	Author: Julien ÉLIE (jelie)	Date: 2010-11-02 22:56
> Unless I'm mistaken, Content-Type should only apply to the body, not the > headers. Either the headers use UTF-8 (RFC 3977), or they should be > MIME-encoded. Everything else is undecodable. Yes, of course. Such articles are not RFC-compliant. You're not mistaken when you mention that the Content-Type: header field applies to the body. I was just answering about how the encoding could be guessed.
msg120341 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-11-03 19:32
What I meant by saying that the spec was broken is that the user is going to be typing the password at a keyboard. The keyboard will generate scan codes. Those scan codes will get interpreted through a system-specific chain of processes until some bytes or some unicode characters are generated. What's to say that the password typed on the keyboard where the password is set up is going to be a binary match for the password entered on the keyboard used for authentication? Which doesn't change the fact that if the spec calls for binary, nttplib should support binary.
msg120343 - (view)	Author: Julien ÉLIE (jelie)	Date: 2010-11-03 20:24
OK, I understand. I believe it works fine in practice, because people often use ASCII-only characters... I assume it is going to be a problem when the passwords contain 8-bit characters. I doubt it will work fine if I use different news readers on different localized systems... Interesting question. I will ask how it should be handled.
msg120350 - (view)	Author: Julien ÉLIE (jelie)	Date: 2010-11-03 22:27
I quote what Russ Allbery has just answered on news.software.nntp: It's completely unspecified what encoding to use for AUTHINFO USER/PASS, which is one of the problems fixed by SASL. Clients should always use SASL where possible because of things like this. None of the legacy authentication mechanisms (for protocols besides NNTP, as well) support character sets. If they have to fall back to AUTHINFO USER/PASS, they're unfortunately just going to have to guess. Most clients previously probably just sent whatever bytes across the wire that corresponded to the local character set encoding of the username and password. In practice, using anything other than ASCII in passwords with AUTHINFO USER/PASS is not going to be portable and won't work reliably. > How do current news readers send them to news servers? > And how news servers should decode them? News servers probably can't do anything better than just accepting them as a byte stream and doing a byte-by-byte comparison against local configuration.
msg121088 - (view)	Author: Julien ÉLIE (jelie)	Date: 2010-11-12 23:34
RFC 4616 about SASL PLAIN: The mechanism consists of a single message, a string of [UTF-8] encoded [Unicode] characters, from the client to the server. The client presents the authorization identity (identity to act as), followed by a NUL (U+0000) character, followed by the authentication identity (identity whose password will be used), followed by a NUL (U+0000) character, followed by the clear-text password. As with other SASL mechanisms, the client does not provide an authorization identity when it wishes the server to derive an identity from the credentials and use that as the authorization identity. [...] The authorization identity (authzid), authentication identity (authcid), password (passwd), and NUL character deliminators SHALL be transferred as [UTF-8] encoded strings of [Unicode] characters. That's one of the reasons why AUTHINFO SASL is better than AUTHINFO USER. It also allows whitespaces (a few news servers do not parse well whitespaces in user names or passwords after AUTHINFO USER/PASS -- imagine " test" with a leading space). Solved with SASL.
msg121092 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-11-12 23:44
Hello Julien, > That's one of the reasons why AUTHINFO SASL is better than AUTHINFO > USER. It also allows whitespaces (a few news servers do not parse > well whitespaces in user names or passwords after AUTHINFO USER/PASS > -- imagine " test" with a leading space). Solved with SASL. If you want to contribute SASL auth for NNTP, it might make sense to have a dedicated module to provide SASL mechanisms (and then let imaplib reuse that module). Of course, not all mechanisms need to be provided at the start. (FWIW, I had written a patch providing generic SASL support for Twisted years ago: http://twistedmatrix.com/trac/ticket/2015 )

History
Date	User	Action	Args
2022-04-11 14:57:08	admin	set	github: 54493
2013-10-25 07:58:10	christian.heimes	set	versions: + Python 3.3, Python 3.4, - Python 3.2
2012-02-05 16:45:25	hynek	set	nosy: + hynek
2011-07-07 10:50:55	vstinner	set	nosy: + vstinner components: + Unicode
2010-11-12 23:44:24	pitrou	set	messages: + msg121092
2010-11-12 23:34:29	jelie	set	messages: + msg121088
2010-11-03 22:27:57	jelie	set	messages: + msg120350
2010-11-03 20:24:07	jelie	set	messages: + msg120343
2010-11-03 19:32:21	r.david.murray	set	messages: + msg120341
2010-11-03 16:33:05	pitrou	set	status: closed -> open title: Exception raised when decoding NNTP newsgroup descriptions -> NNTP should accept bytestrings for username and password resolution: not a bug -> stage: resolved -> needs patch
2010-11-02 22:56:21	jelie	set	messages: + msg120279
2010-11-02 22:48:20	pitrou	set	messages: + msg120276
2010-11-02 22:43:15	jelie	set	messages: + msg120273
2010-11-02 22:14:24	pitrou	set	nosy: + pitrou messages: + msg120266
2010-11-02 22:06:05	jelie	set	messages: + msg120263
2010-11-02 00:01:57	r.david.murray	set	messages: + msg120190
2010-11-01 21:37:21	jelie	set	messages: + msg120181
2010-11-01 21:28:28	jelie	set	messages: + msg120179
2010-11-01 21:18:23	jelie	set	messages: + msg120177
2010-11-01 21:05:56	r.david.murray	set	status: open -> closed nosy: eric.araujo, r.david.murray, jelie messages: + msg120173 components: + Library (Lib), - Extension Modules, Unicode resolution: not a bug stage: resolved
2010-11-01 21:00:10	eric.araujo	set	messages: + msg120172
2010-11-01 20:58:19	jelie	set	messages: + msg120170
2010-11-01 20:52:00	eric.araujo	set	nosy: + eric.araujo messages: + msg120169
2010-11-01 20:50:10	jelie	set	messages: + msg120168
2010-11-01 20:26:55	r.david.murray	set	nosy: + r.david.murray messages: + msg120163
2010-11-01 20:25:49	jelie	set	components: + Unicode
2010-11-01 20:22:39	jelie	create