This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: NNTP should accept bytestrings for username and password
Type: behavior Stage: needs patch
Components: Library (Lib), Unicode Versions: Python 3.3, Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: eric.araujo, hynek, jelie, pitrou, r.david.murray, vstinner
Priority: normal Keywords:

Created on 2010-11-01 20:22 by jelie, last changed 2022-04-11 14:57 by admin.

Messages (21)
msg120161 - (view) Author: Julien ÉLIE (jelie) Date: 2010-11-01 20:22
> +# - all commands are encoded as UTF-8 data (using the "surrogateescape"
> +#   error handler), except for raw message data (POST, IHAVE)
> +# - all responses are decoded as UTF-8 data (using the "surrogateescape"
> +#   error handler), except for raw message data (ARTICLE, HEAD, BODY)

It does not seem to work on my news server (news.trigofacile.com):

print(s.descriptions('*'))

Exception raised with an UnicodeEncodeError.  I do not know what is happening.
The same command works fine with Python 2.7 and 3.0.



Also, for AUTHINFO, be careful that nntplib should not send an UTF-8 string but a byte string.  (I have not tested.)
The username and the password are byte strings.
msg120163 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-11-01 20:26
What's the exception?  If there were any escaped bytes in the string returned by descriptions, you would get an error when you try to print them.

This could be a design problem.
msg120168 - (view) Author: Julien ÉLIE (jelie) Date: 2010-11-01 20:50
Traceback (most recent call last):
  File "nntplib-test.py", line 10, in <module>
    print(s.descriptions('*'))
  File "C:\Program Files\Python32\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 3285-3287: character maps to <undefined>


The code was previously working fine with Python 3.1.



Looking more in details:

(resp, descs) = s.descriptions('*')

clefs = list(descs.keys())
clefs.sort()
for clef in clefs:
    print(clef), print(b[clef])



Traceback (most recent call last):
  File "nntplib-test.py", line 14, in <module>
    print(clef), print(b[clef])
  File "C:\Program Files\Python32\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0152' in position 0: character maps to <undefined>


That is another error.  It corresponds to a description containing « Œ », yet in UTF-8.

So you mean the issue comes from the MS-DOS console I am using.
Well, that seems right.
No problem in the Python IDLE!

Maybe this issue should be closed then?

(Yet, something changed because print() worked fine with the previous version...)
msg120169 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-11-01 20:52
FTR, a UTF-8 string *is* a byte string.
msg120170 - (view) Author: Julien ÉLIE (jelie) Date: 2010-11-01 20:58
Yes, you're right.
I meant to say that AUTHINFO is not expecting a UTF-8-encoded string.

For instance:

AUTHINFO USER Éric

is valid and should not always be transformed by nntplib to:

AUTHINFO USER Éric


News servers do a byte-string comparison (as specified in RFC 4643).  So if « Éric » is the expected user name, then it is this very name that is expected!
msg120172 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-11-01 21:00
But É cannot be transferred as is.  It needs to be encoded to bytes using some encoding.  What encoding is correct?
msg120173 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-11-01 21:05
Éric: UTF-8 (IIUC the RFC says "SHOULD be UTF-8").

Julien: yes, there are differences in the way printing to the console works between 2.x and 3.x, and this has caused some surprises for Windows users, where the default console codec is a bit limited.  So yes I'm going to close this issue.  Reopen it if you find it is not a windows console problem.
msg120177 - (view) Author: Julien ÉLIE (jelie) Date: 2010-11-01 21:18
David:  no, the RFC does not mention UTF-8 about AUTHINFO.
Please note the subtlety:

   command =/ authinfo-sasl-command /
        authinfo-user-command /
        authinfo-pass-command

   authinfo-sasl-command = "AUTHINFO" WS "SASL" WS mechanism
        [WS initial-response]
   authinfo-user-command = "AUTHINFO" WS "USER" WS username
   authinfo-pass-command = "AUTHINFO" WS "PASS" WS password

   initial-response = base64-opt
   username = 1*user-pass-char
   password = 1*user-pass-char
   user-pass-char = B-CHAR

   ;   U- means based on UTF-8, excluding NUL CR and LF
   ;   B- means based on bytes, excluding NUL CR and LF
   U-CHAR     = CTRL / TAB / SP / A-CHAR / UTF8-non-ascii
   B-CHAR     = CTRL / TAB / SP / %x21-FF


That is not for nothing that B-CHAR are explicitly mentioned.  And *not* U-CHAR.
That is why I insist on that fact, and I fear the new nntplib implementation using UTF-8 is breaking the NNTP protocol at some places...
msg120179 - (view) Author: Julien ÉLIE (jelie) Date: 2010-11-01 21:28
Éric:  there is no notion of encoding in a few NNTP commands.
Regarding AUTHINFO, the real string that I should have written is:

AUTHINFO USER \xC9ric

7-bit bytes are considered to be encoded in ASCII.
8-bit bytes are just 8-bit bytes.  No encoding.

The news client and the news server have to agree on the setting.  Authentification occurs between them.
I can imagine the news client in ISO-8859-1 and the news server in ISO-8859-15, and a password with a « € » sign in.  Then the password will not be the "same" (when entered on the keyboard), but will match in bytes!


I hope my explanation was clear enough now.
No encoding, just byte strings here!
msg120181 - (view) Author: Julien ÉLIE (jelie) Date: 2010-11-01 21:37
Maybe the bug should be reopened -- or the subject changed -- because the real issue is when I read:

# Incompatible changes from the 2.x nntplib:
# - all commands are encoded as UTF-8 data (using the "surrogateescape"
#   error handler), except for raw message data (POST, IHAVE)
# - all responses are decoded as UTF-8 data (using the "surrogateescape"
#   error handler), except for raw message data (ARTICLE, HEAD, BODY)

# UTF-8 is the character set for all NNTP commands and responses: they
# are automatically encoded (when sending) and decoded (and receiving)
# by this class.

That is not true.
What for XOVER/OVER answers?  They contain raw message data.  Why aren't they excluded?
And XHDR/HDR answers?
(As I see that HEAD is excluded, then so should OVER and HDR...)

And AUTHINFO, as I have just explained in the comments here.
msg120190 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-11-02 00:01
That's not what you opened the bug about, though, according to the title.

I discussed the headers-in-things-other-than HEAD/ARTICLE, and Antoine was of the opinion that they were "supposed" to be utf-8 and that in any case using surrogate escape was good enough in context.  (Headers could also, of course, be MIME transfer encoded, but in that case the header decode will turn them into the correct unicode, assuming they were encoded correctly).

Perhaps this design decision needs to be revisited, but if so you'll need a different example of a problem, and so I think this ticket should remain closed and you should open a new one.

Two new ones, actually, since AUTHINFO is yet a different problem.  And given that the standard you quote specifies bytes without an encoding, there may be *no* solution that works (that is, the standard appears to be broken, since most people expect to be able to use text strings for passwords).
msg120263 - (view) Author: Julien ÉLIE (jelie) Date: 2010-11-02 22:06
David, the headers are not at all supposed to be "utf-8" encoded.
For instance, have a look at the cn.bbs.comp.lang.python newsgroup:

http://groups.google.fr/group/cn.bbs.comp.lang.python

If you look at the source of the articles, you will for instance see that the Subject: header field is not MIME-encoded.  It is directly written in gb2312.

That's how news works in the wild.  Please do not break nntplib in Python 3.2!


Regarding AUTHINFO, the specification is not broken at all.  Bytes are expected, not strings in a particular encoding.  Well, most people are in fact confused when they speak about encodings -- me included :-)
The specification is pretty clear:  NNTP expects bytes.  And my text string is "\xC9ric", that's all.  Please also do not break nntplib when providing such strings on class instantiation.
msg120266 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-11-02 22:14
> And my text string is "\xC9ric", that's all.

You mean b"\xC9ric", right?

> If you look at the source of the articles, you will for instance see
> that the Subject: header field is not MIME-encoded.  It is directly
> written in gb2312.

How is an NNTP client supposed to guess the encoding? Either a header is MIME-encoded, or it follows the RFC 3977 recommendation of UTF-8 (“The content of a header SHOULD be in UTF-8”), or it's unreadable.
msg120273 - (view) Author: Julien ÉLIE (jelie) Date: 2010-11-02 22:43
Antoine, a news client could guess it because of the Content-Type: header field (in this example, it mentions charset="gb2312").
Yet, articles without a Content-Type: header field exist in the wild...
There is no way to always make the right guess, unfortunately.
News clients try to do their best :-)


Yes, I mean b"\xC9ric".  4 bytes.
msg120276 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-11-02 22:48
> Antoine, a news client could guess it because of the Content-Type:
> header field (in this example, it mentions charset="gb2312").
> Yet, articles without a Content-Type: header field exist in the
> wild...

Unless I'm mistaken, Content-Type should only apply to the body, not the
headers. Either the headers use UTF-8 (RFC 3977), or they should be
MIME-encoded. Everything else is undecodable.

> There is no way to always make the right guess, unfortunately.
> News clients try to do their best :-)

Well, a news client built on nntplib could also try to do its best :)

> Yes, I mean b"\xC9ric".  4 bytes.

Ok, perhaps we should allow bytes username and password.
msg120279 - (view) Author: Julien ÉLIE (jelie) Date: 2010-11-02 22:56
> Unless I'm mistaken, Content-Type should only apply to the body, not the
> headers.  Either the headers use UTF-8 (RFC 3977), or they should be
> MIME-encoded.  Everything else is undecodable.

Yes, of course.  Such articles are not RFC-compliant.  You're not mistaken when you mention that the Content-Type: header field applies to the body.
I was just answering about how the encoding could be guessed.
msg120341 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-11-03 19:32
What I meant by saying that the spec was broken is that the user is going to be typing the password at a keyboard.  The keyboard will generate scan codes.  Those scan codes will get interpreted through a system-specific chain of processes until some bytes or some unicode characters are generated.  What's to say that the password typed on the keyboard where the password is set up is going to be a binary match for the password entered on the keyboard used for authentication?

Which doesn't change the fact that if the spec calls for binary, nttplib should support binary.
msg120343 - (view) Author: Julien ÉLIE (jelie) Date: 2010-11-03 20:24
OK, I understand.  I believe it works fine in practice, because people often use ASCII-only characters...  I assume it is going to be a problem when the passwords contain 8-bit characters.
I doubt it will work fine if I use different news readers on different localized systems...

Interesting question.  I will ask how it should be handled.
msg120350 - (view) Author: Julien ÉLIE (jelie) Date: 2010-11-03 22:27
I quote what Russ Allbery has just answered on news.software.nntp:


It's completely unspecified what encoding to use for AUTHINFO USER/PASS,
which is one of the problems fixed by SASL.  Clients should always use
SASL where possible because of things like this.  None of the legacy
authentication mechanisms (for protocols besides NNTP, as well) support
character sets.

If they have to fall back to AUTHINFO USER/PASS, they're unfortunately
just going to have to guess.  Most clients previously probably just sent
whatever bytes across the wire that corresponded to the local character
set encoding of the username and password.

In practice, using anything other than ASCII in passwords with AUTHINFO
USER/PASS is not going to be portable and won't work reliably.

> ** How do current news readers send them to news servers?
> ** And how news servers should decode them?

News servers probably can't do anything better than just accepting them as
a byte stream and doing a byte-by-byte comparison against local
configuration.
msg121088 - (view) Author: Julien ÉLIE (jelie) Date: 2010-11-12 23:34
RFC 4616 about SASL PLAIN:

   The mechanism consists of a single message, a string of [UTF-8]
   encoded [Unicode] characters, from the client to the server.  The
   client presents the authorization identity (identity to act as),
   followed by a NUL (U+0000) character, followed by the authentication
   identity (identity whose password will be used), followed by a NUL
   (U+0000) character, followed by the clear-text password.  As with
   other SASL mechanisms, the client does not provide an authorization
   identity when it wishes the server to derive an identity from the
   credentials and use that as the authorization identity.
[...]
   The authorization identity (authzid), authentication identity
   (authcid), password (passwd), and NUL character deliminators SHALL be
   transferred as [UTF-8] encoded strings of [Unicode] characters.


That's one of the reasons why AUTHINFO SASL is better than AUTHINFO USER.  It also allows whitespaces (a few news servers do not parse well whitespaces in user names or passwords after AUTHINFO USER/PASS -- imagine " test" with a leading space).  Solved with SASL.
msg121092 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-11-12 23:44
Hello Julien,

> That's one of the reasons why AUTHINFO SASL is better than AUTHINFO
> USER.  It also allows whitespaces (a few news servers do not parse
> well whitespaces in user names or passwords after AUTHINFO USER/PASS
> -- imagine " test" with a leading space).  Solved with SASL.

If you want to contribute SASL auth for NNTP, it might make sense to
have a dedicated module to provide SASL mechanisms (and then let imaplib
reuse that module). Of course, not all mechanisms need to be provided at
the start.

(FWIW, I had written a patch providing generic SASL support for Twisted
years ago: http://twistedmatrix.com/trac/ticket/2015 )
History
Date User Action Args
2022-04-11 14:57:08adminsetgithub: 54493
2013-10-25 07:58:10christian.heimessetversions: + Python 3.3, Python 3.4, - Python 3.2
2012-02-05 16:45:25hyneksetnosy: + hynek
2011-07-07 10:50:55vstinnersetnosy: + vstinner
components: + Unicode
2010-11-12 23:44:24pitrousetmessages: + msg121092
2010-11-12 23:34:29jeliesetmessages: + msg121088
2010-11-03 22:27:57jeliesetmessages: + msg120350
2010-11-03 20:24:07jeliesetmessages: + msg120343
2010-11-03 19:32:21r.david.murraysetmessages: + msg120341
2010-11-03 16:33:05pitrousetstatus: closed -> open
title: Exception raised when decoding NNTP newsgroup descriptions -> NNTP should accept bytestrings for username and password
resolution: not a bug ->
stage: resolved -> needs patch
2010-11-02 22:56:21jeliesetmessages: + msg120279
2010-11-02 22:48:20pitrousetmessages: + msg120276
2010-11-02 22:43:15jeliesetmessages: + msg120273
2010-11-02 22:14:24pitrousetnosy: + pitrou
messages: + msg120266
2010-11-02 22:06:05jeliesetmessages: + msg120263
2010-11-02 00:01:57r.david.murraysetmessages: + msg120190
2010-11-01 21:37:21jeliesetmessages: + msg120181
2010-11-01 21:28:28jeliesetmessages: + msg120179
2010-11-01 21:18:23jeliesetmessages: + msg120177
2010-11-01 21:05:56r.david.murraysetstatus: open -> closed
nosy: eric.araujo, r.david.murray, jelie
messages: + msg120173

components: + Library (Lib), - Extension Modules, Unicode
resolution: not a bug
stage: resolved
2010-11-01 21:00:10eric.araujosetmessages: + msg120172
2010-11-01 20:58:19jeliesetmessages: + msg120170
2010-11-01 20:52:00eric.araujosetnosy: + eric.araujo
messages: + msg120169
2010-11-01 20:50:10jeliesetmessages: + msg120168
2010-11-01 20:26:55r.david.murraysetnosy: + r.david.murray
messages: + msg120163
2010-11-01 20:25:49jeliesetcomponents: + Unicode
2010-11-01 20:22:39jeliecreate