classification
Title: nntplib cleanup
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Dmitry.Jemerov, giampaolo.rodola, ncoghlan, pitrou, r.david.murray
Priority: normal Keywords: patch

Created on 2010-07-23 16:54 by Dmitry.Jemerov, last changed 2010-09-29 15:05 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
nntplib_cleanup.patch Dmitry.Jemerov, 2010-07-23 16:54
nntplib_cleanup2.patch pitrou, 2010-09-13 19:27
nntplib_cleanup3.patch pitrou, 2010-09-14 14:08
nntplib_cleanup4.patch pitrou, 2010-09-15 21:17
nntplib_cleanup5.patch pitrou, 2010-09-19 17:16
nntplib_cleanup6.patch pitrou, 2010-09-24 20:47
nntplib_cleanup7.patch pitrou, 2010-09-25 16:15
Messages (19)
msg111364 - (view) Author: Dmitry Jemerov (Dmitry.Jemerov) Date: 2010-07-23 16:54
The patch performs an extensive cleanup of nntplib:
 - Change API methods to return strings instead of bytes. This breaks API compatibility, but given that the parameters need to be passed as strings and many of the returned values would need to be passed to other API methods, I consider the current API to be broken. I've discussed this with Brett at the EuroPython sprint, and he agrees.
 - Add tests.
 - Add pending deprecation warnings for xgtitle() and xpath() methods, which are not useful in modern environments.
 - Use named tuples for returned values where appopriate.
 - Modernize the implementation a little bit.
 - Clean up the docstrings.
msg111369 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-07-23 17:56
I haven't looked at the patch in detail, but if my quick skim is correct, it looks like you are decoding using latin1.  If you don't provide any way to get at the original bytes (which I presume you don't if you have "converted the API to return strings"), then it will be difficult and non-obvious to correctly decode messages using a content transfer encoding of 8bit and some other character set.  (Of course, it is currently impossible to do this using the Py3k email package either, but I'm working on fixing that).
msg111383 - (view) Author: Dmitry Jemerov (Dmitry.Jemerov) Date: 2010-07-23 20:23
This is an issue only for the actual article content, right? I'll be happy to extend the API to allow getting the original bytes of the article content, while keeping the rest of API (like group names) string-based.
msg111404 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-07-24 00:48
Correct, if by 'article content' you mean what is returned by the article command.  So I think it is necessary to be able to get bytes for two commands: article and body.  Then for symmetry it should also be possible to get bytes from the head command.  (It is actually possible for headers to include 8bit data, but it is a violation of the RFC and both rare outside of spam and impossible to handle sensibly in most cases...and it may be the case that most if not all nttpservers sanitize such headers, I don't have any experience in that area.)

On the other hand, this tells me that a possible use case I had been wondering about in email is in fact a valid concern: being able to parse a data stream that contains both strings and bytes :(  That is, one might use nntplib in a mode where you pull the headers as text and the body as bytes and want to hand it off to email to build a Message object from those parts.
msg115794 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-07 18:59
Note that according to RFC 3977, “The character set for all NNTP commands is UTF-8”.

But it also says this about multi-line data blocks:

   Note that texts using an encoding (such as UTF-16 or UTF-32) that may
   contain the octets NUL, LF, or CR other than a CRLF pair cannot be
   reliably conveyed in the above format (that is, they violate the MUST
   requirement above).  However, except when stated otherwise, this
   specification does not require the content to be UTF-8, and therefore
   (subject to that same requirement) it MAY include octets above and
   below 128 mixed arbitrarily.

IMO, it should decode/encode by default using utf-8 (with the "surrogateescape" error handler for easy round-tripping with non-compliant servers), except for raw articles (bodies / envelopes) where bytes should be returned.
msg116342 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-13 19:27
Here is an updated patch, but it's a work in progress.

Since we are breaking compatibility anyway, I think a larger cleanup is deserved. For example:
- remove old exception aliases
- make return types consistent (for example, newgroups() should returned structured data as list() does)
- use datetime.datetime objects instead of consuming and producing dates and times as 6-character strings

Additional useful features could be added:

- automatic querying of CAPABILITIES on connection
- a higher-level over() method able to parse the response as a list of header dicts (using LIST OVERVIEW.FMT), with appropriately decoded values (thanks email.header.decode_header())
- ...

Also, we should add an internal mock NNTP server to easily test features. Relying on gmane is nice for simple basic tests, but not much more.

Does it sound reasonable?
msg116349 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-09-13 20:39
Assuming we can break backward compatibility, it sounds fine to me.
msg116400 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-14 14:08
Here is a further, still work-in-progress, patch. I'm posting it here so that interested people can give advice.
msg116878 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-19 17:16
Here is the current state of the cleanup work. Coding is roughly finished; most methods are unit-tested. The documentation remains to be done. Review welcome.
msg116884 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2010-09-19 20:54
To make the distinction easier to remember, would it help if the methods that are currently set to return bytes instead accepted the typical encoding+errors parameters, with parallel *b APIs to get at the raw bytes?

My concern with the current API is that there isn't a clear indicator during normal programming as to which APIs return strings and which return the raw bytes and hence require further decoding.
msg116886 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-19 21:09
> To make the distinction easier to remember, would it help if the
> methods that are currently set to return bytes instead accepted the
> typical encoding+errors parameters, with parallel *b APIs to get at
> the raw bytes?

Not really, no. For raw messages, which encoding+errors must be used
depends on the returned contents, it's not something the client can know
up front; moreover, different parts of the returned bytes may need
decoding using different encodings (for example if there are several
MIME parts to the message). People should use the email package to parse
the raw messages, as I assume they already do in 2.x.

Apart from raw message bodies, NNTP data has well-defined encodings and
that's why I can take and return unicode (although as stated, I also use
surrogateescape to be fault-tolerant in the face of broken servers).

> My concern with the current API is that there isn't a clear indicator
> during normal programming as to which APIs return strings and which
> return the raw bytes and hence require further decoding.

That's a documentation issue. I haven't touched the docs yet :)
msg116903 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-09-20 02:04
I tested this against my existing py3k nttp client code.

Why is it a good thing to make file a keyword only parameter? But given that it is, line 720 needs to use it as such, which omission means your missing at least one test :)

On line 193 you index fmt, and *then* you check the length.  When the number of tokens is too long, an IndexError is raised.  (Note that the offending overview line comes from the gmane group comp.lang.python.mime.devel, and the offense is an extra '' field on one of the records.  No idea why it got added to that one record.  Looks like it is message 701, artid <4111BBA9.3040302@harvee.org>)

Could the 'date' field in the xover headers also be a DateTime rather than a string?  And :bytes and :lines be ints?  Or is that being to DWIMish?  If the date field isn't returned as a datetime, though, should there be a helper method for converting it, or should we just assume that email.utils mktime_tz and parsedate_tz?

Am I correct that the purpose of _NNTPBase is to make testing easier?

There seem to be only three lines of code in common between the file and the non-file case in _getlongresp.  I think it would be clearer to make them two different routines, or at least move the three common lines to the top and then do an if test on file is None (that is, put the simpler, non-file case first).

My little nttp client script doesn't really test very much of the nttplib interface, nor is it very complex.  The change to xover considerably simplified that part of my script, so I very much like that change.  I was also able to drop my implementation of decode_header.  So overall this patch is significant improvement, IMO.

I haven't given the code as thorough a review as might be optimal, but I certainly think you are headed in the right direction.
msg116904 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-09-20 02:08
Gah.  I reviewed patch4, didn't noticed today's update :(
msg116959 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-20 18:20
> Gah.  I reviewed patch4, didn't noticed today's update :(

Oops, I'm sorry for that. It turns out most of your comments are useful anyway.

> Why is it a good thing to make file a keyword only parameter?

I was thinking that if we want to add other function parameters, it will
be possible to do so while keeping the (mnemonically useful) convention
that the file parameter is last.

You are right that testing that parameter is still on the TODO list...

> On line 193 you index fmt, and *then* you check the length.  When the
> number of tokens is too long, an IndexError is raised.

Yes, this is fixed in patch5 :)

> Could the 'date' field in the xover headers also be a DateTime rather
> than a string?  And :bytes and :lines be ints?  Or is that being to
> DWIMish?

We could indeed parse the xover/over fields, but the issue is that there
can be extra fields which therefore won't get parsed (or not all of them
if we decide to recognize some e.g. "Xref").

Also, people could want to get at the non-parsed representation
(especially for the date field), which means we would need two APIs with
two different levels of abstraction.

> If the date field isn't returned as a datetime, though, should there
> be a helper method for converting it, or should we just assume that
> email.utils mktime_tz and parsedate_tz?

parsedate or parsedate_tz is the right thing to use indeed (dates follow
the same format as in e-mails). As for the from and subject fields, the
provided decode_header() is very recommended (as the __main__ block
shows).

> Am I correct that the purpose of _NNTPBase is to make testing easier?

Yes.

> There seem to be only three lines of code in common between the file
> and the non-file case in _getlongresp.  I think it would be clearer to
> make them two different routines, or at least move the three common
> lines to the top and then do an if test on file is None (that is, put
> the simpler, non-file case first).

I'll take a look.

> My little nttp client script doesn't really test very much of the
> nttplib interface, nor is it very complex.  The change to xover
> considerably simplified that part of my script, so I very much like
> that change.  I was also able to drop my implementation of
> decode_header.  So overall this patch is significant improvement, IMO.

Thank you :)
msg117016 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-09-21 02:01
OK.  It doesn't seem all that likely that we'll be adding parameters, but you never know.  So I'm OK with file becoming keyword only.

As for the overview headers...if they aren't decoded through decode_header already, why are they unicode?  If they haven't been decoded, I'd expect them to be bytes.  Or does the standard really say that these headers, extracted from the bytes message data, will be valid utf-8?  (I realize that as things stand now with email5 you'll in fact *have* to decode them in order to process them with decode_header, but that doesn't change the fact that if they are unicode I would expect that they'd be *fully* decoded.)

But I take your point about the datatypes otherwise.  Also, when we get to email6 all headers should be turned into Header objects, and there will be a way to get a datetime out of a Date Header object.
msg117208 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-23 16:56
Yes, unescaped utf-8 is explicitly allowed in headers by RFC 3977:

   The content of a header SHOULD be in UTF-8.  However, if an
   implementation receives an article from elsewhere that uses octets in
   the range 128 to 255 in some other manner, it MAY pass it to a client
   or server without modification.  Therefore, implementations MUST be
   prepared to receive such headers, and data derived from them (e.g.,
   in the responses from the OVER command, Section 8.3), and MUST NOT
   assume that they are always UTF-8.

(I guess this means NNTP is now more modern than SMTP :-)).

Actually, there's a test with an actual utf-8 header in the unit tests.
(as for “implementations MUST be prepared to receive such headers, and data derived from them”, this is accounted for by using surrogateescape).
msg117331 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-24 20:47
Here is a patch adding tests for article(), head() and body(), as well as for the "file" parameter.
As far as code is concerned, this should now be complete. Documentation remains to be updated :)
msg117377 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-25 16:15
And here is a patch integrating doc fixes and updates.
msg117618 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-29 15:05
I've committed the latest patch in r85111.
History
Date User Action Args
2010-09-29 15:05:01pitrousetstatus: open -> closed
resolution: fixed
messages: + msg117618

stage: patch review -> resolved
2010-09-25 16:16:06pitrousetfiles: + nntplib_cleanup7.patch

messages: + msg117377
2010-09-24 20:47:54pitrousetfiles: + nntplib_cleanup6.patch

messages: + msg117331
2010-09-23 16:56:46pitrousetmessages: + msg117208
2010-09-21 02:01:45r.david.murraysetmessages: + msg117016
2010-09-20 18:20:34pitrousetmessages: + msg116959
2010-09-20 02:08:53r.david.murraysetmessages: + msg116904
2010-09-20 02:04:43r.david.murraysetmessages: + msg116903
2010-09-19 21:09:36pitrousetmessages: + msg116886
2010-09-19 20:54:15ncoghlansetnosy: + ncoghlan
messages: + msg116884
2010-09-19 17:16:15pitrousetfiles: + nntplib_cleanup5.patch

messages: + msg116878
stage: patch review
2010-09-15 21:17:57pitrousetfiles: + nntplib_cleanup4.patch
2010-09-14 14:08:35pitrousetfiles: + nntplib_cleanup3.patch

messages: + msg116400
2010-09-14 11:10:12pitroulinkissue1926 dependencies
2010-09-13 20:47:13giampaolo.rodolasetnosy: + giampaolo.rodola
2010-09-13 20:39:22r.david.murraysetmessages: + msg116349
2010-09-13 19:27:24pitrousetfiles: + nntplib_cleanup2.patch

messages: + msg116342
stage: patch review -> (no value)
2010-09-07 18:59:19pitrousetnosy: + pitrou
messages: + msg115794
2010-07-24 00:48:46r.david.murraysettype: enhancement
messages: + msg111404
stage: patch review
2010-07-23 20:23:19Dmitry.Jemerovsetmessages: + msg111383
2010-07-23 17:56:47r.david.murraysetnosy: + r.david.murray
messages: + msg111369
2010-07-23 16:54:15Dmitry.Jemerovcreate