nntplib cleanup #53606

DmitryJemerov · 2010-07-23T16:54:16Z

BPO	9360
Nosy	@ncoghlan, @pitrou, @giampaolo, @bitdancer
Files	nntplib_cleanup.patch nntplib_cleanup2.patch nntplib_cleanup3.patch nntplib_cleanup4.patch nntplib_cleanup5.patch nntplib_cleanup6.patch nntplib_cleanup7.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2010-09-29.15:05:01.245>
created_at = <Date 2010-07-23.16:54:15.762>
labels = ['type-feature', 'library']
title = 'nntplib cleanup'
updated_at = <Date 2010-09-29.15:05:01.244>
user = 'https://bugs.python.org/DmitryJemerov'

bugs.python.org fields:

activity = <Date 2010-09-29.15:05:01.244>
actor = 'pitrou'
assignee = 'none'
closed = True
closed_date = <Date 2010-09-29.15:05:01.245>
closer = 'pitrou'
components = ['Library (Lib)']
creation = <Date 2010-07-23.16:54:15.762>
creator = 'Dmitry.Jemerov'
dependencies = []
files = ['18159', '18872', '18882', '18894', '18931', '19002', '19011']
hgrepos = []
issue_num = 9360
keywords = ['patch']
message_count = 19.0
messages = ['111364', '111369', '111383', '111404', '115794', '116342', '116349', '116400', '116878', '116884', '116886', '116903', '116904', '116959', '117016', '117208', '117331', '117377', '117618']
nosy_count = 5.0
nosy_names = ['ncoghlan', 'pitrou', 'giampaolo.rodola', 'r.david.murray', 'Dmitry.Jemerov']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue9360'
versions = ['Python 3.2']

DmitryJemerov · 2010-07-23T16:54:13Z

The patch performs an extensive cleanup of nntplib:

Change API methods to return strings instead of bytes. This breaks API compatibility, but given that the parameters need to be passed as strings and many of the returned values would need to be passed to other API methods, I consider the current API to be broken. I've discussed this with Brett at the EuroPython sprint, and he agrees.
Add tests.
Add pending deprecation warnings for xgtitle() and xpath() methods, which are not useful in modern environments.
Use named tuples for returned values where appopriate.
Modernize the implementation a little bit.
Clean up the docstrings.

bitdancer · 2010-07-23T17:56:47Z

I haven't looked at the patch in detail, but if my quick skim is correct, it looks like you are decoding using latin1. If you don't provide any way to get at the original bytes (which I presume you don't if you have "converted the API to return strings"), then it will be difficult and non-obvious to correctly decode messages using a content transfer encoding of 8bit and some other character set. (Of course, it is currently impossible to do this using the Py3k email package either, but I'm working on fixing that).

DmitryJemerov · 2010-07-23T20:23:19Z

This is an issue only for the actual article content, right? I'll be happy to extend the API to allow getting the original bytes of the article content, while keeping the rest of API (like group names) string-based.

bitdancer · 2010-07-24T00:48:46Z

Correct, if by 'article content' you mean what is returned by the article command. So I think it is necessary to be able to get bytes for two commands: article and body. Then for symmetry it should also be possible to get bytes from the head command. (It is actually possible for headers to include 8bit data, but it is a violation of the RFC and both rare outside of spam and impossible to handle sensibly in most cases...and it may be the case that most if not all nttpservers sanitize such headers, I don't have any experience in that area.)

On the other hand, this tells me that a possible use case I had been wondering about in email is in fact a valid concern: being able to parse a data stream that contains both strings and bytes :( That is, one might use nntplib in a mode where you pull the headers as text and the body as bytes and want to hand it off to email to build a Message object from those parts.

pitrou · 2010-09-07T18:59:19Z

Note that according to RFC 3977, “The character set for all NNTP commands is UTF-8”.

But it also says this about multi-line data blocks:

Note that texts using an encoding (such as UTF-16 or UTF-32) that may
contain the octets NUL, LF, or CR other than a CRLF pair cannot be
reliably conveyed in the above format (that is, they violate the MUST
requirement above). However, except when stated otherwise, this
specification does not require the content to be UTF-8, and therefore
(subject to that same requirement) it MAY include octets above and
below 128 mixed arbitrarily.

IMO, it should decode/encode by default using utf-8 (with the "surrogateescape" error handler for easy round-tripping with non-compliant servers), except for raw articles (bodies / envelopes) where bytes should be returned.

pitrou · 2010-09-13T19:27:20Z

Here is an updated patch, but it's a work in progress.

Since we are breaking compatibility anyway, I think a larger cleanup is deserved. For example:

remove old exception aliases
make return types consistent (for example, newgroups() should returned structured data as list() does)
use datetime.datetime objects instead of consuming and producing dates and times as 6-character strings

Additional useful features could be added:

automatic querying of CAPABILITIES on connection
a higher-level over() method able to parse the response as a list of header dicts (using LIST OVERVIEW.FMT), with appropriately decoded values (thanks email.header.decode_header())
...

Also, we should add an internal mock NNTP server to easily test features. Relying on gmane is nice for simple basic tests, but not much more.

Does it sound reasonable?

bitdancer · 2010-09-13T20:39:22Z

Assuming we can break backward compatibility, it sounds fine to me.

pitrou · 2010-09-14T14:08:25Z

Here is a further, still work-in-progress, patch. I'm posting it here so that interested people can give advice.

pitrou · 2010-09-19T17:16:03Z

Here is the current state of the cleanup work. Coding is roughly finished; most methods are unit-tested. The documentation remains to be done. Review welcome.

ncoghlan · 2010-09-19T20:54:15Z

To make the distinction easier to remember, would it help if the methods that are currently set to return bytes instead accepted the typical encoding+errors parameters, with parallel *b APIs to get at the raw bytes?

My concern with the current API is that there isn't a clear indicator during normal programming as to which APIs return strings and which return the raw bytes and hence require further decoding.

pitrou · 2010-09-19T21:09:37Z

To make the distinction easier to remember, would it help if the
methods that are currently set to return bytes instead accepted the
typical encoding+errors parameters, with parallel *b APIs to get at
the raw bytes?

Not really, no. For raw messages, which encoding+errors must be used
depends on the returned contents, it's not something the client can know
up front; moreover, different parts of the returned bytes may need
decoding using different encodings (for example if there are several
MIME parts to the message). People should use the email package to parse
the raw messages, as I assume they already do in 2.x.

Apart from raw message bodies, NNTP data has well-defined encodings and
that's why I can take and return unicode (although as stated, I also use
surrogateescape to be fault-tolerant in the face of broken servers).

My concern with the current API is that there isn't a clear indicator
during normal programming as to which APIs return strings and which
return the raw bytes and hence require further decoding.

That's a documentation issue. I haven't touched the docs yet :)

bitdancer · 2010-09-20T02:04:40Z

I tested this against my existing py3k nttp client code.

Why is it a good thing to make file a keyword only parameter? But given that it is, line 720 needs to use it as such, which omission means your missing at least one test :)

On line 193 you index fmt, and *then* you check the length. When the number of tokens is too long, an IndexError is raised. (Note that the offending overview line comes from the gmane group comp.lang.python.mime.devel, and the offense is an extra '' field on one of the records. No idea why it got added to that one record. Looks like it is message 701, artid <4111BBA9.3040302@harvee.org>)

Could the 'date' field in the xover headers also be a DateTime rather than a string? And :bytes and :lines be ints? Or is that being to DWIMish? If the date field isn't returned as a datetime, though, should there be a helper method for converting it, or should we just assume that email.utils mktime_tz and parsedate_tz?

Am I correct that the purpose of _NNTPBase is to make testing easier?

There seem to be only three lines of code in common between the file and the non-file case in _getlongresp. I think it would be clearer to make them two different routines, or at least move the three common lines to the top and then do an if test on file is None (that is, put the simpler, non-file case first).

My little nttp client script doesn't really test very much of the nttplib interface, nor is it very complex. The change to xover considerably simplified that part of my script, so I very much like that change. I was also able to drop my implementation of decode_header. So overall this patch is significant improvement, IMO.

I haven't given the code as thorough a review as might be optimal, but I certainly think you are headed in the right direction.

bitdancer · 2010-09-20T02:08:53Z

Gah. I reviewed patch4, didn't noticed today's update :(

pitrou · 2010-09-20T18:20:33Z

Gah. I reviewed patch4, didn't noticed today's update :(

Oops, I'm sorry for that. It turns out most of your comments are useful anyway.

Why is it a good thing to make file a keyword only parameter?

I was thinking that if we want to add other function parameters, it will
be possible to do so while keeping the (mnemonically useful) convention
that the file parameter is last.

You are right that testing that parameter is still on the TODO list...

On line 193 you index fmt, and *then* you check the length. When the
number of tokens is too long, an IndexError is raised.

Yes, this is fixed in patch5 :)

Could the 'date' field in the xover headers also be a DateTime rather
than a string? And :bytes and :lines be ints? Or is that being to
DWIMish?

We could indeed parse the xover/over fields, but the issue is that there
can be extra fields which therefore won't get parsed (or not all of them
if we decide to recognize some e.g. "Xref").

Also, people could want to get at the non-parsed representation
(especially for the date field), which means we would need two APIs with
two different levels of abstraction.

If the date field isn't returned as a datetime, though, should there
be a helper method for converting it, or should we just assume that
email.utils mktime_tz and parsedate_tz?

parsedate or parsedate_tz is the right thing to use indeed (dates follow
the same format as in e-mails). As for the from and subject fields, the
provided decode_header() is very recommended (as the __main__ block
shows).

Am I correct that the purpose of _NNTPBase is to make testing easier?

Yes.

There seem to be only three lines of code in common between the file
and the non-file case in _getlongresp. I think it would be clearer to
make them two different routines, or at least move the three common
lines to the top and then do an if test on file is None (that is, put
the simpler, non-file case first).

I'll take a look.

My little nttp client script doesn't really test very much of the
nttplib interface, nor is it very complex. The change to xover
considerably simplified that part of my script, so I very much like
that change. I was also able to drop my implementation of
decode_header. So overall this patch is significant improvement, IMO.

Thank you :)

bitdancer · 2010-09-21T02:01:45Z

OK. It doesn't seem all that likely that we'll be adding parameters, but you never know. So I'm OK with file becoming keyword only.

As for the overview headers...if they aren't decoded through decode_header already, why are they unicode? If they haven't been decoded, I'd expect them to be bytes. Or does the standard really say that these headers, extracted from the bytes message data, will be valid utf-8? (I realize that as things stand now with email5 you'll in fact *have* to decode them in order to process them with decode_header, but that doesn't change the fact that if they are unicode I would expect that they'd be *fully* decoded.)

But I take your point about the datatypes otherwise. Also, when we get to email6 all headers should be turned into Header objects, and there will be a way to get a datetime out of a Date Header object.

pitrou · 2010-09-23T16:56:46Z

Yes, unescaped utf-8 is explicitly allowed in headers by RFC 3977:

The content of a header SHOULD be in UTF-8. However, if an
implementation receives an article from elsewhere that uses octets in
the range 128 to 255 in some other manner, it MAY pass it to a client
or server without modification. Therefore, implementations MUST be
prepared to receive such headers, and data derived from them (e.g.,
in the responses from the OVER command, Section 8.3), and MUST NOT
assume that they are always UTF-8.

(I guess this means NNTP is now more modern than SMTP :-)).

Actually, there's a test with an actual utf-8 header in the unit tests.
(as for “implementations MUST be prepared to receive such headers, and data derived from them”, this is accounted for by using surrogateescape).

pitrou · 2010-09-24T20:47:35Z

Here is a patch adding tests for article(), head() and body(), as well as for the "file" parameter.
As far as code is concerned, this should now be complete. Documentation remains to be updated :)

pitrou · 2010-09-25T16:15:10Z

And here is a patch integrating doc fixes and updates.

pitrou · 2010-09-29T15:05:01Z

I've committed the latest patch in r85111.

DmitryJemerov mannequin added the stdlib Python modules in the Lib dir label Jul 23, 2010

bitdancer added the type-feature A feature request or enhancement label Jul 24, 2010

pitrou closed this as completed Sep 29, 2010

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nntplib cleanup #53606

nntplib cleanup #53606

DmitryJemerov mannequin commented Jul 23, 2010

DmitryJemerov mannequin commented Jul 23, 2010

bitdancer commented Jul 23, 2010

DmitryJemerov mannequin commented Jul 23, 2010

bitdancer commented Jul 24, 2010

pitrou commented Sep 7, 2010

pitrou commented Sep 13, 2010

bitdancer commented Sep 13, 2010

pitrou commented Sep 14, 2010

pitrou commented Sep 19, 2010

ncoghlan commented Sep 19, 2010

pitrou commented Sep 19, 2010

bitdancer commented Sep 20, 2010

bitdancer commented Sep 20, 2010

pitrou commented Sep 20, 2010

bitdancer commented Sep 21, 2010

pitrou commented Sep 23, 2010

pitrou commented Sep 24, 2010

pitrou commented Sep 25, 2010

pitrou commented Sep 29, 2010

nntplib cleanup #53606

nntplib cleanup #53606

Comments

DmitryJemerov mannequin commented Jul 23, 2010

DmitryJemerov mannequin commented Jul 23, 2010

bitdancer commented Jul 23, 2010

DmitryJemerov mannequin commented Jul 23, 2010

bitdancer commented Jul 24, 2010

pitrou commented Sep 7, 2010

pitrou commented Sep 13, 2010

bitdancer commented Sep 13, 2010

pitrou commented Sep 14, 2010

pitrou commented Sep 19, 2010

ncoghlan commented Sep 19, 2010

pitrou commented Sep 19, 2010

bitdancer commented Sep 20, 2010

bitdancer commented Sep 20, 2010

pitrou commented Sep 20, 2010

bitdancer commented Sep 21, 2010

pitrou commented Sep 23, 2010

pitrou commented Sep 24, 2010

pitrou commented Sep 25, 2010

pitrou commented Sep 29, 2010