Issue 9360: nntplib cleanup

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/53606

classification

Title:	nntplib cleanup
Type:	enhancement	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.2

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	Dmitry.Jemerov, giampaolo.rodola, ncoghlan, pitrou, r.david.murray
Priority:	normal	Keywords:	patch

Created on 2010-07-23 16:54 by Dmitry.Jemerov, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
nntplib_cleanup.patch	Dmitry.Jemerov, 2010-07-23 16:54
nntplib_cleanup2.patch	pitrou, 2010-09-13 19:27
nntplib_cleanup3.patch	pitrou, 2010-09-14 14:08
nntplib_cleanup4.patch	pitrou, 2010-09-15 21:17
nntplib_cleanup5.patch	pitrou, 2010-09-19 17:16
nntplib_cleanup6.patch	pitrou, 2010-09-24 20:47
nntplib_cleanup7.patch	pitrou, 2010-09-25 16:15

Messages (19)
msg111364 - (view)	Author: Dmitry Jemerov (Dmitry.Jemerov)	Date: 2010-07-23 16:54
The patch performs an extensive cleanup of nntplib: - Change API methods to return strings instead of bytes. This breaks API compatibility, but given that the parameters need to be passed as strings and many of the returned values would need to be passed to other API methods, I consider the current API to be broken. I've discussed this with Brett at the EuroPython sprint, and he agrees. - Add tests. - Add pending deprecation warnings for xgtitle() and xpath() methods, which are not useful in modern environments. - Use named tuples for returned values where appopriate. - Modernize the implementation a little bit. - Clean up the docstrings.
msg111369 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-07-23 17:56
I haven't looked at the patch in detail, but if my quick skim is correct, it looks like you are decoding using latin1. If you don't provide any way to get at the original bytes (which I presume you don't if you have "converted the API to return strings"), then it will be difficult and non-obvious to correctly decode messages using a content transfer encoding of 8bit and some other character set. (Of course, it is currently impossible to do this using the Py3k email package either, but I'm working on fixing that).
msg111383 - (view)	Author: Dmitry Jemerov (Dmitry.Jemerov)	Date: 2010-07-23 20:23
This is an issue only for the actual article content, right? I'll be happy to extend the API to allow getting the original bytes of the article content, while keeping the rest of API (like group names) string-based.
msg111404 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-07-24 00:48
Correct, if by 'article content' you mean what is returned by the article command. So I think it is necessary to be able to get bytes for two commands: article and body. Then for symmetry it should also be possible to get bytes from the head command. (It is actually possible for headers to include 8bit data, but it is a violation of the RFC and both rare outside of spam and impossible to handle sensibly in most cases...and it may be the case that most if not all nttpservers sanitize such headers, I don't have any experience in that area.) On the other hand, this tells me that a possible use case I had been wondering about in email is in fact a valid concern: being able to parse a data stream that contains both strings and bytes :( That is, one might use nntplib in a mode where you pull the headers as text and the body as bytes and want to hand it off to email to build a Message object from those parts.
msg115794 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-09-07 18:59
Note that according to RFC 3977, “The character set for all NNTP commands is UTF-8”. But it also says this about multi-line data blocks: Note that texts using an encoding (such as UTF-16 or UTF-32) that may contain the octets NUL, LF, or CR other than a CRLF pair cannot be reliably conveyed in the above format (that is, they violate the MUST requirement above). However, except when stated otherwise, this specification does not require the content to be UTF-8, and therefore (subject to that same requirement) it MAY include octets above and below 128 mixed arbitrarily. IMO, it should decode/encode by default using utf-8 (with the "surrogateescape" error handler for easy round-tripping with non-compliant servers), except for raw articles (bodies / envelopes) where bytes should be returned.
msg116342 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-09-13 19:27
Here is an updated patch, but it's a work in progress. Since we are breaking compatibility anyway, I think a larger cleanup is deserved. For example: - remove old exception aliases - make return types consistent (for example, newgroups() should returned structured data as list() does) - use datetime.datetime objects instead of consuming and producing dates and times as 6-character strings Additional useful features could be added: - automatic querying of CAPABILITIES on connection - a higher-level over() method able to parse the response as a list of header dicts (using LIST OVERVIEW.FMT), with appropriately decoded values (thanks email.header.decode_header()) - ... Also, we should add an internal mock NNTP server to easily test features. Relying on gmane is nice for simple basic tests, but not much more. Does it sound reasonable?
msg116349 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-09-13 20:39
Assuming we can break backward compatibility, it sounds fine to me.
msg116400 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-09-14 14:08
Here is a further, still work-in-progress, patch. I'm posting it here so that interested people can give advice.
msg116878 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-09-19 17:16
Here is the current state of the cleanup work. Coding is roughly finished; most methods are unit-tested. The documentation remains to be done. Review welcome.
msg116884 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2010-09-19 20:54
To make the distinction easier to remember, would it help if the methods that are currently set to return bytes instead accepted the typical encoding+errors parameters, with parallel *b APIs to get at the raw bytes? My concern with the current API is that there isn't a clear indicator during normal programming as to which APIs return strings and which return the raw bytes and hence require further decoding.
msg116886 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-09-19 21:09
> To make the distinction easier to remember, would it help if the > methods that are currently set to return bytes instead accepted the > typical encoding+errors parameters, with parallel *b APIs to get at > the raw bytes? Not really, no. For raw messages, which encoding+errors must be used depends on the returned contents, it's not something the client can know up front; moreover, different parts of the returned bytes may need decoding using different encodings (for example if there are several MIME parts to the message). People should use the email package to parse the raw messages, as I assume they already do in 2.x. Apart from raw message bodies, NNTP data has well-defined encodings and that's why I can take and return unicode (although as stated, I also use surrogateescape to be fault-tolerant in the face of broken servers). > My concern with the current API is that there isn't a clear indicator > during normal programming as to which APIs return strings and which > return the raw bytes and hence require further decoding. That's a documentation issue. I haven't touched the docs yet :)
msg116903 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-09-20 02:04
I tested this against my existing py3k nttp client code. Why is it a good thing to make file a keyword only parameter? But given that it is, line 720 needs to use it as such, which omission means your missing at least one test :) On line 193 you index fmt, and then you check the length. When the number of tokens is too long, an IndexError is raised. (Note that the offending overview line comes from the gmane group comp.lang.python.mime.devel, and the offense is an extra '' field on one of the records. No idea why it got added to that one record. Looks like it is message 701, artid <4111BBA9.3040302@harvee.org>) Could the 'date' field in the xover headers also be a DateTime rather than a string? And :bytes and :lines be ints? Or is that being to DWIMish? If the date field isn't returned as a datetime, though, should there be a helper method for converting it, or should we just assume that email.utils mktime_tz and parsedate_tz? Am I correct that the purpose of _NNTPBase is to make testing easier? There seem to be only three lines of code in common between the file and the non-file case in _getlongresp. I think it would be clearer to make them two different routines, or at least move the three common lines to the top and then do an if test on file is None (that is, put the simpler, non-file case first). My little nttp client script doesn't really test very much of the nttplib interface, nor is it very complex. The change to xover considerably simplified that part of my script, so I very much like that change. I was also able to drop my implementation of decode_header. So overall this patch is significant improvement, IMO. I haven't given the code as thorough a review as might be optimal, but I certainly think you are headed in the right direction.
msg116904 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-09-20 02:08
Gah. I reviewed patch4, didn't noticed today's update :(
msg116959 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-09-20 18:20
> Gah. I reviewed patch4, didn't noticed today's update :( Oops, I'm sorry for that. It turns out most of your comments are useful anyway. > Why is it a good thing to make file a keyword only parameter? I was thinking that if we want to add other function parameters, it will be possible to do so while keeping the (mnemonically useful) convention that the file parameter is last. You are right that testing that parameter is still on the TODO list... > On line 193 you index fmt, and then you check the length. When the > number of tokens is too long, an IndexError is raised. Yes, this is fixed in patch5 :) > Could the 'date' field in the xover headers also be a DateTime rather > than a string? And :bytes and :lines be ints? Or is that being to > DWIMish? We could indeed parse the xover/over fields, but the issue is that there can be extra fields which therefore won't get parsed (or not all of them if we decide to recognize some e.g. "Xref"). Also, people could want to get at the non-parsed representation (especially for the date field), which means we would need two APIs with two different levels of abstraction. > If the date field isn't returned as a datetime, though, should there > be a helper method for converting it, or should we just assume that > email.utils mktime_tz and parsedate_tz? parsedate or parsedate_tz is the right thing to use indeed (dates follow the same format as in e-mails). As for the from and subject fields, the provided decode_header() is very recommended (as the __main__ block shows). > Am I correct that the purpose of _NNTPBase is to make testing easier? Yes. > There seem to be only three lines of code in common between the file > and the non-file case in _getlongresp. I think it would be clearer to > make them two different routines, or at least move the three common > lines to the top and then do an if test on file is None (that is, put > the simpler, non-file case first). I'll take a look. > My little nttp client script doesn't really test very much of the > nttplib interface, nor is it very complex. The change to xover > considerably simplified that part of my script, so I very much like > that change. I was also able to drop my implementation of > decode_header. So overall this patch is significant improvement, IMO. Thank you :)
msg117016 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-09-21 02:01
OK. It doesn't seem all that likely that we'll be adding parameters, but you never know. So I'm OK with file becoming keyword only. As for the overview headers...if they aren't decoded through decode_header already, why are they unicode? If they haven't been decoded, I'd expect them to be bytes. Or does the standard really say that these headers, extracted from the bytes message data, will be valid utf-8? (I realize that as things stand now with email5 you'll in fact have to decode them in order to process them with decode_header, but that doesn't change the fact that if they are unicode I would expect that they'd be fully decoded.) But I take your point about the datatypes otherwise. Also, when we get to email6 all headers should be turned into Header objects, and there will be a way to get a datetime out of a Date Header object.
msg117208 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-09-23 16:56
Yes, unescaped utf-8 is explicitly allowed in headers by RFC 3977: The content of a header SHOULD be in UTF-8. However, if an implementation receives an article from elsewhere that uses octets in the range 128 to 255 in some other manner, it MAY pass it to a client or server without modification. Therefore, implementations MUST be prepared to receive such headers, and data derived from them (e.g., in the responses from the OVER command, Section 8.3), and MUST NOT assume that they are always UTF-8. (I guess this means NNTP is now more modern than SMTP :-)). Actually, there's a test with an actual utf-8 header in the unit tests. (as for “implementations MUST be prepared to receive such headers, and data derived from them”, this is accounted for by using surrogateescape).
msg117331 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-09-24 20:47
Here is a patch adding tests for article(), head() and body(), as well as for the "file" parameter. As far as code is concerned, this should now be complete. Documentation remains to be updated :)
msg117377 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-09-25 16:15
And here is a patch integrating doc fixes and updates.
msg117618 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-09-29 15:05
I've committed the latest patch in r85111.

History
Date	User	Action	Args
2022-04-11 14:57:04	admin	set	github: 53606
2010-09-29 15:05:01	pitrou	set	status: open -> closed resolution: fixed messages: + msg117618 stage: patch review -> resolved
2010-09-25 16:16:06	pitrou	set	files: + nntplib_cleanup7.patch messages: + msg117377
2010-09-24 20:47:54	pitrou	set	files: + nntplib_cleanup6.patch messages: + msg117331
2010-09-23 16:56:46	pitrou	set	messages: + msg117208
2010-09-21 02:01:45	r.david.murray	set	messages: + msg117016
2010-09-20 18:20:34	pitrou	set	messages: + msg116959
2010-09-20 02:08:53	r.david.murray	set	messages: + msg116904
2010-09-20 02:04:43	r.david.murray	set	messages: + msg116903
2010-09-19 21:09:36	pitrou	set	messages: + msg116886
2010-09-19 20:54:15	ncoghlan	set	nosy: + ncoghlan messages: + msg116884
2010-09-19 17:16:15	pitrou	set	files: + nntplib_cleanup5.patch messages: + msg116878 stage: patch review
2010-09-15 21:17:57	pitrou	set	files: + nntplib_cleanup4.patch
2010-09-14 14:08:35	pitrou	set	files: + nntplib_cleanup3.patch messages: + msg116400
2010-09-14 11:10:12	pitrou	link	issue1926 dependencies
2010-09-13 20:47:13	giampaolo.rodola	set	nosy: + giampaolo.rodola
2010-09-13 20:39:22	r.david.murray	set	messages: + msg116349
2010-09-13 19:27:24	pitrou	set	files: + nntplib_cleanup2.patch messages: + msg116342 stage: patch review -> (no value)
2010-09-07 18:59:19	pitrou	set	nosy: + pitrou messages: + msg115794
2010-07-24 00:48:46	r.david.murray	set	type: enhancement messages: + msg111404 stage: patch review
2010-07-23 20:23:19	Dmitry.Jemerov	set	messages: + msg111383
2010-07-23 17:56:47	r.david.murray	set	nosy: + r.david.murray messages: + msg111369
2010-07-23 16:54:15	Dmitry.Jemerov	create