classification
Title: There is no way to get a list of available codecs
Type: behavior Stage:
Components: Library (Lib), Unicode Versions: Python 3.3, Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: dmi.baranov, doerwalter, ezio.melotti, lemburg, ncoghlan, paul.moore, vstinner
Priority: normal Keywords:

Created on 2013-04-30 09:51 by paul.moore, last changed 2013-05-02 16:00 by dmi.baranov.

Files
File name Uploaded Description Edit
codecs_searchers.py dmi.baranov, 2013-05-02 13:45
Messages (13)
msg188147 - (view) Author: Paul Moore (paul.moore) * (Python committer) Date: 2013-04-30 09:51
The codecs module allows the user to register additional codecs, but does not offer a means of getting a list of registered codecs.

This is important, for example, in a tool to re-encode files. It is reasonable to expect that such a tool would offer a list of supported encodings, to assist the user. For example, the -l option of the iconv command.
msg188247 - (view) Author: Dmi Baranov (dmi.baranov) * Date: 2013-05-01 23:59
I think its not possible while codecs registry contains search callbacks (stateless-registry)
msg188252 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-05-02 06:46
On 02.05.2013 01:59, Dmi Baranov wrote:
> 
> Dmi Baranov added the comment:
> 
> I think its not possible while codecs registry contains search callbacks (stateless-registry)

It is possible: we'd just need to invent a way to ask search functions
for the list of available codecs, e.g. by moving from plain function
objects to CodecSearchFunction objects.
msg188267 - (view) Author: Dmi Baranov (dmi.baranov) * Date: 2013-05-02 13:45
I think the "function" is a bit misleading. I suggest something like CodecsSearcher, please look at attached implementation (dirty code, just for start discussion about interfaces, lazy caches, etc).
msg188268 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-05-02 14:41
This is actually similar to the problem with getting the list of modules an importer provides (that is, we don't currently have an officially defined method in the importer protocol for that, although pkgutil.iter_importer_modules implicitly looks for an "iter_modules" method, due to the old import emulation used until Python 3.2).

I see three possibilities:

1. Use independent purpose specific protocols to get a list of entries out of these objects.

2. Create a new, common protocol for extracting lists of entries from search hooks like importers and codec search functions

3. Use the existing __iter__ protocol

I'm currently thinking option 3 might be a reasonable way forward. That is, if a codec search hook wants to provide a listing of available codecs, it can just define __iter__ in addition to __call__. Importers could define __iter__ in addition to the other methods in the importer API.

Thoughts?
msg188269 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2013-05-02 14:45
The point of using a function is to allow the function special hanling of the encoding name, which goes beyond a simple map lookup, i.e. you could do the following:

   import codecs

   def search_function(encoding):
      if not encoding.startswith("append-"):
         return None

      suffix = encoding[7:]

      def encode(s, errors="strict"):
         s = (s + suffix).encode("utf-8", errors)
         return (s, len(s))

      def decode(s, errors="strict"):
         s = bytes(s).decode("utf-8", errors)
         if s.endswith(suffix):
            s = s[:-len(suffix)]
         return (s, len(s))

      return codecs.CodecInfo(encode, decode, name=encoding)

   codecs.register(search_function)

   $ python
   Python 3.3.1 (default, Apr 29 2013, 15:35:47)
   [GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.24)] on darwin
   Type "help", "copyright", "credits" or "license" for more information.
   >>> import appendcodec
   >>> 'foo'.encode('append-bar')
   b'foobar'
   >>> b'foobar'.decode('append-bar')
   'foo'

The search function can't return a list of codec names in this case, as the list is infinite.
msg188270 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-05-02 14:47
On 02.05.2013 16:41, Nick Coghlan wrote:
> 
> Nick Coghlan added the comment:
> 
> This is actually similar to the problem with getting the list of modules an importer provides (that is, we don't currently have an officially defined method in the importer protocol for that, although pkgutil.iter_importer_modules implicitly looks for an "iter_modules" method, due to the old import emulation used until Python 3.2).
> 
> I see three possibilities:
> 
> 1. Use independent purpose specific protocols to get a list of entries out of these objects.
> 
> 2. Create a new, common protocol for extracting lists of entries from search hooks like importers and codec search functions
> 
> 3. Use the existing __iter__ protocol
> 
> I'm currently thinking option 3 might be a reasonable way forward. That is, if a codec search hook wants to provide a listing of available codecs, it can just define __iter__ in addition to __call__. Importers could define __iter__ in addition to the other methods in the importer API.
> 
> Thoughts?

Too obscure :-)

Let the object expose a method: .list_codecs() -> returns a list
of supported codecs as CodecInfo objects.

We may also deprecate the .__call__() in favor of:
.find_codec(encoding) -> return codec implementing encoding.
msg188271 - (view) Author: Paul Moore (paul.moore) * (Python committer) Date: 2013-05-02 14:51
@doerwalter In that case, I'd take the view that such a codec should simply not return anything. The discovery mechanism can be limited to returning only statically discoverable codec names (and it can be documented as such).

The original use case was to support functionality like iconv -l. Omitting edge cases like this is probably reasonable in that context.
msg188272 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-05-02 14:53
On 02.05.2013 16:45, Walter Dörwald wrote:
> ...
> The search function can't return a list of codec names in this case, as the list is infinite.

True.

The search object will have to be allowed to raise a
NotImplementedError or some other error/return value
to signal that the list of supported codecs is not available.

Note that the search object should only return a list of
supported canonical encoding names with .list_codecs(),
not all possible ones :-)
msg188273 - (view) Author: Dmi Baranov (dmi.baranov) * Date: 2013-05-02 15:35
My +1 for __iter__ with default `raise StopIteration`, it is more elegant solution than declaration and guarantee of the interfaces (based at collections.abc.Callable and collections.abc.Iterator).
 
Paul, result as iterable of CodecInfo objects is gives much more flexibility than the names of codecs (whats if you will have a few codecs with the same name in different SearchObjects?)

As I see, you would like use this as:

encoded_data = 'abc'
for codecs in codecs.registered_codecs():
 decoded_data = codecs.decode(data)
 if decoded_data == 'cba': # cracked
  break

Whats about backward compatibly with Lib/encoding modules (initial item in interp->codec_search_path)? Can we skip anything in search_path, if its not supports iteration?
msg188274 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-05-02 15:40
On 02.05.2013 16:53, Marc-Andre Lemburg wrote:
> 
> Marc-Andre Lemburg added the comment:
> 
> On 02.05.2013 16:45, Walter Dörwald wrote:
>> ...
>> The search function can't return a list of codec names in this case, as the list is infinite.
> 
> True.
> 
> The search object will have to be allowed to raise a
> NotImplementedError or some other error/return value
> to signal that the list of supported codecs is not available.
> 
> Note that the search object should only return a list of
> supported canonical encoding names with .list_codecs(),
> not all possible ones :-)

Scratch that last sentence. Returning CodecInfo instances,
as I originally wrote, is a better way to go.
msg188275 - (view) Author: Paul Moore (paul.moore) * (Python committer) Date: 2013-05-02 15:43
On 2 May 2013 16:35, Dmi Baranov <report@bugs.python.org> wrote:

> Paul, result as iterable of CodecInfo objects is gives much more
> flexibility than the names of codecs (whats if you will have a few codecs
> with the same name in different SearchObjects?)

Works for me. My usage would be

def list_supported_codecs():
  for codec in codecs.registered_codecs():
    print(codec.name)
msg188277 - (view) Author: Dmi Baranov (dmi.baranov) * Date: 2013-05-02 16:00
Sorry for additional nose - currently there is no way to change the codecs_search_path. Similarly with sys.patch_hooks is a great way to increase the level of customization (maybe I have a faster codec?).
History
Date User Action Args
2013-05-02 16:00:48dmi.baranovsetmessages: + msg188277
2013-05-02 15:43:58paul.mooresetmessages: + msg188275
2013-05-02 15:40:54lemburgsetmessages: + msg188274
2013-05-02 15:35:50dmi.baranovsetmessages: + msg188273
2013-05-02 14:53:14lemburgsetmessages: + msg188272
2013-05-02 14:51:01paul.mooresetmessages: + msg188271
2013-05-02 14:47:27lemburgsetmessages: + msg188270
2013-05-02 14:45:58doerwaltersetnosy: + doerwalter
messages: + msg188269
2013-05-02 14:41:44ncoghlansetmessages: + msg188268
2013-05-02 13:45:45dmi.baranovsetfiles: + codecs_searchers.py

messages: + msg188267
2013-05-02 06:46:15lemburgsetmessages: + msg188252
2013-05-01 23:59:18dmi.baranovsetnosy: + dmi.baranov
messages: + msg188247
components: + Library (Lib)
2013-04-30 10:21:39vstinnersetnosy: + vstinner
2013-04-30 09:56:17ezio.melottisetnosy: + lemburg, ncoghlan
2013-04-30 09:51:57paul.moorecreate