classification
Title: Provide mimetypes.sniff API as stdlib
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.10
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: corona10 Nosy List: Jim.Jewett, YoSTEALTH, berker.peksag, corona10, gvanrossum, serhiy.storchaka, xtreak
Priority: normal Keywords: patch

Created on 2020-06-02 06:42 by corona10, last changed 2020-07-30 14:25 by corona10.

Pull Requests
URL Status Linked Edit
PR 20720 open corona10, 2020-06-08 13:42
Messages (12)
msg370591 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2020-06-02 06:42
The current mimetypes.guess_type API guesses file types based on file extensions.

However, there is a more accurate method which is calling sniffing.

Some languages like Go(https://golang.org/pkg/net/http/#DetectContentType) provides mimesniff API and the method is implemented based on a standard way which is published on https://mimesniff.spec.whatwg.org/


I have a sample code implementation this
https://github.com/corona10/mimesniff/blob/master/mimesniff/mimesniff.py
But the API interface will be changed to mimetypes API.


So I would like to provide mimetypes.sniff API rather than a new stdlib package like mimesniff.
msg370602 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2020-06-02 10:41
I ping some of the core developers who recently work on this module.
Sorry if this topic is not interesting to you :(

I want to listen to how about provide this API as the stdlib API.
Three things I'd like to appeal through this proposal.

1. It will provide based on a more precise way.
2. There is a good standard(whatwg) in which format will be supported.
3. I am eager to maintain this module as the active core developer.
msg374150 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2020-07-23 22:09
This looks like a useful addition. I hope someone will take up the review!
msg374387 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2020-07-27 15:31
> This looks like a useful addition. I hope someone will take up the review!

Thank you guido!
I also think that this API is good to be added to the standard library and it would be very useful!

I hope that someone would like to interest in this issue ;)
msg374438 - (view) Author: Jim Jewett (Jim.Jewett) (Python triager) Date: 2020-07-27 23:15
The standard itself says that it only applies to content served over http; if the content is retrieved by ftp or from a file system, then you should trust that.  I don't notice that in the code you pointed to.

So maybe filetype is the right answer if the data isn't coming over the network?  For whatwg demonstration code, it is reasonable to assume that, but in python -- at a minimum, you should document the assumption prominently in the docs and docstring.
msg374439 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2020-07-27 23:21
Whether the data was retrieved over a network has nothing to do with it.

There are complementary ways of guessing what data you are working with -- guess based on the filename extension or sniff based on the contents of the file (or downloaded data).

There are a zillion reasons why the filename could be a lie -- e.g. a user could pick the wrong extension, or rename a file, or a tool could save a file using the wrong extension or no extension at all. Then again sometimes the contents of the file might not be enough, e.g.
```
foo() // bar
```
is both valid Python and valid JavaScript. :-)
msg374467 - (view) Author: Jim Jewett (Jim.Jewett) (Python triager) Date: 2020-07-28 05:56
There are a zillion reasons a filename could be wrong -- but the standard
says to trust the filesystem.  So if it sniffs based on contents, it isn't
quite following the standard.  It is probably still a useful tool, but it
won't be the One Right Way, and it isn't even clear that it should replace
current heuristics.

On Mon, Jul 27, 2020 at 7:22 PM Guido van Rossum <report@bugs.python.org>
wrote:

>
> Guido van Rossum <guido@python.org> added the comment:
>
> Whether the data was retrieved over a network has nothing to do with it.
>
> There are complementary ways of guessing what data you are working with --
> guess based on the filename extension or sniff based on the contents of the
> file (or downloaded data).
>
> There are a zillion reasons why the filename could be a lie -- e.g. a user
> could pick the wrong extension, or rename a file, or a tool could save a
> file using the wrong extension or no extension at all. Then again sometimes
> the contents of the file might not be enough, e.g.
> ```
> foo() // bar
> ```
> is both valid Python and valid JavaScript. :-)
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue40841>
> _______________________________________
>
msg374471 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-07-28 07:07
I think that both functions for detecting file type, by name and by content, are useful in different circumstances. We have similar more specific detection functions sndhdr and imghdr.

But I am not sure whether it should be a part of the mimetypes module or separate module. Should it use sndhdr and imghdr modules for audio and image types? Should it be a wrapper to the libmagic library (https://linux.die.net/man/3/libmagic) or reimplement it in Python?

If we add the code for detecting the file type based on algorithms used in browsers, should not we add also the code for detecting the text encoding based on other algorithms used in browsers, or it is too much?
msg374509 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2020-07-28 16:35
> I think that both functions for detecting file type, by name and by content

I think so too, mime sniffing would not be a way to alternate the method based on the file extension. Both APIs should be provided.

> should not we add also the code for detecting the text encoding based on other algorithms used in browsers

I already add the code for text encoding detection based on the whatwg standard so if this API is landed, yes text encoding detection will be supported.(e.g utf-16be)
IMHO, there would be use-cases since today python is used a lot for text data handling (for example crawling, data pre-processing) 

There would be the question that the standard for the browser is appropriate for the python stdlib module.
My answer is that the whatwg standard could be the one of best standards to follow if make the decision to provide mime sniffing.

The standard handle mime types that are widely used in the real world not only for browser but also HTTP server or else.

One of the big stress to maintain mime-types detection is that considering how many mime-types should be supported.
Luckily, whatwg can be the strong standard to make the decision.
msg374511 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2020-07-28 17:04
When the standard says "trust the filename" it is talking to the
application, not to the sniffing library. The library should provide the
tool for applications to follow the standard, but I don't see a reason why
we would have to enforce how applications call the library. Since we agree
there are use cases beyond what the standard has thought of for combining
sniffing the data and guessing based on the filename, we should make that
possible, the standard's exhortations notwithstanding.

Python is not a browser; a browser could be an application written in
Python. Python therefulre should provide tools that are useful to implement
a browser.
msg374593 - (view) Author: (YoSTEALTH) * Date: 2020-07-29 23:13
Start and end position of the signature must be accounted for, not all file signature start at ``0`` or ``< 512`` bytes

Rather then writing all the signatures manually might be a good idea to use already collected resource like https://www.garykessler.net/library/file_sigs.html
msg374615 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2020-07-30 14:25
https://www.garykessler.net/library/file_sigs.html looks like a good resource for this kind of API.

However, I would like to choose well-known standard from whatwg or w3c etc..
History
Date User Action Args
2020-07-30 14:25:30corona10setmessages: + msg374615
2020-07-29 23:13:21YoSTEALTHsetnosy: + YoSTEALTH
messages: + msg374593
2020-07-28 17:04:18gvanrossumsetmessages: + msg374511
2020-07-28 16:35:08corona10setmessages: + msg374509
2020-07-28 07:07:29serhiy.storchakasetmessages: + msg374471
2020-07-28 05:56:06Jim.Jewettsetmessages: + msg374467
2020-07-27 23:21:48gvanrossumsetmessages: + msg374439
2020-07-27 23:15:18Jim.Jewettsetnosy: + Jim.Jewett
messages: + msg374438
2020-07-27 15:31:42corona10setmessages: + msg374387
2020-07-23 22:09:32gvanrossumsetnosy: + gvanrossum
messages: + msg374150
2020-06-08 13:42:45corona10setkeywords: + patch
stage: patch review
pull_requests: + pull_request19929
2020-06-02 10:41:15corona10setnosy: + berker.peksag, serhiy.storchaka, xtreak
messages: + msg370602
2020-06-02 10:05:55vstinnersetnosy: - vstinner
2020-06-02 06:43:52corona10settitle: Provide mimetypes.sniff API -> Provide mimetypes.sniff API as stdlib
2020-06-02 06:43:44corona10settitle: Implement mimetypes.sniff -> Provide mimetypes.sniff API
2020-06-02 06:42:50corona10setassignee: corona10
2020-06-02 06:42:45corona10create