msg370591 - (view) |
Author: Dong-hee Na (corona10) * |
Date: 2020-06-02 06:42 |
The current mimetypes.guess_type API guesses file types based on file extensions.
However, there is a more accurate method which is calling sniffing.
Some languages like Go(https://golang.org/pkg/net/http/#DetectContentType) provides mimesniff API and the method is implemented based on a standard way which is published on https://mimesniff.spec.whatwg.org/
I have a sample code implementation this
https://github.com/corona10/mimesniff/blob/master/mimesniff/mimesniff.py
But the API interface will be changed to mimetypes API.
So I would like to provide mimetypes.sniff API rather than a new stdlib package like mimesniff.
|
msg370602 - (view) |
Author: Dong-hee Na (corona10) * |
Date: 2020-06-02 10:41 |
I ping some of the core developers who recently work on this module.
Sorry if this topic is not interesting to you :(
I want to listen to how about provide this API as the stdlib API.
Three things I'd like to appeal through this proposal.
1. It will provide based on a more precise way.
2. There is a good standard(whatwg) in which format will be supported.
3. I am eager to maintain this module as the active core developer.
|
msg374150 - (view) |
Author: Guido van Rossum (gvanrossum) * |
Date: 2020-07-23 22:09 |
This looks like a useful addition. I hope someone will take up the review!
|
msg374387 - (view) |
Author: Dong-hee Na (corona10) * |
Date: 2020-07-27 15:31 |
> This looks like a useful addition. I hope someone will take up the review!
Thank you guido!
I also think that this API is good to be added to the standard library and it would be very useful!
I hope that someone would like to interest in this issue ;)
|
msg374438 - (view) |
Author: Jim Jewett (Jim.Jewett) * |
Date: 2020-07-27 23:15 |
The standard itself says that it only applies to content served over http; if the content is retrieved by ftp or from a file system, then you should trust that. I don't notice that in the code you pointed to.
So maybe filetype is the right answer if the data isn't coming over the network? For whatwg demonstration code, it is reasonable to assume that, but in python -- at a minimum, you should document the assumption prominently in the docs and docstring.
|
msg374439 - (view) |
Author: Guido van Rossum (gvanrossum) * |
Date: 2020-07-27 23:21 |
Whether the data was retrieved over a network has nothing to do with it.
There are complementary ways of guessing what data you are working with -- guess based on the filename extension or sniff based on the contents of the file (or downloaded data).
There are a zillion reasons why the filename could be a lie -- e.g. a user could pick the wrong extension, or rename a file, or a tool could save a file using the wrong extension or no extension at all. Then again sometimes the contents of the file might not be enough, e.g.
```
foo() // bar
```
is both valid Python and valid JavaScript. :-)
|
msg374467 - (view) |
Author: Jim Jewett (Jim.Jewett) * |
Date: 2020-07-28 05:56 |
There are a zillion reasons a filename could be wrong -- but the standard
says to trust the filesystem. So if it sniffs based on contents, it isn't
quite following the standard. It is probably still a useful tool, but it
won't be the One Right Way, and it isn't even clear that it should replace
current heuristics.
On Mon, Jul 27, 2020 at 7:22 PM Guido van Rossum <report@bugs.python.org>
wrote:
>
> Guido van Rossum <guido@python.org> added the comment:
>
> Whether the data was retrieved over a network has nothing to do with it.
>
> There are complementary ways of guessing what data you are working with --
> guess based on the filename extension or sniff based on the contents of the
> file (or downloaded data).
>
> There are a zillion reasons why the filename could be a lie -- e.g. a user
> could pick the wrong extension, or rename a file, or a tool could save a
> file using the wrong extension or no extension at all. Then again sometimes
> the contents of the file might not be enough, e.g.
> ```
> foo() // bar
> ```
> is both valid Python and valid JavaScript. :-)
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue40841>
> _______________________________________
>
|
msg374471 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2020-07-28 07:07 |
I think that both functions for detecting file type, by name and by content, are useful in different circumstances. We have similar more specific detection functions sndhdr and imghdr.
But I am not sure whether it should be a part of the mimetypes module or separate module. Should it use sndhdr and imghdr modules for audio and image types? Should it be a wrapper to the libmagic library (https://linux.die.net/man/3/libmagic) or reimplement it in Python?
If we add the code for detecting the file type based on algorithms used in browsers, should not we add also the code for detecting the text encoding based on other algorithms used in browsers, or it is too much?
|
msg374509 - (view) |
Author: Dong-hee Na (corona10) * |
Date: 2020-07-28 16:35 |
> I think that both functions for detecting file type, by name and by content
I think so too, mime sniffing would not be a way to alternate the method based on the file extension. Both APIs should be provided.
> should not we add also the code for detecting the text encoding based on other algorithms used in browsers
I already add the code for text encoding detection based on the whatwg standard so if this API is landed, yes text encoding detection will be supported.(e.g utf-16be)
IMHO, there would be use-cases since today python is used a lot for text data handling (for example crawling, data pre-processing)
There would be the question that the standard for the browser is appropriate for the python stdlib module.
My answer is that the whatwg standard could be the one of best standards to follow if make the decision to provide mime sniffing.
The standard handle mime types that are widely used in the real world not only for browser but also HTTP server or else.
One of the big stress to maintain mime-types detection is that considering how many mime-types should be supported.
Luckily, whatwg can be the strong standard to make the decision.
|
msg374511 - (view) |
Author: Guido van Rossum (gvanrossum) * |
Date: 2020-07-28 17:04 |
When the standard says "trust the filename" it is talking to the
application, not to the sniffing library. The library should provide the
tool for applications to follow the standard, but I don't see a reason why
we would have to enforce how applications call the library. Since we agree
there are use cases beyond what the standard has thought of for combining
sniffing the data and guessing based on the filename, we should make that
possible, the standard's exhortations notwithstanding.
Python is not a browser; a browser could be an application written in
Python. Python therefulre should provide tools that are useful to implement
a browser.
|
msg374593 - (view) |
Author: (YoSTEALTH) * |
Date: 2020-07-29 23:13 |
Start and end position of the signature must be accounted for, not all file signature start at ``0`` or ``< 512`` bytes
Rather then writing all the signatures manually might be a good idea to use already collected resource like https://www.garykessler.net/library/file_sigs.html
|
msg374615 - (view) |
Author: Dong-hee Na (corona10) * |
Date: 2020-07-30 14:25 |
https://www.garykessler.net/library/file_sigs.html looks like a good resource for this kind of API.
However, I would like to choose well-known standard from whatwg or w3c etc..
|
msg379460 - (view) |
Author: Dong-hee Na (corona10) * |
Date: 2020-10-23 18:10 |
I close this issue as rejected!
During the sprint, I could hear a lot of opinions from core devs including Guido, Tal, and Christian.
The overall conclusion for me is not to add this time.
if the mimetypes module is extracted from stdlib to pypi package, we can discuss to add this feature at that time!
Thank you everyone for the discussion!
|
msg379461 - (view) |
Author: Guido van Rossum (gvanrossum) * |
Date: 2020-10-23 18:15 |
Dong-hee, I recommend that you turn this into a 3rd party package on PyPI
yourself. That way your effort and code will live on!
|
msg379470 - (view) |
Author: Tal Einat (taleinat) * |
Date: 2020-10-23 19:48 |
> Dong-hee, I recommend that you turn this into a 3rd party package on PyPI yourself.
+1
|
msg379520 - (view) |
Author: Dong-hee Na (corona10) * |
Date: 2020-10-24 05:29 |
@gvanrossum, @taleinat
I've already provided the mimesniffing through PyPI ;)
https://pypi.org/project/mimesniff/
The interface is similar to imghdr.what :)
|
|
Date |
User |
Action |
Args |
2022-04-11 14:59:31 | admin | set | github: 85018 |
2020-10-24 05:29:15 | corona10 | set | messages:
+ msg379520 |
2020-10-23 19:48:42 | taleinat | set | messages:
+ msg379470 |
2020-10-23 18:15:00 | gvanrossum | set | messages:
+ msg379461 |
2020-10-23 18:10:33 | corona10 | set | status: open -> closed
nosy:
+ taleinat, christian.heimes messages:
+ msg379460
resolution: rejected stage: patch review -> resolved |
2020-07-30 14:25:30 | corona10 | set | messages:
+ msg374615 |
2020-07-29 23:13:21 | YoSTEALTH | set | nosy:
+ YoSTEALTH messages:
+ msg374593
|
2020-07-28 17:04:18 | gvanrossum | set | messages:
+ msg374511 |
2020-07-28 16:35:08 | corona10 | set | messages:
+ msg374509 |
2020-07-28 07:07:29 | serhiy.storchaka | set | messages:
+ msg374471 |
2020-07-28 05:56:06 | Jim.Jewett | set | messages:
+ msg374467 |
2020-07-27 23:21:48 | gvanrossum | set | messages:
+ msg374439 |
2020-07-27 23:15:18 | Jim.Jewett | set | nosy:
+ Jim.Jewett messages:
+ msg374438
|
2020-07-27 15:31:42 | corona10 | set | messages:
+ msg374387 |
2020-07-23 22:09:32 | gvanrossum | set | nosy:
+ gvanrossum messages:
+ msg374150
|
2020-06-08 13:42:45 | corona10 | set | keywords:
+ patch stage: patch review pull_requests:
+ pull_request19929 |
2020-06-02 10:41:15 | corona10 | set | nosy:
+ berker.peksag, serhiy.storchaka, xtreak messages:
+ msg370602
|
2020-06-02 10:05:55 | vstinner | set | nosy:
- vstinner
|
2020-06-02 06:43:52 | corona10 | set | title: Provide mimetypes.sniff API -> Provide mimetypes.sniff API as stdlib |
2020-06-02 06:43:44 | corona10 | set | title: Implement mimetypes.sniff -> Provide mimetypes.sniff API |
2020-06-02 06:42:50 | corona10 | set | assignee: corona10 |
2020-06-02 06:42:45 | corona10 | create | |