classification
Title: Request: Human readable byte amounts in the standard library
Type: enhancement Stage:
Components: Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: FlipperPA, Jason Stelzer, eric.smith, martin.panter, miserlou2, mivade, serhiy.storchaka
Priority: normal Keywords:

Created on 2017-10-10 17:10 by miserlou2, last changed 2017-10-29 11:23 by martin.panter.

Messages (10)
msg304061 - (view) Author: Rich (miserlou2) Date: 2017-10-10 17:10
This problem is an _extremely_ common one, a problem that almost any Python project of a reasonable size will encounter.

Given a number of bytes, say 123901842, format this as '123.9MB'.

The reason I strongly think that this should be included in the standard library is that if you look for how to do this through a Google search, there are an incredible amount of different solutions on StackOverflow, blog posts, forum posts, and various different libraries which provide this functionality - with varying levels of functionality and safety. You can also find different implementations of solutions to this problem inside of pretty much every major Python project (Django, etc.). In fact, I don't think I can think of any other function that gets copy-pasted into a project's 'util.py' file more commonly.

I think this should functionality should be provided in the standard math package, with arguments which allow to specific SI/NIST formatting and the number of significant digits to display. Implementing this would strongly cut down on the amount of cargo-cult Python programming in the world.

I'm willing to implement this if there's a consensus that it should be included.

Thanks!,
Rich Jones
msg304065 - (view) Author: Michael DePalatis (mivade) * Date: 2017-10-10 18:14
This would be a useful feature, but I don't think it quite fits in the math package. It might make more sense to use this with string formatting, for example:

{:h}.format(filesize)

where I use h as the format specifier since it doesn't appear to be taken yet and would be in line with "human-readable" options in tools like ls.
msg304070 - (view) Author: Timothy Allen (FlipperPA) Date: 2017-10-10 19:26
This would be a benefit to my team, for sure. I can't even tell you how many different solutions we currently use to make file sizes human readable - at least three.
msg304076 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-10-10 20:00
Do you mean decimal or binary prefixes? 123901842 bytes can be formatted as "118.2 MiB". In different areas decimal and binary prefixes can be more common. For example the volume of hard disks usually is specified with decimal prefixes, but the volume of RAM -- with binary prefixes (32 GiB, not 34.4 GB). Sometimes even mixed prefixes can be used (1 MB = 1024000 bytes).

And when we talk about human-readability, we can't ignore localization. For many people "123,9 МБ" looks more human-readable than "123.9 MB".

This is a complex problem and needs a complex solution. You can start from writing a special purposed package and adding it on PyPI. Maybe there are existing packages that solve this problem.
msg304084 - (view) Author: Rich (miserlou2) Date: 2017-10-10 21:33
Yep, as I mentioned, it should be configurable to use either format. Localization is an excellent point as well, so, all in all, the optional arguments to the function are format, significant digits, and delimiter. That's not an unreasonable amount of configurability.

It's not a complex problem, the solutions are fairly simple, but there are many ways to shoot yourself in the foot when rolling your own. There are already many packages which attempt this, most of which aren't used by any serious projects, who instead use their own implementations. There are just as many snippets of partial solutions floating around the internet as well. There is no canonical way to solve this common problem.

This is exactly why this common functionality should be added to the standard library, so that this extremely common function doesn't have to be imported from some-random-jamook's-untrustworthy-project-on-PyPI or rewritten from scratch for every project.
msg304094 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2017-10-10 23:24
A library implementing this should definitely go on PyPI first to shake out design issues. Then we'd need a PEP.

As someone who has a simplistic version of this code around, and who's done a bit of string formatting, I can assure you that there are a lot of issues to be thought through.

- Should it be 0.5M, or 500K? Is there a cutoff to switch over? Is it configurable?
- What if I prefer 1000K to 1M? Or even 1,000.0K or 1.000,0K (localized)?
- How many decimals to display? Can you suppress trailing zeros? Should it be 1.0M, or 1M?
- Space between the number and the units?
- As you mention, MiB vs. MB. And there are schemes where MB means 1000^2 and others where MB means 2^20. And as Serhiy says, schemes where 1 MB = 1024000 bytes. How to handle all of these.
- Localization. Should this be specified as a locale, use the current locale, just ignored (see PEP 378 for an example), or have the delimiters passed in?

I'm sure there are other decisions to be made.

If you're serious about this (and I hope you are!), then I think finding current best practices both within and outside of the Python universe should be researched.

I do like the idea of the "h" format specifier, which would be an argument to move it in to the stdlib. That said, we never did come to agreement on something much simpler: engineering notation for floats (see issue 8060), because no one put the time into writing up a concrete proposal.
msg304178 - (view) Author: Jason Stelzer (Jason Stelzer) Date: 2017-10-11 20:01
Just pointing out that this exists and seems active.

https://github.com/tbielawa/bitmath

Perhaps include some or all of it in core python? Crazier things have happened.
msg304179 - (view) Author: Rich (miserlou2) Date: 2017-10-11 20:21
I think bitmath would be overkill to include in its entirety, but maybe there solution is a good one.

There is also:

https://pypi.python.org/pypi/byteformat/
https://pypi.python.org/pypi/datasize
https://pypi.python.org/pypi/hurry.filesize
https://pypi.python.org/pypi/hfilesize/
https://humanfriendly.readthedocs.io/en/latest/
https://pypi.python.org/pypi/humanize

and a bajillion other solutions here: https://stackoverflow.com/questions/1094841/reusable-library-to-get-human-readable-version-of-file-size and elsewhere - I think really the underscoring how common this problem is.

(Although I don't _particularly want_ this to expand beyond the scope of this single function, it does seem that given the amount of "Python for Humans" stuff out there, there could be an argument made for adding a "humanize" package into the standard library..)
msg304180 - (view) Author: Jason Stelzer (Jason Stelzer) Date: 2017-10-11 20:30
I often speak in generalizations and half thoughts. Feel free to cherry pick as much or a little as you want.

Including a core shim of whatever is agreed to be the minimalist functionality with a SEE ALSO note or clue as to where to start would:

* Resolve the basic out-of-the-box stuff.

* Eliminate a lot of boring DIY stuff for people who reach for an editor first and search second.

* Give new users much firmer footing.

* Give some everything and the kitchen sink projects wider exposure.
msg305186 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2017-10-29 11:23
Ken Kundert started a related discussion a while back on Python-ideas: <https://www.mail-archive.com/search?l=mid&q=20160830203427.GE2363@kundert.designers-guide.com>. This was about SI-prefixed units in general; not restricted to bytes. Also, the “timeit” module already does auto-scaling across nsec, usec, msec, and sec.

I think supporting decimal SI prefixes (for µs, mL, km, MW, etc) is more important than the binary versions (KiB, MiB, GiB). And units of 1,024,000 are definitely too niche.

I think a new format type “:h” may be the way forward. Perhaps it would add an SI prefix, and then the user could append their unit:

>>> f"{123901842:h}B"  # Six significant digits by default (like “:g”)
"123.902 MB"
>>> f"{123901842:.5h}B"  # Drop trailing zeros
"123.9 MB"
>>> f"{12:+6h}m"  # Sign and width options may be useful
"  +12 m"
>>> f"{12e99:h}m"  # Exponential notation for extreme values
"1.2e100 m"
>>> f"{12e99:H}m"  # Capitalize E, INF, etc (but not k for kilo-, etc)
"1.2E100 m"
>>> f"{123901:#.5h}m"  # Alternative form keeps trailing zeros
"123.90 km"
>>> f"{123:.2h}m"  # Precision < 3 may not be respected
"123 m"
>>> f"{123:#.2h}m"  # Maybe alternative form could respect the precision
"0.12 km"
>>> f"{123901842:.4h}B".replace(" ", "")  # Avoid the space
"123.9MB"
>>> f"{123901842:.4h}B".replace(" ", "&nbsp;")  # Alternative space
"123.9&nbsp;MB"
>>> f"{123901842:.4h}B".replace(".", ",")  # Alternative to decimal point
"123,9 MB"
>>> f"{12e-6:h}sec"  # Non-ASCII by default
"12 µsec"
>>> f"{12e-6:h}sec".replace("\N{MICRO SIGN}", "u")  # ASCII compatibility
"12 usec"

Squares and cubes may be a minor stumbling block: 0.001 m² is one thousand square millimetres, but f"{0.001:.3h}m²" would return "1 mm²".
History
Date User Action Args
2017-10-29 11:23:36martin.pantersetnosy: + martin.panter
messages: + msg305186
2017-10-11 20:30:48Jason Stelzersetmessages: + msg304180
2017-10-11 20:21:52miserlou2setmessages: + msg304179
2017-10-11 20:01:01Jason Stelzersetnosy: + Jason Stelzer
messages: + msg304178
2017-10-10 23:24:44eric.smithsetnosy: + eric.smith
messages: + msg304094
2017-10-10 21:33:56miserlou2setmessages: + msg304084
2017-10-10 20:00:32serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg304076
2017-10-10 19:26:08FlipperPAsetnosy: + FlipperPA
messages: + msg304070
2017-10-10 18:14:40mivadesetnosy: + mivade
messages: + msg304065
2017-10-10 17:10:53miserlou2create