I am trying to write an application to convert bytes to kb to mb to gb to tb. Here\'s what I have so far:
def size_format(b):
if b < 1000:
r
Dividing bytes to get a human-readable answer may seem easy, right? Wrong!
Every other answer is incorrect and contains floating point rounding bugs that cause incorrect output such as "1024 KiB" instead of "1 MiB". They shouldn't feel sad about it, though, since it's a bug that even Android's OS programmers had in the past, and tens of thousands of programmer eyes never noticed the bug in the world's most popular StackOverflow answer either, despite years of people using that old Java answer.
So what's the problem? Well, it's due to the way that floating point rounding works. A float such as "1023.95" will actually round up to "1024.0" when told to format itself as a single-decimal number. Most programmers don't think about that bug, but it COMPLETELY breaks the "human readable bytes" formatting. So their code thinks "Oh, 1023.95, that's fine, we've found the correct unit since the number is less than 1024", but they don't realize that it will get rounded to "1024.0" which SHOULD be formatted as the NEXT size-unit.
Furthermore, many of the other answers are using horribly slow code with a bunch of math functions such as pow/log, which may look "neat" but completely wrecks performance. Most of the other answers use crazy if/else nesting, or other performance-killers such as temporary lists, live string concatenation/creation, etc. In short, they waste CPU cycles doing pointless, heavy work.
Most of them also forget to include larger units, and therefore only support a small subset of the most common filesizes. Given a larger number, such code would output something like "1239213919393491123.1 Gigabytes", which is silly. Some of them won't even do that, and will simply break if the input number is larger than the largest unit they've implemented.
Furthermore, almost none of them handle negative input, such as "minus 2 megabytes", and completely break on such input.
They also hardcode very personal choices such as precision (how many decimals) and unit type (metric or binary). Which means that their code is barely reusable.
So... okay, we have a situation where the current answers aren't correct... so why not do everything right instead? Here's my function, which focuses on both performance and configurability. You can choose between 0-3 decimals, and whether you want metric (power of 1000) or binary (power of 1024) representation. It contains some code comments and usage examples, to help people understand why it does what it does and what bugs it avoids by working this way. If all the comments are deleted, it would shrink the line numbers by a lot, but I suggest keeping the comments when copypasta-ing so that you understand the code again in the future. ;-)
from typing import List, Union
class HumanBytes:
METRIC_LABELS: List[str] = ["B", "kB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB"]
BINARY_LABELS: List[str] = ["B", "KiB", "MiB", "GiB", "TiB", "PiB", "EiB", "ZiB", "YiB"]
PRECISION_OFFSETS: List[float] = [0.5, 0.05, 0.005, 0.0005] # PREDEFINED FOR SPEED.
PRECISION_FORMATS: List[str] = ["{}{:.0f} {}", "{}{:.1f} {}", "{}{:.2f} {}", "{}{:.3f} {}"] # PREDEFINED FOR SPEED.
@staticmethod
def format(num: Union[int, float], metric: bool=False, precision: int=1) -> str:
"""
Human-readable formatting of bytes, using binary (powers of 1024)
or metric (powers of 1000) representation.
"""
assert isinstance(num, (int, float)), "num must be an int or float"
assert isinstance(metric, bool), "metric must be a bool"
assert isinstance(precision, int) and precision >= 0 and precision <= 3, "precision must be an int (range 0-3)"
unit_labels = HumanBytes.METRIC_LABELS if metric else HumanBytes.BINARY_LABELS
last_label = unit_labels[-1]
unit_step = 1000 if metric else 1024
unit_step_thresh = unit_step - HumanBytes.PRECISION_OFFSETS[precision]
is_negative = num < 0
if is_negative: # Faster than ternary assignment or always running abs().
num = abs(num)
for unit in unit_labels:
if num < unit_step_thresh:
# VERY IMPORTANT:
# Only accepts the CURRENT unit if we're BELOW the threshold where
# float rounding behavior would place us into the NEXT unit: F.ex.
# when rounding a float to 1 decimal, any number ">= 1023.95" will
# be rounded to "1024.0". Obviously we don't want ugly output such
# as "1024.0 KiB", since the proper term for that is "1.0 MiB".
break
if unit != last_label:
# We only shrink the number if we HAVEN'T reached the last unit.
# NOTE: These looped divisions accumulate floating point rounding
# errors, but each new division pushes the rounding errors further
# and further down in the decimals, so it doesn't matter at all.
num /= unit_step
return HumanBytes.PRECISION_FORMATS[precision].format("-" if is_negative else "", num, unit)
print(HumanBytes.format(2251799813685247)) # 2 pebibytes
print(HumanBytes.format(2000000000000000, True)) # 2 petabytes
print(HumanBytes.format(1099511627776)) # 1 tebibyte
print(HumanBytes.format(1000000000000, True)) # 1 terabyte
print(HumanBytes.format(1000000000, True)) # 1 gigabyte
print(HumanBytes.format(4318498233, precision=3)) # 4.022 gibibytes
print(HumanBytes.format(4318498233, True, 3)) # 4.318 gigabytes
print(HumanBytes.format(-4318498233, precision=2)) # -4.02 gibibytes
By the way, the hardcoded PRECISION_OFFSETS
is created that way for maximum performance. We could have programmatically calculated the offsets using the formula unit_step_thresh = unit_step - (0.5/(10**precision))
to support arbitrary precisions. But it really makes NO sense to format filesizes with massive 4+ trailing decimal numbers. That's why my function supports exactly what people use: 0, 1, 2 or 3 decimals. Thus we avoid a bunch of pow and division math. This decision is one of many small attention-to-detail choices that make this function FAST. Another example of performance choices was the decision to use a string-based if unit != last_label
check to detect the end of the List, rather than iterating by indices and seeing if we've reached the final List-index. Generating indices via range()
or tuples via enumerate()
is slower than just doing an address comparison of Python's immutable string objects stored in the _LABELS
lists, which is what this code does instead!
Sure, it's a bit excessive to put that much work into performance, but I hate the "write sloppy code and only optimize after all the thousands of slow functions in a project makes the whole project sluggish" attitude. The "premature optimization" quote that most programmers live by is completely misunderstood and used as an excuse for sloppiness. :-P
I place this code in the public domain. Feel free to use it in your projects, both freeware and commercial. I actually suggest that you place it in a .py
module and change it from a "class namespace" into a normal module instead. I only used a class to keep the code neat for StackOverflow and to make it easy to paste into self-contained python scripts if you don't want to use modules.
Enjoy and have fun! :-)