Floating point numbers

Published on: Sun Oct 15 2023

If you ever worked with numbers on computers - you likely noticed the stark difference between integers and real numbers. Integers make sense and behave as expected. However whenever real numbers are required - suddenly there are multiple choices for representation. There are fractions, decimals and floats. What are these weird names?

Fractions should make sense to you if you remember number classes from math. It’s a pair of integers - a numerator and denumerator and work just like you’d expect. Decimal and float numbers are approximations of real numbers that function similarly to each other. Let’s focus on floats for now.

Floats are a shorthand for floating point number. Sometimes you will see double - which stands for double precision float, while usual floats are therefore single precision (take up half as much memory - 32 bits).

What is the floating point you may be wondering? Remember scientific notation? Numbers like 10^8 * 3 - which is an approximation of the speed of light in meters per second. Because the number is so large compared to other speeds - we care about the number of decimal places more than the exact value. The floating comes from the fact that we could represent the same value by “floating” the fractional point by changing the exponent - 10^9 * 0.3, so the point “floats”.

Floats are the exact same concept that uses the base 2 for the exponential representation, so a value like 256 is represented as 2^8 * 1. And a number like 192 can be represented as 2^8 * 0.11 (base 2). Let me take a quick detour if you are not used to decimal notation for binary. If you remember radixes - 0.25 is a shorthand for saying 0 * 10^0 + 2 * 10^-1 + 5 * 10^-2 or alternatively 0 + 2/10 + 5/100. In binary - it’s the same concept except the number we raise to different powers is 2. So 0.11 is a shorthand for 0*2^0 + 1*2^-1 + 1*2^-2 or 0 + 1/2 + 1/4. So 0.11 in base 2 is a fraction 3/4, which should hopefully make the 192 representation make sense now.

With me so far? If not - I suggest playing with numbers (using fully expanded notation) in different bases. While building an intuition for what I’m talking about you will also get to see the elegance of the numeric notation.

A decimal notation is a shorthand for fractional notation - 0.11 is really 11/100 and because the denumerator is a power of 10 - there are fractions that are not representable exactly. Everyone is familiar with 1/3 requiring an infinitely long decimal representation. 2 as a denumerator results in even more fractions being unrepresentable exactly - 1/10 has to be approximated as a fraction of a power of 2. And this is the exact reason why some calculators show some noise after adding a couple of tenths together.

Let’s dive into the hardware implementation details, taking the widely used 32-bit representation as an example. As you may guess - there are 2 groups of bits, one for the exponent (the power we raise 2 to) and the significant or mantissa - the most significant digits of the value. The most common implementation is IEEE 754 standard, which is implemented by virtually all modern processors. It allocates 8 bits to the exponent and 24 bits to the mantissa. There are a couple of gotchas though. One - the exponent range is not 255 values, but only 254. This is done to accomodate positive and negative infinities and NaN (not a number) representations. The values are useful to make arithmetic operations more robust - dividing some value by zero results in an infinity instead of resulting in a hardware signalled exception. And because the floating point numbers are not intended to be used in highly prcise calculations - working with them becomes simpler.

0 00000000 00000000000000000000000 =

sign bit exponent bits mantissa (significand) bits

I just mentioned positive and negative infinities and if you’re familiar with different negative number representations in hardware - you might be wondering now how are negative floating point numbers are represented. Well it’s done with a sign bit. But wait, wouldn’t that make it a 33-bit representation? Yes it would! But engineers were clever and made the convention that values should be represented in such a way that the first bit of the mantissa is always one, so we don’t need to represent it in hardware! Hang on, how do we then represent zero? Simple - the convention only applies if the exponent is not 0.

This hardware representation leads to several interesting side-effects. For one - there are 2 zero values: a positive and a negative one. Additionally there are subnormal numbers - numbers near zero that have several leading zeroes. These are equivalent to “inefficient” scientific notation: 10^10 * 0.03. Older hardware did not handle such values well and would provide erroneous results, however nowadays they can be used without much worry.

Now that you know how floats work - you might be wondering why are they not the default everywhere. There are several reasons. For one - they are imprecise, trying to add 1 to 17 million will not result in 17 million and one, because there are not enough bits in the mantissa, so the 1 gets lost. This may be an acceptable tradeoff in some uses, but not in others. You wouldn’t want your user id to be confused with someone else’s, right? Another reason is that floating point operations are slow. Before 2 values can be added - their exponents need to be equalized, so for most hardware floating point operations take about 10x longer to execute.

And the final reason is that different hardware represents floating point values differently. Yes the standard defines the bit layout, but it leaves the special value representation up to the implementation and most implementations don’t adhere to the standard exactly. Meaning one can’t just send floating point values over the network and expect the other end to read the same value without performing some sort of validation of marshalling.