Number Representation & IEEE-754

Number Representation Refresher

Positional Number Systems

A positional number system (aka: one based on exponentiation) is defined by a set of digits, where numbers are sequences like so:

\boxed{ \text{Let } b = ? \text{ and } D = \{ d_0,d_1,d_2,d_3,...,d_n \} } \\ \small\textit{Where $b$ is the base and $D$ is the possible digits} \\~\\ \textit{A number is a sequence $n_i \in D$}

Example: Value of a Number

In general:

n_4 n_3 n_2 n_1 = n_1 \times b^0 + n_2 \times b^1 + n_3 \times b^2 + n_4 \times b^3

A specific example with a base 10 number:

1 9 3 7 = 7 \times 10^0 + 3 \times 10^1 + 9 \times 10^2 + 1 \times 10^3

A specific example with a base 2 number:

1 1 0 1 = 1 \times 2^0 + 0 \times 2^1 + 1 \times 2^2 + 1 \times 2^3

Binary

Modern computers represent numbers in binary (b=2, D = \{0,1\}).

This is simpler on hardware, but requires more digits.

Refresher: Successive Division (from: CS2640)

Divide by 2 and keep track of remainder (0 or 1)
Truncate quotient after each division

Don’t Get Tripped Up!: Integer Division and Remainder
If you’re doing decimal division rather than integer division (easy to do if you’re using a calculator instead of your head), you can tell if the remainder is 1 if the quotient ends in 0.5.
If you’re doing decimal division and the quotient doesn’t end in 0 or 0.5, you forgot to integer division. (Truncate everything after the point and divide only the integer.)

Example: Decimal to Binary with Successive Division

Let’s convert 633 to binary with successive division.

633	1
316	0
115	0
79	1
39	1
19	1
9	1
4	0
2	0
1	1
0

We read the number from the bottom-up, so the answer is:

633_{10} = 1001111001_2

Significance

Significance: How much a digit contributes to the overall value of a number.

Example: Most and Least Significance

9378 = 9_4 3_3 7_2 8_1 \\ \small\textit{(subscripts show that digit's significance)}

The most significant digit is 9.
The least significant digit is 2.

Loss of Significance

Loss of Significance: We can incur a loss of significance when storing numbers or doing arithmetic.

\boxed{ \text{Absolute Error: } AE = |X - X_s| } \\~\\ \boxed{ \text{Relative Error: } RE = \left| \frac{X - X_s}{X} \right| } \\ \small \textit{where $X$ is the original value, and} \\ \textit{where $X_s$ is the stored/approximated value}

Example: Truncation and calculating error

Suppose a decimal computer than can only store numbers with 5 significant digits.

We have the following measurement and want to store it in the computer.

Sample (i)	Value
1	15.0783073

Q: The value is 9 significant digits, yet our computer can only store 5. How can we store it in the computer?

A: We can truncate the number, dropping the least significant digits.

For sample 1, this means storing only 15.078.

Q: How much absolute and relative error did we just incur by doing this?

AE = | X - X_s | = | 15.0783073 - 15.078 | = 0.0003073

RE = \left| \frac{X - X_s}{X} \right| = \left| \frac{ 15.0783073 - 15.078}{ 15.0783073} \right| \approx 0.00002038

Thus, we’ve incurred a loss of significance with an absolute error of 0.0003073, or relative error of 0.002038%

Remember: There are errors everywhere.
Even instruments can have error.

Floating Point Numbers

Floating-Point: A number with a point in it.

Scientific Notation

\boxed{ \text{Normalized Scientific Notation: } m \times b^n } \\~\\ \small \textit{(where $1 \le |m| \lt 10$)}

Scientific Notation: A way to express big or small numbers more conveniently.

Converting to Scientific Notation:

m: Move the decimal point until only one digit is on the left of the point.
b^n: Multiply the number by the base (b) raised to the # of times you moved the point (n).
- (Remember that n has to be negative if the point moved to the right!)

Example: Scientific notation of a typical decimal

Q: Find the scientific notation representation of 347.5601.

Left Term: We move the decimal left by two digits.

347.5601 = 3.475601 \times ?

Right Term: Base is ten and we only moved twice, so we multiply by 10^2

347.5601 = 3.475601 \times 10^2

Example: Scientific notation for binary floating-point

Q: Find the binary scientific notation of 101.1011_2

A: 1.011011 \times 10^{10}

Remember, 10^{10} is binary for 2_{10}^2 (because the base is 2 and we moved the decimal point left twice)

Important: A property of non-zero binary numbers (floating-points and integers) is that the MSB is 1 (or else it wouldn’t be the MSB)

Integer Representation

Standard Signed Integer Refresher

A standard signed integer is a 32-bit pattern where the MSB is the sign bit.

Value Range: [-2^{31}, 2^{31} - 1]

Floating-Point Representation

IEEE-754

The binary pattern is split into:

Sign: 1 bit (0 pos, 1 neg)
Exponent: 8 bits
- Bias: 127
- 0000 0000 is reserved to set entire number to zero
- 1111 1111 is reserved for infinity (use above sign bit to set \pm\infin)
Mantissa: 23 bits
- “One Plus Notation”: We only store the digits after the decimal point and assume the number always starts with 1.
- (This is just the remaining bits, assuming a total of 32 bits)

Example: How IEEE-754 was made (one plus notation and exponent)

IEEE-754 is based on the scientific notation that was previously discussed.

One Plus Notation

We know that the MSB is always 1 (otherwise it wouldn’t be the MSB), so we can save a bit by not storing it.

We’ll just assume every number starts with “1.” (this is one plus notation)
This complicates representing zero, but we’ll address that later.

The Three Parts

Hence, we only need three things:

\textcolor{green}{\pm} \; 1.\textcolor{green}{\text{mantissa}} \times 10_2^{\textcolor{green}{\text{exponent}}}

Sign
Mantissa (the part after the point)
Exponent (we can ignore the base because it’s always two, or 10_2 in binary)

Exponent

The designers of IEEE-754 decided to dedicate 8 bits to the exponent

Reserving Zero: If exponent is zero, we see it conflicts with the one plus notation (it would erroneously be 1)
- So, we reserve the pattern 0000 0000 to signal that the whole number should be zero.
Reserving Infinity: The designers also wanted to be able to represent infinity.
- So, we reserve the pattern 1111 1111 to signal \pm infinity.
Bias: The designers didn’t want to have a second sign bit, so they decided to add a bias to the exponent to guarantee positivity. (We do this by adding 127_{10} to every number.)

Very Important: We can still represent negative exponents! It’s just that to store them in IEEE-754, we add the bias.
(So, actual exponent = stored exponent - 127)

Why Bias?

The designers wanted to remove the exponent’s sign bit to:

Simplify hardware (no extra sign bit handling needed)
Simplify comparison (e.g., checking if 127 - 127 > 126 - 127 is unnecessary and invokes negative numbers when you can just do 127 > 126)

Even More: Okay… but I really wanna know why…

Biasing enforces lexicographic ordering (e.g., in two’s complement, a negative number could have a higher bit pattern than a larger number due to the sign bit shenanigans, while the biased version doesn’t introduce this complexity. Bitwise operations Just Work (TM))
Clean zero/infinity signals and balanced range. (e.g., we can reserve 0000 0000 and 1111 1111 for zero and infinity (and there’s no wasted finagling with -0 and +0 like in two’s complement); and we get to dedicate the remaining unreserved bits towards [-126, 127], which is balanced around zero thanks to the bias. In the explanation this all was self-evident/obvious, but the answer to “which came first, the bias or the balance” is “both”)

Example: Converting a Binary Floating-Point to IEEE-754

I. Binary Floating-Point to IEEE-754

Q: Convert 1.01101 \times 10^{11} to IEEE-754.

The sign is positive, so the sign bit is 0.
We need to bias the exponent before we can store it.

\begin{aligned} \text{Exponent} + \text{Bias} &= 11_2 + 127_{10} \\ &= 130_{10} \\ &= 10000010_2 \end{aligned}

Thus, the biased exponent is 10000010

The mantissa is 01101, so we pad the right until it’s 23 bits. Thus, the padded mantissa is 01101000000000000000000
Altogether, this means the IEEE-754 representation 1.01101 \times 10^{11} in IEEE-754 is: 0 01101000000000000000000 10000010 (sign, mantissa, exponent)

II. IEEE-754 to Decimal

Q: Now, convert 0 01101000000000000000000 10000010 to a decimal.

We can see/calculate:

Sign bit is positive.
Exponent is 3 (unbiased by doing 10000010_2 - 127_{10})
Mantissa is 1.40625

How We Got Mantissa:

First, let’s account for one plus notation: 01101 -> 1.01101.

Then, 1.01101_2 = \left[ 1 + \frac{1}{4} + \frac{1}{8} + \frac{1}{32} = 1.40625 \right]_{10}

In other words, we are doing:

1.\text{mantissa bits as fraction} \times 2^\text{unbiased exponent}

The final answer fills in as: 1.40625 \times 2^3 = 11.25_{10}

Example: Converting a Decimal to IEEE-754 and back

Q: Convert 1957.18765 to IEEE-754

Convert integer part into binary

Successive Division of 1957:

1957	1
978	0
489	1
244	0
122	0
61	1
30	0
15	1
7	1
3	1
1	1
0

Integer Portion: 11110100101

Convert the floating-point part.

Remember: We store 23 bits in the mantissa (assuming 32-bit floating point).
The integer portion already consumes 10 bits (we skip the MSB (plus one notation), if you’re confused by the fact we have 11 bits)
Thus, we’ll only convert up to 13 points of precision.

Successive Multiplcation of 0.18765:

0.18765	0
0.3753	0
0.7506	1
2.5012	1
1.0024	0
0.0048	0
0.0096	0
0.0192	0
0.0384	0
0.0768	0
0.1536	0
0.3072	0
0.6144	1

Differences from Successive Division:
We multiply by 2 rather than divide.
We truncate the stuff before the decimal

In this case, we read the string from top-to bottom:

0011000000001

Combining these two parts, we get:

11110100101.0011000000001

To Scientific Notation

Now, we want to convert our binary floating-point to scientific notation.

11110100101.0011000000001

In scientific, after moving the decimal left ten times, it is

1.11101001010011000000001_2 \times 10_2^{1010}

Note: Be not afraid. The right side is just 2^{10} in decimal

Sign, Exponent, Mantissa

Our sign is 0 (positive).

The exponent, after biasing, is 10 + 127 = 137 (decimal), or 10001001_2 (binary).

The mantissa is everything after the decimal point. 11101001010011000000001

Therefore, our final number is:

0 10001001 11101001010011000000001

Machine Epsilon: The smallest number \epsilon such that 1.0 + \epsilon > 1 in floating-point arithmetic.
In other words, it’s the smallest number you can add to 1 such that the result isn’t 1.
e.g., for single precision, this is
1.0 + 2^{-23} > 1.0

Exponent Conversion

To add two floating-points with different exponents, you need to make their exponents equal.

Do this by manipulating the floating-point with the smaller exponent to match the bigger one.

Example: Why do we convert exponents this way?

To add two numbers with different exponents, you need to convert one of the numbers to use the same exponent as the other.

Suppose you want to add these two: one 5.3782 \times 10^{-5} \\~\\ 2.7641 \times 10^{-7}

Q: Should you:

Convert the one with the smaller exponent, or

2.7641 \times 10^{-7} = 0.0276 \times 10^{-5}

Convert the one with the bigger exponent?

5.3782 \times 10^{-5} = 537.82 \times 10^{-7}

Note: Assume our machine can only fit 5 bits magnitude.

A: We should do the first method (make the smaller one match the bigger one)

The second method doesn’t even result in a valid floating-point and also causes misalignment.

IEEE Double-Precision Standard

Like IEEE-754, but with 64 bits.

The binary pattern is split into:

Sign: 1 bit (0 pos, 1 neg)
Exponent: 11 bits
- Bias: 1023
- 0000 0000 000 is reserved to set entire number to zero.
- 1111 1111 111 is reserved to set entire number to zero.
Mantissa: 52 bits