Number Representation & IEEE-754

Number Representation Refresher

Positional Number Systems

A positional number system (aka: one based on exponentiation) is defined by a set of digits, where numbers are sequences like so:

\boxed{ \text{Let } b = ? \text{ and } D = \{ d_0,d_1,d_2,d_3,...,d_n \} } \\ \small\textit{Where $b$ is the base and $D$ is the possible digits} \\~\\ \textit{A number is a sequence $n_i \in D$}

Example: Value of a Number

In general:

n_4 n_3 n_2 n_1 = n_1 \times b^0 + n_2 \times b^1 + n_3 \times b^2 + n_4 \times b^3

A specific example with a base 10 number:

1 9 3 7 = 7 \times 10^0 + 3 \times 10^1 + 9 \times 10^2 + 1 \times 10^3

A specific example with a base 2 number:

1 1 0 1 = 1 \times 2^0 + 0 \times 2^1 + 1 \times 2^2 + 1 \times 2^3

Binary

Modern computers represent numbers in binary (b=2, D = \{0,1\}).

Refresher: Successive Division (from: CS2640)

Don’t Get Tripped Up!: Integer Division and Remainder

Example: Decimal to Binary with Successive Division

Let’s convert 633 to binary with successive division.

6331
3160
1150
791
391
191
91
40
20
11
0

We read the number from the bottom-up, so the answer is:

633_{10} = 1001111001_2

Significance

Significance: How much a digit contributes to the overall value of a number.

Example: Most and Least Significance

9378 = 9_4 3_3 7_2 8_1 \\ \small\textit{(subscripts show that digit's significance)}

Loss of Significance

Loss of Significance: We can incur a loss of significance when storing numbers or doing arithmetic.

\boxed{ \text{Absolute Error: } AE = |X - X_s| } \\~\\ \boxed{ \text{Relative Error: } RE = \left| \frac{X - X_s}{X} \right| } \\ \small \textit{where $X$ is the original value, and} \\ \textit{where $X_s$ is the stored/approximated value}

Example: Truncation and calculating error

Suppose a decimal computer than can only store numbers with 5 significant digits.

We have the following measurement and want to store it in the computer.

Sample (i)Value
115.0783073

Q: The value is 9 significant digits, yet our computer can only store 5. How can we store it in the computer?

A: We can truncate the number, dropping the least significant digits.


Q: How much absolute and relative error did we just incur by doing this?

A:

AE = | X - X_s | = | 15.0783073 - 15.078 | = 0.0003073

RE = \left| \frac{X - X_s}{X} \right| = \left| \frac{ 15.0783073 - 15.078}{ 15.0783073} \right| \approx 0.00002038

Remember: There are errors everywhere.

Floating Point Numbers

Floating-Point: A number with a point in it.

Scientific Notation

\boxed{ \text{Normalized Scientific Notation: } m \times b^n } \\~\\ \small \textit{(where $1 \le |m| \lt 10$)}

Scientific Notation: A way to express big or small numbers more conveniently.

Converting to Scientific Notation:

  1. m: Move the decimal point until only one digit is on the left of the point.
  2. b^n: Multiply the number by the base (b) raised to the # of times you moved the point (n).
Example: Scientific notation of a typical decimal

Q: Find the scientific notation representation of 347.5601.

A:

  1. Left Term: We move the decimal left by two digits.

347.5601 = 3.475601 \times ?

  1. Right Term: Base is ten and we only moved twice, so we multiply by 10^2

347.5601 = 3.475601 \times 10^2

Example: Scientific notation for binary floating-point

Q: Find the binary scientific notation of 101.1011_2

A: 1.011011 \times 10^{10}

Important: A property of non-zero binary numbers (floating-points and integers) is that the MSB is 1 (or else it wouldn’t be the MSB)

Integer Representation

Standard Signed Integer Refresher

A standard signed integer is a 32-bit pattern where the MSB is the sign bit.

Floating-Point Representation

IEEE-754

The binary pattern is split into:

  1. Sign: 1 bit (0 pos, 1 neg)
  2. Exponent: 8 bits
  3. Mantissa: 23 bits
Example: How IEEE-754 was made (one plus notation and exponent)

IEEE-754 is based on the scientific notation that was previously discussed.

One Plus Notation

We know that the MSB is always 1 (otherwise it wouldn’t be the MSB), so we can save a bit by not storing it.

The Three Parts

Hence, we only need three things:

\textcolor{green}{\pm} \; 1.\textcolor{green}{\text{mantissa}} \times 10_2^{\textcolor{green}{\text{exponent}}}

  1. Sign
  2. Mantissa (the part after the point)
  3. Exponent (we can ignore the base because it’s always two, or 10_2 in binary)

Exponent

The designers of IEEE-754 decided to dedicate 8 bits to the exponent

  1. Reserving Zero: If exponent is zero, we see it conflicts with the one plus notation (it would erroneously be 1)
    • So, we reserve the pattern 0000 0000 to signal that the whole number should be zero.
  2. Reserving Infinity: The designers also wanted to be able to represent infinity.
    • So, we reserve the pattern 1111 1111 to signal \pm infinity.
  3. Bias: The designers didn’t want to have a second sign bit, so they decided to add a bias to the exponent to guarantee positivity. (We do this by adding 127_{10} to every number.)

Very Important: We can still represent negative exponents! It’s just that to store them in IEEE-754, we add the bias.

Why Bias?

The designers wanted to remove the exponent’s sign bit to:

  1. Simplify hardware (no extra sign bit handling needed)
  2. Simplify comparison (e.g., checking if 127 - 127 > 126 - 127 is unnecessary and invokes negative numbers when you can just do 127 > 126)
Even More: Okay… but I really wanna know why…
  1. Biasing enforces lexicographic ordering (e.g., in two’s complement, a negative number could have a higher bit pattern than a larger number due to the sign bit shenanigans, while the biased version doesn’t introduce this complexity. Bitwise operations Just Work (TM))
  2. Clean zero/infinity signals and balanced range. (e.g., we can reserve 0000 0000 and 1111 1111 for zero and infinity (and there’s no wasted finagling with -0 and +0 like in two’s complement); and we get to dedicate the remaining unreserved bits towards [-126, 127], which is balanced around zero thanks to the bias. In the explanation this all was self-evident/obvious, but the answer to “which came first, the bias or the balance” is “both”)
Example: Converting a Binary Floating-Point to IEEE-754

I. Binary Floating-Point to IEEE-754

Q: Convert 1.01101 \times 10^{11} to IEEE-754.

A:

  1. The sign is positive, so the sign bit is 0.

  2. We need to bias the exponent before we can store it.

\begin{aligned} \text{Exponent} + \text{Bias} &= 11_2 + 127_{10} \\ &= 130_{10} \\ &= 10000010_2 \end{aligned}

Thus, the biased exponent is 10000010

  1. The mantissa is 01101, so we pad the right until it’s 23 bits. Thus, the padded mantissa is 01101000000000000000000

  2. Altogether, this means the IEEE-754 representation 1.01101 \times 10^{11} in IEEE-754 is: 0 01101000000000000000000 10000010 (sign, mantissa, exponent)


II. IEEE-754 to Decimal

Q: Now, convert 0 01101000000000000000000 10000010 to a decimal.

A:

We can see/calculate:

  1. Sign bit is positive.
  2. Exponent is 3 (unbiased by doing 10000010_2 - 127_{10})
  3. Mantissa is 1.40625

How We Got Mantissa:

First, let’s account for one plus notation: 01101 -> 1.01101.

Then, 1.01101_2 = \left[ 1 + \frac{1}{4} + \frac{1}{8} + \frac{1}{32} = 1.40625 \right]_{10}

In other words, we are doing:

1.\text{mantissa bits as fraction} \times 2^\text{unbiased exponent}

The final answer fills in as: 1.40625 \times 2^3 = 11.25_{10}

Example: Converting a Decimal to IEEE-754 and back

Q: Convert 1957.18765 to IEEE-754

  1. Convert integer part into binary

Successive Division of 1957:

19571
9780
4891
2440
1220
611
300
151
71
31
11
0
  1. Convert the floating-point part.

Remember: We store 23 bits in the mantissa (assuming 32-bit floating point).

Successive Multiplcation of 0.18765:

0.187650
0.37530
0.75061
2.50121
1.00240
0.00480
0.00960
0.01920
0.03840
0.07680
0.15360
0.30720
0.61441

Differences from Successive Division:

In this case, we read the string from top-to bottom:


Combining these two parts, we get:

11110100101.0011000000001

  1. To Scientific Notation

Now, we want to convert our binary floating-point to scientific notation.

11110100101.0011000000001

In scientific, after moving the decimal left ten times, it is

1.11101001010011000000001_2 \times 10_2^{1010}

Note: Be not afraid. The right side is just 2^{10} in decimal

  1. Sign, Exponent, Mantissa

Our sign is 0 (positive).

The exponent, after biasing, is 10 + 127 = 137 (decimal), or 10001001_2 (binary).

The mantissa is everything after the decimal point. 11101001010011000000001

Therefore, our final number is:

0 10001001 11101001010011000000001

Machine Epsilon: The smallest number \epsilon such that 1.0 + \epsilon > 1 in floating-point arithmetic.

1.0 + 2^{-23} > 1.0

Exponent Conversion

To add two floating-points with different exponents, you need to make their exponents equal.

Example: Why do we convert exponents this way?

To add two numbers with different exponents, you need to convert one of the numbers to use the same exponent as the other.

Suppose you want to add these two: one 5.3782 \times 10^{-5} \\~\\ 2.7641 \times 10^{-7}

Q: Should you:

  1. Convert the one with the smaller exponent, or

2.7641 \times 10^{-7} = 0.0276 \times 10^{-5}

  1. Convert the one with the bigger exponent?

5.3782 \times 10^{-5} = 537.82 \times 10^{-7}

Note: Assume our machine can only fit 5 bits magnitude.

A: We should do the first method (make the smaller one match the bigger one)

IEEE Double-Precision Standard

Like IEEE-754, but with 64 bits.

The binary pattern is split into:

  1. Sign: 1 bit (0 pos, 1 neg)
  2. Exponent: 11 bits
  3. Mantissa: 52 bits