Home > CS3010: Numerical Methods > Number Representation & IEEE-754Number Representation & IEEE-754
A positional number system (aka: one based on exponentiation) is defined by a set of digits, where numbers are sequences like so:
\boxed{
\text{Let } b = ? \text{ and } D = \{ d_0,d_1,d_2,d_3,...,d_n \}
} \\
\small\textit{Where $b$ is the base and $D$ is the possible digits}
\\~\\
\textit{A number is a sequence $n_i \in D$} In general: n_4 n_3 n_2 n_1 =
n_1 \times b^0 +
n_2 \times b^1 +
n_3 \times b^2 +
n_4 \times b^3 A specific example with a base 10 number: 1 9 3 7 =
7 \times 10^0 +
3 \times 10^1 +
9 \times 10^2 +
1 \times 10^3 A specific example with a base 2 number: 1 1 0 1 =
1 \times 2^0 +
0 \times 2^1 +
1 \times 2^2 +
1 \times 2^3Example: Value of a Number
Modern computers represent numbers in binary (b=2, D = \{0,1\}).
Don’t Get Tripped Up!: Integer Division and Remainder
- If you’re doing decimal division rather than integer division (easy to do if you’re using a calculator instead of your head), you can tell if the remainder is 1 if the quotient ends in 0.5.
- If you’re doing decimal division and the quotient doesn’t end in 0 or 0.5, you forgot to integer division. (Truncate everything after the point and divide only the integer.)
Let’s convert 633 to binary with successive division.
633 | 1 |
---|---|
316 | 0 |
115 | 0 |
79 | 1 |
39 | 1 |
19 | 1 |
9 | 1 |
4 | 0 |
2 | 0 |
1 | 1 |
0 |
We read the number from the bottom-up, so the answer is:
633_{10} = 1001111001_2
Significance: How much a digit contributes to the overall value of a number. 9378 = 9_4 3_3 7_2 8_1 \\
\small\textit{(subscripts show that digit's significance)}Example: Most and Least Significance
Loss of Significance: We can incur a loss of significance when storing numbers or doing arithmetic.
\boxed{
\text{Absolute Error: } AE = |X - X_s|
} \\~\\
\boxed{
\text{Relative Error: } RE = \left|
\frac{X - X_s}{X}
\right|
} \\
\small
\textit{where $X$ is the original value, and} \\
\textit{where $X_s$ is the stored/approximated value} Suppose a decimal computer than can only store numbers with 5 significant digits. We have the following measurement and want to store it in the computer. Q: The value is 9 significant digits, yet our computer can only store 5. How can we store it in the computer? A: We can truncate the number, dropping the least significant digits. Q: How much absolute and relative error did we just incur by doing this? A: AE = | X - X_s |
= | 15.0783073 - 15.078 |
= 0.0003073 RE = \left| \frac{X - X_s}{X} \right|
= \left| \frac{ 15.0783073 - 15.078}{ 15.0783073} \right|
\approx 0.00002038Example: Truncation and calculating error
Sample (i) Value 1 15.0783073 15.078
.0.0003073
, or relative error of 0.002038%
Remember: There are errors everywhere.
- Even instruments can have error.
Floating-Point: A number with a point in it.
\boxed{ \text{Normalized Scientific Notation: } m \times b^n } \\~\\ \small \textit{(where $1 \le |m| \lt 10$)}
Scientific Notation: A way to express big or small numbers more conveniently.
Converting to Scientific Notation:
Q: Find the scientific notation representation of 347.5601.
A:
347.5601 = 3.475601 \times ?
347.5601 = 3.475601 \times 10^2
Q: Find the binary scientific notation of 101.1011_2
A: 1.011011 \times 10^{10}
Important: A property of non-zero binary numbers (floating-points and integers) is that the MSB is 1 (or else it wouldn’t be the MSB)
A standard signed integer is a 32-bit pattern where the MSB is the sign bit.
The binary pattern is split into:
0000 0000
is reserved to set entire number to zero1111 1111
is reserved for infinity (use above sign bit to set \pm\infin)1.
IEEE-754 is based on the scientific notation that was previously discussed.
We know that the MSB is always 1 (otherwise it wouldn’t be the MSB), so we can save a bit by not storing it.
1.
” (this is one plus notation)Hence, we only need three things:
\textcolor{green}{\pm} \; 1.\textcolor{green}{\text{mantissa}} \times 10_2^{\textcolor{green}{\text{exponent}}}
The designers of IEEE-754 decided to dedicate 8 bits to the exponent
0000 0000
to signal that the whole number should be zero.1111 1111
to signal \pm infinity.Very Important: We can still represent negative exponents! It’s just that to store them in IEEE-754, we add the bias.
- (So, actual exponent = stored exponent - 127)
The designers wanted to remove the exponent’s sign bit to:
0000 0000
and 1111 1111
for zero and infinity (and there’s no wasted finagling with -0
and +0
like in two’s complement); and we get to dedicate the remaining unreserved bits towards [-126, 127], which is balanced around zero thanks to the bias. In the explanation this all was self-evident/obvious, but the answer to “which came first, the bias or the balance” is “both”)Q: Convert 1.01101 \times 10^{11} to IEEE-754.
A:
The sign is positive, so the sign bit is 0.
We need to bias the exponent before we can store it.
\begin{aligned} \text{Exponent} + \text{Bias} &= 11_2 + 127_{10} \\ &= 130_{10} \\ &= 10000010_2 \end{aligned}
Thus, the biased exponent is 10000010
The mantissa is 01101
, so we pad the right until it’s 23 bits.
Thus, the padded mantissa is 01101000000000000000000
Altogether, this means the IEEE-754 representation 1.01101 \times 10^{11} in IEEE-754 is: 0 01101000000000000000000 10000010
(sign, mantissa, exponent)
Q: Now, convert 0 01101000000000000000000 10000010
to a decimal.
A:
We can see/calculate:
How We Got Mantissa:
First, let’s account for one plus notation: 01101
-> 1.01101
.
Then, 1.01101_2 = \left[ 1 + \frac{1}{4} + \frac{1}{8} + \frac{1}{32} = 1.40625 \right]_{10}
In other words, we are doing:
1.\text{mantissa bits as fraction} \times 2^\text{unbiased exponent}
The final answer fills in as: 1.40625 \times 2^3 = 11.25_{10}
Q: Convert 1957.18765 to IEEE-754
Successive Division of 1957:
1957 | 1 |
---|---|
978 | 0 |
489 | 1 |
244 | 0 |
122 | 0 |
61 | 1 |
30 | 0 |
15 | 1 |
7 | 1 |
3 | 1 |
1 | 1 |
0 |
Remember: We store 23 bits in the mantissa (assuming 32-bit floating point).
- The integer portion already consumes 10 bits (we skip the MSB (plus one notation), if you’re confused by the fact we have 11 bits)
- Thus, we’ll only convert up to 13 points of precision.
Successive Multiplcation of 0.18765:
0.18765 | 0 |
---|---|
0.3753 | 0 |
0.7506 | 1 |
1 | |
0 | |
0.0048 | 0 |
0.0096 | 0 |
0.0192 | 0 |
0.0384 | 0 |
0.0768 | 0 |
0.1536 | 0 |
0.3072 | 0 |
0.6144 | 1 |
Differences from Successive Division:
- We multiply by 2 rather than divide.
- We truncate the stuff before the decimal
In this case, we read the string from top-to bottom:
Combining these two parts, we get:
11110100101.0011000000001
Now, we want to convert our binary floating-point to scientific notation.
11110100101.0011000000001
In scientific, after moving the decimal left ten times, it is
1.11101001010011000000001_2 \times 10_2^{1010}
Note: Be not afraid. The right side is just 2^{10} in decimal
Our sign is 0
(positive).
The exponent, after biasing, is 10 + 127 = 137 (decimal), or 10001001_2 (binary).
The mantissa is everything after the decimal point. 11101001010011000000001
Therefore, our final number is:
0 10001001 11101001010011000000001
Machine Epsilon: The smallest number \epsilon such that 1.0 + \epsilon > 1 in floating-point arithmetic.
- In other words, it’s the smallest number you can add to 1 such that the result isn’t 1.
- e.g., for single precision, this is
1.0 + 2^{-23} > 1.0
To add two floating-points with different exponents, you need to make their exponents equal.
To add two numbers with different exponents, you need to convert one of the numbers to use the same exponent as the other.
Suppose you want to add these two: one 5.3782 \times 10^{-5} \\~\\ 2.7641 \times 10^{-7}
Q: Should you:
2.7641 \times 10^{-7} = 0.0276 \times 10^{-5}
5.3782 \times 10^{-5} = 537.82 \times 10^{-7}
Note: Assume our machine can only fit 5 bits magnitude.
A: We should do the first method (make the smaller one match the bigger one)
Like IEEE-754, but with 64 bits.
The binary pattern is split into:
0000 0000 000
is reserved to set entire number to zero.1111 1111 111
is reserved to set entire number to zero.