Encoding Basics III - Signed Number Representations

It started elegant in mathematics but ends up ugly in Engineering.

Transplanted Post

After setting up this website I am gradually transferring all my previous blogs stored locally and in other blog system towards here. This is one of them. Thus, you should be aware that the date and time displayed on this blog is not accurate.

Post Series

This posts belongs to the series: Encoding Basics. You can see all posts in this series here.

Expected Reader Experience: None

This post is good for readers having zero or very limited knowledge on the topic: Encoding. However, since this post belongs to a series, you are expected to master the previous posts in the series.

Bits as Possibilities, NOT Numbers

The last post we mentioned that 1111 1010 represents -6 instead of 250 in computers. Why do we think it should be 250? Because by normal conversion procedure mentioned in the last post, if we convert 1111 1010 to base-10 we get 250; if we convert 250 in base-10 to base-2 we get 1111 1010. What is wrong?

The problem is that our computer does not use this simple conversion scheme in representing base-10 numbers as base-2 numbers. If it does - how do we represent negative numbers? We cannot add a sign like what we do in mathematics - remember, computers do not have signs. They only have 0s and 1s. So, instead of

0000 0000 = 0
0000 0001 = 1
0000 0010 = 2
...
1111 1110 = 254
1111 1111 = 255

We can devise a scheme that:

0000 0000 = -128
0000 0001 = -127
0000 0010 = -126
...
1111 1110 = 126
1111 1111 = 127

So as to accommodate negative numbers. What about numbers beyond -128 and 127? You have to use more bits to represent them. In fact, the common data type of signed integer number in many programming language is int, which often has 4 bytes. This means it can represents \(2^{32}\) different possibilities. Instead of representing \(0\) to \(2^{32}-1\), we can let them represent \(- 2^{31}\) to \(2^{31} -1\).

How is this even allowed? After all, mathematically, 0000 0000 in binary is never -128 in base-10. How could we impose such a definition which is mathematically incorrect? The reason is that the 0000 0000 here is not a number, but a possibility. A Byte of information gives us 256 possibilities and we can use these to represent any 256 distinct numbers: 0 to 255, or -128 to 127, or all odd numbers within 1 to 511…

The key thing is to select a correspondence scheme meaningfully so that we can represent numerical values as complete as possible with minimal complexity in terms of both machine process on operating them and human readability. Hence, we cannot just use the mathematical conversion as the correspondence scheme because it does not allow us to represent negative number.

True Form, and Two’s Complement

Representing Negative Numbers

The most intuitive way of representing negative numbers is to imitate what we do in mathematics. We can append a “sign” at the front of each binary sequence, using 0 as positive and 1 as negative. Thus, when machines interpret the numbers, they first have to know the first bit is not related to its value but its sign. For example, 0000 1001 is 9 but 1000 1001 is -9. The full list is:

0000 0000 = 0
0000 0001 = 1
0000 0010 = 2
...
0111 1111 = 127
1000 0000 = -0
1000 0001 = -1
1000 0010 = -2
...
1111 1110 = -126
1111 1111 = -127

This is the True From (or Sign/Magnitude) scheme (that is, we say that 1111 1110 is the true form of number -126). Let us get a little deeper than this. With convention mathematical conversion, 1000 0001 is 129, but as a true form it really represents -1. Interpret this another way, we have

\[ T = \begin{cases} C, & \text{if C < M } \\ M - C, & \text{if C >= M}\end{cases} \]

Or,

\[ T = \begin{cases} C, & \text{if C < M } \\ - (C \bmod (M+1)), & \text{if C >= M}\end{cases} \]

where T is the true value, C is the value obtained by converting the true form number to base-10 using mathematical conversion, and M is the minimal positive number that cannot be expressed by a fixed count of n bits (\(M = 2^{n-1}\); for example, \(M=2^{(8-1)}=128\) in this case where \(n=8\)).

This works, but clearly not optimally. For one, why are we having a 0 and a -0? This will cause problems in computer operation (The computer will report 0 != -0 since 0000 0000 != 1000 0000), and it really does not look elegant.

The second way of writing the true value inspires a more elegant solution is Two’s Complement, using the same idea of modulo. What about:

\[ T = C \bmod (2M) \]

Since this is an modulo operation, we take T value as the one that falls into the integers range \( [-M,M-1]\). This gives:

0000 0000 = 0
0000 0001 = 1
0000 0010 = 2
...
0111 1111 = 127
1000 0000 = -128
1000 0001 = -127
1000 0010 = -126
...
1111 1110 = -2
1111 1111 = -1

This scheme is called Two’s Complement. This is how positive and negative integers are represented in modern computers. Why use Two’s Complement? It does not have two 0s, and it is also intuitive (if you think about modulo). The greatest benefits is that it allows the same logic gate set up to perform addition and subtraction as subtracting is just adding a negative number.

That is why in the last post we get -6 on performing NOT on 5. 5 is 0000 0101 and a NOT turns it to 1111 1010 which represents -6 by Two’s Complement.

Heuristics

There are two other schemes of representing negative numbers other than True Form and Two’s Complement, which are rarely used.

One’s Complement:

\[ T = C \bmod (2M-1) \]

Since this is an modulo operation, we take T value as the one that falls into the integers range \( [-(M-1),M-1]\). This gives:

0000 0000 = 0
0000 0001 = 1
0000 0010 = 2
...
0111 1111 = 127
1000 0000 = -127
1000 0001 = -126
1000 0010 = -125
...
1111 1110 = -1
1111 1111 = -0

Offset Binary:

\[ T = (C+M) \bmod (2M) \]

Since this is an modulo operation, we take T value as the one that falls into the integers range \( [-M,M-1]\). This gives:

0000 0000 = -128
0000 0001 = -127
...
0111 1110 = -2
0111 1111 = -1
1000 0000 = 0
1000 0001 = 1
1000 0010 = 2
...
1111 1110 = 126
1111 1111 = 127

There are some heuristics for easy conversion between them:

For a positive number:
- Its True Form, One’s Complement and Two’s Complement are the same.
- Its Offset Binary is obtained by taking NOT on the signed bit of its Two’s Complement.
For a negative number:
- Its One’s Complement is obtained by taking NOT on all bits of its True Form, except the signed bit.
- Its Two’s Complement is obtained by adding 1 to its One’s Complement.
- Its Offset Binary is obtained by taking NOT on the signed bit of its Two’s Complement.

All these can be easily proven by modulo mathematics.

For example,

True Value (Base-10)	True Form	One’s Complement	Two’s Complement	Offset Binary
5	0000 0101	0000 0101	0000 0101	1000 0101
35	0010 0011	0010 0011	0010 0011	1010 0011
-5	1000 0101	1111 1010	1111 1011	0111 1011
-35	1010 0011	1101 1100	1101 1101	0101 1101

To summarize:

A Byte of Binary Digits	True Value (Interpreted as True Form)	True Value (Interpreted as One’s Complement)	True Value (Interpreted as Two’s Complement)	True Value (Interpreted as Offset Binary)
0000 0000	0	0	0	-128
0000 0001	1	1	1	-127
0000 0010	2	2	2	-126
…
0111 1110	126	126	126	-2
0111 1111	127	127	127	-1
1000 0000	-0	-127	-128	0
1000 0001	-1	-126	-127	1
…
1111 1101	-125	-2	-3	125
1111 1110	-126	-1	-2	126
1111 1111	-127	-0	-1	127

Floating-Point Numbers, and Other Numerical Types

Now, think about how to represent decimal numbers. In the previous post we know \(47.625_{(10)}=101111.101_{(2)}\). Hence, what about storing this number with 9 bits, the first 6 for its integer parts and the last three for its decimal parts? This works, but it certainly brings troubles when some numbers with some different counts of integer and decimal digits need to be represented. A solution is to allocate a fixed number of bits for the integer and a fixed number of bits for the decimal. For example, we use two bytes for each decimal number, with the first byte representing its integer part and the second byte representing its decimal part.

This is called Fixed-point representation, in the sense that it fixes the position of the decimal point in each binary sequence that is supposed to represent a number. However, the accuracy and range obtained from this scheme is not satisfactory. If you are interested, you can delve deeper by following up the links below:

For reasons and the current solution (IEEE 754), read this.
For a visualization of IEEE 754, use this.
Note that both the significand and the exponent in IEEE 754 format use the True Form representation (the sign/magnitude way) instead of Two’s Complement (Strictly speaking, exponent is defined to be positive in IEEE 754 so that does not really matter). Why? Read this.

Conclusion

Just knowing the principle about the bits as possibilities and 1-1 correspondence is not sufficient. Under the principle, there can be numerous practical scheme that implements the principle like the four for integer and the various IEEE standards for floating-point numbers. It is important to choose wisely. Next up, we are using the possibilities that bits give us for characters and symbols.

Buy me a cup of coffee?

Post author: Long Yu
Post link: https://reimirno.github.io/2021/08/07/Encoding-Basics-3/
Copyright Notice: All articles in this blog are licensed under unless otherwise stated.