

## Digital Building Blocks

### 5.1 INTRODUCTION

Up to this point, we have examined the design of combinational and sequential circuits using Boolean equations, schematics, and HDLs. This chapter introduces more elaborate combinational and sequential building blocks used in digital systems. These blocks include arithmetic circuits, counters, shift registers, memory arrays, and logic arrays. These building blocks are not only useful in their own right, but they also demonstrate the principles of hierarchy, modularity, and regularity. The building blocks are hierarchically assembled from simpler components such as logic gates, multiplexers, and decoders. Each building block has a welldefined interface and can be treated as a black box when the underlying implementation is unimportant. The regular structure of each building block is easily extended to different sizes. In Chapter 7, we use many of these building blocks to build a microprocessor.

### 5.2 ARITHMETIC CIRCUITS

Arithmetic circuits are the central building blocks of computers. Computers and digital logic perform many arithmetic functions: addition, subtraction, comparisons, shifts, multiplication, and division. This section describes hardware implementations for all of these operations.

### 5.2.1 Addition

Addition is one of the most common operations in digital systems. We first consider how to add two 1-bit binary numbers. We then extend to N -bit binary numbers. Adders also illustrate trade-offs between speed and complexity.

## 5

### 5.1 Introduction

5.2 Arithmetic Circuits
5.3 Number Systems
5.4 Sequential Building Blocks
5.5 Memory Arrays
5.6 Logic Arrays
5.7 Summary

Exercises
Interview Questions



Figure 5.1 1-bit half adder

$$
\begin{array}{r}
1 \\
0001 \\
+0101 \\
\hline 0110
\end{array}
$$

Figure 5.2 Carry bit


Figure 5.3 1-bit full adder

## Half Adder

We begin by building a 1-bit half adder. As shown in Figure 5.1, the half adder has two inputs, $A$ and $B$, and two outputs, $S$ and $C_{\text {out }} . S$ is the sum of $A$ and $B$. If $A$ and $B$ are both $1, S$ is 2 , which cannot be represented with a single binary digit. Instead, it is indicated with a carry out $C_{\text {out }}$ in the next column. The half adder can be built from an XOR gate and an AND gate.

In a multi-bit adder, $C_{\text {out }}$ is added or carried in to the next most significant bit. For example, in Figure 5.2, the carry bit shown in blue is the output $C_{\text {out }}$ of the first column of 1-bit addition and the input $C_{\text {in }}$ to the second column of addition. However, the half adder lacks a $C_{\text {in }}$ input to accept $C_{\text {out }}$ of the previous column. The full adder, described in the next section, solves this problem.

## Full Adder

A full adder, introduced in Section 2.1, accepts the carry in $C_{\text {in }}$ as shown in Figure 5.3. The figure also shows the output equations for $S$ and $C_{\text {out }}$.

## Carry Propagate Adder

An $N$-bit adder sums two $N$-bit inputs, $A$ and $B$, and a carry in $C_{\text {in }}$ to produce an $N$-bit result $S$ and a carry out $C_{\text {out }}$. It is commonly called a carry propagate adder (CPA) because the carry out of one bit propagates into the next bit. The symbol for a CPA is shown in Figure 5.4; it is drawn just like a full adder except that $A, B$, and $S$ are busses rather than single bits. Three common CPA implementations are called ripple-carry adders, carry-lookahead adders, and prefix adders.

## Ripple-Carry Adder

The simplest way to build an $N$-bit carry propagate adder is to chain together $N$ full adders. The $C_{\text {out }}$ of one stage acts as the $C_{\text {in }}$ of the next stage, as shown in Figure 5.5 for 32-bit addition. This is called a ripplecarry adder. It is a good application of modularity and regularity: the full adder module is reused many times to form a larger system. The ripplecarry adder has the disadvantage of being slow when $N$ is large. $S_{31}$ depends on $C_{30}$, which depends on $C_{29}$, which depends on $C_{28}$, and so forth all the way back to $C_{\mathrm{in}}$, as shown in blue in Figure 5.5. We say that the carry ripples through the carry chain. The delay of the adder, $t_{\text {ripple }}$,


Figure 5.4 Carry propagate adder
grows directly with the number of bits, as given in Equation 5.1, where $t_{F A}$ is the delay of a full adder.

$$
\begin{equation*}
t_{\text {ripple }}=N t_{F A} \tag{5.1}
\end{equation*}
$$

## Carry-Lookahead Adder

The fundamental reason that large ripple-carry adders are slow is that the carry signals must propagate through every bit in the adder. A carrylookahead adder (CLA) is another type of carry propagate adder that solves this problem by dividing the adder into blocks and providing circuitry to quickly determine the carry out of a block as soon as the carry in is known. Thus it is said to look abead across the blocks rather than waiting to ripple through all the full adders inside a block. For example, a 32 -bit adder may be divided into eight 4 -bit blocks.

CLAs use generate $(G)$ and propagate $(P)$ signals that describe how a column or block determines the carry out. The $i$ th column of an adder is said to generate a carry if it produces a carry out independent of the carry in. The $i$ th column of an adder is guaranteed to generate a carry $C_{i}$ if $A_{i}$ and $B_{i}$ are both 1 . Hence $G_{i}$, the generate signal for column $i$, is calculated as $G_{i}=A_{i} B_{i}$. The column is said to propagate a carry if it produces a carry out whenever there is a carry in. The $i$ th column will propagate a carry in, $C_{i-1}$, if either $A_{i}$ or $B_{i}$ is 1 . Thus, $P_{i}=A_{i}+B_{i}$. Using these definitions, we can rewrite the carry logic for a particular column of the adder. The $i$ th column of an adder will generate a carry out $C_{i}$ if it either generates a carry, $G_{i}$, or propagates a carry in, $P_{i} C_{i-1}$. In equation form,

$$
\begin{equation*}
C_{i}=A_{i} B_{i}+\left(A_{i}+B_{i}\right) C_{i-1}=G_{i}+P_{i} C_{i-1} \tag{5.2}
\end{equation*}
$$

The generate and propagate definitions extend to multiple-bit blocks. A block is said to generate a carry if it produces a carry out independent of the carry in to the block. The block is said to propagate a carry if it produces a carry out whenever there is a carry in to the block. We define $G_{i: j}$ and $P_{i: j}$ as generate and propagate signals for blocks spanning columns $i$ through $j$.

A block generates a carry if the most significant column generates a carry, or if the most significant column propagates a carry and the previous column generated a carry, and so forth. For example, the generate logic for a block spanning columns 3 through 0 is

$$
\begin{equation*}
G_{3: 0}=G_{3}+P_{3}\left(G_{2}+P_{2}\left(G_{1}+P_{1} G_{0}\right)\right) \tag{5.3}
\end{equation*}
$$

A block propagates a carry if all the columns in the block propagate the carry. For example, the propagate logic for a block spanning columns 3 through 0 is

$$
\begin{equation*}
P_{3: 0}=P_{3} P_{2} P_{1} P_{0} \tag{5.4}
\end{equation*}
$$

Using the block generate and propagate signals, we can quickly compute the carry out of the block, $C_{i}$, using the carry in to the block, $C_{j-1}$.

$$
\begin{equation*}
C_{i}=G_{i: j}+P_{i: j} C_{j-1} \tag{5.5}
\end{equation*}
$$

Schematics typically show signals flowing from left to right. Arithmetic circuits break this rule because the carries flow from right to left (from the least significant column to the most significant column).


Throughout the ages, people have used many devices to perform arithmetic. Toddlers count on their fingers (and some adults stealthily do too). The Chinese and Babylonians invented the abacus as early as 2400 BC. Slide rules, invented in 1630 , were in use until the 1970's, when scientific hand calculators became prevalent. Computers and digital calculators are ubiquitous today. What will be next?

Figure 5.6(a) shows a 32-bit carry-lookahead adder composed of eight 4-bit blocks. Each block contains a 4-bit ripple-carry adder and some lookahead logic to compute the carry out of the block given the carry in, as shown in Figure 5.6(b). The AND and OR gates needed to compute the single-bit generate and propagate signals, $G_{i}$ and $P_{i}$, from $A_{i}$ and $B_{i}$ are left out for brevity. Again, the carry-lookahead adder demonstrates modularity and regularity.

All of the CLA blocks compute the single-bit and block generate and propagate signals simultaneously. The critical path starts with computing $G_{0}$ and $G_{3: 0}$ in the first CLA block. $C_{\text {in }}$ then advances directly to $C_{\text {out }}$ through the AND/OR gate in each block until the last. For a large adder, this is much faster than waiting for the carries to ripple through each consecutive bit of the adder. Finally, the critical path through the last block contains a short ripple-carry adder. Thus, an N -bit adder divided into $k$-bit blocks has a delay

$$
\begin{equation*}
t_{C L A}=t_{p g}+t_{p g_{-} \text {block }}+\left(\frac{N}{k}-1\right) t_{\mathrm{AND}_{-} \mathrm{OR}}+k t_{F A} \tag{5.6}
\end{equation*}
$$

Figure 5.6 (a) 32-bit carrylookahead adder (CLA), (b) 4-bit CLA block

where $t_{p g}$ is the delay of the individual generate/propagate gates (a single AND or OR gate) to generate $P_{i}$ and $G_{i}, t_{p g_{-} \text {block }}$ is the delay to find the generate/propagate signals $P_{i: j}$ and $G_{i: j}$ for a $k$-bit block, and $t_{\text {AND_OR }}$ is the delay from $C_{\text {in }}$ to $C_{\text {out }}$ through the final AND/OR logic of the $k$-bit CLA block. For $N>16$, the carry-lookahead adder is generally much faster than the ripple-carry adder. However, the adder delay still increases linearly with $N$.

## Example 5.1 RIPPLE-CARRY ADDER AND CARRY-LOOKAHEAD ADDER DELAY

Compare the delays of a 32-bit ripple-carry adder and a 32-bit carry-lookahead adder with 4-bit blocks. Assume that each two-input gate delay is 100 ps and that a full adder delay is 300 ps .

Solution: According to Equation 5.1, the propagation delay of the 32-bit ripplecarry adder is $32 \times 300 \mathrm{ps}=9.6 \mathrm{~ns}$.

The CLA has $t_{p g}=100 \mathrm{ps}, t_{\text {pg_block }}=6 \times 100 \mathrm{ps}=600 \mathrm{ps}$, and $t_{\mathrm{AND}_{-} \mathrm{OR}}=2 \times 100 \mathrm{ps}=$ 200 ps . According to Equation 5.6 , the propagation delay of the 32 -bit carry-lookahead adder with 4 -bit blocks is thus $100 \mathrm{ps}+600 \mathrm{ps}+(32 / 4-1) \times 200 \mathrm{ps}+(4 \times 300 \mathrm{ps})=$ 3.3 ns , almost three times faster than the ripple-carry adder.

## Prefix Adder*

Prefix adders extend the generate and propagate logic of the carrylookahead adder to perform addition even faster. They first compute $G$ and $P$ for pairs of columns, then for blocks of 4 , then for blocks of 8 , then 16 , and so forth until the generate signal for every column is known. The sums are computed from these generate signals.

In other words, the strategy of a prefix adder is to compute the carry in $C_{i-1}$ for each column $i$ as quickly as possible, then to compute the sum, using

$$
\begin{equation*}
S_{i}=\left(A_{i} \oplus B_{i}\right) \oplus C_{i-1} \tag{5.7}
\end{equation*}
$$

Define column $i=-1$ to hold $C_{\text {in }}$, so $G_{-1}=C_{\text {in }}$ and $P_{-1}=0$. Then $C_{i-1}=$ $G_{i-1:-1}$ because there will be a carry out of column $i-1$ if the block spanning columns $i-1$ through -1 generates a carry. The generated carry is either generated in column $i-1$ or generated in a previous column and propagated. Thus, we rewrite Equation 5.7 as

$$
\begin{equation*}
S_{i}=\left(A_{i} \oplus B_{i}\right) \oplus G_{i-1:-1} \tag{5.8}
\end{equation*}
$$

Hence, the main challenge is to rapidly compute all the block generate signals $G_{-1:-1}, G_{0:-1}, G_{1:-1}, G_{2:-1}, \ldots, G_{N-2:-1}$. These signals, along with $P_{-1:-1}, P_{0:-1}, P_{1:-1}, P_{2:-1}, \ldots, P_{N-2:-1}$, are called prefixes.

Early computers used ripplecarry adders, because components were expensive and ripple-carry adders used the least hardware. Virtually all modern PCs use prefix adders on critical paths, because transistors are now cheap and speed is of great importance.

Figure 5.7 shows an $N=16$-bit prefix adder. The adder begins with a precomputation to form $P_{i}$ and $G_{i}$ for each column from $A_{i}$ and $B_{i}$ using AND and OR gates. It then uses $\log _{2} N=4$ levels of black cells to form the prefixes of $G_{i: j}$ and $P_{i: j}$. A black cell takes inputs from the upper part of a block spanning bits $i: k$ and from the lower part spanning bits $k-1: j$. It combines these parts to form generate and propagate signals for the entire block spanning bits $i: j$ using the equations

$$
\begin{gather*}
G_{i: j}=G_{i: k}+P_{i: k} G_{k-1: j}  \tag{5.9}\\
P_{i: j}=P_{i: k} P_{k-1: j} \tag{5.10}
\end{gather*}
$$

In other words, a block spanning bits $i: j$ will generate a carry if the upper part generates a carry or if the upper part propagates a carry generated in the lower part. The block will propagate a carry if both the upper and

Figure 5.7 16-bit prefix adder


Legend

lower parts propagate the carry. Finally, the prefix adder computes the sums using Equation 5.8.

In summary, the prefix adder achieves a delay that grows logarithmically rather than linearly with the number of columns in the adder. This speedup is significant, especially for adders with 32 or more bits, but it comes at the expense of more hardware than a simple carry-lookahead adder. The network of black cells is called a prefix tree.

The general principle of using prefix trees to perform computations in time that grows logarithmically with the number of inputs is a powerful technique. With some cleverness, it can be applied to many other types of circuits (see, for example, Exercise 5.7).

The critical path for an $N$-bit prefix adder involves the precomputation of $P_{i}$ and $G_{i}$ followed by $\log _{2} N$ stages of black prefix cells to obtain all the prefixes. $G_{i-1:-1}$ then proceeds through the final XOR gate at the bottom to compute $S_{i}$. Mathematically, the delay of an $N$-bit prefix adder is

$$
\begin{equation*}
t_{P A}=t_{p g}+\log _{2} N\left(t_{p g_{-p r e f i x}}\right)+t_{\text {XOR }} \tag{5.11}
\end{equation*}
$$

where $t_{p g_{-} \text {prefix }}$ is the delay of a black prefix cell.

## Example 5.2 PREFIX ADDER DELAY

Compute the delay of a 32 -bit prefix adder. Assume that each two-input gate delay is 100 ps .

Solution: The propagation delay of each black prefix cell $t_{p g_{\text {_prefix }}}$ is 200 ps (i.e., two gate delays). Thus, using Equation 5.11, the propagation delay of the 32 -bit prefix adder is $100 \mathrm{ps}+\log _{2}(32) \times 200 \mathrm{ps}+100 \mathrm{ps}=1.2 \mathrm{~ns}$, which is about three times faster than the carry-lookahead adder and eight times faster than the ripple-carry adder from Example 5.1. In practice, the benefits are not quite this great, but prefix adders are still substantially faster than the alternatives.

## Putting It All Together

This section introduced the half adder, full adder, and three types of carry propagate adders: ripple-carry, carry-lookahead, and prefix adders. Faster adders require more hardware and therefore are more expensive and power-hungry. These trade-offs must be considered when choosing an appropriate adder for a design.

Hardware description languages provide the + operation to specify a CPA. Modern synthesis tools select among many possible implementations, choosing the cheapest (smallest) design that meets the speed requirements. This greatly simplifies the designer's job. HDL Example 5.1 describes a CPA with carries in and out.

## HDL Example 5.1 ADDER

## SystemVerilog

module adder 非( parameter $N=8$ )
(input logic $[\mathrm{N}-1: 0] \mathrm{a}, \mathrm{b}$,
input logic cin, output logic $[\mathrm{N}-1: 0] \mathrm{s}$, output logic cout);
assign $\{$ cout, $s\}=a+b+c i n$; endmodule

## VHDL

```
library IEEE; use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD_UNSIGNED.ALL;
entity adder is
    generic(N: integer:=8);
    port(a,b: in STD_LOGIC_VECTOR(N-1 downto 0);
            cin: in STD_LOGIC;
            s: out STD_LOGIC_VECTOR(N-1 downto 0);
            cout: out STD_LOGIC);
end;
architecture synth of adder is
    signal result: STD_LOGIC_VECTOR(N downto 0);
begin
    result<= ("0" & a) + ("0" & b ) + cin;
    s <= result(N-1 downto 0);
    cout <= result(N);
end;
```



Figure 5.8 Synthesized adder


Figure 5.9 Subtractor: (a) symbol, (b) implementation

### 5.2.2 Subtraction

Recall from Section 1.4.6 that adders can add positive and negative numbers using two's complement number representation. Subtraction is almost as easy: flip the sign of the second number, then add. Flipping the sign of a two's complement number is done by inverting the bits and adding 1.

To compute $Y=A-B$, first create the two's complement of $B$ : Invert the bits of $B$ to obtain $\bar{B}$ and add 1 to get $-B=\bar{B}+1$. Add this quantity to $A$ to get $Y=A+\bar{B}+1=A-B$. This sum can be performed with a single CPA by adding $A+\bar{B}$ with $C_{\text {in }}=1$. Figure 5.9 shows the symbol for a subtractor and the underlying hardware for performing $Y=A-B$. HDL Example 5.2 describes a subtractor.

### 5.2.3 Comparators

A comparator determines whether two binary numbers are equal or if one is greater or less than the other. A comparator receives two $N$-bit binary numbers $A$ and $B$. There are two common types of comparators.

HDL Example 5.2 SUBTRACTOR

## SystemVerilog

```
module subtractor 非(parameter N=8)
    (input logic [N-1:0]a, b,
    output logic [N-1:0] y);
    assign y=a-b;
endmodule
```


## VHDL

```
library IEEE; use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD_UNSIGNED.ALL;
entity subtractor is
    generic(N: integer:= 8);
    port(a,b: in STD_LOGIC_VECTOR(N-1 downto 0);
        y: out STD_LOGIC_VECTOR(N-1 downto 0));
end;
architecture synth of subtractor is
begin
    y<= a - b;
end;
```



Figure 5.10 Synthesized subtractor


An equality comparator produces a single output indicating whether $A$ is equal to $B(A==B)$. A magnitude comparator produces one or more outputs indicating the relative values of $A$ and $B$.

The equality comparator is the simpler piece of hardware. Figure 5.11 shows the symbol and implementation of a 4-bit equality comparator. It first checks to determine whether the corresponding bits in each column of $A$ and $B$ are equal using XNOR gates. The numbers are equal if all of the columns are equal.

Magnitude comparison of signed numbers is usually done by computing $A-B$ and looking at the sign (most significant bit) of the result as shown in Figure 5.12. If the result is negative (i.e., the sign bit is 1 ), then $A$ is less than $B$. Otherwise $A$ is greater than or equal to $B$. This comparator, however, functions incorrectly upon overflow. Exercises 5.9 and 5.10 explore this limitation and how to fix it.

Figure 5.11 4-bit equality comparator: (a) symbol, (b) implementation


Figure 5.12 N -bit signed comparator

## HDL Example 5.3 COMPARATORS

## SystemVerilog

```
module comparator 非(parameter N = 8)
    (input logic[N-1:0] a,b,
    output logiceq, neq, 1t, 1te, gt, gte);
```

    assign eq \(=(a==b)\);
    assign neq \(=(a!=b)\);
    assign 1t \(=(a<b)\);
    assign 1te \(=(a<=b)\);
    assign gt \(=(a>b)\);
    assign gte \(=(a>=b)\);
    endmodule

## VHDL

```
library IEEE; use IEEE.STD_LOGIC_1164.ALL;
entity comparators is
    generic(N: integer:=8);
    port(a,b: in STD_LOGIC_VECTOR(N-1 downto 0);
        eq, neq, 1t, 1te, gt, gte: out STD_LOGIC);
end;
architecture synth of comparator is
begin
    eq <= '1'when (a=b) else '0';
    neq <= '1' when (a/= b) else '0';
    1t <= '1' when (a<b) else '0';
    1te<='1' when (a<= b) else '0';
    gt <='1'when (a>b) else '0';
    gte<='1' when (a>= b) else '0';
end;
```



Figure 5.13 Synthesized comparators


Figure 5.14 ALU symbol

HDL Example 5.3 shows how to use various comparison operations for unsigned numbers.

### 5.2.4 ALU

An Arithmetic/Logical Unit (ALU) combines a variety of mathematical and logical operations into a single unit. For example, a typical ALU

Table 5.1 ALU operations

| ALUControl $_{1: 0}$ | Function |
| :--- | :--- |
| 00 | Add |
| 01 | Subtract |
| 10 | AND |
| 11 | OR |

might perform addition, subtraction, AND, and OR operations. The ALU forms the heart of most computer systems.

Figure 5.14 shows the symbol for an N -bit ALU with N -bit inputs and outputs. The ALU receives a 2-bit control signal ALUControl that specifies which function to perform. Control signals will generally be shown in blue to distinguish them from the data. Table 5.1 lists typical functions that the ALU can perform.

Figure 5.15 shows an implementation of the ALU. The ALU contains an N -bit adder and $N$ two-input AND and OR gates. It also contains inverters and a multiplexer to invert input $B$ when $A L U C o n t r o l_{0}$ is asserted. A 4:1 multiplexer chooses the desired function based on ALUControl.

More specifically, if $A L U C o n t r o l=00$, the output multiplexer chooses $A+B$. If $A L U C o n t r o l=01$, the ALU computes $A-B$. (Recall from Section 5.2.2 that $\bar{B}+1=-B$ in two's complement arithmetic. Because ALUControl $l_{0}$ is 1 , the adder receives inputs $A$ and $\bar{B}$ and an asserted carry in, causing


Figure 5.15 N -bit ALU


Figure 5.16 ALU symbol with output flags

Figure 5.17 N -bit ALU with output flags
it to perform subtraction: $A+\bar{B}+1=A-B$.) If $A L U C o n t r o l=10$, the ALU computes $A$ AND $B$. If $A L U C o n t r o l=11$, the ALU performs $A$ OR $B$.

Some ALUs produce extra outputs, called flags, that indicate information about the ALU output. Figure 5.16 shows the ALU symbol with a 4-bit ALUFlags output. As shown in the schematic of this ALU in Figure 5.17, the ALUFlags output is composed of the $N, Z, C$, and $V$ flags that indicate, respectively, that the ALU output is negative or zero or that the adder produced a carry out or overflowed. Recall that the most significant bit of a two's complement number is 1 if it is negative and 0 otherwise. Thus, the $N$ flag is connected to the most significant bit of the ALU output, Result ${ }_{31}$. The $Z$ flag is asserted when all of the bits of Result are 0 , as detected by the $N$-bit NOR gate in Figure 5.17. The C flag is asserted when the adder produces a carry out and the ALU is performing addition or subtraction ( $A L U$ Control $_{1}=0$ ).


Overflow detection, as shown on the left side of Figure 5.17, is trickier. Recall from Section 1.4.6 that overflow occurs when the addition of two same signed numbers produces a result with the opposite sign. So, $V$ is asserted when all three of the following conditions are true: (1) the ALU is performing addition or subtraction (ALUControl ${ }_{1}=0$ ), (2) $A$ and Sum have opposite signs, as detected by the XOR gate, and, as
detected by the XNOR gate, (3) either $A$ and $B$ have the same sign and the adder is performing addition $\left(A L U\right.$ Control $\left._{0}=0\right)$ or $A$ and $B$ have opposite signs and the adder is performing subtraction (ALUControl ${ }_{0}=1$ ). The 3 -input AND gate detects when all three conditions are true and asserts $V$.

The HDL for an N -bit ALU with output flags is left to Exercises 5.11 and 5.12. There are many variations on this basic ALU that support other functions, such as XOR or equality comparison.

### 5.2.5 Shifters and Rotators

Shifters and rotators move bits and multiply or divide by powers of 2 . As the name implies, a shifter shifts a binary number left or right by a specified number of positions. There are several kinds of commonly used shifters:

- Logical shifter—shifts the number to the left (LSL) or right (LSR) and fills empty spots with 0's.
Ex: 11001 LSR $2=00110 ; 11001$ LSL $2=00100$
- Arithmetic shifter-is the same as a logical shifter, but on right shifts fills the most significant bits with a copy of the old most significant bit (msb). This is useful for multiplying and dividing signed numbers (see Sections 5.2.6 and 5.2.7). Arithmetic shift left (ASL) is the same as logical shift left (LSL).

Ex: 11001 ASR $2=11110 ; 11001$ ASL $2=00100$

- Rotator-rotates number in a circle such that empty spots are filled with bits shifted off the other end.

Ex: 11001 ROR $2=01110 ; 11001$ ROL $2=00111$
An $N$-bit shifter can be built from $N \mathrm{~N}: 1$ multiplexers. The input is shifted by 0 to $N-1$ bits, depending on the value of the $\log _{2} N$-bit select lines. Figure 5.18 shows the symbol and hardware of 4 -bit shifters. The operators $\ll, \gg$, and $\ggg$ typically indicate shift left, logical shift right, and arithmetic shift right, respectively. Depending on the value of the 2 -bit shift amount shamt 1.0 $^{0}$, the output $Y$ receives the input $A$ shifted by 0 to 3 bits. For all shifters, when shamt ${ }_{1: 0}=00, Y=A$. Exercise 5.18 covers rotator designs.

A left shift is a special case of multiplication. A left shift by $N$ bits multiplies the number by $2^{N}$. For example, $000011_{2} \ll 4=110000_{2}$ is equivalent to $3_{10} \times 2^{4}=48_{10}$.

An arithmetic right shift is a special case of division. An arithmetic right shift by $N$ bits divides the number by $2^{N}$. For example, $11100_{2}$ $\ggg 2=11111_{2}$ is equivalent to $-4_{10} / 2^{2}=-1_{10}$.


shamt $_{1: 0}$
$A_{3: 0} \xrightarrow{\ell_{2}} \rightarrow \ggg>Y_{3: 0}$

(a)

(b)

(c)

Figure 5.18 4-bit shifters: (a) shift left, (b) logical shift right, (c) arithmetic shift right

Figure 5.19 Multiplication: (a) decimal, (b) binary

| 230 | multiplicand | 0101 |
| :---: | :---: | :---: |
| $\times \quad 42$ | multiplier | $\times 0111$ |
| 460 | partial | 0101 |
| +920 | products | 0101 |
| 9660 |  | +0000 |
|  | result | 0100011 |

$230 \times 42=9660$
(a)
$5 \times 7=35$
(b)

### 5.2.6 Multiplication*

Multiplication of unsigned binary numbers is similar to decimal multiplication but involves only 1's and 0's. Figure 5.19 compares multiplication in decimal and binary. In both cases, partial products are formed by multiplying a single digit of the multiplier with the entire multiplicand. The shifted partial products are summed to form the result.

In general, an $N \times N$ multiplier multiplies two $N$-bit numbers and produces a 2 N -bit result. The partial products in binary multiplication are either the multiplicand or all 0 's. Multiplication of 1-bit binary numbers is equivalent to the AND operation, so AND gates are used to form the partial products.

Signed and unsigned multiplication differ. For example, consider $0 x F E \times 0 x F D$. If these 8 -bit numbers are interpreted as signed integers, they represent -2 and -3 , so the 16 -bit product is $0 x 0006$. If these numbers are interpreted as unsigned integers, the 16 -bit product is $0 x F B 06$. Notice that in either case, the least significant byte is $0 x 06$.

Figure 5.20 shows the symbol, function, and implementation of an unsigned $4 \times 4$ multiplier. The unsigned multiplier receives the multiplicand and multiplier, $A$ and $B$, and produces the product $P$. Figure $5.20(\mathrm{~b})$ shows how partial products are formed. Each partial product is a single multiplier bit ( $B_{3}, B_{2}, \mathrm{~B}_{1}$, or $B_{0}$ ) AND the multiplicand bits ( $A_{3}, A_{2}, A_{1}, A_{0}$ ). With $N$-bit operands, there are $N$ partial products and $N-1$ stages of 1-bit adders. For example, for a $4 \times 4$ multiplier, the partial product of the first row is $B_{0}$ AND $\left(A_{3}, A_{2}, A_{1}, A_{0}\right)$. This partial product is added to the shifted second partial product, $B_{1}$ AND $\left(A_{3}, A_{2}, A_{1}, A_{0}\right)$. Subsequent rows of AND gates and adders form and add the remaining partial products.

The HDL for signed and unsigned multipliers is in HDL Example 4.33. As with adders, many different multiplier designs with different speed/cost trade-offs exist. Synthesis tools may pick the most appropriate design given the timing constraints.

A multiply accumulate operation multiplies two numbers and adds them to a third number, typically the accumulated value. These operations, also called MACs, are often used in digital signal processing (DSP) algorithms such as the Fourier transform, which requires a summation of products.


Figure $5.204 \times 4$ multiplier:
(a) symbol, (b) function,
(c) implementation

### 5.2.7 Division*

Binary division can be performed using the following algorithm for N -bit unsigned numbers in the range $\left[0,2^{N-1}\right]$ :

```
R'=0
for i=N-1 to 0
    R={\mp@subsup{R}{}{\prime}<< 1, A A}
    D=R-B
    if D<0 then }\quad\mp@subsup{Q}{i}{}=0,\mp@subsup{R}{}{\prime}=R\quad//R<
    else }\quad\mp@subsup{Q}{i}{\prime}=1,\mp@subsup{R}{}{\prime}=D\quad//R\geq
R= R'
```

The partial remainder $R$ is initialized to $0\left(R^{\prime}=0\right)$, and the most significant bit of the dividend $A$ becomes the least significant bit of $R\left(R=\left\{R^{\prime} \ll 1, A_{i}\right\}\right)$. The divisor $B$ is subtracted from this partial remainder to determine whether it fits $(D=R-B)$. If the difference $D$ is negative (i.e., the sign bit of $D$ is 1 ), then the quotient bit $Q_{i}$ is 0 and the difference is discarded. Otherwise, $Q_{i}$ is 1 , and the partial remainder is updated to be the difference. In any event, the partial remainder is then doubled (left-shifted by one column), the next most significant bit of $A$ becomes the least significant bit of $R$, and the process repeats. The result satisfies $\frac{A}{B}=Q+\frac{R}{B}$.

Figure 5.21 Array divider


Figure 5.21 shows a schematic of a 4-bit array divider. The divider computes $A / B$ and produces a quotient $Q$ and a remainder $R$. The legend shows the symbol and schematic for each block in the array divider. Each row performs one iteration of the division algorithm. Specifically, each row calculates the difference $D=R-B$. (Recall that $R+\bar{B}+1=R-B$ ). The signal $N$ indicates whether $D$ is negative. So a row's multiplexer select lines receive the most significant bit of $D$, which is 1 when the difference is negative. The quotient $\left(Q_{i}\right)$ is 0 when $D$ is negative and 1 otherwise. The multiplexer passes $R$ to the next row if the difference is negative and $D$ otherwise. The following row shifts the new partial remainder left by one bit, appends the next most significant bit of $A$, and then repeats the process.

The delay of an $N$-bit array divider increases proportionally to $N^{2}$ because the carry must ripple through all $N$ stages in a row before the sign is determined and the multiplexer selects $R$ or $D$. This repeats for all $N$ rows. Division is a slow and expensive operation in hardware and therefore should be used as infrequently as possible.

### 5.2.8 Further Reading

Computer arithmetic could be the subject of an entire text. Digital Arithmetic, by Ercegovac and Lang, is an excellent overview of the entire field. CMOS VLSI Design, by Weste and Harris, covers high-performance circuit designs for arithmetic operations.

### 5.3 NUMBER SYSTEMS

Computers operate on both integers and fractions. So far, we have only considered representing signed or unsigned integers, as introduced in Section 1.4. This section introduces fixed- and floating-point number systems that can represent rational numbers. Fixed-point numbers are analogous to decimals; some of the bits represent the integer part, and the rest represent the fraction. Floating-point numbers are analogous to scientific notation, with a mantissa and an exponent.

### 5.3.1 Fixed-Point Number Systems

Fixed-point notation has an implied binary point between the integer and fraction bits, analogous to the decimal point between the integer and fraction digits of an ordinary decimal number. For example, Figure 5.22(a) shows a fixed-point number with four integer bits and four fraction bits. Figure 5.22 (b) shows the implied binary point in blue, and Figure 5.22(c) shows the equivalent decimal value. The integer bits are called the bigh word and the fraction bits are called the low word.
(a) 01101100
(b) 0110.1100
(c) $2^{2}+2^{1}+2^{-1}+2^{-2}=6.75$

Figure 5.22 Fixed-point notation of 6.75 with four integer bits and four fraction bits
(a) 0010.0110
(b) 1010.0110
(c) 1101.1010

Figure 5.23 Fixed-point representation of $\mathbf{- 2 . 3 7 5 :}$ (a) absolute value, (b) sign and magnitude, (c) two's complement

Fixed-point number systems are commonly used for banking and financial applications that require precision but not a large range. Digital signal processing (DSP) applications also often use fixed-point numbers because the computations are faster and consume less power than they would in floating-point.

Figure 5.24 Fixed-point two's complement conversion

Figure 5.25 Addition: (a) binary fixed-point, (b) decimal equivalent

Signed fixed-point numbers can use either two's complement or sign/ magnitude notation. Figure 5.23 shows the fixed-point representation of -2.375 using both notations with four integer and four fraction bits. The implicit binary point is shown in blue for clarity. In sign/magnitude form, the most significant bit is used to indicate the sign. The two's complement representation is formed by inverting the bits of the absolute value and adding a 1 to the least significant (rightmost) bit. In this case, the least significant bit position is in the $2^{-4}$ column.

Like all binary number representations, fixed-point numbers are just a collection of bits. There is no way of knowing the existence of the binary point except through agreement of those people interpreting the number.

## Example 5.3 ARITHMETIC WITH FIXED-POINT NUMBERS

Compute $0.75+-0.625$ using fixed-point numbers.
Solution: First convert 0.625 , the magnitude of the second number, to fixed-point binary notation. $0.625 \geq 2^{-1}$, so there is a 1 in the $2^{-1}$ column, leaving $0.625-0.5=$ 0.125 . Because $0.125<2^{-2}$, there is a 0 in the $2^{-2}$ column. Because $0.125 \geq 2^{-3}$, there is a 1 in the $2^{-3}$ column, leaving $0.125-0.125=0$. Thus, there must be a 0 in the $2^{-4}$ column. Putting this all together, $0.625_{10}=0000.1010_{2}$.

Use two's complement representation for signed numbers so that addition works correctly. Figure 5.24 shows the conversion of -0.625 to fixed-point two's complement notation.

Figure 5.25 shows the fixed-point binary addition and the decimal equivalent for comparison. Note that the leading 1 in the binary fixed-point addition of Figure $5.25(\mathrm{a})$ is discarded from the 8 -bit result.

### 5.3.2 Floating-Point Number Systems*

Floating-point numbers are analogous to scientific notation. They circumvent the limitation of having a constant number of integer and fraction bits, allowing the representation of very large and very small numbers.

| 0000.1010 | Binary Magnitude |
| ---: | :--- |
| 1111.0101 | One's Complement |
| $+\quad 1$ | Add 1 |
| 1111.0110 | Two's Complement |
|  |  |
| 0000.1100 |  |
| +1111.0110 |  |
| 10000.0010 |  |
| (a) | (b) 0.75 |

(b)

Like scientific notation, floating-point numbers have a sign, mantissa (M), base (B), and exponent (E), as shown in Figure 5.26. For example, the number $4.1 \times 10^{3}$ is the decimal scientific notation for 4100 . It has a mantissa of 4.1 , a base of 10 , and an exponent of 3 . The decimal point floats to the position right after the most significant digit. Floating-point numbers are base 2 with a binary mantissa. 32 bits are used to represent 1 sign bit, 8 exponent bits, and 23 mantissa bits.

## Example 5.4 32-BIT FLOATING-POINT NUMBERS

Show the floating-point representation of the decimal number 228.
Solution: First convert the decimal number into binary: $228_{10}=11100100_{2}=$ $1.11001_{2} \times 2^{7}$. Figure 5.27 shows the 32 -bit encoding, which will be modified later for efficiency. The sign bit is positive (0), the 8 exponent bits give the value 7 , and the remaining 23 bits are the mantissa.

In binary floating-point, the first bit of the mantissa (to the left of the binary point) is always 1 and therefore need not be stored. It is called the implicit leading one. Figure 5.28 shows the modified floating-point representation of $228_{10}=11100100_{2} \times 2^{0}=1.11001_{2} \times 2^{7}$. The implicit leading one is not included in the 23 -bit mantissa for efficiency. Only the fraction bits are stored. This frees up an extra bit for useful data.

We make one final modification to the exponent field. The exponent needs to represent both positive and negative exponents. To do so, float-ing-point uses a biased exponent, which is the original exponent plus a constant bias. 32-bit floating-point uses a bias of 127. For example, for the exponent 7 , the biased exponent is $7+127=134=10000110_{2}$. For the exponent -4 , the biased exponent is: $-4+127=123=$ $01111011_{2}$. Figure 5.29 shows $1.11001_{2} \times 2^{7}$ represented in floatingpoint notation with an implicit leading one and a biased exponent of

| 1 bit | 8 bits 23 bits |  |
| :---: | :---: | :---: |
| 0 | 00000111 | 11100100000000000000000 |
| Sign | Exponent | Mantissa |


| 1 bit | 8 bits 23 bits |  |
| :---: | :---: | :---: |
| 0 | 00000111 | 11001000000000000000000 |
| Sign | Exponent | Fraction |


| 1 bit | 8 bits 23 bits |  |
| :---: | :---: | :---: |
| 0 | 10000110 | 11001000000000000000000 |
| Sign <br> Expenent |  | Fraction |
|  |  |  |

$$
\pm \mathrm{M} \times \mathrm{B}^{\mathrm{E}}
$$

Figure 5.26 Floating-point numbers

As may be apparent, there are many reasonable ways to represent floating-point numbers. For many years, computer manufacturers used incompatible floating-point formats. Results from one computer could not directly be interpreted by another computer.

The Institute of Electrical and Electronics Engineers solved this problem by creating the IEEE 754 floating-point standard in 1985 defining floating-point numbers. This floating-point format is now almost universally used and is the one discussed in this section.

Figure 5.27 32-bit floating-point version 1

Figure 5.28 32-bit floating-point version 2

Figure 5.29 IEEE 754 floatingpoint notation

Floating-point cannot represent some numbers exactly, like 1.7. However, when you type 1.7 into your calculator, you see exactly 1.7, not $1.69999 .$. . To handle this, some applications, such as calculators and financial software, use binary coded decimal ( $B C D$ ) numbers or formats with a base 10 exponent. BCD numbers encode each decimal digit using four bits with a range of 0 to 9 . For example, the BCD fixed-point notation of 1.7 with four integer bits and four fraction bits would be 0001.0111. Of course, nothing is free. The cost is increased complexity in arithmetic hardware and wasted encodings (A-F encodings are not used), and thus decreased performance. So for compute-intensive applications, floating-point is much faster.
$134(7+127)$. This notation conforms to the IEEE 754 floating-point standard.

## Special Cases: $\mathbf{0}, \pm \infty$, and NaN

The IEEE floating-point standard has special cases to represent numbers such as zero, infinity, and illegal results. For example, representing the number zero is problematic in floating-point notation because of the implicit leading one. Special codes with exponents of all 0's or all l's are reserved for these special cases. Table 5.2 shows the floating-point representations of $0, \pm \infty$, and NaN . As with sign/magnitude numbers, floating-point has both positive and negative $0 . \mathrm{NaN}$ is used for numbers that don't exist, such as $\sqrt{-1}$ or $\log _{2}(-5)$.

## Single- and Double-Precision Formats

So far, we have examined 32-bit floating-point numbers. This format is also called single-precision, single, or float. The IEEE 754 standard also defines 64-bit double-precision numbers (also called doubles) that provide greater precision and greater range. Table 5.3 shows the number of bits used for the fields in each format.

Excluding the special cases mentioned earlier, normal single-precision numbers span a range of $\pm 1.175494 \times 10^{-38}$ to $\pm 3.402824 \times 10^{38}$. They have a precision of about seven significant decimal digits (because $2^{-24} \approx 10^{-7}$ ). Similarly, normal double-precision numbers span a range of $\pm 2.22507385850720 \times 10^{-308}$ to $\pm 1.79769313486232 \times 10^{308}$ and have a precision of about 15 significant decimal digits.

Table 5.2 IEEE 754 floating-point notations for $\mathbf{0}, \pm \infty$, and NaN

| Number | Sign | Exponent | Fraction |
| :--- | :---: | :---: | :---: |
| 0 | X | 00000000 | 00000000000000000000000 |
| $\infty$ | 0 | 11111111 | 00000000000000000000000 |
| $-\infty$ | 1 | 11111111 | 00000000000000000000000 |
| NaN | X | 11111111 | Non-zero |

Table 5.3 Single- and double-precision floating-point formats

| Format | Total Bits | Sign Bits | Exponent Bits | Fraction Bits |
| :---: | :---: | :---: | :---: | :---: |
| single | 32 | 1 | 8 | 23 |
| double | 64 | 1 | 11 | 52 |

## Rounding

Arithmetic results that fall outside of the available precision must round to a neighboring number. The rounding modes are: round down, round up, round toward zero, and round to nearest. The default rounding mode is round to nearest. In the round to nearest mode, if two numbers are equally near, the one with a 0 in the least significant position of the fraction is chosen.

Recall that a number overflows when its magnitude is too large to be represented. Likewise, a number underflows when it is too tiny to be represented. In round to nearest mode, overflows are rounded up to $\pm \infty$ and underflows are rounded down to 0 .

## Floating-Point Addition

Addition with floating-point numbers is not as simple as addition with two's complement numbers. The steps for adding floating-point numbers with the same sign are as follows:

1. Extract exponent and fraction bits.
2. Prepend leading 1 to form the mantissa.
3. Compare exponents.
4. Shift smaller mantissa if necessary.
5. Add mantissas.
6. Normalize mantissa and adjust exponent if necessary.
7. Round result.
8. Assemble exponent and fraction back into floating-point number.

Figure 5.30 shows the floating-point addition of $7.875\left(1.11111 \times 2^{2}\right)$ and $0.1875\left(1.1 \times 2^{-3}\right)$. The result is $8.0625\left(1.0000001 \times 2^{3}\right)$. After the fraction and exponent bits are extracted and the implicit leading 1 is prepended in steps 1 and 2, the exponents are compared by subtracting the smaller exponent from the larger exponent. The result is the number of bits by which the smaller number is shifted to the right to align the implied binary point (i.e., to make the exponents equal) in step 4 . The aligned numbers are added. Because the sum has a mantissa that is greater than or equal to 2.0, the result is normalized by shifting it to the right one bit and incrementing the exponent. In this example, the result is exact, so no rounding is necessary. The result is stored in floating-point notation by removing the implicit leading one of the mantissa and prepending the sign bit.

### 5.4 SEQUENTIAL BUILDING BLOCKS

This section examines sequential building blocks, including counters and shift registers.

Floating-point arithmetic is usually done in hardware to make it fast. This hardware, called the floating-point unit (FPU), is typically distinct from the central processing unit (CPU). The infamous floating-point division (FDIV) bug in the Pentium FPU cost Intel $\$ 475$ million to recall and replace defective chips. The bug occurred simply because a lookup table was not loaded correctly.

Figure 5.30 Floating-point addition


Figure 5.31 Counter symbol


Figure 5.32 N -bit counter

### 5.4.1 Counters

An $N$-bit binary counter, shown in Figure 5.31, is a sequential arithmetic circuit with clock and reset inputs and an $N$-bit output Q. Reset initializes the output to 0 . The counter then advances through all $2^{N}$ possible outputs in binary order, incrementing on the rising edge of the clock.

Figure 5.32 shows an N -bit counter composed of an adder and a resettable register. On each cycle, the counter adds 1 to the value stored in the register. HDL Example 5.4 describes a binary counter with asynchronous reset.

Other types of counters, such as Up/Down counters, are explored in Exercises 5.47 through 5.50.

HDL Example 5.4 COUNTER

## SystemVerilog

```
module counter 非(parameter N = 8)
    (input logicclk,
        input logic reset,
        output logic [N-1:0] q);
```

    always_ff @(posedge clk, posedge reset)
        if (reset) \(q<=0\);
        else \(\quad \mathrm{q}<=\mathrm{q}+1\);
    endmodule

## VHDL

```
library IEEE; use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD_UNSIGNED.ALL;
entity counter is
    generic(N: integer := 8);
    port(clk,reset: in STD_LOGIC;
        q: out STD_LOGIC_VECTOR(N-1 downto 0));
end;
architecture synth of counter is
begin
    process(clk, reset) begin
        if reset then q<=(OTHERS => '0');
        elsif rising_edge(clk) then q<= q + '1';
        end if;
    end process;
end;
```



Figure 5.33 Synthesized counter

### 5.4.2 Shift Registers

A shift register has a clock, a serial input $S_{\text {in }}$, a serial output $S_{\text {out }}$, and $N$ parallel outputs $Q_{N-1: 0}$, as shown in Figure 5.34. On each rising edge of the clock, a new bit is shifted in from $S_{\text {in }}$ and all the subsequent contents are shifted forward. The last bit in the shift register is available at $S_{\text {out }}$. Shift registers can be viewed as serial-to-parallel converters. The input is provided serially (one bit at a time) at $S_{\text {in }}$. After $N$ cycles, the past $N$ inputs are available in parallel at $Q$.

A shift register can be constructed from $N$ flip-flops connected in series, as shown in Figure 5.35. Some shift registers also have a reset signal to initialize all of the flip-flops.



Figure 5.34 Shift register symbol

Figure 5.35 Shift register schematic

Figure 5.36 Shift register with parallel load

Don't confuse shift registers with the shifters from Section 5.2.5. Shift registers are sequential logic blocks that shift in a new bit on each clock edge. Shifters are unclocked combinational logic blocks that shift an input by a specified amount.


A related circuit is a parallel-to-serial converter that loads $N$ bits in parallel, then shifts them out one at a time. A shift register can be modified to perform both serial-to-parallel and parallel-to-serial operations by adding a parallel input $D_{N-1: 0}$, and a control signal Load, as shown in Figure 5.36. When Load is asserted, the flip-flops are loaded in parallel from the $D$ inputs. Otherwise, the shift register shifts normally. HDL Example 5.5 describes such a shift register.

## Scan Chains*

Shift registers are often used to test sequential circuits using a technique called scan chains. Testing combinational circuits is relatively straightforward. Known inputs called test vectors are applied, and the outputs are checked against the expected result. Testing sequential circuits is more difficult, because the circuits have state. Starting from a known initial condition, a large number of cycles of test vectors may be needed to put the circuit into a desired state. For example, testing that the most significant bit of a 32 -bit counter advances from 0 to 1 requires resetting the counter, then applying $2^{31}$ (about two billion) clock pulses!

To solve this problem, designers like to be able to directly observe and control all the state of the machine. This is done by adding a test mode in which the contents of all flip-flops can be read out or loaded with desired values. Most systems have too many flip-flops to dedicate individual pins to read and write each flip-flop. Instead, all the flip-flops in the system are connected together into a shift register called a scan chain. In normal operation, the flip-flops load data from their $D$ input and ignore the scan chain. In test mode, the flip-flops serially shift their contents out and shift in new contents using $S_{\text {in }}$ and $S_{\text {out }}$. The load multiplexer is usually integrated into the flip-flop to produce a scannable flip-flop. Figure 5.38 shows the schematic and symbol for a scannable flip-flop and illustrates how the flops are cascaded to build an N -bit scannable register.

For example, the 32 -bit counter could be tested by shifting in the pattern 011111... 111 in test mode, counting for one cycle in normal mode, then shifting out the result, which should be 100000. . .000. This requires only $32+1+32=65$ cycles.

## HDL Example 5.5 SHIFT REGISTER WITH PARALLEL LOAD

## SystemVerilog

```
module shiftreg非(parameter N = 8)
    (input logic clk,
    input logic reset,load,
    input logic sin,
    input logic[N-1:0]d,
    output logic [N-1:0] q,
    output logic sout);
```

    always_ff @(posedge clk, posedge reset)
    if (reset) \(q<=0\);
    else if (load) \(\quad\) q \(<=d\);
    else \(q<=\{q[N-2: 0]\), sin \(\}\);
    assign sout \(=q[N-1]\);
    endmodule

## VHDL

```
library IEEE; use IEEE.STD_LOGIC_1164.ALL;
entity shiftreg is
    generic(N: integer := 8);
    port(clk, reset: in STD_LOGIC;
        load,sin: in STD_LOGIC;
        d: in STD_LOGIC_VECTOR(N-1 downto 0);
        q: out STD_LOGIC_VECTOR(N-1 downto 0);
        sout: out STD_LOGIC);
end;
architecture synth of shiftreg is
begin
    process(clk, reset) begin
        if reset ='1' then q<=(OTHERS => '0');
        elsif rising_edge(clk) then
            if load then q<= d;
            else q <= q(N-2 downto 0) & sin;
            end if;
        end if;
    end process;
    sout<=q(N-1);
end;
```



Figure 5.37 Synthesized shiftreg


Figure 5.38 Scannable flip-flop: (a) schematic, (b) symbol, and (c) $\boldsymbol{N}$-bit scannable register


Figure 5.39 Generic memory array symbol


Figure $5.404 \times 3$ memory array: (a) symbol, (b) function


Figure $5.41 \mathbf{3 2} \mathbf{~ K b}$ array: depth = $2^{10}=1024$ words, width $=32$ bits


Figure 5.42 Bit cell

### 5.5 MEMORY ARRAYS

The previous sections introduced arithmetic and sequential circuits for manipulating data. Digital systems also require memories to store the data used and generated by such circuits. Registers built from flip-flops are a kind of memory that stores small amounts of data. This section describes memory arrays that can efficiently store large amounts of data.

The section begins with an overview describing characteristics shared by all memory arrays. It then introduces three types of memory arrays: dynamic random access memory (DRAM), static random access memory (SRAM), and read only memory (ROM). Each memory differs in the way it stores data. The section briefly discusses area and delay trade-offs and shows how memory arrays are used, not only to store data but also to perform logic functions. The section finishes with the HDL for a memory array.

### 5.5.1 Overview

Figure 5.39 shows a generic symbol for a memory array. The memory is organized as a two-dimensional array of memory cells. The memory reads or writes the contents of one of the rows of the array. This row is specified by an Address. The value read or written is called Data. An array with $N$-bit addresses and $M$-bit data has $2^{N}$ rows and $M$ columns. Each row of data is called a word. Thus, the array contains $2^{N} M$-bit words.

Figure 5.40 shows a memory array with two address bits and three data bits. The two address bits specify one of the four rows (data words) in the array. Each data word is three bits wide. Figure 5.40 (b) shows some possible contents of the memory array.

The depth of an array is the number of rows, and the width is the number of columns, also called the word size. The size of an array is given as depth $\times$ width. Figure 5.40 is a 4 -word $\times 3$-bit array, or simply $4 \times 3$ array. The symbol for a 1024 -word $\times 32$-bit array is shown in Figure 5.41. The total size of this array is 32 kilobits ( Kb ).

## Bit Cells

Memory arrays are built as an array of bit cells, each of which stores 1 bit of data. Figure 5.42 shows that each bit cell is connected to a wordline and a bitline. For each combination of address bits, the memory asserts a single wordline that activates the bit cells in that row. When the wordline is HIGH, the stored bit transfers to or from the bitline. Otherwise, the bitline is disconnected from the bit cell. The circuitry to store the bit varies with memory type.

To read a bit cell, the bitline is initially left floating $(Z)$. Then the wordline is turned ON, allowing the stored value to drive the bitline to 0 or 1 . To write a bit cell, the bitline is strongly driven to the desired
value. Then the wordline is turned ON , connecting the bitline to the stored bit. The strongly driven bitline overpowers the contents of the bit cell, writing the desired value into the stored bit.

## Organization

Figure 5.43 shows the internal organization of a $4 \times 3$ memory array. Of course, practical memories are much larger, but the behavior of larger arrays can be extrapolated from the smaller array. In this example, the array stores the data from Figure 5.40(b).

During a memory read, a wordline is asserted, and the corresponding row of bit cells drives the bitlines HIGH or LOW. During a memory write, the bitlines are driven HIGH or LOW first and then a wordline is asserted, allowing the bitline values to be stored in that row of bit cells. For example, to read Address 10, the bitlines are left floating, the decoder asserts wordline ${ }_{2}$, and the data stored in that row of bit cells (100) reads out onto the Data bitlines. To write the value 001 to Address 11, the bitlines are driven to the value 001, then wordline $3_{3}$ is asserted and the new value (001) is stored in the bit cells.

## Memory Ports

All memories have one or more ports. Each port gives read and/or write access to one memory address. The previous examples were all singleported memories.

Multiported memories can access several addresses simultaneously. Figure 5.44 shows a three-ported memory with two read ports and one write port. Port 1 reads the data from address $A 1$ onto the read data output $R D 1$. Port 2 reads the data from address $A 2$ onto


Figure $5.43 \mathbf{4 \times 3}$ memory array


Figure 5.44 Three-ported memory


Robert Dennard, 1932-. Invented DRAM in 1966 at IBM. Although many were skeptical that the idea would work, by the mid-1970s DRAM was in virtually all computers. He claims to have done little creative work until, arriving at IBM, they handed him a patent notebook and said, "put all your ideas in there." Since 1965, he has received 35 patents in semiconductors and microelectronics. (Photo courtesy of IBM.)


Figure 5.45 DRAM bit cell
$R D 2$. Port 3 writes the data from the write data input WD3 into address $A 3$ on the rising edge of the clock if the write enable WE3 is asserted.

## Memory Types

Memory arrays are specified by their size (depth $\times$ width) and the number and type of ports. All memory arrays store data as an array of bit cells, but they differ in how they store bits.

Memories are classified based on how they store bits in the bit cell. The broadest classification is random access memory (RAM) versus read only memory ( $R O M$ ). RAM is volatile, meaning that it loses its data when the power is turned off. ROM is nonvolatile, meaning that it retains its data indefinitely, even without a power source.

RAM and ROM received their names for historical reasons that are no longer very meaningful. RAM is called random access memory because any data word is accessed with the same delay as any other. In contrast, a sequential access memory, such as a tape recorder, accesses nearby data more quickly than faraway data (e.g., at the other end of the tape). ROM is called read only memory because, historically, it could only be read but not written. These names are confusing, because ROMs are randomly accessed too. Worse yet, most modern ROMs can be written as well as read! The important distinction to remember is that RAMs are volatile and ROMs are nonvolatile.

The two major types of RAMs are dynamic RAM (DRAM) and static RAM (SRAM). Dynamic RAM stores data as a charge on a capacitor, whereas static RAM stores data using a pair of cross-coupled inverters. There are many flavors of ROMs that vary by how they are written and erased. These various types of memories are discussed in the subsequent sections.

### 5.5.2 Dynamic Random Access Memory (DRAM)

Dynamic RAM (DRAM, pronounced "dee-ram") stores a bit as the presence or absence of charge on a capacitor. Figure 5.45 shows a DRAM bit cell. The bit value is stored on a capacitor. The nMOS transistor behaves as a switch that either connects or disconnects the capacitor from the bitline. When the wordline is asserted, the nMOS transistor turns ON, and the stored bit value transfers to or from the bitline.

As shown in Figure 5.46(a), when the capacitor is charged to $V_{D D}$, the stored bit is 1 ; when it is discharged to GND (Figure 5.46(b)), the stored bit is 0 . The capacitor node is dynamic because it is not actively driven HIGH or LOW by a transistor tied to $\mathrm{V}_{D D}$ or GND.

Upon a read, data values are transferred from the capacitor to the bitline. Upon a write, data values are transferred from the bitline to

the capacitor. Reading destroys the bit value stored on the capacitor, so the data word must be restored (rewritten) after each read. Even when DRAM is not read, the contents must be refreshed (read and rewritten) every few milliseconds, because the charge on the capacitor gradually leaks away.

### 5.5.3 Static Random Access Memory (SRAM)

Static RAM (SRAM, pronounced "es-ram") is static because stored bits do not need to be refreshed. Figure 5.47 shows an SRAM bit cell. The data bit is stored on cross-coupled inverters like those described in Section 3.2. Each cell has two outputs, bitline and $\overline{\mathrm{bitline}}$. When the wordline is asserted, both nMOS transistors turn on, and data values are transferred to or from the bitlines. Unlike DRAM, if noise degrades the value of the stored bit, the cross-coupled inverters restore the value.

### 5.5.4 Area and Delay

Flip-flops, SRAMs, and DRAMs are all volatile memories, but each has different area and delay characteristics. Table 5.4 shows a comparison of these three types of volatile memory. The data bit stored in a flip-flop is available immediately at its output. But flip-flops take at least 20 transistors to build. Generally, the more transistors a device has, the more area, power, and cost it requires. DRAM latency is longer than that of SRAM because its bitline is not actively driven by a transistor. DRAM must wait for charge to move (relatively) slowly from the capacitor to the bitline. DRAM also fundamentally has lower throughput than SRAM, because it must refresh data

Figure 5.46 DRAM stored values


Figure 5.47 SRAM bit cell

## Table 5.4 Memory comparison

| Memory <br> Type | Transistors per <br> Bit Cell | Latency |
| :---: | :---: | :--- |
| flip-flop | $\sim 20$ | fast |
| SRAM | 6 | medium |
| DRAM | 1 | slow |



Figure $5.4816 \times 32$ register file with two read ports and one write port


Figure 5.49 ROM bit cells containing 0 and 1
periodically and after a read. DRAM technologies such as synchronous DRAM (SDRAM) and double data rate ( $D D R$ ) SDRAM have been developed to overcome this problem. SDRAM uses a clock to pipeline memory accesses. DDR SDRAM, sometimes called simply DDR, uses both the rising and falling edges of the clock to access data, thus doubling the throughput for a given clock speed. DDR was first standardized in 2000 and ran at 100 to 200 MHz . Later standards, DDR2, DDR3, and DDR4, increased the clock speeds, with speeds in 2015 being over 1 GHz .

Memory latency and throughput also depend on memory size; larger memories tend to be slower than smaller ones if all else is the same. The best memory type for a particular design depends on the speed, cost, and power constraints.

### 5.5.5 Register Files

Digital systems often use a number of registers to store temporary variables. This group of registers, called a register file, is usually built as a small, multiported SRAM array, because it is more compact than an array of flip-flops.

Figure 5.48 shows a 16 -register $\times 32$-bit three-ported register file built from a three-ported memory similar to that of Figure 5.44. The register file has two read ports $(A 1 / R D 1$ and $A 2 / R D 2)$ and one write port (A3/WD3). The 4-bit addresses, $A 1, A 2$, and $A 3$, can each access all $2^{4}=16$ registers. So, two registers can be read and one register written simultaneously.

### 5.5.6 Read Only Memory

Read only memory (ROM) stores a bit as the presence or absence of a transistor. Figure 5.49 shows a simple ROM bit cell. To read the cell, the bitline is weakly pulled HIGH. Then the wordline is turned ON. If the transistor is present, it pulls the bitline LOW. If it is absent, the bitline remains HIGH. Note that the ROM bit cell is a combinational circuit and has no state to "forget" if power is turned off.

The contents of a ROM can be indicated using dot notation. Figure 5.50 shows the dot notation for a 4 -word $\times 3$-bit ROM containing the data from Figure 5.40. A dot at the intersection of a row (wordline) and a column (bitline) indicates that the data bit is 1 . For example, the top wordline has a single dot on Data ${ }_{1}$, so the data word stored at Address 11 is 010 .

Conceptually, ROMs can be built using two-level logic with a group of AND gates followed by a group of OR gates. The AND gates produce all possible minterms and hence form a decoder. Figure 5.51 shows the ROM of Figure 5.50 built using a decoder and OR gates. Each dotted row in Figure 5.50 is an input to an OR gate in Figure 5.51.


For data bits with a single dot, in this case Data $a_{0}$, no OR gate is needed. This representation of a ROM is interesting because it shows how the ROM can perform any two-level logic function. In practice, ROMs are built from transistors instead of logic gates to reduce their size and cost. Section 5.6.3 explores the transistor-level implementation further.

The contents of the ROM bit cell in Figure 5.49 are specified during manufacturing by the presence or absence of a transistor in each bit cell. A programmable ROM ( $P R O M$, pronounced like the dance) places a transistor in every bit cell but provides a way to connect or disconnect the transistor to ground.

Figure 5.52 shows the bit cell for a fuse-programmable ROM. The user programs the ROM by applying a high voltage to selectively blow

Figure $5.51 \mathbf{4 \times 3}$ ROM implementation using gates
Figure $5.504 \times 3$ ROM: dot notation



Fujio Masuoka, 1944-. Received a $\mathrm{Ph} . \mathrm{D}$. in electrical engineering from Tohoku University, Japan. Developed memories and highspeed circuits at Toshiba from 1971 to 1994. Invented Flash memory as an unauthorized project pursued during nights and weekends in the late 1970s. Flash received its name because the process of erasing the memory reminds one of the flash of a camera. Toshiba was slow to commercialize the idea; Intel was first to market in 1988. Flash has grown into a $\$ 25$ billion per year market. Dr. Masuoka later joined the faculty at Tohoku University and is working to develop a 3-dimensional transistor.


Flash memory drives with Universal Serial Bus (USB) connectors have replaced floppy disks and CDs for sharing files because Flash costs have dropped so dramatically.
fuses. If the fuse is present, the transistor is connected to GND and the cell holds a 0 . If the fuse is destroyed, the transistor is disconnected from ground and the cell holds a 1 . This is also called a one-time programmable ROM, because the fuse cannot be repaired once it is blown.

Reprogrammable ROMs provide a reversible mechanism for connecting or disconnecting the transistor to GND. Erasable PROMs (EPROMs, pronounced "e-proms") replace the nMOS transistor and fuse with a floating-gate transistor. The floating gate is not physically attached to any other wires. When suitable high voltages are applied, electrons tunnel through an insulator onto the floating gate, turning on the transistor and connecting the bitline to the wordline (decoder output). When the EPROM is exposed to intense ultraviolet (UV) light for about half an hour, the electrons are knocked off the floating gate, turning the transistor off. These actions are called programming and erasing, respectively. Electrically erasable PROMs (EEPROMs, pronounced "e-e-proms" or "double-e proms") and Flash memory use similar principles but include circuitry on the chip for erasing as well as programming, so no UV light is necessary. EEPROM bit cells are individually erasable; Flash memory erases larger blocks of bits and is cheaper because fewer erasing circuits are needed. In 2015, Flash memory cost about $\$ 0.35$ per GB , and the price continues to drop by 30 to $40 \%$ per year. Flash has become an extremely popular way to store large amounts of data in portable battery-powered systems such as cameras and music players.

In summary, modern ROMs are not really read only; they can be programmed (written) as well. The difference between RAM and ROM is that ROMs take a longer time to write but are nonvolatile.

### 5.5.7 Logic Using Memory Arrays

Although they are used primarily for data storage, memory arrays can also perform combinational logic functions. For example, the Data output of the ROM in Figure 5.50 is the XOR of the two Address inputs. Likewise Data $_{0}$ is the NAND of the two inputs. A $2^{N}$-word $\times M$-bit memory can perform any combinational function of $N$ inputs and $M$ outputs. For example, the ROM in Figure 5.50 performs three functions of two inputs.

Memory arrays used to perform logic are called lookup tables (LUTs). Figure 5.53 shows a 4 -word $\times 1$-bit memory array used as a lookup table to perform the function $Y=A B$. Using memory to perform logic, the user can look up the output value for a given input combination (address). Each address corresponds to a row in the truth table, and each data bit corresponds to an output value.


Figure 5.53 4-word $\times$ 1-bit memory array used as a lookup table

### 5.5.8 Memory HDL

HDL Example 5.6 describes a $2^{N}$-word $\times M$-bit RAM. The RAM has a synchronous enabled write. In other words, writes occur on the rising edge of the clock if the write enable we is asserted. Reads occur immediately. When power is first applied, the contents of the RAM are unpredictable.

HDL Example 5.7 describes a 4 -word $\times 3$-bit ROM. The contents of the ROM are specified in the HDL case statement. A ROM as small as this one may be synthesized into logic gates rather than an array. Note that the seven-segment decoder from HDL Example 4.24 synthesizes into a ROM in Figure 4.20.

### 5.6 LOGIC ARRAYS

Like memory, gates can be organized into regular arrays. If the connections are made programmable, these logic arrays can be configured to perform any function without the user having to connect wires in specific ways. The regular structure simplifies design. Logic arrays are mass produced in large quantities, so they are inexpensive. Software tools allow users to map logic designs onto these arrays. Most logic arrays are also reconfigurable, allowing designs to be modified without replacing the hardware. Reconfigurability is valuable during development and is also useful in the field, because a system can be upgraded by simply downloading the new configuration.

This section introduces two types of logic arrays: programmable logic arrays (PLAs), and field programmable gate arrays (FPGAs). PLAs, the

Programmable ROMs can be configured with a device programmer like the one shown below. The device programmer is attached to a computer, which specifies the type of ROM and the data values to program. The device programmer blows fuses or injects charge onto a floating gate on the ROM. Thus the programming process is sometimes called burning a ROM.


## HDL Example 5.6 RAM

## SystemVerilog

```
module ram 非(parameter N=6,M=32)
    (input logic clk,
    input logic we,
    input logic [N-1:0] adr.
    input logic[M-1:0] din,
    output logic [M-1:0] dout);
```

logic [M-1:0] mem [2**N-1:0];
always_ff @(posedge clk)
if (we) mem [adr] <= din;
assign dout $=$ mem[adr];
endmodule

## VHDL

```
library IEEE; use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD_UNSIGNED.ALL;
entity ram_array is
    generic(N: integer := 6; M: integer := 32);
    port(clk,
        we: in STD_LOGIC;
        adr: in STD_LOGIC_VECTOR(N-1 downto 0);
        din: in STD_LOGIC_VECTOR(M-1 downto 0);
        dout: out STD_LOGIC_VECTOR(M-1 downto 0));
end;
architecture synth of ram_array is
    type mem_array is array ((2**N-1) downto 0)
        of STD_LOGIC_VECTOR (M-1 downto 0);
        signal mem: mem_array;
begin
    process(clk) begin
        if rising_edge(clk) then
            if we then mem(TO_INTEGER(adr)) <= din;
            end if;
        end if;
    end process;
    dout <= mem(TO_INTEGER(adr));
end;
```



Figure 5.54 Synthesized ram
older technology, perform only combinational logic functions. FPGAs can perform both combinational and sequential logic.

### 5.6.1 Programmable Logic Array

Programmable logic arrays (PLAs) implement two-level combinational logic in sum-of-products (SOP) form. PLAs are built from an AND array followed by an OR array, as shown in Figure 5.55. The inputs (in true and complementary form) drive an AND array, which produces implicants, which in turn are ORed together to form the outputs. An $M \times N \times P$-bit PLA has $M$ inputs, $N$ implicants, and $P$ outputs.

HDL Example 5.7 ROM

```
SystemVerilog
module rom(input logic [1:0] adr,
    output logic [2:0] dout):
    always_comb
    case(adr)
        2'b00: dout = 3'b011;
        2'b01: dout = 3'b110;
        2'b10: dout = 3'b100;
        2'b11: dout = 3'b010;
    endcase
endmodule

\section*{VHDL}
```

1ibrary IEEE; use IEEE.STD_LOGIC_1164.a11;

```
1ibrary IEEE; use IEEE.STD_LOGIC_1164.a11;
entity rom is
entity rom is
    port(adr: in STD_LOGIC_VECTOR(1 downto 0);
    port(adr: in STD_LOGIC_VECTOR(1 downto 0);
dout: out STD_LOGIC_VECTOR(2 downto 0));
dout: out STD_LOGIC_VECTOR(2 downto 0));
end;
end;
architecture synth of rom is
architecture synth of rom is
begin
begin
    process(al1) begin
    process(al1) begin
    case adr is
    case adr is
        when "00" => dout <= "011";
        when "00" => dout <= "011";
        when "01" => dout <= "110";
        when "01" => dout <= "110";
        when "10" => dout <= "100";
        when "10" => dout <= "100";
        when "11" => dout <= "010";
        when "11" => dout <= "010";
        end case;
        end case;
    end process;
    end process;
end;
```

```
end;
```

```


Figure \(5.55 \boldsymbol{M} \times \boldsymbol{N} \times \boldsymbol{P}\)-bit PLA

Figure 5.56 shows the dot notation for a \(3 \times 3 \times 2\)-bit PLA performing the functions \(X=\bar{A} \bar{B} C+A B \bar{C}\) and \(Y=A \bar{B}\). Each row in the AND array forms an implicant. Dots in each row of the AND array indicate which literals comprise the implicant. The AND array in Figure 5.56 forms three implicants: \(\bar{A} \bar{B} C, A B \bar{C}\), and \(A \bar{B}\). Dots in the OR array indicate which implicants are part of the output function.

Figure 5.57 shows how PLAs can be built using two-level logic. An alternative implementation is given in Section 5.6.3.

ROMs can be viewed as a special case of PLAs. A \(2^{M}\)-word \(\times N\)-bit ROM is simply an \(M \times 2^{M} \times N\)-bit PLA. The decoder behaves as an AND plane that produces all \(2^{M}\) minterms. The ROM array behaves as an OR plane that produces the outputs. If the function does not depend on all \(2^{M}\) minterms, a PLA is likely to be smaller than a ROM. For example, an 8 -word \(\times 2\)-bit ROM is required to perform

Figure \(5.563 \times 3 \times 2\)-bit PLA: dot notation

Figure \(5.573 \times 3 \times 2\)-bit PLA using two-level logic

the same functions performed by the \(3 \times 3 \times 2\)-bit PLA shown in Figures 5.56 and 5.57.

Simple programmable logic devices (SPLDs) are souped-up PLAs that add registers and various other features to the basic AND/OR planes. However, SPLDs and PLAs have largely been displaced by FPGAs, which are more flexible and efficient for building large systems.

\subsection*{5.6.2 Field Programmable Gate Array}

A field programmable gate array (FPGA) is an array of reconfigurable gates. Using software programming tools, a user can implement designs on the FPGA using either an HDL or a schematic. FPGAs are more
powerful and more flexible than PLAs for several reasons. They can implement both combinational and sequential logic. They can also implement multilevel logic functions, whereas PLAs can only implement two-level logic. Modern FPGAs integrate other useful features such as built-in multipliers, high-speed I/Os, data converters including analog-to-digital converters, large RAM arrays, and processors.

FPGAs are built as an array of configurable logic elements (LEs), also referred to as configurable logic blocks (CLBs). Each LE can be configured to perform combinational or sequential functions. Figure 5.58 shows a general block diagram of an FPGA. The LEs are surrounded by input/output elements (IOEs) for interfacing with the outside world. The IOEs connect LE inputs and outputs to pins on the chip package. LEs can connect to other LEs and IOEs through programmable routing channels.

Two of the leading FPGA manufacturers are Altera Corp. and Xilinx, Inc. Figure 5.59 shows a single LE from Altera's Cyclone IV FPGA introduced in 2009. The key elements of the LE are a 4 -input lookup table (LUT) and a 1-bit register. The LE also contains configurable multiplexers to route signals through the LE. The FPGA is configured by specifying the contents of the lookup tables and the select signals for the multiplexers.

Figure 5.58 General FPGA layout


FPGAs are the brains of many consumer products, including automobiles, medical equipment, and media devices like MP3 players. The Mercedes Benz S-Class series, for example, has over a dozen Xilinx FPGAs or PLDs for uses ranging from entertainment to navigation to cruise control systems. FPGAs allow for quick time to market and make debugging or adding features late in the design process easier.


Figure 5.59 Cyclone IV Logic Element (LE)
(Reproduced with permission from the Altera Cyclone \({ }^{\text {TM }}\) IV Handbook © 2010 Altera Corporation.)

The Cyclone IV LE has one 4-input LUT and one flip-flop. By loading the appropriate values into the lookup table, the LUT can be configured to perform any function of up to four variables. Configuring the FPGA also involves choosing the select signals that determine how the multiplexers route data through the LE and to neighboring LEs and IOEs. For example, depending on the multiplexer configuration, the LUT may receive one of its inputs from either data 3 or the output of the LE's own register. The other three inputs always come from data 1, data 2, and data 4. The data 1-4 inputs come from IOEs or the outputs of other LEs, depending on routing external to the LE. The LUT output either goes directly to the LE output for combinational functions, or it can be fed through the flip-flop for registered functions. The flip-flop input comes from its own LUT output, the data 3 input, or the register output of the previous LE. Additional hardware includes support for addition using the carry chain hardware, other multiplexers for routing, and flip-flop enable and reset. Altera groups 16 LEs together to create a logic array block (LAB) and provides local connections between LEs within the LAB.

In summary, the Cyclone IV LE can perform one combinational and/or registered function which can involve up to four variables. Other brands of FPGAs are organized somewhat differently, but the same
general principles apply. For example, Xilinx's 7-series FPGAs use 6-input LUTs instead of 4-input LUTs.

The designer configures an FPGA by first creating a schematic or HDL description of the design. The design is then synthesized onto the FPGA. The synthesis tool determines how the LUTs, multiplexers, and routing channels should be configured to perform the specified functions. This configuration information is then downloaded to the FPGA. Because Cyclone IV FPGAs store their configuration information in SRAM, they are easily reprogrammed. The FPGA may download its SRAM contents from a computer in the laboratory or from an EEPROM chip when the system is turned on. Some manufacturers include an EEPROM directly on the FPGA or use one-time programmable fuses to configure the FPGA.

\section*{Example 5.5 FUNCTIONS BUILT USING LEs}

Explain how to configure one or more Cyclone IV LEs to perform the following functions: (a) \(X=\bar{A} \bar{B} C+A B \bar{C}\) and \(Y=A \bar{B}\) (b) \(Y=J K L M P Q R\); (c) a divide-by-3 counter with binary state encoding (see Figure 3.29(a)). You may show interconnection between LEs as needed.

Solution: (a) Configure two LEs. One LUT computes \(X\) and the other LUT computes \(Y\), as shown in Figure 5.60. For the first LE, inputs data 1, data 2, and data 3 are \(A, B\), and \(C\), respectively (these connections are set by the routing channels). data 4 is a don't care but must be tied to something, so it is tied to 0 . For the second LE, inputs data 1 and data 2 are \(A\) and \(B\); the other LUT inputs are don't cares and are tied to 0 . Configure the final multiplexers to select the combinational outputs from the LUTs to produce \(X\) and \(Y\). In general, a single LE can compute any function of up to four input variables in this fashion.



Figure 5.60 LE configuration for two functions of up to four inputs each

Figure 5.61 LE configuration for one function of more than four inputs

Figure 5.62 LE configuration for FSM with two bits of state
(b) Configure the LUT of the first LE to compute \(X=J K L M\) and the LUT on the second LE to compute \(Y=X P Q R\). Configure the final multiplexers to select the combinational outputs \(X\) and \(Y\) from each LE. This configuration is shown in Figure 5.61. Routing channels between LEs, indicated by the dashed blue lines, connect the output of LE 1 to the input of LE 2 . In general, a group of LEs can compute functions of \(N\) input variables in this manner.
(c) The FSM has two bits of state ( \(S_{1: 0}\) ) and one output (Y). The next state depends on the two bits of current state. Use two LEs to compute the next state from the current state, as shown in Figure 5.62. Use the two flip-flops, one from


each LE, to hold this state. The flip-flops have a reset input that can be connected to an external Reset signal. The registered outputs are fed back to the LUT inputs using the multiplexer on data 3 and routing channels between LEs, as indicated by the dashed blue lines. In general, another LE might be necessary to compute the output \(Y\). However, in this case \(Y=S_{0}^{\prime}\), so \(Y\) can come from LE 1. Hence, the entire FSM fits in two LEs. In general, an FSM requires at least one LE for each bit of state, and it may require more LEs for the output or next state logic if they are too complex to fit in a single LUT.

\section*{Example 5.6 LE DELAY}

Alyssa P. Hacker is building a finite state machine that must run at 200 MHz . She uses a Cyclone IV FPGA with the following specifications: \(t_{L E}=381\) ps per LE, \(t_{\text {setup }}=76 \mathrm{ps}\), and \(t_{p c q}=199 \mathrm{ps}\) for all flip-flops. The wiring delay between LEs is 246 ps . Assume the hold time for the flip-flops is 0 . What is the maximum number of LEs her design can use?

Solution: Alyssa uses Equation 3.13 to solve for the maximum propagation delay of the logic: \(t_{p d} \leq T_{c}-\left(t_{p c q}+t_{\text {setup }}\right)\).
Thus, \(t_{p d}=5 \mathrm{~ns}-(0.199 \mathrm{~ns}+0.076 \mathrm{~ns})\), so \(t_{p d} \leq 4.725 \mathrm{~ns}\). The delay of each LE plus wiring delay between LEs, \(t_{\text {LE+wire }}\), is \(381 \mathrm{ps}+246 \mathrm{ps}=627 \mathrm{ps}\). The maximum number of LEs, \(N\), is \(N t_{L E+w i r e} \leq 4.725 \mathrm{~ns}\). Thus, \(N=7\).

\subsection*{5.6.3 Array Implementations*}

To minimize their size and cost, ROMs and PLAs commonly use pseudonMOS or dynamic circuits (see Section 1.7.8) instead of conventional logic gates.

Figure 5.63(a) shows the dot notation for a \(4 \times 3\)-bit ROM that performs the following functions: \(X=A \oplus B, Y=\bar{A}+B\), and \(Z=\bar{A} \bar{B}\). These are the same functions as those of Figure 5.50 , with the address inputs renamed \(A\) and \(B\) and the data outputs renamed \(X, Y\), and \(Z\). The pseudo-nMOS implementation is given in Figure 5.63(b). Each decoder output is connected to the gates of the nMOS transistors in its row. Remember that in pseudo-nMOS circuits, the weak pMOS transistor pulls the output HIGH only if there is no path to GND through the pulldown (nMOS) network.

Pull-down transistors are placed at every junction without a dot. The dots from the dot notation diagram of Figure 5.63(a) are left visible in Figure 5.63(b) for easy comparison. The weak pull-up transistors pull the output HIGH for each wordline without a pull-down transistor. For example, when \(A B=11\), the 11 wordline is HIGH and transistors on \(X\) and \(Z\) turn on and pull those outputs LOW. The \(Y\) output has no

Many ROMs and PLAs use dynamic circuits in place of pseudo-nMOS circuits. Dynamic gates turn the pMOS transistor ON for only part of the time, saving power when the pMOS is OFF and the result is not needed. Aside from this, dynamic and pseudo-nMOS memory arrays are similar in design and behavior.


Figure 5.63 ROM implementation: (a) dot notation, (b) pseudo-nMOS circuit

transistor connecting to the 11 wordline, so \(Y\) is pulled HIGH by the weak pull-up.

PLAs can also be built using pseudo-nMOS circuits, as shown in Figure 5.64 for the PLA from Figure 5.56. Pull-down (nMOS) transistors are placed on the complement of dotted literals in the AND array and on dotted rows in the OR array. The columns in the OR array are sent through an inverter before they are fed to the output bits. Again, the blue dots from the dot notation diagram of Figure 5.56 are left visible in Figure 5.64 for easy comparison.

\subsection*{5.7 SUMMARY}

This chapter introduced digital building blocks used in many digital systems. These blocks include arithmetic circuits such as adders, subtractors, comparators, shifters, multipliers, and dividers; sequential circuits such as counters and shift registers; and arrays for memory and logic. The chapter also explored fixed-point and floating-point representations of fractional numbers. In Chapter 7, we use these building blocks to build a microprocessor.

Adders form the basis of most arithmetic circuits. A half adder adds two 1-bit inputs, \(A\) and \(B\), and produces a sum and a carry out. A full adder extends the half adder to also accept a carry in. \(N\) full adders can be cascaded to form a carry propagate adder (CPA) that adds two \(N\)-bit numbers. This type of CPA is called a ripple-carry adder because the carry ripples through each of the full adders. Faster CPAs can be constructed using lookahead or prefix techniques.

A subtractor negates the second input and adds it to the first. A magnitude comparator subtracts one number from another and determines the relative value based on the sign of the result. A multiplier forms partial products using AND gates, then sums these bits using full adders. A divider repeatedly subtracts the divisor from the partial remainder and checks the sign of the difference to determine the quotient bits. A counter uses an adder and a register to increment a running count.

Fractional numbers are represented using fixed-point or floating-point forms. Fixed-point numbers are analogous to decimals, and floating-point numbers are analogous to scientific notation. Fixed-point numbers use ordinary arithmetic circuits, whereas floating-point numbers require more elaborate hardware to extract and process the sign, exponent, and mantissa.

Large memories are organized into arrays of words. The memories have one or more ports to read and/or write the words. Volatile memories, such as SRAM and DRAM, lose their state when the power is turned off. SRAM is faster than DRAM but requires more transistors. A register file is a small multiported SRAM array. Nonvolatile memories, called ROMs, retain their state indefinitely. Despite their names, most modern ROMs can be written.

Arrays are also a regular way to build logic. Memory arrays can be used as lookup tables to perform combinational functions. PLAs are composed of dedicated connections between configurable AND and OR arrays; they only implement combinational logic. FPGAs are composed of many small lookup tables and registers; they implement combinational and sequential logic. The lookup table contents and their interconnections can be configured to perform any logic function. Modern FPGAs are easy to reprogram and are large and cheap enough to build highly sophisticated digital systems, so they are widely used in low- and medium-volume commercial products as well as in education.

\section*{Exercises}

Exercise 5.1 What is the delay for the following types of 64-bit adders? Assume that each two-input gate delay is 150 ps and that a full adder delay is 450 ps .
(a) a ripple-carry adder
(b) a carry-lookahead adder with 4-bit blocks
(c) a prefix adder

Exercise 5.2 Design two adders: a 64-bit ripple-carry adder and a 64-bit carrylookahead adder with 4 -bit blocks. Use only two-input gates. Each two-input gate is \(15 \mu^{2}\), has a 50 ps delay, and has 20 fF of total gate capacitance. You may assume that the static power is negligible.
(a) Compare the area, delay, and power of the adders (operating at 100 MHz and 1.2 V ).
(b) Discuss the trade-offs between power, area, and delay.

Exercise 5.3 Explain why a designer might choose to use a ripple-carry adder instead of a carry-lookahead adder.

Exercise 5.4 Design the 16 -bit prefix adder of Figure 5.7 in an HDL. Simulate and test your module to prove that it functions correctly.

Exercise 5.5 The prefix network shown in Figure 5.7 uses black cells to compute all of the prefixes. Some of the block propagate signals are not actually necessary. Design a "gray cell" that receives G and \(P\) signals for bits \(i: k\) and \(k-1: j\) but produces only \(\mathrm{G}_{i: j}\), not \(P_{i, j}\). Redraw the prefix network, replacing black cells with gray cells wherever possible.

Exercise 5.6 The prefix network shown in Figure 5.7 is not the only way to calculate all of the prefixes in logarithmic time. The Kogge-Stone network is another common prefix network that performs the same function using a different connection of black cells. Research Kogge-Stone adders and draw a schematic similar to Figure 5.7 showing the connection of black cells in a Kogge-Stone adder.

Exercise 5.7 Recall that an N -input priority encoder has \(\log _{2} \mathrm{~N}\) outputs that encodes which of the \(N\) inputs gets priority (see Exercise 2.36).
(a) Design an N -input priority encoder that has delay that increases logarithmically with \(N\). Sketch your design and give the delay of the circuit in terms of the delay of its circuit elements.
(b) Code your design in an HDL. Simulate and test your module to prove that it functions correctly.

Exercise 5.8 Design the following comparators for 32-bit unsigned numbers. Sketch the schematics.
(a) not equal
(b) greater than or equal to
(c) less than

Exercise 5.9 Consider the signed comparator of Figure 5.12.
(a) Give an example of two 4-bit signed numbers \(A\) and \(B\) for which a 4-bit signed comparator correctly computes \(A<B\).
(b) Give an example of two 4-bit signed numbers \(A\) and \(B\) for which a 4-bit signed comparator incorrectly computes \(A<B\).
(c) In general, when does the \(N\)-bit signed comparator operate incorrectly?

Exercise 5.10 Modify the \(N\)-bit signed comparator of Figure 5.12 to correctly compute \(A<B\) for all \(N\)-bit signed inputs \(A\) and \(B\).

Exercise 5.11 Design the 32-bit ALU shown in Figure 5.15 using your favorite HDL. You can make the top-level module either behavioral or structural.

Exercise 5.12 Design the 32-bit ALU shown in Figure 5.17 using your favorite HDL. You can make the top-level module either behavioral or structural.

Exercise 5.13 Write a testbench to test the 32-bit ALU from Exercise 5.11. Then use it to test the ALU. Include any test vector files necessary. Be sure to test enough corner cases to convince a reasonable skeptic that the ALU functions correctly.

Exercise 5.14 Repeat Exercise 5.13 for the ALU from Exercise 5.12.
Exercise 5.15 Build an Unsigned Comparison Unit that compares two unsigned numbers \(A\) and \(B\). The unit's input is the ALUFlags signal ( \(N, Z, C, V\) ) from the ALU of Figure 5.16, with the ALU performing subtraction: \(A-B\). The unit's outputs are \(H S, L S, H I\), and \(L O\), which indicate that \(A\) is higher than or the same as \((H S)\), lower than or the same as \((L S)\), higher \((H I)\), or lower \((L O)\) than \(B\).
(a) Write minimal equations for \(H S, L S, H I\), and \(L O\) in terms of \(N, Z, C\), and \(V\).
(b) Sketch circuits for \(H S, L S, H I\), and \(L O\).

Exercise 5.16 Build a Signed Comparison Unit that compares two signed numbers \(A\) and \(B\). The unit's input is the ALUFlags signal ( \(N, Z, C, V\) ) from the ALU of Figure 5.16, with the ALU performing subtraction: \(A-B\). The unit's outputs are \(G E, L E, G T\), and \(L T\), which indicate that \(A\) is greater than or equal to (GE), less than or equal to \((L E)\), greater than \((G T)\), or less than \((L T) B\).
(a) Write minimal equations for \(G E, L E, G T\), and \(L T\) in terms of \(N, Z, C\), and \(V\).
(b) Sketch circuits for \(G E, L E, G T\), and \(L T\).

Exercise 5.17 Design a shifter that always shifts a 32 -bit input left by 2 bits. The input and output are both 32 bits. Explain the design in words and sketch a schematic. Implement your design in your favorite HDL.

Exercise 5.18 Design 4-bit left and right rotators. Sketch a schematic of your design. Implement your design in your favorite HDL.

Exercise 5.19 Design an 8 -bit left shifter using only \(242: 1\) multiplexers. The shifter accepts an 8 -bit input \(A\) and a 3 -bit shift amount, shamt \(_{2: 0}\). It produces an 8 -bit output Y. Sketch the schematic.

Exercise 5.20 Explain how to build any N -bit shifter or rotator using only \(N \log _{2} N\) 2:1 multiplexers.

Exercise 5.21 The funnel shifter in Figure 5.65 can perform any \(N\)-bit shift or rotate operation. It shifts a 2 N -bit input right by \(k\) bits. The output \(Y\) is the \(N\) least significant bits of the result. The most significant \(N\) bits of the input are called \(B\) and the least significant \(N\) bits are called \(C\). By choosing appropriate values of \(B, \mathrm{C}\), and \(k\), the funnel shifter can perform any type of shift or rotate. Explain what these values should be in terms of \(A\), shamt, and \(N\) for
(a) logical right shift of \(A\) by shamt
(b) arithmetic right shift of \(A\) by shamt
(c) left shift of \(A\) by shamt
(d) right rotate of \(A\) by shamt
(e) left rotate of \(A\) by shamt


Figure 5.65 Funnel shifter

Exercise 5.22 Find the critical path for the \(4 \times 4\) multiplier from Figure 5.20 in terms of an AND gate delay ( \(t_{\mathrm{AND}}\) ) and an adder delay \(\left(t_{\mathrm{FA}}\right)\) What is the delay of an \(N \times N\) multiplier built in the same way?

Exercise 5.23 Find the critical path for the \(4 \times 4\) divider from Figure 5.21 in terms of a \(2: 1\) mux delay ( \(t_{\mathrm{MUX}}\) ), an adder delay ( \(t_{\mathrm{FA}}\) ), and an inverter delay \(\left(t_{\mathrm{INV}}\right)\). What is the delay of an \(N \times N\) divider built in the same way?

Exercise 5.24 Design a multiplier that handles two's complement numbers.
Exercise 5.25 A sign extension unit extends a two's complement number from \(M\) to \(N(N>M)\) bits by copying the most significant bit of the input into the upper bits of the output (see Section 1.4.6). It receives an \(M\)-bit input \(A\) and produces an \(N\)-bit output \(Y\). Sketch a circuit for a sign extension unit with a 4-bit input and an 8 -bit output. Write the HDL for your design.

Exercise 5.26 A zero extension unit extends an unsigned number from \(M\) to \(N\) bits ( \(N>M\) ) by putting zeros in the upper bits of the output. Sketch a circuit for a zero extension unit with a 4 -bit input and an 8 -bit output. Write the HDL for your design.

Exercise 5.27 Compute \(111001.000_{2} / 001100.000_{2}\) in binary using the standard division algorithm from elementary school. Show your work.

Exercise 5.28 What is the range of numbers that can be represented by the following number systems?
(a) 24-bit unsigned fixed-point numbers with 12 integer bits and 12 fraction bits
(b) 24-bit sign and magnitude fixed-point numbers with 12 integer bits and 12 fraction bits
(c) 24-bit two's complement fixed-point numbers with 12 integer bits and 12 fraction bits

Exercise 5.29 Express the following base 10 numbers in 16-bit fixed-point sign/ magnitude format with eight integer bits and eight fraction bits. Express your answer in hexadecimal.
(a) -13.5625
(b) 42.3125
(c) -17.15625

Exercise 5.30 Express the following base 10 numbers in 12-bit fixed-point sign/ magnitude format with six integer bits and six fraction bits. Express your answer in hexadecimal.
(a) -30.5
(b) 16.25
(c) -8.078125

Exercise 5.31 Express the base 10 numbers in Exercise 5.29 in 16-bit fixed-point two's complement format with eight integer bits and eight fraction bits. Express your answer in hexadecimal.

Exercise 5.32 Express the base 10 numbers in Exercise 5.30 in 12-bit fixed-point two's complement format with six integer bits and six fraction bits. Express your answer in hexadecimal.

Exercise 5.33 Express the base 10 numbers in Exercise 5.29 in IEEE 754 singleprecision floating-point format. Express your answer in hexadecimal.

Exercise 5.34 Express the base 10 numbers in Exercise 5.30 in IEEE 754 singleprecision floating-point format. Express your answer in hexadecimal.

Exercise 5.35 Convert the following two's complement binary fixed-point numbers to base 10. The implied binary point is explicitly shown to aid in your interpretation.
(a) 0101.1000
(b) 1111.1111
(c) 1000.0000

Exercise 5.36 Repeat Exercise 5.35 for the following two's complement binary fixed-point numbers.
(a) 011101.10101
(b) 100110.11010
(c) 101000.00100

Exercise 5.37 When adding two floating-point numbers, the number with the smaller exponent is shifted. Why is this? Explain in words and give an example to justify your explanation.

Exercise 5.38 Add the following IEEE 754 single-precision floating-point numbers.
(a) \(\mathrm{C} 0123456+81 \mathrm{C} 564 \mathrm{~B} 7\)
(b) D0B10301 + D1B43203
(c) \(5 \mathrm{EF} 10324+5 \mathrm{E} 039020\)

Exercise 5.39 Add the following IEEE 754 single-precision floating-point numbers.
(a) C0D20004 + 72407020
(b) C0D20004 + 40DC0004
(c) \((5\) FBE \(4000+3\) FF80000 \()+\) DFDE4000 (Why is the result counterintuitive? Explain.)

Exercise 5.40 Expand the steps in section 5.3.2 for performing floating-point addition to work for negative as well as positive floating-point numbers.

Exercise 5.41 Consider IEEE 754 single-precision floating-point numbers.
(a) How many numbers can be represented by IEEE 754 single-precision floatingpoint format? You need not count \(\pm \infty\) or NaN .
(b) How many additional numbers could be represented if \(\pm \infty\) and NaN were not represented?
(c) Explain why \(\pm \infty\) and NaN are given special representations.

Exercise 5.42 Consider the following decimal numbers: 245 and 0.0625 .
(a) Write the two numbers using single-precision floating-point notation. Give your answers in hexadecimal.
(b) Perform a magnitude comparison of the two 32-bit numbers from part (a). In other words, interpret the two 32 -bit numbers as two's complement numbers and compare them. Does the integer comparison give the correct result?
(c) You decide to come up with a new single-precision floating-point notation. Everything is the same as the IEEE 754 single-precision floating-point standard, except that you represent the exponent using two's complement instead of a bias. Write the two numbers using your new standard. Give your answers in hexadecimal.
(e) Does integer comparison work with your new floating-point notation from part (d)?
(f) Why is it convenient for integer comparison to work with floating-point numbers?

Exercise 5.43 Design a single-precision floating-point adder using your favorite HDL. Before coding the design in an HDL, sketch a schematic of your design. Simulate and test your adder to prove to a skeptic that it functions correctly. You may consider positive numbers only and use round toward zero (truncate). You may also ignore the special cases given in Table 5.2.

Exercise 5.44 In this problem, you will explore the design of a 32-bit floating-point multiplier. The multiplier has two 32 -bit floating-point inputs and produces a 32 -bit floating-point output. You may consider positive numbers only and use round toward zero (truncate). You may also ignore the special cases given in Table 5.2.
(a) Write the steps necessary to perform 32-bit floating-point multiplication.
(b) Sketch the schematic of a 32-bit floating-point multiplier.
(c) Design a 32 -bit floating-point multiplier in an HDL. Simulate and test your multiplier to prove to a skeptic that it functions correctly.

Exercise 5.45 In this problem, you will explore the design of a 32-bit prefix adder.
(a) Sketch a schematic of your design.
(b) Design the 32-bit prefix adder in an HDL. Simulate and test your adder to prove that it functions correctly.
(c) What is the delay of your 32-bit prefix adder from part (a)? Assume that each two-input gate delay is 100 ps.
(d) Design a pipelined version of the 32-bit prefix adder. Sketch the schematic of your design. How fast can your pipelined prefix adder run? You may assume a sequencing overhead ( \(\left.t_{p c q}+t_{\text {setup }}\right)\) of 80 ps. Make the design run as fast as possible.
(e) Design the pipelined 32-bit prefix adder in an HDL.

Exercise 5.46 An incrementer adds 1 to an \(N\)-bit number. Build an 8 -bit incrementer using half adders.

Exercise 5.47 Build a 32-bit synchronous Up/Down counter. The inputs are Reset and \(U p\). When Reset is 1 , the outputs are all 0 . Otherwise, when \(U p=1\), the circuit counts up, and when \(U p=0\), the circuit counts down.

Exercise 5.48 Design a 32 -bit counter that adds 4 at each clock edge. The counter has reset and clock inputs. Upon reset, the counter output is all 0 .

Exercise 5.49 Modify the counter from Exercise 5.48 such that the counter will either increment by 4 or load a new 32 -bit value, \(D\), on each clock edge, depending on a control signal Load. When Load \(=1\), the counter loads the new value \(D\).

Exercise 5.50 An N-bit Johnson counter consists of an N-bit shift register with a reset signal. The output of the shift register \(\left(S_{\text {out }}\right)\) is inverted and fed back to the input \(\left(S_{\text {in }}\right)\). When the counter is reset, all of the bits are cleared to 0 .
(a) Show the sequence of outputs, \(Q_{3: 0}\), produced by a 4-bit Johnson counter starting immediately after the counter is reset.
(b) How many cycles elapse until an N -bit Johnson counter repeats its sequence? Explain.
(c) Design a decimal counter using a 5-bit Johnson counter, ten AND gates, and inverters. The decimal counter has a clock, a reset, and ten one-hot outputs \(Y_{9: 0}\). When the counter is reset, \(Y_{0}\) is asserted. On each subsequent cycle, the next output should be asserted. After ten cycles, the counter should repeat. Sketch a schematic of the decimal counter.
(d) What advantages might a Johnson counter have over a conventional counter?

Exercise 5.51 Write the HDL for a 4-bit scannable flip-flop like the one shown in Figure 5.38. Simulate and test your HDL module to prove that it functions correctly.

Exercise 5.52 The English language has a good deal of redundancy that allows us to reconstruct garbled transmissions. Binary data can also be transmitted in redundant form to allow error correction. For example, the number 0 could be coded as 00000 and the number 1 could be coded as 11111 . The value could then be sent over a noisy channel that might flip up to two of the bits. The receiver could reconstruct the original data because a 0 will have at least three of the five received bits as 0 's; similarly a 1 will have at least three 1 's.
(a) Propose an encoding to send \(00,01,10\), or 11 encoded using five bits of information such that all errors that corrupt one bit of the encoded data can be corrected. Hint: the encodings 00000 and 11111 for 00 and 11, respectively, will not work.
(b) Design a circuit that receives your five-bit encoded data and decodes it to \(00,01,10\), or 11 , even if one bit of the transmitted data has been changed.
(c) Suppose you wanted to change to an alternative 5-bit encoding. How might you implement your design to make it easy to change the encoding without having to use different hardware?

Exercise 5.53 Flash EEPROM, simply called Flash memory, is a fairly recent invention that has revolutionized consumer electronics. Research and explain how Flash memory works. Use a diagram illustrating the floating gate. Describe how a bit in the memory is programmed. Properly cite your sources.

Exercise 5.54 The extraterrestrial life project team has just discovered aliens living on the bottom of Mono Lake. They need to construct a circuit to classify the aliens by potential planet of origin based on measured features available from the NASA probe: greenness, brownness, sliminess, and ugliness. Careful consultation with xenobiologists leads to the following conclusions:
- If the alien is green and slimy or ugly, brown, and slimy, it might be from Mars.
- If the critter is ugly, brown, and slimy, or green and neither ugly nor slimy, it might be from Venus.
- If the beastie is brown and neither ugly nor slimy or is green and slimy, it might be from Jupiter.

Note that this is an inexact science; for example, a life form which is mottled green and brown and is slimy but not ugly might be from either Mars or Jupiter.
(a) Program a \(4 \times 4 \times 3\) PLA to identify the alien. You may use dot notation.
(b) Program a \(16 \times 3\) ROM to identify the alien. You may use dot notation.
(c) Implement your design in an HDL.

Exercise 5.55 Implement the following functions using a single \(16 \times 3\) ROM. Use dot notation to indicate the ROM contents.
(a) \(X=A B+B \bar{C} D+\bar{A} \bar{B}\)
(b) \(Y=A B+B D\)
(c) \(Z=A+B+C+D\)

Exercise 5.56 Implement the functions from Exercise 5.55 using a \(4 \times 8 \times 3\) PLA. You may use dot notation.

Exercise 5.57 Specify the size of a ROM that you could use to program each of the following combinational circuits. Is using a ROM to implement these functions a good design choice? Explain why or why not.
(a) a 16-bit adder/subtractor with \(\mathrm{C}_{\text {in }}\) and \(\mathrm{C}_{\text {out }}\)
(b) an \(8 \times 8\) multiplier
(c) a 16-bit priority encoder (see Exercise 2.36)

Exercise 5.58 Consider the ROM circuits in Figure 5.66. For each row, can the circuit in column I be replaced by an equivalent circuit in column II by proper programming of the latter's ROM?


Figure 5.66 ROM circuits

Exercise 5.59 How many Cyclone IV FPGA LEs are required to perform each of the following functions? Show how to configure one or more LEs to perform the function. You should be able to do this by inspection, without performing logic synthesis.
(a) the combinational function from Exercise 2.13(c)
(b) the combinational function from Exercise 2.17(c)
(c) the two-output function from Exercise 2.24
(d) the function from Exercise 2.35
(e) a four-input priority encoder (see Exercise 2.36)

Exercise 5.60 Repeat Exercise 5.59 for the following functions.
(a) an eight-input priority encoder (see Exercise 2.36)
(b) a 3:8 decoder
(c) a 4-bit carry propagate adder (with no carry in or out)
(d) the FSM from Exercise 3.22
(e) the Gray code counter from Exercise 3.27

Exercise 5.61 Consider the Cyclone IV LE shown in Figure 5.59. According to the datasheet, it has the timing specifications given in Table 5.5.
(a) What is the minimum number of Cyclone IV LEs required to implement the FSM of Figure 3.26?
(b) Without clock skew, what is the fastest clock frequency at which this FSM will run reliably?
(c) With 3 ns of clock skew, what is the fastest frequency at which the FSM will run reliably?

Table 5.5 Cyclone IV timing
\begin{tabular}{ll}
\multicolumn{1}{c}{ Name } & Value (ps) \\
\hline\(t_{p c q}, t_{c c q}\) & 199 \\
\hline\(t_{\text {setup }}\) & 76 \\
\hline\(t_{\text {hold }}\) & 0 \\
\hline\(t_{p d}\) (per LE) & 381 \\
\hline\(t_{\text {wire }}\) (between LEs) & 246 \\
\hline\(t_{\text {skew }}\) & 0 \\
\hline
\end{tabular}

Exercise 5.62 Repeat Exercise 5.61 for the FSM of Figure 3.31(b).
Exercise 5.63 You would like to use an FPGA to implement an M\&M sorter with a color sensor and motors to put red candy in one jar and green candy in another. The design is to be implemented as an FSM using a Cyclone IV FPGA. According to the data sheet, the FPGA has timing characteristics shown in Table 5.5. You would like your FSM to run at 100 MHz . What is the maximum number of LEs on the critical path? What is the fastest speed at which the FSM will run?

\section*{Interview Questions}

The following exercises present questions that have been asked at interviews for digital design jobs.

Question 5.1 What is the largest possible result of multiplying two unsigned \(N\)-bit numbers?

Question 5.2 Binary coded decimal ( \(B C D\) ) representation uses four bits to encode each decimal digit. For example, \(42_{10}\) is represented as \(01000010_{\text {BCD }}\). Explain in words why processors might use BCD representation.

Question 5.3 Design hardware to add two 8-bit unsigned BCD numbers (see Question 5.2). Sketch a schematic for your design, and write an HDL module for the BCD adder. The inputs are \(A, B\), and \(C_{\text {in }}\), and the outputs are \(S\) and \(C_{\text {out }} . C_{\text {in }}\) and \(C_{\text {out }}\) are 1 -bit carries and \(A, B\), and \(S\) are 8 -bit BCD numbers.```

