Just as the grade-school algorithm for multiplying decimal numbers involves the multiplication table and a great deal of addition, it turns out that the largest part of a binary multiplier is concerned with addition. The binary analogue of the multiplication table is simply an AND gate: if both digits are one, their product is one; otherwise at least one of them is a zero and the product is zero.

If *X* is an integer between and and
we write *X*[*j*] to denote the *j*th bit
of the *p*-bit twos-complement representation of *X*, then

Note that the bits are numbered starting with zero. The reservation
of one of the *p* bits to store sign information leads to some
awkwardness in notation, but the above is conventional.

Let us consider only positive data and coefficients for the moment. Using a superscript + to denote our assumption, equation (1) can be rewritten as

where *p* and *q* are the number of bits in the (signed)
representation of the *C*s and *A*s, respectively.

In an electronic circuit implementing a binary arithmetic function, power-of-two factors have no physical realization. Instead they are handled by keeping track, in the design of the circuit, of the significance of each bit. The bits themselves are represented by distinguishable electrical signals appearing at a particular place and time within the circuit.

The product of two bits is trivially computed by the standard logic AND function once the bits are brought together in time and space. Organizing the single-bit terms of (3) for efficient summation is the key to a successful design.

Let us identify the three nested summation symbols on the right-hand
side of (3) by the associated iteration variable. In
keeping with the terminology of software implementation
we will call the first summation symbol ``the *i* loop,'' *etc*.
We can now
concisely characterize a realization of (1) by describing
how it organizes the three loops.

A modern general-purpose computer has a rather large amount of chip
area dedicated to the *j* and *k* loops. One would therefore
naturally choose to implement the *i* loop in software with the
computation of each product taken as an atomic step.

It is possible to duplicate this approach in a programmable logic
chip but the ``parallel multiplier'' structure consumes a very large
number of cells. Furthermore, the resulting structure is slow compared
to the intrinsic speed of the cells because carry bits must propagate
across the entire structure. If it is not fast enough more space may
be committed to build more multipliers operating in parallel. We
would then say that the *i* loop has been partially parallelized.

The earliest digital computers operated on one pair of bits at a time.
In such a computer one of the inner loops, say the *k* loop,
would be realized serially under the control of sequencing hardware
while the *j* loop might be realized in a multiply subroutine and the
*i* loop would be explicit in the application program. This approach
requires only a tiny amount of hardware but, since that hardware
operates at a limited speed, it is painfully slow.

An important class of mechanisms has the *k* loop implemented
serially with *i* and *j* parallel. Programmable logic chips
are designed to implement two-input binary addition with reasonable
efficiency and by a clever feedback arrangement the *n* unsigned
products can be computed in *q*-1 passes through *n* adders:

This structure has the
advantage of a clear relationship to (1) and makes effective use
of high-level macro structures provided by the programmable logic chip
vendor. On the other hand still more adders (or passes) are required
to sum the products and, even with chip structures optimized for binary
addition, the carry propagation delay may impose a speed limit if *p*+*q*
is large.

The circuit above can be improved by exchanging the *i* and *j*
loop realizations. The resulting circuit uses fewer cells in the
adder tree, in part because the tree has work to do in every clock
period and in part because the carry bits do not accumulate at the
top of the tree.

An implementation of this architecture with a 2:1 tree of synchronous binary adders and a final feedback accumulator is the standard FPGA solution for the sum-of-products problem. The adder tree is an example of a pipeline structure.

Mon Dec 11 17:02:42 CST 2000