next up previous
Next: Fixed Coefficients Up: High Order Sum-of-Products in Cellular Logic Previous: Introduction

Bit-Serial Multiplication

Just as the grade-school algorithm for multiplying decimal numbers involves the multiplication table and a great deal of addition, it turns out that the largest part of a binary multiplier is concerned with addition. The binary analogue of the multiplication table is simply an AND gate: if both digits are one, their product is one; otherwise at least one of them is a zero and the product is zero.

If X is an integer between tex2html_wrap_inline581 and tex2html_wrap_inline583 and we write X[j] to denote the jth bit of the p-bit twos-complement representation of X, then


Note that the bits are numbered starting with zero. The reservation of one of the p bits to store sign information leads to some awkwardness in notation, but the above is conventional.

Let us consider only positive data and coefficients for the moment. Using a superscript + to denote our assumption, equation (1) can be rewritten as


where p and q are the number of bits in the (signed) representation of the Cs and As, respectively.

In an electronic circuit implementing a binary arithmetic function, power-of-two factors have no physical realization. Instead they are handled by keeping track, in the design of the circuit, of the significance of each bit. The bits themselves are represented by distinguishable electrical signals appearing at a particular place and time within the circuit.

The product of two bits tex2html_wrap_inline605 is trivially computed by the standard logic AND function once the bits are brought together in time and space. Organizing the tex2html_wrap_inline607 single-bit terms of (3) for efficient summation is the key to a successful design.

Let us identify the three nested summation symbols on the right-hand side of (3) by the associated iteration variable. In keeping with the terminology of software implementation we will call the first summation symbol ``the i loop,'' etc. We can now concisely characterize a realization of (1) by describing how it organizes the three loops.

A modern general-purpose computer has a rather large amount of chip area dedicated to the j and k loops. One would therefore naturally choose to implement the i loop in software with the computation of each product tex2html_wrap_inline617 taken as an atomic step.

It is possible to duplicate this approach in a programmable logic chip but the ``parallel multiplier'' structure consumes a very large number of cells. Furthermore, the resulting structure is slow compared to the intrinsic speed of the cells because carry bits must propagate across the entire structure. If it is not fast enough more space may be committed to build more multipliers operating in parallel. We would then say that the i loop has been partially parallelized.

The earliest digital computers operated on one pair of bits at a time. In such a computer one of the inner loops, say the k loop, would be realized serially under the control of sequencing hardware while the j loop might be realized in a multiply subroutine and the i loop would be explicit in the application program. This approach requires only a tiny amount of hardware but, since that hardware operates at a limited speed, it is painfully slow.

An important class of mechanisms has the k loop implemented serially with i and j parallel. Programmable logic chips are designed to implement two-input binary addition with reasonable efficiency and by a clever feedback arrangement the n unsigned products can be computed in q-1 passes through n adders:


This structure has the advantage of a clear relationship to (1) and makes effective use of high-level macro structures provided by the programmable logic chip vendor. On the other hand still more adders (or passes) are required to sum the products and, even with chip structures optimized for binary addition, the carry propagation delay may impose a speed limit if p+q is large.

The circuit above can be improved by exchanging the i and j loop realizations. The resulting circuit uses fewer cells in the adder tree, in part because the tree has work to do in every clock period and in part because the carry bits do not accumulate at the top of the tree.


An implementation of this architecture with a 2:1 tree of synchronous binary adders and a final feedback accumulator is the standard FPGA solution for the sum-of-products problem. The adder tree is an example of a pipeline structure.

next up previous
Next: Fixed Coefficients Up: High Order Sum-of-Products in Cellular Logic Previous: Introduction

Stephen W. Nuchia
Mon Dec 11 17:02:42 CST 2000