\documentclass[12pt,draft]{article}
\usepackage{amstex}
\newcommand{\repu}{\operatorname{repu}}
\newcommand{\reps}{\operatorname{reps}}
\newcommand{\rep}{\operatorname{rep}}
%\usepackage[letterpaper,margin={1.25in,1.25in},twoside,nofoot]{geometry}
%\usepackage{layout}
\begin{document}
%\layout
\title{High~Order~Sum-of-Products in~Cellular~Logic}
\author{Steve Nuchia}
\maketitle
\begin{abstract}
Multi-Dimensional Digital Signal Processing (DSP) applications
commonly require calculation of sums of products having tens to
hundreds of terms. This paper presents a novel digital logic
structure for computing these sums.
The computational structure presented here is efficient in the sense
that it uses less time and/or fewer logic gates compared to other
methods. We assume a target implementation environment consisting of
a large number of simple logic cells, such as that provided by
commercial FPGA (Field Programmable Gate Array) chips.
The structure presented here is a pipeline with unstructured internal feedback.
Because the feedback is unstructured it is impractical to apply our
design technique by hand; a computer program is presented that generates
pipeline configurations by heuristic search.
Other novel features of the design include
combination of techniques to handle sign extension and rounding
without waiting for carry propagation and
a system-level technique for organizing 2-dimensional
data samples in a memory array.
These techniques make it possible to
build a 15x15 FIR filter core for 10MHz 8-bit radar data using just two
inexpensive off-the-shelf chips.
\end{abstract}
\thispagestyle{empty}
\section{Introduction}
We wish to specify a digital circuit that will compute
\begin{equation}\label{genprod}
\sum_{i=1}^n C_i A_i
\end{equation}
where the $C_i$ are constants and the $A_i$ are inputs to the circuit.
Two's complement fixed point representation is assumed throughout.
In typical applications
the quantities of interest will be scaled by some constant
factor: to represent real-world quantities in the case of the $A$s or
unitless fractions in the case of the $C$s. Since these scaling factors
flow through the calculation in an obvious way it is no loss
of generality to regard all quantities as integers throughout the
exposition of the algorithm. Any desired power-of-two scaling may
be applied to the result by re-designating the weight of the output bits.
Scaling other than by a power of two requires a combination of output
scaling and prescaling of the coefficients.
We are concerned here with synchronous pipelined digital logic systems.
A synchronous digital circuit is one in which all computations occur
in discrete steps called clock periods. In a pipelined
circuit, intermediate results are stored internally and the ultimate result
emerges some number of clock periods after the inputs are accepted.
Synchronous circuit elements store or \emph{latch} their outputs
between clocks; this storage is sufficient for the construction of a pipeline.
We assume that the vector of $n$ data values
$\mathcal{A}=(A_1, A_2,\cdots,A_n)$ is one of a series of such vectors
arriving at the inputs of the circuit at intervals. One measure of the
performance of a circuit is the minimum period between arrivals.
Another measure is the time required to produce the sum once the
vector has been accepted. The latter quantity is known as the
\emph{latency} of the circuit while the former is the reciprocal of its
\emph{throughput.} The number of logic elements dedicated to the
circuit is the third metric; one desires high throughput, low latency,
and an economy of logic \emph{gates}.
Latency and throughput are influenced by both the logical design of the
sum-of-products circuit and by the speed of the digital circuit elements
used to realize the design. The logic family determines the minimum
clock period for a simple \emph{synchronous logic element}.
The \emph{circuit topology} determines the latency and throughput in terms
of the clock period.
The clock period, however, is not independent of the topology. A design that
uses more complicated logical structures will have a longer minimum clock
period due to electrical effects. Carry propagation networks are a
classic example of the kind of logic that adversely affects clock period.
Our goal, then, is to design a sum-of-products circuit that
operates without long-range carry propagation and with acceptable
efficiency.
\section{Bit-Serial Multiplication}
Just as the grade-school algorithm for multiplying decimal numbers
involves the multiplication table and a great deal of addition, it
turns out that the largest part of a binary multiplier is concerned with
addition. The binary analogue of the multiplication table is simply
an AND gate: if both digits are one, their product is one; otherwise
at least one of them is a zero and the product is zero.
If $X$ is an integer between $-2^{p-1}$ and $2^{p-1}-1$ and
we write $X[j]$ to denote the $j$th bit
of the $p$-bit twos-complement representation of $X,$ then
\begin{equation}\label{twocompdefn}
X=\left( \sum_{j=0}^{p-2} 2^j X[j]\right) - 2^{p-1} X[p-1].
\end{equation}
Note that the bits are numbered starting with zero. The reservation
of one of the $p$ bits to store sign information leads to some
awkwardness in notation, but the above is conventional.
Let us consider only positive data and coefficients for the moment.
Using a superscript $+$ to denote our assumption, equation
(\ref{genprod}) can be rewritten as
\begin{equation}\label{expprod}
\sum_{i=1}^n C_i^+ A_i^+=
\sum_{i=1}^n \sum_{j=0}^{p-2} \sum_{k=0}^{q-2} 2^{j+k} C_i[j]A_i[k],
\end{equation}
where $p$ and $q$ are the number of bits in the (signed)
representation of the $C$s and $A$s, respectively.
In an electronic circuit implementing a binary arithmetic function,
power-of-two factors have no physical realization. Instead they are
handled by keeping track, in the design of the circuit, of the
significance of each bit. The bits themselves are represented by
distinguishable electrical signals appearing at a particular place and
time within the circuit.
The product of two bits $C_i[j]A_i[k]$ is trivially computed by the
standard logic AND function once the bits are brought together in
time and space. Organizing the $n\times (p-1)\times (q-1)$ single-bit terms
of (\ref{expprod}) for efficient summation is the key to a successful design.
Let us identify the three nested summation symbols on the right-hand
side of (\ref{expprod}) by the associated iteration variable. In
keeping with the terminology of software implementation
we will call the first summation symbol ``the $i$ loop,'' \emph{etc}.
We can now
concisely characterize a realization of (\ref{genprod}) by describing
how it organizes the three loops.
A modern general-purpose computer has a rather large amount of chip
area dedicated to the $j$ and $k$ loops. One would therefore
naturally choose to implement the $i$ loop in software with the
computation of each product $C_i A_i$ taken as an atomic step.
It is possible to duplicate this approach in a programmable logic
chip but the ``parallel multiplier'' structure consumes a very large
number of cells. Furthermore, the resulting structure is slow compared
to the intrinsic speed of the cells because carry bits must propagate
across the entire structure. If it is not fast enough more space may
be committed to build more multipliers operating in parallel. We
would then say that the $i$ loop has been partially parallelized.
The earliest digital computers operated on one pair of bits at a time.
In such a computer one of the inner loops, say the $k$ loop,
would be realized serially under the control of sequencing hardware
while the $j$ loop might be realized in a multiply subroutine and the
$i$ loop would be explicit in the application program. This approach
requires only a tiny amount of hardware but, since that hardware
operates at a limited speed, it is painfully slow.
An important class of mechanisms has the $k$ loop implemented
serially with $i$ and $j$ parallel. Programmable logic chips
are designed to implement two-input binary addition with reasonable
efficiency and by a clever feedback arrangement the $n$ unsigned
products can be computed in $q-1$ passes through $n$ adders:
\begin{equation}
\sum_{i=1}^n C_i^+ A_i^+ =
\sum_{i=1}^n \sum_{k=0}^{q-2} 2^k A_i[k] C_i^+.
\end{equation}
This structure has the
advantage of a clear relationship to (\ref{genprod}) and makes effective use
of high-level macro structures provided by the programmable logic chip
vendor. On the other hand still more adders (or passes) are required
to sum the products and, even with chip structures optimized for binary
addition, the carry propagation delay may impose a speed limit if $p+q$
is large.
The circuit above can be improved by exchanging the $i$ and $j$
loop realizations. The resulting circuit uses fewer cells in the
adder tree, in part because the tree has work to do in every clock
period and in part because the carry bits do not accumulate at the
top of the tree.
\begin{equation}\label{bitserial}
\sum_{i=1}^n C_i^+ A_i^+ =
\sum_{k=0}^{q-2} 2^k \sum_{i=1}^n A_i[k] C_i^+.
\end{equation}
An implementation of this architecture with a 2:1 tree
of synchronous binary adders and a final feedback accumulator is the
standard FPGA solution for the sum-of-products problem. The adder tree
is an example of a pipeline structure.
\subsection{Fixed Coefficients}
When the $C$ values are fixed by the application (an important special
case) it becomes possible to eliminate any explicit representation of
them from the circuit. One way to take advantage of this is to
specify the constants bit vectors representing the coefficients
and allow standard logic simplification software to
eliminate all the AND gates together with many of the intermediate
stages of the adder tree. This is less effective than one might
hope, however, because the dedicated
carry propagation logic has a rigid structure and because the carry
bits propagating through intermediate stages will tend to ``fill in''
eliminated bit positions.
To take full advantage of fixed bits in the $C$ values requires a
specialized approach.
One such approach leads to a class of designs which were shown by Mintzer to
be well suited to cell-based programmable logic implementation.
These \emph{distributed arithmetic} designs are a specialization of the
serial-$k$ architecture. The $n$ coefficients are partitioned into
sufficiently small subsets $\{P_m\}$ so that the pre-computed sums
$S_m=\sum_{i\in P_m} C_i A_i[k]$ may be efficiently retrieved from
a lookup table structure. Carry propagation is eliminated in the first
rank of the computing structure and, more importantly, zero bits in the
coefficients lead to ``don't care'' conditions in the lookup table logic
that can be efficiently exploited by the logic compiler.
More recently Aoki, Sawada, and Higuchi demonstrated a construction
of a generalized signed-weight representation for an arbitrary fixed
vector of coefficients that results in carry propagation-free adders.
In the present paper we do not assume the coefficients are fixed and
so we will not delve further into these techniques. We note, however, that
our technique can be applied to summing the partitioned sums obtained
by exploiting constant coefficients. The ideas are orthogonal.
In [?], Liu, Wu, Raghupathy and Chen examine the problem of DSP
circuit performance from the algorithmic level, where they are free to
choose to do something better than computing the sum of products. Their
examples provide a motivating sense of the scale of interesting DSP problems.
\subsection{Round-Robin Parallelism}
The algorithms considered so far are all serial in the sense that they
consider only one sum-of-products calculation at a time. Such an
algorithm can be extended beyond its intrinsic performance limits by having
multiple copies working in parallel.
If a series of computations is needed at an aggregate rate
$m$ times faster than the available circuit can perform them, the
successive problems can be parceled out to $m$ identical instances of
the algorithm round-robin fashion. As each instance completes its
computation the sequence of results is reassembled for downstream
consumption.
This scheme has the advantage that very little engineering is required
to scale a working algorithm up to a higher data rate. On the other hand,
it can do no better than $m$ times the original hardware requirement.
It is also worth noting that, in practice, the data values in
successive computations are often closely related. A round-robin
circuit will generally be imperfectly able to exploit this structure.
\section{System-Level Design}
One reason for the popularity of the serial-$k$ architecture is that
it allows the $\mathcal{A}$ data to be fed into the circuit over several
clock periods. If $\mathcal{A}$ contains a total of $n q$ bits the serial-$k$
architecture will require only $n$ communication paths (wires) between
the data source and the sum-of-products circuit.
\subsection{Overlapping Data Sets}
In a digital signal filtering application there is an underlying
sequence of data values $\mathcal{A}=(A_0, A_1, A_2,\cdots)$ and successive
input vectors are overlapping subranges
\[V_t = (A_{mt-n+1}, A_{mt-n+2},\cdots , A_{mt-1}, A_{mt}).\]
One customarily chooses
a reset state for the internal latches equivalent to assuming $A_i=0$
for all $i<0.$ While $m=1$ in many applications, \emph{decimation filters}
are an important exception. There the \emph{stride}, $m,$
is the decimation factor.
After $V_{t-1}$ has been input, all but the last $m$ of the
values in $V_t$ have been communicated to the circuit. In this circumstance
we require only $m$ wires to supply the circuit with the new inputs. When
$m=1$ the data stream is said to be \emph{serial}; this is frequently
the case in audio applications.
The $n-m$ overlapping data values must be retained inside the sum-of-products
circuit from one problem to the next. The shift register arrangement that
accomplishes this is the well known \emph{tapped delay line}.
\subsection{Two Dimensional Problems}
In one-dimensional DSP problems it is frequently the case that
infinite impulse response filters provide acceptable performance
and therefore very wide sum-of-products circuits are seldom
required. Two-dimensional problems, on the other hand, often
involve image data that would be ``smeared'' by the assymetry of
an IIR filter and so finite-impulse response (FIR) filters are
desirable in such application. The work reported here began
in response to just such a requirement.
Two dimensional DSP problems arise in both realtime and non-realtime contexts.
In offline image processing an algorithm may access the numbers encoding the
image in any convenient order. Realtime problems, the kind typically
associated with FPGA solutions, often are constrained by the scanning
structure of the image acquisition or communication mechanisms.
Thus we assume that the sequence $\mathcal{A}$ represents samples
from a two dimensional space, scanned in some specified sequence.
The image processing paradigm involves applying a function to a
geometric neighborhood of a point in the image space. The data
representing this neighborhood (i.e. $V_t$) is scattered through
$\mathcal{A}$ by the scanning protocol. Reassembling $V_t$ is one
of the system-level design challenges in two-dimensional DSP implementation.
The topology of the image space and the timing of the data stream
will determine how the circuit must handle edge effects. Realizing
the edge effect policy appropriate to the context is the other
characteristic two-dimensional DSP challenge.
Common topologies include the video image, a rectangle, and the $360^\circ$
radar image, a cylinder. When sequence timing is based on the classical
analog implementations there are gaps at the edges of the space that
can be logically filled with zeros or handled in more exotic ways, such
as reflection.
If the sequences have been retimed to eliminate the gaps then implementation
of edge correction may require multiple coefficient sets selected
dynamically based
on the relationship between an input vector and the edges of the space.
These problems are independent of the mechanism used to sum the
products except for the observation that packed (retimed) data may preclude
exploiting the constant-coefficient assumption. Let us now consider
the problem of gathering a coherent $V_t$ at the inputs of the
sum-of-products circuit.
\subsection{FIFO RAM Delay Lines}
Need pictures here.
In a common formulation, $\mathcal{A}$ consists of a sequence
of scan lines, each consisting of $s$ adjacent samples. Then,
ignoring edge effects, the input vector is the $2I+1$ by $2J+1$
patch centered at $A_t$:
\[V_t=\left\{\begin{matrix}
A_{t-I s-J} & \hdotsfor{4} \\
\hdotsfor{2} & A_{t-s} & \hdotsfor{2} \\
\dots & A_{t-1} & A_t & A_{t+1} & \dots \\
\hdotsfor{2} & A_{t+s} & \hdotsfor{2} \\
\hdotsfor{4} & A_{t+I s+J}
\end{matrix}\right\}.\]
\section{Modeling Background}
In this section we introduce some background ideas and notation that will
be helpful in understanding the algorithm.
\subsection{Representation Properties}
The two's complement representation has several properties, familiar
to designers of digital logic, that are responsible for its near-universal
use in general purpose computers. For convenience we now introduce some
formal notation for binary representations and review the relevant
properties.
The unsigned interpretation of a vector of $n$ bits is given by
\[
x = \repu_n^{-1} X = \sum_{i=0}^{n-1} 2^i X[i].
\]
Rather than give a bit-by-bit definition of the representation
function it will suffice to define it as the inverse of the
interpretation function:
\begin{equation}\label{repu}
\repu_n^{-1} \repu_n x \cong x \mod 2^n,
\end{equation}
for non-negative $x.$
The two's complement (signed) representation can be defined by concatenating
a sign bit onto an unsigned bit vector:
\[\reps_n x =
\begin{cases}
0:\repu_{n-1}x & \text{if $0 \le x < 2^{n-2},$} \\
1:\repu_{n-1}(x + 2^{n-1}) & \text{if $-2^{n-1} \le x < 0.$}
\end{cases}
\]
The interpretation function for the two's complement representation
was given in equation (\ref{twocompdefn}); it can also be defined
in terms of the unsigned interpretation function:
\[
x=\reps_n^{-1} X=\repu_{n-1}^{-1}X - 2^{n-1}X[n-1].
\]
\subsubsection{Sum Property}
Given integers $x$ and $y$ such that $x,$ $y,$ and $x+y$ are
representable as $n$-bit two's-complement numbers,
\begin{equation}\label{sumprop}
\reps_n x+y = \repu_n (\repu_n^{-1} X + \repu_n^{-1} Y),
\end{equation}
where $X = \reps_n x$ and $Y = \reps_n y.$
The sum property means that hardware adders can be built without
regard to the interpretation --- signed or unsigned --- of the
bit vectors being added. This is true when the lengths of the
bit vector output is the same a both input bit vectors. When
the sum is not representable we say that \emph{overflow} or
\emph{underflow} has occurred.
\subsubsection{Truncation Properties}
When the length of a bit vector is changed the value represented
by the new vector can be related to the value represented by the
original vector. While additional relationships can be derived,
we will need only those that preserve the interpretation of the
vectors. These follow directly from the definitions above:
\begin{gather*}
\repu_{n-1}^{-1} \repu_n x \cong x \mod 2^{n-1}, \\
\repu_{n-1}^{-1} X = \repu_n^{-1} X\qquad \text{if $X[n-1]=0.$}
\end{gather*}
For signed data
\[
\reps_{n-1}^{-1} X =
\reps_{n}^{-1} X\qquad \text{if $X[n-2]=X[n-1]$}
\]
since
\[
\reps_{n-1}^{-1} X - \reps_n^{-1} X =
(-2^{n-2}X[n-2]) - (-2^{n-1}X[n-1]+2^{n-2}X[n-2]).
\]
\subsubsection{Sign Extension Properties}
The value-preserving truncation properties lead to the
sign extension rules
\[
\repu_{n+1} \repu_n^{-1} X = 0:X
\]
and
\[
reps_{n+1} \reps_n^{-1} X = X[n-1]:X.
\]
\subsubsection{Addition Without Overflow}
Sign extension gives us a mechanism for computing sums
without worrying about overflow or underflow. If $-2^{n-1}\le x,y<2^{n-1}$
then $-2^n\le x+y<2^n$ and so the sum is representable as an $n+1$-bit
two's complement number. The $n+1$-bit sum of the extended input vectors
represents the sum of the input quantities. Since the sign extension
rule differs for signed and unsigned numbers the hardware must be specialized
to the interpretation.
If the inputs are known to represent unsigned quantities then the
``extension'' consists of zero bits which will not influence the
output of the addition circuit. In this case the bit $n$ of the
output is the ``carry out'' bit of the adder.
\subsubsection{Multiplication by Two}
Concatenating a zero bit onto the low-order end of a bit
vector is equivalent to doubling the number it represents:
\[
\rep_{n+1} 2x = (\rep_n x):0
\]
for both signed and unsigned representations. We will consider
division in section \ref{rounding} when we make provisions for
rounding output value.
\subsubsection{Additive Inverse Property}
Since $\reps -x$ and $\reps x$ must add add to zero in
a fixed-width binary adder, it must be the case that
\[
\reps_n^{-1} (\reps_n -x \oplus \reps_n x) \cong 0 \mod 2^n
\]
(where $\oplus$ represents fixed-width addition) or, equivalently,
\[
\repu_n (\repu_n^{-1} \reps_n -x + \repu_n^{-1} \reps_n x) = \repu_n 0.
\]
The logical ``not'' function can be written as
$\overline{x}=1-x$ for a single bit and is defined by
\[
\overline{X}[i]=\overline{X[i]}
\]
for a bit vector. Since (for representable $x$)
\[
\repu_n^{-1} \overline{\repu_n x} = \sum_{i=0}^{n-1} 2^i(1 - X[i])
=(2^n - 1) - x
\]
it is easy to verify the well-known formula
\[
\reps_n -x = \overline{\reps_n x} \oplus \reps_n 1.
\]
\subsection{Logic Pipelines}
Pipelined digital logic is a structure familiar, or at least
accessible, to most circuit designers. Likewise, the mathematical
properties of acyclic pipeline circuits can be formalized as
a composition of functions. The circuits we wish to describe,
however, are pipelines with feedback (directed cycles) and we
need to develop some machinery for reasoning about them.
For the reader who is more comfortable with an intuitive
approach to circuit design the following will serve to
introduce the mathematics behind our circuit. For the
mathematician, the central difficulty is to get a handle on
the role of \emph{time} in these circuits.
\subsubsection{Combinational Logic}
A combinational logic network is one that is free of synchronous
elements. This is not a particularly helpful definition ---
we will use only the fact that a combinational circuit that
is also free of directed cycles implements a pure function. The outputs
depend only on the value of the inputs and not on their history.
The circuit is implemented using wires, transistors, \emph{etc}.
This obvious fact has two important consequences. First, to be
useful the circuit is attached to a source of inputs and those
inputs change over time. Secondly, the outputs do not instantaneously
change in response to a change in the inputs; some finite delay is
involved.
Quite sophisticated treatments of this delay have been studied.
For our purposes it will suffice to consider just the gross maximum
delay: the time between the inputs becoming \emph{stable} and the
outputs reliably taking on the function value corresponding to
those inputs. To a very loose first approximation the lower bound on
delay for realizations of a given function is proportional to the
logarithm of the number of input bits.
\subsubsection{Synchronous Logic}
A \emph{flip-flop} is a logic element with a special input known
as a \emph{clock}. In effect, the flip-flop samples its other inputs
on each active \emph{edge} (state transition) of the clock signal.
The output of a flip-flop is, in general, a function of the input
and the output(s) at the time of the clock edge.
A variety of named flip-flop configurations exist, distinguished
by the function of the inputs and output they compute. All can
be modeled as a combinational circuit combined with a "D" flip-flop
or \emph{latch} circuit element.
A latch has a single input and a single output. The input signal is
copied to the output on each clock edge.
A synchronous logic circuit is composed of combinational circuits
with latched outputs. The latches share a common clock signal which
serves to divide time into discrete intervals. Ignoring the ultimate
inputs, signals change only on clock edges and, so long as each
combinational circuit has delay less than the minimum clock period,
it is possible to ignore timing details.
The latches also give synchronous circuits \emph{memory}: the
output may in general depend on the history of input values
rather than on just their instantaneous values. The memory
property also makes it possible to continue ``working on'' a
set of input information after the input signals have changed.
This is the foundation of logic pipelining.
\subsubsection{Pipeline Theory}
In a pipelined circuit a sequence of synchrounous logic
\emph{stages} operate on the data, passing intermediate results
from one to the next. Such circuits are carefully engineered for a particular
combination of hardware platform characteristics and required
performance and can be quite efficient.
Consider a series of computations, each having the form of
(\ref{genprod}). The coefficients $\{C_i\}$ are constant within the
series but each sum $S_t$ is assumed to have an independent sequence
of data values $V_t=\{A_i\}$.
The sum $S_t$ is represented implicitly by the vector $V_t$.
Suppose that a set of $m$ functions exists so that
$S_t=(f_m\circ f_{m-1}\circ \ldots \circ f_2\circ f_1)(V_t)$.
Clearly each intermediate form
$(f_h\circ \ldots \circ f_1)(V_t), 1\le h\le m$
is also a representation of $S_t$.
Suppose the set of functions $\{f_h\}$ is chosen so that each can be
realized efficiently in hardware. The relative value assigned to
space (chip area) and time efficiency is application dependent and
will generally determine the number of stages. If these stages are
simply composed they will compute $S_t$ from $V_t$ with a
delay equal to the sum\footnote{Actually, the delay may be substantially
less than the sum of worst-case delays.
The idea is to take advantage of the fact that delay is not uniform;
the delay from any given input to any given output may be considerably
less than the worst-case delay of a stage.
Much recent work in computer
arithmetic has focused on composing stages to minimize overall latency.
See [?] for recent developments applied to parallel multipliers.}
of the propagation delay in each stage.
If, on the other hand, the intermediate results are latched then the
delay will be at least $m$ times the longest of the stage delays.
The resulting pipeline still computes the correct value, but one
must be careful to account for the delay.
If at (discrete) time $t$ the stored result
from stage $m$ is $S_t$ then the input to stage $h$ is
$(f_{h-1}\circ \ldots \circ f_1)(V_{t+1+m-h})$
and at time $t+1$ stage $m$ will output $S_{t+1}$.
The canonical notational mechanism for reasoning about pipelines
is the time/space diagram; see for [?] for a careful treatment in
the intuitive style.
If the set of pipeline stage functions is chosen well the sum of the
stage areas (hardware cost) will be small and the stage delays
will be balanced.
The pipeline as a whole completes one computation per clock period,
corresponding to the input vector presented $m$ delay periods
earlier. The application will generally determine the minimum rate of
computation and maximum allowable latency.
\subsubsection{Horizontal Partitioning and the Logical Radius}
A pipeline stage computes a vector-valued function of a vector
of bits. One may consider the coordinate functions one by one,
$f(V)=(f_1(V),f_2(V),\cdots,f_n(V)).$
The delay for the function as a whole is the maximum of the
coordinate function delays.
Often the coordinate functions will depend on only a subset
of the input vector components. If $U_i \subset V$ and $f'_i(U_i)=f_i(V)$
then the size of $U_i$ is the \emph{logical radius} of the $i$th
component of $f.$ Since the delay associated with $f_i$ is a
function of the radius, balancing the radii of the component
functions is a guideline in choosing the pipeline decomposition
of a computational task.
\subsection{Conventional Iterative Sum of Products}
Practical serial-$k$ sum-of-products circuits are based on the fastest
available adder circuits in the target implementation technology.
Because the $V_t$ data are spread out over several clock periods
at the inputs, the results must be accumulated at some stage.
Since the adders take only two binary numbers as inputs, a wide
sum-of-products will require a tree of adders.
The accumulation could be done at the top (wide end) of the tree:
\[
\sum_{i=1}^n \left(\sum_{k=0}^{q-2} 2^k C_i A_i[k]\right),
\]
but there would be $p+q-2$ bits in each term entering the summation
tree ($i$ loop). In the preferred circuit
\begin{equation}\label{conv}
\sum_{k=0}^{q-2} 2^k \left(\sum_{i=1}^n C_i A_i[k]\right)
\end{equation}
the adder inputs are only $q-1$ bits wide.
The conventional circuit can be modeled in the following way.
The input is formally a vector containing $V_t$ and $k;$ in practice, $k$
is implicit. The first stage of the pipeline is the AND gate
multiplication:
\[
f_0(k,A_1,A_2,\cdots,A_n)=(k,A_1[k]C_1,\cdots,A_n[k]C_n).
\]
Intermediate stages add pairs of elements:
\[
f_h(k,X_1,X_2,\cdots)=(k,X_1+X_2,\cdots),\quad 1\le h < m.
\]
In an optimized implementation $f_0$ would likely be subsumed by
the $f_1$ hardware. If the number of inputs is not a power of two
some values will be copied from input to output (delayed)
in one or more stages.
Each bit in the output of an adder is a function of all the input
bits having the same or lower weight --- the radius is the entire
input vector. Without dedicating an
extravagant amount of hardware to the problem, computing an
$m$-bit wide sum requires time proportional\footnote{
Xilinx documentation gives the following data ...
} to $1+\frac{m}{10},$ approximately.
Each rank in the summation tree requires adders one bit wider than
those in the previous rank. Sign extention is accomplished by
an arrangement logically equivalent to
wiring the leftmost bit of each input vector to two inputs.
The critical path in such circuits is that the sign bit cannot
be duplicated until its value is known.
After $\lceil\log_2 n\rceil$ stages all the terms have been
combined into a single
$(p+\log_2 n)$-bit integer and the input to the final stage is the
vector $(k,X_k)$ where
\[
X_k=\sum_{i=1}^n A_i[k] C_i
\]
corresponds to the parenthesized expression in (\ref{conv}).
The final stage has the following structure:
\begin{equation}\label{convaccum}
f_m(k,X_k)=2^{q-1}X_k+\frac{1}{2} f_m(k-1,X_{k-1}),
\end{equation}
with $f_m(k,X)=0$ for $k<0$.
The peculiar form of $f_m$ is chosen to correspond to
an efficient hardware design. It can be implemented
with an adder only one bit wider than that of the
final summation stage, plus a shift register for the low-order bits.
A more straightforward computation, such as
\[
f'_{m}(k,X_k)=2^k X_k+f'_{m}(k-1,X_{k-1}),
\]
would require an adder $q-2$ bits wider.
The conventional circuit is easy to specify and understand. Establishing
its correctness is straightforward. When the data are narrow enough,
available adder circuits can implement the conventional circuit with a
sufficently high clock rate and there is no need for a more
complex solution.
\section{Unstructured Feedback in a Summation Pipeline}
In the conventional iterative sum-of-products circuit there
is a well-defined transition from the simply pipelined summation
tree to a purely iterative final stage. As has proved to be the
case for the partial product reduction trees in fast parallel
multipliers, better results are obtained when the simple structural
paradigm is abandoned. We follow the parallel multiplier developments
by limiting carry propagation distance and extend them by distributing
the iterative (feedback) elements throughout the tree.
\subsection{The Generalized Full Adder}
The canonical binary adder is a cascade of elements known as
\emph{full adders,} so called to distinguish them from their constituent
\emph{half adders}. Each full adder is a circuit with three inputs and
two outputs. Customarily two of the inputs are connected to the two
input bits of a particular weight. The two outputs give the unsigned
representation of the sum of the inputs. The low-order output becomes
the adder ouput bit of the corresponding weight and the high-order
(carry out) output is fed back to the third (carry in) input of the
next full adder in the cascade.
Low-latency adders use a variety of techniques to compute the
carry bits more quickly than a cascade of full adders could, but
when the result is going to go through further sum reduction there
is no pressing need to compute the high-order bits fully. Breaking
the feedback path results in a set of independent full adders in
place of an integrated binary adder. Each full adder can quickly
reduce up to three input bits to two output bits
\emph{representing the same quantity}, passing the
result to the next stage.
While it is not hard to see that a pipeline stage of independent
full adders preserves the interpretation value of the vector
passing through it, it is not obvious that the process terminates
in a single binary number without special arrangements. This turns
out to be the case for a generalized full adder; it is possible to
clock the entire pipeline at a rate limited only by the delay in a full adder.
The generalization that is needed is to allow control inputs
to mask the arithmetic inputs --- the generalized full adder
computes a sum of the form $\sum c_i a_i,$ where the $c_i$ are ones if
not physically connected. For the remainder of this paper
we assume as our unit of hardware cost a logic cell that is a slight
simplification of the Xilinx FPGA logic cell. Each cell computes
an arbitrary two-bit function of up to four input bits and optionally
latches the result. The generalized full adder, constrained to a total
of four inputs, maps directly onto the simplified Xilinx logic cell.
The initial AND stage of the sum-of-products pipeline is obviously
accomodated by the generalized full adder model. The control signals
are also essential for the feedback arrangement, where they are used
to implement rules corresponding to that for $k<0$
in equation (\ref{convaccum}).
It is occasionally useful to allow inputs of different weights as well; as
long as the total value of the inputs cannot exceed three times the
base weight for the cell this is valid.
\subsection{Data Conditioning}
There are several issues that are peripheral to the main algorithm
but must nevertheless be addressed. Discussion of these points is
collected in the following sections.
\subsubsection{Magnitude Bound}
Based on $\sum |C_i|$ derive a bound on $|S_t|.$ Compute maximum
information-bearing bit position and limit tree expansion to that column.
Give an example.
\subsubsection{Saturation}
\subsubsection{Rounding}\label{rounding}
Rounding leads to the need to introduce a constant (single weighted bit)
offset into the mix and allows throwing away low-order bits after the
carries have been computed.
Give Example.
\subsubsection{Signed Data}
Translate to excess $2^{p-1}$ notation. Leads to another
constant offset. Combine with rounding term.
Compute the data sign compensation term. Then ``translation''
just means inverting the sign/high-order data bit.
\subsection{Signed Coefficients}
A treatment like that for signed data would require keeping track
of the sum of data values. This could be done (for contiguous samples)
with a subtractor and an accumulator but there is a better way.
Separate sign bits and subtract. Convert into a counting
problem, invert count bits as they become known. The extra +1
gets fed in as a constant, joined with the others!
\section{Design Generation and Proof of Correctness}
\subsection{Time Signature Algebra}
\subsection{Search Heuristics}
\section{Conclusion}
\end{document}