Practical serial-k sum-of-products circuits are based on the fastest available adder circuits in the target implementation technology. Because the data are spread out over several clock periods at the inputs, the results must be accumulated at some stage. Since the adders take only two binary numbers as inputs, a wide sum-of-products will require a tree of adders.
The accumulation could be done at the top (wide end) of the tree:
but there would be p+q-2 bits in each term entering the summation tree (i loop). In the preferred circuit
the adder inputs are only q-1 bits wide.
The conventional circuit can be modeled in the following way. The input is formally a vector containing and k; in practice, k is implicit. The first stage of the pipeline is the AND gate multiplication:
Intermediate stages add pairs of elements:
In an optimized implementation would likely be subsumed by the hardware. If the number of inputs is not a power of two some values will be copied from input to output (delayed) in one or more stages.
Each bit in the output of an adder is a function of all the input bits having the same or lower weight -- the radius is the entire input vector. Without dedicating an extravagant amount of hardware to the problem, computing an m-bit wide sum requires time proportional to approximately.
Each rank in the summation tree requires adders one bit wider than those in the previous rank. Sign extention is accomplished by an arrangement logically equivalent to wiring the leftmost bit of each input vector to two inputs. The critical path in such circuits is that the sign bit cannot be duplicated until its value is known.
After stages all the terms have been combined into a single -bit integer and the input to the final stage is the vector where
corresponds to the parenthesized expression in (8). The final stage has the following structure:
with for k<0.
The peculiar form of is chosen to correspond to an efficient hardware design. It can be implemented with an adder only one bit wider than that of the final summation stage, plus a shift register for the low-order bits. A more straightforward computation, such as
would require an adder q-2 bits wider.
The conventional circuit is easy to specify and understand. Establishing its correctness is straightforward. When the data are narrow enough, available adder circuits can implement the conventional circuit with a sufficently high clock rate and there is no need for a more complex solution.