## Implementing a Single-coefficient Multiplier

#### Features

- Theory of Developing a Single-coefficient Multiplier
- Implementation using an AT40K Series FPGA for an 8-bit Single-coefficient Multiplier
- Coefficient Look-Up Table is Easily Re-Configurable
- View Logic Workview Office Reference Designs for 8-bit and 16-bit Implementations

## Introduction

Atmel's AT40K Series of Field Programmable Gate Arrays (FPGA) are specifically designed for computation intensive Digital Signal Processor (DSP) applications. The FPGA architecture provides an octagonal shaped core cell that possesses orthogonal and diagonal cell-to-cell connections, allowing each core cell to communicate directly with its eight nearest neighbors without using the slower bussing network. This architecture is optimized for many applications, including array multiplication and multiplication-oriented applications, including image processing, real-time video, telecommunications and control systems.

The implementation of the Single-coefficient Multiplier presented in this application note uses an algorithm of simple additions and single bit multiplications and series of Look-Up Tables (LUTs) or Read-only Memory (ROM) Macros, which are followed by a stage of addition; multiple stages of addition may be required depending upon the bit width of the product vector. The reference design has been completed in schematic form using Viewlogic<sup>®</sup> Workview Office<sup>®</sup>, however a successful re-target to either VHDL<sup>®</sup> or Verilog<sup>®</sup> would not be a difficult task.

## The Algorithm

Performing decimal multiplication by hand is a fairly tedious process, especially if we do not know our multiplication tables. In order to simplify the process, one typically breaks the problem down into simpler to perform steps. For example, when performing decimal multiplication one actually performs single-digit multiplications with carries to the next column, one then adds all the intermediate products to obtain the final product. Consider the following example:

 $170 \times 3 = 0 + 210 + 300 = 510$  $170 \times 63 = 510^{(1)} + 10,200^{(2)} = 10,710$ 

Notes: 1. 510 = 3 x 170 2. 10,200 = 6 x 170 x 10



Programmable SLI AT40K AT40KAL AT94K

# Application Note

Rev. 3038A-FPSLI-3333





Alternatively, we could simplify and speed up this process by committing a larger number of multiplication tables to memory, however that is not a very viable option.

With the devices available today especially those which are re-configurable, this alternative is a very viable solution. In order to implement a Single-coefficient Multiplier in hardware, one would first need to construct a series of hexadecimal multiplication tables, from \$00 to \$0F times the multiplicand value. As with longhand multiplication, one would add the intermediate products and obtain the final product. Consider the following example:

 $AA \times 3F = 9F6 + 1FE = 29D6$ Notes: 1. 9F6 = F x AA 2. 1FE = 3 x AA x 10  $1 \times AA = 0AA$  $9 \times AA = 5FA$  $2 \times AA = 154$  $A \times AA = 6A4$  $3 \times AA = 1FE$  $B \times AA = 74E$  $C \times AA = 7F8$  $4 \times AA = 2A8$ 5 x AA = 352  $D \times AA = 8A2$  $6 \times AA = 3FC$  $E \times AA = 94C$  $7 \times AA = 4A6$  $F \times AA = 9F6$ 

Furthermore, it is possible to implement optimizations that significantly reduce the addition logic required. As you can see from performing a multiplication, the four least significant bits remain unchanged through the addition phase of the application, see Figure 1.





## <sup>2</sup> Implementing a Single-coefficient Multiplier

# Implementing a Single-coefficient Multiplier

### Implementation

An 8-bit implementation of the Single-coefficient Multiplier requires two arrays of twelve unique LUTs, which provide the results ranging from zero to fifteen times the multiplicand. Atmel's 4-input core cell in the AT40K Series of FPGAs can be configured as a 16 x 1 ROM, implementing each LUT easily and efficiently. The contents of the 16 x 1 ROM are bit-sliced according the coefficient multiplication table, see Table 1. For consistency, this application note will continue to use the data from the previous example.

|               | R11 | R10 | R9 | R8 | R7 | R6 | R5 | R4 | R3 | R2 | R1 | R0 |
|---------------|-----|-----|----|----|----|----|----|----|----|----|----|----|
| 0 x AA = 000> | 0   | 0   | 0  | 0  | 0  | 0  | 0  | 0  | 0  | 0  | 0  | 0  |
| 1 x AA = 0AA> | 0   | 0   | 0  | 0  | 1  | 0  | 1  | 0  | 1  | 0  | 1  | 0  |
| 2 x AA = 154> | 0   | 0   | 0  | 1  | 0  | 1  | 0  | 1  | 0  | 1  | 0  | 0  |
| 3 x AA = 1FE> | 0   | 0   | 0  | 1  | 1  | 1  | 1  | 1  | 1  | 1  | 1  | 0  |
| 4 x AA = 2A8> | 0   | 0   | 1  | 0  | 1  | 0  | 1  | 0  | 1  | 0  | 0  | 0  |
| 5 x AA = 352> | 0   | 0   | 1  | 1  | 0  | 1  | 0  | 1  | 0  | 0  | 1  | 0  |
| 6 x AA = 3FC> | 0   | 0   | 1  | 1  | 1  | 1  | 1  | 1  | 1  | 1  | 0  | 0  |
| 7 x AA = 4A6> | 0   | 1   | 0  | 0  | 1  | 0  | 1  | 0  | 0  | 1  | 1  | 0  |
| 8 x AA = 550> | 0   | 1   | 0  | 1  | 0  | 1  | 0  | 1  | 0  | 0  | 0  | 0  |
| 9 x AA = 5FA> | 0   | 1   | 0  | 1  | 1  | 1  | 1  | 1  | 1  | 0  | 1  | 0  |
| A x AA = 6A4> | 0   | 1   | 1  | 0  | 1  | 0  | 1  | 0  | 0  | 1  | 0  | 0  |
| B x AA = 74E> | 0   | 1   | 1  | 1  | 0  | 1  | 0  | 0  | 1  | 1  | 1  | 0  |
| C x AA = 7F8> | 0   | 1   | 1  | 1  | 1  | 1  | 1  | 1  | 1  | 0  | 0  | 0  |
| D x AA = 8A2> | 1   | 0   | 0  | 0  | 1  | 0  | 1  | 0  | 0  | 0  | 1  | 0  |
| E x AA = 94C> | 1   | 0   | 0  | 1  | 0  | 1  | 0  | 0  | 1  | 1  | 0  | 0  |
| F x AA = 9F6> | 1   | 0   | 0  | 1  | 1  | 1  | 1  | 1  | 0  | 1  | 1  | 0  |

Table 1. 16 x 1 ROM Look-Up Table Data for \$AA

The optimized 8-bit implementation of a Single-coefficient Multiplier requires a total of twenty-four 16 x 1 ROM LUTs and a 20-bit Ripple-carry Adder. An optimized 16-bit Single-coefficient Multiplier would require a total of eighty 16 x 1 ROM LUTs, two 20-bit Ripple-carry Adders, and a 24-bit Ripple-carry Adder. A comparison between a Single-coefficient Multiplier and a standard multiplier is shown in Table 2 for both 8-bit and 16-bit implementations.

| Table 2. | Implementation of | Various Multipliers or | AT40KAL Series FPGA |
|----------|-------------------|------------------------|---------------------|
|----------|-------------------|------------------------|---------------------|

| Multiplier                           | Speed (MHz) | Core Cell Count |  |
|--------------------------------------|-------------|-----------------|--|
| 8-bit Single-coefficient Multiplier  | 56          | 37              |  |
| 8 x 8 Unsigned Multiplier            | 44          | 64              |  |
| 16-bit Single-coefficient Multiplier | 22          | 145             |  |
| 16 x 16 Unsigned Multiplier          | 23          | 256             |  |





## Changing the Constant Coefficient Dynamically

Atmel's AT40K Series of FPGAs are capable of being dynamically re-configured. The AT40K Series of FPGAs supports CacheLogic<sup>®</sup> re-configuration, which allows for the contents of the 16 x 1 ROM LUTs to be updated with new data dynamically without disrupting the entire system. Further information on CacheLogic Mode 4 re-configuration is available from the Atmel Programmable SLi Group under a Non-Disclosure Agreement (NDA), as this protects our customers from the reverse engineering of their designs.

For this example, lets assume that an 8-bit Single-coefficient Multiplier has been implemented in the embedded AT40KAL FPGA within the AT94K FPSLIC<sup>™</sup> device. We will then use the built-in CacheLogic interface to dynamically re-configure one of the 16 x 1 ROM LUTs in twelve instructions. The following code snippet re-configures one of the R7 LUTs for a Single-coefficient Multiplier to contain \$DB6C for a coefficient of \$55 from the embedded megaAVR<sup>®</sup> microcontroller.

| ldi rTemp, | (1 - 1)    | ;Load X-Coordinate of R7 LUT     |
|------------|------------|----------------------------------|
| out FPGAX, | rTemp      |                                  |
| ldi rTemp, | (12 - 1)   | ;Load Y-Coordinate of R7 LUT     |
| out FPGAY, | rTemp      |                                  |
| ldi rTemp, | 0b0000110  | ;Details of FPGAZ only under NDA |
| out FPGAZ, | rTemp      |                                  |
| ldi rTemp, | 0b01101011 | ;Load New LUT Data Contents      |
| out FPGAD, | rTemp      |                                  |
| ldi rTemp, | 0b0000111  | ;Details of FPGAZ only under NDA |
| out FPGAZ, | rTemp      |                                  |
| ldi rTemp, | 0b01100111 | ;Load New LUT Data Contents      |
| out FPGAD, | rTemp      |                                  |

The previous example demonstrated the necessary instructions to re-configure a 16 x 1 ROM Look-Up Table, lets now consider the overall time required re-configured a Single-coefficient Multiplier. The calculations will assume worst case timing, where we will issue commands to load the FPGAX, FPGAY and FPGAZ registers for every ROM Look-Up Table. Both the *Idi* and *out* instructions are single-cycle instructions in the megaAVR architecture. The maximum time to re-configure the entire Single-coefficient Multiplier is dependent on the frequency of the microcontroller system clock ( $f_{CLK}$ ), the number of ROM Look-Up Tables needing re-configuration (nROM), and the number microcontroller clock cycles needed to re-configure one ROM Look-Up Table (nCycles). The equation for computing the maximum time for re-configuration is:

$$T_{MAX} = \frac{1}{f_{CLK}} \times nROM \times nCycles$$

The calculations for  $T_{MAX}$  have been performed for both 8-bit and 16-bit Single-coefficient Multipliers, assuming operation at 25 MHz, see Table 3.

| Multiplier                           | Τ <sub>ΜΑΧ</sub> μs |
|--------------------------------------|---------------------|
| 8-bit Single-coefficient Multiplier  | 11.52               |
| 16-bit Single-coefficient Multiplier | 38.40               |

### Optimization

The Single-coefficient Multiplier has been optimized so it is as efficient as possible, when considering maximum speed and core cell usage. Speed enhancements may be achieved by pipelining the device with single or multiple stages. The pipelining process will drastically increase the amount of FPGA core cells, however the performance boost may provide justification in the target system.

# 4 Implementing a Single-coefficient Multiplier



#### **Atmel Headquarters**

*Corporate Headquarters* 2325 Orchard Parkway San Jose, CA 95131 TEL 1(408) 441-0311 FAX 1(408) 487-2600

Europe

Atmel SarL Route des Arsenaux 41 Casa Postale 80 CH-1705 Fribourg Switzerland TEL (41) 26-426-5555 FAX (41) 26-426-5500

#### Asia

Atmel Asia, Ltd. Room 1219 Chinachem Golden Plaza 77 Mody Road Tsimhatsui East Kowloon Hong Kong TEL (852) 2721-9778 FAX (852) 2722-1369

#### Japan

Atmel Japan K.K. 9F, Tonetsu Shinkawa Bldg. 1-24-8 Shinkawa Chuo-ku, Tokyo 104-0033 Japan TEL (81) 3-3523-3551 FAX (81) 3-3523-7581

#### **Atmel Operations**

Memory Atmel Corporate 2325 Orchard Parkway San Jose, CA 95131 TEL 1(408) 436-4270 FAX 1(408) 436-4314

Microcontrollers Atmel Corporate 2325 Orchard Parkway San Jose, CA 95131 TEL 1(408) 436-4270 FAX 1(408) 436-4314

Atmel Nantes La Chantrerie BP 70602 44306 Nantes Cedex 3, France TEL (33) 2-40-18-18-18 FAX (33) 2-40-18-19-60

ASIC/ASSP/Smart Cards Atmel Rousset Zone Industrielle 13106 Rousset Cedex, France TEL (33) 4-42-53-60-00 FAX (33) 4-42-53-60-01

Atmel Colorado Springs 1150 East Cheyenne Mtn. Blvd. Colorado Springs, CO 80906 TEL 1(719) 576-3300 FAX 1(719) 540-1759

Atmel Smart Card ICs Scottish Enterprise Technology Park Maxwell Building East Kilbride G75 0QR, Scotland TEL (44) 1355-803-000 FAX (44) 1355-242-743

Atmel Programmable SLI Hotline (408) 436-4119

Atmel Programmable SLI e-mail fpga@atmel.com

FAQ Available on web site

#### © Atmel Corporation 2002.

Atmel Corporation makes no warranty for the use of its products, other than those expressly contained in the Company's standard warranty which is detailed in Atmel's Terms and Conditions located on the Company's web site. The Company assumes no responsibility for any errors which may appear in this document, reserves the right to change devices or specifications detailed herein at any time without notice, and does not make any commitment to update the information contained herein. No licenses to patents or other intellectual property of Atmel are granted by the Company in connection with the sale of Atmel products, expressly or by implication. Atmel's products are not authorized for use as critical components in life support devices or systems.

ATMEL<sup>®</sup>, megaAVR<sup>®</sup> and CacheLogic<sup>®</sup> are the registered trademarks of Atmel. FPSLIC<sup>™</sup> is the trademark of Atmel.

Viewlogic<sup>®</sup> and Workview Office<sup>®</sup> are the registered trademarks of Viewlogic Systems, Inc.; Verilog<sup>®</sup> is the registered trademark of Gateway Design Automation Corporation; VHDL<sup>®</sup> is the registered trademark of Cadence Design Systems, Inc. Other terms and product names may be the trademarks of others.

reg-

Printed on recycled paper.

*e-mail* literature@atmel.com

Web Site http://www.atmel.com

38521 Saint-Egreve Cedex, France TEL (33) 4-76-58-30-00 FAX (33) 4-76-58-34-80

**RF**/Automotive

Atmel Heilbronn

Postfach 3535

Theresienstrasse 2

TEL (49) 71-31-67-0

74025 Heilbronn, Germany

FAX (49) 71-31-67-2340

Atmel Colorado Springs

TEL 1(719) 576-3300

FAX 1(719) 540-1759

Avenue de Rochepleine

Atmel Grenoble

BP 123

1150 East Cheyenne Mtn. Blvd.

Biometrics/Imaging/Hi-Rel MPU/

High Speed Converters/RF Datacom

Colorado Springs, CO 80906