FPGA 2001 Conference Report

Introduction

This is my report on the Nineth ACM Symposium on Field Programmable Gate Arrays (FPGA'01), held at the Doubletree Hotel in Monterey, California on 12th-13th February 2001.

I typed all this up more or less on the go, using my trusty Psion 5mx palmtop. As a result, some of the English is a little fractured though I've tried to clean bits up. On the plus side, you should get a good idea of what the talks were about.

Usual disclaimers apply: the opinions or research of anyone mentioned here may quite possibly bear no reality to my understanding of them. If you're using this as a reliable data source, you are quite mad. But it hopefully gives a useful flavour of the conference, which I found enjoyable and interesting.

Quick Reference

Monday 12th February

The soundproofing in the hotel could do with a little work; my next door neighbour's stertorisms came clearly through the wall. At least it wasn't a honeymooning couple. Anyhow, I crawled out of bed at the unspeakable hour of 6:45am so that I could get to the hotel business centre when it opened and have my poster abstract copied onto transparency.

Remarkably, all went to plan and I was left to catch up with my travelogue in the hotel lobby, waiting for breakfast to start and listening to the rain drum on the roof. Apparently the gag here is "if you don't like the Monterey weather, wait 24 hours." A couple weeks back they had temperatures of 80F. Now that just ain't right.

All things considered, it looked like I had brought the rain with me from the UK. Whoops. I hoped that Texas would be drier.

Breakfast was coffee, bagels and pastries; I had started to miss my Weetabix, just a little bit, but not enough to complain. A small European table formed, with two guys from a UK startup, Exilent?? in Bristol, Gerhard from Uni Mannheim and a student and his professor from a French university just outside Paris. The Bristol guys were here to check out the competition and spread the word, Gerhard was speaking on his bachelor's research topic (video compression), and the French student was presenting a poster on some research into optimal FPGA configurations.

I found the poster area and put mine up; the poster boards were quite massive, and I could comfortably have put my stuff on A3 paper and still have it fit. Oh well. We then filed into the lecture theatre for the first session.

Scott Hauck opened up with the thanks for everyone who had contributed to getting this conference running. Sponsors were Altera, Xilinx, whatever Lucent are calling themselves this week, Cypress and Actel.

Martine Schlag took over and gave all the admin details.

Placement and Routing

Carl Ebeling chaired this session, giving a concise introduction to the general theme of the session before handing over to Mike Hutton.

Timing-Driven Placement for Hierarchical PLDs

Michael Hutton from Altera presented this talk.

They were aiming to improve timing-driven compilation for their hierarchical FLEX device. They were allowing a doubling of compile time for a 30% improvement in fmax. They didn't want to do too much extra coding so had to stick with their recursive partitioning flow.

The APEX PLD family (20K400 example device) is a hierarchical device. 10 lookup elements per LAB, 16 LAB per MegaLAB, 4 MegaLAB columns and 26 rows giving 16640 LEs. Interconnect is hierarchical. Routing is local wire, GH wire, H or HH wire, and V or VV wire in order of increasing reach.

Because of this, if your part falls partly outside of a hierarchy you suddenly pay a significant price in increased delay.

Their basic recursive partitioning algorithm proceeds with 6 phases in total. To improve it they weight the edges of the critical path in their graph so that more emphasis is put on getting the critical path right. Note that the longest path isn't necessarily most critical because it can have lax timing constraints. The algorithm must be able to compensate for heuristic mistakes, since the problem is NP hard.

The key idea was "slack" which leads to a clearer understand of what is and is not critical. There is also an adaptive delay estimator which predicts future cuts.

Definitions: "path" is from a delay source to a destination.

"Slack" on a path is its constraint minus its delay.

"Slack" on an edge is the worst case (smallest) slack over all relevant paths.

"Slack profile" shows edges with a given slack value. Normally is a bell curve on slack vs number of edges.

Critical path/edge is one with slack <= 0. Typically, few edges are critical (less than 10%).

To predict future timing: it's inaccurate in early phases. Criticality of a path depends how local edges are later cut relative to other paths. So we use a probabilistic estimator based on experience with other designs, so we can guess how likely are future local cuts.

Delay(wire) = actual delay if known, phase-local otherwise. This approach gives about 10-15% improvement in fmax (register-to-register performance), and is a very minor change to the original algorithm.

Other trick: phase loops. We do iterative improvement (2 or 3 times) on phases to "correct errors" and make further improvements. We limit the number of edges in the graph to cut.

As an example, we have a typical slack profile, do a partition and some edges got "longer" since they were cut and have longer communication delay. Re-weight edges we missed and re-partition. Then iterate improvement via phase loops for best-possible solution before next phase.

On 20 industrial designs we get a 36% improvement in fmax, for a 1.9X increase in total time including 2.9X placement time increase. You can tune it by restricting number of retries, e.g. 15% fmax improvement for 1.4X, 1.7X times increase.

Questions: a technical one on definition of phase-local and knowing how likely additional cuts would be. Another one on the effect of very regular designs e.g. pipelined - in this case Mike said that the software wouldn't help you because your critical path is the datapath.

LRoute: A Delay Minimal Router for Hierarchical CPLDs

James Lee from Synopsys presented this talk.

He showed a set of the four main routing models (array, row, sea of gates and hierarchical). Hierarchical is popular because it scales nicely.

The routing algorithms all hang off the maze routing algorithm; variants include A*, Negotiated A* and Orthogonal.

He was looking at the Vantis Mach 5 CPLD. It has Blocks held inside a Segment.

The two basic kinds of constraint in CPLD routing; hard (architecture constraints) and soft (performance considerations). The hard constraints must be satisfied, and may make a circuit unroutable.

The Mach 5 has constraints such as: a source may only connect to one block interconnect wire segment (other constraints were listed).

The problem can be represented as a graph with edges mapping to switches, vertices to wires. In the Maze Routing Algorithm we propagate a "wave" up the graph to the first sink and use that as a partial path. However, you're then committed to this partial path and it may prove a rod for your own back.

At this point it wasn't desperately clear what was going on. There was no explanation as to what these paths correspond to in a placed and routed design.

The approach to route a net optimally is to use dynamic programming and use an "unrolled" version of the graph. Instead of say three levels in a hierarchy (L1,L2,L3) you can spread it out to say 5 levels, so all routing arrows go to the right rather than some pointing back to earlier levels. This makes life easier.

He outlined the particular dynamic programming algorithm. Check the paper if you're interested in the details; I couldn't type that quickly. Looks like you aim to pick the route with the best partial graph.

You then put the thing into a Lagrangian Relaxation framework, relating nets and their use of each wire segment. You set it up so that no two nets can use the same routing resource. He specified the Lagrangian subproblem, which attaches weights to each routing resource and now allows overlap but penalises it.

This is known as the Lagrangian Multiplier Problem and is solved by an iterative approach using a subgradient method, set up so that congested nodes get higher cost and less congested nodes get lower cost. At each stage you recalculate a parameter lambda, initially set to 1.

Experimental results contrasting the LRoute, Maze Routing and dp algorithms; LRoute could route all of them, the other two algorithms couldn't, and LRoute (with one notable exception) wasn't much slower than dp.

We then looked at relaxing soft constraints without letting circuit performance deteriorate. It wasn't entirely clear what the result graphs presented here actually meant, but apparently this helps routability and speed.

Questions: one about fractional results coming out of the Lagrange algorithm. James said that they just look at the integer parts.

A Crosstalk-Aware Timing-Driven Router for FPGAs

Steven Wilton from Uni British Columbia gave this talk. As he started up his Powerpoint I noticed that the laptop systray was amazingly full of crap. Windows, eh?

Crosstalk is where a voltage change on one wire causes a change on other, unconnected wires. The basic approach is to separate clocked and unclocked nets. But also, crosstalk can affect the delay of signals and perhaps mean missing trigger signals.

The delay of a net is the time to charge or discharge capacitance C along the net. There's fixed C (propertional to length) , switch C, and coupling C which is what causes crosstalk.

One solution - put wires further apart. Vaughn Betz said you can make some wires isolated and put crosstalk-causing signals on those. Works well but is limited.

Another approach is to look at signal activities; are they switching in the same direction? Worst case is them switching in opposite direction. So try routing together wires which switch in same direction.

Make the simplifying pessimistic assumption that we are always switching in opposite direction, but if adjacent track isn't used then the Ccouple is only half. Should we route Critical Path nets next to unused wires? Yes, but you can end up going around the houses. How about doing this only when you need to?

This is the subject of the talk. Use the VPR as a baseline routing algorithm, modifying timing model and cost function for crosstalk. (VPR is apparently well known among people who try to route FPGAs.)

Timing model estimates speed of circuit post-route, and is done during routing to choose between alternative routes.

VPR uses a fixed capacitance C for each track; we change this to include coupling capacitance. This depends on whether neighbour tracks are used.

Cost function in VPR has delay component and congestion component. We modify the delay function using a new timing model. Look at making signals understand the effect they have on other already-routed wires; new function Penalty() as the increase in delay for all neighbours given this route.

The implementation that Steve has is about 20% slower in runtime than vanilla VPR. How good is it?

The statutory table of comparison circuits was shown; just adding the enhanced timing model told us that this consideration made a 4% improvement over VPR, after you take out the logic blocks. Adding the new router made more like an 8% improvement.

Adding more tracks per channel means the router can do a better job, much as you'd expect.

The improvement is free (no area penalty) and crosstalk issue will be more important as technology shrinks; these tests were at 0.8u.

Questions: how does the "7% free" claim match the extra tracks usage? It's because the extra tracks were available to vanilla and enhanced VPR.

Runtime and Quality Tradeoffs in FPGA Placement and Routing

Chandra Mulpuri from Uni Washington presented this talk.

He noted that place and route in CAD mapping was very time-consuming. But for some apps this has a disproprotionate effect on effectiveness because you do P&R lots of times, e.g. SAT solvers where you have to P&R for each new problem.

He was looking at the tradeoff aspects of speeding up P&R. His target was a XC4000E using parallel pathfinder and VPR benchmarks. He compared critical path delays and P&R runtimes.

Looked at simulated annealing, characterised by the number of moves tried per temperature.

Looked at force-directed algorithm, where CLBs are connected by "springs". Characterised by the stopping criterion (how much can it twitch and be regarded as stopped).

Other placement algorithms examined; Xilinx (effort level), FM recursive bipartitioning, Scatter (ultra-fast, arbitrary placement).

Comparing all these on critical path delay vs P&R time they are all over the shop, no clear winner.

Pathfinder router algorithm; nets negotiate for resource where criticality denotes negotiating power. This was tweaked a bit so that rip-and-reroute only happened for shared resources.

Comparing this against Xilinx, Hierarchical etc. Simple always gives fastest routing calculation, Pathfinder always gives the least delay in the resulting circuit.

When all routers used the same placement there was a much clearer result. See the graph in the paper, but in summary: annealing gives fastest and best solution but less so when shorter routing times allowed. Graphs weren't up for long enough to draw concise conclusions.

Looking at the effect that the FPGA size had; 75% utilisation seems to be a turning point.

So how to speed P&R? Can 5x speed increase for 38% quality loss. Extra resources don't really help.

Individual faster algorithms don't give overall speedup. But best placement and fast router makes a significant difference overall.

Questions: why did more resource slow down CAD algorithms? Looks like having to search a lot more before finding a solution.

Do algorithms take advantage of SIMD or similar? Don't know.

Coffee Break

The poster session coming up was previewed by Carl. Topics included:

In addition, there were posters for all this session's talks. This was quite a good idea as it allowed you to catch up with any point that wasn't clear during the talk.

In the event I didn't have time to look at any other posters as I was kept busy trying to explain SRPT to interested passers-by. It seemed to generate a reasonable amount of interest and some useful questions.

Technology Mapping

Steven Wilton chaired this session, noting that both papers were from UCLA.

Performance Driven Mapping for CPLD Architectures

Deming Chen presented this.

He noted the challenges for PLA mapping; sharing of inputs and p-terms (product terms) among different outputs in same block; wide fan-ins, 2-level minimization which is NP-hard.

He showed a general PLA structure with AND then OR arrays.

He chose to represent a Boolean network by a DAG, and defined a root set as a cluster rooted at the node set.

The problem is to cover a boolean network with (k,m,p) feasible clusters which are PLA-able, minimising delay. k inputs, p pterms, m outputs.

You find the delay info for each node by using a labelling scheme such as DAGmap or FlowMap. He was looking to extend DAGmap. He showed an example for a (3,3,2) feasible PLA, labeling (3,3,1) feasible clusters. You try and grab as big a group as poss which is (3,3,1) feasible, using a dynamic programming approach.

From this labelling we have the required time of the network which is the max label in the labelled network.

To map a labelled graph, start from outputs (POs) working towards inputs (PIs) and keep a list of notes in label-decreasing slack-time increasing order. Take opportunity to merge overlapping clusters in a "sibling merge", "SlackTime relaxation" or in the worst case do a "Duplication" move.

Then pack clusters to reduce area further. Collapse PLA into all fanout PLAs, or bin-pack PLA in a maximum shared-input first-fit-decreasing binpack.

This reduces your netowrk into one with a few (3,3,1) PLAs as nodes.

Trade of area vs delay. Use threshold control to put ceilings on fanouts and / or pterms.

Tried comparing with TEMPLA, which aims for area reduction only. Also with MAXPLUS II from Altera (ooh there's a question begging to be asked). Both comparisons in terms of area and delay, and in terms of (10,12,4) PLAs. TEMPLA used little less area but lots more delay. MPII used less area to but quite a bit more delay. So this new approach seems to give goot delay reduction for a little area increase.

Question: can you apply it to the 20K FLEX device which has a product term optimization thingy? He didn't think so.

Simultaneous Logic Decomposition with Technlogy Mapping in FPGA Designs

Michael Gang Chen from UCLA VLSI CAD lab gave this talk.

He described the normal approach where you optimize tech-indep, decompose logic then map to technology. He showed what happens when you decompose a network with 5 gate inputs to 2 gate inputs max; depth increase unless you do expensive optimisation.

He's trying to combine logic decomp and tech mapping. He can explore all structural decomps in a W-bounded bool netowrk, givingb best bounded decomp for FPGA network to date.

Goal; deconp with minimal depth after mapping. nown NP-hard for W >= K >= 5.

Existing algorithms include tech-decomp, dmig and dogma.

Uses a mapping graph which encodes circuit structures in a single graph, adding "choice" nodes where alll inputs are logically equiv and "ugates" which are two choice nodes and fanins, and also cycles. You mutate a standard graph into one using these new constructs.

FPGAs are difficut because they have LUTs which needs a different algorithm. Doesn't work if cycles are involved! Combined minimisation for depth and area are NP-hard

SLDMap, their new approach, generates an initial W-bounded network. Make mapping graph, depth optimal labelling, label relaxaton, area minimisation, decomp selection then decomp mapping.

You enumerate all decomps for each AND gates; use of ugates keeps this manageable.

Depth optimal labelling is key. Label is minimum depth in mapped network. Sort nodes in pseudo top order -pseudo because of cycles. Label ugates with their minimum depth. You can show this approach is at least as good as DAGmap in terms of depth.

Area minimisation done by label relaxation, working back from the required arrival time for each node. Generate delay/area pairs generation for each ugate. Then select the decomposition with minimal depth and area.

Did an experiment using standard flow comparing dmig, dogma and new SLDMap. SLDMap was about 10% less in depth and 6% less in area than the other two.

Questions: runtime for big circuits? Longer than other approaches, 5-10%(?) Did you compare with Xilinx? Not rigorously. How does it apply to a datapath design where lots of pipelines? Then it won't help you much. Benchmarks quite small, why? Difficult to get large commercial circuits.

Lunch

I was hoping to get a look at other posters, and had some success.

Routing Architectures

Tom Kean from Algotronix chaired this session.

Using Sparse Crossbars within LUT Clusters

Guy Lemieux from Uni Toronto presented this talk.

He set the context by describing research into routing tracks for island-style FPGAs. He then noted that modern FPGAs replaced a single LUT with a cluster of LUTs. We assumed until now that a crossbar between them was fine. But what if the connection is sparse?

Guy wanted to reduce the area overhead of all these crossbars. Given a cluster of N LUTs of size k, with inputs and outputs distributed around the cluster, how sparse can you make the connectors while still leaving it routable? What are the tradeoffs?

Delay might increase if you have to route around part of the cluster to come in into a less crowded side.

He gave specifics about the routing architecture in which he was working. Benchmarks were the MCNC circuits (which are quite small and old, largest is 8000 LUTs). Plea for more circuits!

For each circuit, use the smallest feasible square FPGA

Vary Fc (the level of interconnect) and see what happens to channel width (number of tracks). Sparse clusters add 2-5 extra tracks, which is well under 10%. For very small clusters there's no real difference. So we're not straining the routing network.

Regarding number of transistors, and hence area, small sparse clusters save maybe 2%. But large spare clusters save a heckuva lot more, maybe 10%.

So what's the best Fc for a given size of LUT and cluster? Best Fc is 50% when N=6 LUTs, 4-input LUTs, saving 10% of area. If LUT has 5 inputs then Fc=40%. He illustrated several other cases, indicating that the wider and more numerous the LUTs in a cluster, the more area can be saved. But you do need to add some spare inputs into the cluster to help routability.

Delaywise, there seems to be no impact! Runtime for P&R goes up 200%-300% when you contrast old and new VPRs.

Questions: why did you depopulated part of the cluster and not another part in an early slide? He did, the slide wasn't clear. Can you extrapolate result to a datapath design? Hasn't looked at it, looks tricky. Depopulation pattern make any difference? Possibly, but not much difference on average. Did you assume symmetrical function of the LUT? Yes, and also each LUT has same switch pattern.

Detailed Routing Architectures for Embedded Programmable Logic IP Cores

Peter Hallschmid from Uni British Columbia presented this.

He gave an example block diagram of a programmable IP core; programmable logic embedded in an ASIC. Motivation why; late incorporation of algorithms, fast upgrade,amortise development cost over family of chips, and programmable builtin self test (BIST).

He thinks FPGA cores should be able to take any size and shape, but current ones don't and may not be routable enough. He's looking at detailed routing including a rectangular switch block, as opposed to a normal square one.

The "Imran" switch block is the base square switch block; some tracks terminate and use the "Wilton" pattern, and pass-thrutracks use disjoint pattern. Works well for segmented architectures.

Replicate several of these patterns into congruent subswitchblocks. Gives precise definition in paper.

Fs is number of connections per track, which is 3 for square blocks, and is 8 for the short side of our block; large is bad because it congests and increases capacitive load.

MODn are switch blocks where proportion n of connections have been removed. Removal pattern found by experimentation.

Looking at best block for given track and aspect ratios, best track ratio for each aspect ratio in best block, what improvement given, what is penalty for using rectangular FPGAs.

MOD2 is best for track ratio 1 to 1.5, mod4 best above this. True for all aspect ratios tested. Used MPMC and VPR for experiments.

Best track ratio: high aspect ratio makes larger area.

Best block and track ratio is 8% denser but a bit slower than baseline. Aspect ratio 2:1 isn't much worse, 10:1 is a lot worse.

In future, looking at effect of CLB pin distribution and looking at L-shaped cores.

NB: This is only a small part of UBC's SoC research.

Questions: how did you decide which connections to remove in depopulation? Combination of guesses and experiment. How much did it vary between different organisations? 5-10%.

Mixing Buffers and Pass Transistors in FPGA Routing Architectures

Mike Sheng from Uni Toronto gave this talk, once he'd fought his way past the QuickTime 4 upgrade screen...

He wants to improve delay without paying too much in area.

Goal: identify the FPGA bits that delay the most. Using the architecture given in [Betz99].

Logic block: 4 4LUTs, 10 ins, 4 outs

Switch block Fs=3. Can only switch track i in one channel to track i in another channel.

Connection block Fc(in) = 0.6 Fc(out)=0.25.

Two key outputs from CAD flow are the critical path delay and the total FPGA area needed.

Uses MCNC circuits of 1K to 8K LUTs.

Critical path delay comprises logic block delay and routing delay. Used SPICE sim to work out numbers. Average 38% logic block delay, 62% routing delay. Routing delay breaks down 60% source buffer, 30% input MUX, 9% other.

Why source buffer delay high? Long chain of pass transistors. High fanout nets need a certain amount of buffering close to source and pass transistors near sink.

Propose mixed buffer-pass architecture with three different types of track. Fraction of Planar-Pass tracks is important - drops track count, hence area too but can up delay. Lots of graphs were presented but the results weren't summarised well.

He illustrated examples with 20% PP 80% Mixed Buffer; 20%PP, 80% Planar Buffer; and quite a few others. Scored by the product of the area and the delay.

In future, looking at different routing wire segment length and try depopulating the switch block for routing wires switched by "Mixed" topology.

Questions: critical path in mixed arch uses same number of buffers and switches as in original? No, first run with minimum number of tracks then add 20% to reduce stress. If using tristate and pass transistors, why not use dedicated driver to improve performance? Well, that's what their pass transistors are, apparently. Do you include SRAM bits in your area count? Yes.

Coffee

New posters:

The mid-afternoon slump was clear to see on people's faces, but coffee and chocolate chip cookies were there in quantity which led to a general genuflection in blood sugar levels. I had a chat with the Altera Excalibur exhibitioner, demoing their Nios SoC playing MP3s and showing a block diagram of the ARM922 and MIPS-based SoCs due out Real Soon Now.

Applications

Ray Andraka of Andraka Consulting chaired the final session of presentations. Kudos to the program committee for

  1. sticking to time, and
  2. scheduling the length of sessions against anticipated audience attention spans.

Reprogrammable Network Packet Processing on the Field Programmable Port Extender (FPX)

John Lockwood gave this talk. He's from the Uni of Washington but his lab address is Montana. Go figure.

He's looking at extending networks with FPGAs to increase functionality and performance.

FPX allows modules to be dynamically installed in a network, providing an open platform for trialling firewalls, routers etc.

Firewalls can classify, selectively forward, encrypt/decrypt and otherwise mangle packets.

FPX sits between the switch fabric and the fibreoptic line card. WU have built a 20Gb/s switch with 8 ports, and can add an FPX module to e.g. make the port into an IP router.

They use the Utopia data interface which takes some of the worry off. You can route specific flows to specific modules, and can reprogram modules while other modules continue to operate. Can also share resource such as SRAM, SDRAM.

FPX has separate data, SDRAM and SRAM interfaces. Also has clock, reset, yankout warning signals. Defined by VHDL module.

Has layers for frame / IP packet / UDP / Application around the networking hardware module.

The platform has a reprogrammable application device (RAD) on an FPGA, offchip memory, and a network infrastructure device (NID) on another FPGA to route to / from modules and reprogram them. The NID is the focus of this talk.

NID can check packets for errors, control reprogramming, virtual circuit tables for routing between ports, and a switching function; 4-port soft core switch. Flows are identified by virtual circuits, virtual paths, ingress port, and a lookup table for identifying output ports.

Default card operation is bypass hardware -- nothing happens! Then you can reprogram. Can egress process, ingress process, ingress and egress, loopback, and chained egress processing to combine modules.

He gave an example configuration for an Internet router.

Can also mix hardware and software; they've got a Pentium on a card (the Smart Port Card, SPC) and stick the SPC between line card and FPX. So you can migrate functionality on the critical path, once you're happy the software works.

To reprogram hardware there's an external switch controller which can send in control cells to the FPX which recognises them and writes them into its configuration RAM. Then send a reconfigure command and the FPX does a reprogram with the new data in configuration RAM.

At the moment they're looking at how to reprogram with a dynamic hardware plugin module, reprogramming e.g. a certain fraction of CLB columns in a Virtex.

FPX is being used in 6-monthly workshops at WU.

He went a fair way over time but it was quite interesting.

Questions: taken offline.

Fast Implementations of Secret-Key Block Ciphers Using Mixed Inner- and Outer-Round Pipelining

Kris Gaj from George Mason gave this talk. I'd seen the poster beforehand so had an idea where it was going.

He gave a brief history of secret key ciphers from DES up to AES (Rijndael) including how the AES selection contest was run. His group participated in the second round of selection where suitability for hardware implementation was considered.

Two key parameters; latency (time for encryption of single datablock) and throughput. Throughput is the most important. Most secret key ciphers haev a number of "rounds" of similar computations, leading to an iterative architecture.

They did all 5 candidates plus Triple DES in Virtex, and Rijndael was 7 times greater throughput than Triple DES. Also looked at area usage - didn't get the info down before this slide went.

Pipelining would be cool; outer and inner round pipelining depends where we are in the processing loop. You can increase area and get speed increase more or less linear. But can we do better?

Yep, do it inside the cipher round. This is inner round pipelining. Area doesn't increase much because there's only 1 round; we've just added registers. You whack up throughput with small area increase. You can then mix inner and outer round pipelining for better results.

Latency goes up, since nothing's for free.

Doing this for the candidates, 3DES and Rijndael didn't need anything like as many stages as the others. Clock frequency was similar for all in both inner and mixed pipelining, plus or minus 10%.

Throughput in mixed pipelining Serpent and Twofish were best, about 25% faster than Rijndael. Serpent and Rijndael have low latency even after pipelining.

Area requirements? All fit in a Virtex with less than 50% util. But mixed pipelines needs multiple devices for all except 3DES.

NSA did only outer round pipelining and said Serpent was 4x faster than all except Rijndael.

For full mixed pipelining, speed proportional to block size and inv to min clock period.

Questions: offline as well.

I'd probably vote for this as the most interesting talk.

Algorithmic Transformations in the Implementation of K-means Clustering on Reconfigurable Hardware

Miriam Leeser from Northeastern Uni gave the last talk. Again, there was a useful poster I'd seen earlier.

She's looking at analysing satellite image data on reconfigurable hardware.

For each pixel in satellite data you get 10-20 channels of data (not just red, green, blue!) 1 data cube of 614x512 is 134MB!

Algorithm is unsupervised, looking to compress down to 16 clusters = 150KB/image. Very lossy! But retain features of interest.

Cluster spectrally similar pixels together. Helps to detect real changes and isn't sensitive to atmospheric distortions. You combine pixels which are close to each other and similar to each other. Needs a distance metric to define "close". Iterative algorithm, sensitive to initial values.

Do this in reconfig hardware because lots of small parallel computations. Try to minimise bitwdiths to get a smaller datapath.

Normal distance metric used are Euclidean, Maximum and Manhattan (taxicab). Manhattan and Maximum are much cheaper in computation terms so use a linear combination of them.

Did a bunch of data-independent experiments looking at relative variance and misclassification rate. Linear combination was best, but not much better than Manhattan at high dimensions.

Did some experiments on real AVIRIS data but in general there wasn't much difference, maybe 6% increase in variance

Image analysers liked Manhattan distance results.

Data truncation checks showed that you could severely truncate data (6 out of 12 input bits) and not lose cluster quality.

Put onto an Annapolis Wildstar board, Virtex 1000) 180 times faster than host Pentium III computer.

Manhattan in software doesn't make anything like as much difference as in hardware..

Questions: don't you have to store all the data anyway? Yes, but you don't have to process it. Did you use things like Pentium SIMD instructions in software? No, but didn't do anything special with the hardware either..

Dinner

The FPGA'01 crowd gathered in the De Anza ballroom for dinner at 6pm. It was laid out with about ten to a table which was about right; everyone was still buzzing from the events of the day so there was lots to talk about. Tim from Belfast was encouraging us to come along to FPL'01, promising an excellent few days in Belfast and a large quantity of Guinness. The food was OK, the wine flowed freely and time simply flew by.

At 7:30pm a panel of veterans from industry and academe rose to the top table to tackle the question "Is marriage in the cards for programmable logic, microprocessors and ASICs?" The answer appeared to be a qualified "yes" from each panellist, with David Papworth from Intel being closest to dissent. Moore's Law seemed to be the driver behind the predictions of pretty much all of the panel, observing that FPGAs had gone from glue logic in the early years, through system components, to full system on chips.

The panel took questions for an hour, which was about right for useful stuff to come out without the audience getting bored.

Afterwards a bunch of us Brits and some Canadians headed down to Pete's brewpub to sink some very decent beer and porter; a good evening out. Note well: Canadians can drink beer until it comes out their ears.

Tuesday 13th February

There was a noticeable drop in the number of people at breakfast first thing. Can't think why. :-)

Reconfigurable Computing

Steve Trimberger from Xilinx chaired this session.

Attacking the Semantic Gap Between Application Programming Languages and Configurable Hardware

Greg Snider from HP presented this talk. This had been flagged as the one to watch today.

He noted that the semantic gap had become wider as computers got more parallel (e.g. VLIW) and FPGAs had aggravated this. He saw an opportunity to accelerate apps with massive parallelism, aiming the hardware at scientists rather than chip designers.

Which language? Contrast Application languages (C, Java etc) with H/W languages (VHDL, Handel C). The former are sequential, hard to make parallel, and the latter need expensive tools and are hard to debug and integrate.

Their strategy is to combine best features of both. A worthy aim! But how?

He emphasised a conventional app language being compelling because you can use a standard compiler and debugger, debug on a PC, then when you are ready for hardware you use a HW compiler.

Uses a state machine parallel programming model, very simple : use many of them to exploit parallelism. Arbitrary topology allowed, hierarchicla decomp also.

So which language? You can use pretty much any one, but they've gone for a subsetted OO language like C++ or Java. (Ooh! Question coming on.) Examples given in Java for conciseness.

Gave example of a counter. Seems to be important to know what data you've got, how it interacts, public vs private issue.

Compiler written in Java, compiles Java or C++ source to configuration bits. Portable to different targets. Heavy optimisation used, very important in compiler. Do static expression evaluation to simplify stuff. Gave example of byte concatenation, which in C is horrible. Their compiler reduces it down to just wires!

Compiled the Serpent AES candidate, optimised down by 3 orders of magnitude. Pipeline of 32 machines in C++; 3200 CLBs and 77MHz obtained without tuning.

Questions: do they use annotations to provide extra information? No - they take the opposite approach to what I expected, and are trying to hide any hardware aspects from the user so annotations and pragmas are against their philosophy.

Matching and Searching Analysis for Parallel Hardware Implementation on FPGAs

Pedro Diniz from USC presented this talk, having to wrestle with the laptop to get started properly. (All software sucks!)

He's looking to take a sequential C program and get it automatically onto a target. He had to defer to the next speaker while the laptop was balking.

They're compiling C into VHDL and then doing estimation, generating auxiliary modules and targeting. Also compile a portion of the original source onto your conventional host PC.

Focusing on system-level compiler doing things like loop unrolling, memory access optimisation. Can partition across multiple FPGAs, using an estimation tool to get a quick guess whether the design will fit into a single device.

Using behavioural synthesis to manage storage and comms stuff.

They want to do apps like pattern matching over a string. Gave example of unrolling a naive C implementation of this pattern matching.

To make this work, use symbolic predicate composition ("symbolic execution"), select the right set of side effects by making a set of multiplexors for each variable and select the right variable final value by selecting the correct mux control values.

Handling breaks and continues is hard (which isn't surprising given the work Bernard Carre et al did on control flow).

Construct a control flow graph, extract symbolic predicates, get a core datapath, get an augmented datapath. Can handle regular array accesses and nested loops.

Compiler analyses SUIF, a general RTL-level imperative language. Uses off the shelf synthesis tools, targeting the Xilinx XCV1000BG560.

Did some simple match and search kernel compilations; figures given didn't really say much other than that the examples were very small.

Further work is being done on symbolic analysis techniques, combined loop unrolling to cut down memory accesses. He ran on quite a bit.

Questions: how does your narrow problem set expand to more general solutions? Doesn't really; it was a valid point.

Evaluation of the Streams-C C-to-FPGA Compiler: An Applications Perspective

Jan Frigo from Los Alamos presented this talk.

They compared optimised hand-coded VHDL to that produced by the Streams-C compiler for existing applications.

Streams-C is a subset of C functions and libraries, aimed to be easy to use. The compiler generates RTL, targeting Virtex devices. 5-10fold improvement in develop time, trade off 1-4fold increase in area and down to halving of the speed.

Streams-C is parallel and aimed at computing streams of data. Collection of processes, separate address spaces, high rate sync and low rate asynch communication.

Compiler analyses the code, generates VHDL and then RTL. Can then use synthesis tool of choice. Compiler can also convert code into C++ for a conventional PC target to do a simulation.

For apps, example is 4-tap FIR filter for RF data processing. Can also do image processing such as contrast enhancement and yesterday's K-means clustering algorithm.

Kmeans was done in a systolic process array; 99% of algorithm time was found to be in finding the closest pixel so obvious target for reprogrammable logic.

Contrasting performance versus productivity; big variation in area increased, but much better in terms of productivity and runs at similar speed.

Currently working on Streams-C version 2.

Questions: have you done comparisons with more general RTL? No; were looking more at immediate effect for the user.

The Effect of Reconfigurable Units in Superscalar Processors

Jorge Carrillo from Uni Toronto presented this talk.

Defined reconfigurable processor as a conventional uProc coupled with reconfigurable resources. Using Toronto's OneChip. Wanted to know how to combine to get best features of both technologies. Needed to do full software simulation to find out detailed information.

Differs from things like Garp and Napa because he's focused on the interface rather than the chip.

OneChip is MIPS-like, has a reconfigurable function unit (RFU), does dynamic scheduling and reconfiguration. Enhanced with superscalar pipeline; RFU integrated inside pipeline, in parallel with execution and memory stages.

Developed Sim-OneChip simulator to model the chip. No simulation of FPGA fabric. Compiles the C source code with ssgcc to produce a onechip binary, which is then executed by the simulator. User interface to RFU defined by the OneChip library.

Apps evaluated were taken from MediaBench, profiled using GNU gprof to pick things to go into RFU. Did audio and video coding, image compression and public key encryption. Used 3 data size and 4 experiments; in and out of order, with and without RFU. Assumed that memory accesses dominate logic and memory access time is cycle.

In audio encoding, for example, RFU gives most speedup. In JPEG encoding, out-of-order execution gives most speedup.

In these apps, parallel execution of RFU doesn't really seem necessary. No more than 2 FPGA contexts are needed. Configurations are implementable. OneChip speeds up kernels and apps, but the biggest gains come from dynamic scheduling.

Future work looking at automatic compilation, picking up pieces of C to put into RFU.

Questions: estimates on how large fractions of code are when going into RFU? No figures to hand but doesn't reckon things like JPEG very big.

Coffee Break

The posters introduced by Steve were:

I had a chat with the HP guys; they were focusing on making their compiler transparent to non-hardware people. Current work includes getting array usage to be compiled properly and efficiently.

Pipelined Routing Architectures

Andre DeHon from Cal Tech chaired this session.

Interconnect Pipelining in a Throughput-Intensive FPGA Architecture

Amit Singh from UCSB presented this talk.

He noted that FPGAs still aren't attractive for throughput-intensive apps as they're slower than custom logic. Last year he proposed Wave-Steered FPGA fabric which was targeted at DSP apps. Now looking at fine granularity pipelining and pipelined interconnect.

Wave steering differs from normal pipelines because there are two out of phase clocks and so only one half of the data path is active at any one time. You map each node in the graph into PTL logic through a 2:1 mux, though it wasn't clear what this graph actually represented.

Implementing an area multiplier in 4-level, 2-level and nonpipelined shows that the 4 and 2 are about the same performance, significantly better than unpipelined.

You decompose a function into a network of wavesteered blocks, connected via pipelined interconnects operating at a single clock frequency. If there's long interconnect which is longer than the clock period then you add intermediate latches to ensure that signals come in at the proper time.

Wave steering starting condition: OK as long as there are no stalls inside a logic block, so late arrivals can be OK, and you might delay early arrivals to get the data to arrive consecutively.

Their FPGA device uses hierarchical routing with wide channels.

If you underuse the logic resources in a cluster that enables you to route random circuits.

Examples of DSP designs at 625MHz were given in terms of logic blocks needed, but data wasn't presented with any context. so was hard to interpret.

There's a tradeoff between number of latches used and frequency achieved. Also presented some data on power and temperature performance. Can save power with signal reordering technique.

It was clear that Amit knew what he was talking about but he left me way behind. I had no idea where he was going or why he was going there. He showed that he'd got Wave-steered blocks to work, but not any reason why people should be interested - what things did it give you?

Questions: were random circuits random combinational? Yes. Then how to handle sequential combinational? Someone else has looked at 1-hot encoding. Doesn't using flops and buffers to skew inputs lock you into a particular manufacturing process? Probably.

The Case for Registered Routing Switches in FPGAs

Deshanand "Des" Singh from Uni Toronto presented this talk.

He pointed out that his approach allowed signals to traverse long interconnect in more than one clock cycle, which would be handy.

Sequential retiming structurally transforms a synchronous circuit while retaining functionality and improving performance. You push registers around to cut down on critical path.

Represent synchronous circuit by a DAG; each vertex is a combinational cell and has a label with number ofo registers moved from outputs to inputs., each edge is a connection, and has a weight related to numbr of registers traversed.

Retiming problem changes weights by changing registers pulled onto or pulled off the wire, and is a constraint satisfaction problem. Ensure weights are nonnegative. Timing constraint that each path with nonzero delay must have weight at least 1.

We retime after place&route. We then know routing delays completely, and they dominate gate delays.

Propose two architecture changes to help. A registered routing switch inputs a signal, and can either register it or shove it straight through. Also add a register per logic block. Gives the retiming algorithm greater freedom to move stuff around.

Two stage mapping algorithm to use these changes: retiming aware routing which puts long routes onto registered routing tracks,and architecture constrained routing to retime circuit given constraints of the architecture i.e. where the spare registers are.

He noted the extra constraints on the retiming problem regarding weights and number of connections.

The algorithm is to enumerate constraints and store them, then thow out satisfied constraints in an iterative procedure.

The retiming aware routing routes the circuit normally to start with, then pretend all tracks are registered, then permute routes on a track by track basis, retiming and seeing if that helps, measuring criticality for each track. Sort by criticality and then permute them so the most critical ones end up on routes with registered switches.

Experimental results weren't immediately clear from the crowded graph, but the best area/delay tradeoff was around 0.25 giving a 12-20% speedup.

Questions: are the results critical path? No, they're normalised speedup. What if you allow tool to add extra registers increasing pipeline latency of design? Tradeoffs weren't too good.

Lunch

While waiting for lunch I had a chat with Russ Tessier from Uni Massachussetts, then after gobbling down food went for a stroll up and down Monterey's main shopping streets to stretch my legs and get some fresh air. You can have too much of lecture theatres.

I wandered out to Fisherman's Wharf and spent a very pleasant hour watching all the wildlife in the harbour. A sea otter was bringing up shellfish from the harbour bottom, breaking them open on its chest and trying to fend off a hungry seabird. Seagulls and pelicans buzzed past, and seals were draped over pretty much every available surface.

Several seals were swimming around the harbour. One big brute came right underneath the pier I was on; he was easily eight feet long, bulky but cruised past very gracefully. A posse of seals were camped on a buoy by a moored ship, and occasionally barked loudly when an intruder attempted to join them.

Issues in FPGA-based Systems

Chuck Stroud from Uni North Carolina chaired this session. His accent was to die for; a real high-quality Southern drawl.

Configuration Compression for FPGA-based Embedded Systems

Andreas Dandalis from USC presented this talk.

He described the problem with configuration memory. FPGAs contain a configuration controller along with the configuration memory which holds a new configuration. SRAM-based FPGAs use external memory to store configurations - too big to hold inside! Virtex needs from 0.34 to 33.52 Mbits!

You can compress the bitstream offline and decompress it on the fly. Decompression rate determines configuration time. Can't compress lossy! Use an LZ algorithm, dictionary-based. Not as efficient as Huffman but is simpler.

LZ looks for longest possible string of symbols previously seen and replace it with the dictionary index. Several variants; we'll look at LZW which checks globally rather than locally.

He referenced some previous relevant work by Hauck et al.

Compressing, we compress the config bitstream in reverse order.

Once dictionary built, we remove unreferenced entries and merge common prefix strings. There's a bit of a tradeoff for memory requirement vs index expansion.

Decompressing, we use an index and dictionary computed by the offline compression, decompress the data and unreverse the config bitstream. The reversing is to avoid needing a stack for decompression.

Experiments done on a Virtex for some AES candidates and DSP algorithms, with bitstream sizes from 1.7 to 6.1 Mbits and with an 8bit symbol length.

The pruning of the dictionary can, for our examples, take compression to 80% of original size, close to the lower bound of 70% imposed by information content of the data.

In future looking at compressing a set of configurations rather than just one.

Questions:how sparse were the designs? They used the smallest device that the design fitted into, so "not very". What did dictionary entries look like? Very flat tree, so no "magic words"!

A Memory Coherence Technique for Online Transient Error Recovery of FPGA Configurations

Wei-Je Huang from Stanford Center of Reliable Computing presented this talk.

Wanted a dependable system by using reconfiguring property of FPGA. Need concurrent error detection, be able to recover from the error whether transient (this talk's focus) or permanent.

Partition the configuration columns of the (partially reconfigurable) FPGA into configuration frames. These contain LUT entries and interconnect states.

For a dependable dual system, have two FPGAs coupled to separate memory and EEPROM, running different apps simultaneously. Each FPGA has a controller, monitoring the error signal from the other FPGA. If error detected, the controller sets off the other FPGA's transient error recovery.

Transient faults caused by radiation. Two categories: configuration transients (errors in config data, so look like permanent faults) and system transients (errors in user data).

To detect config transients, use config readback and comparison with period background scrubbing. Correct by writeback of config.

To fix system transients, use roll-back or roll-forward.

Main topic - memory coherence issue. LUT can be either fixed logic or user memory modules. For speed, do two concurrent processes; system level user ops, and config read/writeback. Could then have a memory coherence issue (race condition).

Several strategies possible to fix this. Segregated memory problems been suggested before, but hsa extra P&R constraints. Instead use stall-when-write or dirty bit strategy.

Stall-on-write locks columns, and stall user op when it tries to write to a locked column. Dirty bit has a state register to indicate whether shared memory modified by other process, so you know when not to overwrite.

Time overhead for these two approaches: 1 faulty frame in configuration gives much smaller time overhead for the dirty bit strategy. Larger RAM size doesn't make much odds. Same for different read/write latency.

Note that we need a mapping betwen user memory location and physical location.

Area overhead was 24% extra for a codeword FIFO with dirty-bit strategy, time overhead less than 0.1%.

Questions: technical one about Virtex bitstream corruption from a readback for that columm; you need to stop the system clock. Apparently that's not happened for Mr. Huang. Taken offline.

Run-Time Defect Tolerance using JBits

Prasanna Sundararajan from Xilinx presented this talk.

This is a software technique to tolerate defects, using JBits tools from Xilinx and testing on Virtex FPGAs. It facilitates runtime reconfig in the presence of defects.

Defect tolerance helps manufacturers to increase yield, being able to use defective parts. Previous work has been done in both hardware and software. Software approaches map circuits around known defects.

Need to be able to find the defects, bypass them have a solution that works with current FPGAs, and allow fast circuit customisation for runtime reconfig.

JBits has user code in Java, allows fiddling with individual elements in device and is supported by a range of tools and libraries. There's a hardware interface to support this fiddling.

Why JBits and RTPCore? Run-time parametrisable inputs, fast circuit generation and customisation, and supports runtime reconfig.

RTPCore has three modes to allow skipping of defects: skip CLB, skip column and skip row. Depends what your timing and alignment constrainst are.

Finding defects is hard, so we'll work out how to bypass known defects first. (Uh?) Defective CLBs aren't too bad to track, defective wires are tolerated by marking them as "used" in the routing database.

They built a defect tolerant constant multiplier. Couldn't find a suitable defective board which supported JBits so marked random CLBs as buggered.

Useful for lowspeed designs, moderately dense, in FPGAs like Virtex.

In future solve the problem with timing issue, get a proper test methodology with JBits, and work on on-line testing and fault tolerance.

Questions: won't there be a problem doing this in the field, maintaining such a defect database? Will probably need hardware support.

Coffee

Final bunch of posters:

Applications in Image/Video Compression

I had no research interest in this area, and had seen the relevant posters, so skipped this session and left early in order to get up to San Francisco before dark.

The Return

I had the car out of the garage just after 3pm, and after a bit of faffing around in the back streets of Monterey stumbled onto Highway 1 northbound.

The roads weren't too busy to start with, and it was a still and bright afternoon as the road wound around the edge of the bay towards Santa Cruz. There was a brief hold-up in the city itself - some traffic lights had broken so reverted to a 4-way stop - but once clear of Santa Cruz the other cars faded away and I sped further north.

The sun was on its way to setting, the waves were rolling over the rocks and a little early evening mist hung over the edge of the coastline. The scene was, and I use this word not lightly, beautiful. I'd move house to here in a shot, if only they would learn how to make tea properly.

At Half Moon Bay I swung east onto 92, which brought me over the hills and presented the far mountains with a clear cap of snow on them. California was definitely having a peculiar February, weather-wise. 92 brought me to 101 and I headed south, away from the airport, reasoning that accomodation would be cheaper that way.

Motel 6 had a newly renovated set of rooms about four miles down the road, so I pulled off at that intersection, topped up my gasoline and went to see if I could find a room. As it happens, I only just squeaked in; the clerk (Mae) commented that for some reason today it had been "just crazy". I asked her what time in the mornings 101 got horrible. She found this highly amusing. "Weeeell," she said, "it's haaarrrrrible around six thirty..."

The motel had a Hobee's attached and I went there for a plate of chicken nachos and a beer - nothing special, but a welcome plain meal after two days of conference food. I then went to my room with the intention of sacking out, only to be foiled by back-to-back new episodes of Buffy and Angel. Well come on, you'd need to have superhuman control to ignore an opportunity like that...



Web pages maintained by Adrian Hilton