The second day of FPL started at 8:30am, half an hour earlier than yesterday and a cruel trick to play on anyone who'd over-indulged at the banquet last night. That said, the main hall was reasonably full for the opening keynote.
Another over-long title, which could for instance hve been shortened to "Where are the FPGAs in 3G mobiles?" Jan Rabaey from UCB opened the session with this. There was a small question in the back of my mind about why a US-based academic was talking on this topic when the 3G mobiles and networks were being designed and built mostly by European companies; but hey...
He pointed out that the wireless market was growing pretty much in a Fibonacci series, and asked where we were going from today's voice-oriented market. The answer was "data". He said that SMS had really taken off in Europe, and i-mode in Japan had had simiar success, indeed more than WAP to date.
For timescales, he reckoned 3G phones in 2003, with GPRS coming in next year and Bluetooth the year after.
Compelling issues in wireless; performance. The receivers are mostly digital, high data tates come with major processing requirements, and you need better spectral efficiency because it'll be crowded.
He presented a block diagram of a wireless transceiver, going through source and channel encode/decode, needing about 7900 million ops per second on a single 384 kbs UTRA W-CDMA channel. Wireless needs are rising far faster than Moore's law! It's an arithmetic progression.
Even then, the efficiency of processors in terms of energy per MIP is dropping - though he didn't say which processors he'd been looking at.
Second issue, after performance, is that you need to cut down on energy dissipation because you drain batteries too quickly, cutting down mobility. Also true for base stations! And battery capacity is not growing anything like as quickly - only 5% per year.
How to solve? Trade off efficiency and flexibility. At one end is direct-mapped hardware (ASIC) and the other a general microprocessor. In the middle is a HW-reconfig processor and a SW-programmable DSP. There are 3 orders of magnitude in the ASCI-uPROC tradeoff.
He gave an example of a digital baseband receiver on a DSP and an ASIC. The ASIC used 1/150th of the power!
The third issue is flexibility. There are many wireless standards, lots of unpredictable services, adaptive solutions use the spectrum more efficiently, and cost-of-design, time-toomarket issues stop us making an ASIC for each thing.
He showed data rates of 2G (e.g. GSM) at 9.6kbs, 2.5G (GPRS) at 64-384kbs, and 3G (WCDMA) at 384-2000Kbs.
Currently the infrastructure base stations have a bunch of cards with a bunch of ASICs to cope with the different standards. He showed advertising material of new phone/PDA services. [But out of date!]
For architecture reuse, he pointed out that reuse comes in generations: standard cell, then IP blocks, then architecture, then ICs. We're getting hold of generation 2 at the moment, and hope to manage architecture reuse RSN. Like the PC, we want everyone building on much the same motherboard. This gives us a bunch of standard SW too.
What platform-based design does is to protect the designers from too many choices! Jan envisages a platform with a small RF front end; analogue doesn't scale! There's an embedded uPROC and DSPs, FPGA, reconfigurable state machiness, a dedicated DSP and a reconfig datapath. It's heterogeneous because there are very different tasks and timescales over the whole phone; contrast teh UI to the antenna! It's reconfigurable because you can do a wide range of things quickly and efficiently.
Why reconfigurability? Low-energy roadmap. Voltage must be considered as a design variable, and you must reduce waste (switching capacitance). This is actually best-implemented in an ASIC, but a reconfig platform (when implemented correctly) can do a good job.
Why not do all mobile processing on FPGAs? Lots of boards, energy consumption will be high because of heterogenity. Should be able to be reconfigurable at higher level e.g. datapaths rather than just bit level.
He described the Pleiades reconfigurable architecture which was a first stab at a mobile-time reconfig platform with reconfigurable interconnect between blocks.
The reconfigurable chip MAIA presented last year had an ARM 8, some FPGAs, MACs, ALUs, a low-power voice code, and ran at 40MHz at 1V. It was a factor of 6 less efficient than an ASIC in terms of power, but much better than a conventional uPROC.
The Chameleon reconfig processor is nearly all reconfig in terms of area usage, but does have an embedded uPROC and a fast bus. It can handle 50 CDMA channels or 16 UMTS, with the different phases overlaid on the FPGA.
The MorphICs Dynamic Reconfigurable Architecture has a bunch of fixed core plus a DRA block to do WCDMA, CDMA, GSM etc. They say: first focus on apps and algorithms, then built set of configurable parameterisable block to do it well; don't add too much flexibility!
Not just dataflow benefits; in network and protocol stack processing you get the same efficiency gains.
So next generation wireless systems need flexible high-performance energy-efficient platforms, and reconfig platforms can provide this.
Change title to "Making silicon squishy". John McCaskill from GMD Sankt Austin. A very interesting talk.
He compared ASICs, FPGAs and evolvable electronics to interfaces and the biomolecular domain; molecular electronics, configurable DNA computing and molecular evolution respectively.
Evolution with FPGAs; they've build a big machine with 150-odd FPGAs to do simulations of molecular evolution on the timescale of the lifetime of Earth. It uses probabilistic kinetics, processing individual sequences at monomer level. The machine was built in 1993 and outperforms current workstations by 3 orders of magnitude.
Central concept of a "construction system" which is a collection of complex components which interact to reproduce one another using a common mechanism, similar to life as we know it. Self-replication is do-able; replicating catalysts as well is tricky!
On the FPGA array they assembled a large 2D array where a complex of long molecular strings were interacting; in 3D you get clusters occurring which are key to replication.
2nd generation computer "POLYP" has 16 boards with optical interconnect. Looking at self- and autonomous- replication.
In microfluidic systems they built 0,1, 2D microflow reactors. Important element is to change the direction of the flow; the microvalve is the key component here. Problem is that the valves are still about 1sq mm - too big!
Trading off flexibility vs Density you go from dynamically switchable valve arrays to full custom microfluidics, the densest way to create a routing pattern. Compromises with bistable plugs or valves available, under increasing controllability by user.
You have a set of reactor cell elements, with switching elements built up by plugs or valves, and reactant flowing through.
A universal switching element is built from programmable plugs with connections between upper and lower layers. Use these to route a network, for instance to combine substances in many different ways (combinatorial catalysis).
To implement these, use photomasks to polymerise the plugs, and to reset the chip use organic solvents or raised temperature.
Beebe did some work on hydrogels where switching elements expanded and contracted depending on the pH of substance flowing past.
Reconfigurable molecular computation is configurable DNA computing. 3 levels: intermolecular, intramolecular and supramolecular. Applies to hard combinatorial problems e.g. crypto, associative memories, distributed databases.
In DNA processing, waiting for two objects to meet takes ages. Put in lots of control molecules to match patterns and it's a lot quicker.
Maximal clique problem; find largest subset of notes in a graph which is completely connected. It's NP-complete. Define position-dependent encoding of 1 and 0 with DNA sequences ("CGGAT = 0 and the fifth position"). Implement graph with an optical mask where edges related to a binary map, and mask out where edges are absent.
To solve: start with all possible solutions, choose subsets depending on values at different sites. They flow molecules through a multipath datapath, clocked by a magnetic tugging on magnetic "beads".
Optically configurable DNA selection uses a photo-polymerisation process of attaching the DNA to the beads, optical valves controlling whether the molecules go through a particular "cell" in the reactor. They've got this working for single molecules.
The future lies in combining selection steps with thermal amplification steps. Aiming for dynamically reconfigurable DNA computing with full evolving reactors, DNA-controlled intelligent construction of other nano-devices.
It's actually quite cheap to make microfluidic systems, since you only have to make the frame once and then customise it for each app.
M Renovell from U Montpellier spoke about this. It was a collaborative work between Monpellier, Catalina and UCB.
He commented that this conference was mainly aimed at design rather than testing. The objective of his presentation was to give a flavour of the task of testing FPGAs.
The method of testing ASIC circuits is to proceed from high level to low level, specification to voltages. They try to verify the circuit before it's mass-produced - otherwise, very expensive if errors found! Once mass-produced, just check that the fab process hasn't introduced a glitch; normally, not a congenital problem but rather a spot defect caused by e.g. dust on a mask. They try to detect these by injection of faults (e.g. "stuck-at") using automatic test pattern generators.
They use the structure (netlist) of the circuit to generate tests.
FPGAs differ from ASICs because they're programmable. They locate an object (e.g. ALU) and test it by firing vectors into it. They try to cover all the objects in the device. Note that each reconfiguration of the FPGA takes a while, so is correspondingly expensive.
Aim is to get a structural approach to test FPGAs. ASICs have a heterogenous structure so they can test them block-by-block; FPGAs are homogeneous so that approach is doomed.
Trick is to minimise the number of test configurations (not necessarily the test vectors). Need to develop a specific test approach: consider logic cells, interconnect and memory separately. So test array of logic cells, then array of interconnect, then array of memory cells.
For each array, consider a specific fault model for the logic. Start with an elementary element, then the cell, then the array of cells, at each step building on our work for the previous steps and trying to keep number of configurations low.
For testing a cell, for instance, connect directly from pins to cell inputs and cell outputs to pins. But try to pack as many different cell tests as possible in one config to stop problem with many configurations. Could route through other cells which are defined as "PASS" configs.
Typical test of classic mux: stuck-at 0 and 1 for each input, in isolation. This gives 4 configurations, with two vectors for each configuration.
Only need 5 test configurations to test the standard Xilinx cell!
Next, connect logic cells in cascading XOR function so can test entire FPGA with 5 test configurations.
Routing-wise there are 3 configurations for entire array.
Phew! A long and brain-stretching first session. It had over-run by about fifteen minutes, so they took ten minutes off the break. Still, the on-tap coffee and currant bread enabled rapid re-charging of batteries. For the second morning session I went for the Mobile Communications track, since it coincided with my professional interest.
I'm prepared to pay a large sum of money (tens of schillings, at least) to anyone who'll give their pet project a cool acronym which is obscene in at least one language. Why hasn't anyone produced the Serial Multiplexing Evolutionary Generator? I'm disappointed.
Jürgen Becker from U Darmstadt was up again, building on his comms presentation from yesterday. He pointed out that by 2010 there would be up to 1.7 billion mobile phone users. New apps such as internet / multimedia are coming on-line.
Comms are cell-based, local, regional or global, with bandwidths dropping as you get further out.
Need to reduce power consumption, have flexible systems and high-performance systems. High end DSPs draw too much power. We're looking at 0.1 micron in 2005 and 0.05 micro in 2011. Becker reckons about 6000 MIPS needed for future DSP processing.
Flexible SoCs trade power and performance for flexibility, and have efficient CAD support. They minimise risk of obsolescence, which is handy.
DReAM is a SoC. Links to RAM, cores and ASICs via Amba High Performance Bus bridge. The bridge at 64 bit can handle 800Mbit/s peak performance. DReAM itself is coarse-grain application-tailored controlled by a global communication unit which can dynamically reconfig DReAM. It has local and global comms between blocks (RPUs). Configuration management units throughout the chip hold up to four configurations.
The RPUs have two Reconfigurable Arithmetic Processing Units with two dual-port RAMs and a single controller. A spreading data path unit does combinations work. The RAP can add, subtract, accumulate at 16 bit and multiply, divide, or MAC at 8 bit.
Pre-synthed IP cores can be mapped onto DReAM.
As an application, they did a CDMA-based RAKE receiver which can cope with signals being reflected off various objects, throwing out the inter-symbol interference. You multiply the data signal with a fast-oscillating pseud-random sequence which makes it easier to pick out the real signal. You have a number of correlation arms which are aligned to produce your original broadcast data.
Each RAKE finger takes an oversampled signal through a multiplexer and does some arithmetic magic. One finger needs 4 RPUs in DReAM. The PN codes are generated directly with spreading data path units of RPUs.
They put 4 rake fingers on DReAM with a data rate of 1.5Mb/s. Woo. Becker reckons that you need flexible SoC for 3G mobile and DReAM is an example of it. He went quite a bit over time, and probably didn't need to.
[The "traffic light" time indicator on the speaker's desk is a good idea, but having a klaxon go off when the red light comes on would be entertaining too.]
One guy asked a question, but it went on so long that I lost track of what he was asking. It seemed to be something about how flexible DReAM was. The chair had to intervene eventually.
Another long title. Herr Blaickner was from the home team, TI Villach. His contents slide was a warnng, with seven separate sections. In a scheduled slot of 25 minutes I don't think you want more than 15 slides, and 10 would be better.
He smmarised the upcoming system architectures for 3G mobiles. UMTS is the big one, dominated by WCDMA, but there are also the higher-bandwidth ones like HIPERLAN and DVB.
He suggested that suitable technology for the software radio and base-stations would be FPGAs with 2 million logic gates, along with a library of scaleable communication sub-modules. He pointed out that you could address the power consumption problem by combining FPGAs with static logic.
He showed typical architectures for a coherent digital receiver and a generic software radio receiver.
This talk was concerned with phase correction to synchronise waves. It was getting into electromagnetic wave theory at this point, which was losing me slightly. He showed the architectur needed to do a sequential parameter search using phase rotation. Apparently feed-forward structures are preferable because there are no latency restrictions, unlike feedback structures.
A key computation element is CORDIC, part of a pipeline. It munges vectors in a variety of useful ways. He showed how it fitted into a carrier phase synchronizer. He displayed error plots and performance results for various phase angles, though the scaling wasn't clear (if any was intended, I'm not sure if it was relevant).
They tried it out in FLEX and APEX devices, with APEX managing 250 Mb/s, twice as good as FLEX, both needing 1500-3000 units (cells? blocks?).
They're looking at building on this work for more advanced signal analysis.
He gave a design cycle, but it looked fairly standard, then summarised. He just about ran over time. Why was everyone running over today but not yesterday? Peculiar.
I wimped out of the next two talks as neither track offered stuff in or near my area. Instead I took the opportunity to get some postcards sent off.
I talked to some of the guys from TI Villach about Villach, and a bunch of us compared notes on Austrian, Californian, British and Texan weather. The conclusion was that people preferred Texas because you knew where you stood with it; also, the house prices were reasonable there.
For the first part of the afternoon session, I went for the student papers on the grounds of solidarity. Besides, I've noticed that it tends to be the students who come up with the really radical ideas. Some of them may not work out, but at least they've been tried! Thomas Hoffman, whose life my LaTeX files had made a misery, chaired the session.
Andreas Dandells from USC gave this presentation.
He gave the example of an FPGA munching on a problem and generating problem-instance dependent logic, then reconfiguring itself without any external host.
He gave motivation: reduced mapping time, reduced configuration time. He also presented the other side of the coin with the issue of feasibility of this approach given current commercial device limitations. His work aimed to make self-reconfig on commercial devices possible, with device-independence and performance comparable with regular designs.
The basic approach is to abstract the dynamic logic onto the dynamic reconnect data, so you don't have to access the configuration bitstream.
Logic functions abstracted using "logiclets", and the connections between them are what is changed.
Each interconnect has an addreess which is stored in memory. They use on-chip memory to store it. They've tried string matching, shortest path and genetic programming approaches.
For string matching gave FSM which was transformed by changing which states it hops back to (the "next state" logic). They got it running at 110MHz on a Virtex in one context.
For shortest-path they used the Bellman-Ford algorithm, with an adjacency list based mapping in data memory. Again, this went into a Virtex and ran in better O() area and time efficiency than with Dandalis's work on this.
For generic programming the known problem is that the program representation is specific to the problem. If we evolve on the host, there's a reconfig time penalty. On-chip evolution chooses terminal and function sets for a tree representation of the program. So your evolution logic just changes which trees are trialled.
The implementation performs as well as contemporary implementations but occupies lots more area. But it is feasible on commercial devices. He outlined some future work.
Kai Woska from U Siegen gave this presentation. He's an undergrad but gave a confident presentation, kudos.
The reason why the model was developed was to gain knowledge of how the FPGA worked, and to be able to trial changes to FPGA architectures.
The actual FPGA has 4096 cells, 5 layers of routing, and variable databus width.
In Verilog they designed in a hierarchy of modules. The main control logic is a simple state machine.
He noted that the XC6216 has wildcard decoders which can have more than one output line active; they had to do some structural investigation of the real decoders to produce an efficient implementation.
Some of the register handling had the modules auto-generated by C-code, since over 2000 similar instances were present in a regular pattern.
For testing on an UltraSparc 1 with 576MB RAM, using Cadence Verilog XL, they had 100,000 modules using 800MB+ RAM and taking 1 hour to simulate!
Instead used NC Verilog, giving a 70MB simulation snapshot. Simulation could be done in seconds. However, there was no signal tracing so debugging was hard.
They used the gained information in work of generating a fault-tolerant FPGA.
Jernej Andrejas from U Ljubljana spoke now. He was looking at DSP implementations of multiplication and division. He used VHDL, Xilinx devices and Synopsys tools.
The devices they generated were parametrised by sizes of operands.
For parallel multiplication they used carry propagation. They formed a carry-save multiplier from several carry-lookahead adders.
The slowest resulting circuit on an XCS40PQ208-4 was the basic one; adding CLAs made it better, adding carry-save made it best, and VHDL multiplication operators were a little worse than carry-save. It did, however, save quite a lot of space.
A pipeline multiplier formed similarly was fastest because there was a smaller carry delay time. Parallel-serial multipliers were more compact and had approximately linear area growth.
For a 16×16 bit multiplication, 429 CLBs were used by generic pipeline; bigger than LogicCore macro or parallel-serial multiplier but quicker.
They built a multiplier according to ADSP 2181, which gives a product valid after fewer periods than max given certain constraints on the input. It's smaller and faster than the classic multiplier. Virtex allowed fewer CLBs and increased frequency.
A parallel non-restoring divider gave 227ns delay for 16 bit divided by 6 bit numbers.
So FPGA functions are not as fast as they are in DSP processors.
[Mathmo hat on: why bother? We know that the problem is NP-complete!]
Andreas Dandalis from USC was back! He stated the well-known SAT problem of satisfying a binary equation in n variables.
They customise hardware for each SAT instance, use massive parallelism. There are a number of known algorithms for SAT. Backtracking algorithms tell you whether the search is complete, which might be handy.
He elaborated on the backtracking algorithm. A key timing for solutions is the unit propagation time (the time for a decision to be checked and implications to be found.) In this solution they aimed to have multiple pipelines with the goal of speeding up the UPT.
The basic architecture was a main control unit to pass out data and control backtracking, and PEs to check clause constraints and perform indications.
To enable parallel pipelines they broke clauses up into groups so each pipeline had an independent variable set. At the end, they merge the decisions to get a consistent result.
They modified the algorithm to use these parallel pipelines, where a conflict in any pipe caused all pipelines to backtrack. The clause area dominates any FPGA implementation. The speedup is complex to estimate; he gave a formula but it was big.
They built a C++ simulator to simulate architectures so that they could find what worked best.
Getting pipelines up to 65 gave a speedup of 10 over single pipeline on DIMACS test set. There's a balance to strike in number of pipelines. You could measure parameters at runtime and reconfigure dynamically for the optimum number. This gives the best results.
Everyone was finishing well within time here, which was a nice change from this morning.
This was presented by Abdellah Touhafi from Vrije U, Brussels. New title: "The Hydra".
He noted that FPGAs have a fixed granularity, and so does the reconfigurable computing system itself. If apps can be nicely pipelined and structured, fine; if not, you get gate bloat and performance drop. Memory bandwidth and interconnect latency can also be a problem.
A similar performance drop is found in parallel computing systems, known as "Amdahl's Law". It says that you cn only give so many special resources and expect performance gain. With FPGAs there comes a point where comms overhead swamps processing time.
Using a run time reconfigurable system, if you are careful and make it scalable you can change the system granularity and hence get close to the optimum architecture to solve the problem.
An app is seen as a set of communicating contexts, each of which accomplishes a certain task. A context is a specialised circuit with behaviour, state set, input ports, output ports, and if more than one node is needed then there is a need for synchronisation.
Nodes are built around a scalable reconfiguration handler to send and receive synch info to/from nodes, saving state data and handling data exchange. The dynamic reconfig data path implements the active context (partially) and has four near-neighbour connections.
The reconfig controller manages resources without great overhead and is general enough to apply to a range of apps. It is implemented using a fast FPGA rather than a RISC core so no extra comms links are needed, and can be specialised for target apps.
Between nodes are dedicated sync lines for nanosecond syncs.
He presented diagrams of how the datapath, memory layer and controller appear. It is implemented on one SUReCA board with four computing nodes, one master controller, three clocks, and multiple boards can be connected.
The master controller links to the host via EPP. It controls the programmable clocks' rates, and generates the virtual pin control signals.
The datapath layer is implemented with a XC6264-2 RPU.
The control layer is a scalable RISC on a XC013E-3 at 20MHz.
The local databuses of the four nodes share one bus controlled by the master, and it sets them up like this. Nodes are in a pipeline. Interconnected boards can form pipeline or mesh.
There is 4Mb of static RAM per node, 16 bits wide.
The session finished well early so I had time to look around the posters. A reasonably wide variety of stuff including some educational curriculae for teaching FPL subjects.
For the last session I picked the Applications track, wanting to see how FPGA theory was being applied in the Real World.
Yamamoto from Tokyo Denki U gave this presentation. He noted that discrete time Markov chains, queuing models and Petri nets have been used to analyse prallel systems.
Problems: DTMCs have a huge number of states, queuing model requires a large network so it's hard to solve analytically. Instead can apply Monte Carlo simulation, but still takes a long time.
They proposed and implemented Reconfigurable Stochastic Model Simulator. Uses Taico, a descriptive language of DTMC and queuing model. Translates to HDL and then synthesisted.
The user writes a model of target parallel system in Taico, then translated to HDL to produce a sim circuit. This is laid out and executed on FLEMING, a reconfigurable machine.
For the Taico description, divide it into elements operating in parallel. Describe each element as a DTMC with some interactions, and then specify what you want to measure during simulation. He presented a simple state transition diagram of a pair of elements and the description of one element in Taico.
The queuing model of the target system is built up by a construction using basic components such as call source, branch, selector etc. with specified parameters (e.g. max number of items in a queue).
The Taico-HDL translator is written in C. The resulting simulation circuit is a set of elements each reading from a RNG, reading in other element states and writing out its own new state value. Queue elements have fairly intuitive constructs (e.g. a queue is a FIFO buffer).
The RNG is a 32-bit generator.
FLEMING = Flexible Logic EMulatIon eNGine. It has six Reconfigurable Units (XC5215) and 3 Switching Elements (Lattice components).
They tried modelling the arbitration protocol of IEEE Futurebus and multiprocessors with a few different architectures. The model descriptions are fairly simple and readable in appearance. For a big circuit, the user can divide it into several devices.
The major time was logic and layout synthesis (10s of minutes each) and could run Futurebus simulation at 9.1Msteps/set in 345 CLBs = 10350 gates. Random number generator takes 243 CLBs at 3.2Msteps/sec.
Compared to UltraSparc 300MHz workstation, speedup of simulation was around 2 orders of magnitude. But remember overhead of synthesis! So system is useful if huge number of steps required.
Liam Marnane of Cork U presented this talk.
3DCam is intended for electronics inspection of PCBs. 80MHz pixel rate from sensor, 10MHz pixel rate for 3D data and a 1kx1k sensor array. The aim of this project is to show it's feasible to be used.
The camera uses a liquid crystal fringe projector generating a sine wave and projecting it onto the object changing the phase to get 8 images, and generate 3d data from that.
Need fast accurate FLCs, fast CMOS sensor, fast hardware for fringe evaluation, high speed host interface.
The sensor & evaluation hardware evaluates the 3D data from the fringe images by using an FPGA, using Phase Measuring Triangulation (heavy trig).
Need an arctan of a fraction as the key calculation. Then use Chinese Remainder Theorem to extend height measurement range.
Needs to store 8 frames of data, subtract, do the arctan, do the Chinese remainder, and tidy up. There must be null values assigned in places to stop overflow.
He showed the gates used to do subtraction, arctan (using the CORDIC evaluator, fewer pins required than for a lookup table). The Chinese remainder theorem was a bit hairy but still only basic operations. The finetune uses external SRAM to fine-tune the height result.
Can use a serial architecture so only need three frames storage, and get same latency as parallel version. Using Virtex and Virtex-E FPGAs because -E has the LVDS on it. Parallel is 50% bigger than the serial in terms of slice and pin resources.
The CORDIC is slowest part, coming in above the required 80MHz speed at the Virtex -5 rating.
So the Virtex FPGAs can hack it. I/O mainly determines FPGA selection.
Mr. Melnikoff from U Birmingham presented this. They are trying to implement a speech recognition system using programmable logic. He noted existing software products such as ViaVoice. Also wanted to find out about parallelisable algorithms on FPGAs and whether speech recognition was better than in software.
First, pre-process to throw away useless data. Then decoding: turn observations (filtered data) into phones (sub-word units). There are 49 basic phones for English. Then post-processing to turn phones to text. Difficult stage is decoding.
Speech recognition theory: sequence of observations Ot. Each model Mi represents a phone. Find sequence of models M which best describes O. Can't compute directly so use a Hidden Markov Model. Purely statistical, finite state machine. Looking for most probable sequence of HMMs.
This sequence of HMMs tells you the phones which have been spoken. Speech recognition uses Viterbi decoding: some FPGA cores do this, but they aren't up to the scale that Birmingham are using. Described using a state-time trellis. He showed the macro and micro versions of the decision nodes. All ops can be reduced to comparisons and additions by working in the log domain. There's also a lot of locality and parallelism.
In the past, used multiple PEs, one per HMM node. Big issue in past was load balancing. Not an issue with FPGA.
He showed the architecture, and a synthesised node column. Targeted at an XCV 1000 with 49 phones and statues. Overflowed the LUTs by about 25%, LUTs by about 15%, BlockRAMs by a factor of 6. Would run at 84MHz.
Could sample at a mere 100Hz so have 10ms to process. Decoding rate implies we could roughen it down.
Could store off-chip but have 2500bit bottleneck! Distributed RAM/ROM possible but we're short of resources anyway. BlockRAM is ideal, but not enough space. So look at reconfiguration.
Cut the column into parts, and partially reconfigure at runtime. Can we do it in 10ms?
Motivation: current speech processing loads the processor so it'd be nice to take the load off.
In the future he's looking to integrate into a complete system, improve the recogniser to use pairs or triples of phones. Big node count increases! Need to investigate RAM vs run time reconfiguration.
I spent most of the evening figuring out some stuff about
compiling SPARK into FPGAs. I think I've got something, but really
need a good drawing platform to play with. Dinner was at the
Café Mosser again.
Wednesday 30th August
Checking out of the hotel this morning, I found that Mastercard was accepted after all. Seems like I got duff gen. Ah well, I paid in echtes Geld anyway.
The walk to the Congress Centre seemed especially long this morning, probably because of having to schlep a suitcase full of hiking gear along with me. I took it as today's exercise. Since most of the rest of the day would likely be spent on my butt, I had to take aerobic activity where I could find it.
Tom Kean from Algotronix gave this talk, subtitled "Opportunities for the new Commercial Architectures". It took him and several more of the world's finest FPL minds a good 15-20 minutes to get Windows displaying the presentation at the correct resolution on the projector. So next time someone spouts off about Windows' user-friendliness you are entitled to say "B******!"
[Annoyance for this year's conferences; mobile phones. GSM means that you can annoy people Europe-wide. Customisable ring tones don't help.]
Anyway, Tom's friendly Scottish burr finally opened his presentation. He summarised Algotronix's work over the past couple of year. It's been mainly "soft" work like IP research, architecture investigation and some security applications.
He listed the 6 Silicon startups who are chasing Altera and Xilinx, along with a summary of what they were doing. There's a wide range of business models and architectures.
He described the "first wave" of FPL devices, based around PAL, where AMD triumphed in 1984 by buying Monolithic Memories and its 22V10 architecture. This echoes today's LUT-based architectures.
The second wave was Xilinx and its LUT. It took on AMD and won because of two inflection points: CMOS let you put everything on one chip, and they had a fabless business model. AMD stagnated, and left the business in 1999.
He proposed that two more inflection points were appearing. You can put an entire system on chip, and the Silicon IP business model reduces barriers to entry for new companies. He thought that IP was better for customers than companies.
The progress of FPL was categorised into three "waves": SPLD, FPGA/CPLD and CSoC. First ave gave write-once control memory, customising after manufacture. Second wave allows multiple reconfig during development. Third wave provides reconfig after deployment, needing security and implying a capable processor on the chip!
He outlined how an FPGA with downloaded updates sustains revenue much better than an ASIC.
He defined configurable System on Chip a scombining FPL with a standard uController and memory on a single die, giving a single chip solution for embedded systems. Close coupling increases performance, reduces board area, power and cost.
He gave the example of Triscend's CSoC, and described the likely business model change for third wave where instead of three parties providing broad segment IP cores, CAD tools and FPL respectively, one company provides everything for one narrow vertical market segment.
He classified SoCs as high density catalogue FPGA (system on programmable chip), configurable SoC such as Triscend, and specialised SoC with FPL as an IP block.
He noted the start of appearance of evidence that the third wave is on its way; significant investment into startup FPL, and mainstream FPGA vendors playing catchup.
The questions provoked a fair aamount of debate, and it was good to see that FPL still arouses passion in people.
This was a survey of the FPL activity in asia. The speaker noted that although much work has been done in Europe and USA, interesting things were also happening in Asia. He was focussing mainly on Japan in the talk, but other countries had stuff going on too.
He listed the main reconfig platforms, mostly from the universities but also from firms like Mitsubishi.
RM-I to IV from Kobe has been going awhile with the latest having 16 Xilinx FPGAs with local memory and central interconnect. Used for LSI CAD but laso for image processing. Up to two orders of magnitude greater performance than PCs or workstations.
Mitsubishi's RASH has up to 6 boards on a CompactPCI bus along with a single PC board as controller. More than one unit can be connected using Ethernet. Ech board has 8 ALtera Flex10Ks along with SRAM. The FPGAs are on a 32bit locla bus. It can DESsearch 2000 times faster than a fast Pentium PC.
So there are lots of platforms, some standalone and some attached boards. Most use Xilinx though some use Altera. For programming they use HDL, DataFlow C (for FLEMING) or Extended C.
For comms contrl there are also quite a few systems. NTT has made some, unsurprisingly! YARDS; yet another redefinable system with four FPGAs and can be programmed remotely over an ATM network. NTT's ATTRACTOR has a collection of cards (FPGA, LUT, ATM-SW etc) with 1Gbs serial link. For instance you can have an IP cut-through (tunnelling?) router on ATTRACTOR.
He described RHiNET which ais a flexible netowrk for interconnecting PCs. It uses a Flex10K as a primitive message handler. The optical interconnect on it runs at 1.5Gbps at 133MHz.
Reconfig systems have been found to give high performance and flexibiity, and allow remote reconfig. He noted lots of other apps given in the acompanying paper.
The new devices for reconfig systems which have been produced; FPccA from Hiroshima U, and Proteus Lite from NTT. FPccA has 12 ALUs running at 25Mflops. Proteus is for comms stuff and has a grid of basic cells with mostly horizontal communications. They've made a programmable ATM adptor with it.
Flexible architectures include Plastic Cell Architecture from NTT (see related short paper by Shizawa-san). Also several multicontext FPGAs, two of which are from NEC. PCA has in each cell a builtin part and a FPL "plastic" part with asynch operation and messaging. It can be dynamicaly reconfigured. First protoype is 0.35u CMOS on a 1cm^2 die.
He outilined how a typical multicontext FPGA works, noting Fujitsu's patent of 1990 of a multicontext PLD.
The DRL from NEC has 8 contexts each with 9000 usable gates, and can run DES in multiple contexts at 6 times PC speeds. Keio U has done some work on Virtual Hardware (WASMII) on the DRL, programmed using a data flow language which is used to generate VHDL. It uses 16 control blocks static on the DRL, and the other 32 being dynamic.
They have done small apps with this such as neural network emulation, running bout the same speed as current PCs. The context switching provides a significant voverhead, though the NEC DRAM MPGA might help solve this.
In the future they are looking at combined FPGA and CPU chips.
I talked to a German chap who's doing a PhD part time like myself. His firm gives him up to half his work time to study, which isn't at all bad. We discussed among other things the problem of errors, business screwups and programmer egos.
I went for the Methodology and Technology track to start the second morning session.
George Constantinides from Imperial gave this presentation. He was looking at using multiple wordlength datapaths across the array, making sure that when you bind resources (adders, multipliers etc.) to do these operations then the resources used are minimal.
The outstanding question is on high-level synthesis which he was going to try to address. He was looking principally at multiplication since multipliers take up much more space than adders, and are less easy to build up hierarchically in parallel.
He showed a bit-parallel multiplier built up by full adders. He showed how breaking it into smaller multipliers by breaking the carry paths resulted in increased resources, especially routing.
He showed a possible alternative by ensuring that each column or row of FA's only belonged to one of the two multipliers at any one time. There is increased complexity of the basic FA here, though. For a static reconfig design on a Xilinx 4K you get 938 CLBs, but 290CLBs for a non-reconfig multiplier, so there's a huge increase in area. So sharing small multipliers is a losing proposition.
Dynamic reconfiguration? Neither Xilinx 62K nor Virtex are really suitable.
Try to restrict to one multiplication on one multiplier in one clock cycle. Can now restate problem as a minimized graph colouring problem.
He pointed out that you might not want a minimm colour solution, giving an example where you could share a 32×16 multiplier only with more than minimum colours. But apparently this isn't a common situation.
He described the basic algorithm of getting a minimum colouring only looking at the structure, not wordlengths, then iteratively swap and inherit colours of nodes to greedily reach a local minimum.
Compared to blind approach and a wordlength-only approach, this new approach is much better. It's not quite as good as simulated annealing but is a lot quicker (several orders of magnitude).
So use a multiplier array for a single mult per clock cycle, and use this Swap and Inherit to give a fast good solution.
Milan Vasilko from Bournemouth Uni presented this. They're trying to take a behavioural description of a problem, then creating a design implementation on a reconfigurable system.
He noted the previous work done on partitioning system models. As regards behavioural partitioning, it allows good exploraton of the design space but can be simplistic about the target. There's also inaccurate resource consumption estimation.
Their motivation was that they'd been frustrated by having to go through many blind iterations, and wanted to estimate the configuration overhead during synthesis for many technologies.
A basic problem: D = (A+B)×C
Design constraints are target technology, system clock, latency / throughput and others. He drew the simple dataflow graph.
Problems then include resource allocation, resource binding (for shared resources). Once done, characterise individual computations with parameterised library routines.
For static logic synthesis, if your result didn't fit in the device then you've have to iterate synthesis. But for reconfigurable design you have to do it for each reconfig which is very frustrating.
Temporal floorplanning is used instead, trying to place the computations into the 3D space of FPGA area against time. Then construct a schedule to be fed into the controller.
You can now exploit resource-sharing at architectural and fine-grained device level, sharing resources between unrelated blocks.
Need to assume fully specced problem, synchronous 1-clock architecture, only route data between configs via the interface and shared regs,and external config controller.
Decided to use generic algorithms because it looks like they work in a reasonable time and seem suited for this kind of problem. He recapped how genetic algorithms work in general. They applied it to a complex labelled graph using a steady state GA with tournament selection, and initialised by a greedy algorithm with random temporal floorplanning generation and a consistency check.
For crossover they did module allocation, X-Y and X-Y-Z coords swap, and for mutation they additionally did 3D "shaking". After each application they did a consistency check and correction. Fitness of a design was based on latency (configuration and execution).
They tried it on an XC6200 with 3 small benchmarks against manually-optimised designs. He showed an example solution in pretty colours with the schedule; lots of time was devoted to reconfig rather than computing. Compared to manually optimised, between 0 and 27% inefficiencies were had.
He went well over time, mostly because there was quite a bit of info presented (e.g. genetic algorithm modus operandi) which could have been left out, and could have skipped through his conclusions more quickly.
Kasul Aoyama from NTT Communication Science Labs presented this talk.
He reviewed the main types of reconfigurable elements: LUTs, PLA and MUX. They propose a new threshold logic reconfigurable element based on vMOS. It can be quickly reonfigured and implement any symmetric function.
Threshold logic gates are a general form of conventional gates, working like a neural net neuron computing a weighted sum of inputs and triggering on a threshold. The sum goes onto a floating gate which goes through an inverter to do the thresholding. The vMOS inverter diagram was given but my VLSI knowledge wasn't enough to pick out the key elements. There seemed to be an element of CMOS to drive the output. It has a variable threshold set by some capacitance hanging off the main gate?
The element has k+1 4-input vInverters feeding into one big vInverter. The principle of operation wasn't entirely clear. Maybe you vary the threshold capacitance to reconfigure it? There are k variable going in and k+1 bits of configuration data going in.
He gave an example for XOR of two variables. The threshold was bounced up for state 1 (corresponding to 01 or 10) forcing result to 0, and dropped down for states 0 and 2, forcing 1, then it went through the inverter to give 0110 - OK!
He showed how you store configuration data using switches controlled by an Initialization bit to control switching to the control inputs.
They have designed an element on this principle computing any 3-element symmetric function. He showed the gate diagram. I didn't know what some of the symbols meant so can't give much of an interpretation.
They've simulated the circuit with 3-input logic functions like AND, NOR and it seems to work.
They have fabricated it and it's not too hard. They need 3 layer CMOS?
Another long title. How about "Change about, spot the laggards"? Andrzej Krasniewski from Warsaw U presented this.
Normal testing checks all possible modes of blocks, interconnections etc. as far as possible. Tries to check that the device will work in the field. Called application independent testing.
Problem is, insufficient coverage of timing-related faults. Can't examine more than a small fraction of interconnects.
Idea is to use FPGA preconfigured for app: application dependent testing.
Given a module doing a user function, decompose it into several sections, then test one section at a time using a BIST block to emulate the rest of the FPGA. Once test done, BIST logic "vanishes" so no overhead. They've got an exact method for doing this, called combinationally exhaustive testing.
They enhanced this to make it oriented towrds finding delay faults. Designed BIST scheme to satisfy timing preservation reqt, and modified each LUT to be more vulnerable to delay faults by modifying blocks to be XOR fns. He gave an example where a simple net of LUTs where before 2 out of 4096 pairs of inputs showed the problem of a single delay fault, but as XORs 64 of 4096 inputs showed problems.
But introducing XORs can make zero-activity nodes and make other problems, though in practice this isn't very likely. Untestable path delay faults can be missed. Also, it can flag false transition delays.
To solve this, modify LUTs so in-out transition pattern is preserved, as is blocking capability, but nevertheless output activity is higher. He showed how to do it but it's a bit hard to explain in text. You can get a significnt improvement in output activity. They have a library of optimal activity functions, the trick is to find one that has the right transition and blocking patterns.
They did a study on estimating the improvement from random testing with a thought experiment based on a network of 4-input LUTs, looking at worse-case and optimum activity functions for each LUT and introducing different delay faults. In terms of the sequence length needed to detect a fault with P = 0.999, the new method needed about two orders of magnitude fewer checks compared with worst case.
So this gives you much better testability to detect delay faults. The worst-case output activity functions in fact include AND, OR, NAND, NOR.
On the home straight now. The lunch was reasonably good, though the salad we started with was laced with mayonnaise which was annoying. I talked with guys from TU Darmstadt, Altera, IRISA and Mexico, comparing notes on research, the conference and things to do around Villach.
For the final set of presentations I went for the Applications track.
Shuichi Ichikawa from Toyohashi U presented this. He's been looking at solving computationally hard problems using massive parallelism. It's known that Boolean SAT is a suitable problem, but are there any others?
Yes! "Is a graph Ga isomorphic to any subgraph of Gb?" It's widely applicable to problems and is generally NP-complete. Ullman has proposed the most widely used algorithm. People have used a variety of custom hardware to solve it, or related problems.
He formulated the problem by using an adjacency matrix to represent edges. There's a formula for edge density which can tell you roughly how hard the problem is.
The enumeration algorithm checks all possible mappings of vertices, depth first search, simple but impractical. Ullmann's algorithm is a tree search with pruning, having a condition to satisfy at each node and pruning if it's not met.
He outlined the refinement procedure which was complex but apparently is the mathematical statement of "adjacent vertices in A are adjacent in B".
He gave an elementary circuit to provide partial checking of the refinement procedure, and showed it fully parallel. But it needs a lot of hardware!
Reducing the required hardware is done by adding sequencers to go row by row or column by column. You can also time-share hardware to some extent.
He proposes a newish idea, adding another necessary condition to assist pruning, that subgraphs of Ga are isomorphic to subgraphs of Gb.
Using a Lucent 2C FPGA he worked out the required hardware; with the most compact algorithm it's 160 processing units at 35.9MHz for 15 nodes in Ga and Gb. Execution time is 160 secs, compared to 1200 secs for software on a Pentium II. Area multiplied by time is about constant, with the proposed algorithm doing better than most other variants. They built a small prototype on a Lucent card which worked as predicted.
He noted that larger graphs could be managed but execution time would increase exponentially.
Martyn Edwards from UMIST presented this. They've been trying to implement SDF graphs in hardware and software.
SDF is usually used in digital signal processing. They handle regular computations on infinite streams of data. They're interested in lowish data rates with hairy processing. You define an app by an SDF graph (diagram given) in terms of actors doing actions and passing tokens.
He sketched the architecture you need to execute it, using a fifo to buffer actions. Some research going on about synthesis in SW or HW.
Does it help to use dynamic reconfigurable FPGAs? If so, what do we reconfigure?
He defined a valid schedule for the SDF which says what order the actors are executed in, and tries not to deadlock things. He defined the maximum size buffer for an edge, and noted that different schedules may have different total buffer sizes.
You then draw an SDF dependency graph which says what the token flow will be through the implementation and how many times each actor fires.
Various implementations given. One unit per actor does the example in 21 cycles, one unit for all needs 32 cycles. He showed the corresponding hardware needed for a unit with registers and muxes, mainly. Replacing FIFOs with register arrays takes about 25% off the cell usage on the given example.
Multiple units per actor possible; gave example of B being split across 2. Gain a little time, but still lots of idle time. Why not reconfigure, since data coming in slowly? Example increased time slightly but removed one unit. Tradeoff!
He showed the modified hardware to implement this, and the architecture of an actor. This later is done by a FSM for each, with a single global FSM to kick things off.
Future work includes trying a real example and playing with it to get performance improvements.
I sneaked in to the tail end of George Milne's talk on behavioural language compilation as I switched to the Compilation track. I need to look at Circal to see how it differs from SCSP or SRPT.
Valeri Skylarov from U Aveiro, Portugal gave this last lecture. He was looking at FSM with modifiable functionality. He's using RAM blocks as the main method of implementation.
He outlined how FSMs are formed and how they can be implemented in a combinatorial circuit. He gave an example FSM and its implementation, and noted that naive implementations use up a shedload of RAM.
To be cunning and cut down on data, use state encodings so that you recognise each state by looking at just one bit. This can cut down RAM usage by 50%ish. You can use unused bits to identify certain transitions. NB: in normal machines the number of inputs is normally greater than the number of state bits.
He showed a big state transition table describing his example. I didn't get what all the columns meant. Apparently you could implement it in only 2 CLBs.
He then presented a pseudo-architecture diagram whose purpose or meaning wasn't clear.
The main conditions for providing address codes were given: codes with different superscripts must be different, you want to find the minimum value of S (predefined size of codes) and you must encode the variables in each subset such that their indexes all fit in a subset of the bits. Or something like that.
I'm not entirely sure exactly when I lost the plot on this one, but I was well out of my depth by now.
Experimental results: he implemented a number of machines on an XC4010XL, and RAM-based FSM was pretty much always better than One-hot encoding, and a good deal better than binary encoding. Looks as if commercial FPGAs are good at implementing FSMs in RAM blocks.
Prof. Grunbacher did the thanks, Patrick Lysaght thanked the committee, and Prof. Hartenstein thanked the attendees. Und damit basta!
FPL had laid on a coach to take us to Klagenfurt, though it took some while to get out of Villach as we reversed up a one-way street to pick up a couple of delegates who had gone to their hotel. Then the driver stopped in at his depot, maybe to get his sandwiches or something. Anyway, we got to Klagenfurt in plenty of time.
The flight to Vienna was quite uneventful, though we landed with only half an hour before the Heathrow flight was due to take off. A reasonably rapid transit through the airport resulted. The Heathrow plane was quite full, and about ten of we FPL guys were on it. The two hour flight provided enough time for the airline to screen "The Bachelor" on the drop-down movie screens. Quite silly, but entertaining.
We waited patiently at the luggage carousel in Heathrow as cases bumped past. Eventually it became clear that no more cases were emerging, and ominously the only people left seemed to be people who'd also been on the Klagenfurt flight. Sure enough, our luggage hadn't made it out of Vienna with us. I thanked my lucky stars that my wallet and car keys were in my carry-on bag. One of the Embedded Solutions guys was concerned that his keys may have still been in his case. I hope that they weren't.
Anyhow, I picked up the car from long-term parking and drove home, arriving back at 11:30pm. Exhausted, I crashed, knowing that I had to be at work next day, and it was an early start...