Publication date: 04 May 2010
Customizable processor cores first hit the market in the late 1990s, and since then many improvements have been made in the way these processors can be customized. Much of the process now is highly automated to speed up the process and guarantee that the processor core can’t be broken during the customization process. So why should you consider customizing a processor core instead of using one “off the shelf”?
Some people think that all RISC processors deliver about the same performance per clock cycle. This assumption is wrong for processor cores that have been customized for a specific application. By customizing the processor, designers can get a significant performance improvement for each clock cycle.A designer can add custom instructions to a processor's ISA (instruction set architecture), which will increase the processor core's size, which in turn increases the processor core's average power dissipation per clock cycle. However, is the new instructions dramatically cut the total clock cycles required to perform a given workload, then the total energy consumed (power-per-cycle multiplied by total cycle time) can be substantially limited.
Example: A 20% increase in power dissipated per clock cycle, offset by a 3x speed up in task execution, actually reduces energy consumption by 60%. This reduction in required task-execution cycles allows the system either to spend much more time in a low-power sleep state or to reduce processor's clock frequency and core operating voltage, leading to further reductions in both dynamic and leakage power.
What's the alternative to building the acceleration right into the processor? With standard RISC processors, the answer is to build your own RTL blocks for the acceleration. That means months of lengthy complex verification cycles. In most RTL blocks, the FSM (finite state machine) contains nothing but control details. And most of the design and verification risk is in the FSM, due to its complexity.
A late design change made to an RTL acceleration block is much more likely to affect the FSM than the datapath because the FSM contains most of the design complexity. Configurable processors reproduce the RTL block’s ability to have wide datapaths—which can be guaranteed correct-by-construction by the processor vendor—while reducing the risks associated with FSM design because processor-based FSMs are firmware programmable.
Adding functions to a Tensilica processor core never compromises the underlying base Xtensa instruction set, thereby ensuring availability of a robust ecosystem of third party application software and development tools. Configurable Xtensa processor cores are always compatible with major operating systems, debug probes and ICE solutions; and always come with a complete, automatically generated software-development tool chain including: an advanced integrated development environment based on the ECLIPSE framework; a world-class optimizing, vectorizing compiler; a cycle-accurate, SystemC-compatible simulation model and instruction set simulator; the full industry-standard GNU tool chain; and EDA synthesis scripts.
Customized processors can be used as an alternative to hand-coded RTL blocks by adding the same datapath configurations as implemented in RTL accelerator blocks. These datapath configurations include deep pipelines, parallel execution units, task-specific state registers, and wide data buses to local and global memories. This broad customizability allows configured processors to sustain the same high computation throughput and support the same data interfaces as RTL hardware accelerators.
However, control of customizable processor datapaths is very different from the RTL counterparts. Cycle-by-cycle control of a processor’s datapaths is not frozen in a hardware FSM’s state transitions. Instead, the processor-based FSM is implemented in firmware, which greatly reduces the amount of effort needed to fix an algorithm bug or to add new features. In a firmware-controlled FSM, control-flow decisions occur in branches, load and store operations implement memory accesses, and computations become explicit sequences of general-purpose and application-specific instructions.
By using a unique customized processor, you make it much harder for competitors to copy your ideas. You get higher performance, lower processor operating power, and you'll have a version of a microprocessor that no one else can buy. In addition, no one else can get the automatically-generated matching software tool chain unless you provide it to them so no one can program the processors in your ASIC unless you allow it.
So even if someone else gets your customized processor, they can't take advantage of the task-specific ISA optimizations you've made unless you provide them with the tools to do so. Plus your optimized processor will get better performance, operate at lower required clock rates, and consume less energy than the industry-standard, fixed-ISA microprocessor cores.
Worried that others won't know how to program it? You'd be amazed how often customizable processors are in the data plane - handling functions that are never touched by the main operating system once the firmware is installed. You program the video firmware once and then it just works. Same for audio and communications. There are tremendous opportunities for unique optimized processors doing those tasks that no one else will need to program once you've optimized your firmware.
The process of creating a configured processor core for a specific application can be highly automated. Compilers are now available that can examine the C code for a particular task or algorithm and suggest process extensions that will speed up that task or algorithm. These compilers can provide near-immediate feedback to a design team, which can significantly shorten the design cycle.
For example, Tensilica's XPRES Compiler can automatically analyze C code, identify critical inner loops and other tuning opportunities, and create many trial processor configurations with customized extensions that boost performance. The XPRES Compiler creates a graph showing different execution speed/gate-count trade offs for the analyzed code. That way, the design team can decide what trade-offs it wants to make between added gates (area) and increased performance (cycle count).
Tensilica's tool chain automates the process of designing a configurable processor. This automation allows Tensilica to guarantee that the results are correct by construction. The development process includes the following steps:
1. Compile the original C/C++ application and run the XPRES Compiler. The XPRES Compiler analyzes millions of possible instruction combinations and present the designer with alternatives.2. The designer selects the "best" configuration for the target gate count. Optionally, the designer can manually refine the generated configuration.3. Build the processor using Tensilica's standard Xtensa Processor Generator flow. The Xtensa Processor Generator creates an RTL description of the configured processor and generates tailored versions of all necessary software development tools including the compiler, assembler, debugger, and instruction set simulator. It also generates a C or SystemC simulation model of the proceswsor and EDA synthesis scripts. No manual work is required to match software development tools and processor.4. Compile the original, unmodified C code to run on the customized processor core. It will take advantage of all customizations.
Note that if an ASIC designer uses the XPRES Compiler to create a customized processor, no modifications need to be made to the original C code or any other C code to use these new instructions. The compiler automatically exploits those new instructions.
When you base your design, or parts of it, on a standard processor core, it's much easier for someone else to copy. After all, they can use that same processor core in their design. But when you customize a processor, even just a little bit, how's someone going to copy your customizations? That core is yours and yours alone. No other company can replicate your version of that configured, task-specific processor. This helps keep the design priates away.
With Tensilica's Xtensa processors, when you customize them, you get a matching software tool chain that is optimized for your changes. While compiler for other processors might be able to run your software, they won't achieve the levels of performance you can get by using your optimized compiler on your optimized processor core.
Optimization adds special registers (sized to the natural data types of the tasks to be performed) and execution units that efficiently perform task-specific algorithms, often in one of two clock cycles. This design approach keeps clock rates and energy consumption low.
You say you're not a processor designer and don't know how to add registers and execution units to a processor? You probably know Verilog. With Tensilica's automated process, you just need to write a few lines of Verilog-like code. Our processor generator automatically figures out how to modify the processor to get the results you want.
A processor core’s main bus represents a significant I/O bottleneck – so much so that processor cores have lacked the I/O bandwidth required by many tasks performed in SOCs.
However, there’s a new breed of processor that relieves the congestion on the main bus by supporting other means for high-performance I/O. Tensilica’s Xtensa processors let you achieve data transfer rates that can match those of hand-designed RTL blocks. Tensilica offers four ways to directly communicate without using the main system bus.
1. The XLMI (Xtensa Local Memory Interface) bus is a simple, fast, single-cycle bus that performs data transfers much faster than the main system bus because it is not designed to support multiple bus masters. Instead, it can be used to tightly couple Xtensa processors used in dataflow instantiations, as well as integrate existing hardwired, high-performance complex state machine logic into the Xtensa processor’s memory space. It can be configured up to a full 128-bit width, delivering 3.2 GB/s (peak) low-latency bandwidth.
2. Ports act like GPIO (general-purpose I/O) and are wires that directly connect two Xtensa processors or an Xtensa processor to external RTL. Port connections can be up to 1024 wires wide, allowing wide data types to be transferred easily without the need for multiple load/store operations. Ports are particularly useful to convey control and status information.
3. Queues are like FIFOs and provide a high-speed mechanism to transfer streaming data without buffering. Input queues and output queues operate to the programmer’s viewpoint like traditional processor registers, without the bandwidth limitations of local and system memory access. Queues can sustain data rates as high as one transfer every clock cycle or over 350 Gbits/sec per Queue added to an Xtensa processor.
4. Memory Lookup Interfaces are useful for connecting RAMS for table lookups or for connecting long-latency hardware computation units. Memories connected to these Lookup interfaces can be read and written directly from the processor datapath without using load and store instructions.
All of these features, when easily specified by the designer, are automatically added to the Xtensa processor and are 100% fully modeled by Tensilica’s Xtensa Processor Generator. The full behavior of the interface is automatically reflected in the custom software development tools, instruction set simulator, bus functional model and EDA scripts. And because it’s automated using Tensilica’s patented technology, it’s pre-verified and correct by construction – no need to re-verify the processor.
In many industries, standards often change and new algorithms must be incorporated into your next product, or even an existing product. If you’ve hard coded the algorithm in RTL, there’s no changing that after the chip is made. However, if you implement the design in a processor, by definition you can make firmware changes after the silicon is made.
This is the standard reason to use a processor instead of RTL. But why is this an especially important reason to use customizable processors? Because, with customizable processors, you can embed the logic you were going to put in an external RTL block right into the processor’s data path. You can also customize the data path to the exact width required. If you need to process 56-bit data, you can process that all in one chunk rather than two less efficient 32-bit operations.
Your processor, therefore, becomes much more efficient, and the need to offload tasks to dedicated hardwired blocks of RTL is greatly reduced if not eliminated. So these functions, which previously were in RTL, are now software programmable in the processor itself.
Thinking about moving up to a bigger core because you need more processor performance? That bigger processor will not only take more area, but will also use more power. What if, instead, you started with a highly efficient base architecture and just added the performance where you need it? You’d get a lean, mean processing machine, without the overload of a bigger general-purpose processor.
Sure, you could design your own processor or DSP. Or design your accelerator functions in RTL. But the design cycle usually takes over a year, and that figure might not include all the time needed to design the matching software tool chain – the compiler, debugger, instruction set simulator, etc. – if you’re designing a processor. And there’s no guarantee that the exact processor specification or RTL design you pick to implement is the best one.
Instead, you can use a customizable processor and compare and contrast different alternatives to pick the best architecture. Then, when you’ve meet your performance/area/power goals, you can trust Tensilica’s automated process of generating a matching complete software tool chain for the exact processor you’ve designed. Hundreds of designs have been completed this way, and Tensilica guarantees that the software tools, as well as the processor, are correct by construction.
The engineering manpower needed to develop and verify custom RTL acceleration hardware is greatly reduced. A processor-centric ASIC design approach permits graceful software-based project-schedule recovery when (never if) a bug is discovered. And the time-consuming RTL verification cycle is removed. You just need to verify that the processor is doing what you think you asked it to do, rather than the standard hardwired RTL verification challenges.
With automated tools, why not try to customize a processor for your exact requirements? You’ll be amazed at the flexible options you have to get the efficiency you need from a programmable processor-based solution. With Tensilica’s tools, you can test out different configurations and options to see what gives you the best combination of area, power, and performance for your exact application.
No longer do you need to worry that you’re not a processor designer. Even if you’ve never designed a processor before, if you know what your applications needs (what you might have put in an RTL block in old designs), you can use Tensilica’s automated process to design a processor.
No, you don’t have to go into the processor RTL and make any modifications. As a matter of fact, we don’t want you to touch the RTL that’s made from our Xtensa Processor Generator.
Instead, you use a Verilog-like language to describe the features you want to add. Our automated process then generates the processor extensions to match your requirements. Along with the matching software tool chain.