<<< :sectnums: == Application-Specific Processor Configuration Due to the processor's configuration options, which are mainly defined via the top entity VHDL generics, the SoC can be tailored to the application-specific requirements. Note that this chapter does not focus on optional _SoC features_ like IO/peripheral modules. It rather gives ideas on how to optimize for _overall goals_ like performance and area. [NOTE] Please keep in mind that optimizing the design in one direction (like performance) will also effect other potential optimization goals (like area and energy). === Optimize for Performance The following points show some concepts to optimize the processor for performance regardless of the costs (i.e. increasing area and energy requirements): * Enable all performance-related RISC-V CPU extensions that implement dedicated hardware accelerators instead of emulating operations entirely in software: `M`, `C`, `Zfinx` * Enable mapping of compleX CPU operations to dedicated hardware: `FAST_MUL_EN => true` to use DSP slices for multiplications, `FAST_SHIFT_EN => true` use a fast barrel shifter for shift operations. * Implement the instruction cache: `ICACHE_EN => true` * Use as many _internal_ memory as possible to reduce memory access latency: `MEM_INT_IMEM_EN => true` and `MEM_INT_DMEM_EN => true`, maximize `MEM_INT_IMEM_SIZE` and `MEM_INT_DMEM_SIZE` * _To be continued..._ === Optimize for Size The NEORV32 is a size-optimized processor system that is intended to fit into tiny niches within large SoC designs or to be used a customized microcontroller in really tiny / low-power FPGAs (like Lattice iCE40). Here are some ideas how to make the processor even smaller while maintaining it's _general purpose system_ concept and maximum RISC-V compatibility. **SoC** * This is obvious, but exclude all unused optional IO/peripheral modules from synthesis via the processor configuration generics. * If an IO module provides an option to configure the number of "channels", constrain this number to the actually required value (e.g. the PWM module `IO_PWM_NUM_CH` or the external interrupt controller `XIRQ_NUM_CH`). * Disable the instruction cache (`ICACHE_EN => false`) if the design only uses processor-internal IMEM and DMEM memories. * _To be continued..._ **CPU** * Use the _embedded_ RISC-V CPU architecture extension (`CPU_EXTENSION_RISCV_E`) to reduce block RAM utilization. * The compressed instructions extension (`CPU_EXTENSION_RISCV_C`) requires additional logic for the decoder but also reduces program code size by approximately 30%. * If not explicitly used/required, exclude the CPU standard counters `[m]instret[h]` (number of instruction) and `[m]cycle[h]` (number of cycles) from synthesis by disabling the `Zicntr` ISA extension (note, this is not RISC-V compliant). * Map CPU shift operations to a small and iterative shifter unit (`FAST_SHIFT_EN => false`). * If you have unused DSP block available, you can map multiplication operations to those slices instead of using LUTs to implement the multiplier (`FAST_MUL_EN => true`). * If there is no need to execute division in hardware, use the `Zmmul` extension instead of the full-scale `M` extension. * Disable CPU extension that are not explicitly used. * _To be continued..._ === Optimize for Clock Speed The NEORV32 Processor and CPU are designed to provide minimal logic between register stages to keep the critical path as short as possible. When enabling additional extension or modules the impact on the existing logic is also kept at a minimum to prevent timing degrading. If there is a major impact on existing logic (example: many physical memory protection address configuration registers) the VHDL code automatically adds additional register stages to maintain critical path length. Obviously, this increases operation latency. In order to optimize for a minimal critical path (= maximum clock speed) the following points should be considered: * Complex CPU extensions (in terms of hardware requirements) should be avoided (examples: floating-point unit, physical memory protection). * Large carry chains (>32-bit) should be avoided (i.e. constrain the HPM counter sizes via `HPM_CNT_WIDTH`). * If the target FPGA provides sufficient DSP resources, CPU multiplication operations can be mapped to DSP slices (`FAST_MUL_EN => true`) reducing LUT usage and critical path impact while also increasing overall performance. * Use the synchronous (registered) RX path configuration of the external memory interface (`MEM_EXT_ASYNC_RX => false`). * _To be continued..._ [NOTE] The short and fixed-length critical path allows to integrate the core into existing clock domains. So no clock domain-crossing and no sub-clock generation is required. However, for very high clock frequencies (this is technology / platform dependent) clock domain crossing becomes crucial for chip-internal connections. === Optimize for Energy There are no _dedicated_ configuration options to optimize the processor for energy (minimal consumption; energy/instruction ratio) yet. However, a reduced processor area (<<_optimize_for_size>>) will also reduce static energy consumption. To optimize your setup for low-power applications, you can make use of the CPU sleep mode (`wfi` instruction). Put the CPU to sleep mode whenever possible. Disable all processor modules that are not actually used (exclude them from synthesis if the will be _never_ used; disable the module via it's control register if the module is not _currently_ used). When is sleep mode, you can keep a timer module running (MTIME or the watch dog) to wake up the CPU again. Since the wake up is triggered by _any_ interrupt, the external interrupt controller can also be used to wake up the CPU again. By this, all timers (and all other modules) can be deactivated as well. .Processor-internal clock generator shutdown [TIP] If _no_ IO/peripheral module is currently enabled, the processor's internal clock generator circuit will be shut down reducing switching activity and thus, dynamic energy consumption.