<<< :sectnums: === Custom Functions Unit (CFU) The Custom Functions Unit is the central part of the <<_zxcfu_isa_extension>> and represents the actual hardware module, which can be used to implement _custom RISC-V instructions_. The CFU is intended for operations that are inefficient in terms of performance, latency, energy consumption or program memory requirements when implemented entirely in software. Some potential application fields and exemplary use-cases might include: * **AI:** sub-word / vector / SIMD operations like processing all four bytes of a 32-bit data word in parallel * **Cryptographic:** bit substitution and permutation * **Communication:** conversions like binary to gray-code; multiply-add operations * **Image processing:** look-up-tables for color space transformations * implementing instructions from **other RISC-V ISA extensions** that are not yet supported by the NEORV32 [NOTE] The CFU is not intended for complex and _CPU-independent_ functional units that implement complete accelerators (like block-based AES encryption). These kind of accelerators should be implemented as memory-mapped <<_custom_functions_subsystem_cfs>>. A comparison of all NEORV32-specific chip-internal hardware extension options is provided in the user guide section https://stnolting.github.io/neorv32/ug/#_adding_custom_hardware_modules[Adding Custom Hardware Modules]. :sectnums: ==== CFU Instruction Formats The custom instructions executed by the CFU utilize a specific opcode space in the `rv32` 32-bit instruction space that has been explicitly reserved for user-defined extensions by the RISC-V specifications ("Guaranteed Non-Standard Encoding Space"). The NEORV32 CFU uses the `custom` opcodes to identify the instructions implemented by the CFU and to differentiate between the different instruction formats. The according binary encoding of these opcodes is shown below: * `custom-0`: `0001011` RISC-V standard, used for <<_cfu_r3_type_instructions>> * `custom-1`: `0101011` RISC-V standard, used for <<_cfu_r4_type_instructions>> * `custom-2`: `1011011` NEORV32-specific, used for <<_cfu_r5_type_instructions>> type A * `custom-3`: `1111011` NEORV32-specific, used for <<_cfu_r5_type_instructions>> type B :sectnums: ===== CFU R3-Type Instructions The R3-type CFU instructions operate on two source registers `rs1` and `rs2` and return the processing result to the destination register `rd`. The actual operation can be defined by using the `funct7` and `funct3` bit fields. These immediates can also be used to pass additional data to the CFU like offsets, look-up-tables addresses or shift-amounts. However, the actual functionality is entirely user-defined. Example operation: `rd <= rs1 xnor rs2` .CFU R3-type instruction format image::cfu_r3type_instruction.png[align=center] * `funct7`: 7-bit immediate (further operand data or function select) * `rs2`: address of second source register (32-bit source data) * `rs1`: address of first source register (32-bit source data) * `funct3`: 3-bit immediate (further operand data or function select) * `rd`: address of destination register (for the 32-bit processing result) * `opcode`: `0001011` (RISC-V "custom-0" opcode) .RISC-V compatibility [NOTE] The CFU R3-type instruction format is compliant to the RISC-V ISA specification. .Instruction encoding space [NOTE] By using the `funct7` and `funct3` bit fields entirely for selecting the actual operation a total of 1024 custom R3-type instructions can be implemented (7-bit + 3-bit = 10 bit -> 1024 different values). :sectnums: ===== CFU R4-Type Instructions The R4-type CFU instructions operate on three source registers `rs1`, `rs2` and `rs2` and return the processing result to the destination register `rd`. The actual operation can be defined by using the `funct3` bit field. Alternatively, this immediate can also be used to pass additional data to the CFU like offsets, look-up-tables addresses or shift-amounts. However, the actual functionality is entirely user-defined. Example operation: `rd <= (rs1 * rs2 + rs3)[31:0]` .CFU R4-type instruction format image::cfu_r4type_instruction.png[align=center] * `rs3`: address of third source register (32-bit source data) * `rs2`: address of second source register (32-bit source data) * `rs1`: address of first source register (32-bit source data) * `funct3`: 3-bit immediate (further operand data or function select) * `rd`: address of destination register (for the 32-bit processing result) * `opcode`: `0101011` (RISC-V "custom-1" opcode) .RISC-V compatibility [NOTE] The CFU R4-type instruction format is compliant to the RISC-V ISA specification. .Unused instruction bits [NOTE] The RISC-V ISA specification defines bits [26:25] of the R4-type instruction word to be all-zero. These bits are ignored by the hardware (CFU and illegal instruction check logic) and should be set to all-zero to preserve compatibility with future ISA spec. versions. .Instruction encoding space [NOTE] By using the `funct3` bit field entirely for selecting the actual operation a total of 8 custom R4-type instructions can be implemented (3-bit -> 8 different values). :sectnums: ===== CFU R5-Type Instructions The R5-type CFU instructions operate on four source registers `rs1`, `rs2`, `rs3` and `r4` and return the processing result to the destination register `rd`. As all bits of the instruction word are used to encode the five registers and the opcode, no further immediate bits are available to specify the actual operation. There are two different R5-type instruction with two different opcodes available. Hence, only two R5-type operations can be implemented out of the box. Example operation: `rd <= rs1 & rs2 & rs3 & rs4` .CFU R5-type instruction A format image::cfu_r5type_instruction_a.png[align=center] .CFU R5-type instruction B format image::cfu_r5type_instruction_b.png[align=center] * `rs4.hi` & `rs4.lo`: address of fourth source register (32-bit source data) * `rs3`: address of third source register (32-bit source data) * `rs2`: address of second source register (32-bit source data) * `rs1`: address of first source register (32-bit source data) * `rd`: address of destination register (for the 32-bit processing result) * `opcode`: `1011011` (RISC-V "custom-2" opcode) and/or `1111011` (RISC-V "custom-3" opcode) .RISC-V compatibility [IMPORTANT] The RISC-V ISA specifications does not specify a R5-type instruction format. Hence, this instruction format is NEORV32-specific. .Instruction encoding space [IMPORTANT] There are no immediate fields in the CFU R5-type instruction so the actual operation is specified entirely by the opcode resulting in just two different operations out of the box. However, another CFU instruction (like a R3-type instruction) can be used to "program" the actual operation of a R5-type instruction by writing operation information to a CFU-internal "command" register. :sectnums: ==== Using Custom Instructions in Software The custom instructions provided by the CFU can be used in plain C code by using **intrinsics**. Intrinsics behave like "normal" C functions but under the hood they are a set of macros that hide the complexity of inline assembly. Using intrinsics removes the need to modify the compiler, built-in libraries or the assembler when using custom instructions. Each intrinsic will be compiled into a single 32-bit instruction word providing maximum code efficiency. .CFU Example Program [TIP] There is an example program for the CFU, which shows how to use the _default_ CFU hardware module. This example program is located in `sw/example/demo_cfu`. The NEORV32 software framework provides four pre-defined prototypes for custom instructions, which are defined in `sw/lib/include/neorv32_cpu_cfu.h`: .CFU instruction prototypes [source,c] ---- neorv32_cfu_r3_instr(funct7, funct3, rs1, rs2) // R3-type instructions neorv32_cfu_r4_instr(funct3, rs1, rs2, rs3) // R4-type instructions neorv32_cfu_r5_instr_a(rs1, rs2, rs3, rs4) // R5-type instruction A neorv32_cfu_r5_instr_b(rs1, rs2, rs3, rs4) // R5-type instruction B ---- The intrinsic functions always return a 32-bit value of type `uint32_t` (the processing result), which can be discarded if not needed. Each intrinsic function requires several arguments depending on the instruction type/format: * `funct7` - 7-bit immediate (R3-type only) * `funct3` - 3-bit immediate (R3-type, R4-type) * `rs1` - source operand 1, 32-bit (R3-type, R4-type) * `rs2` - source operand 2, 32-bit (R3-type, R4-type) * `rs3` - source operand 3, 32-bit (R3-type, R4-type, R5-type) * `rs4` - source operand 4, 32-bit (R4-type, R4-type, R5-type) The `funct3` and `funct7` bit-fields are used to pass 3-bit or 7-bit literals to the CFU. The `rs1`, `rs2`, `rs3` and `r4` arguments pass the actual data to the CFU. These register arguments can be populated with variables or literals. The following example shows how to pass arguments: .CFU instruction usage example [source,c] ---- uint32_t tmp = some_function(); ... uint32_t res = neorv32_cfu_r3_instr(0b0000000, 0b101, tmp, 123); uint32_t foo = neorv32_cfu_r4_instr(0b011, tmp, res, (uint32_t)some_array[i]); uint32_t bar = neorv32_cfu_r5_instr_a(tmp, res, foo, tmp); ---- :sectnums: ==== CFU Control and Status Registers (CFU-CSRs) The CPU provides up to four control and status registers (<<_cfureg, `cfureg*`>>) to be used within the CFU. These CSRs are mapped to the "custom user-mode read/write" CSR address space, which is explicitly reserved for platform-specific application by the RISC-V spec. For example, these CSRs can be used to pass additional operands to the CFU, to obtain additional results, to check processing status or to program operation modes. .CFU CSR Access Example [source,c] ---- neorv32_cpu_csr_write(CSR_CFUREG0, 0xabcdabcd); // write data to CFU CSR 0 uint32_t tmp = neorv32_cpu_csr_read(CSR_CFUREG3); // read data from CFU CSR 3 ---- .Additional CFU-internal CSRs [TIP] If more than four CFU-internal CSRs are required the designer can implement an "indirect access mechanism" based on just two of the default CSRs: one CSR is used to configure the index while the other is used as alias to exchange data with the indexed CFU-internal CSR - this concept is similar to the RISC-V Indirect CSR Access Extension Specification (`Smcsrind`). :sectnums: ==== Custom Instructions Hardware The actual functionality of the CFU's custom instructions is defined by the user-defined logic inside the CFU hardware module `rtl/core/neorv32_cpu_cp_cfu.vhd`. CFU operations can be entirely combinatorial (like bit-reversal) so the result is available at the end of the current clock cycle. Operations can also take several clock cycles to complete (like multiplications) and may also include internal states and memories. The CFU's internal control unit takes care of interfacing the custom user logic to the CPU pipeline. .CFU Hardware Example & More Details [TIP] The default CFU hardware module already implement some exemplary instructions that are used for illustration by the CFU example program. See the CFU's VHDL source file (`rtl/core/neorv32_cpu_cp_cfu.vhd`), which is highly commented to explain the available signals, implementation options and the handshake with the CPU pipeline. .CFU Hardware Resource Requirements [NOTE] Enabling the CFU and actually implementing R4-type and/or R5-type instructions (or more precisely, using the according operands for the CFU hardware) will add one or two, respectively, additional read ports to the core's register file significantly increasing resource requirements. .CFU Access [NOTE] The CFU is accessible from all privilege modes (including CFU-internal registers accessed via the indirects CSR access mechanism). It is the task of the CFU designers to add according access-constraining logic if certain CFU states shall not be exposed to all privilege levels (i.e. exncryption keys). .CFU Execution Time [NOTE] The CFU has to complete computation within a **bound time window**. Otherwise, the CFU operation is terminated by the hardware and an illegal instruction exception is raised. See section <<_cpu_arithmetic_logic_unit>> for more information. .CFU Exception [NOTE] The CFU can intentionally raise an illegal instruction exception by not asserting the `done` at all causing an execution timeout. For example this can be used to signal invalid configurations/operations to the runtime environment. See the CFU's VHDL file for more information.