neorv32/docs/datasheet/cpu_cfu.adoc

259 lines
12 KiB
Plaintext

<<<
:sectnums:
=== Custom Functions Unit (CFU)
The Custom Functions Unit is the central part of the <<_zxcfu_isa_extension>> and represents
the actual hardware module, which can be used to implement _custom RISC-V instructions_.
The CFU is intended for operations that are inefficient in terms of performance, latency, energy consumption or
program memory requirements when implemented entirely in software. Some potential application fields and exemplary
use-cases might include:
* **AI:** sub-word / vector / SIMD operations like processing all four bytes of a 32-bit data word in parallel
* **Cryptographic:** bit substitution and permutation
* **Communication:** conversions like binary to gray-code; multiply-add operations
* **Image processing:** look-up-tables for color space transformations
* implementing instructions from **other RISC-V ISA extensions** that are not yet supported by the NEORV32
[NOTE]
The CFU is not intended for complex and _CPU-independent_ functional units that implement complete accelerators
(like block-based AES encryption). These kind of accelerators should be implemented as memory-mapped
<<_custom_functions_subsystem_cfs>>. A comparison of all NEORV32-specific chip-internal hardware extension
options is provided in the user guide section
https://stnolting.github.io/neorv32/ug/#_adding_custom_hardware_modules[Adding Custom Hardware Modules].
:sectnums:
==== CFU Instruction Formats
The custom instructions executed by the CFU utilize a specific opcode space in the `rv32` 32-bit instruction
space that has been explicitly reserved for user-defined extensions by the RISC-V specifications ("Guaranteed
Non-Standard Encoding Space"). The NEORV32 CFU uses the `custom` opcodes to identify the instructions implemented
by the CFU and to differentiate between the different instruction formats. The according binary encoding of these
opcodes is shown below:
* `custom-0`: `0001011` RISC-V standard, used for <<_cfu_r3_type_instructions>>
* `custom-1`: `0101011` RISC-V standard, used for <<_cfu_r4_type_instructions>>
* `custom-2`: `1011011` NEORV32-specific, used for <<_cfu_r5_type_instructions>> type A
* `custom-3`: `1111011` NEORV32-specific, used for <<_cfu_r5_type_instructions>> type B
:sectnums:
===== CFU R3-Type Instructions
The R3-type CFU instructions operate on two source registers `rs1` and `rs2` and return the processing result to
the destination register `rd`. The actual operation can be defined by using the `funct7` and `funct3` bit fields.
These immediates can also be used to pass additional data to the CFU like offsets, look-up-tables addresses or
shift-amounts. However, the actual functionality is entirely user-defined.
Example operation: `rd <= rs1 xnor rs2`
.CFU R3-type instruction format
image::cfu_r3type_instruction.png[align=center]
* `funct7`: 7-bit immediate (further operand data or function select)
* `rs2`: address of second source register (32-bit source data)
* `rs1`: address of first source register (32-bit source data)
* `funct3`: 3-bit immediate (further operand data or function select)
* `rd`: address of destination register (for the 32-bit processing result)
* `opcode`: `0001011` (RISC-V "custom-0" opcode)
.RISC-V compatibility
[NOTE]
The CFU R3-type instruction format is compliant to the RISC-V ISA specification.
.Instruction encoding space
[NOTE]
By using the `funct7` and `funct3` bit fields entirely for selecting the actual operation a total of 1024 custom
R3-type instructions can be implemented (7-bit + 3-bit = 10 bit -> 1024 different values).
:sectnums:
===== CFU R4-Type Instructions
The R4-type CFU instructions operate on three source registers `rs1`, `rs2` and `rs2` and return the processing
result to the destination register `rd`. The actual operation can be defined by using the `funct3` bit field.
Alternatively, this immediate can also be used to pass additional data to the CFU like offsets, look-up-tables
addresses or shift-amounts. However, the actual functionality is entirely user-defined.
Example operation: `rd <= (rs1 * rs2 + rs3)[31:0]`
.CFU R4-type instruction format
image::cfu_r4type_instruction.png[align=center]
* `rs3`: address of third source register (32-bit source data)
* `rs2`: address of second source register (32-bit source data)
* `rs1`: address of first source register (32-bit source data)
* `funct3`: 3-bit immediate (further operand data or function select)
* `rd`: address of destination register (for the 32-bit processing result)
* `opcode`: `0101011` (RISC-V "custom-1" opcode)
.RISC-V compatibility
[NOTE]
The CFU R4-type instruction format is compliant to the RISC-V ISA specification.
.Unused instruction bits
[NOTE]
The RISC-V ISA specification defines bits [26:25] of the R4-type instruction word to be all-zero. These bits
are ignored by the hardware (CFU and illegal instruction check logic) and should be set to all-zero to preserve
compatibility with future ISA spec. versions.
.Instruction encoding space
[NOTE]
By using the `funct3` bit field entirely for selecting the actual operation a total of 8 custom R4-type
instructions can be implemented (3-bit -> 8 different values).
:sectnums:
===== CFU R5-Type Instructions
The R5-type CFU instructions operate on four source registers `rs1`, `rs2`, `rs3` and `r4` and return the
processing result to the destination register `rd`. As all bits of the instruction word are used to encode the
five registers and the opcode, no further immediate bits are available to specify the actual operation. There
are two different R5-type instruction with two different opcodes available. Hence, only two R5-type operations
can be implemented out of the box.
Example operation: `rd <= rs1 & rs2 & rs3 & rs4`
.CFU R5-type instruction A format
image::cfu_r5type_instruction_a.png[align=center]
.CFU R5-type instruction B format
image::cfu_r5type_instruction_b.png[align=center]
* `rs4.hi` & `rs4.lo`: address of fourth source register (32-bit source data)
* `rs3`: address of third source register (32-bit source data)
* `rs2`: address of second source register (32-bit source data)
* `rs1`: address of first source register (32-bit source data)
* `rd`: address of destination register (for the 32-bit processing result)
* `opcode`: `1011011` (RISC-V "custom-2" opcode) and/or `1111011` (RISC-V "custom-3" opcode)
.RISC-V compatibility
[IMPORTANT]
The RISC-V ISA specifications does not specify a R5-type instruction format. Hence, this instruction
format is NEORV32-specific.
.Instruction encoding space
[IMPORTANT]
There are no immediate fields in the CFU R5-type instruction so the actual operation is specified entirely
by the opcode resulting in just two different operations out of the box. However, another CFU instruction
(like a R3-type instruction) can be used to "program" the actual operation of a R5-type instruction by
writing operation information to a CFU-internal "command" register.
:sectnums:
==== Using Custom Instructions in Software
The custom instructions provided by the CFU can be used in plain C code by using **intrinsics**. Intrinsics
behave like "normal" C functions but under the hood they are a set of macros that hide the complexity of inline assembly.
Using intrinsics removes the need to modify the compiler, built-in libraries or the assembler when using custom
instructions. Each intrinsic will be compiled into a single 32-bit instruction word providing maximum code efficiency.
.CFU Example Program
[TIP]
There is an example program for the CFU, which shows how to use the _default_ CFU hardware module.
This example program is located in `sw/example/demo_cfu`.
The NEORV32 software framework provides four pre-defined prototypes for custom instructions, which are defined in
`sw/lib/include/neorv32_cpu_cfu.h`:
.CFU instruction prototypes
[source,c]
----
neorv32_cfu_r3_instr(funct7, funct3, rs1, rs2) // R3-type instructions
neorv32_cfu_r4_instr(funct3, rs1, rs2, rs3) // R4-type instructions
neorv32_cfu_r5_instr_a(rs1, rs2, rs3, rs4) // R5-type instruction A
neorv32_cfu_r5_instr_b(rs1, rs2, rs3, rs4) // R5-type instruction B
----
The intrinsic functions always return a 32-bit value of type `uint32_t` (the processing result), which can be discarded
if not needed. Each intrinsic function requires several arguments depending on the instruction type/format:
* `funct7` - 7-bit immediate (R3-type only)
* `funct3` - 3-bit immediate (R3-type, R4-type)
* `rs1` - source operand 1, 32-bit (R3-type, R4-type)
* `rs2` - source operand 2, 32-bit (R3-type, R4-type)
* `rs3` - source operand 3, 32-bit (R3-type, R4-type, R5-type)
* `rs4` - source operand 4, 32-bit (R4-type, R4-type, R5-type)
The `funct3` and `funct7` bit-fields are used to pass 3-bit or 7-bit literals to the CFU. The `rs1`, `rs2`, `rs3`
and `r4` arguments pass the actual data to the CFU. These register arguments can be populated with variables or
literals. The following example shows how to pass arguments:
.CFU instruction usage example
[source,c]
----
uint32_t tmp = some_function();
...
uint32_t res = neorv32_cfu_r3_instr(0b0000000, 0b101, tmp, 123);
uint32_t foo = neorv32_cfu_r4_instr(0b011, tmp, res, (uint32_t)some_array[i]);
uint32_t bar = neorv32_cfu_r5_instr_a(tmp, res, foo, tmp);
----
:sectnums:
==== CFU Control and Status Registers (CFU-CSRs)
The CPU provides up to four control and status registers (<<_cfureg, `cfureg*`>>) to be
used within the CFU. These CSRs are mapped to the "custom user-mode read/write" CSR address space, which is
explicitly reserved for platform-specific application by the RISC-V spec. For example, these CSRs can be used
to pass additional operands to the CFU, to obtain additional results, to check processing status or to program
operation modes.
.CFU CSR Access Example
[source,c]
----
neorv32_cpu_csr_write(CSR_CFUREG0, 0xabcdabcd); // write data to CFU CSR 0
uint32_t tmp = neorv32_cpu_csr_read(CSR_CFUREG3); // read data from CFU CSR 3
----
.Additional CFU-internal CSRs
[TIP]
If more than four CFU-internal CSRs are required the designer can implement an "indirect access mechanism" based
on just two of the default CSRs: one CSR is used to configure the index while the other is used as alias to exchange
data with the indexed CFU-internal CSR - this concept is similar to the RISC-V Indirect CSR Access Extension
Specification (`Smcsrind`).
:sectnums:
==== Custom Instructions Hardware
The actual functionality of the CFU's custom instructions is defined by the user-defined logic inside
the CFU hardware module `rtl/core/neorv32_cpu_cp_cfu.vhd`.
CFU operations can be entirely combinatorial (like bit-reversal) so the result is available at the end of
the current clock cycle. Operations can also take several clock cycles to complete (like multiplications)
and may also include internal states and memories. The CFU's internal control unit takes care of
interfacing the custom user logic to the CPU pipeline.
.CFU Hardware Example & More Details
[TIP]
The default CFU hardware module already implement some exemplary instructions that are used for illustration
by the CFU example program. See the CFU's VHDL source file (`rtl/core/neorv32_cpu_cp_cfu.vhd`), which
is highly commented to explain the available signals, implementation options and the handshake with the CPU pipeline.
.CFU Hardware Resource Requirements
[NOTE]
Enabling the CFU and actually implementing R4-type and/or R5-type instructions (or more precisely, using
the according operands for the CFU hardware) will add one or two, respectively, additional read ports to
the core's register file significantly increasing resource requirements.
.CFU Access
[NOTE]
The CFU is accessible from all privilege modes (including CFU-internal registers accessed via the indirects CSR
access mechanism). It is the task of the CFU designers to add according access-constraining logic if certain CFU
states shall not be exposed to all privilege levels (i.e. exncryption keys).
.CFU Execution Time
[NOTE]
The CFU has to complete computation within a **bound time window**. Otherwise, the CFU operation is terminated
by the hardware and an illegal instruction exception is raised. See section <<_cpu_arithmetic_logic_unit>>
for more information.
.CFU Exception
[NOTE]
The CFU can intentionally raise an illegal instruction exception by not asserting the `done` at all causing an
execution timeout. For example this can be used to signal invalid configurations/operations to the runtime
environment. See the CFU's VHDL file for more information.