259 lines
12 KiB
Plaintext
259 lines
12 KiB
Plaintext
<<<
|
|
:sectnums:
|
|
=== Custom Functions Unit (CFU)
|
|
|
|
The Custom Functions Unit is the central part of the <<_zxcfu_isa_extension>> and represents
|
|
the actual hardware module, which can be used to implement _custom RISC-V instructions_.
|
|
|
|
The CFU is intended for operations that are inefficient in terms of performance, latency, energy consumption or
|
|
program memory requirements when implemented entirely in software. Some potential application fields and exemplary
|
|
use-cases might include:
|
|
|
|
* **AI:** sub-word / vector / SIMD operations like processing all four bytes of a 32-bit data word in parallel
|
|
* **Cryptographic:** bit substitution and permutation
|
|
* **Communication:** conversions like binary to gray-code; multiply-add operations
|
|
* **Image processing:** look-up-tables for color space transformations
|
|
* implementing instructions from **other RISC-V ISA extensions** that are not yet supported by the NEORV32
|
|
|
|
[NOTE]
|
|
The CFU is not intended for complex and _CPU-independent_ functional units that implement complete accelerators
|
|
(like block-based AES encryption). These kind of accelerators should be implemented as memory-mapped
|
|
<<_custom_functions_subsystem_cfs>>. A comparison of all NEORV32-specific chip-internal hardware extension
|
|
options is provided in the user guide section
|
|
https://stnolting.github.io/neorv32/ug/#_adding_custom_hardware_modules[Adding Custom Hardware Modules].
|
|
|
|
|
|
:sectnums:
|
|
==== CFU Instruction Formats
|
|
|
|
The custom instructions executed by the CFU utilize a specific opcode space in the `rv32` 32-bit instruction
|
|
space that has been explicitly reserved for user-defined extensions by the RISC-V specifications ("Guaranteed
|
|
Non-Standard Encoding Space"). The NEORV32 CFU uses the `custom` opcodes to identify the instructions implemented
|
|
by the CFU and to differentiate between the different instruction formats. The according binary encoding of these
|
|
opcodes is shown below:
|
|
|
|
* `custom-0`: `0001011` RISC-V standard, used for <<_cfu_r3_type_instructions>>
|
|
* `custom-1`: `0101011` RISC-V standard, used for <<_cfu_r4_type_instructions>>
|
|
* `custom-2`: `1011011` NEORV32-specific, used for <<_cfu_r5_type_instructions>> type A
|
|
* `custom-3`: `1111011` NEORV32-specific, used for <<_cfu_r5_type_instructions>> type B
|
|
|
|
|
|
:sectnums:
|
|
===== CFU R3-Type Instructions
|
|
|
|
The R3-type CFU instructions operate on two source registers `rs1` and `rs2` and return the processing result to
|
|
the destination register `rd`. The actual operation can be defined by using the `funct7` and `funct3` bit fields.
|
|
These immediates can also be used to pass additional data to the CFU like offsets, look-up-tables addresses or
|
|
shift-amounts. However, the actual functionality is entirely user-defined.
|
|
|
|
Example operation: `rd <= rs1 xnor rs2`
|
|
|
|
.CFU R3-type instruction format
|
|
image::cfu_r3type_instruction.png[align=center]
|
|
|
|
* `funct7`: 7-bit immediate (further operand data or function select)
|
|
* `rs2`: address of second source register (32-bit source data)
|
|
* `rs1`: address of first source register (32-bit source data)
|
|
* `funct3`: 3-bit immediate (further operand data or function select)
|
|
* `rd`: address of destination register (for the 32-bit processing result)
|
|
* `opcode`: `0001011` (RISC-V "custom-0" opcode)
|
|
|
|
.RISC-V compatibility
|
|
[NOTE]
|
|
The CFU R3-type instruction format is compliant to the RISC-V ISA specification.
|
|
|
|
.Instruction encoding space
|
|
[NOTE]
|
|
By using the `funct7` and `funct3` bit fields entirely for selecting the actual operation a total of 1024 custom
|
|
R3-type instructions can be implemented (7-bit + 3-bit = 10 bit -> 1024 different values).
|
|
|
|
|
|
:sectnums:
|
|
===== CFU R4-Type Instructions
|
|
|
|
The R4-type CFU instructions operate on three source registers `rs1`, `rs2` and `rs2` and return the processing
|
|
result to the destination register `rd`. The actual operation can be defined by using the `funct3` bit field.
|
|
Alternatively, this immediate can also be used to pass additional data to the CFU like offsets, look-up-tables
|
|
addresses or shift-amounts. However, the actual functionality is entirely user-defined.
|
|
|
|
Example operation: `rd <= (rs1 * rs2 + rs3)[31:0]`
|
|
|
|
.CFU R4-type instruction format
|
|
image::cfu_r4type_instruction.png[align=center]
|
|
|
|
* `rs3`: address of third source register (32-bit source data)
|
|
* `rs2`: address of second source register (32-bit source data)
|
|
* `rs1`: address of first source register (32-bit source data)
|
|
* `funct3`: 3-bit immediate (further operand data or function select)
|
|
* `rd`: address of destination register (for the 32-bit processing result)
|
|
* `opcode`: `0101011` (RISC-V "custom-1" opcode)
|
|
|
|
.RISC-V compatibility
|
|
[NOTE]
|
|
The CFU R4-type instruction format is compliant to the RISC-V ISA specification.
|
|
|
|
.Unused instruction bits
|
|
[NOTE]
|
|
The RISC-V ISA specification defines bits [26:25] of the R4-type instruction word to be all-zero. These bits
|
|
are ignored by the hardware (CFU and illegal instruction check logic) and should be set to all-zero to preserve
|
|
compatibility with future ISA spec. versions.
|
|
|
|
.Instruction encoding space
|
|
[NOTE]
|
|
By using the `funct3` bit field entirely for selecting the actual operation a total of 8 custom R4-type
|
|
instructions can be implemented (3-bit -> 8 different values).
|
|
|
|
|
|
:sectnums:
|
|
===== CFU R5-Type Instructions
|
|
|
|
The R5-type CFU instructions operate on four source registers `rs1`, `rs2`, `rs3` and `r4` and return the
|
|
processing result to the destination register `rd`. As all bits of the instruction word are used to encode the
|
|
five registers and the opcode, no further immediate bits are available to specify the actual operation. There
|
|
are two different R5-type instruction with two different opcodes available. Hence, only two R5-type operations
|
|
can be implemented out of the box.
|
|
|
|
Example operation: `rd <= rs1 & rs2 & rs3 & rs4`
|
|
|
|
.CFU R5-type instruction A format
|
|
image::cfu_r5type_instruction_a.png[align=center]
|
|
|
|
.CFU R5-type instruction B format
|
|
image::cfu_r5type_instruction_b.png[align=center]
|
|
|
|
* `rs4.hi` & `rs4.lo`: address of fourth source register (32-bit source data)
|
|
* `rs3`: address of third source register (32-bit source data)
|
|
* `rs2`: address of second source register (32-bit source data)
|
|
* `rs1`: address of first source register (32-bit source data)
|
|
* `rd`: address of destination register (for the 32-bit processing result)
|
|
* `opcode`: `1011011` (RISC-V "custom-2" opcode) and/or `1111011` (RISC-V "custom-3" opcode)
|
|
|
|
.RISC-V compatibility
|
|
[IMPORTANT]
|
|
The RISC-V ISA specifications does not specify a R5-type instruction format. Hence, this instruction
|
|
format is NEORV32-specific.
|
|
|
|
.Instruction encoding space
|
|
[IMPORTANT]
|
|
There are no immediate fields in the CFU R5-type instruction so the actual operation is specified entirely
|
|
by the opcode resulting in just two different operations out of the box. However, another CFU instruction
|
|
(like a R3-type instruction) can be used to "program" the actual operation of a R5-type instruction by
|
|
writing operation information to a CFU-internal "command" register.
|
|
|
|
|
|
:sectnums:
|
|
==== Using Custom Instructions in Software
|
|
|
|
The custom instructions provided by the CFU can be used in plain C code by using **intrinsics**. Intrinsics
|
|
behave like "normal" C functions but under the hood they are a set of macros that hide the complexity of inline assembly.
|
|
Using intrinsics removes the need to modify the compiler, built-in libraries or the assembler when using custom
|
|
instructions. Each intrinsic will be compiled into a single 32-bit instruction word providing maximum code efficiency.
|
|
|
|
.CFU Example Program
|
|
[TIP]
|
|
There is an example program for the CFU, which shows how to use the _default_ CFU hardware module.
|
|
This example program is located in `sw/example/demo_cfu`.
|
|
|
|
The NEORV32 software framework provides four pre-defined prototypes for custom instructions, which are defined in
|
|
`sw/lib/include/neorv32_cpu_cfu.h`:
|
|
|
|
.CFU instruction prototypes
|
|
[source,c]
|
|
----
|
|
neorv32_cfu_r3_instr(funct7, funct3, rs1, rs2) // R3-type instructions
|
|
neorv32_cfu_r4_instr(funct3, rs1, rs2, rs3) // R4-type instructions
|
|
neorv32_cfu_r5_instr_a(rs1, rs2, rs3, rs4) // R5-type instruction A
|
|
neorv32_cfu_r5_instr_b(rs1, rs2, rs3, rs4) // R5-type instruction B
|
|
----
|
|
|
|
The intrinsic functions always return a 32-bit value of type `uint32_t` (the processing result), which can be discarded
|
|
if not needed. Each intrinsic function requires several arguments depending on the instruction type/format:
|
|
|
|
* `funct7` - 7-bit immediate (R3-type only)
|
|
* `funct3` - 3-bit immediate (R3-type, R4-type)
|
|
* `rs1` - source operand 1, 32-bit (R3-type, R4-type)
|
|
* `rs2` - source operand 2, 32-bit (R3-type, R4-type)
|
|
* `rs3` - source operand 3, 32-bit (R3-type, R4-type, R5-type)
|
|
* `rs4` - source operand 4, 32-bit (R4-type, R4-type, R5-type)
|
|
|
|
The `funct3` and `funct7` bit-fields are used to pass 3-bit or 7-bit literals to the CFU. The `rs1`, `rs2`, `rs3`
|
|
and `r4` arguments pass the actual data to the CFU. These register arguments can be populated with variables or
|
|
literals. The following example shows how to pass arguments:
|
|
|
|
.CFU instruction usage example
|
|
[source,c]
|
|
----
|
|
uint32_t tmp = some_function();
|
|
...
|
|
uint32_t res = neorv32_cfu_r3_instr(0b0000000, 0b101, tmp, 123);
|
|
uint32_t foo = neorv32_cfu_r4_instr(0b011, tmp, res, (uint32_t)some_array[i]);
|
|
uint32_t bar = neorv32_cfu_r5_instr_a(tmp, res, foo, tmp);
|
|
----
|
|
|
|
|
|
:sectnums:
|
|
==== CFU Control and Status Registers (CFU-CSRs)
|
|
|
|
The CPU provides up to four control and status registers (<<_cfureg, `cfureg*`>>) to be
|
|
used within the CFU. These CSRs are mapped to the "custom user-mode read/write" CSR address space, which is
|
|
explicitly reserved for platform-specific application by the RISC-V spec. For example, these CSRs can be used
|
|
to pass additional operands to the CFU, to obtain additional results, to check processing status or to program
|
|
operation modes.
|
|
|
|
.CFU CSR Access Example
|
|
[source,c]
|
|
----
|
|
neorv32_cpu_csr_write(CSR_CFUREG0, 0xabcdabcd); // write data to CFU CSR 0
|
|
uint32_t tmp = neorv32_cpu_csr_read(CSR_CFUREG3); // read data from CFU CSR 3
|
|
----
|
|
|
|
|
|
.Additional CFU-internal CSRs
|
|
[TIP]
|
|
If more than four CFU-internal CSRs are required the designer can implement an "indirect access mechanism" based
|
|
on just two of the default CSRs: one CSR is used to configure the index while the other is used as alias to exchange
|
|
data with the indexed CFU-internal CSR - this concept is similar to the RISC-V Indirect CSR Access Extension
|
|
Specification (`Smcsrind`).
|
|
|
|
|
|
:sectnums:
|
|
==== Custom Instructions Hardware
|
|
|
|
The actual functionality of the CFU's custom instructions is defined by the user-defined logic inside
|
|
the CFU hardware module `rtl/core/neorv32_cpu_cp_cfu.vhd`.
|
|
|
|
CFU operations can be entirely combinatorial (like bit-reversal) so the result is available at the end of
|
|
the current clock cycle. Operations can also take several clock cycles to complete (like multiplications)
|
|
and may also include internal states and memories. The CFU's internal control unit takes care of
|
|
interfacing the custom user logic to the CPU pipeline.
|
|
|
|
.CFU Hardware Example & More Details
|
|
[TIP]
|
|
The default CFU hardware module already implement some exemplary instructions that are used for illustration
|
|
by the CFU example program. See the CFU's VHDL source file (`rtl/core/neorv32_cpu_cp_cfu.vhd`), which
|
|
is highly commented to explain the available signals, implementation options and the handshake with the CPU pipeline.
|
|
|
|
.CFU Hardware Resource Requirements
|
|
[NOTE]
|
|
Enabling the CFU and actually implementing R4-type and/or R5-type instructions (or more precisely, using
|
|
the according operands for the CFU hardware) will add one or two, respectively, additional read ports to
|
|
the core's register file significantly increasing resource requirements.
|
|
|
|
.CFU Access
|
|
[NOTE]
|
|
The CFU is accessible from all privilege modes (including CFU-internal registers accessed via the indirects CSR
|
|
access mechanism). It is the task of the CFU designers to add according access-constraining logic if certain CFU
|
|
states shall not be exposed to all privilege levels (i.e. exncryption keys).
|
|
|
|
.CFU Execution Time
|
|
[NOTE]
|
|
The CFU has to complete computation within a **bound time window**. Otherwise, the CFU operation is terminated
|
|
by the hardware and an illegal instruction exception is raised. See section <<_cpu_arithmetic_logic_unit>>
|
|
for more information.
|
|
|
|
.CFU Exception
|
|
[NOTE]
|
|
The CFU can intentionally raise an illegal instruction exception by not asserting the `done` at all causing an
|
|
execution timeout. For example this can be used to signal invalid configurations/operations to the runtime
|
|
environment. See the CFU's VHDL file for more information.
|