raid/riscv64: Optimize xor_gen_rvv by leiwen2025 · Pull Request #418 · intel/isa-l

leiwen2025 · 2026-05-28T09:20:56Z

sg2044:
      new: xor_gen_warm: runtime = 3062177 usecs, bandwidth 50580 MB in 3.0622 sec = 17579.80 MB/s
      old: xor_gen_warm: runtime = 3062535 usecs, bandwidth 31581 MB in 3.0625 sec = 10564.11 MB/s

Signed-off-by: WenLei <lei.wen2@zte.com.cn>

sunyuechi · 2026-06-03T09:15:26Z

This adds manual unrolling on top of the standard RVV path, but the extra
gain isn't consistent across machines (likely because sg2044's vector unit
is out-of-order while k3's is in-order):

sg2044  #412 = 16517 MB/s   #418 = 17579 MB/s   (418 faster)
k3      #412 = 16295 MB/s   #418 = 15406 MB/s   (418 slower)

For a portable implementation I'd lean toward the plain standard-RVV style
(#412) unless there's a stronger case across more machines that the manual
unrolling is a net win.

leiwen2025 · 2026-06-04T08:28:22Z

This adds manual unrolling on top of the standard RVV path, but the extra gain isn't consistent across machines (likely because sg2044's vector unit is out-of-order while k3's is in-order):
sg2044  #412 = 16517 MB/s   #418 = 17579 MB/s   (418 faster)
k3      #412 = 16295 MB/s   #418 = 15406 MB/s   (418 slower)
For a portable implementation I'd lean toward the plain standard-RVV style (#412) unless there's a stronger case across more machines that the manual unrolling is a net win.

Thanks for the review and for sharing the benchmark numbers.
How about creating a VLEN=128 branch for the unrolled loop? The unrolled version is modeled after ARM NEON 128-bit implementation, so the gains maybe are expected primarily on VLEN=128 machines like sg2044. Platforms with wider vectors (like k3) would continue using the plain standard-RVV style(#412).

sunyuechi · 2026-06-04T09:15:59Z

This might be more of a microarchitecture / memory-throughput / OoO-vs-in-order
thing rather than VLEN, maybe? As for the NEON style.. I'm just guessing, but
maybe it's because NEON is fixed-length and can't be written the RVV way.

One example is k230 (VLEN=128, in-order, RVV 1.0):

pr 412
  xor_gen_warm: runtime =    3062030 usecs, bandwidth 2280 MB in 3.0620 sec = 744.90 MB/s
pr 418
  xor_gen_warm: runtime =    3061614 usecs, bandwidth 1936 MB in 3.0616 sec = 632.42 MB/s

raid/riscv64: Optimize xor_gen_rvv

1182fc3

Signed-off-by: WenLei <lei.wen2@zte.com.cn>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raid/riscv64: Optimize xor_gen_rvv#418

raid/riscv64: Optimize xor_gen_rvv#418
leiwen2025 wants to merge 1 commit into
intel:masterfrom
leiwen2025:xor_gen_opt_rv64

leiwen2025 commented May 28, 2026 •

edited

Loading

Uh oh!

sunyuechi commented Jun 3, 2026

Uh oh!

leiwen2025 commented Jun 4, 2026

Uh oh!

sunyuechi commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

leiwen2025 commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sunyuechi commented Jun 3, 2026

Uh oh!

leiwen2025 commented Jun 4, 2026

Uh oh!

sunyuechi commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

leiwen2025 commented May 28, 2026 •

edited

Loading