Skip to content

raid/riscv64: Optimize xor_gen_rvv#418

Open
leiwen2025 wants to merge 1 commit into
intel:masterfrom
leiwen2025:xor_gen_opt_rv64
Open

raid/riscv64: Optimize xor_gen_rvv#418
leiwen2025 wants to merge 1 commit into
intel:masterfrom
leiwen2025:xor_gen_opt_rv64

Conversation

@leiwen2025

@leiwen2025 leiwen2025 commented May 28, 2026

Copy link
Copy Markdown
Contributor
sg2044:
      new: xor_gen_warm: runtime = 3062177 usecs, bandwidth 50580 MB in 3.0622 sec = 17579.80 MB/s
      old: xor_gen_warm: runtime = 3062535 usecs, bandwidth 31581 MB in 3.0625 sec = 10564.11 MB/s

Signed-off-by: WenLei <lei.wen2@zte.com.cn>
@sunyuechi

Copy link
Copy Markdown
Contributor

This adds manual unrolling on top of the standard RVV path, but the extra
gain isn't consistent across machines (likely because sg2044's vector unit
is out-of-order while k3's is in-order):

sg2044  #412 = 16517 MB/s   #418 = 17579 MB/s   (418 faster)
k3      #412 = 16295 MB/s   #418 = 15406 MB/s   (418 slower)

For a portable implementation I'd lean toward the plain standard-RVV style
(#412) unless there's a stronger case across more machines that the manual
unrolling is a net win.

@leiwen2025

Copy link
Copy Markdown
Contributor Author

This adds manual unrolling on top of the standard RVV path, but the extra gain isn't consistent across machines (likely because sg2044's vector unit is out-of-order while k3's is in-order):

sg2044  #412 = 16517 MB/s   #418 = 17579 MB/s   (418 faster)
k3      #412 = 16295 MB/s   #418 = 15406 MB/s   (418 slower)

For a portable implementation I'd lean toward the plain standard-RVV style (#412) unless there's a stronger case across more machines that the manual unrolling is a net win.

Thanks for the review and for sharing the benchmark numbers.
How about creating a VLEN=128 branch for the unrolled loop? The unrolled version is modeled after ARM NEON 128-bit implementation, so the gains maybe are expected primarily on VLEN=128 machines like sg2044. Platforms with wider vectors (like k3) would continue using the plain standard-RVV style(#412).

@sunyuechi

Copy link
Copy Markdown
Contributor

This might be more of a microarchitecture / memory-throughput / OoO-vs-in-order
thing rather than VLEN, maybe? As for the NEON style.. I'm just guessing, but
maybe it's because NEON is fixed-length and can't be written the RVV way.

One example is k230 (VLEN=128, in-order, RVV 1.0):

pr 412
  xor_gen_warm: runtime =    3062030 usecs, bandwidth 2280 MB in 3.0620 sec = 744.90 MB/s
pr 418
  xor_gen_warm: runtime =    3061614 usecs, bandwidth 1936 MB in 3.0616 sec = 632.42 MB/s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants