Replace custom CUDA CSysMatrix matvec with cuSPARSE BSR SpMV#2816
Open
LwhJesse wants to merge 1 commit into
Open
Replace custom CUDA CSysMatrix matvec with cuSPARSE BSR SpMV#2816LwhJesse wants to merge 1 commit into
LwhJesse wants to merge 1 commit into
Conversation
6 tasks
This was referenced May 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR replaces the custom CUDA
CSysMatrixblock-sparse matrix-vector product kernel with cuSPARSE BSR SpMV for the CUDA square-block path.This is not only a performance or maintenance refactor. It is primarily a correctness fix.
The previous custom CUDA kernel has two independent numerical correctness problems:
nVar != nEqn), but that path was never semantically correct and can produce wrong results.Because of these issues, the previous CUDA kernel was not semantically safe for all inputs accepted by the interface. For the square-block CUDA path, the matrix storage is already BSR-like, so this PR replaces the custom kernel with cuSPARSE BSR SpMV instead of extending a path with known correctness mismatches.
What changed
GPUMatrixVectorProductAddkernel.CSysMatrix<ScalarType>::GPUMatrixVectorProductwith cuSPARSE BSR SpMV for the CUDA path.nVar != nEqn.Why the previous CUDA kernel is being replaced
The motivation here is not only simplification. The previous CUDA kernel had correctness risks that were reproducible at the matrix-vector-product level.
Problem 1: square-block routing bug
The old kernel derives the output component from a fixed
threadNo-basedblockRow, while the actual matrix entry being processed changes with the loop index:But the true dense-block row of the current matrix entry is determined by the current
index, not by the fixed thread id.That means a single thread can process entries belonging to different local block rows, while still accumulating everything into one fixed output component. This creates an algorithmic routing mismatch.
A minimal synthetic square-block case demonstrates it directly:
Expected CPU reference output:
Observed outputs:
In this example, one CUDA lane processes both
index = 0andindex = 30. Those two entries belong to different local block rows, but the old kernel routes both contributions to the same output component.Problem 2: rectangular block layouts were accepted, but not semantically consistent with the CPU path
CSysMatrixhas a general block interface with shapenVar x nEqn.The CPU path uses the correct input-vector stride:
However, the old CUDA kernel uses:
That is not semantically consistent with the CPU rectangular-block matvec definition when
nVar != nEqn.So although the old CUDA kernel would execute for rectangular blocks, that execution was not actually equivalent to the CPU rectangular-block matvec semantics. In other words, the rectangular CUDA path appeared to be accepted by the implementation, but its indexing was not consistent with the CPU rectangular-block definition.
A minimal synthetic rectangular case shows this clearly:
Observed outputs:
This is not just a corner-case API detail. It shows that the previous CUDA kernel could silently accept a rectangular block layout and produce a wrong result without any explicit failure.
Why explicit rejection is better than silent execution
This PR intentionally supports only the CUDA square-block case:
For
nVar != nEqn, the new CUDA path now fails explicitly instead of silently running an incorrect algorithm.So this PR does not reduce the set of rectangular CUDA cases that were correctly supported before, because the old kernel did not provide a semantically correct rectangular CUDA implementation in the first place.
The behavior change is:
That is a correctness improvement, not a compatibility regression.
Why cuSPARSE BSR is the right replacement here
For the CUDA square-block path, the existing storage is BSR-like in the sense relevant here:
row_ptrcol_indmatrixas dense block payloadsThe cuSPARSE path uses the matching BSR block direction for the existing block payload layout and provides a well-tested implementation of the required operation.
Given the observed correctness problems in both square and rectangular cases, using cuSPARSE for the valid square-block CUDA path is more robust than preserving a separate custom implementation for that path.
Validation
Build
nvccfailed Meson's CUDA compiler sanity check before SU2 compilation. This happened before building the modified source files and was not a cuSPARSE link failure.1. Synthetic square-block validation
Compared:
Result:
2. Synthetic rectangular-block validation
Compared:
Result:
3. Real SU2 integration cases
Ran 6 existing runnable implicit Krylov cases with four-way comparison:
Tested cases:
Summary:
new CPUmatcheddevelop CPUin 6/6 cases.new CUDAmatchednew CPUin 6/6 cases.develop CUDAdiffered from CPU / new CUDA in 5/6 cases.GPUassert,cuSPARSEassert, illegal memory access, or segmentation fault was observed.NaNflags in all four runs, so that behavior was not new-CUDA-specific.For cases without standard residual columns, the final
history.csvrow was compared across all common numeric fields.Scope
This PR changes only the CUDA matrix-vector product implementation for the square-block
CSysMatrixpath.It does not change:
Follow-up work
This PR focuses on correctness first.
The current implementation creates/destroys cuSPARSE descriptors and allocates/frees the SpMV workspace inside each matvec call. That can be optimized later by caching:
cusparseSpMV_preprocess()stateFollow-up draft
I have also prepared a follow-up draft PR in my fork to show the next intended step after this correctness PR:
That draft is based on this PR branch, not directly on
develop, so it shows only the incremental resource-lifecycle change: caching the cuSPARSE handle, BSR matrix descriptor, and SpMV workspace after the cuSPARSE BSR path introduced here.Related Work
This replaces the earlier narrower draft PR #2813, which attempted to fix only the contribution-routing issue inside the existing custom CUDA kernel. After additional testing, the scope changed to replacing the square-block CUDA path with cuSPARSE BSR SpMV and explicitly rejecting unsupported rectangular CUDA blocks.
PR Checklist
--warnlevel=3when using meson).pre-commit run --allto format old commits.