AVX2 decoder: N-way codeblock interleave for the step-1 VLC/MEL decode#285
AVX2 decoder: N-way codeblock interleave for the step-1 VLC/MEL decode#285osamu620 wants to merge 2 commits into
Conversation
Block-level SIMD decodes one codeblock at a time, but the first step of that
decode -- recovering the VLC and MEL segments of the HTJ2K cleanup pass, what
this decoder calls "step 1" -- is a serial dependency chain that leaves the
core's execution units under-utilised (IPC well below peak). The way to fill
those idle slots is to decode several *independent* codeblocks together and let
the out-of-order engine overlap their chains. That interleaving cannot live
inside the per-block SIMD decoder, which by definition sees only one codeblock;
the code that knows about "a row of codeblocks in a subband" sits one level up,
in subband::pull_line. This commit adds the batching seam there, kept
ISA-neutral so that shared decode code is touched in just one place and any
decoder can opt in.
- A cb_decoder_batch_fun32 typedef and an optional decode_cb32_batch function
pointer in codeblock_fun, alongside the existing decode_cb32/decode_cb64 and
chosen by the same runtime CPU-feature dispatch. NULL by default; an ISA
sets it only if it has a batch kernel.
- codeblock::decode_row(blocks, count) gathers a subband row's decodable
32-bit codeblocks into parallel arrays and calls the batch in windows of 64,
marking zero-blocks directly. It falls back to per-block decode() when no
batch is registered (NULL) or the row is 64-bit, reproducing the exact 1-way
behaviour and error handling.
- subband::pull_line decodes the row through decode_row instead of looping
decode() per block.
Keeping this half ISA-neutral means a SIMD kernel plugs in with just its own
batch function and one dispatch line, with no further changes to that shared
code (the AVX2 kernel follows in the next commit; AVX-512/NEON could reuse it
unchanged). By itself it is a no-op -- with decode_cb32_batch NULL every block
still takes the existing 1-way decode() -- so it carries zero risk to current
output and can be reviewed on its own.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The first step of decoding a codeblock -- recovering the VLC and MEL segments of
the HTJ2K cleanup pass ("step 1") -- is a serial per-quad dependency chain and
the single hottest decode kernel (~42% of 8-bit AVX2 decode, execution-bound at
IPC ~2.8). Codeblocks are independent, so interleaving several of them lets the
out-of-order engine overlap the otherwise-serial chains.
Add a templated N-way step-1 kernel (decode_cb_step1_vlc_nway<N>) that decodes
N same-dimension codeblocks in lockstep, driven by a new batched entry
(ojph_decode_codeblock_avx2_batch) that groups a subband row's same-width blocks
N-at-a-time and falls back to the 1-way path for edge/odd/invalid blocks. The
per-block validation (cb_decode_setup) and the second step -- MagSgn decode plus
the optional SgnProp/MagRef passes (cb_decode_finish) -- are factored into shared
helpers so the 1-way and batched entries stay in sync. Wiring: set
decode_cb32_batch on the AVX2 dispatch (it hooks into the ISA-agnostic
codeblock::decode_row added previously).
The step-1 kernel is pure scalar; its per-stream loops carry an OJPH_UNROLL
hint, a small portable macro added to ojph_arch.h next to OJPH_FORCE_INLINE:
_Pragma("GCC unroll 8") on GCC/Clang, empty on MSVC (which has no such pragma
and would otherwise warn C4068).
N defaults to 2 (OJPH_DEC_BATCH_N). On Zen 5 N=2 is the AVX2 sweet spot: the
bulkier AVX2 step-2 runs between batches, so N=4's ~48KB step-1 working set
saturates the 48KB L1d and thrashes; N=2 also keeps the unrolled per-stream
state in registers for the 32-bit (i386) build this file serves.
Whole-decode, best-of-15 ojph_expand to tmpfs on Zen 5, single-thread, vs a
1-way reference build: 8-bit lossless -10%, 16-bit lossless -8.7%, 8-bit lossy
-8.3% (wall clock -9.5% / -8.6% / -8.6%). Byte-identical to the 1-way path on
8/16-bit lossless, lossy, the 512x8 streaming shape, and a Kakadu multipass
stream (SgnProp/MagRef coverage); 82/82 ctest pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Dear @osamu620, Thank you for this PR. Kind regards, |
|
@osamu620 lossless benchmark round-trip passes! |
|
Dear @osamu620, Thank you for all the effort you put in the previous PR. I think we should keep this for the time being to see if other parties are interested. Thank you again. Kind regards, |
Follow-up to #276. The first step of decoding a codeblock — recovering the VLC and MEL segments of the HTJ2K cleanup pass (this decoder's "step 1") — is a serial dependency chain and the single hottest decode kernel (~42% of 8-bit AVX2 decode, IPC ~2.8). Codeblocks are independent, so decoding several in lockstep lets the out-of-order engine overlap their chains.
Two commits
ISA-agnostic plumbing. That interleaving can't live inside the per-block SIMD decoder, which by definition sees only one codeblock — the code that knows about "a row of codeblocks in a subband" sits one level up, in
subband::pull_line, so the seam belongs there. This adds an optionaldecode_cb32_batchpointer (next todecode_cb32/decode_cb64, selected by the same CPU-feature dispatch) and acodeblock::decode_rowthat gathers a row and either calls the batch or falls back to the exact 1-waydecode()(when no batch is registered, or for 64-bit blocks). No functional change on its own — that shared code is touched in just one place, future ISAs plug in with only a kernel + one dispatch line, and current output can't change.AVX2 kernel. A templated N-way step-1 kernel (
decode_cb_step1_vlc_nway<N>) that decodes N same-width codeblocks in lockstep, driven byojph_decode_codeblock_avx2_batch, which groups a subband row's same-width blocks N-at-a-time and falls back to 1-way for edge/odd/invalid blocks. Per-block validation and the second step (MagSgn + optional SgnProp/MagRef) are factored into shared helpers so the 1-way and batched paths stay in sync.Performance
Single-thread,
ojph_expand→ tmpfs,taskset -c0, best-of-15, vs a 1-way reference build (Zen 5):Whole-process includes the PPM write, so the decode-kernel gain is larger.
OJPH_DEC_BATCH_Ndefaults to 2 — the measured Zen 5 sweet spot (the bulkier AVX2 step-2 between batches makes N=4's larger step-1 working set thrash L1; N=2 also keeps the unrolled per-stream state in registers for 32-bit builds). It's compile-overridable, and it can never regress: anything not batched takes the unchanged 1-way path.Correctness
Byte-identical to the 1-way path on 8/16-bit lossless, lossy, the 512×8 streaming shape, and a Kakadu multipass stream (SgnProp/MagRef coverage); 82/82 ctest.
Built clean (0 warnings) locally across GCC on Linux (AVX2-only, AVX-512-enabled, and no-SIMD configs), MSVC VS2022 (x64 + Win32/32-bit), and MinGW UCRT64 GCC 16.
@palemieux — would you mind running this against the lossless benchmark round-trip, as you did on #276? Marking draft until that and CI are green.
🤖 Generated with Claude Code