Conversation
…put reads The existing Tensor.getDataAsFloatArray() allocates a fresh float[] on every call and copies from the underlying off-heap buffer into it. In a steady-state inference loop this is a per-frame allocation proportional to the output tensor size — for a 144x192 single-channel depth output that's ~110 KB per call, multiplied by however many output tensors a model returns. Profiling on Android (Perfetto) showed this driving substantial ART GC pressure that ended up stealing CPU back from the inference thread itself via run-queue contention. The off-heap FloatBuffer that backs the output Tensor is already owned by the native side and exposed internally via the package-private getRawDataBuffer(). copyDataInto(FloatBuffer dst) lets callers reuse a pre-allocated destination across calls, eliminating the per-call float[] allocation while keeping ownership unambiguous (the destination is caller-owned, so it's safe to hand off to async consumers without racing against the next forward() overwriting native memory). Implemented for float32 (zero-copy bulk put) and float16 (per-element half->float widening, mirroring getDataAsFloatArray). Other dtypes inherit the base-class IllegalStateException default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends the previous commit (which added copyDataInto(FloatBuffer) for float32 / float16) to cover every dtype that the Tensor class supports: copyDataInto(ByteBuffer) int8 copyDataIntoUnsigned(ByteBuffer) uint8 copyDataInto(IntBuffer) int32 copyDataInto(LongBuffer) int64 copyDataInto(DoubleBuffer) float64 copyDataInto(ShortBuffer) float16 (raw fp16 bits, no widening) Mirrors the asymmetry of getDataAsByteArray (int8) vs getDataAsUnsignedByteArray (uint8) — calling copyDataInto(ByteBuffer) on a uint8 tensor throws, and copyDataIntoUnsigned(ByteBuffer) on int8 throws, just like the array accessors. The two methods are intentionally separate even though the underlying bits are identical, so a misuse caused by a dtype switch surfaces as an exception instead of silently producing values with the wrong sign interpretation. The float16 ShortBuffer overload writes the raw 16-bit half-precision bits without widening; copyDataInto(FloatBuffer) on the same tensor still performs the half->float widening, matching the dual getDataAsShortArray / getDataAsFloatArray accessors that float16 already exposes. Tests cover each new variant's happy path plus the asymmetric int8/uint8 rejection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
meredithbayne
approved these changes
Apr 29, 2026
|
Nice improvements! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
Tensor.copyDataInto(<TypedBuffer>)to the Android Java API for every dtype, mirroring the version up for review on pytorch/executorch#19171. Eliminates the per-callfloat[]/int[]/ etc. allocation thatgetDataAs*Array()performs by letting the caller supply (and reuse) a destination buffer.Motivation
Profiling on Android (Perfetto) showed
output.toTensor().dataAsFloatArraydriving substantial ART GC pressure in the depth-inference loop. Each call allocates a fresh Javafloat[]sized to the tensor's element count and bulk-copies from the underlying off-heap buffer into it. For a model with two[1, 1, 144, 192]fp32 outputs that's ~110 KB × 2 = 220 KB of Java-heap churn per inference. At 2 fps that's 440 KB/sec just from reading outputs — visibly enough to add several young-generation GCs per second, and ~14% of the inference wall time was being lost to run-queue contention as GC suspended app threads.The native side already exposes the underlying off-heap
FloatBufferas a zero-copy view of the C++ tensor'sdata_ptr()(via the package-privategetRawDataBuffer). The newcopyDataIntoAPI gives external callers a public way to drain that buffer into a destination they own and reuse across calls.API
FloatBufferoverload (per-element half→float widening) and aShortBufferoverload (raw fp16 bits, no widening).getDataAsByteArrayvsgetDataAsUnsignedByteArraydistinction — calling the wrong one throws.Buffer.putsemantics).dst.remaining()capacity check so an undersized destination throwsBufferOverflowExceptionbefore any partial widening is observed (matching the all-or-nothing semantics of the bulk-putpaths).Caller-side usage
Test plan
TensorTest.ktcovers each variant's happy path, asymmetric int8/uint8 rejection, fp16 raw-bits vs widened paths, position-respecting writes, andBufferOverflowExceptionon undersized destinations (including the fp16 widening path's upfront check).Tensor.javacompiles standalone withjavac 17.google-java-format(1.23.0, project standard) andktfmt(Meta style, matchesspotless { kotlin { ktfmt() } }config) both clean.EstimateDepth.kt(forthcoming change to swapdataAsFloatArrayforcopyDataInto+ a 2-slot encoder pool).🤖 Generated with Claude Code