[Blob] Support blob-write-null-on-missing-file for HTTP URLs. by wwj6591812 · Pull Request #8219 · apache/paimon

wwj6591812 · 2026-06-12T08:30:06Z

Background

blob-write-null-on-missing-file already allows Flink writes to store NULL instead of failing when a descriptor BLOB points to a missing file on filesystem-based URIs (file://, hdfs://, oss://, etc.).

However, for http:// / https:// URIs, UriReaderFactory.exists() previously always returned true for HTTP readers, so the pre-check never ran. As a result, missing HTTP resources (e.g. CDN image URLs returning 404) were only detected during the actual blob read in BlobFormatWriter, causing the whole write task to fail.

This is a common issue when ingesting external image URLs into Paimon blob tables.

Purpose

This PR makes blob-write-null-on-missing-file work for HTTP/HTTPS descriptor URLs.

Primary path (pre-check):

Add real HTTP existence checks via HttpClientUtils.exists() (HEAD first, fallback to lightweight Range: bytes=0-0 GET when HEAD is not supported)
Wire HttpUriReader.exists() into UriReaderFactory.exists()
Reuse the existing Flink write path in FlinkRowWrapper.isNullAt() so missing HTTP descriptors are treated as NULL before write

Fallback path (write-time):

In BlobFormatWriter, when writeNullOnMissingFile is enabled, catch HTTP 404 during blob read and write NULL with a warning instead of failing the task
Thread the option through BlobFileContext → MultipleBlobFileWriter / ExternalStorageBlobWriter → BlobFileFormat

With this change, bad HTTP URLs are skipped gracefully while the job continues, and the failing URL is logged for troubleshooting.

Tests

HttpClientUtilsTest
- HTTP 200 → exists() == true
- HTTP 404 → exists() == false
- HEAD not allowed (405) → fallback GET check
- isNotFoundError() helper
UriReaderFactoryTest
- HTTP missing resource → exists() == false
- HTTP available resource → exists() == true
BlobFormatWriterTest
- write-path 404 fallback writes NULL when enabled
- 404 still fails when option is disabled
FlinkRowWrapperTest
- HTTP 404 descriptor → isNullAt() == true when checking enabled
- HTTP 200 descriptor → isNullAt() == false when checking enabled

JingsongLi · 2026-06-13T02:39:23Z

+                        blobUri(blob),
+                        blobFieldName,
+                        e);
+                writeNullElement();


The missing-file fallback can leave stale bytes in the blob file. addElement writes MAGIC_NUMBER_BYTES before opening/reading the referenced blob, then on a 404 it records the row as NULL and returns without rewinding/truncating out. A 404 from newInputStream() already leaves the magic bytes behind; if a later non-null blob is written to the same file, BlobFileMeta computes that later blob's offset as 0 because the NULL length does not advance offsets, while the actual bytes start after the orphan magic bytes. That will make subsequent blob descriptors/readback point at the wrong offset and corrupt data.

Could we check existence before writing, or stage the blob into a temp buffer/file and only append to out after the read succeeds? It would also be good to add a test that writes a missing HTTP descriptor followed by a valid blob and reads the valid blob back.

@JingsongLi

Thanks for catching this — you are absolutely right.

The write-time 404 fallback previously wrote MAGIC_NUMBER_BYTES to out before opening the referenced blob. When the read failed, we recorded a NULL entry but left orphan magic bytes in the file. Because NULL entries do not advance blob offsets, a later non-null blob could get a descriptor offset of 0 while the actual payload started after those stale bytes, which would corrupt subsequent reads.

I have fixed this by staging the blob payload (magic number + bytes) in memory first and only appending to out after the referenced blob is fully read successfully. If the read fails with a 404 and blob-write-null-on-missing-file is enabled, we now write NULL without touching out at all.

I also added testMissingHttpBlobFollowedByValidBlobPreservesReadback, which writes a missing HTTP descriptor followed by a valid blob and verifies that the valid blob can be read back correctly via BlobFileMeta / BlobFormatReader.

The primary pre-check path (UriReaderFactory.exists() for HTTP + FlinkRowWrapper.isNullAt()) remains unchanged and should prevent most missing URLs from reaching the write path; the staged write is now a safe fallback for the remaining cases.

Please take another look when you have a chance. Thanks again!

JingsongLi · 2026-06-14T02:52:16Z

-        write(MAGIC_NUMBER_BYTES);
-
-        long blobPos = out.getPos();
+        ByteArrayOutputStream stagedPayload = new ByteArrayOutputStream();


Could we avoid staging the entire payload in memory here? This writer is used for normal blob writes too, so the new ByteArrayOutputStream + toByteArray() path can allocate roughly 2x the blob size before writing. Since BLOBs are expected to be large and blob.target-file-size can be much larger than task memory, a valid large blob can now OOM even when writeNullOnMissingFile is false. One safer approach is to open the source stream before appending anything to out; if opening hits 404, write NULL, otherwise keep the previous streaming copy path.

@JingsongLi

Thanks for the follow-up — you're right that staging the full payload in memory is not acceptable for normal large blob writes.

I've removed the ByteArrayOutputStream path and switched to the approach you suggested: open the source stream before appending anything to out. If opening fails with HTTP 404 and blob-write-null-on-missing-file is enabled, we record NULL without touching out; otherwise we keep the original streaming copy path (magic number + chunked read/write). This avoids the orphan-magic-byte issue for the common HTTP 404-at-open case without adding extra memory overhead for valid blobs.

testMissingHttpBlobFollowedByValidBlobPreservesReadback still passes. Please take another look when you have a chance.

JingsongLi · 2026-06-14T02:52:27Z

+     * support HEAD, a lightweight GET with {@code Range: bytes=0-0} is used as fallback.
+     */
+    public static boolean exists(String uri) throws IOException {
+        int statusCode = headStatusCode(uri);


Is it safe to make the HEAD result authoritative here? Some HTTP resources, including signed or GET-only URLs, can reject HEAD or return a different status than GET even though the payload is readable. With this change, a HEAD 404/403 would make exists() return false or throw before trying the range GET, so an existing blob could be treated as missing or fail the write. Could we fall back to Range: bytes=0-0 GET for non-success HEAD statuses as well, at least for 403/404, and only return false when the GET also returns 404?

@JingsongLi

Good point — making HEAD authoritative is unsafe for signed or GET-only URLs.

I've updated exists() so that only a successful HEAD (200) returns true immediately. For any non-200 HEAD status (including 403/404, as well as 405/501), we now fall back to a lightweight Range: bytes=0-0 GET and only return false when that GET also returns 404. Added tests for HEAD 404/403 with a readable range GET, and for HEAD 404 + GET 404.

Thanks again for the careful review!

wwj6591812 · 2026-06-14T14:50:15Z

@JingsongLi
Hi，Please CC, Thx

JingsongLi reviewed Jun 13, 2026

View reviewed changes

wwj6591812 force-pushed the support_blob_write_null_0612 branch from 63a534b to 7a2cc62 Compare June 13, 2026 05:18

JingsongLi reviewed Jun 14, 2026

View reviewed changes

[Blob] Support blob-write-null-on-missing-file for HTTP URLs.

b6181d6

wwj6591812 force-pushed the support_blob_write_null_0612 branch from 7a2cc62 to b6181d6 Compare June 14, 2026 08:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Blob] Support blob-write-null-on-missing-file for HTTP URLs.#8219

[Blob] Support blob-write-null-on-missing-file for HTTP URLs.#8219
wwj6591812 wants to merge 1 commit into
apache:masterfrom
wwj6591812:support_blob_write_null_0612

wwj6591812 commented Jun 12, 2026

Uh oh!

JingsongLi Jun 13, 2026

Uh oh!

wwj6591812 Jun 13, 2026

Uh oh!

JingsongLi Jun 14, 2026

Uh oh!

wwj6591812 Jun 14, 2026

Uh oh!

JingsongLi Jun 14, 2026

Uh oh!

wwj6591812 Jun 14, 2026

Uh oh!

wwj6591812 commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wwj6591812 commented Jun 12, 2026

Background

Purpose

Tests

Uh oh!

JingsongLi Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

wwj6591812 Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

JingsongLi Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

wwj6591812 Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

JingsongLi Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

wwj6591812 Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

wwj6591812 commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants