Skip to content

[Blob] Support blob-write-null-on-missing-file for HTTP URLs.#8219

Open
wwj6591812 wants to merge 1 commit into
apache:masterfrom
wwj6591812:support_blob_write_null_0612
Open

[Blob] Support blob-write-null-on-missing-file for HTTP URLs.#8219
wwj6591812 wants to merge 1 commit into
apache:masterfrom
wwj6591812:support_blob_write_null_0612

Conversation

@wwj6591812

Copy link
Copy Markdown
Contributor

Background

blob-write-null-on-missing-file already allows Flink writes to store NULL instead of failing when a descriptor BLOB points to a missing file on filesystem-based URIs (file://, hdfs://, oss://, etc.).

However, for http:// / https:// URIs, UriReaderFactory.exists() previously always returned true for HTTP readers, so the pre-check never ran. As a result, missing HTTP resources (e.g. CDN image URLs returning 404) were only detected during the actual blob read in BlobFormatWriter, causing the whole write task to fail.

This is a common issue when ingesting external image URLs into Paimon blob tables.

Purpose

This PR makes blob-write-null-on-missing-file work for HTTP/HTTPS descriptor URLs.

Primary path (pre-check):

  • Add real HTTP existence checks via HttpClientUtils.exists() (HEAD first, fallback to lightweight Range: bytes=0-0 GET when HEAD is not supported)
  • Wire HttpUriReader.exists() into UriReaderFactory.exists()
  • Reuse the existing Flink write path in FlinkRowWrapper.isNullAt() so missing HTTP descriptors are treated as NULL before write

Fallback path (write-time):

  • In BlobFormatWriter, when writeNullOnMissingFile is enabled, catch HTTP 404 during blob read and write NULL with a warning instead of failing the task
  • Thread the option through BlobFileContextMultipleBlobFileWriter / ExternalStorageBlobWriterBlobFileFormat

With this change, bad HTTP URLs are skipped gracefully while the job continues, and the failing URL is logged for troubleshooting.

Tests

  • HttpClientUtilsTest
    • HTTP 200 → exists() == true
    • HTTP 404 → exists() == false
    • HEAD not allowed (405) → fallback GET check
    • isNotFoundError() helper
  • UriReaderFactoryTest
    • HTTP missing resource → exists() == false
    • HTTP available resource → exists() == true
  • BlobFormatWriterTest
    • write-path 404 fallback writes NULL when enabled
    • 404 still fails when option is disabled
  • FlinkRowWrapperTest
    • HTTP 404 descriptor → isNullAt() == true when checking enabled
    • HTTP 200 descriptor → isNullAt() == false when checking enabled

blobUri(blob),
blobFieldName,
e);
writeNullElement();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The missing-file fallback can leave stale bytes in the blob file. addElement writes MAGIC_NUMBER_BYTES before opening/reading the referenced blob, then on a 404 it records the row as NULL and returns without rewinding/truncating out. A 404 from newInputStream() already leaves the magic bytes behind; if a later non-null blob is written to the same file, BlobFileMeta computes that later blob's offset as 0 because the NULL length does not advance offsets, while the actual bytes start after the orphan magic bytes. That will make subsequent blob descriptors/readback point at the wrong offset and corrupt data.

Could we check existence before writing, or stage the blob into a temp buffer/file and only append to out after the read succeeds? It would also be good to add a test that writes a missing HTTP descriptor followed by a valid blob and reads the valid blob back.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JingsongLi

Thanks for catching this — you are absolutely right.

The write-time 404 fallback previously wrote MAGIC_NUMBER_BYTES to out before opening the referenced blob. When the read failed, we recorded a NULL entry but left orphan magic bytes in the file. Because NULL entries do not advance blob offsets, a later non-null blob could get a descriptor offset of 0 while the actual payload started after those stale bytes, which would corrupt subsequent reads.

I have fixed this by staging the blob payload (magic number + bytes) in memory first and only appending to out after the referenced blob is fully read successfully. If the read fails with a 404 and blob-write-null-on-missing-file is enabled, we now write NULL without touching out at all.

I also added testMissingHttpBlobFollowedByValidBlobPreservesReadback, which writes a missing HTTP descriptor followed by a valid blob and verifies that the valid blob can be read back correctly via BlobFileMeta / BlobFormatReader.

The primary pre-check path (UriReaderFactory.exists() for HTTP + FlinkRowWrapper.isNullAt()) remains unchanged and should prevent most missing URLs from reaching the write path; the staged write is now a safe fallback for the remaining cases.

Please take another look when you have a chance. Thanks again!

@wwj6591812 wwj6591812 force-pushed the support_blob_write_null_0612 branch from 63a534b to 7a2cc62 Compare June 13, 2026 05:18
write(MAGIC_NUMBER_BYTES);

long blobPos = out.getPos();
ByteArrayOutputStream stagedPayload = new ByteArrayOutputStream();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid staging the entire payload in memory here? This writer is used for normal blob writes too, so the new ByteArrayOutputStream + toByteArray() path can allocate roughly 2x the blob size before writing. Since BLOBs are expected to be large and blob.target-file-size can be much larger than task memory, a valid large blob can now OOM even when writeNullOnMissingFile is false. One safer approach is to open the source stream before appending anything to out; if opening hits 404, write NULL, otherwise keep the previous streaming copy path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JingsongLi

Thanks for the follow-up — you're right that staging the full payload in memory is not acceptable for normal large blob writes.

I've removed the ByteArrayOutputStream path and switched to the approach you suggested: open the source stream before appending anything to out. If opening fails with HTTP 404 and blob-write-null-on-missing-file is enabled, we record NULL without touching out; otherwise we keep the original streaming copy path (magic number + chunked read/write). This avoids the orphan-magic-byte issue for the common HTTP 404-at-open case without adding extra memory overhead for valid blobs.

testMissingHttpBlobFollowedByValidBlobPreservesReadback still passes. Please take another look when you have a chance.

* support HEAD, a lightweight GET with {@code Range: bytes=0-0} is used as fallback.
*/
public static boolean exists(String uri) throws IOException {
int statusCode = headStatusCode(uri);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to make the HEAD result authoritative here? Some HTTP resources, including signed or GET-only URLs, can reject HEAD or return a different status than GET even though the payload is readable. With this change, a HEAD 404/403 would make exists() return false or throw before trying the range GET, so an existing blob could be treated as missing or fail the write. Could we fall back to Range: bytes=0-0 GET for non-success HEAD statuses as well, at least for 403/404, and only return false when the GET also returns 404?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JingsongLi

Good point — making HEAD authoritative is unsafe for signed or GET-only URLs.

I've updated exists() so that only a successful HEAD (200) returns true immediately. For any non-200 HEAD status (including 403/404, as well as 405/501), we now fall back to a lightweight Range: bytes=0-0 GET and only return false when that GET also returns 404. Added tests for HEAD 404/403 with a readable range GET, and for HEAD 404 + GET 404.

Thanks again for the careful review!

@wwj6591812 wwj6591812 force-pushed the support_blob_write_null_0612 branch from 7a2cc62 to b6181d6 Compare June 14, 2026 08:23
@wwj6591812

Copy link
Copy Markdown
Contributor Author

@JingsongLi
Hi,Please CC, Thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants