[Blob] Support blob-write-null-on-missing-file for HTTP URLs.#8219
[Blob] Support blob-write-null-on-missing-file for HTTP URLs.#8219wwj6591812 wants to merge 1 commit into
Conversation
| blobUri(blob), | ||
| blobFieldName, | ||
| e); | ||
| writeNullElement(); |
There was a problem hiding this comment.
The missing-file fallback can leave stale bytes in the blob file. addElement writes MAGIC_NUMBER_BYTES before opening/reading the referenced blob, then on a 404 it records the row as NULL and returns without rewinding/truncating out. A 404 from newInputStream() already leaves the magic bytes behind; if a later non-null blob is written to the same file, BlobFileMeta computes that later blob's offset as 0 because the NULL length does not advance offsets, while the actual bytes start after the orphan magic bytes. That will make subsequent blob descriptors/readback point at the wrong offset and corrupt data.
Could we check existence before writing, or stage the blob into a temp buffer/file and only append to out after the read succeeds? It would also be good to add a test that writes a missing HTTP descriptor followed by a valid blob and reads the valid blob back.
There was a problem hiding this comment.
Thanks for catching this — you are absolutely right.
The write-time 404 fallback previously wrote MAGIC_NUMBER_BYTES to out before opening the referenced blob. When the read failed, we recorded a NULL entry but left orphan magic bytes in the file. Because NULL entries do not advance blob offsets, a later non-null blob could get a descriptor offset of 0 while the actual payload started after those stale bytes, which would corrupt subsequent reads.
I have fixed this by staging the blob payload (magic number + bytes) in memory first and only appending to out after the referenced blob is fully read successfully. If the read fails with a 404 and blob-write-null-on-missing-file is enabled, we now write NULL without touching out at all.
I also added testMissingHttpBlobFollowedByValidBlobPreservesReadback, which writes a missing HTTP descriptor followed by a valid blob and verifies that the valid blob can be read back correctly via BlobFileMeta / BlobFormatReader.
The primary pre-check path (UriReaderFactory.exists() for HTTP + FlinkRowWrapper.isNullAt()) remains unchanged and should prevent most missing URLs from reaching the write path; the staged write is now a safe fallback for the remaining cases.
Please take another look when you have a chance. Thanks again!
63a534b to
7a2cc62
Compare
| write(MAGIC_NUMBER_BYTES); | ||
|
|
||
| long blobPos = out.getPos(); | ||
| ByteArrayOutputStream stagedPayload = new ByteArrayOutputStream(); |
There was a problem hiding this comment.
Could we avoid staging the entire payload in memory here? This writer is used for normal blob writes too, so the new ByteArrayOutputStream + toByteArray() path can allocate roughly 2x the blob size before writing. Since BLOBs are expected to be large and blob.target-file-size can be much larger than task memory, a valid large blob can now OOM even when writeNullOnMissingFile is false. One safer approach is to open the source stream before appending anything to out; if opening hits 404, write NULL, otherwise keep the previous streaming copy path.
There was a problem hiding this comment.
Thanks for the follow-up — you're right that staging the full payload in memory is not acceptable for normal large blob writes.
I've removed the ByteArrayOutputStream path and switched to the approach you suggested: open the source stream before appending anything to out. If opening fails with HTTP 404 and blob-write-null-on-missing-file is enabled, we record NULL without touching out; otherwise we keep the original streaming copy path (magic number + chunked read/write). This avoids the orphan-magic-byte issue for the common HTTP 404-at-open case without adding extra memory overhead for valid blobs.
testMissingHttpBlobFollowedByValidBlobPreservesReadback still passes. Please take another look when you have a chance.
| * support HEAD, a lightweight GET with {@code Range: bytes=0-0} is used as fallback. | ||
| */ | ||
| public static boolean exists(String uri) throws IOException { | ||
| int statusCode = headStatusCode(uri); |
There was a problem hiding this comment.
Is it safe to make the HEAD result authoritative here? Some HTTP resources, including signed or GET-only URLs, can reject HEAD or return a different status than GET even though the payload is readable. With this change, a HEAD 404/403 would make exists() return false or throw before trying the range GET, so an existing blob could be treated as missing or fail the write. Could we fall back to Range: bytes=0-0 GET for non-success HEAD statuses as well, at least for 403/404, and only return false when the GET also returns 404?
There was a problem hiding this comment.
Good point — making HEAD authoritative is unsafe for signed or GET-only URLs.
I've updated exists() so that only a successful HEAD (200) returns true immediately. For any non-200 HEAD status (including 403/404, as well as 405/501), we now fall back to a lightweight Range: bytes=0-0 GET and only return false when that GET also returns 404. Added tests for HEAD 404/403 with a readable range GET, and for HEAD 404 + GET 404.
Thanks again for the careful review!
7a2cc62 to
b6181d6
Compare
|
@JingsongLi |
Background
blob-write-null-on-missing-filealready allows Flink writes to storeNULLinstead of failing when a descriptor BLOB points to a missing file on filesystem-based URIs (file://,hdfs://,oss://, etc.).However, for
http:///https://URIs,UriReaderFactory.exists()previously always returnedtruefor HTTP readers, so the pre-check never ran. As a result, missing HTTP resources (e.g. CDN image URLs returning 404) were only detected during the actual blob read inBlobFormatWriter, causing the whole write task to fail.This is a common issue when ingesting external image URLs into Paimon blob tables.
Purpose
This PR makes
blob-write-null-on-missing-filework for HTTP/HTTPS descriptor URLs.Primary path (pre-check):
HttpClientUtils.exists()(HEAD first, fallback to lightweightRange: bytes=0-0GET when HEAD is not supported)HttpUriReader.exists()intoUriReaderFactory.exists()FlinkRowWrapper.isNullAt()so missing HTTP descriptors are treated asNULLbefore writeFallback path (write-time):
BlobFormatWriter, whenwriteNullOnMissingFileis enabled, catch HTTP 404 during blob read and writeNULLwith a warning instead of failing the taskBlobFileContext→MultipleBlobFileWriter/ExternalStorageBlobWriter→BlobFileFormatWith this change, bad HTTP URLs are skipped gracefully while the job continues, and the failing URL is logged for troubleshooting.
Tests
HttpClientUtilsTestexists() == trueexists() == falseisNotFoundError()helperUriReaderFactoryTestexists() == falseexists() == trueBlobFormatWriterTestNULLwhen enabledFlinkRowWrapperTestisNullAt() == truewhen checking enabledisNullAt() == falsewhen checking enabled