Problem
When wrapping an OpendalStore in a DataFusion session context, paths get double-prefixed, causing file lookups to fail.
Minimal reproduction:
let op = Operator::from_uri("hf://buckets/lhoestq/datasets")?;
let store = OpendalStore::new(op);
let ctx = SessionContext::new();
let url = Url::parse("hf://buckets/lhoestq/datasets")?;
ctx.register_object_store(&url, store);
// Fails — paths are double-prefixed
ctx.read_parquet("hf://buckets/lhoestq/datasets/...").await?;
What happens: DataFusion's DefaultObjectStoreRegistry matches on scheme + authority only (hf://buckets), so a query against hf://buckets/lhoestq/datasets/prompts.chat/data/file.parquet passes the full path lhoestq/datasets/prompts.chat/data/file.parquet to the store. But the Operator already has root /lhoestq/datasets/, so OpendalStore concatenates:
root + path = /lhoestq/datasets/ + lhoestq/datasets/prompts.chat/data/file.parquet
= /lhoestq/datasets/lhoestq/datasets/prompts.chat/data/file.parquet ❌
The only workaround I found is writing a custom ObjectStore wrapper that strips the registered prefix before passing operations to OpendalStore, then restores the full path in metadata so DataFusion's contains() check succeeds.
Technical deep-dive (click to expand)
The full path transformation chain
1. Registration key extraction — DefaultObjectStoreRegistry::register_store() calls get_url_key():
fn get_url_key(url: &Url) -> String {
format!("{}://{}", url.scheme(), &url[url::Position::BeforeHost..url::Position::AfterPort])
}
This means hf://buckets/lhoestq/datasets and hf://buckets both produce the key "hf://buckets". There's no path component in the key.
2. Store lookup — When read_parquet("hf://buckets/lhoestq/datasets/...") is called, DataFusion calls registry.get_store(url) → get_url_key("hf://buckets/lhoestq/datasets") → "hf://buckets" → matches our registered store. The full URL is then passed to the store's methods.
3. Listing path — In ListingTableUrl::list_prefixed_files():
- The URL is parsed into a
ListingTableUrl with prefix = Path::from_url_path("/lhoestq/datasets/prompts.chat/data/...")
- The prefix is normalized to
lhoestq/datasets/prompts.chat/data/... (leading / stripped)
- If it looks like a single file (no trailing
/), DataFusion calls store.head(&full_prefix) directly with lhoestq/datasets/prompts.chat/data/...
- If it looks like a directory, DataFusion calls
store.list(Some(&prefix)) with the same path
4. OpendalStore::get_opts() path handling — The store receives location lhoestq/datasets/prompts.chat/data/... and passes it directly to the OpenDAL operator:
async fn get_opts(&self, location: &Path, options: GetOptions) -> ... {
let raw_location = percent_decode_path(location.as_ref());
// raw_location = "lhoestq/datasets/prompts.chat/data/..."
// Operator internally does: root + path
// = "/lhoestq/datasets/" + "lhoestq/datasets/prompts.chat/data/..."
// = "/lhoestq/datasets/lhoestq/datasets/prompts.chat/data/..." ❌
}
There is no way to tell OpendalStore what prefix was registered in the outer context. It just blindly concatenates operator.root + incoming_path.
Why this is tricky
- DataFusion's registry key is
scheme://authority (no path) — there's no way to register at a more specific path.
OpendalStore has no concept of a "registered prefix" — it just concatenates whatever path it receives with the operator's root.
How to fix it
Is it a possible solution to use an empty root in the HF implementation @kszucs @Xuanwo ?
Or let me know if you have other ideas or if this issue should be in datafusion instead.
Environment
opendal: 0.57.x
object_store_opendal: 0.57.x
object_store: 0.13.x
datafusion: 54.x
Problem
When wrapping an
OpendalStorein a DataFusion session context, paths get double-prefixed, causing file lookups to fail.Minimal reproduction:
What happens: DataFusion's
DefaultObjectStoreRegistrymatches on scheme + authority only (hf://buckets), so a query againsthf://buckets/lhoestq/datasets/prompts.chat/data/file.parquetpasses the full pathlhoestq/datasets/prompts.chat/data/file.parquetto the store. But theOperatoralready has root/lhoestq/datasets/, soOpendalStoreconcatenates:The only workaround I found is writing a custom
ObjectStorewrapper that strips the registered prefix before passing operations toOpendalStore, then restores the full path in metadata so DataFusion'scontains()check succeeds.Technical deep-dive (click to expand)
The full path transformation chain
1. Registration key extraction —
DefaultObjectStoreRegistry::register_store()callsget_url_key():This means
hf://buckets/lhoestq/datasetsandhf://bucketsboth produce the key"hf://buckets". There's no path component in the key.2. Store lookup — When
read_parquet("hf://buckets/lhoestq/datasets/...")is called, DataFusion callsregistry.get_store(url)→get_url_key("hf://buckets/lhoestq/datasets")→"hf://buckets"→ matches our registered store. The full URL is then passed to the store's methods.3. Listing path — In
ListingTableUrl::list_prefixed_files():ListingTableUrlwithprefix = Path::from_url_path("/lhoestq/datasets/prompts.chat/data/...")lhoestq/datasets/prompts.chat/data/...(leading/stripped)/), DataFusion callsstore.head(&full_prefix)directly withlhoestq/datasets/prompts.chat/data/...store.list(Some(&prefix))with the same path4.
OpendalStore::get_opts()path handling — The store receives locationlhoestq/datasets/prompts.chat/data/...and passes it directly to the OpenDAL operator:There is no way to tell
OpendalStorewhat prefix was registered in the outer context. It just blindly concatenatesoperator.root + incoming_path.Why this is tricky
scheme://authority(no path) — there's no way to register at a more specific path.OpendalStorehas no concept of a "registered prefix" — it just concatenates whatever path it receives with the operator's root.How to fix it
Is it a possible solution to use an empty root in the HF implementation @kszucs @Xuanwo ?
Or let me know if you have other ideas or if this issue should be in datafusion instead.
Environment
opendal: 0.57.xobject_store_opendal: 0.57.xobject_store: 0.13.xdatafusion: 54.x