Skip to content

OpendalStore double-prefixing when used with DataFusion's object store registry #7786

Description

@lhoestq

Problem

When wrapping an OpendalStore in a DataFusion session context, paths get double-prefixed, causing file lookups to fail.

Minimal reproduction:

let op = Operator::from_uri("hf://buckets/lhoestq/datasets")?;
let store = OpendalStore::new(op);

let ctx = SessionContext::new();
let url = Url::parse("hf://buckets/lhoestq/datasets")?;
ctx.register_object_store(&url, store);

// Fails — paths are double-prefixed
ctx.read_parquet("hf://buckets/lhoestq/datasets/...").await?;

What happens: DataFusion's DefaultObjectStoreRegistry matches on scheme + authority only (hf://buckets), so a query against hf://buckets/lhoestq/datasets/prompts.chat/data/file.parquet passes the full path lhoestq/datasets/prompts.chat/data/file.parquet to the store. But the Operator already has root /lhoestq/datasets/, so OpendalStore concatenates:

root + path = /lhoestq/datasets/ + lhoestq/datasets/prompts.chat/data/file.parquet
            = /lhoestq/datasets/lhoestq/datasets/prompts.chat/data/file.parquet  ❌

The only workaround I found is writing a custom ObjectStore wrapper that strips the registered prefix before passing operations to OpendalStore, then restores the full path in metadata so DataFusion's contains() check succeeds.


Technical deep-dive (click to expand)

The full path transformation chain

1. Registration key extractionDefaultObjectStoreRegistry::register_store() calls get_url_key():

fn get_url_key(url: &Url) -> String {
    format!("{}://{}", url.scheme(), &url[url::Position::BeforeHost..url::Position::AfterPort])
}

This means hf://buckets/lhoestq/datasets and hf://buckets both produce the key "hf://buckets". There's no path component in the key.

2. Store lookup — When read_parquet("hf://buckets/lhoestq/datasets/...") is called, DataFusion calls registry.get_store(url)get_url_key("hf://buckets/lhoestq/datasets")"hf://buckets" → matches our registered store. The full URL is then passed to the store's methods.

3. Listing path — In ListingTableUrl::list_prefixed_files():

  • The URL is parsed into a ListingTableUrl with prefix = Path::from_url_path("/lhoestq/datasets/prompts.chat/data/...")
  • The prefix is normalized to lhoestq/datasets/prompts.chat/data/... (leading / stripped)
  • If it looks like a single file (no trailing /), DataFusion calls store.head(&full_prefix) directly with lhoestq/datasets/prompts.chat/data/...
  • If it looks like a directory, DataFusion calls store.list(Some(&prefix)) with the same path

4. OpendalStore::get_opts() path handling — The store receives location lhoestq/datasets/prompts.chat/data/... and passes it directly to the OpenDAL operator:

async fn get_opts(&self, location: &Path, options: GetOptions) -> ... {
    let raw_location = percent_decode_path(location.as_ref());
    // raw_location = "lhoestq/datasets/prompts.chat/data/..."
    // Operator internally does: root + path
    // = "/lhoestq/datasets/" + "lhoestq/datasets/prompts.chat/data/..."
    // = "/lhoestq/datasets/lhoestq/datasets/prompts.chat/data/..."  ❌
}

There is no way to tell OpendalStore what prefix was registered in the outer context. It just blindly concatenates operator.root + incoming_path.

Why this is tricky

  1. DataFusion's registry key is scheme://authority (no path) — there's no way to register at a more specific path.
  2. OpendalStore has no concept of a "registered prefix" — it just concatenates whatever path it receives with the operator's root.

How to fix it

Is it a possible solution to use an empty root in the HF implementation @kszucs @Xuanwo ?

Or let me know if you have other ideas or if this issue should be in datafusion instead.

Environment

  • opendal: 0.57.x
  • object_store_opendal: 0.57.x
  • object_store: 0.13.x
  • datafusion: 54.x

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions