Prerequisites
Describe the bug
When attempting to write data directly to an S3 path using the Python bindings, the underlying Rust engine panics with the following error if executed inside a thread without an active Tokio reactor:
PanicException: there is no reactor running
This is highly reproducible in distributed execution frameworks like Ray (Ray Data pipelines) or even standard Python concurrent.futures.ThreadPoolExecutor.
The root cause appears to be an Impedance Mismatch at the PyO3 FFI boundary: Vortex's S3Store and async I/O modules implicitly assume they are being executed within an active Tokio runtime context. However, Python worker threads (like Ray's C++ event loop workers) do not inherently have a Tokio runtime spawned.
Steps to Reproduce
Running the write operation with an S3 path inside a standard Python thread pool or a Ray Worker:
import concurrent.futures
# Assuming `vortex_batch` is a valid Vortex Array or PyArrow Table
# and `vortex.io` is the relevant API entry point.
def write_vortex_s3(data):
# This will trigger the panic because the thread lacks a Tokio reactor
vortex.io.write(data, "s3://my-bucket/test.vx")
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
executor.submit(write_vortex_s3, vortex_batch).result()
Expected Behavior
To make the Python API robust for distributed environments, the bindings should ideally handle this gracefully. I suggest two potential solutions:
FFI Runtime Fallback: If an S3 path is detected at the Python boundary, the Rust side should check for a runtime and, if missing, wrap the async I/O call in a local runtime (e.g., tokio::runtime::Builder::new_current_thread().enable_all().build().unwrap().block_on(...)).
Expose In-Memory / NativeFile API: Expose an interface that accepts memory buffers or pyarrow.NativeFile, allowing users to bypass the Rust S3Store entirely and handle the S3 streaming via pyarrow.fs or boto3 cleanly on the Python side.
Environment
Vortex Python Version: vortex-data 0.74.0
OS: Linux
Execution Context: Ray Data / Python ThreadPoolExecutor
Prerequisites
Describe the bug
When attempting to write data directly to an S3 path using the Python bindings, the underlying Rust engine panics with the following error if executed inside a thread without an active Tokio reactor:
PanicException: there is no reactor running
This is highly reproducible in distributed execution frameworks like Ray (Ray Data pipelines) or even standard Python concurrent.futures.ThreadPoolExecutor.
The root cause appears to be an Impedance Mismatch at the PyO3 FFI boundary: Vortex's S3Store and async I/O modules implicitly assume they are being executed within an active Tokio runtime context. However, Python worker threads (like Ray's C++ event loop workers) do not inherently have a Tokio runtime spawned.
Steps to Reproduce
Running the write operation with an S3 path inside a standard Python thread pool or a Ray Worker:
Expected Behavior
To make the Python API robust for distributed environments, the bindings should ideally handle this gracefully. I suggest two potential solutions:
FFI Runtime Fallback: If an S3 path is detected at the Python boundary, the Rust side should check for a runtime and, if missing, wrap the async I/O call in a local runtime (e.g., tokio::runtime::Builder::new_current_thread().enable_all().build().unwrap().block_on(...)).
Expose In-Memory / NativeFile API: Expose an interface that accepts memory buffers or pyarrow.NativeFile, allowing users to bypass the Rust S3Store entirely and handle the S3 streaming via pyarrow.fs or boto3 cleanly on the Python side.
Environment
Vortex Python Version: vortex-data 0.74.0
OS: Linux
Execution Context: Ray Data / Python ThreadPoolExecutor