Skip to content

openblas_set_num_threads() is silently overridden when built with USE_OPENMP #5806

@nh2

Description

@nh2

Summary

When OpenBLAS is built with USE_OPENMP, calling openblas_set_num_threads(1) before a BLAS operation has no lasting effect.

The thread count is unconditionally overridden back to omp_get_max_threads() on every BLAS call, making the API a silent no-op.

When a user uses openblas_set_num_threads(1) in the hope to get deterministic BLAS results, they will eventually find that this does not work, and results are still nondeterministic, because of this.

The workaround is to run with the OMP_NUM_THREADS=1 environment variable set. But it seems very wrong that this cannot be done with code, and that a function that is explicitly named to set the number of threads, does not actually do that.

Call chain demonstrating the problem

  1. User calls openblas_set_num_threads(1) which calls goto_set_num_threads(1), setting blas_cpu_number = 1.
  2. User then calls cblas_sgemm(); code is in void CNAME() in gemm.c.
  3. Inside, args.nthreads = get_gemm_optimal_nthreads(MNK) calls num_cpu_avail(3).
  4. num_cpu_avail() calls omp_get_max_threads(), which returns the OpenMP default (all CPUs, since OMP_NUM_THREADS is unset).
  5. if (blas_cpu_number != openmp_nthreads) is true (1 != N).
  6. goto_set_num_threads(openmp_nthreads) overrides blas_cpu_number back to N.
  7. num_cpu_avail() returns blas_cpu_number (now N), which flows to GEMM_THREAD(..., args.nthreads) then to exec_blas(num_cpu, queue) then to #pragma omp parallel for.

Root cause

goto_set_num_threads() does NOT call omp_set_num_threads() -- confirmed by searching the entire driver/ directory (zero results). So the OpenMP runtime's idea of the thread count is never updated, and num_cpu_avail() always re-syncs blas_cpu_number from omp_get_max_threads().

Expected behavior

openblas_set_num_threads(1) should durably limit OpenBLAS to 1 thread, even when built with USE_OPENMP. The only current workaround is setting OMP_NUM_THREADS=1 in the environment, which is a process-global side effect affecting all OpenMP consumers.

Suggested fix

When USE_OPENMP is defined, goto_set_num_threads() should also call omp_set_num_threads(num_threads) so that subsequent calls to omp_get_max_threads() in num_cpu_avail() return the value the user requested.

I have not tested this yet, but it seems the most sensical to me that this should work.

Practical impact

This breaks deterministic computation for any library that links OpenBLAS and tries to force single-threaded BLAS via the documented API. Multi-threaded floating-point reductions in exec_blas() produce nondeterministic results due to different summation orders across runs.

I found it when trying to make llama-cpp deterministic with its --threads 1 option and found that it didn't work.

Version

Tested on commit 8cecf899e (v0.3.32).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions