Skip to content

Enable user to take ownership of MPI#722

Merged
otbrown merged 41 commits into
QuEST-Kit:develfrom
otbrown:mpi-update
May 28, 2026
Merged

Enable user to take ownership of MPI#722
otbrown merged 41 commits into
QuEST-Kit:develfrom
otbrown:mpi-update

Conversation

@otbrown
Copy link
Copy Markdown
Collaborator

@otbrown otbrown commented Apr 14, 2026

Closes #712.

  • QuEST no longer uses MPI_COMM_WORLD everywhere. Instead it uses mpiCommQuest (which will very often just be a duplicate of MPI_COMM_WORLD). This aligns with best practise for library developers, as it helps avoid avoid clashes with user MPI calls.
  • Added a new custom environment constructor which lest the user positively declare that they want to take ownership of MPI. QuEST relinquishes responsibility, and will not attempt to initialise or finalise MPI itself. It will still throw if there are an incompatible number of MPI processes.
  • Added some minimal testing... We should probably look at environment testing in more detail at some point! In fairness, I think if the mpiCommQuest stuff was broken essentially any QuEST program would also be broken.
  • Swapped Irecv and Isend in the comms routine as is the first thing I was asked about when we encountered issues with slingshot-11. It doesn't make any difference.
  • New Subcommunicator interface lets the user create a custom QuESTEnv on communicator that is not MPI_COMM_WORLD. This interface is guarded behind compile-time flags as it requires bringing the MPI header into the interface, which means all consumers of libQuEST need to include MPI as well. This interface is not yet documented or tested. I think this is tolerable in the short term as it should only appeal to advanced users.

Speaking of tests! Test results on my laptop:

otbrown@oliver-win:~/quantum/quest/source/QuEST/build$ mpirun -n 4 ./tests/tests

QuEST execution environment:
  precision:       2
  multithreaded:   1
  distributed:     1
  GPU-accelerated: 0
  GPU-sharing ok:  0
  cuQuantum:       0
  num nodes:       4

Testing configuration:
  test all deployments:  0
  num qubits in qureg:   6
  max num qubit perms:   10
  max num superop targs: 4
  num mixed-deploy reps: 10

Tested Qureg deployments:
  CPU + OMP + MPI

Randomness seeded to: 2200603699
===============================================================================
All tests passed (293841 assertions in 275 test cases)

and min_example output:

Date:
2026-04-13T18:54:07+01:00

OMP_NUM_THREADS=36
SLURM_NTASKS_PER_NODE=8
SLURM_NNODES=1

QuEST execution environment:
  [precision]
    qreal.................double (8 bytes)
    qcomp.................std::complex<double> (16 bytes)
    qindex................long long int (8 bytes)
    validationEpsilon.....1e-12
  [compilation]
    isMpiCompiled....................1
    isMpiSubCommunicatorCompiled.....0
    isGpuCompiled....................0
    isOmpCompiled....................1
    isCuQuantumCompiled..............0
  [deployment]
    isMpiEnabled............1
    doesUserOwnMpi..........0
    isGpuEnabled............0
    isOmpEnabled............1
    isCuQuantumEnabled......0
    isGpuSharingEnabled.....0
  [cpu]
    numCpuCores.......576 per machine
    numOmpProcs.......36 per machine
    numOmpThrds.......36 per node
    cpuMemory.........754.7 GiB per machine
    cpuMemoryFree.....unknown
  [gpu]
    numGpus...........N/A
    gpuDirect.........N/A
    gpuMemPools.......N/A
    gpuMemory.........N/A
    gpuMemoryFree.....N/A
    gpuCache..........N/A
  [distribution]
    isMpiGpuAware.....0
    numMpiNodes.......8
  [statevector limits]
    minQubitsForMpi.............3
    maxQubitsForCpu.............35
    maxQubitsForGpu.............N/A
    maxQubitsForMpiCpu..........37
    maxQubitsForMpiGpu..........N/A
    maxQubitsForMemOverflow.....58
    maxQubitsForIndOverflow.....63
  [density matrix limits]
    minQubitsForMpi.............3
    maxQubitsForCpu.............17
    maxQubitsForGpu.............N/A
    maxQubitsForMpiCpu..........20
    maxQubitsForMpiGpu..........N/A
    maxQubitsForMemOverflow.....28
    maxQubitsForIndOverflow.....31
  [statevector autodeployment]
    8 qubits......[omp]
    29 qubits.....[omp] [mpi]
  [density matrix autodeployment]
    4 qubits......[omp]
    15 qubits.....[omp] [mpi]

Qureg:
  [deployment]
    isMpiEnabled.....1
    isGpuEnabled.....0
    isOmpEnabled.....1
  [dimension]
    isDensMatr.....0
    numQubits......20
    numCols........N/A
    numAmps........2^20 = 1048576
  [distribution]
    numNodes.....2^3 = 8
    numCols......N/A
    numAmps......2^17 = 131072 per node
  [memory]
    cpuAmps...........2 MiB per node
    gpuAmps...........N/A
    cpuCommBuffer.....2 MiB per node
    gpuCommBuffer.....N/A
    globalTotal.......32 MiB

Qureg (20 qubit statevector, 1048576 qcomps over 8 nodes, 4 MiB per node):
    -0.00062991+0.00025551i   |0⟩        [node 0]
    (8.3557e-5)-0.00063022i   |1⟩
    -0.00053343-0.00011416i   |2⟩
    0.00028553-0.00018421i    |3⟩
    -0.00094126-0.00097974i   |4⟩
    0.00016001+0.00042058i    |5⟩
    0.00073393-0.002013i      |6⟩
    0.00061578+0.00046316i    |7⟩
    0.00048523+0.00068937i    |8⟩
    -0.00081878-0.00060096i   |9⟩
    0.0012373+0.00049652i     |10⟩
    0.0001958+0.00014134i     |11⟩
    0.0013159-0.00040911i     |12⟩
    -0.00061634+0.00094838i   |13⟩
    0.00019392-0.00049259i    |14⟩
    -0.00031263+0.00018272i   |15⟩
                ⋮
    -0.00053361+0.00010033i   |1048560⟩  [node 7]
    0.0017006-0.00053801i     |1048561⟩
    0.0010772-(1.5397e-5)i    |1048562⟩
    -0.00085736-0.00062965i   |1048563⟩
    0.00083238-0.00053071i    |1048564⟩
    -0.00025421+0.00064579i   |1048565⟩
    -0.00069857-0.00043407i   |1048566⟩
    (-9.9973e-5)-0.00094092i  |1048567⟩
    0.00092175-0.00068456i    |1048568⟩
    -0.00026758-0.00027345i   |1048569⟩
    0.00034301+0.00011848i    |1048570⟩
    0.00026923-0.00043647i    |1048571⟩
    -0.00088784-0.00011044i   |1048572⟩
    -0.00037642-0.00087563i   |1048573⟩
    0.00031388+0.00031749i    |1048574⟩
    0.00014548+0.00030294i    |1048575⟩

Total probability: 1

JobID           JobName    Elapsed     MaxRSS
------------ ---------- ---------- ----------
161633       QuEST_min+   00:00:04
161633.batch      batch   00:00:04
161633.exte+     extern   00:00:04
161633.0     min_examp+   00:00:01

There is a caveat here: distributed tests on Cirrus keep failing, but this appears to be true of devel as well as this branch, so I'll do some more digging and raise it as a separate issue once I have a better of idea of what is going on.

otbrown and others added 21 commits February 27, 2026 15:53
…for SubComm, because of course it enforces CXX
…ything but it makes MPI library implementers less nervous
@otbrown
Copy link
Copy Markdown
Collaborator Author

otbrown commented Apr 16, 2026

On the plus side, a distributed QFT benchmark circuit runs perfectly correctly:

QFT Benchmark on 34 Qubits
Date:
2026-04-14T15:24:08+01:00

OMP_NUM_THREADS=36
SLURM_NTASKS_PER_NODE=8
SLURM_NNODES=1

QuEST QFT Benchmark

QuEST execution environment:
  [precision]
    qreal.................double (8 bytes)
    qcomp.................std::complex<double> (16 bytes)
    qindex................long long int (8 bytes)
    validationEpsilon.....1e-12
  [compilation]
    isMpiCompiled...........1
    isGpuCompiled...........0
    isOmpCompiled...........1
    isCuQuantumCompiled.....0
  [deployment]
    isMpiEnabled............1
    isGpuEnabled............0
    isOmpEnabled............1
    isCuQuantumEnabled......0
    isGpuSharingEnabled.....0
  [cpu]
    numCpuCores.......576 per machine
    numOmpProcs.......36 per machine
    numOmpThrds.......36 per node
    cpuMemory.........1.475 TiB per machine
    cpuMemoryFree.....unknown
  [gpu]
    numGpus...........N/A
    gpuDirect.........N/A
    gpuMemPools.......N/A
    gpuMemory.........N/A
    gpuMemoryFree.....N/A
    gpuCache..........N/A
  [distribution]
    isMpiGpuAware.....0
    numMpiNodes.......8
  [statevector limits]
    minQubitsForMpi.............3
    maxQubitsForCpu.............36
    maxQubitsForGpu.............N/A
    maxQubitsForMpiCpu..........38
    maxQubitsForMpiGpu..........N/A
    maxQubitsForMemOverflow.....58
    maxQubitsForIndOverflow.....63
  [density matrix limits]
    minQubitsForMpi.............3
    maxQubitsForCpu.............18
    maxQubitsForGpu.............N/A
    maxQubitsForMpiCpu..........20
    maxQubitsForMpiGpu..........N/A
    maxQubitsForMemOverflow.....28
    maxQubitsForIndOverflow.....31
  [statevector autodeployment]
    8 qubits......[omp]
    29 qubits.....[omp] [mpi]
  [density matrix autodeployment]
    4 qubits......[omp]
    15 qubits.....[omp] [mpi]


Performing QFT on 34 qubits.
Time to run QFT Circuit: 420.544s.

Correctness checks:
Statevector norm: 1
First amplitude: (7.62939e-06, 0)
JobID           JobName    Elapsed     MaxRSS
------------ ---------- ---------- ----------
162993       QFT_bench+   00:07:08
162993.batch      batch   00:07:08
162993.exte+     extern   00:07:08
162993.0            qft   00:07:05

@otbrown otbrown requested a review from JPRichings April 16, 2026 11:56
Comment thread quest/include/environment.h Outdated
*/
void initCustomQuESTEnv(int useDistrib, int useGpuAccel, int useMultithread);

void initCustomMpiQuESTEnv(int useDistrib, int userOwnsMpi, int useGpuAccel, int useMultithread);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not want a new initialiser so its really clear where there is a different setup? Also this probably braked compatibility for older software.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a new initialiser! This is effectively a C interface, so no overloading allowed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies I had missread this I thought it was replacing initCustomQueSTEnv hence the concern. Once again reading comprehension issue must do better.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is more subtle than I might have liked, but the alternative would be an initialiser that was way too long!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initCustomMpiCommQuESTEnv?

@JPRichings
Copy link
Copy Markdown
Contributor

Looks good not run yet.

My only comment is above and raises a wider point where we need to think about how we managed change in the programming interface as we probably want to stabilise as much as possible and only add to existing functionality.

@otbrown
Copy link
Copy Markdown
Collaborator Author

otbrown commented Apr 27, 2026

Looks good not run yet.

My only comment is above and raises a wider point where we need to think about how we managed change in the programming interface as we probably want to stabilise as much as possible and only add to existing functionality.

Agreed, If we're being proper about it, breaking changes to the API require a new major version release, and I don't think we have the time or the inclination currently.

@JPRichings
Copy link
Copy Markdown
Contributor

JPRichings commented Apr 28, 2026

Remaining testing required:

  • world comm -1 rank

  • world_comm as subcomm

  • sub comm only 1 rank

  • 50/50 split of world comm

  • Random splitting (4 rank subcomm or similar)

Copy link
Copy Markdown

@iarejula-bsc iarejula-bsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good to me, really like the idea! I was wondering if this design would allow running quregs in parallel, each with their own communicator (different envs for each one, maybe?), or if that would be impossible with the current implementation. I think that would be a highly interesting feature. As my team is trying to apply malleability to QuEST, that would be a huge win for us.

A few small comments below, but note this is an untested review.

Comment thread quest/src/comm/comm_config.cpp Outdated
Comment thread quest/src/comm/comm_config.cpp Outdated
Comment thread quest/src/comm/comm_config.cpp Outdated
Comment thread quest/src/api/subcommunicator.cpp Outdated
@otbrown
Copy link
Copy Markdown
Collaborator Author

otbrown commented May 15, 2026

In subcommunicator.cpp:17, userQuestComm appears to be duplicated into mpiCommQuest before the normal environment validation runs. If validation later fails under a throwing or custom error handler, the duplicated communicator may be left in mpiCommQuest with no cleanup, which is an MPI resource leak. Haven't had time to trace this fully so it may be a non-issue depending on execution order, but probably worth a quick check.

We now check comm_isInit before comm_setMpiComm, and comm_setMpiComm itself checks that the duplication is successful and therefore (at least according to the standard) if newComm was valid.

@otbrown
Copy link
Copy Markdown
Collaborator Author

otbrown commented May 15, 2026

I note from some testing that the error_... functions are definitely not doing what I expected them to, but that is not a Friday night problem.

Edit: ye olde print debugging indicates the exit function should be called. Could be a timing issue with the process not being called before MPI hits an error. Should definitely revisit those error functions at some point.

@TysonRayJones
Copy link
Copy Markdown
Member

Sorry for my own delay! A horrifying week and another on the way, but I have most of today to rip through the outstanding stuff for v4.3 (🤞 🤞 )

As in if we are running on HPC with GPUs should we run all Qureg in the same mode? i.e. Distributed, GPU aware,... If we end up with a Qureg on CPU and one on GPU for example then I think we need to consider how we are controlling that carefully as I don't think that was the intended change here

I originally misinterpreted you and typed the below response - which I now realise you already understand. Kept it for posterity!

I note the autodeployer already relaxes the requirement that all Qureg have the same mode. A user can already prepare multiple Quregs simultaneously that differ in their multithreadedness, GPU-acceleration and distribution - that has been true since v4.0. Presently, createQureg will actually attempt to automatically choose the deployments per-Qureg based on the Qureg size and available hardware. The main new challenge here (as far as I can tell) is updating the autodeployer to be cognizant of only the resources available to the processes within its subcommunicator.

Of the available resources N nodes of compute of a given type (we could split out GPUs and CPUs on the same node here as different "nodes") which ones are allocated to a given Qureg or FullStateDiagMatr?

Ooh that's a super interesting idea; similar to an automatic choosing of a subcommunicator for a Qureg (too big for one process, to small to distribute across the 1028 available). Sounds very ambitious, maybe best we first try to enable superusers to do such things manually through this MPI opening.

I think my remaining confusion is how do you propose picking the communicator set by setDefaultQuESTCommunicator();?

That would be the communicator that QuEST first uses if you do not override it (like right now, but ofc we don't use the world group), which includes all available nodes. Just a way for a user to stop setting their custom comm upon subsequently created Qureg.

However I think we'd be better off letting the user set deployment parameters, and then dealing with the MPI ourselves. The envelope of MPI processes for QuEST would then be initialisation time only, but we could enable 'sub-deployment' for specific Quregs. This avoids the challenges of even more mpi.h in the public headers, and I suspect makes it easier to avoid orphaned processes and deadlocks.

Oh interesting - so we would let users have "full MPI control" at QuESTEnv initialisation, but only limited MPI control of individual Qureg? That's fine by me! I note I am okay not to coddle the MPI superusers, especially if the alternative is to regrettably restrict their freedom/experimentation with custom MPI - but QuEST is of course open source and hackable. I suppose the greatest priority may be to keep the (internal) design simple and transparent so it's easy to tinker and extend.

One of the issues is that Quregs deployed in different comms are (possibly) incompatible with one another, so all multi-qureg functions would need to either reject them, or otherwise handle the inter-comm communication, which would be another significant re-write with performance implications.

Probably different comm Qureg should just be incompatible - just like some existing multi-Qureg functions do not accept a CPU-only and GPU pair (like here). Probably ends up a single new validation function - or rather, just a new subcommunicator-specific line in the existing quregsAreCompatible validation! If we think up a usecase where it's very useful to make different-comm Qureg compatible, we could introduce a migrateQuregCommunicator or similar hijinks.

I'm much more relaxed about introducing the new API as it is, and then potentially changing it in the future, as this is explicitly advanced functionality hidden behind an advanced interface. I expect the vast majority of users will continue to want to use all the available MPI ranks! (Or indeed no MPI ranks, and just CPU multithreading or a GPU...)

Ok that's fair enough! Especially if it all ends up in a dedicated MPI-specific header. Totally agreed the more general functionality (Qureg specific generalisation) is a future design problem.

When I get a chance, I will comb through the different diff with some nits 😉

@otbrown
Copy link
Copy Markdown
Collaborator Author

otbrown commented May 18, 2026

When I get a chance, I will comb through the different diff with some nits 😉

Thanks Tyson! Good to know that we already have a concept of incompatible Quregs, so it wouldn't be a totally new concept within the codebase.

@iarejula-bsc: Let us know how you would like to be recorded (by name or username) in the contribution file for your contribution here!

@iarejula-bsc
Copy link
Copy Markdown

Hi, i have been really busy and i didnt take a look over the last discussion, is there anything i should help?

@iarejula-bsc: Let us know how you would like to be recorded (by name or username) in the contribution file for your contribution here!

You can use my full name: Íñigo Aréjula Aísa

Comment thread quest/include/environment.h
@otbrown otbrown mentioned this pull request May 28, 2026
@otbrown otbrown merged commit 79e8824 into QuEST-Kit:devel May 28, 2026
236 of 238 checks passed
@TysonRayJones
Copy link
Copy Markdown
Member

Would have been handy to know this is ready for review before being merged! Will note some minor comments now we can address in the future.

Can we add an example in examples/isolated/, like custom_mpi.c, which shows how to use the new functions? I still don't fully appreciate how initCustomMpiCommQuESTEnv and initCustomMpiQuESTEnv are to be used together

Copy link
Copy Markdown
Member

@TysonRayJones TysonRayJones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will address these when I get a sec, but here's some feedback! (marking my progress with 👀 )

(addressed most in #762)

int isMultithreaded;
int isGpuAccelerated;
int isDistributed;
bool userOwnsMpi;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an int, or the other fields updated to bool, since they are all on equal footing. The existing is* fields are always set to 0 or 1. Possibly there was confusion with the arguments of initCustomQuESTEnv() which accepts similarly named arguments as -1 in order for the auto-deployer to override them - that never reaches the QuESTEnv.


#if QUEST_COMPILE_MPI && QUEST_COMPILE_SUBCOMM

#include <stdbool.h>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the C bool being imported here? This is an internal C++ file and can use the inbuilt C++ bool

}

// initialise QuEST around that communicator
initCustomMpiQuESTEnv(useDistrib, userOwnsMpi, useGpuAccel, useMultithread);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function (initCustomMpiCommQuESTEnv) doesn't validate the input parameters, and is calling another API function (initCustomMpiQuESTEnv) which does validate. When the latter throws, it will use the latter's validation error message; __func__ will report the latter threw the error, not the former. So the user will see the name of a function they did not call


#if QUEST_COMPILE_MPI
MPI_Comm comm_getMpiComm() {
return mpiCommQuest;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably check that mpiCommQuest != MPI_COMM_NULL and otherwise throw an internal error, to protect against an internal caller prematurely accessing it, before it is overwritten from its default

#if QUEST_COMPILE_MPI
#include <mpi.h>

static MPI_Comm mpiCommQuest = MPI_COMM_NULL;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a global singleton and should use the convention of beginning with global or global_.

Also, the Quest suffix seems redundant given it's a variable private to this internal QuEST function

Comment thread quest/src/core/errors.cpp
Comment on lines +159 to +162
void error_commInvalidMpiComm() {

raiseInternalError("The supplied MPI communicator was MPI_COMM_NULL, or duplication failed.");
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplication failure fine as an internal error, but we can check for MPI_COMM_NULL at user-input validation stage

Comment thread quest/src/core/errors.cpp
Comment on lines +189 to +192
void error_commDoubleSetMpiComm() {

raiseInternalError("An attempt was made to set mpiCommQuest after it had already been set, as indicated by mpiCommQuest != MPI_COMM_NULL.");
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally caught by user validation

REQUIRE( (env.isMultithreaded == 0 || env.isMultithreaded == 1) );
REQUIRE( (env.isGpuAccelerated == 0 || env.isGpuAccelerated == 1) );
REQUIRE( (env.isDistributed == 0 || env.isDistributed == 1) );
REQUIRE( (env.userOwnsMpi == 0 || env.userOwnsMpi == 1) );
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pointless given userOwnsMpi is a bool

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, artifact from when it wasn't!


#include "quest/include/config.h"

#if QUEST_COMPILE_MPI && QUEST_COMPILE_SUBCOMM
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not thrilled about guarding the entire header, and the contextualising of exposing subcommunicator as "compiling". Semantics like "expose subcommunicator" might have been better, since all the internal trickery of "don't expose internal subcommunicator-specific functions to other TUs" is brittle when QUEST_COMPILE_MPI already safely guards exposure to MPI types.

This is a half-thought for a future me

Comment thread quest/src/core/errors.cpp
raiseInternalError("A function attempted to communicate via more messages than permitted (since there would be more uniquely-tagged messages than the tag upperbound).");
}

void error_commDoubleSetMpiComm() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is hard to parse (double?!), and asymmetric with the existing error_commAlreadyInit. Could be clarified to:

error_commAlreadyHasSetMpiComm`

* Advanced initialiser which lets the user positively declare that they take responsibility for MPI.
* This means we assume they have called MPI_Init, and that they will call MPI_Finalize.
*/
void initCustomMpiQuESTEnv(int useDistrib, bool userOwnsMpi, int useGpuAccel, int useMultithread);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the user needs to resolve MPI in order to own it, it seems fair to move this into the same "optional advanced-MPI user" module needed by subcommunicator. I think relegating this to an alternate header is better give this is anticipated to be tentatively, and is opaquely named. It can then live in a dedicated doc group, e.g. advanced_env, or similarly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants