Skip to content

libmicrokit: add API for resuming x86 VCPU#431

Open
dreamliner787-9 wants to merge 3 commits into
seL4:mainfrom
au-ts:x86_vcpu_resume
Open

libmicrokit: add API for resuming x86 VCPU#431
dreamliner787-9 wants to merge 3 commits into
seL4:mainfrom
au-ts:x86_vcpu_resume

Conversation

@dreamliner787-9

@dreamliner787-9 dreamliner787-9 commented Mar 9, 2026

Copy link
Copy Markdown
Contributor

Depends on seL4/seL4#1534, I've worked around this for now by including vmenter.h in libmicrokit/include/microkit.h.

Previously, the Microkit event handler had no mechanism on x86 to resume a vCPU while simultaneously waiting for an incoming notification. This PR adds microkit_vcpu_x86_deferred_resume() to libmicrokit, providing a cleaner way to integrate x86 vCPU resumption with the Microkit event handler.

Due to a kernel limitation in void NORETURN slowpath(syscall_t syscall) of src/arch/x86/c_traps.c, only the VMM's TCB bound Notification is checked for a pending Notification. An endpoint object can't passed to seL4_VMEnter() so PPCs won't work while the VCPU is running. Technically, PPCs still work if the VCPU isn't resumed, but I feel like we should not expose this footgun to users.

@dreamliner787-9 dreamliner787-9 force-pushed the x86_vcpu_resume branch 2 times, most recently from a0cf31f to 862ed5c Compare March 26, 2026 03:32
@dreamliner787-9 dreamliner787-9 force-pushed the x86_vcpu_resume branch 2 times, most recently from 58c7a1e to 764504a Compare March 31, 2026 10:44
@midnightveil

Copy link
Copy Markdown
Collaborator

For the record: I'm not entirely convinced this is the best way the API could look/behave, it feels quite ugly.

@dreamliner787-9 dreamliner787-9 force-pushed the x86_vcpu_resume branch 2 times, most recently from 59d8991 to 190c9d6 Compare May 20, 2026 05:26
@dreamliner787-9

Copy link
Copy Markdown
Contributor Author

For the record: I'm not entirely convinced this is the best way the API could look/behave, it feels quite ugly.

Yes I agree, as discussed internally, there isn't a clean way to integrate the way virtualisation works on x86 seL4 with the Microkit event loop. Unless we allow the user to override the main() entrypoint and let the user call into the library in some way...

@Indanz

Indanz commented May 20, 2026

Copy link
Copy Markdown

For the record: I'm not entirely convinced this is the best way the API could look/behave, it feels quite ugly.

Yes I agree, as discussed internally, there isn't a clean way to integrate the way virtualisation works on x86 seL4 with the Microkit event loop.

Why can't the Microkit event loop be x86 virtualisation aware? Or have a x86 version of the event loop that calls VMEnter instead of recv. You can add a special function that enters this x86 virtualisation aware event loop and never returns. Then the user only needs to do one special call during init once. If the tooling is aware of all this, it could do this automatically on return of the init function too and the user wouldn't need to do anything special (other than being aware that a VCPU task runs both normal code and a VM).

@dreamliner787-9

dreamliner787-9 commented May 21, 2026

Copy link
Copy Markdown
Contributor Author

For the record: I'm not entirely convinced this is the best way the API could look/behave, it feels quite ugly.

Yes I agree, as discussed internally, there isn't a clean way to integrate the way virtualisation works on x86 seL4 with the Microkit event loop.

Why can't the Microkit event loop be x86 virtualisation aware? Or have a x86 version of the event loop that calls VMEnter instead of recv. You can add a special function that enters this x86 virtualisation aware event loop and never returns. Then the user only needs to do one special call during init once.

I've done some thinking on what you've said and I like it a lot. My take on your idea is that we will have one API that looks like:

static void microkit_x86_vcpu_on(uint64_t rip, uint64_t prim_proc_ctl, uint32_t irq_info)

The user calls this before returning from init() or notified() once during the PD's lifetime. This switches the normal event loop into a x86 virtualisation aware mode. Then the Microkit will automatically resume the VCPU as part of the event loop. No deferred calls like right now.

For argument passing to seL4_VMEnter(), that will be initially taken care of with this call. For subsequent VM exits due to notifications or faults, we can easily hook it to the existing read/write VMCS API. That is, any writes to the guest RIP, Primary Processor Controls or IRQ info get written to local variables, then it will be written to message registers right before the event loop calls seL4_VMEnter(). Rather than having to do a read/write VMCS syscall.

Under this model, the user just need to be aware that a PD with a VCPU will be executing guest code if init() or notified() returns and the PD get scheduled.

Edit: it maybe beneficial to have a microkit_x86_vcpu_off as well. So that this model is similar to the ARM's VCPU restart and stop APIs.

If the tooling is aware of all this, it could do this automatically on return of the init function too and the user wouldn't need to do anything special (other than being aware that a VCPU task runs both normal code and a VM).

This would be ideal, but VCPUs are resumed by user code at runtime so I'm unsure how this could fit in.

@dreamliner787-9 dreamliner787-9 changed the title libmicrokit: add microkit_vcpu_x86_deferred_resume() call libmicrokit: add API for resuming x86 VCPU May 21, 2026
@Indanz

Indanz commented May 21, 2026

Copy link
Copy Markdown

If the tooling is aware of all this, it could do this automatically on return of the init function too and the user wouldn't need to do anything special (other than being aware that a VCPU task runs both normal code and a VM).

This would be ideal, but VCPUs are resumed by user code at runtime so I'm unsure how this could fit in.

Resuming a VCPU is just calling seL4_VMEnter(). That is no different from a code flow point of view than blocking on seL4_ReplyRecv(). Either can be called after return of the user callback function.

Considering x86 VCPU tasks can't receive IPC calls anyway, but only notifications, for Microkit it would make sense to model non-notification returns from VMEnter as Protected procedure calls. Or be more specific and add another callback function for those.

(I'm ignoring the practical implementation split between generic Microkit and libvm or whatever virtualisation aware code you have. For a more virtualisation agnostic Microkit it would help to move stuff more into libraries and less in the Microkit framework.)

Comment thread libmicrokit/include/microkit.h Outdated
Comment thread libmicrokit/include/microkit.h
Comment thread libmicrokit/src/main.c Outdated
Comment thread libmicrokit/src/main.c Outdated
Comment thread libmicrokit/src/main.c Outdated
Comment thread libmicrokit/src/main.c Outdated
Comment thread libmicrokit/src/main.c Outdated
Comment thread libmicrokit/src/main.c Outdated
Comment thread tool/microkit/src/sdf.rs Outdated
Comment thread tool/microkit/src/sdf.rs Outdated
@dreamliner787-9 dreamliner787-9 force-pushed the x86_vcpu_resume branch 3 times, most recently from a85ab86 to eae14da Compare June 2, 2026 05:41
Comment thread libmicrokit/include/microkit.h
Comment thread libmicrokit/include/microkit.h Outdated
Comment thread libmicrokit/include/microkit.h Outdated
Comment on lines +537 to +538
/* On x86, a TCB can only have one bound VCPU at any given time.
* So we don't take a `microkit_child vcpu` here. */

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't that mean that you want to take that argument away from microkit_vcpu_x86_write_vmcs/microkit_vcpu_x86_read_vmcs too? (And any other x86 virtualisation functions there may be.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, the reason why I haven't done that is because I don't know how VM with multiple VCPUs will look like yet, because there are multiple approaches you can take.

You could allow one VMM PD access to multiple VCPU caps to configure them and/or handle faults, i.e. you will have an implicit "big VMM lock". This is currently how ARM VM are modelled in Microkit. For x86, we will need additional trampoline threads that resume the VCPUs. The downside is that you have more context switching, but you would be able to replicate the ARM model. In this case we would need to pass microkit_child vcpu to all the VCPU APIs.

Or, you could adopt a "multi VMM" approach where each PD only have access 1 VCPU. Then the user is responsible for creating 1 PD per VCPU, handle IPIs with explicit channels between the PDs and synchronise access to any global VMM state themselves. This model would incur less context switching and there isn't a "big lock" like the previous one. But there are more complications: you would need to ensure that all the VMM PDs have access to the same resources (memory regions, interrupts, etc) for fault handling, and channels between the VMM PDs for every possible combinations of IPIs (n(n-1) channels). In this case we would not need to pass microkit_child vcpu to all the VCPU APIs.

Arguably the "multi VMM" approach is already possible on both ARM and x86 in Microkit. But at this point I'm not sure whether supporting the "big VMM lock" model on x86 is appropriate so I'm keeping the option open.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you say argues against dropping the parameter for microkit_vcpu_x86_on/off too though.

I strongly advise against some trampoline construction that shuffles all VCPU handling to one VMM task: The reason to have multiple VCPUs is to use multiple cores, so they all will be on different cores. If you handle all VCPUs with one VMM task, all operations become cross-core with huge overhead: First the fault/trap needs to be propagated to the core the VMM is on, and then any reply needs to be funnelled back.

The added value of handling multiple, independent virtual machines with one VMM is approximately zero.

This means that also on Arm, you do not want to use the model with one central VMM handling multiple VCPUs, even though it's simpler.

But there are more complications: you would need to ensure that all the VMM PDs have access to the same resources (memory regions, interrupts, etc) for fault handling,

Why? A VMM doesn't need access to the virtual machine's memory, interrupts etc.

and channels between the VMM PDs for every possible combinations of IPIs (n(n-1) channels).

Why do you need IPIs? Why would one VMM care about what others do?

If you need some kind of central control, then have a global VMM manager which communicates with all the VMMs handling VCPUs. Then you need only n channels and have only one decision point.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly advise against some trampoline construction that shuffles all VCPU handling to one VMM task: The reason to have multiple VCPUs is to use multiple cores, so they all will be on different cores. If you handle all VCPUs with one VMM task, all operations become cross-core with huge overhead: First the fault/trap needs to be propagated to the core the VMM is on, and then any reply needs to be funnelled back.

I agree on the cross-core overhead. I wouldn't recommend it for anything production grade. The reason I brought it up is outlined below.

The added value of handling multiple, independent virtual machines with one VMM is approximately zero.

If you mean handling multiple independent VMs, I agree. However, if you mean handling multiple VCPUs of a single VM with one VMM, I do see value in it from a prototyping perspective.

We currently have the single-VMM + multi-VCPU model working on our ARM VMM, and it was a non-invasive evolution from the single-VCPU code. While x86 would require a trampoline approach, some Microkit changes for multiple VCPUs in one PD, and incur a performance hit. It offers a much simpler development path to a working prototype. It allows the user to get a basic multi-core VM booting first, and then mature their implementation into a performant multi-VMM design later.

That said, I see the argument for keeping the x86 VCPU API aligned with the current kernel implementation. If you feel the prototyping benefit isn't worth accommodating, I'm happy to remove the microkit_child vcpu arguments from all the x86 VCPU APIs. However, since dropping them across the rest of the x86 VCPU APIs would be a breaking change, I'd prefer to handle that API restructuring in a separate, dedicated PR so we can keep this one unblocked and non-breaking.

Why? A VMM doesn't need access to the virtual machine's memory, interrupts etc.

On x86, the VMM needs access to the VM's memory primarily for fault handling. For example, consider an EPT VM Exit when the guest accesses an emulated I/O APIC. The architecture tells you the faulting guest physical address and whether it was a read or a write, but unlike ARM, it does not provide the read/write width or the target CPU registers. To emulate the operation, the VMM must take the instruction pointer, walk the guest page tables, fetch the faulting instruction from guest RAM, and decode it. If multiple VMMs are managing different VCPUs for a single VM, all of them need access to guest RAM to do this.

Additionally, if the VMM is emulating a storage or network device (e.g., VirtIO), it naturally needs access to guest RAM to process DMA requests.

Regarding interrupts, this is necessary for pass-through devices. Since mainline seL4 does not yet support posted interrupts, the physical interrupt must be routed to the VMM so the injection process can be emulated in software.

Why do you need IPIs? Why would one VMM care about what others do?

The guest OS will send IPIs between any pair of VCPUs. If a single VM is split across multiple VMMs, they must coordinate to inject these IPIs into the target VCPUs. Connecting the VMMs with an $n(n-1)$ channel mesh is the brute-force way to handle this. The central control approach you mentioned would solve this problem at a cost of more context switching.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Microkit production grade or a prototype toolkit? That's the choice that you need to make, and be very clear about what your design goals are.

On x86, the VMM needs access to the VM's memory primarily for fault handling.

Wonderful, that explains where virtio came from.

Additionally, if the VMM is emulating a storage or network device (e.g., VirtIO), it naturally needs access to guest RAM to process DMA requests.

That's why I hate virtio, because it forces write access.

the physical interrupt must be routed to the VMM so the injection process can be emulated in software.

The seL4 manual doesn't explain how IRQ injection works on x86, but I assume it's by manipulating VMCS and the return message for seL4_VMEnter(). You don't need access to the interrupts for that, you just need to know when the real interrupt happened, and you do that via notifications, not the IRQ objects themselves. You also don't want to handle all interrupts on all cores, so this would be partitioned anyway.

The guest OS will send IPIs between any pair of VCPUs.

Fair enough, that's a good reason.

Connecting the VMMs with an $n(n-1)$ channel mesh is the brute-force way to handle this. The central control approach you mentioned would solve this problem at a cost of more context switching.

You don't need full-blown channels for just IPIs though, you could limit it to one shared memory page and $n$ notification caps.

But anyway, IPIs are just passing info through, it's pretty stateless. I was talking more about VMMs needing to communicate between them for internal reasons.

Comment thread libmicrokit/include/microkit.h Outdated
Comment thread docs/manual.md Outdated
Comment thread docs/manual.md Outdated
Comment thread docs/manual.md Outdated
Comment thread docs/manual.md
Comment thread libmicrokit/src/main.c Outdated
Because you cannot pass an endpoint object with the syscall.

Signed-off-by: Bill Nguyen <bill.nguyen@unsw.edu.au>
Add:
microkit_vcpu_x86_on()
microkit_vcpu_x86_off()

Signed-off-by: Bill Nguyen <bill.nguyen@unsw.edu.au>
Add
microkit_vcpu_x86_on()
microkit_vcpu_x86_off()

Signed-off-by: Bill Nguyen <bill.nguyen@unsw.edu.au>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants