Handling of Guest-Mode MONITOR and MWAIT

Gabriel L. Somlo

Last updated: Tue. Feb. 05, 2014
Feedback to: somlo at cmu dot edu

The MONITOR and MWAIT instructions have become a very popular method for modern operating systems to implement Idle Threads in an energy-efficient way. These instructions started shipping on x86_64 chips (Intel and AMD) in the 2006 time frame. According to specs, the OS must check for their availability via CPUID before utilizing them, to avoid encountering an Invalid Opcode exception.

The problem is that, on one hand, KVM does not support MONITOR and MWAIT, issuing Invalid Opcode exceptions to any guest attempting to execute such an instruction. On the other hand, Mac OS X (prior to 10.8, a.k.a. MountainLion) will invoke the instructions from its default idle thread (AppleIntelCPUPowerManagement.kext) indiscriminately, without checking CPUID. My hypothesis is that since any and all Intel-based Mac computers ever shipped came with CPUs that supported MONITOR and MWAIT, OS developers at Apple used to simply assume the instructions would always be present on any hardware configuration they could possibly care to support.

As explained in more detail in the specs and in my Idle Thread slides, MONITOR will "arm" the CPU monitoring hardware to watch for writes to a given memory location; a write to that location will "trigger" the monitoring hardware; and MWAIT will put the CPU core to sleep if the monitoring hardware is armed, until such time that a write, interrupt, or some other event triggers it. If unarmed, MWAIT simply behaves as a NOP.

Architecturally, both MONITOR and MWAIT are equivalent to a NOP instruction, and the monitoring hardware is invisible to the CPU programmer. MWAIT can be seen as a prolonged, low-power NOP instruction whose duration is determined by the above-mentioned "invisible" state in the CPU silicon. Additionally, the CPU programmer must expect spurious wakeups (MWAIT may wake before it's "supposed to"), and therefore use the instruction from within a polling loop.

The remainder of this section discusses a variety of ways in which KVM can offer sufficient support for MONITOR and MWAIT to enable Mac OS X guest execution.

1. Allow MONITOR/MWAIT to be executed in guest mode.

Currently, KVM configures the physical CPU in a way that causes it to trap out of guest mode back into host mode (a.k.a. perform a VM exit) when one of a list of privileged instructions (e.g. HLT, but also MONITOR and MWAIT) is encountered. One of the early patches I encountered simply removed MONITOR and MWAIT from that list, preventing a VM exit upon their encounter during guest (a.k.a. VMX-non-root or L>0) mode.

According to the spec (see V3, S25-3, pp25-8), under certain conditions (which happen to be met by the OS X idle thread), guest-mode MWAIT will always default to being treated as a NOP, never entering a low-power sleep state. This causes each guest VCPU to always utilize 100% of a host core, regardless of the actual level of guest activity.

2. Emulate MONITOR/MWAIT as NOP

As mentioned above, KVM currently requests that, among other instructions, MONITOR and MWAIT generate a VM exit and be handled in host (rather than guest) mode. The current handler for both instructions generates an Invalid Opcode exception, a behavior consistent with the non-support for the instructions advertised via CPUID.

My current stable patch against KVM replaces the handler for MONITOR and MWAIT with one that emulates the NOP instruction (i.e., skips the current MONITOR or MWAIT and re-enters VM guest mode execution from the immediately following instruction). As before, these are short, non-power-saving NOP instructions, and therefore each idling VCPU will utilize 100% of the available cycles of a physical core on the host.

As a workaround, it is possible to force Mac OS X to revert to a HLT-based idle thread by removing the default MONITOR/MWAIT one:

	sudo rm /System/Library/Extensions/AppleIntelCPUPowerManagement.kext

This reduces host core utilization to single digits during guest idle times, since the guest VCPUs are removed from scheduling and execution on the host while halted.

Due to its simplicity and relative cleanliness, this combined approach may be a viable long term solution: NOP-based MONITOR/MWAIT emulation will allow booting Mac OS X from factory-default install media. Once installed, the guest may be "optimized" for power consumption and host CPU utilization by forcing it to fall back to a HLT-based idle thread.

3. Emulate MWAIT as HLT

Assuming the requirement to run a completely unmodified OS X guest install, we must support the default MONITOR/MWAIT idle thread in production, and attempt to alleviate host CPU utilization without the option of falling back to the HLT-based version. An interesting observation is that, on single-processor systems, MWAIT behaves very much like HLT: there is no other (V)CPU to trigger the monitoring hardware, and therefore MWAIT will only wake when a hardware interrupt is asserted.

If we attempted to emulate MWAIT as (something similar to) HLT, while continuing to treat MONITOR as a NOP, we might be able to reduce host CPU utilization at the price of having the MWAIT-based idle thread be somewhat "sluggish" (waking up "late", on hardware interrupt, as opposed to "on time" when another VCPU writes to the monitored memory location). This may have a negative impact on e.g. the real-time performance of the OS X guest, but considering we're already running under virtualization, you can't lose what you ain't never had :)

We'd have to pay attention to the fact that the OS X idle thread runs with interrupts disabled (RFLAGS.IF=0), but sets %ecx=1 to make MWAIT wake on interrupt regardless. This experimental patch implements MWAIT as an always-interruptible (regardless of RFLAGS.IF) version of HLT. The patch works well enough on a single-processor guest, reducing host CPU utilization during guest idle to about 15%. However, when booting on an SMP guest, OS X crashes with an "HPET not found" panic, which could indicate any number of problems:

imperfect HPET emulation under SMP
not all VCPUs are guaranteed to receive hardware (e.g., timer) interrupts that would wake them from MWAIT
more ACPI/SeaBIOS/Chameleon bugs

These issues are currently under investigation, so please stay tuned...

4. Emulate the monitoring hardware

Although the spec strongly warns against assuming any connection between the size of the memory chunk being monitored and the size of a cache line, it is obvious that MONITOR and MWAIT are implemented on top of the processor's cache coherence protocol (e.g., MESI). A plausible approach could work like this:

MONITOR counts as a memory access, ensuring that the monitored memory area counts as a valid (i.e., M, E, or S state) cache entry on the respective CPU core, in addition to setting the "armed" flag.
a write to the monitored memory area from another CPU core will cause everyone else's (including the MONITOR-ing core's) corresponding cache line to be invalidated (I state). When a monitored cache line is invalidated, the "armed" flag is also turned off (i.e., the monitoring hardware is "trigerred" or "disarmed").
MWAIT acts as NOP if finds the monitor is "disarmed"; otherwise, it enters a C-state and waits for a triggering write, or interrupt, etc.

A relatively straightforward way to emulate MONITOR and MWAIT in KVM would be to utilize the virtual MMU module to write-protect MONITOR-ed memory areas, and handle the subsequent write faults (by emulating the actual guest write from within the host, and updating the state of the emulated monitoring hardware accordingly). This approach has one major caveat: memory monitoring necessarily happens at page-level granularity, which is typically significantly larger than the the extent of a (few) cache line(s) typically used on real hardware. It is true that the exact extent of the monitored memory area is advertised via CPUID, but OS X is already known to have a poor track record of honoring CPUID.

This new experimental patch implements MONITOR/MWAIT by emulating the monitoring hardware on top of the KVM MMU, as described above. As an optimization step (to avoid a TLB shootdown each time the write-protection on a monitored page is switched on or off), the patch is implemented as folows:

we assume a (very) limited number of MONITOR-ed memory locations is used by the guest (Mac OS X only utilizes one such location, which is shared by all instances of its idle thread).
only the first MONITOR on a given page causes it to be write-protected.
write operations cause a fault and are handled (i.e., emulated) in host mode, but do not switch the page back to being writable; instead, they set an "recently accessed" flag for the page.
periodically, a cleanup pass will relinguish stale monitored pages that have not had their "recently accessed" flag set since the previous pass. This step is not yet implemented in the current version of the patch.

Similarly to the MWAIT as HLT method, this patch only works reliably on single-VCPU guests. The patch sometimes works with '-smp 2,cores=2' (about 30% of the time) as shown in this screenshot. Some other times, the emulated disk controller (AHCI) hangs. Other times, as well as with any attempt at SMP higher than 2, we get the dreaded HPET panic. I'm currently looking for ways to first explain, then debug this behavior.