Owning the Image Object File Format, the Compiler Toolchain, and the Operating System: Solving Intractable Performance Problems Through Vertical Engineering

May 6th, 2016

Closing Down Another Attack Vector

As the Windows kernel continues to pursue in its quest for ever-stronger security features and exploit mitigations, the existence of fixed addresses in memory continues to undermine the advances in this area, as attackers can use data corruption vulnerabilities and combine these with stack and instruction pointer control in order to bypass SMEP, DEP, and countless other architectural defense-in-depth techniques. In some cases, entire mitigations (such as CFG) are undone due to their reliance on a single, well-known static address.

In the latest builds of Windows 10 Redstone 1, aka “Anniversary Update”, the kernel takes a much stronger toward Kernel Address Space Layout Randomization (KASLR), employing an arsenal of tools that can only be available to an operating system developer that also happens to own the world’s most commercially successful compiler, and the world’s most pervasive executable object image format.

The Page Table Entry Array

One of the most unique aspects of the Windows kernel is the reliance on a fixed kernel address to represent the virtual base address of an array of page table entries that describes the entire virtual address space, and the usage of a self-referencing entry which acts as a pivot describing the page directory for the space itself (and, on x64 systems, describing the page directory table itself, and the page map level 4 itself).

This elegant solutions allows instant O(1) translation of any virtual address to its corresponding PTE, and with the correct shifts and base addresses, a conversion into the corresponding PDE (and PPE/PXE on x64 systems). For example, the function MmGetPhysicalAddress only needs to work as follows:

MmGetPhysicalAddress (
    _In_ PVOID Address
    MMPTE TempPte;
    /* Check if the PXE/PPE/PDE is valid */
    if (
#if (_MI_PAGING_LEVELS == 4)
       (MiAddressToPxe(Address)->u.Hard.Valid) &&
#if (_MI_PAGING_LEVELS >= 3)
       (MiAddressToPpe(Address)->u.Hard.Valid) &&
       /* Check if the PTE is valid */
       TempPte = *MiAddressToPte(Address);

Each iteration of the MMU table walk uses simple MiAddressTo macros such as the one below, which in turn rely on hard-code static addresses.

/* Convert an address to a corresponding PTE */
#define MiAddressToPte(x) \
   ((PMMPTE)(((((ULONG)(x)) >> 12) << 2) + PTE_BASE))

As attackers have figured out, however, this “elegance” has notable security implications. For example, if a write-what-where is mitigated by the existence of a read-only page (which, in Linux, would often imply requiring the WP bit to be disabled in CR0), a Windows attacker can simply direct the write-what-where attack toward the pre-computed PTE address in order to disable the WriteProtect bit, and then follow that by the actual write-what-where on the data.

Similarly, if an exploit is countered by SMEP, which causes an access violation when a Ring 0 Code Segment’s Instruction Pointer (CS:RIP) points to a Ring 3 PTE, the exploit can simply use a write-what-where (if one exists), or ROP (if the stack can be controlled), in order to mark the target user-mode PTE containing malicious code, as a Ring 0 page.

Other PTE-based attacks are also possible, such as by using write-what-where vulnerabilities to redirect a PTE to a different physical address which is controlled by the attacker (undocumented APIs available to Administrators will leak the physical address space of the OS, and some physical addresses are also leaked in the registry or CPU registers).

Ultimately, the list goes on and on, and many excellent papers exist on the topic. It’s clear that Microsoft needed to address this limitation of the operating system (or clever optimization, as some would call it). Unfortunately, a number of obstacles exist:

  • Using virtual-mapped tables based on the EPROCESS structure (as Linux and OS X do) causes significant performance impact, as pointer chasing the different tables now causes cache misses and page translations. This becomes even worse when thinking about multi-processor systems, and the cache waste that this causes (where the TLB may end up getting filled with the various global (locked) pages corresponding to the page tables of various processes, instead of only the current process).
  • Changing the address of the PTE array has a number of compatibility concerns, as PTE_BASE is actually documented in ntddk.h, part of the Windows Driver Kit. Additionally, once the new address is discovered, attackers can simply adjust their exploits to use the appropriate static address based on the version of the operating system.
  • Randomizing the address of the PTE array means that Windows memory manager functions can no longer use a static constant or preprocessor definition for the address, but must instead access a global address which contains it. Forcing every processor to dereference a single global address every single time a virtual memory operation (allocation, protection, page walk, fault, etc…) is performed is a significantly negative performance hit, especially on multi-socket, NUMA systems.
  • Dealing with the global variable problem above by creating cache-aligned copies of the address in a per-processor structure causes a waste of precious kernel storage space (for example, assuming a 64-byte cache line and 640 processors, 40KB of physical memory are used to replicate the variable most efficiently). However, on NUMA systems, one would also want the page containing this data to be local to the node, so we might imagine an overhead of 4KB per socket. In practice, this wouldn’t be quite as bad, as Windows already has a per-NUMA-node-allocated, per-processor, cache-aligned list of critical kernel variables: the Kernel Processor Region Control Block (KPRCB).

In a normal world, the final bullet would probably be the most efficient solution: sacrificing what is today a modest amount of physical memory (or re-using such a structure) for dealing with effects of global access. Yet, locating this per-processor data would still not be cheap: most operating systems access such a structure by relying on a segment register such as FS or GS on x86 and x64 systems, or use special CPU registers such as those located on CP15 inside of ARM processors. At the very least, this causes more pointer dereferences and potentially complex microcode accesses. But if we own the compiler and the output format, can’t we think outside the box?

Dynamic Relocation Generation

When the Portable Executable (PE) file format was created, its designers realized an important issue: if compiled code made absolute references to data or functions, these hardcoded pointer values might become invalid if the operating system loaded the executable binary at a different base address than its preferred address. Originally a corner case, the advent of user-mode ASLR made this a common occurrence and new reality.

In order to deal with such rebasing operations, the PE format includes the definition of a special data directory entry called the Relocation Table Directory (IMAGE_DIRECTORY_ENTRY_BASERELOC). In turn, this directory includes a number of tables, each of which is an array of entries. Each entry ultimately describes the offset of a piece of code that is accessing an absolute virtual address, and the required adjustment that is needed to fixup the address. On a modern x64 binary, the only possible fixup is an absolute delta (increment or decrement), but more exotic architectures such as MIPS and ARM had different adjustments based on how absolute addresses were encoded on such processors).

These relocations work great to adjust hardcoded virtual addresses that correspond to code or data within the image itself – but if there is a hard-coded access to 0xC0000000, an address which the compiler has no understanding of, and which is not part of the image, it can’t possibly emit relocations for it – this is a meaningless data dereference. But what if it could?

In such an implementation, all accesses to a particular magic hardcoded address could be described as such to the compiler, which could then work with the linker to generate a similar relocation table – but instead of describing addresses within the image, it would describe addresses outside of the image, which, if known and understood by the PE parser, could be adjusted to the new location of the hard-coded external data address. By doing so, compiled code would continue to access what appears to be a single literal value, and no global variable would ever be needed, cancelling out any disadvantages associated with the randomization of this address.

Indeed, the new build of the Microsoft C Compiler, which is expected to ship with Visual Studio 15 (now in preview), address a special annotation that can be associated with constant values that correspond to external virtual addresses. Upon usage of such a constant, the compiler will ensure that accesses are done in a way that does not “break up” the address, but rather causes its absolute value to be expressed in code (i.e.: “mov rax, 0xC0000000”). Then, the linker collects the RVAs of such locations and builds structures of type IMAGE_DYNAMIC_RELOCATION_ENTRY, as shown below:

   DWORD Version;
   DWORD Size;
// IMAGE_DYNAMIC_RELOCATION DynamicRelocations[0];

When all entries have been written in the image, an IMAGE_DYNAMIC_RELOCATION_TABLE structure is written, with the type below:

   PVOID Symbol;
   DWORD BaseRelocSize;
// IMAGE_BASE_RELOCATION BaseRelocations[0];

The RVA of this table is then written into the IMAGE_LOAD_CONFIG_DIRECTORY, which has been extended with the new field DynamicValueRelocTable and whose size has now been increased:

   ULONGLONG DynamicValueRelocTable;         // VA

Now that we know how the compiler and linker work together to generate the data, the next question is who processes it?

Runtime Dynamic Relocation Processing

In the Windows boot architecture, as the kernel is a standard PE file loaded by the boot loader, it is therefore the boot loader’s responsibility to process the import table of the kernel, and load other required dependencies, to generate the security cookie, and to process the static (standard) relocation table. However, the boot loader does not have all the information required by the memory manager in order to randomize the address space as Windows 10 Redstone 1 now does – this remains the purview of the memory manager. Therefore, as far as the boot loader is concerned, the static PTE_BASE address is still the one to use, and indeed, early phases of boot still use this address (and associate PDE/PPE/PXE base addresses and self-referencing entry).

This clearly implies that it is not considered part of a PE loader’s job to process the dynamic relocation table, but rather the job of the component that creates the dynamic address space map, which has now been enlightened with this knowledge. In the most recent builds, this is done by MiRebaseDynamicRelocationRegions, which eventually calls MiPerformDynamicFixups. This routine locates the PE file’s Load Configuration Directory, gets the RVA (now a VA, thanks to relocations done by the boot loader) of the Dynamic Relocation Table, and begins parsing it. At this moment, it only supports version 1 of the table. Then, it loops through each entry, adjusting the absolute address with the required delta to point to the new PTE_BASE address.

It is important to note that the memory manager only calls MiPerformDynamicFixups on the binaries that it knows require such fixups due to the use of PTE_BASE: the kernel (ntoskrnl.exe) and the HAL (hal.dll). As such, this is not (yet) intended as a generic mechanism for all PE files to allow dynamic relocations of hard-coded addresses toward ASLRed regions of memory – but rather a highly vertically integrated feature specifically designed for dealing with the randomization of the PTE array, and the components that have hardcoded dependencies on it.

As such, even if one were to discover the undocumented annotation which allows the new version of the compiler to generate such tables, no component would currently parse such a table.

Sneaky Side Effects

A few interesting details are of note in the implementation. The first is that the initial version of the implementation, which shipped in build 14316, contained a static address in the loader block, which corresponded to the PTE base address that the loader had selected, and was then overwritten by a new fixed PTE base address (0xFFFFFA00`00000000 on x64).

The WDK, which contains the PTE_BASE address for developers to see (but apparently not use!) also contained this new address, and the debugger was updated to support it. This was presumably done to gauge the impact of changing the address in any way – and indeed we can see release notes referring to certain AV products breaking around the time this build was released. I personally noticed this change by disassembling MmGetPhysicalAddress to see if the PTE base had been changed (a normal part of my build analysis).

The next build, 14332, seemingly contained no changes: reverse engineering of the function showed usage of the same address once again. However, as I was playing around with the !pte extension in the debugger, I noticed that a new address was now being used – and that on a separate machine, this address was different again. Staring in IDA/Hex-Rays, I could not understand how this was possible, as MmGetPhysicalAddress was clearly using the same new base as 14316!

It is only once I unassembled the function in WinDBG that I noticed something strange – the base address had been modified to a different value. This led me to the hunt for the dynamic relocation table mechanism. But this is an important point about this implementation – it offers a small amount of “security through obscurity” as a side-effect: attackers or developers attempting to ‘dynamically discover’ the value of the PTE base by analyzing the kernel file on disk will hit a roadblock – they must look at the kernel file in memory, once relocations have been made. Spooky!


It is often said that all software engineering decisions and features lie somewhere between the four quadrants of security, performance, compatibility and functionality. As such, as an example, the only way to increase security without affecting functionality is to impact compatibility and performance. Although randomizing the PTE_BASE does indeed cause potential compatibility issues, we’ve seen here how control of the compiler (and the underlying linked object file) can allow implementers to “cheat” and violate the security quadrant, in a similar way that silicon vendors can often work with operating system vendors in order to create overhead-free security solutions (one major advantage that Apple has, for example).

Closing “Heaven’s Gate”

December 30th, 2015

Brief Overview of WoW64

“Heaven’s Gate” refers to a technique first popularized by the infamous “Roy G. Biv” of 29a fame, and later re-published in Valhalla #1. Cited and improved in various new forms, and even seen in the wild used by the Vawtrak banking malware, it centers around the fact that on a 64-bit Windows OS, seeing as how all kernel-mode components always execute in 64-bit mode, the address space, core OS structures (EPROCESS, PEB, etc…), and code segments for processes are all initially setup for 64-bit “long mode” execution, regardless of the process actually being hosted by a 32-bit executable binary.

In fact, on 64-bit Windows, the first piece of code to execute in *any* process, is always the 64-bit NTDLL, which takes care of initializing the process in user-mode (as a 64-bit process!). It’s only later that the Windows-on-Windows (WoW64) interface takes over, loads a 32-bit NTDLL, and execution begins in 32-bit mode through a far jump to a compatibility code segment. The 64-bit world is never entered again, except whenever the 32-bit code attempts to issue a system call. The 32-bit NTDLL that was loaded, instead of containing the expected SYSENTER instruction, actually contains a series of instructions to jump back into 64-bit mode, so that the system call can be issued with the SYSCALL instruction, and so that parameters can be sent using the x64 ABI, sign-extending as needed.

This process is accurately described in many sources, including in the Windows Internals books, so if you’re interested in reading more, you can do so, but I’ll spare additional details here.

Enter Heaven’s Gate

Heaven’s Gate, then, refers to subverting the fact that a 64-bit NTDLL exists (and a 64-bit heap, PEB and TEB), and manually jumping into the long-mode code segment without having to issue a system call and being subjected to the code flow that WoW64 will attempt to enforce. In other words, it gives one the ability to create “naked” 64-bit code, which will be able to run covertly, including issuing system calls, without the majority of products able to intercept and/or introspect its execution:

  • Microsoft’s EMET, as well as a myriad of similar tools and sandboxes, only hook/protect the 32-bit NTDLL for WoW64 processes, under the assumption that the 64-bit NTDLL can’t be reached in any other way. The mitigations can therefore be bypassed using Heaven’s Gate. The same technique has been used by the Phenom malware to bypass AV solutions.
  • When debugging a 32-bit application with a 64-bit debugger (such as WinDBG), you will initially see the 64-bit state (heap, stack, NTDLL, TEB, etc…). Since this state is uninteresting, as it only contains the WoW64 system call layer, manual commands and extensions must be used to investigate the 32-bit state instead — and so in order to avoid this, even Microsoft often recommends using the 32-bit WinDBG instead, which will provide a much more seamless debugging experience and show the 32-bit state of the process. Other 3rd party debuggers, which are 32-bit only, will also behave the same way. The problem, therefore, is that by using Heaven’s Gate, there IS now interesting 64-bit state, that these debuggers will miss.
  • Many emulation/detonation engines will, upon seeing a 32-bit executable, emulate it using x86 instructions. They will either ignore or be unable to handle x64 instructions, as they never expect them to run. In fact, this was recently shown by a blog post over at Hexacorn. Heaven’s Gate allows such x64 instructions to run, rendering the x86 code into “dummy” code for misdirection purposes.

Memory Restrictions

These and other “benefits” make Heaven’s Gate a tool of choice for malicious code.  However, there always existed an interesting limitation in 32-bit applications running under WoW64: even when executing in 64-bit long-mode, addresses above the 4 GB could never be allocated (in fact, addresses above 2 GB could normally never be used for compatibility purposes, unless the image was linked with /LARGEADDRESSAWARE — the switch was originally designed to support /3GB x86 server environments, but outgrew its original intent to allow full 4 GB addresses under WoW64, a fact leveraged by many 32-bit games and browsers even today).

Using a kernel debugger and the !vad command, it’s simple to see why, such as on this Windows 7 system, where I’ve typed the command before the process has any chance of executing even a single instruction — not even NTDLL has loaded here, folks. This is an interesting view of what are the “earliest” memory structures you can find in a WoW64 process (at least on Windows 7).


Note that a giant VAD at the end, highlighted in teal, occupies the entire 64-bit portion of the address space. Let’s see what !vad has to say about it:


Seeing as how it’s configured as a “NoChange” and “OneSecured” VAD, it cannot be freed or modified in any way. This is further confirmed by the commit charge of -1.

On Windows 8 and later, however, the output changes, as you can see below. Note that I’ve re-used the same colors as in the Windows 7 output for clarity (and the uncolored VADs correspond to the CFG entries).


The 64-bit NTDLL is actually loaded in 64-bit address space now! And we have not one, but two teal-colored VADs, which surround it, re-creating the “no man’s land” just as on Windows 7 and earlier. This change was briefly mentioned, I believe, by Matt Miller (of skape fame) at one of Microsoft’s BlackHat presentations: it made it a bit harder to guess the location of the 64-bit NTDLL by simply adding a fixed size to the 32-bit NTDLL. In my screenshot, since this is a CFG-enabled process, the VADs don’t exactly envelop NTDLL — rather they surround the native CFG bitmap + NTDLL, but the point remains.

This change in NTDLL load behavior also had the likely intended side effect of making hooks in 64-bit NTDLL extremely hard, or outright impossible. You see, without consuming an enormous amount of space, it’s simply not possible to overwrite an x64 instruction with a call or jmp to an absolute 64-bit address efficiently. Instead, hooking engines will allocate a “trampoline” that is within the 32-bit address range of the hooked function, and use a much smaller 5 byte 32-bit relative jump, which happens to fit nicely in the “hotpatch aware” region that Microsoft binaries have (or anyone linking with /hotpadmin). The trampoline then uses the full 64-bit absolute jump instruction.

As you’ve figured out by now, if the trampoline needs to be within 2GB, but there are two large VADs blocking off all 64-bit addresses around NTDLL, this hooking technique is dead in the water. Other, more complex and error-prone techniques must (and can) be used instead.

Nevertheless, nothing stops Heaven’s Gate on Windows 8. There some minor WoW64 changes which one must adapt to, and accessing or hooking 64-bit NTDLL becomes harder.

Control Flow Guard and WoW64

In Windows 10, a new exploit mitigation is introduced called Control Flow Guard, or CFG. It too, has been rather well described in multiple sources, so I won’t go into details inside of this post. The important piece to remember about CFG is that all relative function calls are now subject to an additional compiler-generated check, which is implemented by NTDLL: only valid function prologues (within 8 bytes of alignment) can be the target of such a call. Valid function prologues, in turn, are marked by a bit being set in a very large bitmap (bit array) structure, which describes the entire user-mode address space (all 128TB of it!). I previously posted on some interesting changes this required in the memory manager, as this bit array obviously becomes quite large (2 TB, in fact).

What’s not been documented too clearly in most research is that on 64-bit systems, there are in fact not one, but two CFG bitmaps: one for 32-bit code, and one for 64-bit code. The addresses of both of these bitmaps is stored in the per-process working set structure (called MMWSL). This structure is pointed to by the MMSUPPORT structure inside of EPROCESS (i..e.: PsGetCurrentProcess()->Vm.VmWorkingSetList), but a unique thing about it, is that it’s stored in a region of memory called “hyperspace”, which is at a fixed address… much like the per-process page table entry array. On recent 64-bit systems, this hard-coded address is 0xFFFFF58010804000, a fact I pointed out in a previous blog post addressing the 64-bit address space of Windows 8.1 and later.

As one can see in the symbols that WinDBG can dump, the MMWSL structure contains a field:

+0x1f8 UserVaInfo       : _MI_USER_VA_INFO

And inside of MI_USER_VA_INFO, we can find an array:

+0x0c8 CfgBitMap : [2] _MI_CFG_BITMAP_INFO

Whose two entries correspond to the following enumeration:

 CfgBitMapNative = 0n0
 CfgBitMapWow64 = 0n1
 CfgBitMapMax = 0n2

Clearly, thus, a 64-bit Windows 10 kernel contains not one, but two CFG bitmaps. And indeed, the 32-bit NTDLL will utilize the address of the WoW64 bitmap, while the 64-bit NTDLL will utilize the Native bitmap. But why use two separate bitmaps? What separates a WoW64 bitmap from a native bitmap? One would imagine that 64-bit code is marked as executable in the native bitmap, and 32-bit code is marked as executable in the WoW64 bitmap… but that’s not quite the full story.

At verification time, indeed, it is the version of NTDLL that is being used, which determines which bitmap will be looked at. But how does the OS populate the bits?

In CFG-aware versions of Windows, the CFG bitmap is touched through two paths: MiCommitVadCfgBits, and MiCfgMarkValidEntries. These, in turn, correspond to either intrinsic CFG modifications (side-effects of allocating, protecting and/or mapping executable memory), or explicit CFG modifications (effect of calling SetValidCallTargets). Both of these paths will eventually call MiSelectCfgBitMap, whose pseudo-code is shown below.


As is quite clear from the code, any private memory allocations below the 64-bit boundary will be marked only in the 32-bit bitmap, while the opposite applies to the 64-bit bitmap. In fact, this is the result of an optimization: instead of having two 2TB bit arrays for each processor execution mode, a single 2TB array is used for 64-bit native code, while a single 32MB array is used for 32-bit native code, greatly reducing address space consumption.

Closing the Gate

Basing the decision of which CFG bitmap to populate on the virtual address of the executable allocation creates an obvious dichotomy: 64-bit code, if running in a 32-bit address range, will instantly trip up CFG, because the NTDLL library that is active in that environment is the 64-bit version, which will check the 64-bit bitmap, which will not have any bits set in the 0-4 GB range. Similarly, any 32-bit code must be running below the 4 GB boundary, else the 32-bit NTDLL’s CFG validation routine will trip up, as the 32-bit bitmap isn’t even large enough to account for addresses above 4 GB.

A naive solution is therefore proposed: simply allocate 64-bit code above the 4GB range, and the problem goes away. There is, of course, a problem with this approach: the NoChange VADs which block the entire > 4 GB region of memory and mark it unusable, leaving only 64-bit NTDLL as the only valid allocation in that address range.

In Windows 10, these two factors combined result in the inability to execute any useful 64-bit code in a 32-application/WoW64 process, because the two restrictions combine, creating an impossible condition. You may be tempted to dismiss the reality by stating that all the 64-bit malicious code has to do is not to have been compiled with CFG. In this case, the compiler should not be emitting calls to the validation routine. However, this misses a critical point: it’s not the process’ own executable code/shellcode which are necessarily performing the 64-bit CFG checks — it’s the 64-bit NTDLL itself, or any other additional 64-bit DLLs you may have injected through the initial 64-bit shellcode, into your own process.

Even worse, even if no other 64-bit DLLs are imported, some core system functionality, implemented by NTDLL, also validates the CFG bitmap: Exceptions, User-Mode Callbacks, and APCs. Any usage of these system mechanisms, because they always initially execute in 64-bit mode, will cause a CFG violation if the target is not in the bitmap — which it cannot possibly be. The same goes for higher level functionality like using the Thread Pool, or any other callback-based mechanism owned by NTDLL in 64-bit mode. For example, because kernel-mode injects user-mode APCs through the 64-bit NTDLL, the user-mode APC routine cannot possibly be a custom, non-DLL function: it would’ve been impossible to allocate it > 4 GB, and the APC dispatcher will validate the CFG bits for any address < 4 GB, and be unable to find it.

Perhaps the best example of these unexpected side-effects is to analyze what Heaven’s Gate-using malware often does to gain some usefulness in the hidden 64-bit context: it will lookup LdrLoadDll inside of NTDLL.DLL and attempt to load additional 64-bit DLLs, such as kernel32.dll. With some coercing (as some of the articles I linked to at the beginning showed), this can be made to work. The problem, in a CFG-aware NTDLL.DLL, is that LdrpCallInitRoutines will perform a CFG bitmap check before calling the DllMain of this DLL. As the DLL will be loaded in 32-bit address space, the WoW64 CFG bitmap will be marked, and not the Native CFG bitmap — causing the 64-bit NTDLL to believe that DllMain is not a valid relative call target, and crash the process.

Suffice it to say, although it still is possible to have a very simple 64-bit piece of code, even possibly performing some system calls, execute in the hidden 64-bit world of a WoW64 process/32-bit application, any attempts to load additional DLLs, use APCs, handle exceptions or user-mode callbacks in 64-bit mode will result in the process crashing, as a CFG violation will be tripped. For most intents and purposes, therefore, CFG has a potentially unintended side-effect: it closes down Heaven’s Gate.

Reopening the gate is left as an exercise to the reader ;-)

Final Note

Astute readers may have noticed the following discrepancies, especially if following along on their own systems:

  • Windows 8.1 Update 3 does have support for CFG
  • We saw three, not two VADs, on my Windows 8.1 Update 3 screenshot
  • This post mentions how Windows 10 closes Heaven’s Gate, but not Windows 8.1 Update 3

The key is in  dumping the MI_USER_VA_INFO structure on such a system:

+0x060 CfgBitMap : [3] _MI_CFG_BITMAP_INFO

Three entries? Let’s take a look:

 CfgBitMapNative = 0n0
 CfgBitMapWow64 = 0n1
 CfgBitMapWow64NativeLow = 0n2
 CfgBitMapMax = 0n3

This explains the three, not two VADs in my dump: in the original CFG implementation on Windows 8.1, 64-bit code could live in the 32-bit address range, as the Native bitmap had a “Wow64Low” portion. In Windows 10, this is now gone (saving 32MB of address space) — Native code is only aware of the 64-bit address ranges.

What are Little PatchGuards Made Of?

June 22nd, 2015

A number of excellent PatchGuard articles have been written around what PatchGuard is, how to bypass it, what triggers it uses, its obfuscation techniques, and more.

But for some reason, nobody has published a full list of everything that PatchGuard actually verifies. Microsoft used to have a website that listed the initial first 7 checks, but nothing beyond that.

I asked around at conferences, and the answer I got was that the code was too complex to analyze, and nobody really wanted to take the time to figure out every single check. I had my own private list of checks I knew PatchGuard does (through runtime analysis), but I was surprised to see the real reason nobody’s bothered to analyze this…

… Microsoft’s own public debugger (known as WinDBG) tells you — why bother reversing? :)

Lo’ and behold, the 39 different checks in PatchGuard Windows 8.1 Update. There’s a few more in Windows 10, I guess they’re not yet documented.

Arg1: 00000000, Reserved
Arg2: 00000000, Reserved
Arg3: 00000000, Failure type dependent information
Arg4: 00000000, Type of corrupted region, can be
0 : A generic data region
1 : Modification of a function or .pdata
2 : A processor IDT
3 : A processor GDT
4 : Type 1 process list corruption
5 : Type 2 process list corruption
6 : Debug routine modification
7 : Critical MSR modification
8 : Object type
9 : A processor IVT
a : Modification of a system service function
b : A generic session data region
c : Modification of a session function or .pdata
d : Modification of an import table
e : Modification of a session import table
f : Ps Win32 callout modification
10 : Debug switch routine modification
11 : IRP allocator modification
12 : Driver call dispatcher modification
13 : IRP completion dispatcher modification
14 : IRP deallocator modification
15 : A processor control register
16 : Critical floating point control register modification
17 : Local APIC modification
18 : Kernel notification callout modification
19 : Loaded module list modification
1a : Type 3 process list corruption
1b : Type 4 process list corruption
1c : Driver object corruption
1d : Executive callback object modification
1e : Modification of module padding
1f : Modification of a protected process
20 : A generic data region
21 : A page hash mismatch
22 : A session page hash mismatch
23 : Load config directory modification
24 : Inverted function table modification
25 : Session configuration modification
102 : Modification of win32k.sys

I have to admit, there are some things I didn’t realize PatchGuard would actually think about protecting, such as the Local APIC. It’s also interesting to see some more esoteric hooks in the list as well, such as PsEstablishWin32Callout protection. I also did not realize PatchGuard now protects the DRIVER_OBJECT structure — indeed, hooking a major function will now give you code 0x1C. And finally, the protection of protected processes means that technically something such as Mimikatz’s “MimiDrv” may crash some machines in the wild.

I usually try to avoid talking about PatchGuard since I’m glad it’s giving AV companies hell, but I can’t have been the only person that never noticed that the checks were documented in the debugger all along, hidden behind a simple command (it makes sense that Microsoft wouldn’t want their own support engineers to be wondering what on Earth they’re looking at):

!analyze -show 109

I can’t even take credit for discovering this on my own. Reading Microsoft’s famous “NT Debugging” blog made me realize that this had been there all along.


Analyzing MS15-050 With Diaphora

May 14th, 2015

One of the most common ways that I glean information on new and upcoming features on releases of Windows is obviously to use reverse engineering such as IDA Pro and look at changed functions and variables, which usually imply a change in functionality.

Of course, such changes can also reveal security fixes, but those are a lot harder to notice at the granular level of diff-analysis that I perform as part of understanding feature changes.

For those type of fixes, a specialized diffing tool, such as BinDiff is often used by reverse engineers and security experts. Recently, such tools have either become obsoleted, abandoned, or cost prohibitive. A good friend of mine, Joxean Koret (previously of Hex-Rays fame, un-coincidentally), has recently developed a Python plugin for IDA Pro, called “Diaphora“, (diaforá, the Greek word for “difference”).

In this blog post, we’ll analyze the recent MS15-050 patch and do a very quick walk-through of how to use Diaphora.


Installing the plugin is as easy as going over to the GitHub page, cloning the repository into a .zip file, and extracting the contents into the appropriate directory (I chose IDA’s plugin folder, but this can be anything you wish).

As long as your IDA Python is configured correctly (which has been a default in IDA for many releases), clicking on File, Script file…, should let you select a .py file


Generating the initial baseline

The first time you run Diaphora, you’ll be making the initial SQLite library. If you don’t have Hex-Rays, or disable the “Use the decompiler if available” flag, this process only takes a few seconds. Otherwise, with Hex-Rays enabled, you’ll be spending more of the time waiting for the decompiler to run on the entire project. Depending on code complexity, this could take a while.

This SQLite library will essentially contain the assembly and pseudo-code in a format easily parsable by the plugin, as well as all your types, enumerations, and decompiler data (such as your annotations and renamed variables). In this case, I had an existing fairly well-maintained IDB for the latest version of the Service Control Manager for Windows 7 SP1, which had actually not changed since 2012. My pseudo-code had over 3 years to grow into a well-documented, thoroughly structured IDA database.

Diff me once, importing your metadata

On the second run of Diaphora (which at this point, should be on your new, fresh binary), this is where you will direct it to the initial SQLite database from the step above, plus select your diffing options. The default set I use are in the screenshot below.


This second run can take much longer than the first, because not only are you taking the time to generate the a second database, but you are then running all of the diffing algorithms that Diaphora implements (which you can customize), which can take significantly longer. Once the run is complete, Diaphora will show you identical code (“Best Matches”), close matches (“Partial Matches”), and Unidentifiable Matches. This is where comparing a very annotated IDB with a fresh IDB for purposes of security research can have problems.

Since I renamed many of the static global variables, any code using them in their renamed format would appear different from the original “loc_325345″ format that IDA uses by default. Any function prototypes which I manually fixed up would also appear different (Hex-Rays is especially bad with variable argument __stdcall on x86), as well any callers of those functions.

So in the initial analysis, I got tons of “Partial Matches” and very few “Best Matches”. Nothing was unmatched, however.

One of the great parts of Diaphora, however, is that you can then confirm that the functions are truly identical. Since we’re talking about files which have symbols, it makes sense to claim that ScmFooBar is identical to ScmFooBar. This will now import all the metadata from your first first IDB to the other, and then give you the option of re-running the analysis stage.

At this point, I have taken all of the 3 years of research I had on one IDB, and instantly (well, almost) merged it to a brand new IDB that covers a more recent version of the binary.

Diff me twice, locating truly changed code

Now that the IDBs have been “synced up”, the second run should identify true code changes — new variables that have been added, structures that changed, and new code paths. In truth, those were identified the first time around, but hidden in the noise of all the IDB annotation changes. Here’s an incredible screenshot of what happened the second time I ran Diaphora.

First, note how almost all the functions are now seen as identical:

And then, on the Partial Matches tab… we see one, and only one function. This is likely what MS15-050 targeted (the description in the Security Bulletin is that this fixed an “Impersonation Level Check” — the function name sounds like it could be related to an access check!).

Now that we have our only candidate for the fix delivered in this update, we can investigate what the change actually was. We do this by right-clicking on the function and selecting “Diff pseudo-code”. The screenshot below is Diaphora’s output:


At this point, the vulnerability is pretty clear. In at least some cases where an access check is being made due to someone calling the Service Control Manager, the impersonation level isn’t verified — meaning that someone with an Anonymous SYSTEM token (for example) could pass off as actually being a SYSTEM caller, and therefore be able to perform actions that only SYSTEM could do. In fact, in this case, we see that the Authentication ID (LUID) of 0x3E7 is checked, which is actually SYSTEM_LUID, making our example reality.

At this point, I won’t yet go into the details on which Service Control Manager calls exactly are vulnerable to this incorrect access check (ScAccessCheck, which is normally used, actually isn’t vulnerable, as it calls NtAccessCheck), or how this vulnerability could be used for local privilege escalation, because I wanted to give kudos to Joxean for this amazing plugin and get more people aware of its existence.

Perhaps we’ll keep the exploitation for a later post? For some  ideas, read up James Forshaw’s excellent Project Zero blog post, in which he details another case of poor impersonation checks in the operating system.


How Control Flow Guard Drastically Caused Windows 8.1 Address Space and Behavior Changes

January 22nd, 2015

Windows 8.1 radically changes the address space layout of the system by finally removing the 44-bit limitation which I described in one of the earliest blog posts on this website (and which Wikipedia even links to!). This is a little-known detail about the operating system, and an odd thing for Microsoft not to emphasize on with more aplomb, especially given that 8.1 is considered a “patch” of Windows 8.

Now, you may think that 16 TB to 256 TB is a meaningless change since no applications currently use even a fraction of that space, but the main benefit of this change are not the ability to allocate additional memory, but rather the increased entropy space available for Address Space Load Randomization (ASLR), especially given that Windows 8 introduced High Entropy ASLR (HEASLR), Top-down Randomization and Anonymous Memory Randomization.

Additionally, another key change was done in Windows 8.1 that is not mentioned anywhere. As Pavel Lebedinsky, one of the lead SDETs on the Memory Manager and an extremely helpful individual indicated on one of the blog posts from Mark Russinovich:

1. Reserved memory does contribute to commit charge, because the memory manager charges commit for pagetable space necessary to map the entire reserved range. On 64 bit this can be a significant number (reserving 1 TB of memory will consume approximately 2 GB of commit).

This means that attempting to reserve the full 8 TB of memory on Windows 7 results in 16 GB of commit, which is beyond’s most people’s commit limit, especially at the time. In Windows 8.1, this would result in 128 GB of commit being used, which only a beefy server would tolerate. While such large memory reservations are unusual, they do have usefulness in certain scenarios related to security and low-level testing. This Windows behavior prevented such reservations from reliably working, but in Windows 8.1, the limitation has been removed!

Indeed, you can easily test this by using the TestLimit tool from the Windows Internals Book, and run it with the -r option (and preferably with a large enough block size). Here’s a screenshot of hitting the 128 TB reservation:


And here’s the resulting view in VMMap, which does not show the expected page table commit charge, but rather a much smaller size (256 MB).


So why did Microsoft change this behavior in Windows 8.1? Well, Windows 10, as well as Windows 8.1 Update 3 (November Update) make this clear. As I previously tweeted, these OS versions enable Control Flow Guard (CFG), a feature that laid dormant in the first versions of Windows 8.1. In order to function, CFG requires the use of optimized bitmaps in order to determine the validity of indirect calls, and on 64-bit Windows, this bitmap requires 2 TB of space. Not only would this cut the Windows 8 address space by 25%, it would’ve also resulted in 4 GB of per-process commit!

Here’s a screenshot of Process Hacker showing how all CFG-enabled processes now use 2 TB of virtual address space:


The final effect of this change from 8 TB to 128 TB is that the kernel address space layout has significantly changed. And sadly, the !address extension in WinDBG is broken and continues to show the Windows 8 address space layout (which I expanded on during my Blackhat 2013 talk), while the Windows Internals book is stuck on Windows 7 and doesn’t even cover Windows 8 or higher.

Therefore, I publish below what I believe to be the only public source of information on the Windows 8.1 x64 memory layout. One of the benefits of this new layout is that it now becomes extremely easy by using the first 5 or 6 nibbles of an address to determine where it’s coming from. For example, 0xFFFFD… is a kernel stack, 0xFFFFC… is paged pool, 0xFFFFF8… is a loaded image (driver or kernel), and 0xFFFFE… is nonpaged pool.

FFFF0000`00000000FFFF07FF`FFFFFFFF8TBMemory Hole
FFFF0800`00000000FFFFAFFF`FFFFFFFF168TBUnused Space
FFFFB000`00000000FFFFBFFF`FFFFFFFF16TBSystem Cache
FFFFE000`00000000FFFFEFFF`FFFFFFFF16TBNonpaged Pool
FFFFF000`00000000FFFFF67F`FFFFFFFF6.5TBUnused Space
FFFFF780`00000000FFFFF780`00000FFF4KShared User Data
FFFFF780`C0000000FFFFF780`FFFFFFFF1GBWS Hash Table
FFFFF781`00000000FFFFF791`3FFFFFFF65GBPaged Pool WS
FFFFF791`40000000FFFFF799`3FFFFFFF32GBWS Hash Table
FFFFF799`40000000FFFFF7A9`7FFFFFFF65GBSystem Cache WS
FFFFF7A9`80000000FFFFF7B1`7FFFFFFF32GBWS Hash Table
FFFFF7B1`80000000FFFFF7FF`FFFFFFFF314GBUnused Space
FFFFF900`00000000FFFFF97F`FFFFFFFF512GBSession Space
FFFFF980`00000000FFFFFA70`FFFFFFFF1TBDynamic VA Space
Table describing the various 64-bit memory ranges in Windows 8.1

Sheep Year Kernel Heap Fengshui: Spraying in the Big Kids’ Pool

December 29th, 2014

The State of Kernel Exploitation

The typical write-what-where kernel-mode exploit technique usually relies on either modifying some key kernel-mode data structure, which is easy to do locally on Windows thanks to poor Kernel Address Space Layout Randomization (KASLR), or on redirecting execution to a controlled user-mode address, which will now run with Ring 0 rights.

Relying on a user-mode address is an easy way not to worry about the kernel address space, and to have full control of the code within a process. Editing the tagWND structure or the HAL Dispatch Table are two very common vectors, as are many others.

However, with Supervisor Mode Execution Prevention (SMEP), also called Intel OS Guard, this technique is no longer reliable — a direct user-mode address cannot be used, and other techniques must be employed instead.

One possibility is to disable SMEP Enforcement in the CR4 register through Return-Oriented Programming, or ROP, if stack control is possible. This has been covered in a few papers and presentations.

Another related possibility is to disable SMEP Enforcement on a per-page basis — taking a user-mode page and marking it as a kernel page by making the required changes in the page level translation mapping entries. This has also been talked in at least one presentation, and, if accepted, a future SyScan 2015 talk from a friend of mine will also cover this technique. Additionally, if accepted, an alternate version of the technique will be presented at INFILTRATE 2015, by yours truly.

Finally, a theoretical possibility is being able to transfer execution (through a pointer, callback table, etc) to an existing function that disables SMEP (and thus bypassing KASLR), but then somehow continues to give the attacker control without ROP — nobody has yet found such a function. This would be a type of Jump-Oriented Programming (JOP) attack.

Nonetheless, all of these techniques continue to leverage a user-mode address as the main payload (nothing wrong with that). However, one must also consider the possibility to use a kernel-mode address for the attack, which means that no ROP and/or PTE hacking is needed to disable SMEP in the first place.

Obviously, this means that the function to perform the malicious payload’s work already exists in the kernel, or we have a way of bringing it into the kernel. In the case of a stack/pool overflow, this payload probably already comes with the attack, and the usual tricks have been employed there in order to get code execution. Such attacks are particularly common in true ‘remote-remote’ attacks.

But what of write-what-where bugs, usually the domain of the local (or remote-local) attacker? If we have user-mode code execution available to us, to execute the write-what-where, we can obviously continue using the write-what-where exploit to repeatedly fill an address of our choice with the payload data. This presents a few problems however:

  • The write-what-where may be unreliable, or corrupt adjacent data. This makes it hard to use it to ‘fill’ memory with code.
  • It may not be obvious where to write the code — having to deal with KASLR as well as Kernel NX. On Windows, this is not terribly hard, but it should be recognized as a barrier nonetheless.

This blog post introduces what I believe to be two new techniques, namely a generic kernel-mode heap spraying technique which results in executable memory, followed by a generic kernel-mode heap address discovery technique, bypassing KASLR.

Big Pool

Experts of the Windows heap manager (called the pool) know that there are two different allocators (three, if you’re being pedantic): the regular pool allocator (which can use lookaside lists that work slightly differently than regular pool allocations), and the big/large page pool allocator.

The regular pool is used for any allocations that fit within a page, so either 4080 bytes on x86 (8 bytes for the pool header, and 8 bytes used for the initial free block), or 4064 bytes on x64 (16 bytes for the pool header, 16 bytes used for the initial free block). The tracking, mapping, and accounting of such allocations is handled as part of the regular slush of kernel-mode memory that the pool manager owns, and the pool headers link everything together.

Big pool allocations, on the other hand, take up one or more pages. They’re used for anything over the sizes above, as well as when the CacheAligned type of pool memory is used, regardless of the requested allocation size — there’s no way to easily guarantee cache alignment without dedicating a whole page to an allocation.

Because there’s no room for a header, these pages are tracked in a separate “Big Pool Tracking Table” (nt!PoolBigPageTable), and the pool tags, which are used to identify the owner of an allocation, are also not present in the header (since there isn’t one!), but rather in the table as well. Each entry in this table is represented by a POOL_TRACKER_BIG_PAGES structure, documented in the public symbols:

    +0x000 Va : Ptr32 Void
    +0x004 Key : Uint4B
    +0x008 PoolType : Uint4B
    +0x00c NumberOfBytes : Uint4B

One thing to be aware of is that the Virtual Address (Va) is OR’ed with a bit to indicate if the allocation is freed or allocated — in other words, you may have duplicate Va’s, some freed, and at most one allocated. The following simple WinDBG script will dump all the big pool allocations for you:

r? @$t0 = (nt!_POOL_TRACKER_BIG_PAGES*)@@(poi(nt!PoolBigPageTable))
r? @$t1 = *(int*)@@(nt!PoolBigPageTableSize) / sizeof(nt!_POOL_TRACKER_BIG_PAGES)
.for (r @$t2 = 0; @$t2 < @$t1; r? @$t2 = @$t2 + 1)
    r? @$t3 = @$t0[@$t2];
    .if (@@(@$t3.Va != 1))
        .printf "VA: 0x%p Size: 0x%lx Tag: %c%c%c%c Freed: %d Paged: %d CacheAligned: %d\n", @@((int)@$t3.Va & ~1), @@(@$t3.NumberOfBytes), @@(@$t3.Key >> 0 & 0xFF), @@(@$t3.Key >> 8 & 0xFF), @@(@$t3.Key >> 16 & 0xFF), @@(@$t3.Key >> 24 & 0xFF), @@((int)@$t3.Va & 1), @@(@$t3.PoolType & 1), @@(@$t3.PoolType & 4) == 4

Why are big pool allocations interesting? Unlike small pool allocations, which can share pages, and are hard to track for debugging purposes (without dumping the entire pool slush), big pool allocations are easy to enumerate. So easy, in fact, that the undocumented KASLR-be-damned API NtQuerySystemInformation has an information class specifically designed for dumping big pool allocations. Including not only their size, their tag, and their type (paged or nonpaged), but also their kernel virtual address!

As previously presented, this API requires no privileges, and only in Windows 8.1 has it been locked down against low integrity callers (Metro/Sandboxed applications).

With the little snippet of code below, you can easily enumerate all big pool allocations:

// Note: This is poor programming (hardcoding 4MB).
// The correct way would be to issue the system call
// twice, and use the resultLength of the first call
// to dynamically size the buffer to the correct size
bigPoolInfo = RtlAllocateHeap(RtlGetProcessHeap(),
                              4 * 1024 * 1024);
if (bigPoolInfo == NULL) goto Cleanup;
res = NtQuerySystemInformation(SystemBigPoolInformation,
                               4 * 1024 * 1024,
if (!NT_SUCCESS(res)) goto Cleanup;
printf("TYPE     ADDRESS\tBYTES\tTAG\n");
for (i = 0; i < bigPoolInfo->Count; i++)
            bigPoolInfo->AllocatedInfo[i].NonPaged == 1 ?
            "Nonpaged " : "Paged    ",
if (bigPoolInfo != NULL)
    RtlFreeHeap(RtlGetProcessHeap(), 0, bigPoolInfo);

Pool Control

Obviously, it’s quite useful to have all these handy kernel-mode addresses. But what can we do to control their data, and not only be able to read their address?

You may be aware of previous techniques where a user-mode attacker allocates a kernel-object (say, an APC Reserve Object), which has a few fields that are user-controlled, and which then has an API to get its kernel-mode address. We’re essentially going to do the same here, but rely on more than just a few fields. Our goal, therefore, is to find a user-mode API that can give us full control over the kernel-mode data of a kernel object, and additionally, to result in a big pool allocation.

This isn’t as hard as it sounds: anytime a kernel-mode component allocates over the limits above, a big pool allocation is done instead. Therefore, the exercise reduces itself to finding a user-mode API that can result in a kernel allocation of over 4KB, whose data is controlled. And since Windows XP SP2 and later enforce kernel-mode non-executable memory, the allocation should be executable as well.

Two easy examples may popup in your head:

  1. Creating a local socket, listening to it, connecting from another thread, accepting the connection, and then issuing a write of > 4KB of socket data, but not reading it. This will result in the Ancillary Function Driver for WinSock (AFD.SYS), also affectionally known as “Another F*cking Driver”, allocating the socket data in kernel-mode memory. Because the Windows network stack functions at DISPATCH_LEVEL (IRQL 2), and paging is not available, AFD will use a nonpaged pool buffer for the allocation. This is great, because until Windows 8, nonpaged pool is executable!
  2. Creating a named pipe, and issuing a write of > 4KB of data, but not reading it. This will result in the Named Pipe File System (NPFS.SYS) allocating the pipe data in a nonpaged pool buffer as well (because NPFS performs buffer management at DISPATCH_LEVEL as well).

Ultimately, #2 is a lot easier, requiring only a few lines of code, and being much less inconspicuous than using sockets. The important thing you have to know is that NPFS will prefix our buffer with its own internal header, which is called a DATA_ENTRY. Each version of NPFS has a slightly different size (XP- vs 2003+ vs Windows 8+).

I’ve found that the cleanest way to handle this, and not to worry about offsets in the final kernel payload, is to internally handle this in the user-mode buffer with the right offsets. And finally, remember that the key here is to have a buffer that’s at least the size of a page, so we can force the big pool allocator.

Here’s a little snippet that keeps all this into account and will have the desired effects:

UCHAR payLoad[PAGE_SIZE - 0x1C + 44];
// Fill the first page with 0x41414141, and the next page
// with INT3's (simulating our payload). On x86 Windows 7
// the size of a DATA_ENTRY is 28 bytes (0x1C).
RtlFillMemory(payLoad,  PAGE_SIZE - 0x1C,     0x41);
RtlFillMemory(payLoad + PAGE_SIZE - 0x1C, 44, 0xCC);
// Write the data into the kernel
res = CreatePipe(&readPipe,
if (res == FALSE) goto Cleanup;
res = WriteFile(writePipe,
if (res == FALSE) goto Cleanup;
// extra code goes here...

Now all we need to know is that NPFS uses the pool tag ‘NpFr’ for the read data buffers (you can find this out by using the !pool and !poolfind commands in WinDBG). We can then change the earlier KASLR-defeating snippet to hard-code the pool tag and expected allocation size, and we can instantly find the kernel-mode address of our buffer, which will fully match our user-mode buffer.

Keep in mind that the “Paged vs. Nonpaged” flag is OR’ed into the virtual address (this is different from the structure in the kernel, which tracks free vs. allocated), so we’ll mask that out, and also make sure you align the size to the pool header alignment (it’s enforced even for big pool allocations). Here’s that snippet, for x86 Windows:

// Based on pooltag.txt, we're looking for the following:
// NpFr - npfs.sys - DATA_ENTRY records (r/w buffers)
for (entry = bigPoolInfo->AllocatedInfo;
     entry < (PSYSTEM_BIGPOOL_ENTRY)bigPoolInfo +
    if ((entry->NonPaged == 1) &&
        (entry->TagUlong == 'rFpN') &&
        (entry->SizeInBytes == ALIGN_UP(PAGE_SIZE + 44,
        printf("Kernel payload @ 0x%p\n",
               (ULONG_PTR)entry->VirtualAddress & ~1 +

And here’s the proof in WinDBG:

Kernel Malloc

Voila! Package this into a simple “kmalloc” helper function, and now you too, can allocate executable, kernel-mode memory, at a known address! How big can these allocations get? I’ve gone up to 128MB without a problem, but this being non-paged pool, make sure you have the RAM to handle it. Here’s a link to some sample code which implements exactly this functionality.

An additional benefit of this technique is that not only can you get the virtual address of your allocation, you can even get the physical address! Indeed, as part of the undocumented Superfetch API that I first discovered and implemented in my meminfo tool, which has now been supplanted by the RAMMap utility from SysInternals, the memory manager will happily return the pool tag, virtual address, and physical address of our allocation.

Here’s a screenshot of RAMMap showing another payload allocation and its corresponding physical address (note that the 0x1000 difference is since the command-line PoC biases the pointer, as you saw in the code).


Next Steps

Now, for full disclosure, there are a few additional caveats that make this technique a bit less sexy in 2015 — and why I chose to talk about it today, and not 8 years ago when I first stumbled upon it:

1) Starting with Windows 8, nonpaged pool allocations are now non-executable. This means that while this trick still lets you spray the pool, your code will require some sort of NX bypass first. So you’ve gone from bypassing SMEP to bypassing kernel-mode NX.

2) In Windows 8.1, the API to get the big pool entries and their addresses is no longer usable by low-integrity callers. This significantly reduces the usefulness in local-remote attacks, since those are usually launched through sandboxed applications (Flash, IE, Chrome, etc) and/or Metro containers.

Of course, there are some ways around this — a sandbox escape is often used in local-remote attacks anyway, so #2 can become moot. As for #1, some astute researchers have already figured out that NX was not fully deployed — for example, Session Pool allocations, are STILL executable on newer versions of Windows, but only on x86 (32-bit). I leave it as an exercise to readers to figure out how this technique can be extended to leverage that (hint: there’s a ‘Big Session Pool’).

But what about a modern, 64-bit version of Windows, say even Windows 10? Well, this technique appears to be mostly dead on such systems — or does it? Is everything truly NX in the kernel, or are there still some sneaky ways to get some executable memory, and to get its address? I’ll be sure to blog about it once Windows 14 is out the door in 2022.

PE Trick #1: A Codeless PE Binary File That Runs

September 29th, 2014


One of the annoying things of my Windows Internals/Security research is when every single component and mechanism I’ve looked at in the last six months has ultimately resulted in me finding very interesting design bugs, which I must now wait on Microsoft to fix before being able to talk further about them. As such, I have to take a smaller break from kernel-specific research (although I hope to lift the veil over at least one issue at the No Such Conference in Paris this year). And so, in the next following few blog posts, probably inspired by having spent too much time talking with my friend Ange Albertini, I’ll be going over some neat PE tricks.


Write a portable executable (PE/EXE) file which can be spawned through a standard CreateProcess call and will result in STATUS_SUCCESS being returned as well as a valid Process Handle, but will not

  • Contain any actual x86/x64 assembly code section (i.e.: the whole PE should be read-only, no +X section)
  • Run a single instruction of what could be construed as x86 assembly code, which is part of the file itself (i.e.: random R/O data should not somehow be forced into being executed as machine code)
  • Crash or make any sort of interactive/visible notice to the user, event log entry, or other error condition.

Interesting, this was actually a real-world situation that I was asked to provide a solution for — not a mere mental exercise. The idea was being able to prove, in the court of law, that no “foreign” machine code had executed as a result of this executable file having been launched (i.e.: obviously the kernel ran some code, and the loader ran too, but all this is pre-existing Microsoft OS code). Yet, the PE file had to not only be valid, but to also return a valid process handle to the caller.


HEADER:00000000 .686p
HEADER:00000000 .mmx
HEADER:00000000 .model flat
HEADER:00000000 ; Segment type: Pure data
HEADER:00000000 HEADER segment page public 'DATA' use32
HEADER:00000000 assume cs:HEADER
HEADER:00000000 __ImageBase dw 5A4Dh ; PE magic number
HEADER:00000002 dw 0 ; Bytes on last page of file
HEADER:00000004 dd 4550h ; Signature
HEADER:00000008 dw 14Ch ; Machine
HEADER:0000000A dw 0 ; Number of sections
HEADER:0000000C dd 0 ; Time stamp
HEADER:00000010 dd 0 ; Pointer to symbol table
HEADER:00000014 dd 0 ; Number of symbols
HEADER:00000018 dw 0 ; Size of optional header
HEADER:0000001A dw 2 ; Characteristics
HEADER:0000001C dw 10Bh ; Magic number
HEADER:0000001E db 0 ; Major linker version
HEADER:0000001F db 0 ; Minor linker version
HEADER:00000020 dd 0 ; Size of code
HEADER:00000024 dd 0 ; Size of initialized data
HEADER:00000028 dd 0 ; Size of uninitialized data
HEADER:0000002C dd 7FBE02F8h ; Address of entry point
HEADER:00000030 dd 0 ; Base of code
HEADER:00000034 dd 0 ; Base of data
HEADER:00000038 dd 400000h ; Image base
HEADER:0000003C dd 4 ; Section alignment
HEADER:00000040 dd 4 ; File alignment
HEADER:00000044 dw 0 ; Major operating system version
HEADER:00000046 dw 0 ; Minor operating system version
HEADER:00000048 dw 0 ; Major image version
HEADER:0000004A dw 0 ; Minor image version
HEADER:0000004C dw 4 ; Major subsystem version
HEADER:0000004E dw 0 ; Minor subsystem version
HEADER:00000050 dd 0 ; Reserved 1
HEADER:00000054 dd 40h ; Size of image
HEADER:00000058 dd 0 ; Size of headers
HEADER:0000005C dd 0 ; Checksum
HEADER:00000060 dw 2 ; Subsystem
HEADER:00000062 dw 0 ; Dll characteristics
HEADER:00000064 dd 0 ; Size of stack reserve
HEADER:00000068 dd 0 ; Size of stack commit
HEADER:0000006C dd 0 ; Size of heap reserve
HEADER:00000070 dd 0 ; Size of heap commit
HEADER:00000074 dd 0 ; Loader flag
HEADER:00000078 dd 0 ; Number of data directories
HEADER:0000007C HEADER ends
HEADER:0000007C end

As per Corkami, in Windows 7 and higher, you’ll want to make sure that the PE is at least 252 bytes on x86, or 268 bytes on x64.

Here’s a 64 byte Base64 representation of a .gz file containing the 64-bit compatible (268 byte) executable:



There is one non-standard machine configuration in which this code will actually still crash (but still return STATUS_SUCCESS in CreateProcess, however). This is left as an exercise to the reader.


The application executes and exits successfully. But as you can see, no code is present in the binary. How does it work? Do you have any other solutions which satisfy the challenge?

The Case Of The Bloated Reference Count: Handle Table Entry Changes in Windows 8.1

June 17th, 2014


As part of my daily reverse engineering and peering into Windows Internals, I started noticing a strange effect in Windows 8.1 whenever looking at the reference counts of various objects with tools such as WinDBG, Process Explorer, and Process Hacker: seemingly gigantic values on x64 Windows, and smaller, yet still incredibly large values on x86.

For the uninitiated, reference counts (internally called pointer counts), and their cousin handle counts, are the Windows kernel’s way of keeping track of open instances to a certain object (such as a file, registry key, or mutex) in order to implement automatic cleanup and garbage collection. Windows system tools such as Process Explorer or Process Hacker often have handy interfaces for looking at the objects to which a process currently has references to, by analyzing the process handle table.

Looking at Opened Handles and their Properties

In the screenshot below, you can see me looking at the first few handles of the Windows shell, Explorer.exe. Particularly, I am interested in the “DBWinMutex” mutex, at handle 0x44.

What this mutex does is gate access to Windows’ debug buffer, used by the OutputDebugString API, so it’s likely that you’ll see it used in many other processes as well. Since Explorer has at least one component using that API, it has a handle opened to it. Let’s go find out how many other components have a handle to it, by double-clicking and looking at its properties.

Pretty striking, isn’t it? While the handle count, which keeps track of actual handles to the object (implying that (Zw)OpenEvent was used to obtain the reference) is 14 and makes sense given the large number of processes that use the debug buffer to print various trace messages, the reference count, which is meant to include those handles plus any other additional internal kernel component references (which can bypass handles altogether and use the ObReferenceObject family of APIs to safely reference an object), is actually 491351! While it’s technically possible for such a large number of kernel references to exist to the object, it’s highly unlikely, and if one checks the reference counts on other objects, similarly large numbers appear. What’s going on?

Using the Windows Debugger to Dump Object Information

First, let’s make sure this isn’t a bug in Process Explorer. Such tools that peer into undocumented structures are often risk prone to subtle changes in the kernel, so I like to use the Windows Kernel Debugger (WinDBG) to validate what user-mode tools are showing. After all, the debugger dumps the raw memory of the object, which is the ground truth. As you can see below, we can use the handy !object extension to go find the object.

32767 Shades of Reference Bias

As you can see, we’re not really getting anywhere here – WinDBG shows an equally large value (458,584) although it’s not quite the same as Process Explorer’s. In fact, it’s exactly:

491351 – 458584 = 32767 (0x7FFF)

This can’t be a coincidence, can it? In fact, looking at other objects in Process Explorer, and comparing the reference count with WinDBG shows a similar pattern – not only are the numbers huge, but Process Explorer is always off by 0x7FFF. I also noticed a second pattern – the more handles that the object had, the bigger the reference count was, and always by a factor of around, or almost, 32767. In this case, dividing 458584 references by 14 handle counts gives us 32756 references-per-handle – close enough. Doing the opposite math on 491351 references gives us 14.995 handles.

Having worked on Process Explorer previously, I knew that as part of the code which handles the properties dialog and queries information on the object, the tool open its own handle to the object, temporarily creating 15 handles. Something became clear: there is now a bias in the reference count of objects, based on the number of handles. However, this bias is not exactly 32767, so something else must be going on.

Globally Searching for Opened Handles with Process Explorer

On a hunch, I decided to take a look at what would happen if I used Process Explorer’s “Find Handle or DLL” functionality, which searches all handles, system-wide, in order to find any which contain the name that the user entered. Because Windows only returns a list of PIDs and Handle Values, Process Explorer then has to attach to the process associated with the PID (since handles are local to each process) and then open the handle so that it can query its name. Let’s see what the search returned:

Fourteen processes have handles open to the DBWinMutex object. Let’s see what happened to the reference count…

The reference count went down to 491337. Which happens to be – wait for it – exactly 14 references less than what we had before. Repeating the exercise a few more times perfectly reproduces this behavior. Each time a new search is done, 14 processes are found (with 1 handle each), and the reference count goes down by 14 again.

The Per-Handle Reference Bias Revealed

At this point, we can infer the following two patterns:

  • Each time a new handle is opened to an object, the reference count goes up by 0x7FFF, or 32767, on x64 Windows. On x86 Windows, the same behavior is seen by the way, but with 0x1F instead.
  • Each time an existing handle to an object is used, the reference count goes down by 1.

The last part in this exercise was trying to understand where this data is coming from. The last bullet point above suggests that there is some sort of per-handle reference count, so I used the !handle extension in WinDBG to locate the handle entry for Explorer’s (PID 4440 as seen earlier) handle to DBWinMutex (handle 44 as seen earlier). I used flag 2 to request the object information as well. As you’ll see below, this gave me the pointer to the handle table entry, which I’ve highlighted in green. We can then use WinDBG’s symbol information to dump the entry using the dt command the _HANDLE_TABLE_ENTRY type inside the nt module.

As someone who has often dumped handle table entries in the debugger, the structure was striking to me, as it was very different from anything I had seen before. In fact, handle table entries only really stored two things before – the pointer to the object, and the granted access mask to the object. Yes, a few flags were used, but definitely nothing like we see above in Windows 8.1.

The New Handle Table Entry Format

Here’s the big changes from previous versions of Windows, on x64:

  • Instead of storing the full 64-bit pointer to the object header, Windows now only stores a 44 bit pointer. The bottom four bits are inferred to be all zeroes as all 64-bit allocations, code, and stack locations are 16-byte aligned, while the top sixteen bits are inferred to be all ones, as architecturally defined by the amd64 achitecture per the rules of canonical addresses (there must now be a dozen algorithms in Windows which rely on these bits having pre-defined, unchanging values!).
  • Three of the assumed bits are re-used to store the three handle attributes (inherited, audited, protected), while a fourth is used to store the lock bit for the handle entry.
  • Finally, the remaining 16-bits are now used to store an inverted reference count which keeps track of the amount of times that a handle has been used by a process. This reference count begins at 0x7FFF and counts down to zero for each additional reference made on the handle. The reference count (i.e.: the pointer count field in the object header) is biased by the number of inverted reference counts in each handle to the process.
  • Because the access mask is only 25 bits if you ignore the generic access rights (which are always translated into specific rights), additional bits can be used for flags. One such bit is used, the others are spare.
  • This leaves an unused 32-bit value that was wasted for alignment purposes on earlier versions of Windows. In Windows 8.1, this is now used to store the TypeInfo field, which is the Object Type Index in the Object Type Index Table (nt!ObTypeIndexTable). Dereferencing this index quickly reveals the object type for this handle, without having to even look at the object header.

On x86 Windows, the structure is different, but the changes semantically similar:

  • No assumptions can be made on the top bits, so the entry continues to store a pointer to the object header, in which the bottom 3 bits are re-used to store the lock bit and 2 of the handle attributes (inherited, audited) as all x86 allocations are 8 byte aligned.
  • Because the granted access mask is only 25 bits, the remaining 7 bits can now be used to store the missing attribute flag (protected), leaving 6 bits to store the reference count. As such, the reference count starts at 0x1F instead, on x86 systems.
  • There is no additional space lost due to alignment, so there is no space to store the TypeInfo field.


As you can see, Windows 8.1 not only introduces a major rewrite to the handle table entry format but also makes these seemingly internal data structure changes to have a visible side effect when using the Windows Debugger or other tools to analyze reference counts on objects, something which driver developers often have to do (and even support professionals when troubleshooting leaks).

Additionally, for forensic analysts, the fact that there is now a per-handle “reference count”, which Microsoft should’ve really called an inverted access count, allows one to get a very detailed understanding of the number of times a handle has been used (and thus perhaps glean insight into unusual uses of the handle).

On a final note, this is a really good example of the type of Windows Internals analysis that one can do without doing any actual “black room” reverse engineering – I didn’t have to open IDA a single time or look at a single line of assembly code to discover and understand this functionality. By merely interacting with the system, deducing logic, and looking at state changes, the behavior became clear. If you ever note any other interesting Windows functionality or behavior that you’ve never been able to explain, feel free to leave a comment!

Protected Processes Part 3 : Windows PKI Internals (Signing Levels, Scenarios, Root Keys, EKUs & Runtime Signers)

December 28th, 2013


In this last part of our series on protected processes in Windows 8.1, we’re going to be taking a look at the cryptographic security that protects the system from the creation or promotion of arbitrary processes to protected status, as well as to how the system is extensible to provide options for 3rd party developers to create their own protected processes.

In the course of examining these new cryptographic features, we’ll also be learning about Signing Levels, a concept introduced in Windows 8. Finally, we’ll examine how the Code Integrity Library DLL (Ci.dll) is responsible for approving the creation of a protected process based on its associated signing level and digital certificate.

Signing Levels in Windows 8

Before Windows 8.1 introduced the protection level (which we described in Part 1 and Part 2), Windows 8 instituted the Signing Level, also sometimes referred to as the Signature Level. This undocumented number was a way for the system to differentiate the different types of Windows binaries, something that became a requirement for Windows RT as part of its requirement to prohibit the execution of Windows “desktop” applications. Microsoft counts among these any application that did not come from the Windows Store and/or which was not subjected to the AppContainer sandboxing technology enforced by the Modern/Metro programming model (meanwhile, the kernel often calls these “packaged” applications).

I covered Signing Levels in my Breakpoint 2012 presentation, and clrokr, one of the developers behind the Windows RT jailbreak, blogged about them as well. Understanding signing levels was critical for the RT jailbreak: Windows introduced a new variable, SeILSigningPolicy, which determined the minimum signing level allowed for non-packaged applications. On x86, this was read from the registry, and assumed to be zero, while on ARM, this was hard-coded to “8”, which as you can see from clrokr’s blog, corresponds to “Microsoft” – in effect allowing only Microsoft-signed applications to run on the RT desktop. The jailbreak, then, simply sets this value to “0”.

Another side effect of Signing Levels was that the “ProtectedProcess” bit in EPROCESS was removed — whether or not a Windows 8 process is protected for DRM purposes (such as Audiodg.exe, which handles audio decoding) was now implied from the value in the “SignatureLevel” field instead.

Signing Levels in Windows 8.1

In Windows 8.1, these levels have expanded to cover some of the needs introduced by the expansion of protected processes. The official names Microsoft uses for them are shown in Table 1 below. In addition, the SeILSigningPolicy variable is no longer initialized through the registry. Instead, it is set through the Secure Boot Signing Policy, a signed configurable policy blob which determines which binaries a Windows 8.1 computer is allowed to run. The value on 8.1 RT, however, remains the same – 8 (Microsoft), still prohibiting desktop application development.

Windows 8.1 Signing Levels

Signing LevelName
2Custom 0
3Custom 1
5Custom 2
7Custom 3 / Antimalware
9Custom 4
10Custom 5
11Dynamic Code Generation
13Windows Protected Process Light
14Windows TCB
15Custom 6

Furthermore, unlike the Protection Level that we saw in Parts 1 and 2, which is a process-wide value most often used for determining who can do what to a process, the Signature Level is in fact subdivided into both an EXE signature level (the “SignatureLevel” field in EPROCESS) as well as a DLL signature level (the “SectionSignatureLevel” field in the EPROCESS structure). While the former is used by Code Integrity to validate the signature level of the primary module binary, the latter is used to set the minimum level at which DLLs on disk must be signed with, in order to be allowed to load in the process. Table 2, which follows, describes the internal mapping used by the kernel in order to assign a given Signature Level for each particular Protected Signer.

Protected Signers to Signing Level Mappings

Protected SignerEXE Signature LevelDLL Signature Level
PsProtectedSignerCodeGenDynamic Code GenerationStore
PsProtectedSignerAntimalwareCustom 3 / AntimalwareCustom 3 / Antimalware
PsProtectedSignerWinTcbWindows TCBWindows TCB

Scenarios and Signers

When the Code Integrity library receives a request from the kernel to validate an image (i.e.: to perform page hash or image hash signature checks), the kernel sends both the signing level (which it determined based on its internal mapping matching Table 2 from above) as well as a bit mask called the Secure Required. This bit mask explains to Code Integrity why image checking is being done. Table 3, shown below, describes the possible values for Secure Required.

Secure Required Bit Flags

Bit ValueDescription
0x1Driver Image. Checks must be done on x64, ARM, or if linked with /INTEGRITYCHECK.
0x2Protected Image. Checks must be done in order to allow the process to run protected.
0x4Hotpatch Driver Image. Checks must be done to allow driver to hotpatch another driver.
0x08Protected Light Image. Checks must be done in order to allow the process to run PPL.
0x10Initial Process Image. Check must be done for User Mode Code Signing (UMCI) reasons.
Based on this bit mask as well as the signing level, the Code Integrity library converts this information into a Scenario. Scenarios describe the signing policy associated with a specific situation in which signature checking is being done.

The system supports a total of 18 scenarios, and their goal is three-fold: determine the minimum hash algorithm that is allowed for the signature check, and determine if only a particular, specific Signer is allowed for this scenario (a Signer is identified by the content hash of the certificate used to sign the image) and which signature level the Signer is allowed to bestow.

Table 4 below describes the standard Scenarios and their associated Security Required, Signing Level, and minimum Hash Algorithm requirements.

Scenario Descriptions and Hash Requirements

ScenarioSecure RequiredSigning LevelHash Algorithm
0N/AWindows TCBCALG_SHA_256
1Hotpatch ImageWindowsCALG_SHA_256
4Protected ImageAuthenticodeCALG_SHA1
5Driver ImageN/ACALG_SHA1
7N/ADynamic Code GenerationCALG_SHA_256
9N/ACustom 0CALG_SHA_256
10N/ACustom 1CALG_SHA_256
11N/ACustom 2CALG_SHA_256
12N/ACustom 3CALG_SHA_256
13N/ACustom 4CALG_SHA_256
14N/ACustom 5CALG_SHA_256
15N/ACustom 6CALG_SHA_256
16N/AWindows Protected LightCALG_SHA_256
18N/AUnchecked or InvalidCALG_SHA1

* Used for checking the Global Revocation List (GRL)
** Used for checking ELAM drivers

From this table we can see three main types of scenarios:

  • Those designed to match to a specific signing level that is being requested (0, 1, 2, 3, 6, 9, 10, 11, 12, 13, 14, 15, 16)
  • Those designed to support a specific “legacy” scenario, such as driver loads or DRM protected processes (1, 4, 5)
  • Those designed for specific internal requirement checks of the cryptographic engine (8, 17)

As expected, with Microsoft recommending the usage of SHA256 signatures recently, this type of signature is enforced on all their internal scenarios, with SHA1 only being allowed on driver and DRM protected images, Windows Store applications, and other generic Microsoft-signed binaries (presumably for legacy support).

The scenario table described in Table 4 is what normally ships with Code Integrity on x86 and x64 systems. On ARM, SHA256 is a minimum requirement for almost all scenarios, as the linked MSDN page above explained. And finally, like many of the other cryptographic behaviors in Code Integrity that we’ve seen so far, the table is also fully customizable by a Secure Boot Signing Policy.

When such a policy is present, the table above can be rewritten for all but the legacy scenarios, and custom minimum hash algorithms can be enforced for each scenario as needed. Additionally, the level to scenario mappings are also customizable, and the policy can also specify which “Signers”, identified by their certificate content hash, can be used for which Scenario, as well as the maximum Signing Level that a Signer can bestow.

Accepted Root Keys

Let’s say that the Code Integrity library has received a request to validate the page hashes of an image destined to run with a protection level of Windows TCB, and thus presumably with Scenario 0 in the standard configuration. What prevents an unsigned binary from satisfying the scenario, or perhaps a test-signed binary, or even a perfectly validly signed binary, but from a random 3rd party company?

When Code Integrity performs its checks, it always remembers the Security Required bit mask, the Signature Level, and the Scenario. The first two are used early on to decide which Root CA authorities will be allowed to participate in the signature check — different request are subject to different accepted root keys, as per Table 5 below.

Note that in these tables, PRS refers to “Product Release Services”, the internal team within Microsoft that is responsible for managing the PKI process and HSM which ultimately signs every officially released Microsoft product.

Accepted Root Keys

Secure RequiredSigning LevelAccepted Root Keys
Protected ImageN/APRS Only
Hotpatch ImageN/ASystem and Self Signed Only
Driver ImageN/APRS Only
N/AStoreWindows and PRS Only
N/AWindowsWindows and PRS Only
N/AWindows TCBPRS Only
N/AAuthenticodePRS, Windows, Trusted Root

Additionally, Tabke 6 below describes overrides that can apply based on debug options or other policy settings which can be present in the Secure Boot Signing Policy:

Accepted Root Key Overrides

OptionEffect on Root Key Acceptance
Policy Option 0x80Enables DMD Test Root
Policy Option 0x10Enables Test Root
/TESTSIGNING in BCDEnables Test Root for Store and Windows TCB Signing Levels. ?Also enables System Root, Self Signed Root and allows an Incomplete Signing Chain for other levels.

Two final important exceptions apply to the root key selection. First, when a custom Secure Boot Signing Policy is installed, and it contains custom signers and scenarios, then absolutely all possible root keys, including incomplete chains, are allowed. This is because it will be the policy that determines which Signer/Hash, Scenario/Level mapping is valid for use, not a hard-coded list of keys.

The second exception is that certain signature levels are “runtime customizable”. We’ll talk more about these near the end of this post, but for now, keep in mind that for any runtime customizable level, all root keys are also accepted. We’ll see that this is because just like with custom signing policies, runtime customizable levels have additional policies based on the signer and other data.

As you can see, this first line of defense prohibits, for example, non PRS-signed image from ever being loaded as a driver or as a DRM-protected process. It also prevents any kind of image from ever reaching a signing level of Windows TCB (thus prohibiting the underlying protection level from ever being granted).

Of course, just looking at root keys can’t be enough — the Windows Root Key is used to sign everything from a 3rd party WHQL driver to an ELAM anti-malware process to a DRM-protected 3rd party Audio Processing Object. Additional restrictions exist in place to ensure the proper usage of keys for the appropriately matching signature level.

Modern PKI enables this through the presence of Enhanced Key Usage (EKU) extensions in a digital signature certificate, which are simply described by their unique OID (Object Identifier, a common format for X.509 certificates that describes object types).

Enhanced Key Usages (EKUs)

After validating that an image is signed with an appropriate certificate that belongs to one of the allowed root keys, the next step is to decide the signing level that the image is allowed to receive, once again keeping in mind the security required bit mask.

First of all, a few checks are made to see which root authority ultimately signed the image, and whether or not any failures are present, keeping account of debug or developer policy options that may have been enabled. These checks will always result in the Unsigned (1), Authenticode (4) or Microsoft (8) signature level to be returned, regardless of other factors.

In the success cases, the following EKUs, shown in Table 7, are used in making the first-stage determination:

EKU to Signing Level Mapping

EKU OID ValueEKU OID NameGranted Signing Level StoreStore * Code GeneratorDynamic Code Generation PublisherMicrosoft Hardware Driver VerificationMicrosoft System Component VerificationWindows Kits ComponentMicrosoft ** TCB ComponentWindows TCB Third Party Application ComponentAuthenticode Software Extension VerificationMicrosoft

* Configurable by Secure Boot Signing Policy
** Only if Secure Boot Signing Policy Issued by Windows Kits Publisher

Next, the resulting signature level is compared with the initial desired signature level. If the level fails to dominate the desired level, a final check is made to see if the signing level is runtime customizable, and if so, this case is handled separately as we’ll see near the end of this post.

Finally, if the resulting signature level is appropriate given the requested level, a check is made to see if the Security Required includes bits 2 (Protected Image) and/or 8 (Protected Light Image). If the latter is present, and if the Windows signature level (12) is requested, two additional EKUs are checked for their presence — at least one must be in the certificate:

  •, Protected Process Light Verification
  •, Protected Process Verification

In the former case, i.e.: a Security Required bit mask indicating a Protected Image, then if the Windows TCB signature level (14) was requested, only the latter EKU is checked.

System Components

You can right-click on any PE file in Windows Explorer which has an embedded certificate and click on the “Digital Signatures” tab in the “Properties” window that you select from the context menu. By double-clicking on the certificate entry, and then clicking on “View Certificate”, you can scroll down to the “Enhanced Key Usages” line and see which EKUs are present in the certificate.

Here’s some screenshots of a few system binaries, which should now reveal familiar EKUs based on what we’ve seen so far.

First of all, here’s Audiodg.exe. All it has is the “Windows” EKU.


Next up, here’s Maps.exe, which has the “Store” EKU:


And finally, Smss.exe, which has both the “Windows” and “Windows TCB” EKU, as well as the “Windows Process Light Verification” EKU.


Runtime Signers 

We’ve mentioned a few cases where the system checks if a signature level is runtime customizable, and if so, proceeds to additional checks. As of Windows 8.1, in the absence of a Secure Boot Signing Policy, only level 7 fits this bill, which corresponds to “Custom 3 / Antimalware” from our first table. If a policy is present, then all the signature levels that have “Custom” in them unsurprisingly also become customizable, as well as the “Windows Protected Process Light” (13) level.

Once a level is determined to be customizable, the Code Integrity library checks if the signing level matches that of any of the registered runtime signers. If there’s a match, the next step is to authenticate the certificate information chain with the policy specified in the runtime signer registration data. This information can include an array of EKUs, which must be present in order to pass the test, as well as the contents hash of at least one signer, of the appropriate hash length and hashing algorithm.

If all policy elements pass the test, then the requested signature level will be granted, bypassing any other default system EKU or root key checks.

How does the system register such runtime signers? The Code Integrity library contains two API calls, SeRegisterSigningInformation and SeUnregisterSigningInformation through which runtime signers can be registered and deregistered. These calls are made by the kernel by SeRegisterElamCertResources which is done either when an Early-Launch Anti Malware (ELAM) driver has loaded (subject to the rules surrounding obtaining an ELAM certificate), or, more interestingly, at runtime when instructed so by a user-mode caller.

That’s right — it is indeed possible through calling the NtSetSystemInformation API, if using the SystemRegisterElamCertificateInformation information class, to pass the full path to a non-loaded ELAM driver binary. By using SeValidateFileAsImageType, the kernel will call into the Code Integrity library to check if the image is signed, using Scenario 17, which you’ll recall from Table 3 above is the ELAM scenario. If user-mode did not pass in a a valid ELAM driver, the request will simply fail.

Once SeRegisterElamCertResources is called in either of these cases, it calls SepParseElamCertResources on the MICROSOFTELAMCERTIFICATEINFO section in order to parse an MSELAMCERTINFOID resource. Here is, for example, a screenshot of the resource data matching this name in Microsoft’s Windows Defender ELAM driver (Wdboot.sys):


This data is formatted according to the following rules below, which can be used in an .rc file when building your own ELAM driver. The sample data from the Windows Defender ELAM driver is also shown alongside in bold for easier comprehension.

MicrosoftElamCertificateInfo  MSElamCertInfoID
    <# of Entries, Max 3>,  -> 1
    L”Content Hash n\0”,    -> f6f717a43ad9abddc8cefdde1c505462535e7d1307e630f9544a2d14fe8bf26e
    <Hash Algorithm n>,     -> 0x800C (CALG_SHA_256)
    L”EKU1n;EKU2n;EKU3n\0”, ->;
    … up to 2 more blocks …

This data is then packaged up into the runtime signer blob that is created by CiRegisterSigningInformation API and will be used for comparisons when the signing level matches — note that the kernel always passes in “7” as the signature level for the signer, since the kernel API is explicitly designed for ELAM purposes.

On the other hand, the internal CiRegisterSigningInformation API can be used for arbitrary signing levels, as long as the current policy allows it and the levels are runtime customizable. Also note that the limitation on up to 3 EKUs and 3 Signers is also enforced by the kernel and not by the Code Integrity library.

Running as Anti-Malware Protected Process Light

In the previous posts we explained some of the protections offered to PPLs and the different signers and levels available. In this post, we started by seeing how the presence of EKUs and particular root authority keys causes the system to allow or deny a certain binary from loading with the requested signature level (and thus protection level), as well as to how DLLs can be prohibited to load in such processes unless they too match a minimum signing level.

This should explain why a process like Smss or Csrss is allowed to run with a given protection level, but it didn’t quite explain why MsMpEng.exe or NisSrv.exe were allowed to run as PPLs, because their certificate EKUs, shown below, don’t match any specially handled level:


However, by taking a look at last section on runtime signers, as well as using the CertUtil utility to dump the content hash of the certificate used to sign the Windows Defender binaries you’ll note a distinct match between the information present in the resource section of the driver, and the information in the certificate.  See below for both the signature hash and the EKU presence:


Because of the ELAM driver, this specific hash and EKU are registered as a runtime signer, and when the service launches, recall that by using the Protected Service functionality we saw in the previous post, the Windefend service requests a Win32 protection level of 3 — or an NT protection level of 0x31.  In turn, this translates to Signing Level 7 — because this level is runtime customizable, a the runtime signer check is then performed, and the hash and EKU is matched.

As we mentioned above, the SystemRegisterElamCertificateInformation information class can be used to request parsing of an ELAM driver’s resource section in order to register a runtime signer. It turns out that this undocumented information class is exposed through the new InstallELAMCertificateInfo API in Windows 8.1, which any 3rd party can legitimately call in order to tap into this behavior, as long as the driver is ELAM signed.

You don’t actually need to have any code in the ELAM driver, just enough of a valid PE image such that the kernel-mode loader can parse the .rsrc section and recover the MicrosoftElamCertificateInfo resource section.

Furthermore, recall that for runtime signers, all the usual root key and EKU checks are gone, instead relying on the policy that was registered. In other words, the system allows you to function as your own 3rd party CA, and issue certificates with custom content hashes for different signers. Or better yet, it is possible to attach custom EKUs to one’s binaries, in order to separate other binaries your organization may be signing.


We have covered the details of these new cryptographic features in great detail.  Now I’d like to point out a few observations about the shortcomings and potential issues inherent to these new features.

As great and extensible as the new PPL system (and its accompanying PKI infrastructure) is, it is not without its own risks. For one thing, any company with an ELAM certificate can now create buggy user-mode processes (remember folks, these are AV companies we’re talking about…) that not only you can’t debug, but you also can’t terminate from user-mode. Although yes, on platforms without SecureBoot, this would be possible by simply using a kernel debugger or custom drivers, imagine less tech-savvy users stuck without being able to use Task Manager.

Additionally, a great deal of reliance seems to have been put on EKUs, which were relatively unknown in the past and mostly only used to define a certificate as being for “SSL” vs “Code Signing”. One can only hope that the major CAs are smart enough to have filters in place to avoid arbitrary EKUs being associated with 3rd party Authenticode certificates. Otherwise, as long as a signature level accepts a non-PRS root key, the infrastructure could easilyy be fooled by an EKU that a CA has allowed into a certificate.

Finally, as with all PKI implementations, this one is not without its own share of bugs. I have independently discovered means to bypass some of the guarantees being made around PPLs, and to illegitimately create an Antimalware process, as I posted in this picture.  I obviously don’t have an ELAM certificate (and the system is not in test-signing mode), so this is potentially a problem. I’ve reported the issue to Microsoft and am waiting more information/feedback before talking about this issue further, in case it is a legitimate bug that needs to be fixed.

Conclusion and Future Work

In this final post on protected processes, we delved deeply into the PKI that is located within the Code Integrity library in Windows 8.1, and we saw how it provides cryptographic boundaries around protected processes, PPLs, and signature levels reserved for particular usages.

At the same time, we talked about how custom signing policies, delivered through Secure Boot, can customize this functionality, and saw up to 6 “Custom” signing levels that can be defined through such a policy. Finally, we looked at how some of these signing levels, namely the Antimalware level by default, can be extended through runtime signers that can be registered either pre- or post-boot through special resource sections in ELAM drivers, thus leading to custom 3rd party PPLs.

In the near future, I intend to contribute patches to Process Hacker in order to add a new column to the process tree view which would show the process protection level in its native NT form, as this data is available through the NtQueryInformationProcess API call in Windows 8.1. The tooltip for this data would then show the underlying Signer and Level, based on the kernel headers I pasted in the earlier blog posts.

Last but not least, the term “Secure Boot Signing Policy” appears numerous times without a full explanation as to what this is, how to register one, and what policies such a construct can contain. It only seems fair to dedicate the next post to this topic – stay tuned!


The contents of this blog series could not have been made possible without the help and contributions of:

  • lilhoser
  • myriachan
  • msuiche

The Evolution of Protected Processes Part 2: Exploit/Jailbreak Mitigations, Unkillable Processes and Protected Services

December 10th, 2013


In this continuing series on the improvements of the protected process mechanism in Windows, we’ll move on past the single use case of LSASS protection and pass-the-hash mitigation through the Protected Process Light (PPL) feature, and into generalized system-wide use cases for PPLs.

In this part, we’ll see how Windows uses PPLs to guard critical system processes against modification and how this has prevented the Windows 8 RT jailbreak from working on 8.1. We’ll also take a look at how services can now be configured to run as a PPL (including service hosts), and how the PPL concept brings yet another twist to the unkillable process argument and semantics.

System Protected Processes

To start the analysis, let’s begin with a simple WinDBG script (you should collapse it into one line) to dump the current PID, name, and protection level of all running processes:

lkd> !for_each_process "
r? @$t0 = (nt!_EPROCESS*) @#Process;
.if @@(@$t0->Protection.Level) 
.printf /D \"%08x <b>[%70msu]</b> level: <b>%02x</b>\\n\",

The output on my rather clean Windows 8.1 32-bit VM, with LSA protection enabled as per the last post, looks something like below. I’ve added the actual string representation of the protection level for clarity:


As a reminder, the protection level is a bit mask composed of the Protected Signer and the Protection Type:

PsProtectedSignerNone = 0n0
PsProtectedSignerAuthenticode = 0n1
PsProtectedSignerCodeGen = 0n2
PsProtectedSignerAntimalware = 0n3
PsProtectedSignerLsa = 0n4
PsProtectedSignerWindows = 0n5
PsProtectedSignerWinTcb = 0n6
PsProtectedSignerMax = 0n7
PsProtectedTypeNone = 0n0
PsProtectedTypeProtectedLight = 0n1
PsProtectedTypeProtected = 0n2
PsProtectedTypeMax = 0n3

This output shows that the System process (the unnamed process), as has been the case since Vista, continues to be a full-fledged protected process, alongside the Software Piracy Protection Service (Sppsvc.exe).

The System process is protected because of its involvement in Digitial Rights Management (DRM) and because it might contain sensitive handles and user-mode data that a local Administrator could have accessed in previous versions of Windows (such as XP). It stands to reason that Sppsvc.exe is protected due to similar DRM-like reasons, and we’ll shortly see how the Service Control Manager (SCM) knew to launch it with the right protection level.

The last protected process we see is Audiodg.exe, which also heralds from the Vista days. Note that because Audiodg.exe can load non-Windows, 3rd party “System Audio Processing Objects” (sAPOs), it only uses the Authenticode Signer, allowing it to load the DLLs associated with the various sAPOs.

We also see a number of “WinTcb” PPLs – TCB here referring to “Trusted Computing Base”. For those familiar with Windows security and tokens, this is not unlike the SeTcbPrivilege (Act as part of the Operating System) that certain highly privileged tokens can have. We can think of these processes as essentially the user-mode root chain of trust provided by Windows 8.1. We’ve already seen that SMSS is responsible for launching LSASS with the right protection level, so it would make sense to also protect the creator. Very shortly, we’ll revisit what actual “protection” is really provided by the different levels.

Finally, we see the protected LSASS process as expected, followed by two “Antimalware” PPLs – the topic of which will be the only focus of Part 3 of this series – and one “Windows” PPL associated with a service host. Just like the SPP service, we’ll cover this one in the “Protected Services” section below.

Jailbreak and Exploit Mitigation

Note that it’s interesting that Csrss.exe was blessed with a protection level as well. It isn’t responsible for launching any special protected processes and doesn’t have any interesting data in memory like LSASS or the System process do. It has, however, gained a very nefarious reputation in recent years as being the source of multiple Windows exploits – many of which actually require running inside its confines for the exploit to function. This is due to the fact that a number of highly privileged specialized APIs exist in Win32k.sys and are meant only to be called by Csrss (as well as the fact that on 32-bit, Csrss has the NULL page mapped, and it also handles much of VDM support).

Because the Win32k.sys developers did not expect local code injection attacks to be an issue (they require Administrator rights, after all), many of these APIs didn’t even have SEH, or had other assumptions and bugs. Perhaps most famously, one of these, discovered by j00ru, and still unpatched, has been used as the sole basis of the Windows 8 RT jailbreak. In Windows 8.1 RT, this jailbreak is “fixed”, by virtue that code can no longer be injected into Csrss.exe for the attack. Similar Win32k.sys exploits that relied on Csrss.exe are also mitigated in this fashion.

Protected Access Rights

Six years ago in my Vista-focused protected process post, I enumerated the documented access rights which were not being granted to protected processes. In Windows 8.1, this list has changed to a dynamic table of elements of the type below:

+0x000 DominateMask        : Uint4B
+0x004 DeniedProcessAccess : Uint4B
+0x008 DeniedThreadAccess  : Uint4B
PAGE:821AD398 ; _RTL_PROTECTED_ACCESS RtlProtectedAccess[]
PAGE:821AD398 <0,   0, 0>                [None]
PAGE:821AD398 <2,   0FC7FEh, 0FE3FDh>    [Authenticode]
PAGE:821AD398 <4,   0FC7FEh, 0FE3FDh>    [CodeGen]
PAGE:821AD398 <8,   0FC7FFh*, 0FE3FFh*>  [Antimalware]
PAGE:821AD398 <10h, 0FC7FFh*, 0FE3FFh*>  [Lsa]
PAGE:821AD398 <3Eh, 0FC7FEh, 0FE3FDh>    [Windows]
PAGE:821AD398 <7Eh, 0FC7FFh*, 0FE3FFh*>  [WinTcb]

Access to protected processes (and their threads) is gated by the PspProcessOpen (for process opens) and PspThreadOpen (for thread opens) object manager callback routines, which perform two checks.

The first, done by calling PspCheckForInvalidAccessByProtection (which in turn calls RtlTestProtectedAccess and RtlValidProtectionLevel), uses the DominateMask field in the structure above to determine if the caller should be subjected to access restrictions (based on the caller’s protection type and protected signer). If the check fails, a second check is performed by comparing the desired access mask with either the “DeniedProcessAccess” or “DeniedThreadAccess” field in the RtlProtectedAccess table. As in the last post, clicking on any of the function names will reveal their implementation in C.

Based on the denied access rights above, we can see that when the source process does not “dominate” the target protected process, only the 0x3801 (~0xFC7FE) access mask is allowed, corresponding to PROCESS_QUERY_LIMITED_INFORMATION, PROCESS_SUSPEND_RESUME, PROCESS_TERMINATE, and PROCESS_SET_LIMITED_INFORMATION (the latter of which is a new Windows 8.1 addition).

On the thread side, THREAD_SET_LIMITED_INFORMATION, THREAD_QUERY_LIMITED_INFORMATION, THREAD_SUSPEND_RESUME, and THREAD_RESUME are the rights normally given, the latter being another new Windows 8.1 access bit.

Pay attention to the output above, however, and you’ll note that, this is not always the case!

Unkillable Processes

In fact, processes with a Protected Signer that belongs to either Antimalware, Lsa, or WinTcb only grant 0x3800 (~0xFC7FF) – in other words prohibiting the PROCESS_TERMINATE right. And for the same group that prohibits PROCESS_TERMINATE, we can also see that THREAD_SUSPEND_RESUME is also prohibited.

This is now Microsoft’s 4th system mechanism that attempts to prevent critical system process termination. If you’ll recall, Windows Server 2003 introduced the concept of “critical processes”, which Task Manager would refuse to kill (and cause a bugcheck if killed with other tools), while Windows 2000 had introduced hard-coded paths in Task Manager to prevent their termination.

Both of these approaches had flaws: malware on Windows 2000 would often call itself “Csrss.exe” to avoid user-initiated termination, while calling RtlSetProcessIsCritical on Vista allowed malware to crash the machine when killed by AV (and also prevent user-initiated termination through Task Manager). Oh, and LSASS was never a critical process – but if you killed it, SMSS would notice and take down the machine. Meanwhile, AV companies were left at the mercy of process-killing malware, until Vista SP1 added object manager filtering, which allowed removing the PROCESS_TERMINATE right that could be granted to a handle.

It would seem like preventing PROCESS_TERMINATE to LSASS, TCB processes, and anti-virus processes is probably the mechanism that makes the most sense – unlike all other approaches which relied on obfuscated API calls or hard-coded paths, the process protection level is a cryptographic approach that cannot be faked (barring a CA/PKI failure).

Launching Protected Services

As SMSS is created by the System process, and it, in turn, creates LSASS, the SCM, and CSRSS, it makes sense for all of these processes to inherit some sort of protection level based on the implicit process creation logic in each of them. But how did my machine know to launch the SPP service protected? And why did I have one lone PPL service host? It turns out that in Windows 8.1, the Service Control Manager now has the capability of supporting services that need to run with a specific protection level, as well as performing similar work as the kernel when it comes to defending against access to them.

In Windows 8.1, when the SCM reads the configuration for each service, it eventually calls ScReadLaunchProtected which reads the “LaunchProtected” value in the service key. As you can see below, my “AppXSvc” service, for example, has this set to the value “2”.


You’ll see the “sppsvc” service with this value set to “1”, and you’ll see “Windefend” and “WdNisSvc” at “3”. All of these match the new definitions in the Winsvc.h header:

// Service LaunchProtected types supported
#define SERVICE_LAUNCH_PROTECTED_NONE                    0
#define SERVICE_LAUNCH_PROTECTED_WINDOWS                 1

The SCM saves the value in the SERVICE_RECORD structure that is filled out by ScAddConfigInfoServiceRecord, and when the service is finally started by ScLogonAndStartImage, it is converted to a protection level by using the g_ScProtectionMap array of tagScProtectionMap structures. WINDOWS becomes 0x52, WINDOWS_LIGHT  becomes 0x51, and ANTIMALWARE_LIGHT becomes 0x31 – the same values shown at the very beginning of the post.

+0x000 ScmProtectionLevel   : Uint4B
+0x004 Win32ProtectionLevel : Uint4B
+0x008 NtProtectionLevel    : Uint4B
.data:00441988 ; tagScProtectionMap g_ScProtectionMap[]
.data:00441988 <0, 0, 0>    [None]
.data:00441988 <1, 1, 52h>  [Windows Protected]
.data:00441988 <2, 2, 51h>  [Windows Light]
.data:00441988 <3, 3, 31h>  [Antimalware Light]

This now explains why NisSrv.exe (WdNisSvc), MsMpEng.exe (Windefend) were running as “Antimalware”, a Protected Signer we haven’t talked about so far, but which will be the sole focus of Part 3 of this series.

In addition, the command-line Sc.exe utility has also been updated, with a new argument “qprotection”, as seen in the screenshot below:


Protected SCM Operations

When analyzing the security around protected services, an interesting conundrum arises: when modifying a service in any way, or even killing it, applications don’t typically act on the process itself, but rather communicate by using the SCM API, such as by using ControlService or StopService. In turn, responding to these remote commands, the SCM itself acts on its subjugate services.

Because the SCM runs with the “WinTcb” Protected Signer, it “dominates” all other protected processes (as we saw in RtlTestProtectedAccess), and the access checks would be bypassed. In other words, a user with only SCM privileges would use the APIs to affect the services, even if they were running with a protection level. However, this is not the case, as you can see in my attempt below to pause the AppX service, to change its configuration, and to stop it – only the latter was successful.


This protection is afforded by new behavior in the Service Control Manager that guards the RDeleteService, RChangeServiceConfigW, RChangeServiceConfig2W, RSetServiceObjectSecurity, and RControlService remote function calls (RPC server stubs). All of these stubs ultimately call ScCheckServiceProtectedProcess which performs the equivalent of the PspProcessOpen access check we saw the kernel do.

As you can see in the C representation of ScCheckServiceProtectedProcess that I’ve linked to, the SCM will gate access to protected services to anyone but the TrustedInstaller service SID. Other callers will get their protection level queried, and be subjected to the same RtlTestProtectedAccess API we saw earlier. Only callers that dominate the service’s protection level will be allowed to perform the corresponding SCM APIs – with the interesting exception around the handling of the SERVICE_CONTROL_STOP opcode in the RControlService case.

As the code shows, this opcode is allowed for Windows and Windows Light services, but not for Antimalware Light services – mimicking, in a way, the protection that the kernel affords to such processes. Here’s a screenshot of my attempt to stop Windows Defender:



In this post, we’ve seen how PPL’s usefulness extend beyond merely protecting LSASS against injection and credential theft.  The protected process mechanism in Windows 8.1 also takes on a number of other roles, such as guarding other key processes against modification or termination, preventing the Windows RT jailbreak, and ultimately obsoleting the “critical process” flag introduced in older Windows versions (as a side effect, it is no longer possible to kill Smss.exe with Task Manager in order to crash a machine!). We’ve also seen how the Service Control Manager also has knowledge of protected processes and allows “protected services” to run, guarding access to them just as the kernel would.

Finally, and perhaps most interestingly to some readers, we’ve also seen how Microsoft is able to protect its antivirus solution (Windows Defender) with the protected process functionality as well, including even preventing the termination of its process and/or the stopping of its service. Following the EU lawsuits and DOJ-settlement, it was obviously impossible for Microsoft to withhold this capability from 3rd parties.

In the next post in this series, we’ll focus exclusively on how a developer can write an Antimalware PPL application, launch it, and receive the same level of protection as Windows Defender.  The post will also explore mechanisms that exist (if any) to prevent such a developer from doing so for malicious purposes.