Solution to Challenge

The clock has ticked past midnight, so it’s now time to reveal the solution to my previous challenge. When I say “Solution” I mean what I and others are aware to be the currently best method. Nobody else has found anything better, and the two “winners” have presented the same solution (which Windows itself uses).

Since the question originally came to me from a developer at Microsoft, and I mentionned this, it was safe to assume that the method Windows used was probably “the right answer”. However, the hard part was explaining what exactly it was doing.

Correct solutions came, in order, from Matt Miller, Razvan Hobeanu and Ken Johnson. These are some of my favorite blogs to read and people I respect most, so I was honoured that they took the time to write up a solution (thanks to everyone else as well!). I will present a “full” solution, including the 64-bit implementation, and the actual code in the kernel responsible for this hack.

Before I start however, there’s one esoteric solution from Myria which I thought was funny enough to be shared. She proposed, roughly: 1) SetThreadAffinityMask(GetCurrentThread(), 1); 2) return 0;

This cute answer will first force the thread to run on CPU 0, then return… CPU 0. Technically this is true, but it’s also completely useless for the actual purpose on why you’d want to know the CPU number in the first place.

Which brings us to the actual correct solution. Most people correctly identified the routine responsible for the code, RtlGetCurrentProcessorNumber, which is what kernel32’s GetCurrentProcessorNumber forwards to. Note that the WOW64 version actually forwards to NtGetCurrentProcessorNumber, and that this Native API also does exist on 32-bit versions of Windows, and reads the value stored in the PCR. While this is a simple solution, it involves an expensive system call. So let’s go back to the user-mode Rtl routine. The raw assembly code is as follows:

mov ecx, 03Bh
lsl eax, ecx
shr eax, 0Eh
retn

When I first saw this code, I didn’t even know what the LSL instruction did, as I had never encountered it. The Intel Manual explains that LSL stands for “Load Segment Limit”, which is a nice way to get the limit for a selector in the GDT without actually having access to the GDT itself. 0x3B is a rather weird selector, but I recognized it as 0x38 masked with 0x3. The former is the selector for the TEB, and the latter is called the RPL Mask, and selects the proper ring level (User-Mode is Ring 3, so RPL is 3). Converting this to nice C code using MSVC 2005’s intrinsics and the NDK (which has internal definitions), this function looks something like:

ULONG
RtlGetCurrentProcessorNumber(VOID)
{
ULONG SegmentLimit;

//
// Get the current segment limit of the TEB
//
SegmentLimit = __segmentlimit(KGDT_R3_TEB | RPL_MASK);

//
// Get the CPU number from the limit. Each processor has its TEB
// selector with a limit composed of the CPU number in the 14th to 19th bits.
//
return (SegmentLimit >> 14);
}

This explains what the code does, and in some sense, how it does it. However, what exactly is the CPU number doing there? Is this some sort of x86 feature? Is it added during each context switch, at boot-up, etc?

The answer lies in the KeStartAllProcessors routine in the kernel, where the following piece of assembly executes:

mov     ebx, [ebp-2Ch]
mov     eax, [ebp-328h]
shl     eax, 0Eh
mov     [ebx+38h], ax
mov     eax, [ebp-328h]
shl     eax, 0Eh
xor     eax, [ebx+3Ch]
and     eax, 0F0000h
xor     [ebx+3Ch], eax

With some help from IDA, we can make this a bit nicer and update some lines:

INIT:008F6605                 mov     ebx, [ebp+ProcessorState.SpecialRegisters.Gdtr.HighWord]
INIT:008F66D6                 mov     eax, [ebp+i]

And of course, [ebx+38h] is the KGDT_R3_TEB entry in the GDT. Because this routine initializes all processors, it loops them, and i contains the current CPU number in the loop. The processor state contains the pointer to the actual GDT for this processor. Therefore, this is a specific hack that was added, and is fully dependent on the OS, which has to be Windows 2003 or newer.

Finally, on x64 versions, the selector used is actually 0x53, based on the 0x50 TEB selector in 64-bit mode. In WOW64 however, a fake WOW system call to NtGetCurrentProcessorNumber is done instead.

Full credit for this hack and the code behind it should go to Neill Clift, who came up with it.

Challenge of the Week (Month?)

Here’s a nice challenge question I got from a very ingenious developer working at Microsoft… now that I’ve found the solution, I thought I should ask it out in the open.

Correct, complete and full answer gets you a nice prize [ie: your name and solution published ;)].

Find the fastest (total cycles) and smallest (total size) method of obtaining the current CPU number that current thread is executing on, on a Windows 2003 or higher computer (ie: this solution can take advantage of any API or system improvements added to NT 5.2+).

  1. You may use an API call if you wish, but be aware that the actual call and stack operations will count in your total.
  2. You may duplicate the contents of an API call, but be aware that you must explain what your code does in detail. Inlining an API you understand nothing about is not a complete solution.
  3. Code must work from user-mode. You can write a kernel driver or user a native function, but the total cycles spend on the ring transition will be factored in your total, plus any size of code spent in kernel-mode.

Email solutions to aionescu at gmail dot com. Posts questions in the comments if you have any.

Heap Tagging is Broken

While developping the Native Development Library (NDL) that I’m working on, I attempted to play with a very undocumented feature of the Rtl Heap APIs: Tagging.

If you’ve used the familiar ExAllocatePool APIs in kernel-mode, then you’re already familiar with tagging. The Heap Manager supports the same idea, but allows you to define your own string tags of arbitrary size. This is done by a rather complex set of global flags, special APIs with strange string formatting (RtlCreateTagHeap), and a hidden little macro in winnt.h. Here’s how heap tagging works in the NDL:

A function called NdlpAllocateMemoryInternal allows the caller (the NDL) to allocate memory from the NDL Heap with a specific size, flags, and tag. The tag here is an index that we can define ourselves, such as NDL_STRING_TAG which is 0x2. Then, the NDL has other internal and/or external functions which allocate memory. For example, the LPC routines need to allocate PORT_MESSAGEs or other structures, so NDL_COMMUNICATIONS_TAG is used when calling NdlpAllocateMemoryInternal. There is also NdlpAllocateString, which uses NDL_STRING_TAG. Finally, users of the NDL (your application itself) gets an API called NdlAllocateMemory. You only provide the size and flags, and internally the NDL will set the NDL_USER_TAG to your allocation.

So far so good.

Now there’s two cool things we can do. First, the RtlQueryTagHeap API allows you to obtain statistics on each tag. Allocations, frees, and bytes allocated. This can give you a nice memory map of the NDL’s current memory usage. Even better however, by using RtlWalkHeap, the NDL can scan for all active NDL_USER_TAG allocations. This is useful, since when your native application returns, an internal call to NdlUnregisterApplication is made. When this happens, the assumption is made that your code is done executing (unless you’ve registered as a “resident” application), so in order to promote good programming and to catch leaks, RtlWalkHeap is called, and all active heap entries are scanned. If a block with the NDL_USER_TAG tag index is found, a debug message is printed out, saying that a heap entry at 0xFOO of size 0xBAR is leaking. We can then use the User-Mode Stack Trace Database support and the AllocatorBackTraceIndex of the heap entry to give a complete stack trace on where this allocation was made.

So far so good. Or Not.

Turns out I was getting Tag Indeces such as 0x8007, 0x8004, etc. It seems that all heap allocations were instead indexed with 0x8000 | CurrentAllocationIndex. This wasn’t helpful at all, so I started analyzing the problem.

The first one is the way in which heap tags are generated and then saved. To generate a tag, you use the MAKE_HEAP_TAG macro in winnt.h. This macro takes a “Tag base”, which is what RtlCreateTagHeap returns to you, as well as a tag index, which you define yourself, for example 0x2. The operation that’s done is Base | (Index << 18). So for index 2, with a base of 0x40000, this gives us 0xC0000. The problem is that when RtlpUpdateTagEntry is done, the code does the following: shr ebx, 12h and ebx, 0FFFF0FFFh EBX contains the heap flags, which are the actual HEAP_XXX flags ORed with the tag. Suppose we didn't use any flags, and are just sending our heap tag, 0xC0000. The result of this operation will be 3, not 2, because nothing is done to take into account the heap tag base. However, this bug should cause us to get tag indeces that are off-by-one, not in the 0x8000 range. So more must be going on. Recall that ebx also contains the typical heap flags. Some heap flags are as small as 0x8, others are bigger such as 0x100, and others yet are as high as 0x40000000. You can start seeing how this can corrupt this check. To make matters worse, when using a stack trace database, the heap understands that it's working in "debugging mode", so it calls a different set of APIs, such as RtlAllocateHeapSlowly and RtlDebugAllocateHeap. The latter ORs in some flags by default, such as Heap->ForceFlags, as well as HEAP_DISABLE_VALIDATION_CHECKS and HEAP_USER_SETTABLE_FLAGS. In my case, the total mask of the flags being ORed in was 0x50100000. Let’s bring in our heap tag, and the total becomes 501C0000. Let’s do the broken EBX code again, and the tag index becomes 0x407. Now RtlpUpdateTagEntry will check if 0x407 is above Heap->HighestTagIndex, and since I’ve created a lot less then 1031 tags, it will think this is a “pseudo-tag”. A pseudo-tag is the combinaiton of HEAP_PSEUDO_TAG_MASK and the curent allocaition index…and you’ve gussed it, that mask is 0x8000.

Thankfully, I was able to find a workaround for the NDL, although not with a small (but not critical) loss of functionality. First, I disabled support for stack backtraces. It makes finding your leak a big harder, but it’s not the end of the world, since this functionality is provided as a small benefit anyway. Since the stack trace functions are exported by Rtl, I will simply modify NdlAllocateMemory to capture the trace by itself. I can then use RtlSetUserFlagsHeap to associate the backtrace index or another similar device. If I want to get more evil, I can probably also play with the _HEAP_ENTRY structure itself and set the backtrace index myself.

The second “fix” was not to use the MAKE_HEAP_TAG macro at all, and ignore the “Tag base”. This solves the off-by-one problem but won’t work very reliably because it can conflict with actual heap flags.

This problem is on Win 2000 and XP. I haven’t checked Windows 2003 or Vista yet, but it’s possible that Vista fixed it after Adrian’s rewrite of code for higher security.

DR (Debug Register) Safety/Reliability and Accounting Features in Windows 2003

As some of you may know, Windows 2000 and even XP suffered from multiple validation/sanitation lacks in DR handling during Context<->Trap Frame conversion. The former is the CONTEXT structure used by Win32, and the latter refers to the KTRAP_FRAME structure used in NT. Many APIs such as Set/GetThreadContext, NtContinue, VDM Stuff, User-mode APCs and User-mode Exception Handling as well as Win32k User-mode Callbacks will eventually convert from one form of the structure to the other. These structures contain the entire CPU state (the KTRAP_FRAME doesn’t contain FPU/NPX Stuff, this is saved on the thread’s kernel stack instead), such as segments, registers and eflags.

You can imagine that a really poorly written kernel would allow you to do something like this in user-mode:

Context.SegCs = KGDT_R0_CODE;
NtSetThreadContext(Thread, &Context); and this would save the Ring 0 CS Selector into the KTRAP_FRAME, which is used when returning back to user-mode, thus giving you Ring 0 access.

Of course, DaveC wasn’t that stupid.

The NT Kernel heavily validates (or “sanitizes” EFLAGS and the fs, ds, es, cs selectors, as well as ensures DR6 and DR7 are valid). However, older versions of Windows did not fully ensure the safety of these registers. In case you didn’t know, the DRs, or Debug Registers, are a series of 32-bit registers on the x86 CPU provided for hardware breakpoints and other debugger support. DR0, 1, 2 and 3 are used to hold the addresses of the hardware breakpoints, while DR6 is a status register, and DR7 is a control register.

Already, you can guess that you really don’t want user-mode to give you kernel-mode pointers in DR0-3. The kernel would be blissfully unware that you’ve just set breakpoints in kernel space, and crash when those pointers were hit. Windows 2000 does validate for this.

However, consider the scenario where the caller sets proper user-mode addresses. The kernel will allow this, and when those pointers are hit, the CPU will do a breakpoint, killing the process if no debugger is attached. Again, I insist that the CPU is entirely responsible for the exception. It has no knowledge of address spaces. This implies that these breakpoint addresses are global for the entire system. Windows 2000 allowed a lower-privilege application to set a debug register on on a specific address that would be hit in a remote process, and then crash that application. Careful crafting would allow the crash to be predictable, and exploitable, such as this advisory demonstrates.

This has long been fixed, and the entire way in which DR registers are handled has also been re-written to protect against some flaws that could happen under VDM or V8086 mode. The DISPATCHER_HEADER has a member called DebugActive, and it’s used for KTHREAD objects. This 1-byte value is actually a mask which represents which DR registers are valid for this thread. The masks are generated as follows:

//

// Thread Dispatcher Header DebugActive Mask

//

#define DR_MASK(x) 1 << x

#define DR_ACTIVE_MASK 0x10

#define DR_REG_MASK 0x4F

Notice, since there is no DR4 register, the 0x10 flag is actually used to specify whether debugging is actually active on the thread. Now if we take a look at KeContextToKframes, which converts a CONTEXT to a KTRAP_FRAME, the code is similar to this:

    /* Handle the Debug Registers */

    if ((ContextFlags & CONTEXT_DEBUG_REGISTERS) == CONTEXT_DEBUG_REGISTERS)

    {

        /* Loop DR registers */

        for (i = 0; i < 4; i++)

        {

            /* Sanitize the context DR Address */

            SafeDr = Ke386SanitizeDr(KiDrFromContext(i, Context), PreviousMode);

 

            /* Save it in the trap frame */

            *KiDrFromTrapFrame(i, TrapFrame) = SafeDr;

 

            /* Check if this DR address is active and add it in the DR mask */

            if (SafeDr) DrMask |= DR_MASK(i);

        }

 

        /* Now save and sanitize DR6 */

        TrapFrame->Dr6 = Context->Dr6 & DR6_LEGAL;

        if (TrapFrame->Dr6) DrMask |= DR_MASK(6);

 

        /* Save and sanitize DR7 */

        TrapFrame->Dr7 = Context->Dr7 & DR7_LEGAL;

        KiRecordDr7(&TrapFrame->Dr7, &DrMask);

 

        /* If we’re in user-mode */

        if (PreviousMode != KernelMode)

        {

            /* Save the mask */

            KeGetCurrentThread()->DispatcherHeader.DebugActive = DrMask;

        }

    }

Likewise, the converse function, KeContextFromKframes, uses the following blob:

    /* Handle debug registers */

    if ((Context->ContextFlags & CONTEXT_DEBUG_REGISTERS) ==

        CONTEXT_DEBUG_REGISTERS)

    {

        /* Make sure DR7 is valid */

        if (TrapFrame->Dr7 & ~DR7_RESERVED_MASK)

        {

            /* Copy the debug registers */

            Context->Dr0 = TrapFrame->Dr0;

            Context->Dr1 = TrapFrame->Dr1;

            Context->Dr2 = TrapFrame->Dr2;

            Context->Dr3 = TrapFrame->Dr3;

            Context->Dr6 = TrapFrame->Dr6;

 

            /* Update DR7 */

            Context->Dr7 = KiUpdateDr7(TrapFrame->Dr7);

        }

        else

        {

            /* Otherwise clear DR registers */

            Context->Dr0 =

            Context->Dr1 =

            Context->Dr3 =

            Context->Dr6 =

            Context->Dr7 = 0;

        }

    }

This new code ensures not only that DR7 and DR6 are valid, but also clears the DR registers if DR7 is invalid, as well as creates a specific per-thread mask specifying which DR registers are enabled and which are not, which protects from the random activation or use of DR addresses. Also, DR7 specifies which of the DRx registers are actually in use, so this information also needs to be kept into account. The KiUpdate/RecordDr7 routines are shown below:

ULONG

FASTCALL

KiUpdateDr7(IN ULONG Dr7)

{

    ULONG DebugMask = KeGetCurrentThread()->DispatcherHeader.DebugActive;

 

    /* Check if debugging is enabled */

    if (DebugMask & DR_ACTIVE_MASK)

    {

        /* Sanity checks */

        ASSERT((DebugMask & DR_REG_MASK) != 0);

        ASSERT((Dr7 & ~DR7_RESERVED_MASK) == DR7_OVERRIDE_MASK);

        return 0;

    }

 

    /* Return DR7 itself */

    return Dr7;

}

 

BOOLEAN

FASTCALL

KiRecordDr7(OUT PULONG Dr7Ptr,

            OUT PULONG DrMask)

{

    ULONG NewMask, Mask;

    UCHAR Result;

 

    /* Check if the caller gave us a mask */

    if (!DrMask)

    {

         /* He didn’t use the one from the thread */

         Mask = KeGetCurrentThread()->DispatcherHeader.DebugActive;

    }

    else

    {

        /* He did, read it */

        Mask = *DrMask;

    }

 

    /* Sanity check */

    ASSERT((*Dr7Ptr & DR7_RESERVED_MASK) == 0);

 

    /* Check if DR7 is empty */

    NewMask = Mask;

    if (*Dr7Ptr)

    {

        /* Assume failure */

        Result = FALSE;

 

        /* Check the DR mask */

        NewMask &= 0x7F;

        if (NewMask & DR_REG_MASK)

        {

            /* Set the active mask */

            NewMask |= DR_ACTIVE_MASK;

 

            /* Set DR7 override */

            *DrMask = DR7_OVERRIDE_MASK;

        }

        else

        {

            /* Sanity check */

            ASSERT(NewMask == 0);

        }

    }

    else

    {

        /* Check if we have a mask or not */

        Result = NewMask ? TRUE: FALSE;

 

        /* Update the mask to disable debugging */

        NewMask &= ~DR_ACTIVE_MASK;

        NewMask |= 0x80;

    }

 

    /* Check if caller wants the new mask */

    if (DrMask)

    {

        /* Update it */

        *DrMask = NewMask;

    }

    else

    {

        /* Check if the mask changed and update it directly */

        if (Mask != NewMask) KeGetCurrentThread()->DispatcherHeader.DebugActive;

    }

 

    /* Return the result */

    return Result;

}

The code above is from ReactOS and may contain bugs :). Some Macros/defines are missing but the overall point should be clear. Next time you’re debugging a thread and come across DebugActive having a value that you expected was TRUE or FALSE, hopefully this should give you some insight.

On another note, I have started working on the NDK article and hope to finish it by tomorrow.

GCC and Vista Incompatibility

Since ReactOS is still being built with GCC (unfortunately), some of our devs have started to report a problem when using the MinGW build under Windows Vista. The call to MapViewOfFileEx that the compiler users for precompiled header support fails, so the compilation fails for any project that uses a PCH.

This type of error might creep up in other system software as well, and it’s not really GCC’s fault for succumbing to it. If you look at the documentation for CreateFileMapping, you’ll notice this blurb in the Remarks section:

Creating a file mapping object from a session other than session zero requires the SeCreateGlobalPrivilege privilege. Note that this privilege check is limited to the creation of file mapping objects and does not apply to opening existing ones. For example, if a service or the system creates a file mapping object, any process running in any session can access that file mapping object provided that the caller has the required access rights.

Windows XP/2000: The requirement described in the previous paragraph was introduced with Windows Server 2003, Windows XP SP2 and Windows 2000 Server SP4.

Although this feature was added in SP2, the reason it doesn’t happen in Windows XP has to do with two changes in Vista. First, UAC means that programs don’t get the SeCreateGlobalPrivilege anymore, because they’re not running in administrator accounts anymore. Secondly, in Vista, Session 0 is now the SYSTEM account session, where the login screen and services are running. Therefore, any user processes will run in Session 1, even in a normal single-user system. These two factors combined mean that CreateFileMapping is now significantly reduced in functionality and that only services are allowed to create global shared memory.

There are three workarounds if you really need the functionality:

  1. Use the Microsoft Management Console (MMC) and the Local Security Policy Snap-In to give SeCreateGlobalPrivilege to the limited account.
  2. Write a wrapper program that executes with elevated rights and and uses RtlAcquire/AdjustPrivilege to get the privilege before running your target program (Such as gcc).
  3. Use the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Kernel\ObUnsecureGlobalNames string array to add the name of the section to the list. Hopefully your program isn’t randomizing the name. Adding this name will disable the kernel protection check.

Why Io/NtCreateFile Fails…

If you’ve ever had a call to NtCreateFile or IoCreateFail fail with the very helpful “STATUS_INVALID_PARAMETER” status code, you can now how exciting it can be to track down which flag exactly you might’ve messed up. Hopefully this snippet of code should help you manually validate your call (I’m not sure if the latest PREFast checks for these things when compiling). This code is from ReactOS.

/* Check if we need to check parameters, or if we’re user mode */

if ((AccessMode != KernelMode) || (Options & IO_CHECK_CREATE_PARAMETERS))

{

    /* Validate parameters */

    if ((FileAttributes & ~FILE_ATTRIBUTE_VALID_FLAGS) ||

        (ShareAccess & ~FILE_SHARE_VALID_FLAGS) ||

        (Disposition > FILE_MAXIMUM_DISPOSITION) ||

        (CreateOptions & ~FILE_VALID_OPTION_FLAGS) ||

         (CreateOptions &

         (FILE_SYNCHRONOUS_IO_ALERT | FILE_SYNCHRONOUS_IO_NONALERT) &&

         (!(DesiredAccess & SYNCHRONIZE))) ||

        ((CreateOptions &

         (FILE_SYNCHRONOUS_IO_NONALERT | FILE_SYNCHRONOUS_IO_ALERT)) ==

         (FILE_SYNCHRONOUS_IO_NONALERT | FILE_SYNCHRONOUS_IO_ALERT)) ||

        ((CreateOptions & FILE_DIRECTORY_FILE) &&

         !(CreateOptions & FILE_NON_DIRECTORY_FILE) &&

          ((CreateOptions & ~(FILE_DIRECTORY_FILE |

                              FILE_SYNCHRONOUS_IO_ALERT |

                              FILE_SYNCHRONOUS_IO_NONALERT |

                              FILE_WRITE_THROUGH |

                              FILE_COMPLETE_IF_OPLOCKED |

                              FILE_OPEN_FOR_BACKUP_INTENT |

                              FILE_DELETE_ON_CLOSE |

                              FILE_OPEN_FOR_FREE_SPACE_QUERY |

                              FILE_OPEN_BY_FILE_ID |

                              FILE_OPEN_REPARSE_POINT)) ||

          ((Disposition != FILE_CREATE) &&

           (Disposition != FILE_OPEN) &&

           (Disposition != FILE_OPEN_IF)))) ||

        ((CreateOptions & FILE_COMPLETE_IF_OPLOCKED) &&

         (CreateOptions & FILE_RESERVE_OPFILTER)) ||

        ((CreateOptions & FILE_NO_INTERMEDIATE_BUFFERING) &&

         (DesiredAccess & FILE_APPEND_DATA)))

    {

        /*

         * Parameter failure. We’ll be as unspecific as NT as to

         * why this happened though, to make debugging a pain!

         */

        DPRINT1(“File Create Parameter Failure!\n”);

        return STATUS_INVALID_PARAMETER;

    }

 

    /* Now check if this is a named pipe */

    if (CreateFileType == CreateFileTypeNamedPipe)

    {

        /* Make sure we have extra parameters */

        if (!ExtraCreateParameters) return STATUS_INVALID_PARAMETER;

 

        /* Get the parameters and validate them */

        NamedPipeCreateParameters = ExtraCreateParameters;

        if ((NamedPipeCreateParameters->NamedPipeType >

             FILE_PIPE_MESSAGE_TYPE) ||

            (NamedPipeCreateParameters->ReadMode >

             FILE_PIPE_MESSAGE_MODE) ||

            (NamedPipeCreateParameters->CompletionMode >

             FILE_PIPE_COMPLETE_OPERATION) ||

            (ShareAccess & FILE_SHARE_DELETE) ||

            ((Disposition < FILE_OPEN) || (Disposition > FILE_OPEN_IF)) ||

            (CreateOptions & ~FILE_VALID_PIPE_OPTION_FLAGS))

        {

            /* Invalid named pipe create */

            return STATUS_INVALID_PARAMETER;

        }

    }

    else if (CreateFileType == CreateFileTypeMailslot)

    {

        /* Make sure we have extra parameters */

        if (!ExtraCreateParameters) return STATUS_INVALID_PARAMETER;

 

        /* Get the parameters and validate them */

        MailslotCreateParameters = ExtraCreateParameters;

        if ((ShareAccess & FILE_SHARE_DELETE) ||

            !(ShareAccess & ~FILE_SHARE_WRITE) ||

            (Disposition != FILE_CREATE) ||

            (CreateOptions & ~FILE_VALID_MAILSLOT_OPTION_FLAGS))

        {

            /* Invalid mailslot create */

            return STATUS_INVALID_PARAMETER;

        }

    }

}