Part 3 of User-Mode Debugging Internals

The last part in my series on how Windows XP and higher support user-mode debugging is now up on OpenRCE; this part covers the kernel-mode side of things, aka the Dbgk module. Read it and find out how to use the native system calls in your debugger, which let you do things like debug multiple processes from a single debugger!

I will post the article on my Publications page as well, shortly.

Coming up shortly: the secrets of RtlRemoteCall!

Recent Events

After initially being slashdotted, my blog post below got linked across the blogosphere, hit Digg, the Inquirer, BoingBoing and other major news sites, and I’ve reached some 60 000 visitors in less than 24 hours…

Since most of you are therefore new visitors, I just wanted to post a short introduction/information paragraph. First of all, I suggest you visit the About page of the blog, as well as my Wiki page on the ReactOS website. This is just to clear up any confusion on where I currently reside, age, education, etc. If you are interested in my other publications/works as a security researcher, you should visit the Publications page, as well as OpenRCE, where I usually post my latest articles. You can also find a recording of my REcon 2006 talk on Archive.Org. Search for my name; the PDF is available on the Publications page as well. Finally, my project, ReactOS, is having a donation fund; if you’d like to donate some money, that would be very appreciated.

As for the DRM post, I never expected that it would get the kind of attention it has; to be fair, I had completely forgotten that today was Vista’s launch date (being a beta tester, I’ve had RTM for months now); I certaintly don’t want to make it seem like I was specifically targetting this day to release anything. Later this week I will release some safe, generic, proof of concept code that targets what I believe is a flaw in the Code Integrity/Driver Signing model. My 64-bit VM is running extremly slow, so it will take me some time to test the code. Because this code will require an initial reboot, Microsoft does not consider it to be a flaw from a security standpoint. And because it’s so generic, it has absolutely nothing to do with DRM or PMP. That being said, I’m sure someone with knowledge of the PMP implementation might be able to use this as a very smart building block of the entire code that would be required; but that would be like arresting every knife manufacturer because knives can kill people.

Finally, if any of you would like more information about ReactOS or would like to meet in person, I will be giving a talk at the SOCAL5X conference on February 9th, and I will be around LA on the 10th as well.

Update on Driver Signing Bypass

I apologize for the lack of news, but after attending CUSEC, I had to spend my time on catching up the two weeks of school and work that I had missed, and exploiting Vista ended up going on the backburner, especially as I had to re-install VMWare 6.0 (which wasn’t being helpful with me) and a new Vista 64-bit image.

That being said, it turns out the code I’ve written does not work out of the box on a Vista RTM system. Although it can be effective when combined with a reboot, this doesn’t provide any advantage of any of the myriad other ways that this could be done (including booting with the disable integrity checks BCD option or the /TESTSIGN flag).

However, it does bypass DRM. As part of the Protected Media Path, (PMP), Windows Vista sets up a number of requirements for A/V software and drivers in order to ensure it complies with the demandes of the media companies. One of these features, which has been heavily criticized as being the actual reason behind driver signing, is that “some premium content may be unavailable” if test signing mode is used. Originally, I assumed that this meant that the kernel would set some sort of variable, but this didn’t make sense: once your unsigned driver could load, it could disable this check. After reading the PMP documentation however, it seems to me that the “feature” explained is more likely the cause of this warning on premium content.

This feature is the ability of the PMP to notify A/V applications that there are unsigned drivers on the system, as well as provide a list of unsigned drivers. The idea is that the application can either outright refuse to play content, or that it can scan for known anti-DRM drivers which might be attempting to hook onto the unencrypted stream. This leads me to believe that it’s up to applications, not the OS, to enforce this DRM check.

The great thing about the code I’ve written is that it does NOT use test signing mode and it does NOT load an unsigned driver into the system. Therefore, to any A/V application running, the system seems totally safe — when in fact, it’s not. Now, because I’m still booting with a special flag, it’s possible for Microsoft to patch the PMP and have it report that this flag is set, thereby disabling premium content. However, beause I already have kernel-mode code running at this point, I can disable this flag in memory, and PMP will never know that it was enabled. Again, Microsoft could fight this by caching the value, or obfuscating it somewhere inside PMP’s kernel-mode code, but as long as it’s in kernel-mode, and I’ve got code in kernel-mode, I can patch it.

To continue this game, Microsoft could then use Patchguard on the obfuscated value…but that would only mean that I can simply disable Patchguard using the numerous methods that Skywing documented in his latest paper.

In the end, the only way that PMP is going to work is with a Hypervisor, and even that will probably fail.

Unfortunately, with almost 0% use for the open source community (which can use test signing mode for their drivers), documenting my method and/or releasing a sample might be viewed as an anti-DRM tool, and defintely a DMCA violation. Although used on its own, this POC doesn’t do anything or go anywhere near the PMP (I don’t even have Protected Media, HDMI, HD-DVD, nor do I know where PMP lives or how someone can intercept decrypted steams), a particularly nasty group of lawyers could still somehow associate the DMCA to it, so I’m not going to take any chances.

It’s quite ironic — Microsoft claims driver signing is to fight malware and increase system stability, so if I get sued under DMCA, wouldn’t that be an admission that driver signing is a “anti-copyright infringment tool”?.

I’d really love to release this tool to the public though, so I will look into my options — perhaps emphasizing the research aspect of it and crippling the binary would be a safe way.

Windows Vista 64-bit Driver Signing/PatchGuard Workaround

I’ve been sitting on this one for a while (over a year), awaiting confirmation of a final key component in the procedure, but I’ve now been able to test my method.
I will be spending tomorrow finishing up the paper and exploit code on my test Virtual PC image. Before you get all excited, please keep in mind this is a local, administrative-account-required workaround for the driver-signing requirement in Vista 64-bit and has no security implications what so ever.

Since I wasn’t able to get a working POC until now, I haven’t made a lot of noise about it… if I get it working right tomorrow, I will probably send a little note to Microsoft to make sure they don’t go medieval on my ass — it has zero customer impact so I don’t think they will, but I apologize if I’ll have to can it.

Back from CUTC

I had the chance to attend the Canadian Undergraduate Technology Conference 2007 this year, in Toronto, and it was one of the most entertaining, informative and enjoyable event I’ve ever been to lately. Apart from the wonderful keynotes (one of them was by a Nobel laureate), the competitions, tech shows and sessions were extremly useful. I was extremly impressed by Apple’s Shark and Quartz Composer tools. I always imagined Mac development was a bit of a mystery and all command-line based magic, but their tools are a serious threat to Windows development. Windows doesn’t even have a tool that comes close to what Quartz Composer can do, and although tools like Shark already exist, none of them are so seamless, easy to use, and powerful. In 20 minutes we took code that we had never seen before, and optimized it from 900 ‘thoughts per second’ (a metric in an AI test case) to over 5000. The entire platform is built on open source tools (such as GCC), and even Shark is based on the Linux code analysis/profiling tool called DTrace (I believe that’s the name). But it’s the Apple UI and integration that makes it all worth it.

Meeting with various company executives, managers and engineers was great too, and they had a lot of insight into their experience working in the industry.

To make things even better, my team also won the “CUTC 2007 Best Design Award” in the AMD/ATI Tech Team competition. All our team members (five) received an ATI Radeon video card. This week I’ll be attending CUSEC, the Canadian Undergraduate Software Engineering Society, which, thankfully, is in Montreal. I will most probably be doing a demo of ReactOS as well.

Solution to Challenge

The clock has ticked past midnight, so it’s now time to reveal the solution to my previous challenge. When I say “Solution” I mean what I and others are aware to be the currently best method. Nobody else has found anything better, and the two “winners” have presented the same solution (which Windows itself uses).

Since the question originally came to me from a developer at Microsoft, and I mentionned this, it was safe to assume that the method Windows used was probably “the right answer”. However, the hard part was explaining what exactly it was doing.

Correct solutions came, in order, from Matt Miller, Razvan Hobeanu and Ken Johnson. These are some of my favorite blogs to read and people I respect most, so I was honoured that they took the time to write up a solution (thanks to everyone else as well!). I will present a “full” solution, including the 64-bit implementation, and the actual code in the kernel responsible for this hack.

Before I start however, there’s one esoteric solution from Myria which I thought was funny enough to be shared. She proposed, roughly: 1) SetThreadAffinityMask(GetCurrentThread(), 1); 2) return 0;

This cute answer will first force the thread to run on CPU 0, then return… CPU 0. Technically this is true, but it’s also completely useless for the actual purpose on why you’d want to know the CPU number in the first place.

Which brings us to the actual correct solution. Most people correctly identified the routine responsible for the code, RtlGetCurrentProcessorNumber, which is what kernel32’s GetCurrentProcessorNumber forwards to. Note that the WOW64 version actually forwards to NtGetCurrentProcessorNumber, and that this Native API also does exist on 32-bit versions of Windows, and reads the value stored in the PCR. While this is a simple solution, it involves an expensive system call. So let’s go back to the user-mode Rtl routine. The raw assembly code is as follows:

mov ecx, 03Bh
lsl eax, ecx
shr eax, 0Eh

When I first saw this code, I didn’t even know what the LSL instruction did, as I had never encountered it. The Intel Manual explains that LSL stands for “Load Segment Limit”, which is a nice way to get the limit for a selector in the GDT without actually having access to the GDT itself. 0x3B is a rather weird selector, but I recognized it as 0x38 masked with 0x3. The former is the selector for the TEB, and the latter is called the RPL Mask, and selects the proper ring level (User-Mode is Ring 3, so RPL is 3). Converting this to nice C code using MSVC 2005’s intrinsics and the NDK (which has internal definitions), this function looks something like:

ULONG SegmentLimit;

// Get the current segment limit of the TEB
SegmentLimit = __segmentlimit(KGDT_R3_TEB | RPL_MASK);

// Get the CPU number from the limit. Each processor has its TEB
// selector with a limit composed of the CPU number in the 14th to 19th bits.
return (SegmentLimit >> 14);

This explains what the code does, and in some sense, how it does it. However, what exactly is the CPU number doing there? Is this some sort of x86 feature? Is it added during each context switch, at boot-up, etc?

The answer lies in the KeStartAllProcessors routine in the kernel, where the following piece of assembly executes:

mov     ebx, [ebp-2Ch]
mov     eax, [ebp-328h]
shl     eax, 0Eh
mov     [ebx+38h], ax
mov     eax, [ebp-328h]
shl     eax, 0Eh
xor     eax, [ebx+3Ch]
and     eax, 0F0000h
xor     [ebx+3Ch], eax

With some help from IDA, we can make this a bit nicer and update some lines:

INIT:008F6605                 mov     ebx, [ebp+ProcessorState.SpecialRegisters.Gdtr.HighWord]
INIT:008F66D6                 mov     eax, [ebp+i]

And of course, [ebx+38h] is the KGDT_R3_TEB entry in the GDT. Because this routine initializes all processors, it loops them, and i contains the current CPU number in the loop. The processor state contains the pointer to the actual GDT for this processor. Therefore, this is a specific hack that was added, and is fully dependent on the OS, which has to be Windows 2003 or newer.

Finally, on x64 versions, the selector used is actually 0x53, based on the 0x50 TEB selector in 64-bit mode. In WOW64 however, a fake WOW system call to NtGetCurrentProcessorNumber is done instead.

Full credit for this hack and the code behind it should go to Neill Clift, who came up with it.

Challenge of the Week (Month?)

Here’s a nice challenge question I got from a very ingenious developer working at Microsoft… now that I’ve found the solution, I thought I should ask it out in the open.

Correct, complete and full answer gets you a nice prize [ie: your name and solution published ;)].

Find the fastest (total cycles) and smallest (total size) method of obtaining the current CPU number that current thread is executing on, on a Windows 2003 or higher computer (ie: this solution can take advantage of any API or system improvements added to NT 5.2+).

  1. You may use an API call if you wish, but be aware that the actual call and stack operations will count in your total.
  2. You may duplicate the contents of an API call, but be aware that you must explain what your code does in detail. Inlining an API you understand nothing about is not a complete solution.
  3. Code must work from user-mode. You can write a kernel driver or user a native function, but the total cycles spend on the ring transition will be factored in your total, plus any size of code spent in kernel-mode.

Email solutions to aionescu at gmail dot com. Posts questions in the comments if you have any.

Heap Tagging is Broken

While developping the Native Development Library (NDL) that I’m working on, I attempted to play with a very undocumented feature of the Rtl Heap APIs: Tagging.

If you’ve used the familiar ExAllocatePool APIs in kernel-mode, then you’re already familiar with tagging. The Heap Manager supports the same idea, but allows you to define your own string tags of arbitrary size. This is done by a rather complex set of global flags, special APIs with strange string formatting (RtlCreateTagHeap), and a hidden little macro in winnt.h. Here’s how heap tagging works in the NDL:

A function called NdlpAllocateMemoryInternal allows the caller (the NDL) to allocate memory from the NDL Heap with a specific size, flags, and tag. The tag here is an index that we can define ourselves, such as NDL_STRING_TAG which is 0x2. Then, the NDL has other internal and/or external functions which allocate memory. For example, the LPC routines need to allocate PORT_MESSAGEs or other structures, so NDL_COMMUNICATIONS_TAG is used when calling NdlpAllocateMemoryInternal. There is also NdlpAllocateString, which uses NDL_STRING_TAG. Finally, users of the NDL (your application itself) gets an API called NdlAllocateMemory. You only provide the size and flags, and internally the NDL will set the NDL_USER_TAG to your allocation.

So far so good.

Now there’s two cool things we can do. First, the RtlQueryTagHeap API allows you to obtain statistics on each tag. Allocations, frees, and bytes allocated. This can give you a nice memory map of the NDL’s current memory usage. Even better however, by using RtlWalkHeap, the NDL can scan for all active NDL_USER_TAG allocations. This is useful, since when your native application returns, an internal call to NdlUnregisterApplication is made. When this happens, the assumption is made that your code is done executing (unless you’ve registered as a “resident” application), so in order to promote good programming and to catch leaks, RtlWalkHeap is called, and all active heap entries are scanned. If a block with the NDL_USER_TAG tag index is found, a debug message is printed out, saying that a heap entry at 0xFOO of size 0xBAR is leaking. We can then use the User-Mode Stack Trace Database support and the AllocatorBackTraceIndex of the heap entry to give a complete stack trace on where this allocation was made.

So far so good. Or Not.

Turns out I was getting Tag Indeces such as 0x8007, 0x8004, etc. It seems that all heap allocations were instead indexed with 0x8000 | CurrentAllocationIndex. This wasn’t helpful at all, so I started analyzing the problem.

The first one is the way in which heap tags are generated and then saved. To generate a tag, you use the MAKE_HEAP_TAG macro in winnt.h. This macro takes a “Tag base”, which is what RtlCreateTagHeap returns to you, as well as a tag index, which you define yourself, for example 0x2. The operation that’s done is Base | (Index << 18). So for index 2, with a base of 0x40000, this gives us 0xC0000. The problem is that when RtlpUpdateTagEntry is done, the code does the following: shr ebx, 12h and ebx, 0FFFF0FFFh EBX contains the heap flags, which are the actual HEAP_XXX flags ORed with the tag. Suppose we didn't use any flags, and are just sending our heap tag, 0xC0000. The result of this operation will be 3, not 2, because nothing is done to take into account the heap tag base. However, this bug should cause us to get tag indeces that are off-by-one, not in the 0x8000 range. So more must be going on. Recall that ebx also contains the typical heap flags. Some heap flags are as small as 0x8, others are bigger such as 0x100, and others yet are as high as 0x40000000. You can start seeing how this can corrupt this check. To make matters worse, when using a stack trace database, the heap understands that it's working in "debugging mode", so it calls a different set of APIs, such as RtlAllocateHeapSlowly and RtlDebugAllocateHeap. The latter ORs in some flags by default, such as Heap->ForceFlags, as well as HEAP_DISABLE_VALIDATION_CHECKS and HEAP_USER_SETTABLE_FLAGS. In my case, the total mask of the flags being ORed in was 0x50100000. Let’s bring in our heap tag, and the total becomes 501C0000. Let’s do the broken EBX code again, and the tag index becomes 0x407. Now RtlpUpdateTagEntry will check if 0x407 is above Heap->HighestTagIndex, and since I’ve created a lot less then 1031 tags, it will think this is a “pseudo-tag”. A pseudo-tag is the combinaiton of HEAP_PSEUDO_TAG_MASK and the curent allocaition index…and you’ve gussed it, that mask is 0x8000.

Thankfully, I was able to find a workaround for the NDL, although not with a small (but not critical) loss of functionality. First, I disabled support for stack backtraces. It makes finding your leak a big harder, but it’s not the end of the world, since this functionality is provided as a small benefit anyway. Since the stack trace functions are exported by Rtl, I will simply modify NdlAllocateMemory to capture the trace by itself. I can then use RtlSetUserFlagsHeap to associate the backtrace index or another similar device. If I want to get more evil, I can probably also play with the _HEAP_ENTRY structure itself and set the backtrace index myself.

The second “fix” was not to use the MAKE_HEAP_TAG macro at all, and ignore the “Tag base”. This solves the off-by-one problem but won’t work very reliably because it can conflict with actual heap flags.

This problem is on Win 2000 and XP. I haven’t checked Windows 2003 or Vista yet, but it’s possible that Vista fixed it after Adrian’s rewrite of code for higher security.