Security Blog

From KernelSnitch to Practical msg_msg/pipe_buffer Heap KASLR Leaks

This post presents a practical heap KASLR leak that does not rely on a memory-safety vulnerability. I start from KernelSnitch, a software-only timing side channel that recovers the location of the current's mm_struct and, from it, the base of the backing mm_cachep slab page. I then combine that leak with same-page-order cross-cache reuse to pivot into exploit-relevant targets such as msg_msg and pipe_buffer. The result is a broadly applicable leak primitive that works across native Linux systems, virtualized KernelCTF setups, restricted Android environments, and both x86_64 and aarch64. Because the attack recovers valid kernel pointers without triggering invalid accesses, it remains exploitable on systems with MTE. More importantly, when the leaked mm_struct pointer is tagged (e.g., on Google Pixels), KernelSnitch can recover its logical tag as well, highlighting its potential as a tag oracle for the leaked object.

From KernelSnitch to Practical msg_msg/pipe_buffer Heap KASLR Leaks

In this post, I combine KernelSnitch with cross-cache reuse to leak the location of msg_msg or pipe_buffer objects, both widely used in kernel exploits. In these attacks (and others), knowing exact object locations is valuable, and this is where the attack presented here becomes relevant. I evaluate the approach on nine environments: five native deployments and four virtualized setups.

A key goal of this project is to minimize system-specific dependencies and make this heap KASLR leak as universal as possible. The proof of concept therefore remains largely unchanged across these environments, and I have open-sourced the full source code. Specifically, the attack consists of an initial system-dependency estimation and the actual exploit (i.e., KernelSnitch plus cross-cache reuse), as demonstrated in the screen recordings of four instances shown in Figures 1 to 4. For the evaluation, I observed a high success rate that is suitable for practical use.

Disclaimer: I responsibly disclosed KernelSnitch three times: once before the NDSS publication, once shortly after the Nullcon Berlin talk, and once to Google while developing this broader leak pipeline. After the last disclosure, Google classified the issue as not a security vulnerability. I am publishing this write-up for reporting transparency and to support defensive research. This blog post is provided strictly for educational and defensive-security purposes and is not intended for malicious use. I made my best effort to report these issues responsibly.

TL;DR

Figure 1: Fedora v6.18 execution flow (msg_msg).
Figure 2: KernelCTF v6.12 execution flow (msg_msg).
Figure 3: KernelCTF v6.12 execution flow (pipe_buffer).
Figure 4: Android untrusted_app execution flow (pipe_buffer). Here, the second terminal is for validation the leaked `pipe_buffer` address, showing the correct slab `kmalloc-cg-1k` owner

Overview

KernelSnitch: Leaking mm_struct Locations

It all started with a discussion I had with Daniel Gruss in 2023 about operating-system side channels. The key takeaway from that discussion was straightforward: when you zoom out and look at the operating system from a high level, the kernel resembles hardware in many important ways. In both worlds, everything flows through shared resources. In hardware, those shared resources are things like CPU caches and execution units. In the kernel, they show up as shared data structures, locks, and software caches. The kernel keeps optimizing the same fundamentals that hardware does: runtime and memory overhead, with the goal of delivering the best possible performance to end users. Once you adopt that lens, we should expect more operating-system side channels, because every shared resource has the potential to become a timing channel in disguise. In other words, the same fundamentals that make systems fast and scalable can also become susceptible under the right probing strategy. That perspective stuck with me and, in 2024, set the direction for KernelSnitch. With that in mind, I started by looking at one of the most fundamental building blocks in computer science: the linked list.

Linked Lists in the Linux Kernel

A linked list is a dynamic data structure where each element stores references to neighboring elements. In Linux, they are used in many subsystems to store sets of objects, such as queues and cache-managed objects. Linux standardizes this pattern through list_head (or alternatives like plist_head), its core linked-list primitive.

As most of us learned at some point, traversing a linked list with non-matching comparisons is an O(n) operation. That gives us a simple but powerful timing side-channel primitive: by timing these O(n) traversals, we can estimate how many elements are present. The challenge is turning that primitive into a heap KASLR leak. That is a big leap. We start from a general observation (longer list takes longer time) and push it towards a security-relevant outcome (recovering a kernel address). The key to achieve this is to go beyond "there is a timing difference". Rather, our goal is to make that timing difference measurable and stable enough to consistently correlate traversal timing with a secret kernel address. Concretely, we need:

This naturally leads to the question: where does the kernel use linked-list-style traversals in a way that depends on a kernel heap object's address? A plain linked list usually gives us length, not location. So we need a setting where an address influences which list we traverse and how long that traversal becomes. Linux hash tables provide exactly that bridge.

Hash Tables in the Linux Kernel

A hash table stores key-value entries in buckets chosen by a hash function. In Linux, hash tables are typically implemented as bucket arrays where collisions are handled by linked-list chaining, so traversing a busy bucket is still an O(n) walk. For this attack, the key property is that bucket indices depend on kernel addresses, which creates address-dependent traversal behavior. After digging through candidate kernel hash tables with Bootlin and CodeQL (this was in 2024, so mostly old-school code archaeology), one target stood out: the futex hash table. As you will see later, it has a combination of properties that makes it an especially attractive target.

Hash Table for Fast User Mutexes (Futex)

The futex subsystem uses a hash table to manage futex metadata. A futex (short for fast userspace mutex) is a synchronization primitive where uncontended locking stays in user space, while the kernel only steps in for contended cases such as wait and wake operations. Conceptually, futex waiters are organized through a hash table for efficient lookup by futex key. In this subsystem, futexes use priority linked lists (i.e., plist_head): Bucketed lookup with collision chains that require traversal, and therefore expose timing behavior tied to the number of elements in a bucket. For indexing, Linux first derives an internal futex key and then hashes that key to select a bucket. At a high level, the key combines user-provided inputs (e.g., the futex user address) with kernel-side context. For the attack relevant private futex, the kernel-side context is the memory management kernel object, mm_struct.

From a security perspective, this is exactly what makes futex hashing interesting: bucket selection depends partly on attacker-controlled input and partly on a kernel address. Let's take a deeper dive into how the Linux kernel implements this indexing. The next listings are implementation-heavy, so I will focus only on the fields and paths that are exploit-relevant.

struct futex_hash_bucket {
    [...]
    atomic_t waiters;
    struct plist_head chain;
} ____cacheline_aligned_in_smp;

static struct {
    struct futex_hash_bucket *queues;
    unsigned long            hashsize;
} __futex_data __read_mostly __aligned(2*sizeof(long));
#define futex_queues   (__futex_data.queues)
#define futex_hashsize (__futex_data.hashsize)

static int __init futex_init(void)
{
    [...]
    futex_hashsize = roundup_pow_of_two(256 * num_possible_cpus());

    futex_queues = alloc_large_system_hash("futex", sizeof(*futex_queues),
                                           futex_hashsize, 0, 0,
                                           &futex_shift, NULL,
                                           futex_hashsize, futex_hashsize);
    [...]
}
Listing 1: The futex hash table.

Listing 1 shows how the futex hash table is stored in memory. Each bucket is a futex_hash_bucket and stores a priority-list chain of waiters. The global futex object then exposes two core parameters: futex_queues (pointer to the bucket array) and futex_hashsize (number of buckets). During futex_init, the kernel computes a power-of-two table size, allocates the bucket array and initializes every bucket. This setup is the foundation that later lets futex_hash() map a key to a concrete bucket index.

union futex_key {
    [...]
    struct {
        union {
            struct mm_struct *mm;
            [...]
        };
        unsigned long address;
        unsigned int offset;
    } private;
    [...]
};

int get_futex_key(u32 __user *uaddr, unsigned int flags, union futex_key *key,
                  enum futex_access rw)
{
    unsigned long address = (unsigned long)uaddr;
    [...]
    // PROCESS_PRIVATE
    key->both.offset = address % PAGE_SIZE;
    key->private.address = address - key->both.offset;
    key->private.mm = current->mm;
    [...]
}

struct futex_hash_bucket *futex_hash(union futex_key *key)
{
    u32 hash = jhash2((u32 *)key, offsetof(typeof(*key), both.offset) / 4,
                      key->both.offset);
    return &futex_queues[hash & (futex_hashsize - 1)];
}
Listing 2: Futex key composition and bucket indexing in futex_hash() (compacted).

Listing 2 shows the full path from futex input to bucket selection. get_futex_key starts from the user-provided futex address (uaddr), derives page-relative metadata. In the private-futex path it combines key->private.address (user-provided) with key->private.mm = current->mm (kernel-internal process context). The resulting key is then passed to futex_hash(), which applies jhash2 and maps the hash into the bucket array. Under the hood, both core futex operations, putting a thread to sleep (wait) and waking it up (wake), use this same get_futex_key() plus futex_hash() pipeline, as you can see in the following sections.

Compatibility note (Linux version 6.16+): Newer kernels introduced a variant for private futex hashing that can avoid using current->mm in the bucket index. However, this path can be generically disabled by setting the private futex hash table to zero slots (unprivileged), which restores the mm-dependent indexing behavior used throughout this post.

Futex Wait and Wake

From user space, both paths are reached via the same syscall interface: syscall(SYS_futex, uaddr, op, ...). When op is FUTEX_WAIT, the kernel enters the wait path (eventually futex_wait). When op is FUTEX_WAKE, it enters the wake path (eventually futex_wake).

int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val, ktime_t *abs_time, u32 bitset)
{
    union futex_key key = FUTEX_KEY_INIT;
    struct futex_hash_bucket *hb;
    int ret;
    [...]
    ret = get_futex_key(uaddr, flags, &key, FUTEX_READ);
    [...]
    hb = futex_hash(&key);
    [...]
    futex_wait_queue(hb);
    [...]
    return ret;
}
Listing 3: Using futex wait to put a thread to sleep (compacted).

Listing 3 shows the essential wait-side path in compact form. The kernel first derives a futex key from user input via get_futex_key. It then maps that key to a specific bucket with futex_hash. Finally, futex_wait_queue enqueues the current thread on that bucket and puts it to sleep. For the attack, this is the key sequence. So each futex_wait call deterministically places one waiter into a bucket selected by attacker-controlled input (uaddr) plus per-process kernel state (notably mm_struct), and therefore makes further traversal of this bucket longer.

int futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
{
    struct futex_hash_bucket *hb;
    struct futex_q *this, *next;
    union futex_key key = FUTEX_KEY_INIT;
    int ret;

    ret = get_futex_key(uaddr, flags, &key, FUTEX_READ);
    [...]
    hb = futex_hash(&key);
    [...]
    plist_for_each_entry_safe(this, next, &hb->chain, list) {
        if (futex_match(&this->key, &key)) {
            [...]
            this->wake(this);
            [...]
        }
    }
    [...]
}
Listing 4: Using futex wake to wake one or more threads up (compacted).

For the wake-side flow (see Listing 4), the kernel resolves the key again and traverses the selected bucket to find matching waiters. This is the critical timing measurement point. futex_wake recomputes the key, hashes into a bucket, and then iterates over hb->chain with plist_for_each_entry_safe until matching entries are found. In particular, it leaks an estimate of list occupancy when traversed entries do not match the requested key, because each non-match adds another comparison step before wake-up handling can terminate. By controlling how many colliding futexes end up in the same bucket, an attacker can influence the amount of traversal work and observe timing differences from user space.

In short, both futex operations share the same key and hash pipeline, but they serve different roles for the attack. The wait path helps shape bucket occupancy, while the wake path exposes measurable traversal time that reflects non-matching work in the bucket. Together, they provide the signal path to infer kernel-address-dependent state. At this point, we have covered the relevant kernel internals for KernelSnitch. The rest of this core leakage attack now moves to user space: we will drive these same kernel paths through carefully crafted syscall sequences, shape bucket state with wait operations, and extract timing signal with wake operations.

Shaping the Futex Hash Table with Wait

You remember from the previous section that I said the futex subsystem has useful exploitation-relevant properties. Here is what I meant by that. Listing 3 gives us the key shaping lever: every successful futex_wait inserts one waiter into the bucket's chain selected by futex_hash(uaddr, current->mm). To grow that chain, we spawn additional threads and let each thread call futex_wait. This gives us two attacker-relevant properties. First, uaddr is fully controlled from user space, while current->mm remains within one process. That makes bucket placement controllable under the mm_struct context. Second, the operation is repeatable with the same arguments. By creating many threads that wait on the same uaddr, we can pile up many waiters in one specific bucket. So, bucket occupancy, and therefore traversal work, becomes largely attacker-controlled. This is our pile-up primitive (see Listing 5): controlled thread creation plus repeated futex_wait calls that intentionally grow one bucket chain and amplify traversal work.

unsigned char pileup_futex[PAGE_SIZE];
// Adds one waiter to futex_hash(pileup_futex, current->mm)
static void *__enqueue_one_waiter(void *arg)
{
    size_t uaddr = (size_t)pileup_futex;
    SYSCHK(syscall(SYS_futex, uaddr, FUTEX_WAIT_PRIVATE, 0, 0, 0, 0));
    return 0;
}

// Spawns <num_waiters> threads to pile up one futex hash bucket
void pile_up_primitive(size_t num_waiters)
{
    pthread_t tid;
    for (size_t i = 0; i < num_waiters; ++i)
        SYSCHK(pthread_create(&tid, 0, __enqueue_one_waiter, 0));
    sched_yield(); sched_yield();
}
Listing 5: Pile-up primitive implementation.

Correlate Futex Wake Timing with Kernel Addresses

After the pile-up step, one bucket is intentionally crowded while most others stay empty or lightly populated. The key point here is that this crowded bucket is selected by futex_hash(uaddr, current->mm). In that expression, we control uaddr and we know the specific hash function (i.e., jhash2). The only missing piece is current->mm. So the next step is to correlate wake timing with candidate bucket indices until only the correct mm_struct-dependent mapping remains.

Before doing that, let's look at what futex_wake gives us as an attacker primitive. One property is especially helpful: if we call futex_wake with an invalid uaddr, we get a non-destructive probe primitive. That means we can trigger key/hash resolution and bucket traversal without consuming or removing waiters, so the bucket state remains intact across repeated measurements. This lets us repeat measurements on bucket state and separate heavily populated buckets from lightly populated ones by timing.

#define NUM_MEASUREMENTS 128
#define NUM_AVG_SAMPLES (1<<3)
// Comparator for ascending cycle counts
static int __compare(const void *a, const void *b)
{
    return (*(size_t *)a - *(size_t *)b);
}

// Probes bucket occupancy via FUTEX_WAKE timing on an invalid uaddr
size_t probe_primitive(size_t invalid_uaddr)
{      
    size_t t_start;
    size_t t_end;
    size_t total_cycles = 0;
    static size_t samples[NUM_MEASUREMENTS];

    for (size_t i = 0; i < NUM_MEASUREMENTS; ++i) {
        sched_yield();
        t_start = timer_begin();
        SYSCHK(syscall(SYS_futex, invalid_uaddr, FUTEX_WAKE_PRIVATE, 0, 0, 0, 0));
        t_end = timer_end();
        samples[i] = t_end - t_start;
    }

    // Keep the lowest-latency window to suppress noisy outliers
    qsort(samples, NUM_MEASUREMENTS, sizeof(samples[0]), __compare);
    for (size_t l = 0; l < NUM_AVG_SAMPLES; ++l)
        total_cycles += samples[l];
    return total_cycles / NUM_AVG_SAMPLES;
}
Listing 6: Non-destructive probe primitive implementation.

Listing 6 shows that I can turn the non-destructive probe primitive, combined with lightweight signal processing, into a robust leakage primitive. Concretely, I repeat the measurement NUM_MEASUREMENTS times and time the syscall with a timing helper. Finally, I apply lightweight signal processing to reject unstable samples and return a robust result.

With this probe primitive, I correlate timing with the mm_struct address in two phases: online and offline. In the online phase (see Listing 7), we use the probe primitive to test many mapped user-space addresses and identify those that collide with the heavily piled-up bucket in the same process. Since we heavily apply the pile-up primitive beforehand, timing for the collided bucket stands out clearly from wake timings of non-colliding buckets. By combining pile-up and non-destructive probing, we get a robust way to detect hash collisions of different user addresses under the same mm_struct context. This makes the approach portable across systems, because the collision signal and filtering strategy are resilient to system noise. In the offline phase (see next section), we take the colliding uaddr set and brute-force candidate mm_struct values until the predicted hash-collision pattern matches the online observations.

#define FUTEX_REGION_SIZE (64ULL << 30)
#define PILEUP_WAITERS 4096
#define TARGET_COLLISIONS 8
#define THRESHOLD_MULTIPLIER 100

unsigned char futex_region[FUTEX_REGION_SIZE];
size_t colliding_futex_addrs[TARGET_COLLISIONS];

void collect_colliding_futex_addrs(void)
{
    size_t found = 1;  // index 0 is the known colliding seed

    // Baseline timing for an empty/light bucket
    size_t baseline_time = MIN(
        probe_primitive((size_t)&futex_region[0]),
        probe_primitive((size_t)&futex_region[PAGE_SIZE + 8])
    );

    // Pile up one bucket to make collisions stand out
    pile_up_primitive(PILEUP_WAITERS);

    // Known colliding address for the piled-up bucket
    colliding_futex_addrs[0] = (size_t)&pileup_futex;

    // Find additional colliding addresses
    for (size_t i = 1; found < TARGET_COLLISIONS; ++i) {
        size_t offset = (i * PAGE_SIZE) | ((i * 8) % PAGE_SIZE);
        if (offset >= FUTEX_REGION_SIZE)
            break;

        size_t candidate_addr = (size_t)&futex_region[offset];
        size_t probe_time = probe_primitive(candidate_addr);

        if (probe_time > baseline_time * THRESHOLD_MULTIPLIER) {
            colliding_futex_addrs[found++] = candidate_addr;
            pr_info("  collision with %016zx\n", candidate_addr);
        }
    }

    assert(found == TARGET_COLLISIONS);
}
Listing 7: Finding hash collisions implementation (online phase).

Brute-force mm_struct

At a high level, the offline search enumerates candidate mm_struct addresses and checks whether futex_hash(uaddr, candidate_mm_struct) reproduces the collision set found in the online phase, using a user-space reimplementation of the kernel's jhash2. The candidate that consistently matches those observed collisions is the recovered current->mm. Figure 5 illustrates this idea: almost all candidates stay near a low background score, while the correct candidate produces a clear, isolated peak.

Figure 5: Collision-match score across candidate mm_struct addresses. The true candidate appears as a dominant peak.

At first glance, brute-forcing the entire kernel virtual-address space for mm_struct seems prohibitively expensive. The turning point is that allocator constraints remove most impossible candidates. In other words, we are not searching arbitrary addresses. Instead, we are searching addresses that satisfy the alignment and placement constraints of the allocator. Kernel code shows that mm_struct objects are allocated from the dedicated slab cache mm_cachep. So before continuing with brute-force reduction, we need a short allocator deep dive: how mm_cachep is backed by pages and how object addresses are aligned.

The exact alignment details are architecture-dependent (we consider x86_64, where other architectures behave similarly). The allocation pipeline is layered. Technically, Linux uses two allocators here: the slab allocator and the page allocator. From an exploitation point of view, however, it is more useful to split the page allocator and think in three layers: the slab allocator, the per-CPU page lists (PCP), and the page (buddy) allocator. The slab allocator is the fast path, while allocations fall back to lower layers as needed. In this context, a slab cache (e.g., mm_cachep) is the per-context allocator. A slab is one allocation unit of that cache and slab pages are the backing pages that hold the actual objects. Most kernel heap allocations request object slots from slab caches, which obtain backing pages from PCP or, as a fallback, the buddy allocator. Those physical pages are visible in the direct physical map (physmap), which is a linear kernel mapping of RAM. Alignment constraints then appear at two levels for this attack: slab pages are aligned to allocation order (order-0: 4 KiB, order-1: 8 KiB, etc.) and objects inside the slab pages follow alignment rules (type alignment, rounded object size). Valid mm_struct addresses must therefore satisfy these allocator-induced alignment and placement rules, not arbitrary virtual-address values.

Figure 6: Alignment path for mm_struct from physmap pages to slab-cache objects on x86_64.

Figure 6 summarizes the alignment constraints for mm_struct placement. Listing 8 then shows how these constraints are used in the brute-force loop. On x86_64, physmap is randomized in a range between 0xffff888000000000 to 0xffffc87fffffffff, so we enumerate candidate slab bases in that range. For each slab base, we then test only valid object-slot offsets. Let mm_order = order(mm_cachep) and mm_slab_size = 1 << (12 + mm_order). Then each candidate slab contributes floor(mm_slab_size / sizeof(mm_struct)) possible mm_struct slots. So the brute-force is no longer an unconstrained virtual-address scan: it is an allocator-constrained scan over mm_order-aligned slab bases and valid slot offsets per slab. For instance, on Linux 6.8.0-101-generic, mm_order=3, mm_slab_size=32 KiB, and with sizeof(mm_struct)=1408 this yields 23 slots. This reduces the search space to about 235.5, which makes the offline phase practical.

#define PHYS_MAP_START 0xffff888000000000ULL
#define PHYS_MAP_END 0xffffc88000000000ULL
#define TARGET_COLLISIONS 8
#define MM_ORDER 3
#define MM_SLAB_SIZE (1ULL<<(12+MM_ORDER))
#define MM_STRUCT_SIZE 1408
uint32_t futex_hash(size_t addr, size_t mm);

// Recovers mm_struct by matching observed futex collisions
size_t recover_mm_struct_addr(void)
{
    for (size_t slab_base = PHYS_MAP_START; slab_base < PHYS_MAP_END; slab_base += MM_SLAB_SIZE) {
        size_t slab_end = slab_base + MM_SLAB_SIZE;

        for (size_t mm_addr = slab_base; mm_addr < slab_end; mm_addr += MM_STRUCT_SIZE) {
            size_t all_collide = 1;

            for (size_t i = 1; i < TARGET_COLLISIONS; ++i) {
                if (futex_hash(colliding_futex_addrs[0], mm_addr) !=
                    futex_hash(colliding_futex_addrs[i], mm_addr)) {
                    all_collide = 0;
                    break;
                }
            }

            if (all_collide) {
                pr_success(" found mm_struct %016zx\n", mm_addr);
                return mm_addr;
            }
        }
    }

    return -1;
}
Listing 8: Brute-force phase to recover the mm_struct location (offline phase).

While this reduction is already strong, one caveat remains: the exact sizeof(mm_struct) and mm_order is kernel- and system-dependent and can change with kernel version, build options, and system configuration. So to keep the attack as universal as possible, we model this as an estimation problem instead of hard-coding one size. We therefore analyze which parameters influence mm_struct layout and build an estimator that predicts practical candidate strides and slot counts for the target system. The good news is that mm_struct is allocated with hardware-cache alignment, so we do not need byte-perfect reconstruction of the layout. In practice, an accuracy on the order of one cache line is sufficient for the estimation of sizeof(mm_struct) and, hence, the brute-force phase.

To recap, KernelSnitch combines four pieces into one reliable leak pipeline: futex hash-table shaping via thread creation plus futex_wait, non-destructive futex_wake timing to correlate bucket behavior with mm_struct-dependent hashing, allocator-based search-space reduction, and an offline brute-force over the remaining candidates. In our setting, this yields a practical near-100% success rate for recovering mm_struct: we only accept a result when the collision-match score is exactly 1. With a sufficiently large set of observed collisions, reaching that perfect match is mathematically overwhelmingly likely only for the true mm_struct location.

At this point, I can reliably recover the location of mm_struct, but that alone is usually not interesting for privilege escalation. A natural next step would be to free the leaked mm_struct and reallocate the same slot with a more relevant target object. However, Linux's heap segregation blocks this straightforward pivot. The kernel does not treat heap memory as one undifferentiated pool. Instead, allocations are separated by object type, size class, and cache-specific rules, which strongly limits naive reuse assumptions.

Cross-Cache Reuse to Same-Page-Order Targets

So when this naive reuse approach does not work, we need a different strategy: cross-cache reuse. For that, we first need to understand how heap segregation works and how memory is reused across allocator layers.

Heap Segregation and Memory Hierarchy

In Linux, heap allocations are served by slab caches, where each cache manages one object layout or one size class. Some objects use dedicated caches (e.g., mm_struct via mm_cachep), while many generic allocations use size-based caches (e.g., kmalloc-* caches). A slab page belongs to exactly one cache at a time and is split into object slots for that cache only. This is why leaking one object and then directly reallocating a different object type in the same slot usually fails. The relevant pivot is therefore page lifecycle: once a slab page becomes fully free, it can leave one cache and later be reassigned to another.

Crucially, in this post, I consistently use the term slab page (singular), even when the slab is higher-order and backed by multiple contiguous base pages.

The Linux kernel memory hierarchy relevant for this attack is: heap objects (typed objects such as mm_struct or msg_msg), slabs (cache-backed pages with object slots), PCP (per-CPU page lists shaping short-term reuse), and buddy (the fallback page allocator). In practice, cross-cache reuse starts once an empty slab page leaves its cache, and then follows one of two reuse paths: First, the page is reclaimed directly from PCP for another allocation, which is constrained to matching page order. Second, allocator pressure can push the page from PCP back to buddy, after which buddy may reissue it to a different allocation context. If a different order is needed, buddy must change the order by either splitting a higher-order block or merging adjacent lower-order buddy pages. For reliability in this part of the attack, I deliberately avoid touching buddy and focus on same-page-order cross-cache reuse through the PCP. This keeps allocator behavior more stable and makes the reuse stages easier to control.

Figure 7: Same-page-order cross-cache reuse: free the mm_struct slab (i.e., order 3), return it to PCP, then reclaim it at the same page order in the target context (i.e., msg_msg).

To recap, the goal is to free the slab page from the dedicated mm_cachep and reclaim the memory chunk for a more relevant same-page-order target. The required order is system-dependent (mm_order = order(mm_cachep), e.g., 3 on some native systems and 2 on KernelCTF LTS instances). Figure 7 visualizes this transition in two phases: ① free enough mm_struct objects until the slab becomes empty and the slab is returned to PCP, and ② trigger target allocations so the same slab is reclaimed by the msg_msg cache.

As discussed in KernelSnitch: Leaking mm_struct Locations, we first leak the location of one mm_struct. From that single leak, we now derive the base address of its slab page using the known slab alignment (e.g., 8 pages in this setting). We then perform cross-cache reuse so that this same slab page is reclaimed in the target context. This yields a msg_msg slab and candidate msg_msg slots within those 8 pages.

SLUB Internals

As you might expect, it is not enough to free a few objects and hope the slab page returns to PCP. To do this step correctly and as universally as possible, we need a view of slab internals. Specifically, we focus on SLUB, the allocator used on modern Linux systems.

Figure 8: Internal SLUB lists relevant for returning a slab page to PCP before reclaim.

Figure 8 shows the SLUB internals that matter most for this transition. The key point is that every slab cache, whether dedicated (e.g., mm_cachep) or generic (e.g., kmalloc-*), is represented by one kmem_cache. That kmem_cache maintains per-CPU allocation state and slab-level metadata. Freelists are the core mechanism and hold free object slots in linked-list form. The active c->freelist drives fast-path object allocation/free on the current CPU, while slab->freelist and partial->freelist track reusable objects when slabs transition into these states. Partial slabs are slabs that still have live objects, so they remain tracked and reusable inside the same cache. If you are interested in the in-depth details of allocator internals and cross-cache mechanics, here are some useful references: CVE-2022-29582 io_uring write-up, Jann Horn's write-up, SLUBStick, PCPLost, and CROSS-X. I continue here with the practical deployment of cross-cache reuse.

In general, the strategy is to allocate many objects, leak one slab base address, and then free objects, including the leaked slab page, until that page is returned. A key objective is to minimize the number of deallocations so runtime stays short and background activity has less chance to disrupt the reuse path.

Returning the Leaked Slab Page to PCP

The naive approach would be to free many objects until several slab pages become empty and then hope that reuse eventually hits the right one. However, this takes longer and does not guarantee success. Instead, our goal is to shape slab-cache state so a minimal number of deallocations is enough to free exactly the leaked slab page. In practice, this is done by freeing the leaked slab and shaping the partial state to trigger reuse. After some experimentation, I found that the following is the most practical workflow across different kernel versions and systems, where mm_sz and mm_order are determined using the mm_struct estimator:

static size_t mm_sz;
static size_t mm_order;

#define MM_OBJS_PER_SLAB ((1<<(12+mm_order))/mm_sz)
#define MM_PARTIALS 10
#define MM_PREPARE (128*MM_OBJS_PER_SLAB)
#define MM_SPRAY ((MM_PARTIALS+1)*MM_OBJS_PER_SLAB)
#define MM_BEFORE (MM_OBJS_PER_SLAB-1)
#define MM_AFTER (MM_OBJS_PER_SLAB)

static pid_t pids_prep[MM_PREPARE];
static pid_t pids_spray[MM_SPRAY];
static pid_t pids_slab_before[MM_BEFORE];
static pid_t pid_leaked;
static pid_t pids_slab_after[MM_AFTER];

void return_leaked_page_to_pcp(void)
{
    // Allocation phase
    for (size_t i = 0; i < MM_PREPARE; ++i)
        pids_prep[i] = alloc_mm_struct();
    for (size_t i = 0; i < MM_SPRAY; ++i)
        pids_spray[i] = alloc_mm_struct();
    for (size_t i = 0; i < MM_BEFORE; ++i)
        pids_slab_before[i] = alloc_mm_struct();
    pid_leaked = alloc_mm_struct();
    for (size_t i = 0; i < MM_AFTER; ++i)
        pids_slab_after[i] = alloc_mm_struct();

    // Leak phase
    size_t mm = leak_mm_struct(pid_leaked);
    size_t mm_base_addr = (mm & ~((1<<(12+mm_order))-1));

    // Return-to-PCP phase
    for (size_t i = 0; i < MM_SPRAY/2; i+=MM_OBJS_PER_SLAB)
        free_mm_struct(pids_spray[i]);
    for (size_t i = 0; i < MM_BEFORE; ++i)
        free_mm_struct(pids_slab_before[i]);
    for (size_t i = 0; i < MM_AFTER-1; ++i)
        free_mm_struct(pids_slab_after[i]);
    for (size_t i = MM_SPRAY/2; i < MM_SPRAY; i+=MM_OBJS_PER_SLAB)
        free_mm_struct(pids_spray[i]);
    free_mm_struct(pid_leaked);
}
Listing 9: Returning the leaked page to the PCP from user space.

Listing 9 is organized into three phases.

I found two practical ways to allocate a fresh mm_struct: fork and exec. Freeing is typically done in exit. These paths allocate and free many additional resources during process creation and destruction, which can introduce noise and instability if not handled carefully. For the allocation side, this is manageable because mm_struct uses a dedicated cache that is largely isolated from unrelated object traffic. So for spraying, I use a lightweight clone-based path as described above which works reliably. For the return-to-PCP phase, however, minimizing side allocations and side deallocations is crucial. Extra allocations may steal a freshly returned page, while extra frees may over-pressure PCP and drain pages back to buddy at the wrong time. So we need a cleaner, more controlled way to free mm_struct.

Looking through the Linux code, the key mechanism is that mm_struct lifetime is governed by a reference counter. A reference counter is an integer that tracks how many active users still hold a reference to an object. As long as this value is non-zero, the object must remain alive. Each release decrements the counter, and only when it reaches zero can the object be finalized and freed. This gives us control over the mm_struct's lifetime: by controlling where references are released, we can trigger mm_struct frees with less allocator noise than on process destruction.

// Allocates an mm_struct and initializes its reference counter to 1
static pid_t __clone(void)
{
    pid_t child = SYSCHK(syscall(SYS_clone, SIGCHLD, NULL, NULL, NULL, 0));
    if (child == 0) {
        pin_to_core(CORE);
        while (1)
            pause();
        exit(0);
    }
    return child;
}

// Opens /proc/<pid>/mem and takes an additional mm_struct reference
static int __memfd(pid_t child)
{
    char path[128];
    memset(path, 0, sizeof(path));
    snprintf(path, sizeof(path), "/proc/%d/mem", child);
    return SYSCHK(open(path, O_RDONLY));
}

// Holds two lifetime references to mm_struct (task + /proc/<pid>/mem fd)
struct mm_ctx {
    pid_t child;
    int fd;
};
void alloc_mm_struct(struct mm_ctx *ctx)
{
    ctx->child = __clone();
    ctx->fd = __memfd(ctx->child);
}
Listing 10: mm_struct allocation triggered from user space.

After inspecting the kernel path, I found that opening "/proc/<pid>/mem" in proc_mem_open increments the mm_struct lifetime reference. So the user-space allocation path is illustrated in Listing 10: clone a child and then open the child's mem file. On all evaluated systems, this is permitted for unprivileged users. This also holds in SELinux-constrained environments: I verified it on both a rooted and a non-rooted Google Pixel with SELinux in enforcing mode.

// Destroys the process, releases most resources, and decreases one mm_struct reference
static void __kill_child_process(pid_t child)
{
    SYSCHK(kill(child, SIGKILL));
    SYSCHK(waitpid(child, NULL, 0));
}
void free_mm_struct_pre(struct mm_ctx *ctx)
{
    __kill_child_process(ctx->child);
}

// Drops the remaining mm_struct reference and frees it with significantly less allocator noise
void free_mm_struct_final(struct mm_ctx *ctx)
{
    SYSCHK(close(ctx->fd));
}
Listing 11: mm_struct freeing triggered from user space.

The free path is split into two steps (see Listing 11). First, I kill the child process via SIGKILL and wait for termination with waitpid, which drops the process-owned reference. However, after opening "/proc/<pid>/mem", that alone does not free the mm_struct. Only when I close the corresponding file descriptor later does the reference counter reach zero, and the mm_struct is released with significantly less allocator noise than full process teardown.

After adapting the free strategy for mm_struct, Listing 9 also needs to be adjusted. After the allocation phase, I wait until all cloned processes are running and then kill them via free_mm_struct_pre. This way, during the return-to-PCP phase, I only need to call free_mm_struct_final to free mm_struct. One additional change is required to make this work. The leak_mm_struct function only leaks the mm_struct of the current process, but here I need the mm_struct of pid_leaked. To achieve that, I use a dedicated child process that leaks its own mm_struct and shares the result with the main process via shared memory.

Reclaiming as Target Objects

At this stage of the attack, I know the leaked slab page has already been returned to PCP. From there, same-page-order cross-cache reclaim follows two steps. First, prior to the return-to-PCP, in a preparation phase, I allocate many target objects to drain freelists and partial lists in the destination cache. Second, in the reclaim phase, I allocate targets so allocation falls back to PCP. Because PCP reuse is LIFO-like on the same CPU in this scenario, the recently returned leaked page is often reclaimed quickly.

Across all evaluated systems, I observe that allocation sizes from 513 bytes up to 8 KiB use slab caches with the same slab-page order as mm_cachep. Examples are the generic slab caches kmalloc-1k till kmalloc-8k. This matters for same-page-order reuse, because it provides a broad set of reclaim candidates beyond a single object type.

typedef struct {
    long mtype;
    char mtext[1];
} msg_t;

void alloc_msg_msg_4k(void)
{
    static char buffer[4096] = {0};
    msg_t *message = (msg_t *)buffer;
    message->mtype = 0x41;

    // prepares the message queue qid
    int qid = SYSCHK(msgget(IPC_PRIVATE, 0666 | IPC_CREAT));
    // allocates the msg_msg from a 4k slab cache
    SYSCHK(msgsnd(qid, message, 4096-48, 0));
}
Listing 12: Allocate msg_msg-4k, which has the same slab-page order as mm_struct.

Listing 12 shows the user-space allocation primitive for msg_msg-4k. I first create a message queue with msgget, then send a carefully sized message with msgsnd so the kernel allocates the corresponding msg_msg object from the 4 KiB target cache. In reuse terms, this is the reclaim trigger: after returning the leaked mm_struct slab page to PCP, repeated calls to this primitive steer the allocator to reuse that page in the msg_msg context.

Listing 12 is illustrative. In the proof of concept, I split preparation and actual allocation to reduce potential allocator noise.

void alloc_pipe_buffer_1k(void)
{
    static int pipe_fd[2];
    // 16 slots will be allocated from kmalloc-cg-1k
    const size_t nr_slots = FLOOR_POW_OF_2(1024/40);

    // prepares the pipe file descriptor
    SYSCHK(pipe(pipe_fd));
    // allocates the pipe_buffer from kmalloc-cg-1k
    SYSCHK(fcntl(pipe_fd[0], F_SETPIPE_SZ, nr_slots << 12));
}
Listing 13: Allocate pipe_buffer-1k, which has the same slab-page order as mm_struct.

Listing 13 shows the alternative reclaim primitive for pipe_buffer-1k. The function creates a pipe and then sets its capacity so the kernel allocates the corresponding pipe_buffer structures from a 1 KiB generic cache in this setup.

In contrast to msg_msg allocation, resizing pipes also deallocates old pipe_buffer objects, which introduces additional allocator noise. This extra deallocation activity can disrupt the shaped slab/PCP state and can cause reclaim failures. To avoid that, I add an intermediate step. After returning the leaked mm_struct slab page to PCP, I first reclaim it as a data page with a low-noise slab-to-page transition. After that reclaim succeeds, I reshape PCP and perform the final page-to-slab reuse into pipe_buffer.

static char __buf[65536];
static struct iovec __iov_2_order = { .iov_base = __buf, .iov_len = 20000 };
static struct msghdr __m_2_order = { .msg_iov = &__iov_2_order, .msg_iovlen = 1 };
static struct iovec __iov_3_order = { .iov_base = __buf, .iov_len = 65536 };
static struct msghdr __m_3_order = { .msg_iov = &__iov_3_order, .msg_iovlen = 1 };
void alloc_free_data_page(void)
{
    int sv[2];
    // prepare the sk_buff for the data page allocation
    SYSCHK(socketpair(AF_UNIX, SOCK_STREAM, 0, sv));
    // order-2 data page allocation via sk_buff->frag
    SYSCHK(sendmsg(sv[0], &__m_2_order, 0));
    // free sk_buff->frag data page
    SYSCHK(close(sv[0]));
    SYSCHK(close(sv[1]));
}
Listing 14: Allocating and deallocating order-2 and order-3 data pages via sk_buff->frag.

Listing 14 shows a low-noise page-allocation primitive based on sk_buff->frag. The core idea is to create a local socket pair and use sendmsg so the kernel backs payload transfer with fragment pages. The payload size controls the requested page order (e.g., __m_2_order for order-2 and __m_3_order for order-3), and closing the sockets frees those pages again.

Evaluation

In this section, I first evaluate the estimator for the target-specific KernelSnitch parameters across all evaluated environments. Then, I evaluate KernelSnitch on its own and observe a near-100% success rate for leaking mm_struct with the same core proof of concept across targets. Finally, I combine KernelSnitch with same-page-order cross-cache reuse and evaluate end-to-end reliability and applicability. For each environment, I run 100 trials across 5 separate reboots, yielding 500 runs per environment.

Everything up to this point belongs to the unprivileged attack path: futex-based leakage, allocator shaping, return of the leaked slab page to PCP, reclamation as msg_msg or pipe_buffer, and final address recovery by brute force. The privileged interfaces used here (a custom implemented kernel module and /sys/kernel/slab) are only for validation and allocator introspection during evaluation. They make the write-up easier to verify and understand, but they are not prerequisites for the attack. The kernel module provides ground-truth leakage and page diagnostics through four commands: LEAK_MM_STRUCT, LEAK_MSG_MSG, LEAK_PIPE_BUFFER, and PAGE_DIAG. The first three return direct object addresses for validation, while PAGE_DIAG reports page-level ownership and metadata. Concretely, diagnostics distinguish slab/PCP/buddy/data-page ownership and include relevant fields such as slab-cache name, CPU, page order, and reference counter depending on the current owner.

Estimator of mm_struct

Our attack depends on two target-specific allocator parameters: sizeof(mm_struct) and the slab-page order used by mm_cachep. Both depend on software and hardware characteristics. On the software side, the main factors are kernel configuration and build options. On the hardware side, allocator behavior depends in part on the number of physical CPU cores. If either parameter is wrong, the brute-force phase fails to recover mm_struct. In practice, this yields no valid candidate rather than a stable but incorrect one. A fallback is therefore to test multiple parameter combinations directly in brute force, but that increases runtime notably.

Instead, I implement an estimator that takes two inputs: the target kernel configuration (or a relevant subset) and the number of physical CPU cores. From these inputs, it estimates both outputs required by the attack: sizeof(mm_struct) and order(mm_cachep). In practice, this keeps the pipeline portable across systems without manual per-target retuning.

Environment Kernel Physical Cores mm_struct Estimation
sizeof(mm_struct) order(mm_cachep)
Ubuntu (ThinkPad P14s Gen 4) 6.8.0-101-generic 12 1408 3
Ubuntu (ThinkPad P14s Gen 4) 6.5.0-14-generic 12 1408 3
Fedora (ThinkPad P14s Gen 5) 6.18.13-200.fc43 16 1536 3
Debian (Raspberry Pi 3) 6.12.47+rpt-rpi-v8 4 1216 3
Android (Google Pixel 8a) 6.1.145-android14-11 9 1024 3
Buildroot 6.12.66 8 1344 3
KernelCTF lts-6.1.81 2 1088 2
KernelCTF lts-6.6.98 2 1344 2
KernelCTF lts-6.12.77 2 1344 2
Table 1: Estimated sizeof(mm_struct) and order(mm_cachep) across evaluated targets.

Table 1 summarizes the estimator outputs for each target environment, i.e., ./estimate_mm_struct.py config nr_cpus. The results match the ground-truth values, which can be obtained from /sys/kernel/slab/mm_struct/slab_size for sizeof(mm_struct) and /sys/kernel/slab/mm_struct/order for order(mm_cachep).

KernelSnitch-Only Results

Before combining with cross-cache reuse, I evaluate KernelSnitch on its own by repeatedly running the mm_struct leakage phase on every environment listed in Table 2. I use the same core proof of concept code, with architecture-dependent timing and physmap parameters fixed at compile time for x86_64 and aarch64. Across these repeated runs, KernelSnitch recovers the target mm_struct location every time, validated with the privileged kernel module. Although this is 100% in the observed experiments, I refer to it as near-100% to acknowledge a negligibly small residual failure probability.

Combined Evaluation: KernelSnitch + Cross-Cache Reuse

For the combined evaluation, I again use the same core proof-of-concept code. The reclaim target is pipe_buffer on Android and msg_msg on the other environments (results are consistent when using pipe_buffer as well). Architecture-dependent parameters are handled exactly as in the KernelSnitch-only evaluation. The most reliable combined sequence is as follows (using msg_msg as the illustrative example).

For additional educational context, I also use privileged data from /sys/kernel/slab during evaluation, specifically slabs, total_objects, slabs_cpu_partial, and partial. This is not required for the attack itself, but it clearly illustrates allocator behavior. If you want to reproduce and understand how pages move across allocator layers, I recommend running the code in a virtual machine with both kernel-module diagnostics and /sys/kernel/slab signals enabled.

Environment Kernel Setup Target Cache Observed Reliability
R1 R2 R3 R4 R5 Overall
Ubuntu (ThinkPad P14s Gen 4) 6.8.0-101-generic native x86_64 kmalloc-cg-4k 100 99 99 99 99 99.2
Ubuntu (ThinkPad P14s Gen 4) 6.5.0-14-generic native x86_64 kmalloc-cg-4k 100 99 99 100 98 99.2
Fedora (ThinkPad P14s Gen 5) 6.18.13-200.fc43 native x86_64 msg_msg-4k 100 100 100 100 100 100
Debian (Raspberry Pi 3) 6.12.47+rpt-rpi-v8 native aarch64 kmalloc-4k 99 100 97 96 99 98.2
Android (Google Pixel 8a) 6.1.145-android14-11 native aarch64 kmalloc-cg-1k 75 65 88 76 77 76.2
Buildroot 6.12.66 virtualized x86_64 msg_msg-4k 100 99 100 100 100 99.8
KernelCTF lts-6.1.81 virtualized x86_64 kmalloc-cg-1k 100 97 96 98 99 98
KernelCTF lts-6.6.98 virtualized x86_64 kmalloc-cg-1k 99 100 94 99 96 97.6
KernelCTF lts-6.12.77 virtualized x86_64 msg_msg-1k 100 100 99 100 100 99.8
Table 2: Reliability across evaluated environments.

Table 2 shows stable performance across all nine environments and across reboot groups. This indicates that the proof of concept is not tied to a single kernel build or machine profile. The consistency across native and virtualized targets supports the claim that the approach transfers well to different real-world configurations.

Crucially, the target slab cache varies across systems because allocator configuration and order(mm_cachep) differ by target. For 6.8.0-101-generic, with CONFIG_MEMCG and order(mm_cachep)=3, the target cache is the segregated generic cache kmalloc-cg-4k. For 6.18.13-200.fc43, with CONFIG_SLAB_BUCKETS enabled and order(mm_cachep)=3, the target cache is the bucketed msg_msg-4k. On KernelCTF instances where order(mm_cachep)=2, the corresponding order-2 targets are used, for example kmalloc-cg-1k (lts-6.1.81) and msg_msg-1k (lts-6.12.77).

For evaluation and reproduction, I provide the proof-of-concept exploit entry points open source. For practical experiments, Listings 15 and 16 show the command flows, while Figures 1 to 4 show video demonstrations of representative runs. Concretely, I cover three Linux flows (KernelSnitch-only, KernelSnitch + msg_msg reclaim, and KernelSnitch + pipe_buffer reclaim) and two Android flows (native shell execution and untrusted_app APK execution). Crucially, I generated the Android results using the native shell flow and used the APK flow only for cross-validation on a smaller subset (due to manual setup). As a result, APK-based runs may have lower reliability. However, as discussed below, this level of reliability remains practically useful. Also for the KernelCTF evaluation, I slightly changed the nsjail.cfg in order to allow access to /dev/lkm.

ubuntu:/tmp/kernelsnitch$ ./exploits/scripts/estimate_mm_struct.py exploits/configs/config-6.8.0-101-generic 12
1408 3
ubuntu:/tmp/kernelsnitch$ ./exploits/kernelsnitch.x86.elf 1408 3 # KernelSnitch only
...
ubuntu:/tmp/kernelsnitch$ ./exploits/poc.x86.elf 1408 3 # KernelSnitch + msg_msg reclaim
...
ubuntu:/tmp/kernelsnitch$ ./exploits/pipe-poc.x86.elf 1408 3 # KernelSnitch + pipe_buffer reclaim
...
Listing 15: Linux execution flow (x86_64).
# Variant 1 (native shell flow)
ubuntu:/tmp/kernelsnitch$ ./exploits/scripts/estimate_mm_struct.py exploits/configs/config-6.1.145-android14 9
1024 3
akita:/data/local/tmp$ # Disable SELinux only for controlled evaluation and get access to /dev/lkm
akita:/data/local/tmp$ ./kernelsnitch.arm.elf 1024 3 # KernelSnitch only
...
akita:/data/local/tmp$ ./google-poc.arm.elf 1024 3 # KernelSnitch + pipe_buffer reclaim
...

# Variant 2 (untrusted_app flow)
ubuntu:/tmp/kernelsnitch$ ./exploits/android-google-poc/build_apk.sh # Build APK in untrusted_app context
...
ubuntu:/tmp/kernelsnitch$ ./exploits/android-google-poc/run_app.sh # Run APK in untrusted_app context
...
akita:/data/local/tmp$ ./page-diag.arm.elf <kaddr> # Validate as root
...
Listing 16: Android execution flows (native shell and untrusted_app).

Results on Android

Android is a harder setting compared to the other Linux environments. In practice, Android kernel exploitation is more difficult because the environment is more restrictive and background activity is higher. In this setup, msg_msg is not a viable reclaim target because of its mandatory access control system SELinux. So, I use pipe_buffer instead. However, reusing the same (even slightly optimized) proof of concept reduces reliability to below 80%. The main reason is the combination of higher allocator pressure and smaller available PCP capacity on Android. Even the low-noise free path via closing the child's mem file can still overpressure PCP. Once that happens, the leaked page is likely drained into buddy and becomes harder to reclaim in the intended target context.

Before continuing, it is good to understand this PCP refill and drain behavior. When PCP occupancy approaches its upper bound, pages are drained back to buddy. When PCP occupancy drops close to empty, pages are refilled from buddy. Across the Linux systems in Table 2 except Android, effective PCP capacity varies with load and often spans from a few hundred to a few thousand pages. By contrast, capacity is tighter on Android, with fixed pages of around 500, and more aggressive refill/drain behavior (atleast from what I have observed).

In practice, this behavior explains most failure cases. During the critical phase of the cross-cache reuse, PCP pressure can become too high and trigger bulk refill/drain events, which makes reclaiming much less predictable. I analyzed this with multiple diagnostics to understand the gap between Android and the other environments. First, I used tracefs to observe when PCP falls back to buddy (which I would recommend you try out). Then, I used /proc/buddyinfo and /proc/zoneinfo to track free blocks per buddy order and current PCP occupancy. In the other Linux environments from Table 2, these files are unprivileged-accessible and provide strong side-channel signals for refill/drain detection, but on Android SELinux blocks direct access. A possible workaround for /proc/buddyinfo is a producer/consumer design: the privileged process traced reads /proc/buddyinfo, publishes relevant counters, and the untrusted side consumes those updates without accessing the file directly. While this works in some security contexts, it also makes the proof of concept notably more complex, and the sampling period is limited to a few milliseconds.

Bearing in mind the main cause of failure, a reliability of 70-80% can still be practically useful. The key point is that in failure cases the page is typically still owned by PCP or buddy, rather than by an unrelated live kernel object. So an attacker can proceed under the reclaim assumption and test follow-up exploitation steps on the candidate pipe_buffer page, e.g., a controlled write. If reclaim did not happen, writes usually land on a page that is currently not used, so the write is effectively ignored.

MTE Implications

ARM Memory Tagging Extension (MTE) adds a lightweight hardware memory-safety mechanism by enforcing tag checks on memory accesses. As a result, it mitigates many type-confusion exploitation paths (e.g., in slab use-after-free and slab out-of-bounds scenarios). Concretely, each 16-byte memory granule has a 4-bit allocation tag, and each pointer carries a 4-bit logical tag in its top byte. On tag-checked loads and stores, the CPU compares the pointer's logical tag with the granule's allocation tag and raises a fault on mismatch. The pointer tag is encoded in the address top byte, while allocation tags are stored in dedicated hardware tag storage.

While MTE can detect tag-mismatched dereferences, it does not stop side-channel observations or pointer leakage that recover valid kernel addresses without ever triggering an invalid tagged access. The attack presented here relies on exactly that property, so it remains effective on MTE-enabled systems.

But KernelSnitch goes one step further. It leaks full kernel pointer values, and for tagged pointers this includes the top-byte logical tag. In practice, this means that in the brute-force phase of Listing 8, I additionally test the 16 possible tag values to leak the correct tag value. While this is a strong primitive, it should be interpreted carefully: on its own, it is a tag oracle for the leaked object (here, mm_struct) and not for the reclaimed object such as pipe_buffer. However, the broader idea can be extended, as also noted by Jann Horn, because similar pointer-leak patterns may apply to other kernel objects as well.

To verify these claims, I perform the evaluations in Table 2 on the Google Pixel 8a with kernel MTE enabled, leaking the address as well as its correct tag value.

Runtime Results

There are three phases with different effects on runtime: collision finding, candidate brute force, and cross-cache reuse. Collision finding and cross-cache reuse are the faster phases and usually complete within a few seconds. For the brute-force phase, the runtime of these proof-of-concept experiments varies significantly across kernels and systems (as seen in Figures 1 to 4). One major factor is whether the kernel enables physical base randomization. For instance, on Ubuntu, Fedora, and KernelCTF, the physmap base is randomized, illustrated by the page_offset_base offset in Figure 6. As a result, the brute-force phase must search a much larger candidate space, and runtime depends in part on where the actual base falls within that range. In contrast, systems such as Raspberry Pi 3 and Android do not use this kind of physical base randomization, as already shown by Seth Jenkins for Google Pixels. There, the brute-force phase only needs to recover the mm_struct object position within the physmap with a know base address, which is significantly faster. A second major factor is how much parallelism the system allows. For example, the ThinkPad P14s Gen 5 provides 20 usable cores (see Figure 1), so the brute-force phase can split the search across 20 threads. A KernelCTF instance (see Figures 2 and 3), in contrast, only provides 2 cores, which limits concurrency and increases runtime. Overall, the dominant runtime cost is the brute-force phase, while the surrounding setup and reclaim stages are comparatively small and stable.

As a reference point, even on KernelCTF instances virtualized with KVM on my Ubuntu laptop, the full leak completes in under 5 minutes. On the native Fedora experiments, it completes in under 1 minute typically.

Conclusion

In this post, I showed how to combine KernelSnitch with same-page-order cross-cache reuse to build a practical heap KASLR leak without relying on a memory-safety vulnerability. Starting from a timing-based mm_struct leak, I demonstrated how to return the leaked mm_cachep slab page (order mm_order) to PCP and reclaim it in a more relevant target context. This reclaim targets msg_msg and pipe_buffer, two widely used objects for kernel exploitation. The evaluation confirms that this approach remains robust across kernels, devices, and deployment models. The key takeaway is that side-channel signal quality and allocator-state massaging together can yield this msg_msg/pipe_buffer leak primitive. Moreover, since I did not rely on a memory-safety vulnerability, defenses such as MTE cannot mitigate this kind of leakage attack.

Here, I showed how to generically leak mm_cachep slab pages across different mm_order values, including all candidate slots when the reclaimed page is used as a slab page. While this is useful for kernel exploitation, some targets require leaking different locations. In those cases, the target locations may not lie on the same page order as mm_cachep. This means the buddy allocator must be considered. One option is cross-page-order cross-cache reuse, where allocator state is massaged so pages returned from PCP are split or merged in a favorable way. Because the buddy allocator is globally shared, additional stabilization techniques are usually needed to make this reliable (such as /proc files if available). A second option is to avoid direct cross-cache reuse and instead leverage favorable neighbor placement between target objects and mm_struct. However, similar to cross-page-order reuse, this strategy also depends on buddy-state massaging.

If you find this work interesting, have questions, or notice technical inaccuracies, feel free to contact me at lukas.maar@tugraz.at.

While I developed the exploit itself, parts of the build scripts and test environments were created with AI assistance (notably using Codex), and the text received editing with AI tools. Also thanks to Ernesto GarcĂ­a, Lorenz Schumm and Mathias Oberhuber for helping improve the readability of this post.

Useful References