The n_hdlc bug
N_HDLC line discipline used a self-made singly linked list for data buffers and had
n_hdlc.tbuf pointer for buffer retransmitting after an error. It worked, but the commit
be10eb75893added data flushing and introduced racy access to
After tx error concurrent
n_hdlc_send_frames() both use
n_hdlc.tbuf and can put one buffer to
tx_free_buf_list twice. That causes an exploitable double-free error in
n_hdlc_release(). The data buffers are represented by
struct n_hdlc_buf and allocated in the
kmalloc-8192 slab cache.
For fixing this bug, I used a standard kernel linked list and got rid of racy
n_hdlc.tbuf: in case of tx error the current
n_hdlc_buf item is put after the head of
I started the investigation when got a suspicious kernel crash from syzkaller. It is a really great project, which helped to fix an impressively big list of bugs in Linux kernel.
This article is the only way for me to publish the exploit code. So, please, be patient and prepare to plenty of listings!
Winning the race
Let’s look to the code of the main loop: going to race till success.
loop counter is incremented every iteration, so
tmo2 variables are changing too. They are used for making lags in the racing threads, which:
- synchronize at the
- spin the specified number of microseconds in a busy loop,
- interact with
Such a way of colliding threads helps to hit the race condition earlier.
Here we open a pseudoterminal master and slave pair and set the
N_HDLC line discipline for it. For more information about that, see
Documentation/serial/tty.txt and this great discussion about
N_HDLC ldisc for a serial line causes the
n_hdlc kernel module autoloading. You can get the same effect using
Here we suspend the pseudoterminal output (see
man tty_ioctl) and write one data buffer. The
n_hdlc_send_frames() fails to send this buffer and saves its address in
We are ready for the race. Start two threads, which are allowed to run on all available CPU cores:
- thread 1: flush the data with
ioctl(ptmd, TCFLSH, TCIOFLUSH);
- thread 2: start the suspended output with
ioctl(ptmd, TCXONC, TCOON).
In a lucky case, they both put the only written buffer pointed by
Now we return to the CPU 0 and trigger possible double-free error:
We close the pseudoterminal master. The
n_hdlc_release() goes through
n_hdlc_buf_list items and frees the kernel memory used for data buffers. Here the possible double-free error happens.
This particular bug is successfully detected by the Kernel Address Sanitizer (KASAN), which reports the use-after-free happening just before the second
The final part of the main loop:
Here we try to exploit the double-free error by overwriting
struct sk_buff. In case of success, we exit from the main loop and run the root shell in the child process using
Exploiting the sk_buff
As I mentioned, the doubly freed
n_hdlc_buf item is allocated in the
kmalloc-8192 slab cache. For exploiting double-free error for this cache, we need some kernel objects with the size a bit less than 8 kB. Actually, we need two types of such objects:
- one containing some function pointer,
- another one with the controllable payload, which can overwrite that pointer.
Searching for such kernel objects and experimenting with them was not easy and took me some time. Finally, I’ve chosen
sk_buff with its
struct skb_shared_info. This approach is not new – consider reading the cool write-up about CVE-2016-2384.
The network-related buffers in Linux kernel are represented by
struct sk_buff. See these great pictures describing
sk_buff data layout. The most important for us is that the network data and
skb_shared_info are placed in the same kernel memory block pointed by
sk_buff.head. So creating a 7500-byte network packet in the userspace will make
skb_shared_info be allocated in the
kmalloc-8192 slab cache. Exactly like we want.
But there is one challenge:
n_hdlc_release() frees 13
n_hdlc_buf items straight away. At first I was trying to do the heap spray in parallel with
n_hdlc_release(), but didn’t manage to inject the corresponding
kmalloc() between the needed
kfree() calls. So I used another way: spraying after
n_hdlc_release() can give two
sk_buff items with the
head pointing to the same memory. That’s promising.
So we need to spray hard but keep 8 kB UDP packets allocated to avoid mess in the allocator freelist. Socket queues are limited in size, so I’ve created a lot of sockets using
socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP):
- one client socket for sending UDP packets,
- one dedicated server socket, which is likely to receive two packets with the same
- 200 server sockets for receiving other packets emitted during heap spray,
- 200 server sockets for receiving the packets emitted during slab exhaustion.
Ok. Now we need another kernel object for overwriting the function pointer in
skb_shared_info.destructor_arg. We can’t use
sk_buff.head for that again, because
skb_shared_info is placed at the same offset in
sk_buff.head and we don’t control it. I was really happy to find that
add_key syscall is able to allocate the controllable data in the
But I became upset when encountered key data quotas in
/proc/sys/kernel/keys/ owned by root. The default value of
/proc/sys/kernel/keys/maxbytes is 20000. It means that only 2
add_keysyscalls can concurrently store our 8 kB payload in the kernel memory, and that’s not enough.
But the happiness returned when I encountered the bright idea at the slides of Di Shen from Keen Security Lab: I can make the heap spray successful even if
So, let’s look at the
The definition of
struct skb_shared_info and
struct ubuf_info is copied to the exploit code from
include/linux/skbuff.h kernel header.
The payload buffer will be passed to
add_key as a parameter, and the data which we put there at
7872 - 18 = 7854 byte offset will exactly overwrite
ubuf_info.callback is called in
SKBTX_DEV_ZEROCOPY flag set to 1. In our case,
ubuf_info item resides in the userspace memory, so dereferencing its pointer in the kernelspace will be detected by SMAP.
Anyway, now the
callback points to
root_it(), which does the classical
commit_creds(prepare_kernel_cred(0)). However, this shellcode resides in the userspace too, so executing it in the kernelspace will be detected by SMEP. We are going to bypass it soon.
Heap spraying and stabilization
As I mentioned,
n_hdlc_release() frees thirteen
n_hdlc_buf items. Our
exploit_skb() is executed shortly after that. Here we do the actual heap spraying by sending twenty 7500-byte UDP packets. Experiments showed that the packets number 12, 13, 14, and 15 are likely to be exploitable, so they are sent to the dedicated server socket.
Now we are going to perform the use-after-free on
- receive 4 network packets on the dedicated server socket one by one,
- execute several
add_keysyscalls with our payload after receiving each of them.
The exact number of
add_key syscalls giving the best results was found empirically by testing the exploit many times. The example of
If we won the race and did the heap spraying luckily, then our shellcode is executed when the poisoned packet is received. After that we can invalidate the keys that were successfully allocated in the kernel memory:
Now we need to prepare the heap to the next round of
n_hdlc racing. The
/proc/slabinfo shows that
kmalloc-8192 slab stores only 4 objects, so double-free error has high chances to crash the allocator. But the following trick helps to avoid that and makes the exploit much more stable – send a dozen UDP packets to fill the partially emptied slabs.
As I mentioned, the
root_it() shellcode resides in the userspace. Executing it in the kernelspace is detected by SMEP (Supervisor Mode Execution Protection). It is an x86 feature, which is enabled by toggling the bit 20 of CR4 register.
There are several approaches to defeat it, for example, Vitaly Nikolenko describes how to switch off SMEP using stack pivoting ROP technique. It works great, but I didn’t want to copy it blindly. So I’ve created another quite funny way to defeat SMEP without ROP. Please inform me if that approach is already known.
arch/x86/include/asm/special_insns.h I’ve found this function:
It writes its first argument to CR4.
Now let’s look at
skb_release_data(), which executes the hijacked
callback in the Ring 0:
We see that the destructor
uarg address as the first argument. And we control this address in the exploited
So I’ve decided to write the address of
ubuf_info.callback and put
ubuf_info item at the mmap’ed userspace address
0x406e0, which is the correct value of CR4 with disabled SMEP.
In that case SMEP is disabled on one CPU core without any ROP. However, now we need to win the race twice: first time to disable SMEP, second time to execute the shellcode. But it’s not a problem for this particular exploit since it is fast and reliable.
So let’s initialize the payload a bit differently:
That SMEP bypass looks witty, but introduces one additional requirement – it needs bit 18 (OSXSAVE) of CR4 set to 1. Otherwise
target_addr becomes 0 and
mmap() fails, since mapping the zero page is not allowed.
CVE-2017-2636 and writing this article was a big fun for me. I want to thank Positive Technologies for giving me the opportunity to work on this research. I would really appreciate feedback.