Shellcode is an exercise in trade-offs.
To be really flexible and fit in the most exploits, shellcode must be small. On the other side of the scale, there are certain features that you need or want, each adding to the size. For instance, doing DNS resolution in the first stage payload is useful, but (in Windows) requires adding 80 bytes to the stager. So we have to balance size, which is very important for compatibility with some exploits that have limited buffers to work with, with features and reliability, which are important for world domination.
Metasploit's existing stagers are usually small enough, and our encoders get rid of bad characters for you, so it doesn't make sense to spend a lot of time writing your own payloads or optimizing to get rid of pesky bytes. Sometimes, though, a few bytes can make a big difference. For example, the exploit for MS08-067, ms08_067_netapi.rb has a buffer size of 400 bytes; we'll come back to this in a moment.
block_api is a brilliant piece of kit that rummages through MZ headers looking for the function we want to call so that finding a pointer to, say, "InternetOpenA" which is portable and reliable on all versions of Windows. Because all of our Windows payloads use it, a win here can make all our Windows payloads a little bit smaller.
The first win comes from the fact that x86 has several ways of addressing memory. This is the original code:
add eax, edx ; Add the modules base address mov eax, [eax+120] ; Get export tables RVA
mov instruction here is using the "mov reg, r/m" form, which allows us to do some simple math on a register (adding 120) without having to store the result. The "r/m" argument is a little more flexible, though. It was intended to be able to reference tables and can take "[ base + scale*index + disp32 ]", where
base would normally be the beginning of the table,
scale would be the size of its elements,
index would be the index of the element you want, and
disp32 the offset within that element.
Armed with this knowledge, both of the above instructions can be condensed into one:
mov eax, [eax+edx+120] ; Get export tables RVA
for a one-byte saving. Every byte is sacred, after all.
The second reduction comes from the fact that the designers of x86 intended the ECX register to be used as a loop counter and thus added several instructions that treat it specially.
jecxz is one that allows us to jump without having to explicitly test a register. By using ECX for our EAT pointer instead of EAX, we can turn this:
test eax, eax ; Test if no export address table is present jz get_next_mod1 ; If no EAT present, process the next module
jecxz get_next_mod1 ; If no EAT present, process the next module
saving another 2 bytes.
That's all great for improving your shellcode golf score, but the big win came in the reverse_http and reverse_https payloads.
The first thing I noticed about reverse_http is that it used the time-honored tradition of jmp'ing ahead, then call'ing backwards and popping the return address to get the address of a string on the stack (in this case, it was the hostname and URI to callback to). Then it would store that value in a register and later push it as an argument to a function. Since the call instruction already puts the value on the stack, I simply rearranged it to be in the argument setup instead of beforehand, e.g. something like this pseudocode:
jmp get_uri got_uri: pop ebx push eax push eax push ebx call ... get_uri: call got_uri db "/12345", 0x00
push eax push eax jmp get_uri got_uri: call ... get_uri: call got_uri db "/12345", 0x00
If we have an instruction like this:
mov esi, eax
and we don't need eax to keep its value (in this case we don't), we can replace this it with
xchg esi, eax
and save another byte.
The next savings came from a need for zeros. In almost every function call made in reverse_http(s), we need a zero (or a NULL) for at least one argument. In fact, we need zero so often that it makes sense to just save it in a register and use it over and over. Previously this was ad-hoc, done at the beginning of each function with a different register. By zeroing one register at the beginning and using it throughout the payload, we can save even more space.
Before this change, reverse_https was 368 bytes unencoded. Encoding with x86/call4_dword_xor knocks it up to 392 bytes. With x86/jmp_call_additive, it is 397 bytes. With the added stuff that ms08_067_netapi needs for fixing up the stack, it comes to 404 and 409, respectively. If you'll recall, that's too large for ms08_067_netapi. The encoders that produce decoding stubs with less overhead also produce badchars for this exploit, so shikata_ga_nai encoding (for instance) will not work.
After this change, reverse_https is 350 bytes unencoded. With all of ms08_067_netapi's restrictions, we can now encode it to a size that will fit. And there was much rejoicing.