,, MMP""MM""YMM `7MM P' MM `7 MM MM MMpMMMb. .gP"Ya MM MM MM ,M' Yb MM MM MM 8M"""""" MM MM MM YM. , .JMML. .JMML JMML.`Mbmmd' `7MMF' `7MF' `7MMF' `7MMF' `MA ,V MM MM VM: ,V `7M' `MF' MM MM .gP"Ya ,6"Yb.`7M' `MF'.gP"Ya `7MMpMMMb. MM. M' `VA ,V' MMmmmmmmMM ,M' Yb 8) MM VA ,V ,M' Yb MM MM `MM A' XMX MM MM 8M"""""" ,pm9MM VA ,V 8M"""""" MM MM :MM; ,V' VA. MM MM YM. , 8M MM VVV YM. , MM MM VF .AM. .MA..JMML. .JMML.`Mbmmd' `Moo9^Yo. W `Mbmmd'.JMML JMML. ,, ,, ,, .g8"""bgd `7MM `7MM mm db .dP' `M MM MM MM dM' ` ,pW"Wq. MM MM .gP"Ya ,p6"bo mmMMmm `7MM ,pW"Wq.`7MMpMMMb. MM 6W' `Wb MM MM ,M' Yb 6M' OO MM MM 6W' `Wb MM MM MM. 8M M8 MM MM 8M"""""" 8M MM MM 8M M8 MM MM `Mb. ,'YA. ,A9 MM MM YM. , YM. , MM MM YA. ,A9 MM MM `"bmmmd' `Ybmd9'.JMML..JMML.`Mbmmd' YMbmd' `Mbmo.JMML.`Ybmd9'.JMML JMML. -- Contact -- https://twitter.com/vxunderground vxug@null.net

Ehrm... Super should do this instead me, anyway, as i'm his pupil, i'm gonna write here what i have learnt in the time while i am inside Win32 coding world. I will guide this tutorial through local optimization rather than structural optimization, because this is up to you and your style (for example, personally i'm *VERY* paranoid about the stack and delta offset calculations, as you could see in my codes, specially in Win95.Garaipena). This article is full of my own ideas and of advices that Super gave to me in Valencian meetings. He's probably the best optimizer in VX world ever. No lie. I won't discuss here how to optimize to the max as he does. No. I only wan't to make you see the most obvious optimizations that could be done when coding for Win32, for example. I won't comment the very obvious optimization tricks, already explained in my Virus Writing Guide for MS-DOS.

Check if a register is zero

I'm sick of see the same always, specially in Win32 coders, and this is really killing me slowly and very painfully. No, no, my mind can't assimilate the idea of a CMP EAX,0 for example. Ok, let's see why:

        cmp     eax,00000000h                   ; 6 bytes
        jz      bribriblibli                    ; 2 bytes (if jz is short)

Heh, i know life's a shit, and you are wasting many code in shitty comparisons. Ok, let't see how to solve this situation, with a code that does the same, but with less bytes.

        or      eax,eax                         ; 2 bytes
        jz      bribriblibli                    ; 2 bytes (if jz is short)

And there is a way to do this even more optimized, anyway it's okay if it doesn't matter where should be the content of EAX (after what i am going to put here, EAX content will finish in ECX). Here you have:

        xchg    eax,ecx                         ; 1 byte
        jecxz   bribriblibli                    ; 2 bytes (if it is short)

Do you see? No excuses about "i don' t optimize because i lose stability", because with this tips you will optimize without losing anything besides bytes of code ;) Heh, we passed from a 8 bytes routine to 3 bytes... Heh? what do you say about it? Hahahaha.

Check if a register is -1

As many APIs in Ring-3 return you a value of -1 (0FFFFFFFFh) if the function failed, and as you should compare if it failed, you must compare for that value. But there is the same problem as before, many many people do it by using CMP EAX,0FFFFFFFFh and it could be done more optimized...

        cmp     eax,0FFFFFFFFh                  ; 6 bytes
        jz      insumision                      ; 2 bytes (if short)

Let's do it as it could be more optimized:

        inc     eax                             ; 1 byte
        xchg    eax,ecx                         ; 1 byte
        jecxz   insumision                      ; 2 bytes (if short)
        dec     ecx                             ; 1 byte

And another thingy could be this:

        inc     eax                             ; 1 byte
        jz      insumision                      ; 2 bytes
        dec     eax                             ; 1 byte

Heh, maybe it occupies more lines, but occupies less bytes so far (4 bytes against 8).

Clear a 32 bit register and move something to its LSW

The most clear example is what all viruses do when loading the number of sections of PE file in AX (as this value occupies 1 word in the PE header). Well, let's see what do the majority of VX:

        xor     eax,eax                         ; 2 bytes
        mov     ax,word ptr [esi+6]             ; 4 bytes

I'm still wondering why all VX use this "old" formula, specially when you have a 386+ instruction that avoids us to make register to be zero before putting the word in AX. This instruction is MOVZX.

        movzx   eax,word ptr [esi+6]            ; 4 bytes

Heh, we avoided 1 instruction of 2 bytes. Cool, huh?

Calling to an address stored in a variable

Heh, this is another thing that some VX do, and makes me to go crazy and scream. Let me remember it to you:

        mov     eax,dword ptr [ebp+ApiAddress]  ; 6 bytes
        call    eax                             ; 2 bytes

We can call to an address directly guys... It saves bytes and doesn't use any register that could be useful for another things.

        call    dword ptr [ebp+ApiAddress]      ; 6 bytes

Another time again, we are saving an unuseful, and not needed instruction, that occupies 2 bytes, and we are making exactly the same.

Fun with push

Almost the same as above, but with push. Let's see what to don't do and what to do:

        mov     eax,dword ptr [ebp+variable]    ; 6 bytes
        push    eax                             ; 1 byte

We could do the same with 1 byte less. See.

        push    dword ptr [ebp+variable]        ; 6 bytes

Cool, huh? ;) Well, if we need to push many times (if the value is big, is more optimized if you push that value 2+ times, and if the value is small is more optimized to push it when you need to push the value 3+ times) the same variable is more optimized to put it in a register, and push the register. For example, if we need to push zero 3 times, is more optimized to xor a register with itself and later push the register. Let's see:

        push    00000000h                       ; 2 bytes
        push    00000000h                       ; 2 bytes
        push    00000000h                       ; 2 bytes

And let's see how to optimize that:

        xor     eax,eax                         ; 2 bytes
        push    eax                             ; 1 byte
        push    eax                             ; 1 byte
        push    eax                             ; 1 byte

Another thing passes while using SEH, as we need to push fs:[0] and such like. Let's see how to optimize that:

        push    dword ptr fs:[00000000h]        ; 6 bytes
        mov     fs:[0],esp                      ; 6 bytes
        pop     dword ptr fs:[00000000h]        ; 6 bytes

Instead that we should do this:

        xor     eax,eax                         ; 2 bytes
        push    dword ptr fs:[eax]              ; 3 bytes
        mov     fs:[eax],esp                    ; 3 bytes
        pop     dword ptr fs:[eax]              ; 3 bytes

Heh, seems a silly thing, but we have 7 bytes less! Whoa!!!

Get the end of an ASCIIz string

This is very useful, specially in our API search engines. And of course, it could be done more optimized rather than the typical way in all viruses. Let's see:

        lea     edi,[ebp+ASCIIz_variable]       ; 6 bytes
@@1:    cmp     byte ptr [edi],00h              ; 3 bytes
        inc     edi                             ; 1 byte
        jz      @@2                             ; 2 bytes
        jmp     @@1                             ; 2 bytes
@@2:    inc     edi                             ; 1 byte

This same code could be very reduced, if you code it in this way:

        lea     edi,[ebp+ASCIIz_variable]       ; 6 bytes
        xor     al,al                           ; 2 bytes
@@1:    scasb                                   ; 1 byte
        jnz     @@1                             ; 2 bytes

Hehehe. Useful, short and good looking. What else do you need? ;)

Multiply shitz

For example, while seeing the code for get the last section, the code most used includes this (we have in EAX the number of sections - 1):

        mov     ecx,28h                         ; 5 bytes
        mul     ecx                             ; 2 bytes

And this saves the result in EAX, right? Well, we have a much better way to do this, with an only one instruction:

        imul    eax,eax,28h                     ; 3 bytes

IMUL stores in the first register indicated the result, result that is given to us multiplying the second register indicated with the third operand, in this case, it's an immediate. Heh, we saved 4 bytes of substituing only 2 instructions of code!

Infection mark

It should work, anyway i'm not sure, because it doesn't in my computer. Pff, maybe an intel bug, or my system is crazy or something. Not sure, but anyway try it, as it is very interesting. Look how it should be unoptimized:

        cmp     dword ptr [esi+44h],"MARK"      ; 7 bytes
        jz      oro_y_grana                     ; 2 bytes
        mov     dword ptr [esi+44h],"MARK"      ; 7 bytes

Optimized, this should be in this way (i already said that it SHOULD work, but it doesn't in my PC):

        mov     eax,"MARK"                      ; 5 bytes
        cmpxchg dword ptr [esi+44h],eax         ; 4 bytes
        jz      oro_y_grana                     ; 2 bytes

Pfff, a really good optimization, 16 bytes reduced to 11 bytes ;)


There are many to do here. Specially done for Ring-0 viruses, there is a VxD service for do that, firstly i'm gonna explain how to do the optimization based in the use of this service, and finally i'll show Super's method, that saves TONS of bytes. Let's see the typical code (assumming EBP as ptr to ioreq structure and EDI pointing to file name:

        xor     eax,eax                         ; 2 bytes
        push    eax                             ; 1 byte
        mov     eax,100h                        ; 5 bytes
        push    eax                             ; 1 byte
        mov     eax,[ebp+1Ch]                   ; 3 bytes
        mov     eax,[eax+0Ch]                   ; 3 bytes
        add     eax,4                           ; 3 bytes
        push    eax                             ; 1 byte
        push    edi                             ; 1 byte
@@3:    int     20h                             ; 2 bytes
        dd      00400041h                       ; 4 bytes

Well, particulary only 1 improve could be done to that code, substitute the third line with this:

        mov     ah,1                            ; 2 bytes

Heh, but i said that Super improved this to the max. I haven't copied his code to get the ptr to the unicode name of file, because is almost ununderstandable, but i catched the concept. Assumptions are EBP as ptr to ioreq structure and buffer as a 100h bytes buffer. Here goes some code:

        mov     esi,[ebp+1Ch]                   ; 3 bytes
        mov     esi,[esi+0Ch]                   ; 3 bytes
        lea     edi,[ebp+buffer]                ; 6 bytes
@@l:    movsb                                   ; 1 byte  ─┐
        dec     edi                             ; 1 byte   │ This loop was
        cmpsb                                   ; 1 byte   │ made by Super ;)
        jnz     @@l                             ; 2 bytes ─┘

Heh, the first of all routines (without local optimization) is 26 bytes, the same with that local optimization is 23 bytes, and the last routine, the structural optimization is 17 bytes. Whoaaaa!!!

VirtualSize calculation

This title is an excuse for show you another strange opcode, very useful for VirtualSize calculations, as we have to add to it a value, and get the value that was there before our addition. Of course, the opcode i am talking about is XADD. Ok, ok, let's see the unoptimized VirtualSize calculation (i assume ESI as a ptr to last section header):

        mov     eax,[esi+8]                     ; 3 bytes
        push    eax                             ; 1 byte
        add     dword ptr [esi+8],virus_size    ; 7 bytes
        pop     eax                             ; 1 byte

And let's see how it should be with XADD:

        mov     eax,virus_size                  ; 5 bytes
        xadd    dword ptr [esi+8],eax           ; 4 bytes

With XADD we saved 3 bytes ;) Btw, XADD is a 486+ instruction.

Setting STACK frames

Another Ring-0 thingy. Let's see it unoptimized:

        push    ebp                             ; 1 byte
        mov     ebp,esp                         ; 2 bytes
        sub     esp,20h                         ; 3 bytes

And if we optimize...

        enter   20h,00h                         ; 4 bytes

Charming, isn't it? ;)

Tips & tricks

Here i will put unclassificalble tricks for optimize, or if i assumed that you know them while making this article ;)

Final words

I expect you understood at least the first optimizations put in this article because they are the ones that make me go mad. I know i am not the best at optimization, neither one of them. For me, the size doesn't matter. Anyway, the obvious optimizations must be done, at least for demonstrate you know to something in your life. Less unuseful bytes means a better virus, believe me. And don't come to me using the same words that QuantumG used in his Next Step virus. The optimizations i showed here WON'T make your virus to lose stability. Just try to use them, ok? It's very logic, guyz.

Billy Belcebú,
mass killer and ass kicker.