,, MMP""MM""YMM `7MM P' MM `7 MM MM MMpMMMb. .gP"Ya MM MM MM ,M' Yb MM MM MM 8M"""""" MM MM MM YM. , .JMML. .JMML JMML.`Mbmmd' `7MMF' `7MF' `7MMF' `7MMF' `MA ,V MM MM VM: ,V `7M' `MF' MM MM .gP"Ya ,6"Yb.`7M' `MF'.gP"Ya `7MMpMMMb. MM. M' `VA ,V' MMmmmmmmMM ,M' Yb 8) MM VA ,V ,M' Yb MM MM `MM A' XMX MM MM 8M"""""" ,pm9MM VA ,V 8M"""""" MM MM :MM; ,V' VA. MM MM YM. , 8M MM VVV YM. , MM MM VF .AM. .MA..JMML. .JMML.`Mbmmd' `Moo9^Yo. W `Mbmmd'.JMML JMML. ,, ,, ,, .g8"""bgd `7MM `7MM mm db .dP' `M MM MM MM dM' ` ,pW"Wq. MM MM .gP"Ya ,p6"bo mmMMmm `7MM ,pW"Wq.`7MMpMMMb. MM 6W' `Wb MM MM ,M' Yb 6M' OO MM MM 6W' `Wb MM MM MM. 8M M8 MM MM 8M"""""" 8M MM MM 8M M8 MM MM `Mb. ,'YA. ,A9 MM MM YM. , YM. , MM MM YA. ,A9 MM MM `"bmmmd' `Ybmd9'.JMML..JMML.`Mbmmd' YMbmd' `Mbmo.JMML.`Ybmd9'.JMML JMML. -- Contact -- https://twitter.com/vxunderground firstname.lastname@example.org
This article is assumed upon a basis on polymorphic engines construction, so you need an adquired good knowledge about decryptor generators and its construction (it's not for newbies! ;)
I wrote this for win32 engines. I'm not very versated in Linux/Unix virusing, but modifying some words on this article (and some points in the index) it can be extrapolated to engines under these systems.
This article is made for those who made its polymorphic engine and they want to improve their knowledge and their techniques, making a better polymorphic engine. I have to advice that the techniques I'm going to explain are very time consumming (an error in the coding, being little or not, can generate huge errors that aren't easy to trace back, or little errors in the code generation that can pop up in the least expected moment).
Well, so let's on. I've tried to make both organized article and clear explanations, but sometimes it can be a little hard to understand. Well, considering that I'm not an expert on article writing, I've done what I could ;).
Many people believe nowadays in the (old) advise of virus coding: you have to code them little. This advise was valid while there were 40 Mb harddisks and no Pentiums+ (or similars), but now this fact is obsolete. An average user has now a huge harddisk (2+ Gb) and s/he don't know the real free space of his/her disk due to the Windblows swap file. Then, we can make a huge virus (10 Kb or more) without being noticed at all. We can apply this to decryptors, then. Why we can't generate a 4 Kb decryptor? We can, and moreover, we are nearly obligated to do it (a 100 bytes decryptor isn't a challenge to any nowadays AV emulator). But we have to have a GOOD garbage generator, since a single instruction size is 3 or 4 bytes as an average, which are about some thousands of instructions in a big decryptor, which can make easier the AVers task if they decide to detect our virus by algorithmical approaches or heuristical techniques. So, we have to code a good garbage generator with quite a lot of chances.
Other thing that we have in favour of big decryptors is the impossibility for an emulator to determine in a few instructions if it's a decryptor or not, forcing it to emulate deeply. 1 Kb of garbage before starting the decryption should be enough, but you have to thing that every time processors are better (faster, cheaper, etc. etc.) so emulators too. Garbage is executed very fast upon normal execution, but it can take a good while upon an emulator. The more garbage you put, the more time an emulator needs and the less possibilities the emulator reach the decrypted virus, always you put coherent garbage to avoid the heuristic detections of "strange" instruction using, and also you have to mantain a good balance between quantity and quality of code generation (putting 20 Kb of very complex garbage can slow the initial execution of the application, noticing the user there is something unusual on his/her system).
When possible, avoid linear decryption. This is valid for both decryption and looping. Although we have a 10 Kb decryptor, if we make a main loop (easily detectable by an emulator) and we access consecutively to the encrypted data when decrypting it, it's stupid to make very complex coding, because we have an algorithmical clue for detection that many emulators use to defeat complex decryptors (they only put a breakpoint after the big loop and wait until that part is decrypted). I've developed two techniques that are useful to avoid this situation: PRIDE and Branching.
The name comes from Pseudo-Random Index DEcryption, and it was an idea that I had from the beginning, when I began to code poly engines. Due to the lack of information about this, I had to research this subject by myself, and now that's what I bring to you.
The idea is that a "normal" program doesn't make normally a sequential read/write of a memory zone, which is being made by the decryptors of all polymorphic virus. There are some techniques that try to avoid that in a certain way (look Zhengxi's engine(29A#1) or MeDriPolEn in Squatter(29A#3)), like adding some bytes to leave "holes" and then do several decryptions of the same code but adding each time a different number, leaving the code full decrypted.
This particularity is also detected by AVs emulators. The only way to hide this is to make an "in-appearance" random accessing to that memory, to cheat the emulator and force it to determine that they are part of a normal memory access of an application, and that's what I researched a lot until I found this formula, very easy to do polymorphically. It's adapted to byte-to-byte decryption, but I explained how to adapt it to others down.
+---------------------------------------------------------------------------+ |Random(Number) symbolizes a random number between 0 and Number-1 (just like| | the C function) | | | |Encrypted_Data_Size = The size of encrypted part, rounded to the next power| | of 2 (I explain this later) | |InitialValue = Random(Encrypted_Data_Size) | | | | The formula | | ----------- | | Register1 = Random(Encrypted_Data_Size) | | Register2 = InitialValue | | Loop_Label: | | Decrypt [(Register1 XOR Register2)+Begin_Address_Of_Encrypted_Data] | | Register1 += Random (Encrypted_Data_Size) AND -2 | | (Take care with this one!) | | Register1 = Register1 MOD Encrypted_Data_Size | | Register2++ | | Register2 = Register2 MOD Encrypted_Data_Size | | if Register2!=InitialValue GOTO Loop_Label | | GOTO Begin_Address_Of_Encrypted_Data | +---------------------------------------------------------------------------+
That's it! Very short, very easy to code, and very randomized. Let's see it by parts, and I'll explain the mathematical aspects of the formula (why it's like this and no like other):
The first thing to consider is the fact that the encrypted part must be a pow of 2. If you look at the formula, you can see that the generated decryption address came from a XOR between two random numbers. The fact is that a XOR (unlike ADD, SUB, etc.) never modifies a bit higher than the highest bit of the two numbers, which allows us to know always the top limit of the resulting number (always power of 2).
Now, the used registers: Register1 is used as a modifier for the Register2, and it's a pseudo-random number every time, due to the fact that we generate its initial value randomly and we add to it a random number every loop. The work of this formula is done by the Register2, and if you look at it, you can see that Register2 is no other thing than a counter, so you can increase it or decrease it, it's up to you (or up to the engine :). Just keep it inside the limits (between 0 and Encrypted_Data_Size).
Now the real revolution of this formula: I find out, after many tests, that when you have a counter (Register2) and you XOR a random number to it (always inside the limits and that, I'm not going to repeat this anymore :) you get a different number, and if you add to the XORing value a little special random number and increase the counter, the next time you do the XOR to the counter you will get another different number. When you have completed all the count sequence (from 0 till NumberPowerOf2), you get a sequence of random numbers which touch all the numbers from 0 till NumberPowerOf2, but without anyone repeated! It's like making a permutation of a sequence, but you don't have to store any vector nor generate any data. Since the formula randomizes all its numbers, it doesn't vary very much from the "standard" decryptor.
I referred sometimes to a "special random number", which is the one that you have to XOR to the counter. This one is special because you have to keep it in a "level" down than the other random numbers. I explain this, and please keep attention to it, because is very important:
When you use this technology, most random numbers are from 0 until the size of the data to decrypt (pow of 2). There's a particularity in the formula that it's necessary for the correct development and returning reliable values: you must "align" the numbers (so, the result of Random(Encrypted_Data_Size) must be multiple of 1 if we decrypt by byte ptr (nothing special here, then), be multiple of 2 if we decrypt by word ptr, and multiple of 4 if we decrypt by dword ptr). But the number that we add to the XORing value is slightly special because it has to be aligned in an upper grade, I mean, if you use byte ptr for decryption, that number must be multiple of 2, multiple of 4 for word ptr and of 8 for dword ptr. That is easily achieved by getting, before coding the opcode of the instruction, the random number to add, and then doing an instruction:
AND Value,Encrypted_Data_Size-2 (for bytes), or
AND Value,Encrypted_Data_Size-4 (for words), etc.
(in the engine, not in the decryptor!).
Just take this into account because, if not, the decrypted part will be touched two times, leaving it corrupted (I found out this after becoming mad and furious after several hours of seeing a correct engine and an incorrect decryption, and after coding thousands of little programs to test that :).
This method, although powerful, can be defeated with the detection of the code loops, so we must do anything to break the linearity of the decryptor execution. The easiest way is to put some conditional jumps in the middle, but it seems that the emulators detect which zone of code is more frequently executed (or something like that), so I thought about it and created the next technique:
This one, combined with PRIDE (it seems a joke :P ) will allow us to defeat normal emulators (at least until this date), and of course with the help of the normal polymorphism techniques and complex garbage generation.
When you look at a legitimal application, you can see that it has many conditional jumps, followed by code, and the normal thing is that a portion of code isn't executed an ununderstandable number of times as a decryption loop does. We must break this, and the way can be this:
+----------------------------------------------------------------------------+ |First we have this arrays and values: | | | |ArrayOfJumps dd N dup (0) | |ArrayOfJumpsNdx dd 0 | |JumpsToComplete dd N dup (0) | |JumpsToCompleteNdx dd 0 | | | | | | | | | | This is the beginning of the decryptor. This is the part when the | | | registers are setted to their starting value, and all things that | | | that we must put at the beginning. | | | | | | | | x First address stored into ArrayOfJumps | | | | | | Garbage | | | | | .-*-. Random conditional jump with a very random probability of jump | | | | Garbage | | x x 2nd and 3rd address stored into ArrayOfJumps | | | | Garbage | | .*. .*. Random conditional jumps | | | | | | | | | | | | Four decryption algorithms that perform the same op. but with | | | | | | different code. | | | | | | | | | | | | | | | | | | Final-of-decryption check | | R R R R Loop to continue decrypting (jump randomly to one of the addr. | | | | | | stored in ArrayOfJumps) | | | | | | Garbage | | | | | | | | V V V V Jump to decrypted virus | | | |(This would be generated with 3 levels of recursivity. Just look down to see| |the explanation) | +----------------------------------------------------------------------------+
I think the technique is quite clear looking at the diagram, but I'll explain it in words:
We must store then the address of the conditional jump we made, because we don't know yet to which address we have to jump, so we save this and later we'll calculate and complete this jumps. It's enough to push the address onto the stack.
After saving this, we code this leave of the binary tree: you call "DoBranch" again, and when it returns... voil´! We have a complete branch coded, and of course the index of instruction insertion points to the place where the next branch will be. So, we pop the address that we pushed before, we calculate the distance between the actual insertion index and the saved address, and then we complete the conditional jump that the save address point at with that calculated value. After this little operation, call "DoBranch" again, decrease [LevelOfRecursivity] and RET.
When you arrive to the comparision/cond_jmp pair (the one which loops to continue decryption if the encrypted part isn't decrypted yet), you have to code the comparision and leave the jump uncompleted, since we must wait to the complete return of "DoBranch" to assure that all branches are coded. So, we store this address into the other prepared array, JumpsToComplete, just like we made with the ArrayOfJumps.
Then insert some garbage, code the jump to the decrypted part and RET.
When we finish, we'll have a decryptor that its behaviour is exactly the same as a normal one, but that you never know which branch of code is going to be executed every time, since when you jump to loop, you perform a random number of random comparisions and conditional jumps that will drive you to a random decryption part. Due to the fact that every final part of the branch does the same than the others, we don't care which one becomes executed every time, so we broke the linearity of execution and now, "from the outside", the decryptor resembles a normal application following its conditions, not a decryption loop.
We have seen that Branching requires recursivity for an easy coding of the technique, but since we have done it, we can make the entire engine oriented to recursive functions, specially to generate indirect register/memory modifications. We are going to code some usual functions recursively, adding a variable that I call "recursivity level", which is a variable that is increased every time the function is called, and it's used to control the active instances of this function (so, when we arrive to a pre-decided number of callings, then we avoid calling this function recursively again). Let's see the MOV Reg,Value instruction and what happens if we code the function that generates this type of instruction in a recursive way:
Now that we have this MOVs, we select them as a "last chance": we use them, for example, in a 25% of callings, and when we are in a quite deep recursivity level (5 or 6 is enough, I think). We'll call this function "DoMOVRegValue". So, at the beginning, we can put:
And, to finish, we jump here instead of making RET:
This would be a direct case, but what if we do this?:
Of course, we don't use only ADD, but XOR, SUB, etc. And we don't use this only in memory addresses, so we can use it in another chance:
(this is inside DoMOVRegValue and we arrive here randomly, since there are other options, of course):
Then, possibilities are infinite. We can combine all the functions we made before to generate a quite complex MOVing. We can make the same with DoMOVRegReg (taking as not-recursives, for example, MOV Reg1,Reg2, LEA Reg1,[Reg2] or PUSH Reg2/POP Reg1). We can use also other functions to increase this: DoPUSHReg, DoPUSHMem, etc. Let's see an example of deep recursivity:
Buf! We are very deep now in recursive instances of that functions. After a while, it'll exit from AdjustMemToValue, and maybe it called other recursive functions that again called AdjustMemToValue, so that's why we have to control the number of recursive calls, since it can generate an enormous amount of code without a great effort. After doing DoPUSHValue, it executes DoPOPMem which strategically isn't recursive (and logically too, since there isn't a DoMOVMemMem instruction).
We know that DoPUSHMem doesn't make recursive calls, but the next function, DoPOPReg, can make them, for example:
Again, more recursive calls :).
After this, you can see how powerful is the recursive code generation, and how a simple MOVing can derive in a quite complex set of assignations from to memory/registers, giving as a final result the desired value in the desired register. Many functions can be done in this way, and later we'll see its application to make garbage.
But even the most recursive engine in the world can make all the work worthless if it's detected heuristically because it put a strange instruction or a quite common polymorphic structure, like:
or similars, because no normal application does that.
What to do, then? It's easy: just have an array of "inconcluded calls", I mean: you code the CALL instruction, but you don't code the subroutine yet. Then, you save the address of this instruction in an array and upon a random decision, or at the end of the decryptor, the engine generates the subroutines and completes the CALLs that are in the array to make them point to the new generated subroutines. Also a good way is pre-generate some subroutines before the entrypoint of the decryptor, and make calls to them (combining this type of CALLing with the "inconcluded call" type).
With this, the decryptor looks more like a compiler-generated application, at least using CALL instructions. Other very polwerful thing is using a stack frame inside the generated subroutines: you make PUSH EBP / MOV EBP,ESP at the beginning and POP EBP at the end of the subroutine and then you can't determine in a first sight if the decryptor is a product of a compiler or not. Even better if you use that stack frame to retrieve values! :)
Other thing: avoid this:
Do you think it's normal to find this in a normal application? Then, the emulator thinks the same :). Avoid this and avoid inserting direct random data without being linked to any direct instruction.
Have you ever scanned a virus that, although extremely polymorphic, inserts one-byte instructions like CMC, STI, etc.? If you have tried it with AVP, for example, you maybe noticed that automatically the antivirus enters in deep scan. Why? Because it's HIGHLY suspicious to use these instructions. Who uses CMC nowadays? And not only that, since the random generator trends to insert quite a lot of this dummy instructions, and the antivirus emulator knows that fact, so when it founds a more or less big number of these instructions (and some instructions directly, even if there is only one), it decides that the file is so suspicious that maybe worth the time to enter in deep scan. Maybe it doesn't found anything, but an average user can think that that file has anything more than the original application.
This advide is also for some 16 bits instructions under win32 applications. While coding the TUAREG engine, I put nearly all the instructions the garbage generator could generate to use 8, 16 or 32 bits, and then, when scanning with AVP, the emulator switched always to deep scan. After thinking about it, I removed the generation of some 16 bits instructions and AVP didn't make that again. I don't know which instructions make AVP to activate that heuristic flag, but invariably I recommend to use as less as possible 16 bits instructions.
Now, we are going to enter in one of my favorite subjects: garbage generation. My opinion is that the main power of a polymorphic engine is the function to generate garbage, since the garbage is the code that makes the emulators to give up tracing or help them to determine the nature of the program. So, the more normal the garbage seems, the less suspicious the decryptor seems, and the more complex the garbage is, the less an emulator can enter into the decryptor to emulate. Let's see some types, although there aren't all (obviously I'm not going to explain the easy ones). Imagination also counts!
Nowadays is a must to have it in an engine, if one wants to make it complex. Which type of application doesn't make any random-access memory operation in the first 300 bytes? Only a very weird program or an infected program with an attached polymorphic virus that doesn't implement memory writing instructions can do this (besides decryption process).
But the fact is that, while in MS-DOS we had all the memory accesible, and we could read from everywhere, this isn't true for win32, and an attempt to read from "everywhere" will cause, in the great majority of times, an exception. Writing is much more restrictive under win32, because we can only write on the sections we defined as WRITABLE on the PE header (although we want to generate an exception, of course :). So, we can use some tricks to have frames of memory to read and/or write indiscriminatedly.
In win32 executables, we have a section that exists in nearly all them: the ".bss" section. This section has a physical size (in file size) of 0, but virtually can be quite big (its size is normally 1000h bytes at least, but in huge programs can arrive to 64K or more). We can use that section to read and/or write anything we want, but always if we make our virus to execute at first, not doing Entry-Point Obscuring and such things, since the application would set all the void data in the section to whatever it needs. There is another solution, and is to use the void holes in the virus that we use to retrieve, for example, the current directory with GetCurrentDirectory or similar functions. Since we don't need that fields until the virus is executing, we can use that holes, if they are big enough, as frames of memory in the same way as .bss, where we can read and write things.
So, once we have that frames, and we are sure that at least 256 or 512 bytes are free to do animalities :), we can code a function to retrieve a random memory address, for example:
That will return in EAX a memory address, which moreover is aligned in a dword boundary.
OK, so after coding a good decryptor structure and memory accesses, to fuck up completely any initial suspiciousness about the "undesired traveller", we insert API calls.
It's not easy to put them, and we only can call those ones that we know how to call, so we have to have information about all them (there isn't a "generic" way of calling them), and moreover we need to find and scan the import directory of the victim host while infecting it, so we have to deduce, from the virtual address (since we only know that about the import directory from the PE header), which physical address is, and then from the physical address convert to virtual address to have the imported API calling address.
The method I follow is the next, assuming we have the host mapped in memory:
After that, we get the addresses to the import where the virtual addresses to the functions will be stored. Now, a call like CALL [Obtained_Address] will make a call to the API. Just be careful with the parameters and with the functions were a buffer is required.
Another thing: as Micro$oft programmers are dumb or worse, there are functions that can hang the application, like GetModuleHandleA. I've tried to pass to it a random pointer as the module name to get the handle of, but then an exception occurs, instead of returning an error telling "no valid string" or "module not found" or something like that, so be careful with some functions.
Before we saw the potential of recursive calls to make things, and now we apply that to garbage. There are some types of garbage that we can make (and we must make) in a recursive way, like CALLs, random loops and some more. Here I'll explain CALLs and random loops.
Before I explained a method of how to make CALLs without making suspicious structures, and I told that, later, at the end of the decryptor (for example), you can construct the subroutines that are called by that CALLs. To make the subroutines, we should use recursivity, so we have to code the DoGarbage function in a recursive way, I mean, the function can be called by itself. This has to be made to make better garbage inside the CALLs, and moreover in that way in that subroutines we can have other CALLs, also. But be careful, because something like this can happen:
That situation can produce a hang of the decryptor, and thus the application never executes. To avoid that, we can use arrays to store "levels" of calls, in this way:
When we enter in the CALL generation part of the engine, we must increase a variable, like an internal recursivity level, to control the level of call that we are generating now, in a way that makes that Level 1 calls never will generate a call to Level 1 or upper, and the same for Level 2 and so on. In thas way we avoid situations like the one above, since a subroutine won't call other subroutine which will drive execution to the same subroutine again.
Other recursive garbage generation is the random loop generation. We can put little, annoying loops that do nothing but that, loop (just seven or eight times, nothing that last very long). Since a random loop inside a random loop is equal to a geometrical increment of the loop duration, better if we avoid them, having a variable telling "I'm in a loop now, so don't make another". The same applies for CALLs, since we code them later, not in that moment, and maybe we put another loop inside the generated subroutine, which would produce the same problem. So, when generating looping code, avoid making other loops or calls. To make the loop, of course, we call to DoGarbage to fill some looping space (a void loop isn't normal, you know).
And, as you could deduct, maybe, we can use DoMOVRegValue and all those functions that we coded to generate more garbage: just take a garbage register and a random number and use that functions.
Well, this article is shorter than I expected, but I hope it'll be useful for you to bring you ideas while coding your new polymorphic engine. The great majority of ideas I expose here have been used in the TUAREG engine, and sometimes in the source code of the TUAREG I refer to this article to get the explanation of some techniques. Laterz!