About Me

My photo
Bay Area, CA, United States
I'm a computer security professional, most interested in cybercrime and computer forensics. I'm also on Twitter @bond_alexander All opinions are my own unless explicitly stated.

Thursday, May 26, 2011

Reverse engineering a malicious PDF Part 1

One of the projects I work on is a malicious Javascript scanner. It also scans PDFs since the malicious part of PDFs is usually encoded Javascript. To test the scanner, we regularly collect malicious PDFs and run them against the scanner to see if they're detected. Of course, in order to determine if it's really malicious, sometimes you need to go in by hand and see what's going on. To this end, the Didier Stevens wrote a chapter on analyzing malicious PDFs I'll be using that as a reference as I go through a malicious PDF here. I recommend reading it alongside this article. Didier is far better at this than I am, so I won't be trying to explain structural concepts which he explains far better. The PDF I'll be looking at is named 4469.pdf. It was downloaded "in the wild" from a website listed on the Malware Domain List.


Didier provides several python scripts that are useful for analyzing PDFs, the first is pdfid.py. It examines the PDF for indicators of a possibly malicious PDF, such as the presence of Javascript, automatic actions, and document length (most malicious PDFs are only one page). Here's what the results look like for 4469.pdf.
In this case, the PDF is only one page, contains Javascript and contains code that will launch when the PDF is opened (OpenAction). This is potentially suspicious, so let's keep investigating. We know that the Javascript is where the malicious activity will happen, so let's look at that first, using Didier's pdf-parser.py
Pdf-parser.py only located Javascript is in indirect object 2 0 of the PDF. However, indirect object 2 0 references indirect object 11 0 as an OpenAction and Javascript. In a moment we'll see why pdf-parser.py didn't identify indirect object 11 0 as containing Javascript. For now, we see that the pdf is invoking Javascript when the file is opened, which we expect from a malicious PDF, and we expect that indirect object 11 contains our payload.


Using pdf-parser.py again, I can parse out indirect object 11, which is a stream object compressed with the Flate method. Interestingly, this is exactly the same situation in Didier's example script, so it seems this is a common way to obfuscate malicious code in pdfs.  Since it's common, Didier provides a method to uncompress the script: pdf-parser.py --object 11 --filter --raw 4469.pdf .... and voila, we have malicious code:
(click to enlarge)
It keeps going like that for another couple pages.

Just like in the malicious Javascript I took a look at last month, the functions and variables all have random names: function hddd(fff), var fpziycpii, etc. There's also plenty of junk characters and excessive transformations to make analysis more annoying. Here's one example towards the end of the script:
for(yrauyiyqouoi=0;yrauyiyqouoi<gmgdouaeyd;yrauyiyqouoi++){var dsfsg = yrauyiyqouoi+1;xrywreom+='var oynaoyoyaia'+dsfsg+' = oynaoyoyaia'+yrauyiyqouoi+';';this[fuquoudieeel](xrywreom);}
There's even a section where every letter is interspersed with a bunch of exclamation marks. It's all messy, but nothing we can't eventually analyze. Let's start at the top.

The first code introduced is function hddd which takes the parameter fff. It takes the parameter and replaces the ** with %u. There are four separate strings processed by this function, which means each string is actually a unicode-encoded string. These are stored as variables: shcode_geticon, shcode_newplayer, shcode_printf, and shcode_collab. Based on the names, these strings are likely the shellcode payloads, but we'll see when we get there.

Next we have:
var fpziycpii = 'e';var uinsenagexo = 'l';var fuquoudieeel = fpziycpii+'va'+uinsenagexo;var ioafyyad = this[fuquoudieeel];rtaoyuupaue = "ioa!fyya!d!('!t!hi!s![!fuqu!ou!d!ie!ee!l](o!y!na!o!yoy!aia!'+!gmg!do!u!a!e!yd!+!')!;!'!)!;".replace(/[!]/g, '');
Stepping through it, the variable fuquoudieeel takes the first two variables and combines them to get "eval", so ioafyyad is this[eval]. Next, rtaoyuupaue is a string that has the replace function executed on it. In this case, the replace function just removes all the extra exclamation points that are in there, yielding:

ioafyyad('this[fuquoudieeel](oynaoyoyaia'+gmgdouaeyd+');');

If we substitute in the known variables, we get:

this[eval]('this[eval]((oynaoyoyaia'+gmgdouaeyd+');');

That's an improvement, but there's still work to do. The variable gmgdouaeyd is later defined as 1100, so we get oynaoyoyaia1100,a variable which isn't defined yet. There's a section towards the end with oynaoyoyaia0 = eiuaopyj; but obviously that's not the same variable. It may be a typo, or it may be junk code ... we'll see. For now, let's move on.


Next we have another function:

function iuoyzemuyyi(ieuohhrk)
{
var iuioathlpau = '!';
var unetoptou = '';
for(xqqauiae=0;xqqauiae<ieuohhrk.length;xqqauiae++)
{
var yaomwteez = ieuohhrk.charAt(xqqauiae);
if(yaomwteez == iuioathlpau) {  } else { unetoptou+=yaomwteez;
}
}
return unetoptou;
}
This function is a longer, more complicated way of removing the exclamation marks from a string. Like the last one, this is applied another code section stored as a string and obfuscated with five exclamation marks. The string is un-obfuscated and stored as the variable etppeifjeka. That's a long section and it looks like that's part of the payload, so we'll get to that in part 2. For now, let's skip past it and see how it's used.

The last section is this:
eiuaopyj = ''+etppeifjeka+'';
var gmgdouaeyd = 1100;
var xrywreom = '';
oynaoyoyaia0 = eiuaopyj;
for(yrauyiyqouoi=0;yrauyiyqouoi<gmgdouaeyd;yrauyiyqouoi++)
{
var dsfsg = yrauyiyqouoi+1;
xrywreom+='var oynaoyoyaia'+dsfsg+' = oynaoyoyaia'+yrauyiyqouoi+';';
this[fuquoudieeel](xrywreom);
}
ioafyyad(rtaoyuupaue);
This section is odd, to say the least. The for loop constructs a string which is stored in the variable xrywreom. The loop counts from 0 to 1100 and builds a section of code that declares a series of variables oynaoyoyaiaX where X is the current number, and the variable is set to equal the previous number. The output looks like this:
var oynaoyoyaia1 = oynaoyoyaia0;var oynaoyoyaia2 = oynaoyoyaia1;var oynaoyoyaia3 = oynaoyoyaia2;var oynaoyoyaia4 = oynaoyoyaia3;var oynaoyoyaia5 = oynaoyoyaia4;
It goes up to var oynaoyoyaia1100 = oynaoyoyaia1099; Each step of the loop, the loop runs this[fuquoudieeel](xrywreom); which executes the code stored in the variable. This creates 1100 variables and sets them all equal to eiuaopyj (the variable holding the obfuscated section we haven't examined yet). Let's go back to the earlier section where we saw a reference to oynaoyoyaia. We had deobfuscated it to this point:
this[eval]('(oynaoyoyaia1100);');
which evaluates back to etppeifjeka, the probable payload.

After the for loop is the last line of Javascript in this PDF: ioafyyad(rtaoyuupaue); As we've already discovered, ioafyyad is this[eval] and rtaoyuupaue is this[eval]('(oynaoyoyaia1100);'); so that is the line of code that actually triggers the exploit.

All that's left to do is deobfuscate the exploit itself and see what it does.

No comments:

Post a Comment